MIT and UC San Diego Researchers Develop Method to Identify and Manipulate Abstract Concepts in Large Language Models
Large language models may harbor hidden biases and personas, but a new method reveals how to detect and control them. Researchers from MIT and UC San Diego have developed recursive feature machines (RFMs) to identify and manipulate abstract concepts—such as biases, moods, and preferences—within large language models (LLMs).
This approach leverages mathematical patterns in LLM representations to isolate and modulate specific concepts, diverging from traditional unsupervised learning techniques that lack such targeted control.
The team demonstrated the method by targeting over 500 concepts across LLMs, including personas like 'conspiracy theorist' and preferences such as 'fan of Boston.' When applied to a vision-language model, enhancing the 'conspiracy theorist' concept altered the model's explanation of the Apollo 17 'Blue Marble' image, shifting its tone to a conspiratorial narrative.
Adityanarayanan Radhakrishnan, a researcher involved in the study, noted: 'With our method, there’s ways to extract these different concepts and activate them in ways that prompting cannot give you answers to.'
Unlike conventional unsupervised learning, which identifies patterns without predefined labels, RFMs explicitly map abstract concepts to mathematical structures within LLMs. This allows researchers to both detect and manipulate these concepts with precision.
The method's code is publicly available, offering potential applications in improving model safety and performance. The study was funded by the National Science Foundation, Simons Foundation, TILOS institute, and U.S. Office of Naval Research.
The researchers caution that while the technique enables greater transparency, it also raises risks if misused. 'There’s a balance between understanding model behavior and ensuring these tools aren’t exploited,' the team notes. The study underscores the need for further research into the ethical implications of such interventions.