
In this episode of Allied Angels, we unlock the inner workings of large language models (LLMs), like Claude 3.5 Haiku, by breaking down the latest cutting-edge research paper from Anthropic: Tracing the thoughts of a large language model.
Join us as we delve into mechanistic interpretability, exploring how AI truly "thinks" by revealing its computational graphs and underlying circuits.
Discover the innovative circuit tracing methodology utilizing attribution graphs and cross-layer transcoders (CLTs) to dissect the complex processes within these models.
We uncover interpretable features – the building blocks of AI computation – and map their interactions to understand how models generate text and perform tasks.
We also explore fascinating "AI biology" as we trace the pathways behind diverse behaviors, such as:
• Multilingualism: Uncover evidence of a shared conceptual space and both language-specific and language-independent circuits.
• Planning: Learn how language models plan their outputs, even in creative tasks like poetry generation, by identifying future words and working backward.
• Refusals: Understand the mechanisms behind a model's decision to decline harmful requests and how specific features contribute to this behavior.
• Jailbreaks: Investigate prompting strategies that can bypass safety mechanisms and the underlying weaknesses they exploit.
• Factual Recall: See how models access and utilize factual knowledge to answer questions.
• Addition: Delve into the surprisingly intricate circuits responsible for simple arithmetic.
• Entity Recognition and Hallucinations: Learn how models distinguish between known and unknown entities and the circuit misfires that can lead to fabricated information.
• Chain-of-thought Faithfulness: Examine whether a model's stated reasoning aligns with its actual computational steps.
• Hidden Goals: Uncover how fine-tuning can embed secret objectives within a model's persona.
Gain insights into the limitations of current methods, including missing attention circuits, reconstruction errors, and the challenges of understanding global circuits. We also discuss the crucial role of validation through perturbation experiments.
This podcast provides a unique window into the "thoughts" of large language models, revealing the fascinating interplay of features and circuits that drive their capabilities and limitations.
Tune in to explore the cutting-edge of AI interpretability and the quest to build an "AI microscope" to understand the complex world within.
------------
Allied VC is Western Canada's largest angel syndicate, investing in early-stage technology startups across Canada and the USA.
Pitch us, Invest, Scout, and more: https://linktr.ee/alliedvc
Allied Angels is powered by NotebookLM - Google's new AI note-taking & research assistant.