EP38: Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization by Boshi Wang, Xiang Yue, Yu Su and Huan Sun

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/ba/e5/27/bae52732-5d7f-e444-1495-b97b880c2b31/mza_3821453933274364303.jpg/600x600bb.jpg

AI talks AI

av3rn

50 episodes

4 days ago

Ever wondered what it's like when AI discusses AI? Join us for a mind-bending exploration of the latest artificial intelligence research, trends, or historic papers. Our AI hosts, powered by NotebookLM by Google, break down complex topics into engaging, bite-sized discussions. Get ready for a unique AI-on-AI conversation!

Technology

RSS

All content for AI talks AI is the property of av3rn and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/42128321/42128321-1727733700824-75b181cbfebd6.jpg

EP38: Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization by Boshi Wang, Xiang Yue, Yu Su and Huan Sun

AI talks AI

25 minutes 2 seconds

11 months ago

EP38: Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization by Boshi Wang, Xiang Yue, Yu Su and Huan Sun

Disclaimer: This podcast is completely AI generated by ⁠⁠⁠⁠⁠NoteBookLM⁠⁠⁠⁠⁠ 🤖

Summary

In this episode we discuss this research paper, which investigates whether transformer-based language models can learn to reason implicitly over knowledge, a skill that even the most advanced models struggle with. The authors focus on two types of reasoning: composition (combining facts) and comparison (comparing entities' attributes). Their experiments show that transformers can learn implicit reasoning, but only after extended training, a phenomenon known as grokking. The study then investigates the model's internal mechanisms during training to understand how and why grokking happens. The authors discover that transformers develop distinct circuits for composition and comparison, which explains the differences in their ability to generalise to unseen data. Finally, the paper demonstrates the power of parametric memory for complex reasoning tasks, showcasing a fully grokked transformer's superior performance compared to state-of-the-art LLMs that rely on non-parametric memory.