Neel Nanda on mechanistic interpretability, superposition and grokking

https://is1-ssl.mzstatic.com/image/thumb/Podcasts116/v4/7f/c2/65/7fc26521-8463-fb0a-cf30-6dacd66b1623/mza_7962167092170656969.jpg/600x600bb.jpg

The Inside View

Michaël Trazzi

52 episodes

1 week ago

The goal of this podcast is to create a place where people discuss their inside views about existential risk from AI.

Technology

RSS

All content for The Inside View is the property of Michaël Trazzi and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

The goal of this podcast is to create a place where people discuss their inside views about existential risk from AI.

Technology

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_episode/14474637/14474637-1695310133446-984047900917d.jpg

Neel Nanda on mechanistic interpretability, superposition and grokking

The Inside View

2 hours 4 minutes 53 seconds

2 years ago

Neel Nanda on mechanistic interpretability, superposition and grokking

Neel Nanda is a researcher at Google DeepMind working on mechanistic interpretability. He is also known for his YouTube channel where he explains what is going on inside of neural networks to a large audience.

In this conversation, we discuss what is mechanistic interpretability, how Neel got into it, his research methodology, his advice for people who want to get started, but also papers around superposition, toy models of universality and grokking, among other things.

Youtube: https://youtu.be/cVBGjhN4-1g

Transcript: https://theinsideview.ai/neel

OUTLINE

(00:00) Intro

(00:57) Why Neel Started Doing Walkthroughs Of Papers On Youtube

(07:59) Induction Heads, Or Why Nanda Comes After Neel

(12:19) Detecting Induction Heads In Basically Every Model

(14:35) How Neel Got Into Mechanistic Interpretability

(16:22) Neel's Journey Into Alignment

(22:09) Enjoying Mechanistic Interpretability And Being Good At It Are The Main Multipliers

(24:49) How Is AI Alignment Work At DeepMind?

(25:46) Scalable Oversight

(28:30) Most Ambitious Degree Of Interpretability With Current Transformer Architectures

(31:05) To Understand Neel's Methodology, Watch The Research Walkthroughs

(32:23) Three Modes Of Research: Confirming, Red Teaming And Gaining Surface Area

(34:58) You Can Be Both Hypothesis Driven And Capable Of Being Surprised

(36:51) You Need To Be Able To Generate Multiple Hypothesis Before Getting Started

(37:55) All the theory is bullshit without empirical evidence and it's overall dignified to make the mechanistic interpretability bet

(40:11) Mechanistic interpretability is alien neuroscience for truth seeking biologists in a world of math

(42:12) Actually, Othello-GPT Has A Linear Emergent World Representation

(45:08) You Need To Use Simple Probes That Don't Do Any Computation To Prove The Model Actually Knows Something

(47:29) The Mechanistic Interpretability Researcher Mindset

(49:49) The Algorithms Learned By Models Might Or Might Not Be Universal

(51:49) On The Importance Of Being Truth Seeking And Skeptical

(54:18) The Linear Representation Hypothesis: Linear Representations Are The Right Abstractions

(00:57:26) Superposition Is How Models Compress Information

(01:00:15) The Polysemanticity Problem: Neurons Are Not Meaningful

(01:05:42) Superposition and Interference are at the Frontier of the Field of Mechanistic Interpretability

(01:07:33) Finding Neurons in a Haystack: Superposition Through De-Tokenization And Compound Word Detectors

(01:09:03) Not Being Able to Be Both Blood Pressure and Social Security Number at the Same Time Is Prime Real Estate for Superposition

(01:15:02) The Two Differences Of Superposition: Computational And Representational

(01:18:07) Toy Models Of Superposition

(01:25:39) How Mentoring Nine People at Once Through SERI MATS Helped Neel's Research

(01:31:25) The Backstory Behind Toy Models of Universality

(01:35:19) From Modular Addition To Permutation Groups

(01:38:52) The Model Needs To Learn Modular Addition On A Finite Number Of Token Inputs

(01:41:54) Why Is The Paper Called Toy Model Of Universality

(01:46:16) Progress Measures For Grokking Via Mechanistic Interpretability, Circuit Formation

(01:52:45) Getting Started In Mechanistic Interpretability And Which WalkthroughS To Start With

(01:56:15) Why Does Mechanistic Interpretability Matter From an Alignment Perspective

(01:58:41) How Detection Deception With Mechanistic Interpretability Compares to Collin Burns' Work

(02:01:20) Final Words From Neel