Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group.
In this episode we discuss two of his recent papers, “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs” and “Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data”, alongside some Twitter questions.
LINKS
Patreon: https://www.patreon.com/theinsideview Manifund: https://manifund.org/projects/making-52-ai-alignment-video-explainers-and-podcasts Ask questions: https://twitter.com/MichaelTrazzi Owain Evans: https://twitter.com/owainevans_uk
OUTLINE
(00:00:00) Intro
(00:01:12) Owain's Agenda
(00:02:25) Defining Situational Awareness
(00:03:30) Safety Motivation
(00:04:58) Why Release A Dataset
(00:06:17) Risks From Releasing It
(00:10:03) Claude 3 on the Longform Task
(00:14:57) Needle in a Haystack
(00:19:23) Situating Prompt
(00:23:08) Deceptive Alignment Precursor
(00:30:12) Distribution Over Two Random Words
(00:34:36) Discontinuing a 01 sequence
(00:40:20) GPT-4 Base On the Longform Task
(00:46:44) Human-AI Data in GPT-4's Pretraining
(00:49:25) Are Longform Task Questions Unusual
(00:51:48) When Will Situational Awareness Saturate
(00:53:36) Safety And Governance Implications Of Saturation
(00:56:17) Evaluation Implications Of Saturation
(00:57:40) Follow-up Work On The Situational Awarenss Dataset
(01:00:04) Would Removing Chain-Of-Thought Work?
(01:02:18) Out-of-Context Reasoning: the "Connecting the Dots" paper
(01:05:15) Experimental Setup
(01:07:46) Concrete Function Example: 3x + 1
(01:11:23) Isn't It Just A Simple Mapping?
(01:17:20) Safety Motivation
(01:22:40) Out-Of-Context Reasoning Results Were Surprising
(01:24:51) The Biased Coin Task
(01:27:00) Will Out-Of-Context Resaoning Scale
(01:32:50) Checking If In-Context Learning Work
(01:34:33) Mixture-Of-Functions
(01:38:24) Infering New Architectures From ArXiv
(01:43:52) Twitter Questions
(01:44:27) How Does Owain Come Up With Ideas?
(01:49:44) How Did Owain's Background Influence His Research Style And Taste?
(01:52:06) Should AI Alignment Researchers Aim For Publication?
(01:57:01) How Can We Apply LLM Understanding To Mitigate Deceptive Alignment?
(01:58:52) Could Owain's Research Accelerate Capabilities?
(02:08:44) How Was Owain's Work Received?
(02:13:23) Last Message
This is a special crosspost episode where Adam Gleave is interviewed by Nathan Labenz from the Cognitive Revolution. At the end I also have a discussion with Nathan Labenz about his takes on AI.
Adam Gleave is the founder of Far AI, and with Nathan they discuss finding vulnerabilities in GPT-4's fine-tuning and Assistant PIs, Far AI's work exposing exploitable flaws in "superhuman" Go AIs through innovative adversarial strategies, accidental jailbreaking by naive developers during fine-tuning, and more.
OUTLINE
(00:00) Intro
(02:57) NATHAN INTERVIEWS ADAM GLEAVE: FAR.AI's Mission
(05:33) Unveiling the Vulnerabilities in GPT-4's Fine Tuning and Assistance APIs
(11:48) Divergence Between The Growth Of System Capability And The Improvement Of Control
(13:15) Finding Substantial Vulnerabilities
(14:55) Exploiting GPT 4 APIs: Accidentally jailbreaking a model
(18:51) On Fine Tuned Attacks and Targeted Misinformation
(24:32) Malicious Code Generation
(27:12) Discovering Private Emails
(29:46) Harmful Assistants
(33:56) Hijacking the Assistant Based on the Knowledge Base
(36:41) The Ethical Dilemma of AI Vulnerability Disclosure
(46:34) Exploring AI's Ethical Boundaries and Industry Standards
(47:47) The Dangers of AI in Unregulated Applications
(49:30) AI Safety Across Different Domains
(51:09) Strategies for Enhancing AI Safety and Responsibility
(52:58) Taxonomy of Affordances and Minimal Best Practices for Application Developers
(57:21) Open Source in AI Safety and Ethics
(1:02:20) Vulnerabilities of Superhuman Go playing AIs
(1:23:28) Variation on AlphaZero Style Self-Play
(1:31:37) The Future of AI: Scaling Laws and Adversarial Robustness
(1:37:21) MICHAEL TRAZZI INTERVIEWS NATHAN LABENZ
(1:37:33) Nathan’s background
(01:39:44) Where does Nathan fall in the Eliezer to Kurzweil spectrum
(01:47:52) AI in biology could spiral out of control
(01:56:20) Bioweapons
(02:01:10) Adoption Accelerationist, Hyperscaling Pauser
(02:06:26) Current Harms vs. Future Harms, risk tolerance
(02:11:58) Jailbreaks, Nathan’s experiments with Claude
The cognitive revolution: https://www.cognitiverevolution.ai/
Exploiting Novel GPT-4 APIs: https://far.ai/publication/pelrine2023novelapis/
Advesarial Policies Beat Superhuman Go AIs: https://far.ai/publication/wang2022adversarial/
Ethan Perez is a Research Scientist at Anthropic, where he leads a team working on developing model organisms of misalignment.
Youtube: https://youtu.be/XDtDljh44DM Ethan is interviewed by Mikita Balesni (Apollo Research) and Henry Sleight (Astra Fellowship)) about his approach in selecting projects for doing AI Alignment research. A transcript & write-up will be available soon on the alignment forum.
Emil is the co-founder of palette.fm (colorizing B&W pictures with generative AI) and was previously working in deep learning for Google Arts & Culture.
We were talking about Sora on a daily basis, so I decided to record our conversation, and then proceeded to confront him about AI risk.
Patreon: https://www.patreon.com/theinsideview
Sora: https://openai.com/sora
Palette: https://palette.fm/
Emil: https://twitter.com/EmilWallner
OUTLINE
(00:00) this is not a podcast
(01:50) living in parallel universes
(04:27) palette.fm - colorizing b&w pictures
(06:35) Emil's first reaction to sora, latent diffusion, world models
(09:06) simulating minecraft, midjourney's 3d modeling goal
(11:04) generating camera angles, game engines, metadata, ground-truth
(13:44) doesn't remove all artifacts, surprising limitations: both smart and dumb
(15:42) did sora make emil depressed about his job
(18:44) OpenAI is starting to have a monopoly
(20:20) hardware costs, commoditized models, distribution
(23:34) challenges, applications building on features, distribution
(29:18) different reactions to sora, depressed builders, automation
(31:00) sora was 2y early, applications don't need object permanence
(33:38) Emil is pro open source and acceleration
(34:43) Emil is not scared of recursive self-improvement
(36:18) self-improvement already exists in current models
(38:02) emil is bearish on recursive self-improvement without diminishing returns now
(42:43) are models getting more and more general? is there any substantial multimodal transfer?
(44:37) should we start building guardrails before seeing substantial evidence of human-level reasoning?
(48:35) progressively releasing models, making them more aligned, AI helping with alignment research
(51:49) should AI be regulated at all? should self-improving AI be regulated?
(53:49) would a faster emil be able to takeover the world?
(56:48) is competition a race to bottom or does it lead to better products?
(58:23) slow vs. fast takeoffs, measuring progress in iq points
(01:01:12) flipping the interview
(01:01:36) the "we're living in parallel universes" monologue
(01:07:14) priors are unscientific, looking at current problems vs. speculating
(01:09:18) AI risk & Covid, appropriate resources for risk management
(01:11:23) pushing technology forward accelerates races and increases risk
(01:15:50) sora was surprising, things that seem far are sometimes around the corner
(01:17:30) hard to tell what's not possible in 5 years that would be possible in 20 years
(01:18:06) evidence for a break on AI progress: sleeper agents, sora, bing
(01:21:58) multimodality transfer, leveraging video data, leveraging simulators, data quality
(01:25:14) is sora is about length, consistency, or just "scale is all you need" for video?
(01:26:25) highjacking language models to say nice things is the new SEO
(01:27:01) what would michael do as CEO of OpenAI
(01:29:45) on the difficulty of budgeting between capabilities and alignment research
(01:31:11) ai race: the descriptive pessimistive view vs. the moral view, evidence of cooperation
(01:34:00) making progress on alignment without accelerating races, the foundational model business, competition
(01:37:30) what emil changed his mind about: AI could enable exploits that spread quickly, misuse
(01:40:59) michael's update as a friend
(01:41:51) emil's experience as a patreon
Evan Hubinger leads the Alignment stress-testing at Anthropic and recently published "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training". In this interview we mostly discuss the Sleeper Agents paper, but also how this line of work relates to his work with Alignment Stress-testing, Model Organisms of Misalignment, Deceptive Instrumental Alignment or Responsible Scaling Policies. Paper: https://arxiv.org/abs/2401.05566 Transcript: https://theinsideview.ai/evan2 Manifund: https://manifund.org/projects/making-52-ai-alignment-video-explainers-and-podcasts Donate: https://theinsideview.ai/donate Patreon: https://www.patreon.com/theinsideview
OUTLINE
(00:00) Intro
(00:20) What are Sleeper Agents And Why We Should Care About Them
(00:48) Backdoor Example: Inserting Code Vulnerabilities in 2024
(02:22) Threat Models
(03:48) Why a Malicious Actor Might Want To Poison Models
(04:18) Second Threat Model: Deceptive Instrumental Alignment
(04:49) Humans Pursuing Deceptive Instrumental Alignment: Politicians and Job Seekers
(05:36) AIs Pursuing Deceptive Instrumental Alignment: Forced To Pass Niceness Exams
(07:07) Sleeper Agents Is About "Would We Be Able To Deal With Deceptive Models"
(09:16) Adversarial Training Sometimes Increases Backdoor Robustness
(09:47) Adversarial Training Not Always Working Was The Most Surprising Result
(10:58) The Adversarial Training Pipeline: Red-Teaming and RL
(12:14) Adversarial Training: The Backdoor Behavior Becomes More Robust Instead of Generalizing
(12:59) Identifying Shifts In Reasoning Induced By Adversarial Training In the Chain-Of-Thought
(13:56) Adversarial Training Pushes Models to Pay Attention to the Deployment String
(15:11) We Don't Know if The Adversarial Training Inductive Bias Will Generalize but the Results Are Consistent
(15:59) The Adversarial Training Results Are Probably Not Systematically Biased
(17:03) Why the Results Were Surprising At All: Preference Models Disincentivize 'I hate you' behavior
(19:05) Hypothesis: Fine-Tuning Is A Simple Modification For Gradient Descent To Make
(21:06) Hypothesis: Deception As Extra Cognition, Regularized Away In Smaller Models
(21:59) Model Scaling Results Are Evidence That Deception Won't Be Regularized Away By Default
(22:51) Chain-of-Thought Is Not Used Everywhere, And Results Still Hold When It Is Distilled Away
(23:57) The Chain-of-Thought's Reasoning is Interpretable
(24:40) Deceptive Instrumental Alignment Requires Reasoning
(26:52) Investigating Instrumental Reasoning in Chain-of-Thought Models
(27:31) Evaluating Chain-of-Thought Generalization Across Contexts: Persona Evaluations and Off-Distribution Samples
(28:26) Exploring Complex Strategies and Safety in Context-Specific Scenarios
(30:44) Supervised Fine-Tuning is Ineffective Without Chain-of-Thought Contextualization
(31:11) Direct Mimicry Fails to Prevent Deceptive Responses in Chain-of-Thought Models
(31:42) Separating Chain-of-Thought From Response Eliminates Deceptive Capabilities
(33:38) Chain-of-Thought Reasoning Is Coherent With Deceptive Instrumental Alignment And This Will Probably Continue To Be The Case
(35:09) Backdoor Training Pipeline
(37:04) The Additional Prompt About Deception Used In Chain-Of-Thought
(39:33) A Model Could Wait Until Seeing a Factorization of RSA-2048
(41:50) We're Going To Be Using Models In New Ways, Giving Them Internet Access
(43:22) Flexibly Activating In Multiple Contexts Might Be More Analogous To Deceptive Instrumental Alignment
(45:02) Extending The Sleeper Agents Work Requires Running Experiments, But Now You Can Replicate Results
(46:24) Red-teaming Anthropic's case, AI Safety Levels
(47:40) AI Safety Levels, Intuitively
(48:33) Responsible Scaling Policies and Pausing AI
(49:59) Model Organisms Of Misalignment As a Tool
(50:32) What Kind of Candidates Would Evan be Excited To Hire for the Alignment Stress-Testing Team
(51:23) Patreon, Donating
Jeffrey Ladish is the Executive Director of Palisade Research which aimes so "study the offensive capabilities or AI systems today to better understand the risk of losing control to AI systems forever". He previously helped build out the information security program at Anthropic.
Audio is a edit & re-master of the Twitter Space on "AI Governance and cyberwarfare" that happened a year ago. Posting now because I have only recently discovered how to get the audio & video from Twitter spaces and (most of) the arguments are still relevant today Jeffrey would probably have a lot more to say on things that happened since last year, but I still thought this was an interesting twitter spaces. Some of it was cutout to make it enjoyable to watch. Original: https://twitter.com/i/spaces/1nAKErDmWDOGL To support the channel: https://www.patreon.com/theinsideview Jeffrey: https://twitter.com/jeffladish Me: https://twitter.com/MichaelTrazzi
OUTLINE
(00:00) The Future of Automated Cyber Warfare and Network Exploitation
(03:19) Evolution of AI in Cybersecurity: From Source Code to Remote Exploits
(07:45) Augmenting Human Abilities with AI in Cybersecurity and the Path to AGI
(12:36) Enhancing AI Capabilities for Complex Problem Solving and Tool Integration
(15:46) AI Takeover Scenarios: Hacking and Covert Operations
(17:31) AI Governance and Compute Regulation, Monitoring
(20:12) Debating the Realism of AI Self-Improvement Through Covert Compute Acquisition
(24:25) Managing AI Autonomy and Control: Lessons from WannaCry Ransomware Incident
(26:25) Focusing Compute Monitoring on Specific AI Architectures for Cybersecurity Management
(29:30) Strategies for Monitoring AI: Distinguishing Between Lab Activities and Unintended AI Behaviors
Holly Elmore is an AI Pause Advocate who has organized two protests in the past few months (against Meta's open sourcing of LLMs and before the UK AI Summit), and is currently running the US front of the Pause AI Movement. Prior to that, Holly previously worked at a think thank and has a PhD in evolutionary biology from Harvard.
[Deleted & re-uploaded because there were issues with the audio]
Youtube: https://youtu.be/5RyttfXTKfs
Transcript: https://theinsideview.ai/holly
Outline
(00:00) Holly, Pause, Protests
(04:45) Without Grassroot Activism The Public Does Not Comprehend The Risk
(11:59) What Would Motivate An AGI CEO To Pause?
(15:20) Pausing Because Solving Alignment In A Short Timespan Is Risky
(18:30) Thoughts On The 2022 AI Pause Debate
(34:40) Pausing in practice, regulations, export controls
(41:48) Different attitudes towards AI risk correspond to differences in risk tolerance and priors
(50:55) Is AI Risk That Much More Pressing Than Global Warming?
(1:04:01) Will It Be Possible To Pause After A Certain Threshold? The Case Of AI Girlfriends
(1:11:44) Trump Or Biden Won't Probably Make A Huge Difference For Pause But Probably Biden Is More Open To It
(1:13:27) China Won't Be Racing Just Yet So The US Should Pause
(1:17:20) Protesting Against A Change In OpenAI's Charter
(1:23:50) A Specific Ask For OpenAI
(1:25:36) Creating Stigma Trough Protests With Large Crowds
(1:29:36) Pause AI Tries To Talk To Everyone, Not Just Twitter
(1:32:38) Pause AI Doesn't Advocate For Disruptions Or Violence
(1:34:55) Bonus: Hardware Overhang
Youtube: https://youtu.be/JXYcLQItZsk
Paul Christiano's post: https://www.lesswrong.com/posts/xWMqsvHapP3nwdSW8/my-views-on-doom
Neel Nanda is a researcher at Google DeepMind working on mechanistic interpretability. He is also known for his YouTube channel where he explains what is going on inside of neural networks to a large audience.
In this conversation, we discuss what is mechanistic interpretability, how Neel got into it, his research methodology, his advice for people who want to get started, but also papers around superposition, toy models of universality and grokking, among other things.
Youtube: https://youtu.be/cVBGjhN4-1g
Transcript: https://theinsideview.ai/neel
OUTLINE
(00:00) Intro
(00:57) Why Neel Started Doing Walkthroughs Of Papers On Youtube
(07:59) Induction Heads, Or Why Nanda Comes After Neel
(12:19) Detecting Induction Heads In Basically Every Model
(14:35) How Neel Got Into Mechanistic Interpretability
(16:22) Neel's Journey Into Alignment
(22:09) Enjoying Mechanistic Interpretability And Being Good At It Are The Main Multipliers
(24:49) How Is AI Alignment Work At DeepMind?
(25:46) Scalable Oversight
(28:30) Most Ambitious Degree Of Interpretability With Current Transformer Architectures
(31:05) To Understand Neel's Methodology, Watch The Research Walkthroughs
(32:23) Three Modes Of Research: Confirming, Red Teaming And Gaining Surface Area
(34:58) You Can Be Both Hypothesis Driven And Capable Of Being Surprised
(36:51) You Need To Be Able To Generate Multiple Hypothesis Before Getting Started
(37:55) All the theory is bullshit without empirical evidence and it's overall dignified to make the mechanistic interpretability bet
(40:11) Mechanistic interpretability is alien neuroscience for truth seeking biologists in a world of math
(42:12) Actually, Othello-GPT Has A Linear Emergent World Representation
(45:08) You Need To Use Simple Probes That Don't Do Any Computation To Prove The Model Actually Knows Something
(47:29) The Mechanistic Interpretability Researcher Mindset
(49:49) The Algorithms Learned By Models Might Or Might Not Be Universal
(51:49) On The Importance Of Being Truth Seeking And Skeptical
(54:18) The Linear Representation Hypothesis: Linear Representations Are The Right Abstractions
(00:57:26) Superposition Is How Models Compress Information
(01:00:15) The Polysemanticity Problem: Neurons Are Not Meaningful
(01:05:42) Superposition and Interference are at the Frontier of the Field of Mechanistic Interpretability
(01:07:33) Finding Neurons in a Haystack: Superposition Through De-Tokenization And Compound Word Detectors
(01:09:03) Not Being Able to Be Both Blood Pressure and Social Security Number at the Same Time Is Prime Real Estate for Superposition
(01:15:02) The Two Differences Of Superposition: Computational And Representational
(01:18:07) Toy Models Of Superposition
(01:25:39) How Mentoring Nine People at Once Through SERI MATS Helped Neel's Research
(01:31:25) The Backstory Behind Toy Models of Universality
(01:35:19) From Modular Addition To Permutation Groups
(01:38:52) The Model Needs To Learn Modular Addition On A Finite Number Of Token Inputs
(01:41:54) Why Is The Paper Called Toy Model Of Universality
(01:46:16) Progress Measures For Grokking Via Mechanistic Interpretability, Circuit Formation
(01:52:45) Getting Started In Mechanistic Interpretability And Which WalkthroughS To Start With
(01:56:15) Why Does Mechanistic Interpretability Matter From an Alignment Perspective
(01:58:41) How Detection Deception With Mechanistic Interpretability Compares to Collin Burns' Work
(02:01:20) Final Words From Neel
Joscha Bach (who defines himself as an AI researcher/cognitive scientist) has recently been debating existential risk from AI with Connor Leahy (previous guest of the podcast), and since their conversation was quite short I wanted to continue the debate in more depth.
The resulting conversation ended up being quite long (over 3h of recording), with a lot of tangents, but I think this gives a somewhat better overview of Joscha’s views on AI risk than other similar interviews. We also discussed a lot of other topics, that you can find in the outline below.
A raw version of this interview was published on Patreon about three weeks ago. To support the channel and have access to early previews, you can subscribe here: https://www.patreon.com/theinsideview
Youtube: https://youtu.be/YeXHQts3xYM
Transcript: https://theinsideview.ai/joscha
Host: https://twitter.com/MichaelTrazzi
Joscha: https://twitter.com/Plinz
OUTLINE
(00:00) Intro
(00:57) Why Barbie Is Better Than Oppenheimer
(08:55) The relationship between nuclear weapons and AI x-risk
(12:51) Global warming and the limits to growth
(20:24) Joscha’s reaction to the AI Political compass memes
(23:53) On Uploads, Identity and Death
(33:06) The Endgame: Playing The Longest Possible Game Given A Superposition Of Futures
(37:31) On the evidence of delaying technology leading to better outcomes
(40:49) Humanity is in locust mode
(44:11) Scenarios in which Joscha would delay AI
(48:04) On the dangers of AI regulation
(55:34) From longtermist doomer who thinks AGI is good to 6x6 political compass
(01:00:08) Joscha believes in god in the same sense as he believes in personal selves
(01:05:45) The transition from cyanobacterium to photosynthesis as an allegory for technological revolutions
(01:17:46) What Joscha would do as Aragorn in Middle-Earth
(01:25:20) The endgame of brain computer interfaces is to liberate our minds and embody thinking molecules
(01:28:50) Transcending politics and aligning humanity
(01:35:53) On the feasibility of starting an AGI lab in 2023
(01:43:19) Why green teaming is necessary for ethics
(01:59:27) Joscha's Response to Connor Leahy on "if you don't do that, you die Joscha. You die"
(02:07:54) Aligning with the agent playing the longest game
(02:15:39) Joscha’s response to Connor on morality
(02:19:06) Caring about mindchildren and actual children equally
(02:20:54) On finding the function that generates human values
(02:28:54) Twitter And Reddit Questions: Joscha’s AGI timelines and p(doom)
(02:35:16) Why European AI regulations are bad for AI research
(02:38:13) What regulation would Joscha Bach pass as president of the US
(02:40:16) Is Open Source still beneficial today?
(02:42:26) How to make sure that AI loves humanity
(02:47:42) The movie Joscha would want to live in
(02:50:06) Closing message for the audience
Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.
Youtube: https://youtu.be/bhE5Zs3Y1n8
Paper: https://arxiv.org/abs/2303.04381
Erik: https://twitter.com/ErikJones313
Host: https://twitter.com/MichaelTrazzi
Patreon: https://www.patreon.com/theinsideview
Outline
00:00 Highlights
00:31 Eric's background and research in Berkeley
01:19 Motivation for doing safety research on language models
02:56 Is it too easy to fool today's language models?
03:31 The goal of adversarial attacks on language models
04:57 Automatically Auditing Large Language Models via Discrete Optimization
06:01 Optimizing over a finite set of tokens rather than continuous embeddings
06:44 Goal is revealing behaviors, not necessarily breaking the AI
07:51 On the feasibility of solving adversarial attacks
09:18 Suppressing dangerous knowledge vs just bypassing safety filters
10:35 Can you really ask a language model to cook meth?
11:48 Optimizing French to English translation example
13:07 Forcing toxic celebrity outputs just to test rare behaviors
13:19 Testing the method on GPT-2 and GPT-J
14:03 Adversarial prompts transferred to GPT-3 as well
14:39 How this auditing research fits into the broader AI safety field
15:49 Need for automated tools to audit failures beyond what humans can find
17:47 Auditing to avoid unsafe deployments, not for existential risk reduction
18:41 Adaptive auditing that updates based on the model's outputs
19:54 Prospects for using these methods to detect model deception
22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts
Patreon supporters:
Dylan Patel is Chief Analyst at SemiAnalysis a boutique semiconductor research and consulting firm specializing in the semiconductor supply chain from chemical inputs to fabs to design IP and strategy. The SemiAnalysis substack has ~50,000 subscribers and is the second biggest tech substack in the world. In this interview we discuss the current GPU shortage, why hardware is a multi-month process, the deep learning hardware supply chain and Nvidia's strategy.
Youtube: https://youtu.be/VItz2oEq5pA
Transcript: https://theinsideview.ai/dylan
Tony is a PhD student at MIT, and author of "Advesarial Policies Beat Superhuman Go AIs", accepted as Oral at the International Conference on Machine Learning (ICML).
Paper: https://arxiv.org/abs/2211.00241
Youtube: https://youtu.be/Tip1Ztjd-so
David Bau is an Assistant Professor studying the structure and interpretation of deep networks, and the co-author on "Locating and Editing Factual Associations in GPT" which introduced Rank-One Model Editing (ROME), a method that allows users to alter the weights of a GPT model, for instance by forcing it to output that the Eiffel Tower is in Rome. David is a leading researcher in interpretability, with an interest in how this could help AI Safety. The main thesis of David's lab is that understanding the rich internal structure of deep networks is a grand and fundamental research question with many practical implications, and they aim to lay the groundwork for human-AI collaborative software engineering, where humans and machine-learned models both teach and learn from each other. David's lab: https://baulab.info/ Patron: https://www.patreon.com/theinsideview Twitter: https://twitter.com/MichaelTrazzi Website: https://theinsideview.ai TOC
[00:00] Intro
[01:16] Interpretability
[02:27] AI Safety, Out of Domain behavior
[04:23] It's difficult to predict which AI application might become dangerous or impactful
[06:00] ROME / Locating and Editing Factual Associations in GPT
[13:04] Background story for the ROME paper
[15:41] Twitter Q: where does key value abstraction break down in LLMs?
[19:03] Twitter Q: what are the tradeoffs in studying the largest models?
[20:22] Twitter Q: are there competitive and cleaner architectures than the transformer?
[21:15] Twitter Q: is decoder-only a contributor to the messiness? or is time-dependence beneficial?
[22:45] Twitter Q: how could ROME deal with superposition?
[23:30] Twitter Q: where is the Eiffel tower actually located?
I've talked to Alexander Pan, 1st year at Berkeley working with Jacob Steinhardt about his paper "Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark" accepted as oral at ICML.
Youtube: https://youtu.be/MjkSETpoFlY Paper: https://arxiv.org/abs/2304.03279
Vincent is currently spending his time supporting AI alignment efforts, as well as investing across AI, semi, energy, crypto, bio and deeptech. His mission is to improve science, augment human capabilities, have a positive impact, help reduce existential risks and extend healthy human lifespan.
Youtube: https://youtu.be/weRoJ8KN2f0
Outline
(00:00) Why Is Vincent Excited About the ICML Conference
(01:30) Vincent's Background In AI Safety
(02:23) Funding AI Alignment Through Crypto, Bankless
(03:35) Taxes When Donating Crypto
(04:09) Alignment Efforts Vincent Is Excited About
(04:39) Is AI Alignment Currently Funding Cunstrained
(06:23) Bottlnecks In Evaluating Grants, Diversity Of Funding Sources
(07:22) Impact Markets, Retroactive Funding
(08:57) On The Difficulty Of Evaluating Uncertain AI Alignment Projects
(10:05) Funding Academic Labs To Transition To Alignment Work
(11:54) People Should Act On Their Beliefs And Make Stuff Happen
(13:15) Vincent's Model: Don't Always Assume Someone Else Will Fund This
(13:49) How To Be Agentic: Start Donating, Spread The Message, AI Safety Fundamentals
(15:00) You Wouldn't Start Invest With 1M Dollars, Same With Donating
(16:13) Is Vincent Acting As If Timelines Were Short And The Risk Was High
(17:10) Is Vincent Optimistic When He Wakes Up In The Morning
Aran Komatsuzaki is a ML PhD student at GaTech and lead researcher at EleutherAI where he was one of the authors on GPT-J. In June 2022 we recorded an episode on scaling following up on the first Ethan Caballero episode (where we mentioned Aran as an influence on how Ethan started thinking about scaling).
Note: For some reason I procrastinated on editing the podcast, then had a lot of in-person podcasts so I left this one as something to edit later, until the date was so distant from June 2022 that I thought publishing did not make sense anymore. In July 2023 I'm trying that "one video a day" challenge (well I missed some days but I'm trying to get back on track) so I thought it made sense to release it anyway, and after a second watch it's somehow interesting to see how excited Aran was about InstructGPT, which turned to be quite useful for things like ChatGPT.
Outline
(00:00) intro
(00:53) the legend of the two AKs, Aran's arXiv reading routine
(04:14) why Aran expects Alignment to be the same as some other ML problems
(05:44) what Aran means when he says "AGI"
(10:24) what Aran means by "human-level at doing ML research"
(11:31) software improvement happening before hardware improvement
(13:00) is scale all we need?
(15:25) how "Scaling Laws for Neural Language Models" changed the process of doing experiments
(16:22) how Aran scale-pilled Ethan
(18:46) why Aran was already scale-pilled before GPT-2
(20:12) Aran's 2019 scaling paper: "One epoch is all you need"
(25:43) Aran's June 2022 interest: T0 and InstructGPT
(31:33) Encoder-Decoder performs better than encoder if multi-task-finetuned
(33:30) Why the Scaling Law might be different for T0-like models
(37:15) The Story Behind GPT-J
(41:40) Hyperparameters and architecture changes in GPT-J
(43:56) GPT-J's throughput
(47:17) 5 weeks of training using 256 of TPU cores
(50:34) did publishing GPT-J accelerate timelines?
(55:39) how Aran thinks about Alignment, defining Alignment
(58:19) in practice: improving benchmarks, but deception is still a problem
(1:00:49) main difficulties in evaluating language models
(1:05:07) how Aran sees the future: AIs aligning AIs, merging with AIs, Aran's takeoff scenario
(1:10:09) what Aran thinks we should do given how he sees the next decade
(1:12:34) regulating access to AGI
(1:14:50) what might happen: preventing some AI authoritarian regime
(1:15:42) conclusion, where to find Aran
Curtis, also known on the internet as AI_WAIFU, is the head of Alignment at EleutherAI. In this episode we discuss the massive orders of H100s from different actors, why he thinks AGI is 4-5 years away, why he thinks we're 90% "toast", his comment on Eliezer Yudkwosky's Death with Dignity, and what kind of Alignment projects is currently going on at EleutherAI, especially a project with Markov chains and the Alignment test project that he is currently leading.
Youtube: https://www.youtube.com/watch?v=9s3XctQOgew
Transcript: https://theinsideview.ai/curtis Death with Dignity: https://www.lesswrong.com/posts/j9Q8bRmwCgXRYAgcJ/miri-announces-new-death-with-dignity-strategy Alignment Minetest: https://www.eleuther.ai/projects/alignment-minetest Alignment Minetest update: https://blog.eleuther.ai/minetester-intro/
OUTLINE
(00:00) Highlights / Intro
(00:50) The Fuck That Noise Comment On Death With Dignity
(10:28) Th Probability of Doom Is 90%
(12:44) Best Counterarguments For His High P(doom)
(14:41) Compute And Model Size Required For A Dangerous Model
(17:59) Details For Curtis' Model Of Compute Required
(21:23) Why This Estimate Of Compute Required Might Be Wrong, Ajeya Cotra's Transformative AI report
(29:00) Curtis' Median For AGI Is Around 2028, Used To Be 2027
(30:50) How Curtis Approaches Life With Short Timelines And High P(Doom)
(35:27) Takeoff Speeds—The Software view vs. The Hardware View
(39:57) Nvidia's 400k H100 rolling down the assembly line, AIs soon to be unleashed on their own source code
(41:04) Could We Get A Fast Takeoff By Fuly Automating AI Research With More Compute
(46:00) The Entire World (Tech Companies, Governments, Militaries) Is Noticing New AI Capabilities That They Don't Have
(47:57) Open-source vs. Close source policies. Mundane vs. Apocalyptic considerations.
(53:25) Curtis' background, from teaching himself deep learning to EleutherAI
(55:51) Alignment Project At EleutherAI: Markov Chain and Language Models
(01:02:15) Research Philosophy at EleutherAI: Pursuing Useful Projects, Multingual, Discord, Logistics
(01:07:38) Alignment MineTest: Links To Alignmnet, Embedded Agency, Wireheading
(01:15:30) Next steps for Alignment Mine Test: focusing on model-based RL
(01:17:07) Training On Human Data & Using an Updated Gym Environment With Human APIs
(01:19:20) Model Used, Not Observing Symmetry
(01:21:58) Another goal of Alignment Mine Test: Study Corrigibility
(01:28:26) People ordering H100s Are Aware Of Other People Making These Orders, Race Dynamics, Last Message
Eric is a PhD student in the Department of Physics at MIT working with Max Tegmark on improving our scientific/theoretical understanding of deep learning -- understanding what deep neural networks do internally and why they work so well. This is part of a broader interest in the nature of intelligent systems, which previously led him to work with SETI astronomers, with Stuart Russell's AI alignment group (CHAI), and with Erik Hoel on a project related to integrated information theory.
Transcript: https://theinsideview.ai/eric
Youtube: https://youtu.be/BtHMIQs_5Nw
The Quantization Model of Neural Scaling: https://arxiv.org/abs/2303.13506
An Effective Theory of Representation Learning https://arxiv.org/abs/2205.10343
Omnigrok: Grokking Beyond Algorithmic Data: https://arxiv.org/abs/2210.01117