Home
Categories
EXPLORE
True Crime
Comedy
Society & Culture
Business
Sports
Technology
News
About Us
Contact Us
Copyright
© 2024 PodJoint
Podjoint Logo
US
00:00 / 00:00
Sign in

or

Don't have an account?
Sign up
Forgot password
https://is1-ssl.mzstatic.com/image/thumb/Podcasts116/v4/7f/c2/65/7fc26521-8463-fb0a-cf30-6dacd66b1623/mza_7962167092170656969.jpg/600x600bb.jpg
The Inside View
Michaël Trazzi
52 episodes
1 week ago
The goal of this podcast is to create a place where people discuss their inside views about existential risk from AI.
Show more...
Technology
RSS
All content for The Inside View is the property of Michaël Trazzi and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
The goal of this podcast is to create a place where people discuss their inside views about existential risk from AI.
Show more...
Technology
https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_episode/14474637/14474637-1691773532035-19e0181850adf.jpg
Erik Jones on Automatically Auditing Large Language Models
The Inside View
22 minutes 36 seconds
2 years ago
Erik Jones on Automatically Auditing Large Language Models

Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.


Youtube: https://youtu.be/bhE5Zs3Y1n8

Paper: https://arxiv.org/abs/2303.04381

Erik: https://twitter.com/ErikJones313

Host: https://twitter.com/MichaelTrazzi

Patreon: https://www.patreon.com/theinsideview


Outline


00:00 Highlights

00:31 Eric's background and research in Berkeley

01:19 Motivation for doing safety research on language models

02:56 Is it too easy to fool today's language models?

03:31 The goal of adversarial attacks on language models

04:57 Automatically Auditing Large Language Models via Discrete Optimization

06:01 Optimizing over a finite set of tokens rather than continuous embeddings

06:44 Goal is revealing behaviors, not necessarily breaking the AI

07:51 On the feasibility of solving adversarial attacks

09:18 Suppressing dangerous knowledge vs just bypassing safety filters

10:35 Can you really ask a language model to cook meth?

11:48 Optimizing French to English translation example

13:07 Forcing toxic celebrity outputs just to test rare behaviors

13:19 Testing the method on GPT-2 and GPT-J

14:03 Adversarial prompts transferred to GPT-3 as well

14:39 How this auditing research fits into the broader AI safety field

15:49 Need for automated tools to audit failures beyond what humans can find

17:47 Auditing to avoid unsafe deployments, not for existential risk reduction

18:41 Adaptive auditing that updates based on the model's outputs

19:54 Prospects for using these methods to detect model deception

22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts


Patreon supporters:

  • Tassilo Neubauer
  • MonikerEpsilon
  • Alexey Malafeev
  • Jack Seroy
  • JJ Hepburn
  • Max Chiswick
  • William Freire
  • Edward Huff
  • Gunnar Höglund
  • Ryan Coppolo
  • Cameron Holmes
  • Emil Wallner
  • Jesse Hoogland
  • Jacques Thibodeau
  • Vincent Weisser
The Inside View
The goal of this podcast is to create a place where people discuss their inside views about existential risk from AI.