AI Safety - Paper Digest

https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/33/fd/c9/33fdc9e0-31b7-0385-6869-07fca94aaab5/mza_17792817917780535359.jpg/600x600bb.jpg

Arian Abbasi, Alan Aqrawi

12 episodes

5 days ago

The podcast where we break down the latest research and developments in AI Safety - so you don’t have to. Each episode, we take a deep dive into new cutting-edge papers. Whether you’re an expert or just AI-curious, we make complex ideas accessible, engaging, and relevant. Stay ahead of the curve with AI Security Papers. Disclaimer: This podcast and its content are generated by AI. While every effort is made to ensure accuracy, please verify all information independently.

Technology

RSS

All content for AI Safety - Paper Digest is the property of Arian Abbasi, Alan Aqrawi and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

Episodes (12/12)

AI Safety - Paper Digest

WEF & Accenture | Advancing Responsible AI Innovation: A Playbook

This episode of the AI Safety Paper Digest is about the World Economic Forum's new playbook on advancing responsible AI innovation. In cooperation with Accenture, the report provides a practical roadmap for turning responsible AI from an aspiration into a competitive advantage while building public trust.

Link to the Report: https://www.weforum.org/publications/advancing-responsible-ai-innovation-a-playbook/

Disclaimer: This summary was generated with the assistance of Google’s NotebookLM AI. For full technical details and comprehensive findings, please consult the original report.

1 month ago

2 minutes 19 seconds

AI Safety - Paper Digest

Okay Waymo, Crash My Car! 🗣️ Testing Autonomous Vehicle Safety with Adversarial Driving Scenarios | LD-Scene

How can we make autonomous driving systems safer through generative AI? In this episode, we explore LD-Scene, a novel framework that combines Large Language Models (LLMs) with Latent Diffusion Models (LDMs) to create controllable, safety-critical driving scenarios. These adversarial scenarios are essential for evaluating and stress-testing autonomous vehicles, yet they’re extremely rare in real-world data.

Sources referenced in this episode:

Mingxing Peng, Yuting Xie, Xusen Guo, Ruoyu Yao, Hai Yang, Jun Ma: “LD-Scene: LLM-Guided Diffusion for Controllable Generation of Adversarial Safety-Critical Driving Scenarios”

Disclaimer: This podcast summary was generated with the assistance of Google’s NotebookLM AI. For full technical details and comprehensive findings, please consult the original research paper.

2 months ago

18 minutes 15 seconds

AI Safety - Paper Digest

The Full LLM Glossary and Foundations

Ever wanted a clear, comprehensive explanation of all the key terms related to Large Language Models (LLMs)? This episode has you covered.

In this >1-hour deep-dive, we'll guide you through the essential glossary of LLM-related terms and foundational concepts, perfect for listening while driving, working, or on the go. Whether you're new to LLMs or looking to reinforce your understanding, this episode is designed to make complex terms accessible.

Sources referenced in this episode:

Humza Naveed et al., "A Comprehensive Overview of Large Language Models"
Tessa Gengnagel et al., "LLM Glossary (draft version)"

Disclaimer: This podcast summary was generated with the help of Google's NotebookLM AI. While we aim to provide an accurate and informative overview, we encourage listeners to consult the original research papers for a deeper and more comprehensive understanding of the topics discussed.

3 months ago

1 hour 28 minutes 18 seconds

AI Safety - Paper Digest

Anthropic's Best-of-N: Cracking Frontier AI Across Modalities

In this special christmas episode, we delve into "Best-of-N Jailbreaking," a powerful new black-box algorithm that demonstrates the vulnerabilities of cutting-edge AI systems. This approach works by sampling numerous augmented prompts - like shuffled or capitalized text - until a harmful response is elicited.

Discover how Best-of-N (BoN) Jailbreaking achieves:

89% Attack Success Rates (ASR) on GPT-4o and 78% ASR on Claude 3.5 Sonnet with 10,000 prompts.
Success in bypassing advanced defenses on both closed-source and open-source models.
Cross-modality attacks on vision, audio, and multimodal AI systems like GPT-4o and Gemini 1.5 Pro.

We’ll also explore how BoN Jailbreaking scales with the number of prompt samples, following a power-law relationship, and how combining BoN with other techniques amplifies its effectiveness. This episode unpacks the implications of these findings for AI security and resilience.

Paper: Hughes, John, et al. "Best-of-N Jailbreaking." (2024). arXiv.

Disclaimer: This podcast summary was generated using Google's NotebookLM AI. While the summary aims to provide an overview, it is recommended to refer to the original research preprint for a comprehensive understanding of the study and its findings.

10 months ago

12 minutes 37 seconds

AI Safety - Paper Digest

Auto-Rewards & Multi-Step RL for Diverse AI Attacks by OpenAI

In this episode, we explore the latest advancements in automated red teaming from OpenAI, presented in the paper "Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning." Automated red teaming has become essential for discovering rare failures and generating challenging test cases for large language models (LLMs). This paper tackles a core challenge: how to ensure attacks are both diverse and effective.

We dive into their two-step approach:

Generating Diverse Attack Goals using LLMs with tailored prompts and rule-based rewards (RBRs).
Training an RL Attacker with multi-step reinforcement learning to optimize for both success and diversity in attacks.

Discover how this approach improves on previous methods by generating more varied and successful attacks, including prompt injection attacks and unsafe response prompts, paving the way for more robust AI models.

Paper: Beutel A, Xiao K, Heidecke J, Weng L "Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning." (2024). OpenAI.com

11 months ago

11 minutes 17 seconds

AI Safety - Paper Digest

Battle of the Scanners: Top Red Teaming Frameworks for LLMs

In this episode, we explore the findings from "Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis." As large language models (LLMs) are integrated into more applications, so do the security risks they pose, including information leaks and jailbreak attacks. This study examines four major open-source vulnerability scanners - Garak, Giskard, PyRIT, and CyberSecEval - evaluating their effectiveness and reliability in detecting these risks. We’ll discuss the unique features of each tool, uncover key gaps in their reliability, and share strategic recommendations for organizations looking to bolster their red-teaming efforts. Join us to understand how these tools stack up and what this means for the future of AI security.

Paper: Brokman, Jonathan, et al. "Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis." (2024). arXiv.

12 months ago

14 minutes 47 seconds

AI Safety - Paper Digest

Watermarking LLM Output: SynthID by DeepMind

In this episode, we delve into the groundbreaking watermarking technology presented in the paper "Scalable Watermarking for Identifying Large Language Model Outputs," published in Nature. SynthID-Text, a new watermarking scheme developed for large-scale production systems, preserves text quality while enabling high detection accuracy for synthetic content. We explore how this technology tackles the challenges of text watermarking without affecting LLM performance or training, and how it’s being implemented across millions of AI-generated outputs. Join us as we discuss how SynthID-Text could reshape the future of synthetic content detection and ensure responsible use of large language models.

Paper: Dathathri, Sumanth, et al. "Scalable Watermarking for Identifying Large Language Model Outputs." 2024. nature.

Disclaimer: This podcast summary was generated using Google's NotebookLM AI. While the summary aims to provide an overview, it is recommended to refer to the original research paper for a comprehensive understanding of the study and its findings.

1 year ago

12 minutes 57 seconds

AI Safety - Paper Digest

Open Source Red Teaming: PyRIT by Microsoft

In this episode, we dive into PyRIT, the open-source toolkit developed by Microsoft for red teaming and security risk identification in generative AI systems. PyRIT offers a model-agnostic framework that enables red teamers to detect novel risks, harms, and jailbreaks in both single- and multi-modal AI models. We’ll explore how this cutting-edge tool is shaping the future of AI security and its practical applications in securing generative AI against emerging threats.

Paper (preprint): Lopez Munoz, Gary D., et al. "PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI Systems." (2024). arXiv.

1 year ago

10 minutes 53 seconds

AI Safety - Paper Digest

Jailbreaking GPT o1: STCA Attack

This podcast, "Jailbreaking GPT o1, " explores how the GPT o1 series, known for its advanced "slow-thinking" abilities, can be manipulated into generating disallowed content like hate speech through a novel attack method, the Single-Turn Crescendo Attack (STCA), which effectively bypasses GPT o1's safety protocols by leveraging the AI's learned language patterns and its step-by-step reasoning process.

Paper (⁠preprint): Aqrawi, Alan and Arian Abbasi. “Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA).” (2024). TechRxiv.

Disclaimer: This podcast was generated using Google's NotebookLM AI. While the summary aims to provide an overview, it is recommended to refer to the original research ⁠preprint ⁠for a comprehensive understanding of the study and its findings.

1 year ago

8 minutes 32 seconds

AI Safety - Paper Digest

The Attack Atlas by IBM Research

This episode explores the intricate world of red-teaming generative AI models as discussed in the paper "Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI." We'll dive into the emerging vulnerabilities as LLMs are increasingly integrated into real-world applications and the evolving tactics of adversarial attacks. Our conversation will center around the "Attack Atlas" - a practical framework that helps practitioners analyze and secure against single-turn input attacks - and we'll examine the critical challenges in both red- and blue-teaming generative AI systems. Whether you’re a security expert or simply fascinated by the defense of cutting-edge AI, this episode breaks down the complexities of safeguarding AI against adversarial threats.

Paper (preprint): Rawat, Ambrish, et al. "Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI." (2024). arXiv.

Disclaimer: This podcast was generated using Google's NotebookLM AI. While the summary aims to provide an overview, it is recommended to refer to the original research preprint for a comprehensive understanding of the study and its findings.

1 year ago

11 minutes 14 seconds

AI Safety - Paper Digest

The Single-Turn Crescendo Attack

In this episode, we examine the cutting-edge adversarial strategy presented in "Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)." Building on the multi-turn crescendo attack method, STCA escalates context within a single, expertly crafted prompt, effectively breaching the safeguards of large language models (LLMs) like never before. We discuss how this method can bypass moderation filters in a single interaction, the implications of this for responsible AI (RAI), and what can be done to fortify defenses against such sophisticated exploits. Join us as we break down how a single, well-designed prompt can reveal deep vulnerabilities in current AI safety protocols.

Paper (preprint): Aqrawi, Alan and Arian Abbasi. "Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)." (2024). arXiv.

1 year ago

6 minutes 45 seconds

AI Safety - Paper Digest

Outsmarting ChatGPT: The Power of Crescendo Attacks

This episode dives into how the Crescendo Multi-Turn Jailbreak Attack leverages seemingly benign prompts to escalate dialogues with large language models (LLMs) such as ChatGPT, Gemini, and Anthropic Chat, ultimately bypassing safety protocols to generate restricted content. The Crescendo attack begins with general questions and subtly manipulates the model’s responses, effectively bypassing traditional input filters, and shows a high success rate across popular LLMs. The discussion also covers the automated tool, Crescendomation, which surpasses other jailbreak methods, showcasing the vulnerability of AI to gradual escalation methods.

Paper (preprint): Mark Russinovich, Ahmed Salem, Ronen Eldan. “Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack.” (2024). arXiv.

1 year ago

9 minutes 49 seconds