Home
Categories
EXPLORE
True Crime
Comedy
Society & Culture
Business
News
Sports
TV & Film
About Us
Contact Us
Copyright
© 2024 PodJoint
00:00 / 00:00
Sign in

or

Don't have an account?
Sign up
Forgot password
https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/1c/18/83/1c1883a3-8260-40c1-1483-a0261cac93d6/mza_8947180628080687200.jpg/600x600bb.jpg
AI: AX - introspection
mcgrof
8 episodes
2 days ago
The art of looking into a model and understanding what is going on through introspection is referred to AX.
Show more...
Technology
RSS
All content for AI: AX - introspection is the property of mcgrof and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
The art of looking into a model and understanding what is going on through introspection is referred to AX.
Show more...
Technology
https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/44214955/44214955-1754722534071-bb9d45cf6b3f5.jpg
HarmBench: Automated Red Teaming for LLM Safety
AI: AX - introspection
22 minutes 28 seconds
3 months ago
HarmBench: Automated Red Teaming for LLM Safety


This paper introduces HarmBench, a new framework for evaluating the safety and robustness of large language models (LLMs) against malicious use. It highlights the growing concern over LLMs' potential for harm, such as generating malware or designing biological weapons, and emphasizes the need for automated red teaming—a process of identifying vulnerabilities—due to the scalability limitations of manual methods. HarmBench addresses the previous lack of standardized evaluation by offering a comprehensive benchmark with diverse harmful behaviors, including contextual and multimodal scenarios, and robust, comparable metrics for assessing attack success rates. The document also presents R2D2, a novel adversarial training method that leverages HarmBench to significantly improve LLM refusal mechanisms without compromising overall performance, ultimately aiming to foster safer AI development.


Source: February 2024 - https://arxiv.org/pdf/2402.04249 - HarmBench: A Standardized Evaluation Framework for

Automated Red Teaming and Robust Refusal

AI: AX - introspection
The art of looking into a model and understanding what is going on through introspection is referred to AX.