Home
Categories
EXPLORE
True Crime
Comedy
Society & Culture
Business
News
Sports
TV & Film
About Us
Contact Us
Copyright
© 2024 PodJoint
Podjoint Logo
US
00:00 / 00:00
Sign in

or

Don't have an account?
Sign up
Forgot password
https://is1-ssl.mzstatic.com/image/thumb/Podcasts125/v4/61/03/ea/6103ea1b-41c7-e0ca-3fc5-b127a2682d35/mza_11809009319831773693.jpg/600x600bb.jpg
Two Voice Devs
Mark and Allen
256 episodes
2 days ago
Mark and Allen talk about the latest news in the VoiceFirst world from a developer point of view.
Show more...
Technology
RSS
All content for Two Voice Devs is the property of Mark and Allen and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
Mark and Allen talk about the latest news in the VoiceFirst world from a developer point of view.
Show more...
Technology
https://d3t3ozftmdmh3i.cloudfront.net/production/podcast_uploaded_nologo/7779266/7779266-1596738075050-bbb767ac48e.jpg
Episode 238 - LLM Benchmarking: What, Why, Who, and How
Two Voice Devs
31 minutes 44 seconds
5 months ago
Episode 238 - LLM Benchmarking: What, Why, Who, and How

How do you know if a Large Language Model is good for your specific task? You benchmark it! In this episode, Allen speaks with Amy Russ about her fascinating career path from international affairs to data, and how that unique perspective now informs her work in LLM benchmarking.

Amy explains what benchmarking is, why it's crucial for both model builders and app developers, and how it goes far beyond simple technical tests to include societal, cultural, and ethical considerations like preventing harms.

Learn about the complex process involving diverse teams, defining fuzzy criteria, and the technical tools used, including data versioning and prompt template engines. Amy also shares insights on how to get involved in open benchmarking efforts and where to find benchmarks relevant to your own LLM projects.

Whether you're building models or using them in your applications, understanding benchmarking is key to finding and evaluating the best AI for your needs.


Learn More:

* ML Commons - https://mlcommons.org/


Timestamps:

00:18 Amy's Career Path (From Diplomacy to Data)

02:46 What Amy Does Now (Benchmarking & Policy)

03:38 Defining LLM Benchmarking

05:08 Policy & Societal Benchmarking (Preventing Harms)

07:55 The Need for Diverse Benchmarking Teams

09:55 Technical Aspects & Tooling (Data Integrity, Versioning)

10:50 Prompt Engineering & Versioning for Benchmarking

12:48 Preventing Models from Tuning to Benchmarks

15:30 Prompt Template Engines & Generating Prompts

17:10 Other Benchmarking Tools & Testing Nuances

19:10 Benchmarking Compared to Traditional QA

21:45 Evaluating Benchmark Results (Human & Metrics)

23:05 The Challenge of Establishing an Evaluation Scale

23:58 How to Get Started in Benchmarking (Volunteering, Organizations)

25:20 Open Benchmarks & Where to Find Them

26:35 Benchmarking Your Own Model or App

28:55 Why Benchmarking Matters for App Builders

29:55 Where to Learn More & Follow Amy


Hashtags:

#LLM #Benchmarking #AI #MachineLearning #GenAI #DataScience #DataEngineering #PromptEngineering #ModelEvaluation #TechPodcast #Developer #TwoVoiceDevs #MLCommons #QA

Two Voice Devs
Mark and Allen talk about the latest news in the VoiceFirst world from a developer point of view.