Beyond the Benchmarks: o3 Reality Check, AI Companies, and The Leaderboard Problem

https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/c8/4e/1e/c84e1e35-8326-9e8b-7244-8994663d69d1/mza_10845128310563015798.jpg/600x600bb.jpg

Fresh From the Labs

Pioneer Square Labs

23 episodes

3 days ago

Fresh From the Labs is your front-row seat to the future of AI — straight from the builders shaping it. Hosted by the product team at Pioneer Square Labs, a Seattle-based venture studio, each episode dives into the week's most exciting AI breakthroughs, tools, and trends. No hype, just hands-on insight from the people actually prototyping, experimenting, and pushing boundaries with the latest tech. Whether you're building with AI or just trying to keep up, this podcast is your lab-tested shortcut to what matters most.

Technology

RSS

All content for Fresh From the Labs is the property of Pioneer Square Labs and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/43357237/43357237-1743204497565-17888a5b23d1f.jpg

Beyond the Benchmarks: o3 Reality Check, AI Companies, and The Leaderboard Problem

Fresh From the Labs

43 minutes 59 seconds

6 months ago

Beyond the Benchmarks: o3 Reality Check, AI Companies, and The Leaderboard Problem

This week on Fresh From the Labs, we're looking past the leaderboards and hype to explore the real-world challenges and limitations of today's AI.

Can AI actually run a company? We dive into recent CMU research that put AI agents to the test, revealing significant struggles with common sense tasks and complex automation like using a web browser effectively.

The conversation unpacks the performance of specific models like o3, contrasting benchmark achievements with practical usability and the ever-present issue of AI hallucinations. We discuss the dangers these hallucinations pose, especially in critical applications, how they can subtly mislead users, create more work, and why simply topping a leaderboard (thanks, Goodhart's Law!) doesn't guarantee success for your specific problem.

Join Shilpa, Jared, and Kevin as they discuss the trial-and-error reality of model selection, the importance of truly understanding the problem you're solving, and why promising developments like local models might offer a path forward through some of these current hurdles. It's a candid look at where AI excels and where it still falls short.

Link to Dr. Anthony Diamond's blog post on o1: https://www.psl.com/feed-posts/o1-an-entirely-different-animal---buyer-beware