
This week on Fresh From the Labs, we're looking past the leaderboards and hype to explore the real-world challenges and limitations of today's AI.
Can AI actually run a company? We dive into recent CMU research that put AI agents to the test, revealing significant struggles with common sense tasks and complex automation like using a web browser effectively.
The conversation unpacks the performance of specific models like o3, contrasting benchmark achievements with practical usability and the ever-present issue of AI hallucinations. We discuss the dangers these hallucinations pose, especially in critical applications, how they can subtly mislead users, create more work, and why simply topping a leaderboard (thanks, Goodhart's Law!) doesn't guarantee success for your specific problem.
Join Shilpa, Jared, and Kevin as they discuss the trial-and-error reality of model selection, the importance of truly understanding the problem you're solving, and why promising developments like local models might offer a path forward through some of these current hurdles. It's a candid look at where AI excels and where it still falls short.
Link to Dr. Anthony Diamond's blog post on o1: https://www.psl.com/feed-posts/o1-an-entirely-different-animal---buyer-beware