Home
Categories
EXPLORE
True Crime
Comedy
Society & Culture
Business
Sports
History
Music
About Us
Contact Us
Copyright
© 2024 PodJoint
00:00 / 00:00
Sign in

or

Don't have an account?
Sign up
Forgot password
https://is1-ssl.mzstatic.com/image/thumb/PodcastSource211/v4/29/05/aa/2905aafd-f007-175a-38d2-ab3c93c14f76/0d304cf2-0619-40e7-8350-96b0ebf86a3f.png/600x600bb.jpg
Next in AI: Your Daily News Podcast
Next in AI
36 episodes
1 day ago
Stay ahead of artificial intelligence daily. AI Daily Brief brings you the latest AI news, research, tools, and industry trends — explained clearly and quickly. This daily AI podcast helps founders, developers, and curious minds cut through the noise and understand what’s next in technology.
Show more...
Technology
RSS
All content for Next in AI: Your Daily News Podcast is the property of Next in AI and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
Stay ahead of artificial intelligence daily. AI Daily Brief brings you the latest AI news, research, tools, and industry trends — explained clearly and quickly. This daily AI podcast helps founders, developers, and curious minds cut through the noise and understand what’s next in technology.
Show more...
Technology
https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/44359812/44359812-1756966404783-2d698ec3ee74f.jpg
Perplexity MoE Deployment Deep Dive: The Custom Kernels and Network Secrets That Make Massive AI Models Run 5X Faster
Next in AI: Your Daily News Podcast
16 minutes 10 seconds
3 days ago
Perplexity MoE Deployment Deep Dive: The Custom Kernels and Network Secrets That Make Massive AI Models Run 5X Faster

The podcast describes the development of high-performance, portable communication kernels specifically designed to handle the challenging sparse expert parallelism (EP) communication requirements (Dispatch and Combine) of large-scale Mixture-of-Experts (MoE) models such as DeepSeek R1 and Kimi-K2. An initial open-source NVSHMEM-based library achieved performance up to 10x faster than standard All-to-All communication and featured GPU-initiated communication (IBGDA) and a split kernel architecture for computation-communication overlap, leading to 2.5x lower latency on single-node deployments. Further specialized hybrid CPU-GPU kernels were developed to enable viable, state-of-the-art latencies for inter-node deployments over ConnectX-7 and AWS Elastic Fabric Adapter (EFA), crucial for serving trillion-parameter models. This multi-node approach leverages high EP values to reduce memory bandwidth pressure per GPU, enabling MoE models to simultaneously achieve higher throughput and lower latency across various configurations, an effect often contrary to dense model scaling

Next in AI: Your Daily News Podcast
Stay ahead of artificial intelligence daily. AI Daily Brief brings you the latest AI news, research, tools, and industry trends — explained clearly and quickly. This daily AI podcast helps founders, developers, and curious minds cut through the noise and understand what’s next in technology.