NVIDIA's Jet Nemotron - Post Neural Architecture Search & JetBlock

https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/00/94/e9/0094e92e-21d4-90e9-a9ea-1c9c0de51e8e/mza_9998448983973779943.jpg/600x600bb.jpg

AI Intuition

Dan Sarmiento

89 episodes

5 days ago

This is the gold rush era of artificial intelligence. You want to learn quickly so you don't get left behind, but how can you learn about AI without an advanced degree in computer science and mathematics? You translate all the complicated concepts into plain language and you summarize the relevant news into a podcast you can listen to while you do everything else. This is the method that helped me speed up my learning and maybe it can help you too.

Technology

RSS

All content for AI Intuition is the property of Dan Sarmiento and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/44026677/44026677-1751945723120-421486eaecd6d.jpg

NVIDIA's Jet Nemotron - Post Neural Architecture Search & JetBlock

AI Intuition

47 minutes 7 seconds

2 months ago

NVIDIA's Jet Nemotron - Post Neural Architecture Search & JetBlock

NVIDIA's new Jet-Nemotron model family, which introduces a hybrid-architecture approach to Large Language Models (LLMs) to significantly improve efficiency without sacrificing accuracy. This innovation is primarily driven by two key technologies: Post Neural Architecture Search (PostNAS), a method for "retrofitting" existing models to identify and replace less critical full-attention layers with more efficient ones, and JetBlock, a novel linear attention module. The core idea is that not all attention layers are equally important, allowing for a drastic reduction in the Key-Value (KV) Cache size, leading to up to a 53.6x increase in decoding throughput and a 98% potential cost reduction for inference. Jet-Nemotron aims to set a new standard for LLM evaluation, emphasizing real-world performance and hardware efficiency across a range of devices, from data centers to edge devices, making high-performance AI more economically viable and accessible.