Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model Parallelism

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/86/ef/63/86ef639b-62d6-8758-aa09-f61a60ec26ca/mza_2459041931596518318.jpg/600x600bb.jpg

Earthly Machine Learning

Amirpasha

38 episodes

6 days ago

“Earthly Machine Learning (EML)” offers AI-generated insights into cutting-edge machine learning research in weather and climate sciences. Powered by Google NotebookLM, each episode distils the essence of a standout paper, helping you decide if it’s worth a deeper look. Stay updated on the ML innovations shaping our understanding of Earth. It may contain hallucinations.

Earth Sciences

Science

RSS

All content for Earthly Machine Learning is the property of Amirpasha and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Earth Sciences

Science

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/42762713/42762713-1735852906997-8ebdc8d7402cc.jpg

Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model Parallelism

Earthly Machine Learning

13 minutes 36 seconds

1 week ago

Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model Parallelism

Jigsaw: Training Multi-Billion-Parameter AI Weather Models With Optimized Model ParallelismAuthors: Deifilia Kieckhefen, Markus Götz, Lars H. Heyen, Achim Streit, and Charlotte Debus (Karlsruhe Institute of Technology, Helmholtz AI)

The paper introduces WeatherMixer (WM), a multi-layer perceptron (MLP)-based architecture designed for atmospheric forecasting, which serves as a competitive alternative to Transformer-based models. WM's workload scales linearly with input size, addressing the scaling challenges and quadratic computational complexity associated with the self-attention mechanism in Transformers when dealing with gigabyte-sized atmospheric data.• A novel parallelization scheme called Jigsaw parallelism is proposed, combining both domain parallelism and tensor parallelism to efficiently train multi-billion-parameter models. Jigsaw is optimized for large input data by fully sharding the data, model parameters, and optimizer states across devices, eliminating memory redundancy.

Jigsaw effectively mitigates hardware bottlenecks, particularly I/O-bandwidth limitations frequently encountered in training large scientific AI models. Due to its partitioned data loading (domain parallelism), the scheme achieves superscalar weak scaling in I/O-bandwidth-limited systems.

The method demonstrates excellent scaling behavior on high-performance computing systems, exceeding state-of-the-art performance in strong scaling in computation–communication-limited systems. The training was successfully scaled up to 256 GPUs, reaching peak performances of 9 and 11 PFLOPs.• Beyond hardware efficiency, Jigsaw improves predictive performance: by partitioning the model across more GPUs (model parallelism) instead of relying solely on data parallelism, it naturally enforces smaller global batch sizes, which empirically helps mitigate the problematic large-batch effects observed in AI weather models, leading to lower loss values.