Great Data Products

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/b2/84/fb/b284fb10-4005-5de0-200f-485611e8df5b/mza_16650230443270850324.png/600x600bb.jpg

Great Data Products

Source Cooperative

2 episodes

5 days ago

A podcast about the ergonomics and craft of data. Brought to you by Source Cooperative.

Technology

Science

RSS

All content for Great Data Products is the property of Source Cooperative and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

A podcast about the ergonomics and craft of data. Brought to you by Source Cooperative.

Technology

Science

Episodes (2/2)

Great Data Products

Protomaps and PMTiles

Jed talks with Brandon Liu about building maps for the web with Protomaps and PMTiles. We cover why new formats won't work without a compelling application, how a single-file base map functions as a reusable data product, designing simple specs for long-term usability, and how object storage-based approaches can replace server-based stacks while staying fast and easy to integrate. Many thanks to our listeners from Norway and Egypt who stayed up very late for the live stream!

Links and Resources

- Protomaps – a free, customizable base map you can self-host

- PMTiles Viewer – drag-and-drop viewer for .pmtiles files

- Browse 2.7 billion building footprints in PMTiles in the Google-Microsoft-OSM Open Buildings - combined by VIDA product on Source

- Emergent standards white paper from the Institutional Architecture Lab

Key takeaways:

1. Ship a killer app if you want a new format to gain traction — The Protomaps base map is the product that makes the PMTiles format matter.

2. Single-file, object storage first — PMTiles runs from a bucket or an SD card, with a browser-based viewer for offline use.

3. Design simple, future‑proof specifications — Keep formats small and reimplementable with minimal dependencies; simplicity preserves longevity and portability.

4. Prioritize the developer experience — Single-binary installs, easy local preview, and eliminating incidental complexity drive adoption more than raw capability.

5. Build the right pipeline for the job — Separate visualization-optimized packaging from analysis-ready data; don’t force one format to do everything.

2 weeks ago

1 hour 17 minutes 14 seconds

Great Data Products

Why LLM Progress is Getting Harder

Jed Sundwall and Drew Breunig explore why LLM progress is getting harder by examining the foundational data products that powered AI breakthroughs. They discuss how we've consumed the "low-hanging fruit" of internet data and graphics innovations, and what this means for the future of AI development.

The conversation traces three datasets that shaped AI: MNIST (1994), the handwritten digits dataset that became machine learning's "Hello World"; ImageNet (2008), Fei-Fei Li's image dataset that launched deep learning through AlexNet's 2012 breakthrough; and Common Crawl (2007), Gil Elbaz's web crawling project that fueled 60% of GPT-3's training data. Drew argues that great data products create ecosystems around themselves, using the Enron email dataset as an example of how a single data release can generate thousands of research papers and enable countless startups. The episode concludes with a discussion of benchmarks as modern data products and the challenge of creating sustainable data infrastructure for the next generation of AI systems.

Links and Resources:

- Common Crawl Foundation Event - October 22nd event at Stanford!

- Cloud-Native Geospatial Forum Conference 2026 - 6-9 October 2026 at Snowbird in Utah!

- Why LLM Advancements Have Slowed: The Low-Hanging Fruit Has Been Eaten - Drew's blog post that inspired this conversation

- Unicorns, Show Ponies, and Gazelles - Jed's vision for sustainable data organizations

- ARC AGI Benchmark - François Chollet's reasoning benchmark

- Thinking Machines Lab - Mira Murati's reproducibility research lab

- Terminal Bench - Stanford's coding agent evaluation benchmark

- Data Science at the Singularity - David Donoho's masterful paper examining the power of frictionless reproducibility

- Rethinking Dataset Discovery with DataScout - New paper examining dataset discovery

- MNIST Dataset - The foundational machine learning dataset on Hugging Face

Key Takeaways

1. Great data products create ecosystems - They don't just provide data, they enable entire communities and industries to flourish

2. Benchmarks are data products with intent - They encode values and shape the direction of AI development

3. We've consumed the easy wins - The internet and graphics innovations that powered early AI breakthroughs are largely exhausted

4. The future is specialized - Progress will come from domain-specific datasets, benchmarks, and applications rather than general models

5. Data markets need new models - Traditional approaches to data sharing may not work in the AI era

1 month ago

1 hour 51 minutes 38 seconds