Dear Analyst #129: How to scale self-serve analytics tools to thousands of users at Datadog with Jean-Mathieu Saponaro

https://is1-ssl.mzstatic.com/image/thumb/Podcasts116/v4/71/98/b7/7198b764-a15d-be86-66a4-c079855e789c/mza_12976570556437611399.png/600x600bb.jpg

Dear Analyst

KeyCuts

10 episodes

6 months ago

This is a podcast made by a lifelong analyst. I cover topics including Excel, data analysis, and tools for sharing data. In addition to data analysis topics, I may also cover topics related to software engineering and building applications. I also do a roundup of my favorite podcasts and episodes.

All content for Dear Analyst is the property of KeyCuts and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Dear Analyst #129: How to scale self-serve analytics tools to thousands of users at Datadog with Jean-Mathieu Saponaro

Dear Analyst

31 minutes 58 seconds

1 year ago

Dear Analyst #129: How to scale self-serve analytics tools to thousands of users at Datadog with Jean-Mathieu Saponaro

When you're organization is small, a centralized data team can take care of all the internal data tooling, reporting, and requests for all departments. As the team grows from 100 to thousands of people, a centralized data team simply cannot handle the number of requests and doesn't have the domain knowledge of all the departments. Jean-Mathieu Saponaro (JM) has experienced this transformation at Datadog. He first joined Datadog in 2015 as a research engineer. He was part of the inaugural data analytics team which now supports 6,000+ employees. In this episode, he discusses scaling a self-serve analytics tool, moving from ETL to ELT data pipelines, and structuring the data team in a hybrid data mesh model.

Building a data catalog for data discovery

According to JM, creating a data catalog is not that hard (when you're organization is small). I've seen data catalogs done in a shared Google Doc where everyone knows what all the tables and columns mean. When the data warehouse grows to hundreds of tables, that's when you'll need a proper data cataloging solution to store all the metadata about your data assets. This is when you move to something like Excel (just kidding)! In all seriousness, a shared Google Sheet isn't a terrible solution if your data warehouse isn't that large and the data structure isn't very complicated.

Source: North Shore Data Services

JM discussed a few strategies that helped them scale their internal data discovery tool:

Strong naming conventions

A pretty common pattern for data warehouses containing "business" data is using dim and fact tables. All tables in the data warehouse have to be prepended with dim or fact so that it's clear what data is stored in the table. There are also consistent naming conventions for the properties in the table. Finally, the "display" name for the table should be closely related to the actual table name itself. For instance, if the table is dim_customers, the display name for the table would just be customers.

Snowflake schema

Another common pattern is using a snowflake scheme to structure the relationship between tables. This structure makes it easy to do business intelligence (e.g. reports in Excel) later on.

Source: Wikipedia

Customizing the data discovery experience

Datadog switched BI tools a few years ago so that the tool could be used by technical and non-technical users alike. They ended up going with Metabase because it didn't feel as "advanced" as Tableau.

In terms of their data catalog, one of the key decisions going into picking a tool was being able to quickly answer the question: where do I start? Where do I go to learn about our customer data? Product data? This is where the discovery experience is important. JM said the entry point to their catalog is still just a list of 800+ tables but they are working on a custom home page.

JM's team thought about the classic build vs. buy decision for their data cataloging tool. Given the size of their organization, they went with the building the tool internally. If the number of users was smaller, it would've been fine to go with an off-the-shelf SaaS tool. JM's team set a goal to build the tool in a few months and it took them 3.5 months exactly. Building the tool internally also meant they could design and re-use custom UI components. This resulted in a consistent user experience for every step of the data discovery process.

Should you migrate data pipelines from ETL to ELT?