This paper details the evolution of Google's Spanner, a globally-distributed database system, from a key-value store to a fully-fledged SQL system. Key improvements discussed include distributed query execution, handling of transient failures via query restarts, efficient range extraction for data retrieval, and the adoption of a common SQL dialect. The authors also explain the transition from a Bigtable-like storage format to a more efficient blockwise-columnar store (Ressi). Finally, the paper highlights lessons learned during Spanner's large-scale deployment and outlines remaining challenges.
The article explores Change Data Capture (CDC), a method for tracking database changes, highlighting its advantages over traditional daily snapshots. It details three CDC implementation approaches: using database triggers (e.g., in PostgreSQL), capturing API requests and using a message broker (e.g., Kafka), and leveraging change streams within a data warehouse (e.g., Snowflake). The article compares these methods, weighing their pros and cons in terms of performance, scalability, and ease of implementation. A subsequent discussion critiques the presented methods, suggesting alternative, more robust solutions based on logical replication tools like Debezium.
This research paper introduces DeepSeek-R1, a large language model enhanced for reasoning capabilities using reinforcement learning (RL). Two versions are presented: DeepSeek-R1-Zero, trained purely via RL without supervised fine-tuning, and DeepSeek-R1, which incorporates additional multi-stage training and cold-start data for improved readability and performance. DeepSeek-R1 achieves results comparable to OpenAI's o1-1217 on various reasoning benchmarks. The study also explores distilling DeepSeek-R1's reasoning capabilities into smaller, more efficient models, achieving state-of-the-art results. Finally, the paper discusses unsuccessful attempts using process reward models and Monte Carlo Tree Search, providing valuable insights for future research.
https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
This Atlassian blog post details the migration of Jira Cloud's Issue Service from JSON to Protocol Buffers (Protobuf) to enhance performance. The switch involved a phased approach to minimise downtime, creating new endpoints and logic to handle both formats concurrently before a complete transition. The results showcased significant improvements: 75% less Memcached CPU usage, 80% smaller data size, and a substantially faster response time. Challenges encountered included Protobuf's handling of null values and incompatibility with Spring's default error controller, which required workarounds. Ultimately, the migration yielded substantial performance gains and reduced infrastructure needs.
https://www.atlassian.com/blog/atlassian-engineering/using-protobuf-to-make-jira-cloud-faster
This research paper introduces Hyaline, a novel family of memory reclamation schemes for lock-free data structures in unmanaged C/C++ code. Hyaline leverages reference counting, but only during reclamation, minimising overhead during object access and balancing workload across threads. The paper details Hyaline's design, including a scalable multi-list version and robust extensions to handle stalled threads. Extensive testing across multiple architectures demonstrates Hyaline's superior performance and memory efficiency compared to existing schemes like epoch-based reclamation and hazard pointers, particularly in read-dominated and oversubscribed scenarios. The paper concludes by proving Hyaline's correctness and lock-freedom properties.
This Atlassian blog post details Trello's migration from RabbitMQ to Kafka for its websocket architecture. RabbitMQ's unreliability during network partitions and high costs associated with queue creation and deletion prompted the switch. The article compares various queuing systems, highlighting Kafka's superior failover capabilities and in-order message delivery. Trello implemented a master-client architecture with Kafka, resulting in improved performance, reduced costs, and fewer outages. Key performance improvements included a 33% decrease in memory usage and a substantial cost reduction.
This podcast explores the field of reliability engineering, tracing its origins at Google with the development of Site Reliability Engineering (SRE). It differentiates reliability engineering from SRE, highlighting its broader applicability across various organisational structures. The podcast outlines four key promises of a successful reliability team: defining service levels (SLA/SLO/SLI), managing the service infrastructure, participating in technical design, and providing tactical support during incidents. Finally, it discusses the evolving landscape of reliability engineering, emphasising pragmatic approaches to balancing cost and reliability needs, and advocating for a more nuanced understanding of when to build versus buy solutions.
This podcast profiles Antithesis, a company developing a "multiverse debugger" for large, distributed systems. It traces the history of debugging tools, highlighting Antithesis's innovative approach using deterministic simulation testing (DST) to allow time travel debugging. The podcast includes a Q&A with Antithesis's co-founder, detailing the challenges of debugging large systems and how Antithesis addresses them. Furthermore, it discusses Antithesis's tech stack, engineering culture, and the trade-offs of using their complex, but potentially game-changing, technology. Finally, it considers the implications of widespread adoption of Antithesis's technology for the future of software development.
This podcast details the creation of Shopify's interactive Black Friday/Cyber Monday live dashboard, nicknamed "Live Globe". The 2024 version, built by a six-person team in two months, features a spaceship-themed interface showcasing real-time sales data and boasts impressive technical specifications, including peak loads of nearly 30 million database reads per second. The design process involved extensive prototyping and the use of AI-generated imagery for inspiration. The podcast also highlights the technology stack (React Three Fiber, Go, Rails, Kafka, and Flink), the inclusion of numerous Easter eggs, and the challenges of performance optimisation and real-time data streaming. Finally, it explores the project's unique approach to ROI, prioritising fun and innovation.
This podcast examines the contrasting "wartime" and "peacetime" operating modes in tech companies, drawing on the author's experiences at Uber and observations across the industry. It defines these modes in terms of leadership styles, employee behaviours, and organisational priorities, highlighting the differences in approaches to project management, performance reviews, and tech debt. The text explores the transitions between these modes, identifying common triggers and observable signs, and offers advice for employees and managers on thriving in each environment. Finally, it discusses the counterintuitive relationship between extended "wartime" periods and tech debt accumulation.
Jim McCormick's "The First Time Manager" offers a practical guide for new managers, covering essential aspects like communication, delegation, and conflict resolution. The book employs a clear and relatable style, using real-world examples and actionable advice to help readers build foundational leadership skills. While some advice may be general, its comprehensive approach to fundamental management principles makes it a valuable resource for aspiring and new managers seeking a strong start in their careers. The book also touches on crucial aspects of personal development and emotional intelligence in leadership. Even experienced managers might find its refresher on core concepts beneficial.
This research paper details the development and implementation of efficient techniques for processing multiple, similar aggregate queries in data streaming systems. The authors address the challenges of scaling to handle hundreds of concurrent queries, each with potentially different time windows and selection predicates. Their proposed "on-the-fly" methods avoid computationally expensive static query analysis, offering significant performance improvements (up to an order of magnitude) over existing approaches. The techniques are validated through a performance study using real-world stock market data, demonstrating their practical effectiveness. The core contributions are novel algorithms for shared time slices, shared data fragments, and a combined approach called shared data shards.
The article details how Vercel's platform handles web requests, from initial user input to final response. Vercel's Edge Network directs requests to optimal data centres, minimising latency. A multi-layered firewall system protects against threats. Advanced routing features, including middleware, manage request flow. Finally, Edge caching and Vercel Functions optimise speed and scalability for dynamic content.
This research paper introduces Monolith, a real-time recommendation system designed by Bytedance. Addressing limitations of existing deep learning frameworks, Monolith uses a novel collisionless embedding table to efficiently handle sparse, dynamic features, significantly improving model quality and memory usage. A key innovation is its online training architecture, enabling real-time model updates based on user feedback. The authors demonstrate Monolith’s superior performance through experiments and A/B tests, highlighting the trade-offs between real-time learning and system reliability. Finally, the paper compares Monolith to existing solutions, showcasing its advantages in scalability and efficiency for large-scale recommendation tasks.
This article reminisces on the history of the Postgres project, spearheaded by Michael Stonebraker at UC Berkeley from the mid-1980s to the mid-1990s. It details Stonebraker's design philosophy and the project's technical innovations, including support for abstract data types, active databases, and novel storage and recovery mechanisms. The article highlights Postgres's evolution into the open-source PostgreSQL system, its significant commercial impact through various spin-off companies, and the lessons learned from its success. It also discusses the unexpected benefits of open-sourcing the research and the project's lasting influence on database technology. The author reflects on his own involvement and contributions to the project.
This blog post details Yelp's in-place migration of their Yelp Reservations service database from PostgreSQL to MySQL. The migration, necessitated by maintenance and expertise limitations with PostgreSQL, involved significant code refactoring to address unsupported features and ensure data consistency. A gradual rollout strategy, employing multi-DB support and careful synchronisation, was implemented to minimise disruption. The process revealed several unexpected challenges, including issues with auto-incrementing keys and ProxySQL memory usage, highlighting the complexities of such large-scale database migrations. Ultimately, the switch to the company standard MySQL improved performance and maintainability.
Meta's FBDetect system, detailed in this research paper, is a robust, in-production performance regression detection system. It identifies minuscule performance regressions (as small as 0.005%) across millions of servers and hundreds of services by monitoring hundreds of thousands of time series metrics. Key to FBDetect's success are advanced techniques for subroutine-level performance analysis, filtering false positives, deduplicating correlated regressions, and root cause analysis. The paper validates FBDetect's effectiveness through simulations and real-world production data, showcasing its superiority over existing methods and highlighting the significance of its seven years of successful operation.
This paper details the architecture and evolution of Amazon DynamoDB, a fully managed NoSQL database service. Key features highlighted include its scalability, predictable performance, high availability (achieved through multi-region replication and sophisticated failure handling), and strong durability (guaranteed by techniques like write-ahead logging and continuous data verification). The authors discuss challenges faced during DynamoDB's development, such as handling uneven traffic distribution and optimising resource allocation, and explain the solutions implemented, including the shift from provisioned to on-demand capacity. Performance benchmarks are provided to demonstrate the system's consistent low latency even under extreme load.
This blog post discusses the multifaceted definition of a senior software engineer. Technical expertise is crucial, encompassing a T-shaped skill profile and a deep understanding of software development principles. However, soft skills, such as communication, leadership, and a growth mindset, are equally vital for moving projects and teams forward. The author suggests several strategies for professional growth, including pair programming, content creation, and seeking challenging tasks. Ultimately, the article posits that becoming a senior engineer is an ongoing journey of learning and improvement, rather than a fixed destination.
Amazon Web Services (AWS) has launched Amazon S3 Tables, a new storage service optimised for analytical workloads. These tables, stored in a new type of S3 bucket, utilise the Apache Iceberg format for efficient querying with tools like Amazon Athena and Apache Spark. Offering significant performance improvements (up to 3x faster queries and 10x more transactions per second) over self-managed solutions, S3 Tables provide fully managed features including automatic compaction, snapshot management, and unreferenced file removal. The service integrates with other AWS analytics services and supports standard S3 APIs, offering enhanced security and scalability. Currently available in select US regions, S3 Tables are designed to streamline large-scale data analytics.