Tuesday 2025-04-08 Assorted Links

Assorted links for Tuesday, April 8:

Building a ubiquitous shared infrastructure using Twine

Twine is our homegrown cluster management system, which has been running in production for the past decade. A cluster management system allocates workloads to machines and manages the life cycle of machines, containers, and workloads. Kubernetes is a prominent example of an open source cluster management system. Twine has helped convert our infrastructure from a collection of siloed pools of customized machines dedicated to individual workloads to a large-scale ubiquitous shared infrastructure in which any machine can run any workload.
Ensuring data reaches disk

The purpose of this document is to describe the path data takes from the application down to the storage, concentrating on places where data is buffered, and to then provide best practices for ensuring data is committed to stable storage so it is not lost along the way in the case of an adverse event. The main focus is on the C programming language, though the system calls mentioned should translate fairly easily to most other languages.
Revolutionizing Money Movements at Scale with Strong Data Consistency

As one of the underlying engines, Uber Money fulfills some of the most important aspects of people’s engagement in the Uber experience. A system like this should not only be robust, but should also be highly available with zero-tolerance to downtime, after our success mantra: “To collect and disburse on-time, accurately and in-compliance”.

While we expand to multiple lines of businesses, and strategize the next best, the engineers in Uber Money also thrive on building the next generation’s Payments Platform which extends Uber’s growth. In this blog, we introduce you to this platform and provide insights into our learnings. This includes migrating hundreds of millions customers between two asynchronous systems while maintaining data-consistency with a goal of zero impact on our users.
How 30 Lines of Code Blew Up a 27-Ton Generator

A secret experiment in 2007 proved that hackers could devastate power grid equipment beyond repair—with a file no bigger than a GIF.
Avoiding overload in distributed systems by putting the smaller service in control

Within AWS, a common pattern is to split the system into services that are responsible for executing customer requests (the data plane), and services that are responsible for managing and vending customer configuration (the control plane). In this article, I discuss a number of different ways the data plane and the control plane interact with each other to avoid system overload. In many of these architectures the larger data plane fleet calls the smaller control plane fleet, but I also want to share the success we’ve had at Amazon when we put the smaller fleet in control.