How AWS S3 is built (Transcript)

[00:00] Gergely Orosz

AWS S3 is the world's largest cloud storage service, but just how big is it and how is it engineered to be as reliable as it is at such a massive scale? Mai-Lan Tomsen Bukovec is the VP of Technology, Data and Analytics at AWS and has been running S3 for 13 years.

Today we discuss the sheer scale of S3 in the data stored and the number of servers it runs on; how seemingly overnight AWS went from an eventually consistent data store to a strongly consistent one and the massive engineering and complexity behind this move.

What is correlated failure, crash consistency, and failure allowances, and why engineers on S3 live and breathe these concepts? We'll also cover the importance of formal methods to ensure correctness at S3 scale, and many more. A lot of these topics are ones that AWS engineering rarely talks about in public. I hope you enjoy these rare details shared.

If you're interested in how one of the largest systems in the world is built and keeps evolving, this episode is for you. This episode is presented by Statsig, the Unified platform for flags, analytics, experiments, and more. Check out the show notes to learn more about them and our other season sponsors. So, Mai-Lan, welcome to the podcast.

[01:03] Mai-Lan Tomsen Bukovec

Thanks for having me.

To kick things off, can you tell me the scale of S3 today?

Well, if you want to take a step back and just think about S3, it is a place where you put an incredible amount of data. Right now, S3 holds over 500 trillion objects. We have hundreds of exabytes of data, and we serve hundreds of millions of transactions per second worldwide. And if you want another fun stat, we process over a quadrillion requests every single year.

What's under the hood of all that is also pretty amazing scale. If you think about what's underneath the hood of S3, at the fundamental level we're disks and servers which sit in racks and those sit in buildings. And if you try to think about all of the scale of what is under the hood, we manage tens of millions of hard drives across millions of servers. And that is in 120 availability zones across 38 regions, which is pretty amazing if you think about it.

So deep down it all starts with hard drives sitting inside servers, sitting inside racks, and then you have a bunch of these racks and then rows of them, buildings of them, right? And that's what you said—there's tens of millions of hard drives deep down in the bottom of this.

That's right. In fact, if you think about the scale of this, if you imagine stacking all of our drives one on top of another, it would go all the way to the International Space Station and just about back. It's kind of a fun visual to have for us who work on the service, but kind of fundamentally, it's really hard to get your brain around the scale of S3.

A lot of our customers assume the scale is there; they assume that all of the drives are always there and they just focus on what S3 is to them, which is "it just works." It just works for any type of data and all of your data.

Yeah. Even for me, when you talk about exabytes, I had to look up exabytes because I know of petabytes, which is already massive. If a company has like one or two or three petabytes of data, it's tons. And an exabyte is a thousand petabytes. You told me that you're thinking at that level. It's just hard to fathom.

Yeah, we have individual customers that have exabytes of data in what they call a data lake. Although last week I heard a great term—we had the Sony group CEO talk about what Sony is doing with data, and they refer to it as a "data ocean" and not a data lake. If you have exabytes of data in your data lake, it is in fact a data ocean, and that ocean is kind of fundamentally S3.

Can you tell me how S3 started? I did some research and there was a story about a distinguished engineer sitting in a pub in Seattle—who knows if it was true or not—but I read a story that he was a bit frustrated with engineers at Amazon building a lot of infrastructure again and again.

Yeah. If you think back, S3 development really started in 2005 and we launched as the first AWS service in 2006. If you think about the technical problems of 2006, a lot of customers were building things like e-commerce websites, right, like Amazon.com. The engineers at Amazon knew that they had a lot of data that at the time was very unstructured—it was PDFs, it was images, it was backups—and they wanted a place where they could store that at an economic price point that let them not think about the growth of storage.

So they built S3, and they really built it for a certain type of storage. The original design of S3 in 2006 was really anchored around eventual consistency. The idea of eventual consistency is that when you put data in storage for S3, we're not going to give you an ACK back on your PUT unless we have your data. So, we have your data, but the eventual consistency part is that if you were to LIST your data, it might not show up because it's being eventually consistent. It's there, but it might not show up on a list.

We built that consistency model at the time because we were really optimizing for things like durability and availability. It worked like a champ for e-commerce sites and things like that because when a human was interacting with an e-commerce site and an image happened to not show up exactly at the moment where you put the data into storage, it was okay because a human would just refresh.

When we launched in 2006—here's a fun fact for you—2006 is when Apache Hadoop first began as a community as well. And so we had a set of what I think of as frontier data customers like Netflix and Pinterest who took a look at things like Hadoop and they put it together with the economics and the attributes of S3, which is unlimited storage with pretty good performance at a great price point. They decided to build what we first began to call data lakes at the time. They decided to extend the idea of unstructured storage and include things like tabular data.

The first wave of frontier data customers were adopting data lakes in about 2013 to 2015. Those were the frontier data customers born in the cloud. Around 2015 to 2020, we started to see all the enterprises take that same data pattern of how can I use S3—the home of all the unstructured data on the planet—and extend it to tabular data. That's when about five years ago, in 2020, I started to see a ton of exabytes of Parquet files.

I have worked on S3 for a minute—I started working on S3 in 2013. I'd been at AWS since 2010, so kind of a while. The rise of Parquet was really interesting because what people did is they said, "Oh, okay. I like the traits and the attributes of S3 and I want to apply it to a table." And so I am going to run my own Parquet data in S3.

Around 2019-2020, we started to see the rise of Iceberg. Iceberg is incredibly popular, and it gives the table attributes to the underlying Parquet data. Customers started to do it in many of my largest data lakes across different industries. One of the things that we did in 2024 is we introduced S3 Tables.

Just for those who don't know what Iceberg is, it's an open-source data format for massive analytic workflows, right?

That's right. If I ask our customers of these data oceans why they care so much about Iceberg, it's because they want to be able to have what a lot of customers are calling this decentralized analytics architecture. They can have lines of businesses or different teams within their company pick what type of analytics to use as long as it's Iceberg compliant.

If Iceberg is the common metaphor for tabular data, then you have choice and flexibility for what type of analytics engines you use in a decentralized analytics architecture. I think that's one of the reasons why Iceberg has just taken off—it makes it easy to use data at scale but it also gives a business owner, the chief data officers or the CTOs of the world, future-proofing for analytics. They can replace their analytics or change it out.

They can adopt new types of analytics and AI because you have this Iceberg at the "bottom turtle" of S3. We launched S3 Tables in December 2024. This year we've had over 15 new features that we've added to S3 Tables. And then this year, of course, we launched the preview of S3 Vectors in July, and then last week we were generally available. The story of S3 is like a story that our customers have written for data, but it's been super fun to work on all these different evolving attributes.

As an engineer, what is the kind of basic architecture and the basic terminology I should know about when I'm starting to work with S3?

When we first launched in 2006, the whole goal for S3 was to provide a very simple developer experience, and we've really tried to stick with that. In fact, when we're sitting around talking about what do we build next, we always go back to that idea of how do you make things really simple to use S3.

Fundamentally, S3 has a lot of different capabilities now, but it's really about the PUT and the GET. The PUT of the storage in and the GET of the storage out, and where we can do that really well at scale—that is the heart of S3. Now we have a ton of extra capabilities that we've launched over time, but fundamentally, when customers think about using S3, they think about the PUT and the GET.

Yeah. So like PUT data, GET data, and I guess some of the other basic operations—it's a bit like HTTP, right? There's also DELETE, LIST, COPY, a few other primitives.

There are. And if I think about where we have gone over time, we've added capabilities on top of that just based on what developers are trying to do. Okay, let's just take PUT. We recently added a set of conditionals to the PUT capability; last year we did `PutIfAbsent` or `PutIfMatch`. This year we did a `CopyIfAbsent` or a `PutIfMatch`, and we did `DeleteIfMatch`. The core thing for us with conditionals is that we can give developers the capability of doing things like the PUT, but based on the behaviors of their application.

Outside of the GET and PUT, the basic operations, I guess the base terminology that you should just know about is buckets, objects, and keys, right? That's how we think about our data.

Yeah. And now it's not just objects. If you think about the two latest primitives or building blocks we've introduced as native to S3, one of them is the Iceberg table with our S3 Tables, and the other one is Vectors. Under the hood of an S3 Table is a set of Parquet files that we're managing on your behalf. But that's not the case for Vectors. A vector is just a long string of numbers. That is a new data structure for us, and it's sitting in S3 just like your objects.

Mai-Lan was talking about the building blocks of S3 like the PUT, GET, Tables, and Vectors. Speaking of primitives for building applications leads nicely to our season sponsor, WorkOS. WorkOS is a set of primitives to make your application enterprise-ready. Primitives like Single Sign-On, Authentication, Directory Sync, Multi-Factor Authentication, and many others.

One feature does not make an app enterprise-ready; rather, it's the combination of primitives altogether that solves enterprise needs. When your product grows in scale, you can always reach for new building blocks for infrastructure from places like AWS. Similarly, when you need to go upmarket and sell to larger enterprises, WorkOS provides the application-level building blocks that you need for this. WorkOS has seen the edge cases and enterprise complexity and solves this for you so you can focus on your core product.

One example of such a building block is adding authentication to your MCP server. If you would have to build it from scratch, it gets pretty complex to set up the OAuth flows behind the scenes. But with WorkOS, it's a few simple steps. Add the AuthKit component to your project, configure it via the UI, then you just direct clients of your MCP server to authorize via AuthKit, verify the response you get via some code, and that's pretty much it. This is the power of well-built primitives. To learn more, head to workos.com.

And with this, let's get back to S3 and how it all started. I'd like to still go back to the beginning of S3. When it was launched, it was pretty shocking for the broader community because S3 launched with a pricing of 15 cents per gigabyte per month, which was about a third to a fifth cheaper than anything else. The going rate at the time was something like 50 cents or 75 cents.

On the first day, I read that like 12,000 developers signed up immediately. A lot of companies immediately or very quickly moved over, and then the surprising thing was that S3 kept cutting prices. It was unheard of before. You were there in the 2010s when some large price cuts happened. Can you tell me what was the thinking inside the S3 team on this unusual pricing? It seemed customers would have been willing to pay more, and yet you were cutting prices continuously even today. I think today it's something like 2 cents or 2.3 cents for the same storage as it was 15 cents on launch.

Yeah. I think part of this goes back to what the goal is for S3. The mission of S3 is to provide the best storage service on the planet. Our goal too is that if you think about the growth of data, IDC says that data is growing at a rate of 27% year-over-year. But I have to tell you, we have so many customers that are growing so much faster than that.

Yeah, I was about to say it sounds pretty low.

I know! But that's an average across everything. We have a lot of customers that grow twice or three times that rate. If you think about all the data that's being generated from sensors, from applications, from AI...

...from just taking photos every day, right?

Photos, that's right. If you think about your phone, too—think about the resolution and how the resolution of the cameras on their phone have grown. You just have this like kind of what Sony talked about with the "data ocean." In order to have all that data and to grow it, you have to be able to grow it economically. You have to be able to grow it at a price point where you don't really think, "Okay, what data am I going to delete now because I'm running out of space?"

You don't have that conversation with S3 customers because of two things. One is we do lower the price of either storage or the capabilities of what we're doing. For example, we lowered the cost of compaction for S3 Tables pretty dramatically within a year after launching S3 Tables. It's not just that—it's like the overall total cost of ownership of your storage. We give you the ability to tier and to archive storage.

We give you the ability to do something called Intelligent Tiering, which is if you don't touch your data for a month, we'll give you an automatic discount on that data because we're watching your storage. If you don't touch it for a month, we'll give you up to 40% discount on that storage. It's like dynamic discounting so you don't even have to think about it.

Our whole goal is that you can grow the data that you need to grow because we know that's being used to pre-train models. We know it's being used to fine-tune and do any type of post-training of AI. We know you're using it for analytics. We know you're using it for all these different things, either now and in the future. So our goal is so that you can keep your data and you can use it in a way that advances whatever the thing is that you're doing—whether it's life sciences or you're an enterprise in manufacturing. Whatever you need, the data should be there, and you should be able to grow it and keep it and use it any way you want.

Yeah, I did want to ask you about this part. So there's Intelligent Tiering, which was launched in 2018—so like 12 years after S3 was launched. One thing that really got my attention was Amazon Glacier, which was launched in 2012. You can store data that you don't need immediate access to; you're okay waiting for some time to get access to it, I think maybe even hours.

When it launched, it was only one cent per gigabyte per month, which was again something back then—the going rate for storage was like 15 cents, so almost 10 times cheaper. How do you do that? What is the architecture and thinking behind how you're able to have this trade-off where if you don't need your data quickly, we can do it a lot cheaper? How should I imagine the kind of trade-offs that you and the engineering team were making?

Well, as you're an engineer yourself, and as a lot of engineering is about constraints, right? That is the fun part about working on S3. When you think about constraints, you think about availability, you think about the cost of storage... we start to get really creative.

In S3, because we build all the way down to the metal of the drives and the capabilities that we have in our hardware, we're able to drive efficiencies at every single part of our stack. Our engineers, when they get together and talk about the constraints and the design goals, we'll do something like set a target for the cost of a byte and we'll drive for that at every single part of the process.

That process also includes the data center. How do our data center technicians operate the service of S3 from a hardware and a data center perspective—the physical buildings—just like we do for the software and the layers of S3 itself? When you have that ability to run across the whole stack all the way down to the physical buildings, and you're thinking so deeply about the cost and the lifetime of every byte, you're able to do things like Glacier.

You mentioned something really interesting—that when S3 started it was eventually consistent, which means that data eventually arrives, it might not be there, and you might be behind. You mentioned that the reason the team launched this was because durability and availability were more important—and I assume, of course, cost as well. But during those initial phases while S3 was eventually consistent, what kind of benefits does it give to have eventual consistency? Is it a cost constraint? Is it just easier to do high availability systems from an engineering perspective?

Well, from an engineering perspective, the main optimization was availability. It was not necessarily durability, but it was availability. If you take a step back and look at the original design of S3, we were really focused very hard on availability.

Let's take a step back. When you talk about consistency, it's the property where the object retrieval (the GET) reflects the most recent PUT to that same object. If you think about what parts of the system of S3 that really hits, a lot of it just starts with our indexing subsystem. The indexing subsystem in S3 holds all of your object metadata—like its name, its tags, its creation time.

Our index is accessed on every single GET or PUT or LIST or HEAD or DELETE—any API call like that. Every single data plane request where you go back into our storage system to go get an object goes through our index. In fact, more requests go through our index than our storage system because, for example, it's serving things like HEAD requests and LIST requests that don't end up going back into our storage system at all. Those are metadata or index requests.

So if you think about our indexing system, we have a storage system in there. And that is a really central concept: a storage system in the middle of our indexing system.

You need a storage system for your index inside the indexing system, right?

That's right. And so we have to configure and size the system to deliver on our design promise for both availability and durability. The data in our index system is stored across a set of replicas and it uses a quorum-based algorithm. A quorum-based algorithm tends to be very forgiving to failures.

If you think about how we implemented quorum in our index system, we start first from servers that are running in these separate availability zones. The reason we do that is that it lets us avoid correlation on a single fault domain. Since the failure of a single disk, a server, a rack, or a zone only affects a subset of data, it never affects all of the data for a single object or even a majority of the data for a single object, which we have sharded across a wide spread of servers.

This core of availability for us is this idea that we spread everything. When a read comes in, it's coming into the S3 front end and we just heavily cache objects across our systems.

It could route at random and you could create a situation where you're creating an inconsistent read.

And so when we have quorum at the index storage layer, we can see reads and writes overlap, but in the cache they don't because we're optimizing for availability.

So just so I understand the first part—the eventual consistency: correct me if I'm wrong that you can just write to all these distributed nodes and you ask one of them, and if it doesn't have it, no problem because it will be eventually consistent. You now have high availability because you don't need to worry about all of them being in the same state?

That's correct.

That's phase one of AWS, and it gives you availability. Now you're explaining how you're able to, behind the scenes, turn this into a strongly consistent system. Strong consistency means that it's guaranteed to have the whole system's state, which is hard to do because you could have distributed failures, etc.

And this replicated journal—it took us a while to build, I won't lie. We don't talk about this stuff very much because this is kind of the secret sauce of S3. But again, our engineers who were in the room were thinking about how do you deliver on both the strong consistency without compromising availability.

I go back to constraints. In that case, we were not trading off consistency and availability anymore. So the engineers had to come up with a new data structure—we do this in S3, Vectors is a new data structure that we came up with as well. But if you think about what we had to invent for strong consistency at S3 scale without relaxing the constraint of availability, we had to build this replicated journal.

The replicated journal is a distributed data structure where we're chaining nodes together so that when a write is coming into the system, it's flowing through the nodes sequentially. A reader/writer in a strongly consistent system for S3 flows through these storage nodes in the journal sequentially. Every node is forwarding to the next node. When the storage nodes get written to, they learn the sequence number of the value along with the value itself.

Therefore, on a subsequent read (like through our cache), the sequence number can be retrieved and stored. So now you have this strongly consistent and highly available capability in S3, and the heart of that is this replicated journal.

Okay. But what's the catch? Because there's always something with trade-offs. On one end, you obviously have more complicated business logic. And then I guess the second obvious question is: what about failures? Because in the case of eventual consistency, you don't worry too much about one failure. Clearly, in this case, what if a node in the sequence fails, either at the time of the write or later? How does the system monitor this and recover? Because that's going to be the tricky part, right?

There's another piece to this puzzle that we implemented, which is a cache coherency protocol. The idea is that this is where we built what we think of as a "failure allowance." In this mode, we needed to retain the property that multiple servers can receive requests and some are allowed to fail.

It's this combination of this replicated journal as a new data structure, plus we implemented this new cache coherency protocol that gave us a failure allowance. Those two things working in concert gave us this strong consistency. I will say, too, this does come at some actual cost.

I was about to say—nothing is free in engineering, right?

There's hardware cost in this because you can imagine we've done some more engineering behind the scenes. But I remember sitting in the room with our engineers on S3 and we debated it. We said, "There's actual costs to the underlying hardware for this, do we pass it along to customers or not?" And we made that explicit decision not to.

Really?

Yeah. We said that when we launch this, we should launch strong consistency, we should make it free of charge to customers, and it should just work for any request that comes into S3. We shouldn't say it's only available on this bucket type or what have you; this should be true for every request made to S3. Part of that mindset for S3 is: how can we provide these types of capabilities, and how can we make it something that becomes a building block—part of the "turtle" of S3—so you shouldn't have to think about the cost of it.

This was the very surprising thing of this launch, by the way—that suddenly AWS said, "Okay, everything is strongly consistent, it does not cost you more." Latency-wise, your latencies shouldn't have changed significantly. I mean, I'm sure when you roll out initially, you do your measurements, etc., but that was the promise, and that was why I couldn't really believe it when I re-read history. It typically doesn't happen; typically strong consistency does add latency or it increases cost. It's very unusual.

If I think about that, one of the things that was also very important for us—and we haven't really talked about this as much, but we think about it a lot on the S3 team—is correctness. It's one thing to say that you're strongly consistent on every request; it's another thing to know it.

When we built this strong consistency, I talked about our new caching protocol and I talked about this replicated journal as a new data structure. That took a little bit of time to do and to get right. But at S3 scale, we could not say that we were strongly consistent unless we *knew* we were strongly consistent.

What does that mean? How do you do that at S3 scale when everybody is using it for every last workload? In fact, one of the reasons why people use it is because our scale is such that we're decorrelating workloads and you can run absolutely anything on S3. But how do you know?

Mai-Lan just talked about how strong consistency made it so much easier to trust S3. Trust is something that is just as important when writing code, especially when with AI we write more code than before. This is a good time to talk about our season sponsor, Sonar.

What is the impact that AI is having on developers? Let's look at some data. A new report from Sonar, "The State of Developer Server Report," found that 82% of developers believe they can code faster with AI. But here's what's interesting: in this same survey, 96% of developers said they do not highly trust the accuracy of AI code. This checks out for me as well—while I write code faster with AI agents, I don't exactly trust the code it produces.

This really becomes a problem at the code review stage where all this AI-generated code must be rigorously verified for security, reliability, and maintainability. SonarQube is precisely built to solve this code verification issue. Sonar has been a leader in the automated code analysis business for over 17 years, analyzing 750 billion lines of code daily. That's over 8 million lines of code per second.

I first came across Sonar 13 years ago, in 2013, when I was working at Microsoft, and a bunch of teams already used SonarQube to improve the quality of their code. I've been a fan since. Sonar provides an essential and independent verification layer—it's an automated guardrail that analyzes all code, whether it's developer or AI-generated, ensuring it meets your quality and security standards before it ever reaches production. To get started for free, head to sonarsource.com/pragmatic.

And with this, let's get back to the importance of strong consistency at AWS. How do you know that you're strongly consistent?

And that is why we used automated reasoning.

What is automated reasoning for those of us who are not as familiar with this—which will be most people outside of very few domains like S3?

Yeah, S3 uses automated reasoning all over the place. Automated reasoning is a specialized form of computer science. If you kind of think about if computer science and math got married and had kids, it would be automated reasoning.

Is it formal methods or based on formal methods?

That's exactly right.

Oh, yeah. I studied computer science, so that's fun. So it's proper formal methods that you're using.

That is right. And we use formal methods in many different places in S3. One of the first places that we adopted it was for us to feel good that we had delivered strong consistency across every request. So what we did is we proved it, right? We built a proof for it, and then we incorporated our proof on check-ins into this index area that I talked about—where you have your caching and then you have your storage sub-layers of the index capabilities.

When anybody is working on our index subsystem now and they're checking in code into the code paths that are being used for consistency, we are proving through formal methods that we haven't regressed our consistency model.

Can you just give us a rough idea—because the formal methods that I have studied were pretty abstract, things like designing languages and how to have the different operators and math involved. But what are the primitives—like servers, networks, etc., and models being built, data flows... how can I imagine a simple proof of something inside S3 roughly, at a really high level?

If you go back to the fundamental notion of a proof, you are proving something to be correct. The places that we use these proofs... we use them in consistency, where we built a proof across all the different combinatorics to make sure that the consistency model is correct.

We use it in cross-region replication to prove that a replication of data from one region to another arrived, and we use it in different places within S3 to prove the correctness of APIs. In all of these cases, we talk about durability, availability, and cost, but just as strong of a design principle for us across S3 is correctness. It's the correctness of a thing, an API request, an operation.

The key thing for us is that you don't want to just prove it once. You want to prove it on every single check-in and you want to prove it on every single request so you can verify—validate and verify—that you are doing, in fact, what you say you do. I think for us at a certain scale, math has to save you, right? Because at a certain scale you can't do all the combinatorics of every single edge case, but math can save you and help you on this at S3 scale. So we use formal methods in many different places of S3. We have some research papers, too; I can send you some links to some research papers where we talk about this.

Yeah, please do, and we will put it in the show notes below so anyone can check it out, because I think it's really interesting. I feel formal methods are not really a thing in a lot of startups and even infrastructure startups yet, but it sounds very reassuring to me to have an ongoing proof of that.

Speaking of which, I want to ask about one thing that is related to this: durability. Amazon S3 has very high durability promises—I think it's 11 nines, which I had to do a double check on because in backend systems, whenever you say three nines, or four nines of availability... we're not talking availability, but durability. Four nines of availability is already hard to achieve and beyond that it just gets very expensive, and I have never heard of 11 nines of durability.

One question that I got when I shared this stat publicly—people were asking, and I was also thinking: how can you prove that? Not just in a formal way, but you're now storing, as you said, 500 trillion objects, which is large enough that just by this durability promise, you might be losing some of them. Do you validate it on the actual data as well, outside of the proof? Because I assume in the proof you will have assumptions on hardware failure rate which might or might not be true.

So my question is that at Amazon S3 level, when you are able to look at "are we living up to our durability promise," how do you go about that and what are your findings?

Yeah. So we just spent a lot of time talking about our index subsystem because that is the subsystem that is related to consistency. But when you think about durability, you think about it all at different levels of the S3 stack, but we really think about it in the storage layer.

In the storage layer, you have this promise of the design, and underneath that is a combination of things. It's software, but it's also the physical layout of where our data is across everything that we have in S3. One of the things that I talked about is that we have disks and servers which sit in racks which sit in buildings, and we have tens of millions of these hard drives. We have millions of servers and we have 120 availability zones across 38 regions.

And one availability zone is like two availability zones are two physically separate locations?

Physically separate, and sometimes they're a ways away from each other. In some of our regions we have more than three availability zones, which gives us a different fault domain. If I were to think about durability, I think the most important thing for us is our auditors.

If you think about a distributed system, we talked about the PUT and the GET. We have many, many microservices that are all doing one or two things very well in the background. We have many different varieties of health checks, but we also have repair systems and we have auditor systems. Our auditor systems go and they inspect every single byte across our whole fleet.

If there are signs that there is repair needed, another repair system will come into place. In the world of distributed systems, these are all microservices working together, loosely correlated, but communicating through well-known interfaces. That collection of systems—which are over 200 microservices now—all sit behind one S3 regional endpoint. A fair number of those subsystems, those microservices, are all dedicated to the notion of durability.

So they will go and check and log and report back. So do I understand correctly that in any given time frame at S3, some systems can answer the question of "What is our durability for the past week, month, year," and so on?

Yes.

Okay, great. So you can verify your durability promise and check if the math is mathing.

Yes. And part of our design is that at any given moment in this conversation that you and I have had just today, we're having servers fail—because servers fail. What we have built in S3 is an assumption that servers fail.

Our systems are always checking to see where any failure might hit an individual node, how it affects a certain byte, and what repair needs to automatically kick in place. This system is constantly moving behind the scenes, while—and that is a completely separate thing from the GET and the PUT. The GET and the PUT is what the customer sees. There's this whole universe under the hood of how do we manage the business of bytes at scale.

I'm just thinking... because for a lot of us engineers who are building moderately sized systems compared to S3—they can already be big—but a failure is a big deal. For example, a machine going down... I have a small side project and my storage filled up and it started to give errors; this is a big deal because it rarely happens to me. This is the first time it happened in 3 years.

Yeah.

But I understand in your business or when you work at S3 scale, this is just every day. And the question is not "if," it's just "how often" and "how do you deal with it." I guess it's a different world.

It is a different world. And the trick is to really think about correlated failure. If you're thinking about availability at any scale, it's the correlated failure that'll get you.

And what is a correlated failure?

That's super interesting. If you think about what I talked about with eventual consistency, we talked about quorum. Quorum is okay for one node to fail, but if all of the nodes go south—for example, if they're in the same availability zone or on the same rack—then you're really going to be messing with your availability of the underlying storage. You've just lost your failure allowance that I talked about with the cache because they all fail together.

A correlated failure is an incredibly important thing to think about when you're thinking about availability. When we're designing around correlated failures, the things we have to think about is: how are those workloads exposed to different levels of failure?

When you upload an object to S3 with a PUT, we replicate that object. We don't just store one copy of it; we store it many times. That replication is important for durability, but what's interesting is it's also important for availability because if any of those correlated failure domains fail—like if a whole AZ fails—there's still a copy somewhere else and the data is still available somewhere even though an availability zone has failed or a rack has failed or a server has failed.

That idea of how you manage and design around correlated failures with both our physical infrastructure is super important for S3 for both availability and durability. We also do things like we think about something called crash consistency. I mean, Gergely, you can tell I can go on and on about this, so you just have to stop me.

No, but this is the interesting stuff.

All right. So the whole idea of crash consistency is that any system you build should always return to a consistent state after a fail-stop failure. If you can do things like reason about the set of states that a system can reach in the presence of failure—and you just always assume the presence of failure—then you also assume the presence of consistency and availability.

Then you just design all of these different microservices to all work together in an underlying capability like S3. That's what our engineers do. They think about crash consistency. They think about correlated failures, they think about failure allowances and caches. It's all that deep distributed system work that our engineers come in every day to work on.

Can we talk about how you think about failure allowances? Because again, there is a concept of error budgets in other companies as well. I feel it's a bit loosely handled, whereas I feel this is kind of your bread and butter. So what is a failure allowance, how do you measure it, and what do you do if you overstep it or overspend it?

I think the idea of a failure allowance is you *want* to have it—you have to have it. If you assume you'll never have a failure, you'll have a very bad day for your customer. We account for failure allowances. But the most important thing—let's just talk about the failure allowance in our cache. How do we manage that?

Well, we manage it in such a way that you'll never experience it because we size it. If you're sizing the cache and you're making sure that the underlying capabilities and the hardware are always there—and we have, like I talked about, those distributed subsystems, those microservices that are all interoperating under the hood—we have a ton of them that do nothing but just track metrics. The sizing of our cache is all related to the metrics and the size of our underlying system.

One of the really big benefits of running on S3 is because our system is so huge, you have these massive layers. The massive layers are all managing things like correlated failures and failure allowances. Because they are so huge at the scale of S3, any application that's sitting on top of S3 gets the benefit of it.

Let's take a break for a minute from S3 to talk about a one-of-a-kind event I'm organizing for the first time: The Pragmatic Summit, in partnership with Statsig. Have you ever wanted to meet standout guests from The Pragmatic Engineer podcast, plus folks from leading tech companies, and learn about what works and what doesn't in building software in this new age of AI?

Come join me February 11th in San Francisco for a very special one-day event. The Pragmatic Summit features industry legends and past podcast guests like Laura Tacho, Kent Beck, Simon Willis, Chip Huyen, Martin Fowler, and many others. We'll also have insider stories on how engineering teams like Cursor, Linear, OpenAI, Ramp, and others built cutting-edge products. We'll also have roundtables where everyone is interested to meet and chat with each other.

Something I'm hoping will make this event extra special: seats are limited and you can apply to attend at pragmaticsummit.com. Talks will be recorded and shared, and paid subscribers will get early access afterwards as a thank you for your additional support. I hope to meet many of you there and I am so excited about this event.

And now let's jump back to S3 and the massive scale of the service, to get a sense of what the reality is like working as an engineer or an engineering leader inside an organization like this. I read a quote from a distinguished engineer, Andy Warfield, who said—I'm just quoting what he said: "Early in my career, I had this sort of naive view that what it meant to build large-scale commercial software was just code. The thing I realized very quickly working on S3 was that the code was inseparable from the organizational memory and the operational practices and the scale of the system."

Since you've now been more than a decade in S3, how do you think of this "beast"—this really complex system, hundreds of microservices, data that is hard to fathom unless you think of hard drives stacking all the way to the space station? How do your engineers wrangle this? Because it does feel a bit intimidating, I'm not going to lie.

Well, I think so much of this just comes back to the culture and the commitment on the team. I've worked on S3 for a very long time now and I have such deep respect for the engineering community on S3. Honestly, this is true for all of the services in our data and analytics stack, but we have engineers in S3 who come in every single day with this deep commitment to the durability, availability, and consistency of your bytes.

The type of conversations that we have are so interesting because we have people who are early out of school, and we have engineers who have been working on S3 for 15 years, and everything in between. The creativity and the invention of S3... you have this tension where on one side you have to be very conservative with S3, and on the other hand, we have this principal engineering tenant called "Respect what came before."

That's an Amazon engineering tenant: if it has worked for many years, you have to respect that. But then there's also this tenant—and these two tenants are a little bit in tension with each other, which is kind of what makes it so fun—called "Be technically fearless." I believe that the S3 engineers are just amazing at this: respecting what came before, because if we build new capabilities in S3 we have to maintain the properties and traits of S3, which is "it just works."

But at the same time, we have to be technically fearless. Our ability to go into the world of conditionals, our ability to go into the world of native support for Iceberg or for Vectors, means that we are extending this foundation of storage in a way that helps customers build whatever application they need now and in the future. That combination of those two things is what I think about when I think about our S3 engineering team—they come in every day and they embody that.

Now going back to the evolution of S3 from unstructured to structured data. You were mentioning how Hadoop and the data warehouse were big use cases where customers started to use it on top of S3. Then at S3, you noticed what some of your biggest customers were doing and you kind of built it yourself with more structured data, and then S3 Tables came along, and then Vectors.

Would you mind sharing a little bit more on how you evolve S3? This was another question that came up when I asked people what they'd like to know about S3: "Is it done? Is it finished, or is it still evolving?" Because there is this notion that S3 can store anything already—any object, any blob. What new thing is there? And yet we have a lot of new things.

Yeah. And if you kind of go back in time a little bit and you think about the rise of Parquet... the rise of Parquet data in S3 started about 2020, and we started to see more and more people store their tabular data in S3. If you think about what Iceberg provided, it provided a replacement for Hive. Hive was giving your file system access into S3 unstructured storage; Iceberg is giving that tabular access, including the compaction and all the table maintenance that goes along with it, into your Parquet data.

I think that the world's data for tabular data is going to live in the future in S3. If you just think about the launch that, for example, Supabase did last week—Supabase announced that their Postgres database is now going to do secondary writes directly into an S3 Table, just like their Postgres extension for vectors is going to integrate directly with S3 Vectors. If the world of data as a source goes directly into an S3 Table, what does that mean for the world's data? SQL, as we know, is the lingua franca of data, and the world's LLMs have all been trained on decades of SQL and Python.

We have many AWS customers who know the S3 API pretty darn well by this point—it's a pretty simple API—but now you have the ability to interact with data in S3 through SQL. What that means is that you don't have to be somebody who's building cloud applications or know S3; you just need to know SQL.

And this is with S3 Tables, right?

With S3 Tables. You can just write SQL into an S3 Table, whether you're an AI agent or a human. You're introducing the lingua franca of data as a native property of S3 with S3 Tables, and I think you're just going to see that take off in the upcoming years.

And your latest launch is S3 Vectors. Can you share a little bit what it takes to build a new data primitive like Vectors—behind the scenes, how long it takes, how the team comes together, and what are some engineering challenges of launching something like this? Again, we're talking about vectors, so you use embeddings—whenever you have LLMs, you create an embedding, it's a vector, you want to store that somewhere, you want to do search on it. There's specialized vector databases, specialized vector additions, etc. I'm assuming this is the functionality that S3 Vectors supports very nicely.

Yeah. Today, a lot of customers use vector databases, just like back in the day a lot of people put their tabular data in databases. They just used the structure of the database in order to take advantage of being able to query their data, but they didn't really need to use a database; they just put it in a database. Then S3 came along and we introduced this way—with the help of open formats like Apache Parquet—of being able to store that structured data in S3.

That's kind of what we're doing with vectors right now. If you think about vectors, vectors are a bespoke data type. A vector, at the end of the day, is a very long list of numbers. Vectors have been around for a long time, but they really took off in people's data worlds in the last couple of years with the rise of embedding models.

If you take a step back and you think about one of the great ironies of data, it is that you have to know your data to know your data—you have to know what your schema is, what the data types are, where it is. As these data lakes become data oceans, it gets harder and harder to know what's in your data. The beautiful thing about embeddings is that embedding models will understand your data so that *you* don't have to understand your data. The format in which these embedding models put this semantic understanding of your data is, in fact, a vector.

When we talk to customers, they're so excited about how these embedding models are getting better and better; they want to apply more and more semantic understanding to their underlying data, whether it's unstructured or structured. So they want to store billions of vectors.

Just to say—when you say they want to understand, correct me if I'm wrong, but hypothetically you have a bunch of text data or maybe some image data, and a lot of customers and teams would like to write queries to say, "Hey, can you find an image that looks like a puppy?" or "Can you find an article that contains this or that?" Embeddings are great for that, but then you need to create the embedding and build the system.

Yeah. If you think about what vectors can do, if you think about all the data that a given company has... your knowledge across your business or your life isn't organized into rows and columns like a database. It's in PDFs, it's on your phone, it's in audio customer care recordings which capture the sentiment of how a customer feels. It's on whiteboards filled up with ideas, and it's in documents across dozens of systems.

It's not that you don't have data—you have tons of data—but understanding what data you have across all of those different formats is a real problem. It's one that AI models can help you with. The capabilities of those AI models have gotten so much better in the last 18 to 24 months, but we needed a place to put billions of vectors, billions of semantic understandings of relationships, and that's what we built S3 for.

The state-of-the-art embedding models combined with the ability to have vectors across S3 is a really important part. It's not a database; it has the cost structure and scale of S3, but for vector storage.

And do I understand correctly—did you need to build new primitives to store this, going down to the metal and figuring out exactly how we do this, or did you build it on top of your existing primitives like blob storage?

It's a new primitive. We talked about S3 Tables, which is building on objects because those individual Parquet files, at the end of the day, are objects. Vectors are totally different. With Vectors, we built a new data structure and a new data type.

It turns out that when you're building vectors, searching for the closest vector in a very high-dimensional space—vector space—is often really hard to find the nearest neighbor. In a database, you often have to essentially compare every vector, and that's often super expensive. What we do in S3 is, because we aren't storing all of our vectors in memory—we're storing them on our fleet of S3—we still need to provide super low latency.

In our launch last week, we were getting about 100 milliseconds or less for a warm query to our vector space, which is pretty fast. It's not database-fast, but it's pretty fast. The way that we do that is we pre-compute a bunch of what we think of as "vector neighborhoods." It's a cluster of a bunch of vectors that are clustered together by similarity—like a type of dog, as an example.

These vector neighborhoods are computed ahead of time asynchronously so that when you're doing your query, it's not going to impact performance. Every time a new vector is inserted to S3, the vector gets added to one or more of these vector neighborhoods based on where it's located. So when you are executing a query on S3 Vectors, there's a much smaller search that's done to find the nearest neighborhoods. Just the vectors and the vector neighborhoods are loaded from S3 into a fast memory, where we apply the nearest neighbor algorithm. It can result in really good sub-100 millisecond query times.

If you think about the scale—S3 will give you up to two billion vectors per index. You think about the scale of an S3 vector bucket, which is up to 20 trillion vectors. Combined with 100 milliseconds or less for warm query performance, that just opens up what you can do with creating a semantic understanding of your data and how you can query it.

It sounds very interesting and also challenging because you have to build this for scale from day one. I guess that's one of the benefits and curses of working at S3: everything that you launch, you need to prepare for what would be extreme data elsewhere, but here it's just Monday.

We have S3 service tenants as well. One of the tenants—and one phrase that I use all the time and our engineers do, too—is "Scale is to your advantage." If you're an engineer and you think about that, it just changes how you design. It means that you can't build something where the bigger you get, the worse your performance gets. It has to be constructed so that the bigger you get, the better your performance gets. The bigger S3 gets, the more decorrelated the workloads are that run in S3. That is a great example of "scale is to your advantage."

When we built Vectors, just like everything in S3, we asked ourselves how we can build this such that scale is to our advantage—how we can build this such that 100 milliseconds or less is just the start of the performance we're going after, and how we can make sure that the more vectors we have in storage, the better the traits of S3 for vectors.

I have a different question about the limitations of S3. I read that the largest object you can store in S3 is 50 terabytes. Why is there a limit on the largest object? I think we can imagine this will be through multiple hard drives and so on, but why did you decide to have a limit? I'm just interested in the thought process of how the team comes up with, "Okay, this will be the limit and this is why."

Well, that limit of 50 terabytes is 10 times greater than what we launched with. We launched with five terabytes, and now we're 50 terabytes. Sometimes we sit and tell customers that and they go, "What am I going to store that's going to be 50 terabytes?" and we're like, "High-resolution video."

Generally speaking, we do try to optimize for certain patterns. When you raise the size of an object by 10 times, we're optimizing for the performance and scale of the underlying systems—like we increased the scale of our batch operations by 10 times last week, too. The idea behind that is that the underlying systems were just optimizing for distributions of work that are the new norm for how people are doing things.

We'll just keep on looking at what customers are doing across a distribution of workloads and seeing if there's something that needs to be changed. We did have a lot of conversations with customers and they're like, "Really? I don't have that many individual objects that are that big," but with the increase of high-resolution cameras, we are seeing larger size objects and we just wanted them to be able to grow unfettered in S3.

So how does S3 evolve and how has the roadmap changed? Because so far, everything that you've told me is saying, "Well, our customers were doing this or that." You live and breathe data here, so you see the patterns and the stats. Is it only you talking with customers and seeing what they're struggling with and then deciding to improve that—whether it's limits or new data types—or is there also some kind of vision or roadmap of what you'll do?

It's a great question. In fact, one of the things that we talk about all the time is the coherency of S3. There are certain things that people always expect from S3: it's the traits of S3, the durability and availability attributes we talked about. A fair amount of engineering goes on under the hood for that.

I think back to 2020—we've launched over a thousand new capabilities since then in S3. Some of them are what we think of as the 90% of the roadmap, which is what people ask for explicitly. For example, some of our media customers want the bigger object size, so we delivered that.

But then we have some things that we invent because we look at what customers are doing with data and we ask ourselves how we can build that. Vector falls into that category. We told ourselves, "Look, we can continue to make S3 the best repository for data on the planet," and we will. But there's this other element: how do you make sure that the data you have is in fact usable, and how do you make sure it's usable in a way that's industry-standard, like that Iceberg layer on top of our tabular data?

It's usable because AI models have now gotten so good at embeddings that you can have AI give you a semantic understanding of your data, if only you had the cost point of putting billions of vectors into storage. So for us, a lot of it is taking a step back and looking not just at what customers ask us for, but we want to remove the constraint of the cost of data and remove the constraint of working with your data.

When we can do both of those things—if we can make it possible that your data grows as your business needs it and you can tap into all the capabilities that you're getting with AI—then we have what we call a "product shape."

What's a product shape?

When I think about S3, I think of it as almost like this living, breathing organism where the shape of the product is evolving. It's evolving with coherency around what you expect for the traits of S3, but it's evolving in a way that lets you steer into how you want to use data, not just now but in the future. We'll continue to evolve the product shape of S3 based on what you want to do with data.

In a lot of ways, we're transcending the boundaries of what object storage was or what a database traditionally was, because now we have tabular formats and conditionals, and we're evolving into this new shape—and it is ultimately uniquely S3.

It kind of sounds like because you have all these microservices, it's evolving almost like a plant or a living organism.

Yes. I am, in fact, a former Peace Corps volunteer from forestry, and so a lot of times I will go back to the natural world for my metaphors. S3 is this living, breathing repository of data that lets people do things with data that they never thought possible.

It's just interesting because I think as engineers, we don't often think to relate the systems that we build with a living organization, when in fact there's code, but as you said, there's people, there's servers, and there's failures that happen at a cadence you can almost predict. You can probably predict how many hard drives are failing today at your scale already. Do you think it's because of the scale—when things become large enough they start to have these characteristics?

What I find fascinating talking to you is the way engineering works inside of S3 feels very different to how it works inside a smaller organization—your startup which does terabytes or maybe even a few petabytes, but that's it. What changes at this large scale? What do you think makes it feel so different?

In order for us to sustain the traits of S3 and to evolve it over time, we have to constantly go back to simplification. We have a very complex system with all of our different microservices, but those microservices have to do one or two things really well, and we have to stay true to that. Otherwise, the complexification of a distributed system is unmaintainable over time.

The concept of "simple" in S3 is a couple of things. One, it's the simplicity of the user model, where you have a simple API but now you also have the simplicity of using SQL or leveraging AI embedding models. That concept of simplicity is in the user model, but under the hood, if you sit in any of our engineering meetings, you will hear our engineers talk about how to implement a capability with the greatest simplicity possible.

Speaking of which, what type of engineers do you typically hire to work at S3 in terms of traits and past experience?

Well, we hire all kinds of engineers. We have many who are early-career, straight out of school, and we have many who have been on S3 for a long time. I think there's a really strong element in our teams around ownership. People feel this personal sense of commitment. I feel it every day I come in—a personal sense of commitment to your bytes, to the preservation of your bytes, to the usefulness of your bytes, so that you can think about what your application does next and not the types of storage you need or how you grow it.

That deep sense of ownership and commitment is a very common thread across our data teams, because we know that at the end of the day, every modern business is a data business. Everything that people are trying to do is based on your data shaping the core of your application experience. That data is our responsibility, and we feel it very deeply.

And what would your advice be to, let's say, a mid-career software engineer who has a few years of experience and decided one day they'd love to work on a deep infrastructure team like S3? For more experienced folks, what are experiences or activities that might help you consider them?

There's a strong value in relentless curiosity. When you work on S3 or a large-scale distributed system which continues to reinvent what storage means, you're not really coloring within the lines—you're drawing what the lines are today and knowing that you might have to rub those out and draw new lines in the future.

I think it's really important to always take a step back and take a look at the latest research. Some of the papers that I'll share with you are around how we took formal methods and brought them into storage systems, or thought about failure in a different way. That creativity, that relentless curiosity... I don't think you can go wrong with that. I think the next generation of software is all driven by the creativity of the engineering mind, and it is in all of us; we just have to unlock it and unleash it.

I also love that with S3, not only has S3 created something that did not exist and was unimaginable, but now I'm hearing about startups that are building on top of S3—Turbopuffer is a good example. They're building innovation because now they have a base layer. You decide where you want to innovate—at the lowest level or one level higher—and you just use the right primitives. In your case, this is just doing hardware and storage better than anyone.

Yeah, it's very exciting for us to see so many different types of infrastructure built on S3 now.

As we close, what is a book or a paper that you would recommend reading that you enjoyed, and why?

I am fascinated by how quickly the evolution of embedding models is coming along, and in particular a field of science that I'm quite interested in is the multimodal embedding model. The world that we experience is multimodal, and therefore the understanding that we have of data should be multimodal as well. There's this whole field of science that's emerging quite rapidly around multimodal embedding models, and I encourage people who are working in the field of data to look at that.

If you think about the next world of data lakes, I think it's going to be on metadata—it's going to be on the semantic understanding of our data. Understanding how that is created through vectors and how it's being searched across multiple modalities is an important area of both research and advancement.

Amazing. And do you have any book recommendations?

I will give you a book recommendation that won't be in the field of computer science; it will be about the evolution of the ecology around us and supporting native bees and insects. A tiny bit further afield, but if your readers are interested, they can take a look at how to support the bees of the planet.

Well, Mai-Lan, thank you very much. This was fascinating and very interesting to get a peak into this massive world of scale of data and respecting the byte.

It was great talking to you. Thank you to yourself and to all of your listeners who use S3. We quite literally wouldn't be able to do what we do without the feedback and the encouragement from everybody who uses S3 today. So thank you for that.

Just wow. I always suspected there's a lot of complexity behind a system like S3, but I just did not realize the scale of it. Whenever I worked on systems with even hundreds of virtual machines, failure of one machine was a rare event and not something that we really counted on.

During my conversation with Mai-Lan, she casually mentioned that several machines had failed during our conversation, which is something that the S3 team knows and prepares for and treats like an everyday event. I personally really liked how AWS has two conflicting tenants heavily used on the S3 team: "Respect what came before" and "Be technically fearless." For such a massive system, it would be easy to say, "Let's move conservatively because of how many companies depend on us." But if they did so, S3 would fall behind.

Finally, I'm still in awe that AWS put strong consistency in place, rolled it out to all customers, and did not increase pricing nor latency at S3 scale. This is an absolutely next-level engineering achievement; in fact, it was probably one of the lesser-known engineering feats of the decade. I hope you found the episode as fascinating as I did.

If you'd like to learn more about Amazon and AWS, check out the exclusive deep dive I did with AWS's incident management team on how they handle outages, in the show notes below. In The Pragmatic Engineer, I also did other deep dives about Amazon and AWS; they are also linked in the show notes. If you enjoy this podcast, please do subscribe on your favorite podcast platform and on YouTube. A special thank you if you also leave a rating on the show.