If you take a look through Dick Craddock’s
blog post from earlier this year, A short history of Hotmail
, or A peek behind the scenes at Hotmail
, by Arthur de Haan
, you’ll see that Hotmail has experienced enormous growth since it started 15 years ago. Today it delivers a service that is localized to 59 markets, encompassing well over a billion inboxes that generate over two petabytes (2 million gigabytes) of new storage each month, and stores hundreds of petabytes in total. You’ll also see that through the years Hotmail’s back-end architecture has been upgraded and reworked multiple times to allow for this rapid growth and to incorporate advances in hardware. Given the operational challenges and trade-offs required to run a service of this size, developing and managing our systems at scale is second nature to the Hotmail team.
In this video, Hotmail engineering team leads Dick Craddock, Mike Schackwitz, and Phil Smoot (me) discuss the major issues that arise from running such a large service.
Next, let’s dig into how we run Hotmail
and how we think about development at scale in general.
For starters, while some of the more than 360 million people who come to Hotmail each month may think about the web UI they see and use, there is a whole lot more going on behind the scenes to run this service.
Under Hotmail’s hood
has lots of users, that’s a given. But with that comes the need to support diverse user scenarios.  People access Hotmail from just about every country on the planet, in dozens of languages, from multiple devices (phones; PCs), on multiple browsers running on multiple operating systems (Windows, Mac, Linux, Unix). And each month they add new accounts and contacts, create filters, share and download photos, import email from other services, and send and receive tens of billions of messages. For all of this, they look to Hotmail for a fast, secure and seamless experience. To make this happen, there are about 100 different services that Hotmail is running all the time.
These 100 or so different service types run on tens of thousands of servers in data centers around the world.  From the ground up, they are grouped into different classes:
Hotmail clusters – how do we put all these services together?
- Manageability services let us operate the system with very little administrative support, the key to running a large number of servers.  Our design goal is a “self-healing” system, and our management services help us do this. They automate software deployment to our servers, monitor the health of our servers, automatically repair failing servers and rebalance the system as necessary, all without any human involvement.
- Storage services uses tens of thousands of servers to store our users’ data, including 3-4 copies of each piece of data for redundancy and backup purposes. Our data systems rely on automatic replication, consistency, and fail-over algorithms to keep all of our data correct and always available.
- Message delivery services deliver mail to and from Hotmail. Today we process over 8 billion messages a day with 2.5 billion messages being delivered to the inbox (the difference between those two numbers is primarily blocked spam).
- Anti-abuse, safety and privacy services protect our users and identify spammers. Our spam prevention incorporates dozens of systems for filtering incoming mail, integrating third-party block- and safe-sender lists, recording what email users report as Junk, and handling end-point reputations systems (i.e., where did the message come from, or where is it directing the user?). Our anti-virus prevention removes viruses from the messages before they get to the user.
- Data synchronization services exchange data between our services and user’s devices like PCs and phones. These include POP3, ActiveSync, and DeltaSync for synchronizing data with PC applications like Outlook and Windows Live Mail, and with mobile devices like Windows Mobile, Blackberry, and the iPhone. These also include services that exchange data with other internal data services like instant messaging, billing, and the Windows Live ID authentication systems.  We also support services that aggregate data from other email services and social networks like Twitter and Facebook.
- Site maintenance services run in the back ground, cleaning up after the party. These include data warehouse services, which track feature usage, and system performance.  They include system garbage collection services that remove deleted and junk mail from the system, and load balancing services that ensure that storage, CPU, memory, and networking demands are distributed over the entire network of servers.
- Application services like mail, calendar, contacts, and instant messaging are our web-based applications that consumers use directly, driving tens of billions of page views per month. These applications implement a variety of techniques including caching, geographical data placement, bandwidth detection, etc., to ensure that the application performance meets our goals in all markets.
Hotmail uses what we call "clusters" to build out and manage all of these different services.  A cluster is a management unit of computation, storage, and memory caching servers grouped together in a network unit.  Clusters allow building and running the Hotmail operational system in a repeatable and predictable manner at ever-growing scale. The cluster contains everything necessary to manage and run a set of services for a set of users. This design provides good performance, as everything the user needs is in one place.  The design also minimizes the impact of system outages to only users on a cluster that might be experiencing a problem. Hotmail has hundreds of clusters, and is adding dozens per year to keep up with the needs of our users. Additional details on cluster architecture can be found at:
Tips for developing at scale
If you’re planning to start your own billion-user service, look no further!  The Hotmail developers have come up with their top five scaling suggestions –
things to keep in mind when writing code for lots of users.
- Keep the overall design as simple as possible. The goal is to fail fast in simple ways. Over time, 80% of all work is maintenance done by people who didn’t invent the system, and often the system becomes so large the entire end-to-end experience cannot be understood.
- Remove all single points of failure from your designs. No single component failure should affect the performance or availability of your system. Then, plan and practice and test these failures as normal occurrences. When your service fails, don’t wake up administrators at night.  Ensure the system handles failure without human intervention.
- Build in performance testing from day one.  Big composite systems spread over the earth can become expensive quickly.  Add in large distances and the constant speed of light, and performance can become very slow with just a simple extra network round-trip or additional disk IO.
- Automate everything. Humans don’t scale well, they are expensive, and make lots of mistakes.
- Isolate composite system failures.  When a neighbor system fails, back off so the neighbor can recover.  And make sure you don’t get tangled up in your neighbors mess.
Partner Development Manager, Windows Live Hotmail