February 23, 2020

DDIA 1. Reliable, Scalable, and Maintainable Applications

A data-intensive application is typically built front standard building blocks that provide commonly needed functionality. For example, many applications need to:

  • Store data so that they, or another application, can find it again later (databases)
  • Remember the result of an expensive operation, to speed up reads (caches)
  • Allow users to search data by keyword or filter it in various ways (search indexes)
  • Send a message to another process, to be handled asynchronously (stream processing)
  • Periodically crunch a large amount of accumulated data (batch processing)

In this book, we focus on three concerns that are important in most software systems:

  • Reliability: The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).
  • Scalability: As the system grows (in data volume, traffic colume, or complexity), there should be reasonable ways of dealing with that growth.
  • Maintainability: Over time, many different people will work on the system (endineering and operations, both maintaining current behaviour and adapting the system to new use cases), and they should all be able to work on it productively.


The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.

Note that a fault is not the same as a failure. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.

Hardware Faults

When we think of the causes of system failure, hardware faults quickly come to mind.

Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. Disks may be set up in a RAID configuration, servers may have dual power supplies and hot-swappable CPUs, and datacenters may have batteries and diesel generators for backup power. When one component dies, the redundant component can take its place while the broken component is replaced.

Software Errors

Another class of fault is a systematic error within the system.

There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measureing; monitoring, and analyzing system behaviour in production.

Human Errors

Humans design and build software systems, and the operators who keep the systems running are also human. Even when they have the best intentions, humans are known to be unreliable.

How do we make our systems reliable, in spite of unreliable humans? The best systems combine several approaches:

  • Design systems in a way that minimizes opportunies for error. For example, will-designed abstractions, APIs, and admin interfaces make it easy to do the “right thing” and discourage “the wrong thing”.
  • Decouple the places where people make the most mistakes from the places where they can cause failures.
  • Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests.


Scalability is the term we use to describe a system’s ability to cope with increased load.

Describing Load

Load can be described with a few numbers which we call load parameters. The best choice of parameters depends on the architecture of your system: it may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else.

Describing Performance

Once you have described the load on your system, you can investigate what happens when the load increases. You can look at it in two ways:

  • When you increase a load parameter and keep the system resources (CPU, memory, network bandwidth, etc) unchanged, how is the performance of your system affected?
  • When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

Both questions require performance numbers:

  • In a batch processing system such as Hadoop, we ususally care about throughput - the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size.
  • In online system, what’s uaually more important is the service’s response time - that is, the time between a client sending a request and receving a response.

We need to think of response time not as a single number, but as a distribution of values that you can measure.

comments powered by Disqus