designing_data_intensive_apps.htm
Designing Data-Intensive Apps
Chapter 1: Foundations of Data Systems
- Reliability - tolerating hardware and software faults
- Scalability - measuring load and performance; latency; percentiles; throughput
- Maintainability - operability, simplicity, and evolvability
Reliability
- Reliability implies "continuing to work correctly even when things go wrong"
- What goes wrong are called faults, systems that anticipate faults are called fault-tolerant or resilient.
- Many faults are due to poor error handling; by inducing faults we can ensure the system works as expected (see Netflix Chaos Monkeys)
Data Models and Query Languages
Relational Model
The roots of relational databases lie in business data processing, which was performed on mainframe computers in the 1960s and ’70s. The use cases appear mundane from today’s perspective: typically transaction processing (entering sales or banking transactions, airline reservations, stock-keeping in warehouses) and batch processing (customer invoicing, payroll, reporting).
Birth of NoSQL
There are several driving forces behind the adoption of NoSQL databases, including:
- A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput
- A widespread preference for free and open source software over commercial database products
- Specialized query operations that are not well supported by the relational model
- Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model
New non-relational “NoSQL” datastores have diverged in two main directions:
- Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare.
- Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything
Document Databases
Document databases reverted back to the hierarchical model in one aspect: storing nested records (one-to-many relationships, like positions, education, and contact_info in Figure 2-1) within their parent record rather than in a separate table.
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships
If the data in your application has a document-like structure (i.e., a tree of one-to-many relationships, where typically the entire tree is loaded at once). Then it’s a good idea to use a document model.
The document model has limitations: for example, you cannot refer directly to a nested item within a document, but instead you need to say something like “the second item in the list of positions for user 251”
However, if your application does use many-to-many relationships, the document model becomes less appealing. It’s possible to reduce the need for joins by denormalizing, but then the application code needs to do additional work to keep the denormalized data consistent.
Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database
Transaction Processing or Analytics
Records are inserted or updated based on the user’s input. Because these applications are interactive, the access pattern became known as online transaction processing(OLTP)
Usually, an analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user. For example, if your data is a table of sales transactions, then analytic queries might be:
- What was the total revenue of each of our stores in January?
- How many more bananas than usual did we sell during our latest promotion?
- Which baby food is most often purchased with brand Diapers?
These queries are often written by business analysts and feed into reports that help
A company’s management makes better decisions (business intelligence). To differentiate this pattern of using databases from transaction processing, it has been called online analytic processing (OLAP) [47].iv The difference between OLTP and OLAP is not always clear-cut, but some typical characteristics are listed in
Table 3-1.
![[/image1.png]]
These OLTP systems are usually expected to be highly available and to process transactions with low latency since they are often critical to the operation of the business. Database administrators, therefore, closely guard their OLTP databases. They are usually reluctant to let business analysts run ad hoc analytic queries on an OLTP database since those queries are often expensive, scanning large parts of the dataset, which can harm the performance of concurrently executing transactions.
Data Warehouse
A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content without affecting OLTP operations [48]. The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company. Data is extracted from OLTP databases (either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse. This process of getting data into the
The warehouse is known as Extract–Transform–Load (ETL)
![[/image2.png]]
A significant advantage of using a separate data warehouse, rather than querying OLTP systems directly for analytics, is that the data warehouse can be optimized for analytic access patterns. It turns out that the indexing algorithms discussed in the first half of this chapter work well for OLTP but are not very good at answering analytic queries.
Star and Snowflake Schema
The name “star schema” comes from the fact that when the table relationships are
visualized, the fact table is in the middle, surrounded by dimension tables; the connections to these tables are like the rays of a star. This template is a variation known as the snowflake schema, where dimensions are further broken down into subdimensions. For example, there could be separate tables for brands and product categories, and each row in the dim_product table could reference the brand and type as foreign keys rather than storing them as strings in the dim_product table. Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with [55].
Column Oriented Storage
In most OLTP databases, storage is laid out in a row-oriented fashion: all the values from one table row are stored next to each other. Document databases are similar: an entire document is typically stored as one contiguous sequence of bytes. You can see this in the CSV example of Figure 3-1.
To process a query like an Example 3-1, you may have indexes on factsales.date_key and fact_sales.product_sk tells the storage engine where to find all the sales for a particular date or product. But then, a row-oriented storage engine still needs to load all of those rows (each consisting of over 100 attributes) from the disk into memory, parse them, and filter out those that don’t meet the required conditions. That can take a long time. The idea behind _column-oriented storage is simple: don’t store all the values from one row together, but instead store dealse values from each column. If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work. This principle is illustrated in Figures 3-10.
Finishing off the OLTP side, we did a brief tour through some more complicated indexing structures and databases optimized for keeping all data in memory.
We then took a detour from the internals of storage engines to look at the high-level architecture of a typical data warehouse. This background illustrated why analytic workloads are so different from OLTP: when your queries require sequentially scanning across amanyrows, indexes are much less relevant. Instead, it becomes essential to encode data very compactly to minimize the amount of data the query needs to read from the disk
As an application developer, if you’re armed with this knowledge about the internals of storage engines, you can know which tool best suits your particular application. If you need to adjust a database’s tuning parameters, this understanding allows you to imagine what effect a higher or a lower value may have
![[/image3.png]]
Models of Data Flow
That’s a fairly abstract idea—there are many ways data can flow from one process to another. Who encodes the data, and who decodes it? In the rest of this chapter we will explore some of the most common ways how data flows between processes:
- Via databases (see “Dataflow Through Databases” on page 129)
- Via service calls (see “Dataflow Through Services: REST and RPC” on page 131)
- Via asynchronous message passing (see “Message-Passing Dataflow” on page 136)
Dataflow Through Services: REST and RPC
When you have processes that need to communicate over a network, there are a few different ways of arranging that communication. The most common arrangement is to have two roles: clients and servers. The servers expose an API over the network, and the clients can connect to the servers to make requests to that API. The API exposed by the server is known as a service.
Web browsers are not the only type of client. For example, a native app running on a mobile device or a desktop computer can also make network requests to a server, and a client-side JavaScript application running inside a web browser can use XMLHttpRequest to become an HTTP client (this technique is known as Ajax [30]). In this case, the server’s response is typically not HTML for displaying to a human, but rather data in an encoding that is convenient for further processing by the clientside application code (such as JSON). Although HTTP may be used as the transport protocol, the API implemented on top is application-specific, and the client and server need to agree on the details of that API.
This way of building applications has traditionally been called a service oriented architecture (SOA), more recently refined and rebranded as microservices architecture [31, 32].
A key design goal of a service-oriented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable. For example, each service should be owned by one team, and that team should be able to release new versions of the service frequently, without having to coordinate with other teams. In other words, we should expect old and new versions of servers and clients to be running at the same time, and so the data encoding used by servers and clients must be compatible across versions of the service API— precisely what we’ve been talking about in this chapter.
Message-Passing Dataflow
In this final section, we will briefly look at asynchronous message-passing systems, which are somewhere between RPC and databases. They are similar to RPC in that a client’s request (usually called a message) is delivered to another process with low latency.
Message brokers
In the past, the landscape of message brokers was dominated by commercial enterprise software from companies such as TIBCO, IBM WebSphere, and webMethods. More recently, open source implementations such as RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka have become popular. The detailed delivery semantics vary by implementation and configuration, but in general, message brokers are used as follows: one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic
However, a consumer may itself publish messages to another topic (so you can chain them together, as we shall see in Chapter 11), or to a reply queue that is consumed by the sender of the original message (allowing a request/response dataflow, similar to RPC)
We also discussed several modes of dataflow, illustrating different scenarios in which data encodings are important:
- Databases, where the process writing to the database encodes the data and the process reading from the database decodes it
- RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response
- Asynchronous message passing (using message brokers or actors), where nodes communicate by sending each other messages that are encoded by the sender and decoded by the recipient
Stream Processing
In general, a “stream” refers to data that is incrementally made available over time. The concept appears in many places: in the stdin and stdout of Unix, programming languages (lazy lists) [2], filesystem APIs (such as Java’s FileInputStream), TCP connections, delivering audio and video over the internet, and so on.
In a stream processing context, a record is more commonly known as an event, but it is essentially the same thing: a small, self contained, immutable object containing the details of something that happened at some point in time.
An event may be encoded as a text string, or JSON, or perhaps in some binary form, as discussed in Chapter 4. This encoding allows you to store an event, for example by appending it to a file, inserting it into a relational table, or writing it to a document database. It also allows you to send the event over the network to another node in order to process it.
Databases have traditionally not supported this kind of notification mechanism very well: relational databases commonly have triggers, which can react to a change (e.g., a row being inserted into a table), but they are very limited in what they can do and have been somewhat of an afterthought in database design [4, 5]. Instead, specialized tools have been developed for the purpose of delivering event notifications.
Streams come from (user activity events, sensors, and writes to databases), and we have talked about how streams are transported (through direct messaging, via message brokers, and in event logs)
Logs for Message Storage
A log is simply an append-only sequence of records on disk.
The same structure can be used to implement a message broker: a producer sends a message by appending it to the end of the log, and a consumer receives messages by reading the log sequentially. If a consumer reaches the end of the log, it waits for a notification that a new message has been appended
In order to scale to higher throughput than a single disk can offer, the log can be partitioned (in the sense of Chapter 6). Different partitions can then be hosted on different machines, making each partition a separate log that can be read and written independently from other partitions
Processing Streams
Broadly, there are three options:
- You can take the data in the events and write it to a database, cache, search index, or similar storage system, from where it can then be queried by other clients. As shown in Figure 11-5, this is a good way of keeping a database in sync with changes happening in other parts of the system—especially if the stream con‐ sumer is the only client writing to the database. Writing to a storage system is the streaming equivalent of what we discussed in “The Output of Batch Workflows” on page 411.
- You can push the events to users in some way, for example by sending email alerts or push notifications, or by streaming the events to a real-time dashboard where they are visualized. In this case, a human is the ultimate consumer of the stream.
- You can process one or more input streams to produce one or more output streams. Streams may go through a pipeline consisting of several such processing stages before they eventually end up at an output (option 1 or 2)