Chapter 4, Data Management Patterns GPT Flashcards
What are the unique characteristics of cloud native data compared to traditional data processing practices?
Cloud native data can be stored in many forms, in a variety of data formats and data stores, does not maintain a fixed schema, and is encouraged to have duplicate data to facilitate availability and performance over consistency. Multiple services are encouraged to call respective service APIs that own the data store rather than accessing the same database directly. This provides separation of concerns and allows cloud native data to scale out.
Page 162
Just as cloud native microservices have characteristics such as being scalable, resilient, and manageable, cloud native data has its own unique characteristics that are quite different from traditional data processing practices. Most important, cloud native data can be stored in many forms, in a variety of data formats and data stores. They are not expected to maintain a fixed schema and are encouraged to have duplicate data to facilitate availability and performance over consistency. Furthermore, in cloud native applications, multiple services are not encouraged to access the same database; instead, they should call respective service APIs that own the data store to access the data. All these provide separation of concerns and allow cloud native data to scale out.
What are stateless applications, and why are they simpler to implement and scale compared to stateful applications?
Stateless applications depend only on input and configuration data, making their failure or restart have almost no impact on execution. In contrast, stateful applications depend on input, config, and state data, which makes them more complex to implement and scale, as application failures can corrupt their state leading to incorrect execution.
Page 162
Applications that depend only on input and configuration (config) data are called stateless applications. These applications are relatively simple to implement and scale because their failure or restart has almost no impact on their execution. In contrast, applications that depend on input, config, and state data—stateful applications—are much more complex to implement and scale. The state of the application is stored in data stores, so application failures can result in partial writes that corrupt their state, which can lead to incorrect execution of the application.
What are relational databases best suited for, and what principle do they follow for schema definition?
Relational databases are ideal for storing structured data that has a predefined schema and use Structured Query Language (SQL) for processing, storing, and accessing data. They follow the principle of defining schema on write, meaning the data schema is defined before writing the data to the database.
Page 165
Relational databases are ideal for storing structured data that has a predefined schema. These databases use Structured Query Language (SQL) for processing, storing, and accessing data. They also follow the principle of defining schema on write: the data schema is defined before writing the data to the database.
What are the advantages of using relational databases for cloud native application data?
Relational databases can optimally store and retrieve data using database indexing and normalization, provide transaction guarantees through ACID properties, and help deploy and scale the data along with microservices as a single deployment unit.
Page 165-166
Relational databases can optimally store and retrieve data by using database indexing and normalization. Because these databases support atomicity, consistency, isolation, and durability (ACID) properties, they can also provide transaction guarantees.
Relational databases are a good option for storing cloud native application data. We recommend using a relational database per microservice, as this will help deploy and scale the data along with the microservice as a single deployment unit.
What is the principle of schema on read, and which type of databases follow this principle?
The principle of schema on read means that the schema of the data is defined only at the time of accessing the data for processing, not when it is written to the disk. NoSQL databases follow this principle.
Page 166
NoSQL databases follow the principle of schema on read: the schema of the data is defined only at the time of accessing the data for processing, and not when it is written to the disk.
Why are NoSQL databases suitable for handling big data, and what is a general recommendation regarding their use for transaction guarantees?
NoSQL databases are designed for scalability and performance, making them suitable for handling big data. However, it is generally not recommended to store data in NoSQL stores that need transaction guarantees.
Page 166-167
NoSQL databases are best suited to handling big data, as they are designed for scalability and performance.
it is generally not recommended to store data in NoSQL stores that need transaction guarantees.
How does a column store database manage data, and what are some common examples?
A column store database stores multiple key (column) and value pairs in each of its rows, allowing for writing any number of columns during the write phase and specifying only the columns of interest during data retrieval. Examples include Apache Cassandra and Apache HBase.
Page 167
Column store stores multiple key (column) and value pairs in each of its rows, as shown in Figure 4-2. These stores are a good example of schema on read: we can write any number of columns during the write phase, and when data is retrieved, we can specify only the columns we are interested in processing. The most widely used column store is Apache Cassandra. For those who use big data and Apache Hadoop infrastructure, Apache HBase can be an option as it is part of the Hadoop ecosystem.
What type of data is stored in a document store, and which databases are popular for this purpose?
A document store can store semi-structured data such as JSON and XML documents, allowing processing with JSON and XML path expressions. Popular document stores include MongoDB, Apache CouchDB, and CouchBase.
Page 169
Document store can store semi-structured data such as JSON and XML documents. This also allows us to process stored documents by using JSON and XML path expressions. These data stores are popular as they can store JSON and XML messages, which are usually used by frontend applications and APIs for communication. MongoDB, Apache CouchDB, and CouchBase are popular options for storing JSON documents.
What is the CAP theorem, and how does it apply to NoSQL stores?
The CAP theorem states that a distributed application can provide either full availability or consistency, but not both, while ensuring partition tolerance. Availability means the system is fully functional when some nodes are down, consistency means an update/change in one node is immediately propagated to others, and partition tolerance means the system can work even when some nodes cannot connect to each other.
Page 169
NoSQL stores are distributed, so they need to adhere to the CAP theorem; CAP stands for consistency, availability, and partition tolerance. This theorem states that a distributed application can provide either full availability or consistency; we cannot achieve both while providing network partition tolerance. Here, availability means that the system is fully functional when some of its nodes are down, consistency means an update/change in one node is immediately propagated to other nodes, and partition tolerance means that the system can continue to work even when some nodes cannot connect to each other.
Why is filesystem storage preferred for unstructured data in cloud native applications?
Filesystem storage is preferred for unstructured data because it optimizes data storage and retrieval without trying to understand the data. It can also be used to store large application data as a cache, which can be cheaper than retrieving data repeatedly over the network.
Page 171
Filesystem storage is the best for storing unstructured data in cloud native applications. Unlike NoSQL stores, it does not try to understand the data but rather purely optimizes data storage and retrieval. We can also use filesystem storage to store large application data as a cache, as it can be cheaper than retrieving data repeatedly over the network.
When should cloud native applications use relational data stores, NoSQL stores, or filesystem storage?
Cloud native applications should use relational data stores when they need transactional guarantees and data needs to be tightly coupled with the application. NoSQL or filesystem stores should be used for semi-structured or unstructured fields to achieve scalability while preserving transactional guarantees. NoSQL is also suitable for extremely large data, querying capability, or specific application use cases like graph processing.
Page 172
Cloud native applications should use relational data stores when they need transactional guarantees and when data needs to be tightly coupled with the application.
When data contains semi-structured or unstructured fields, they can be separated and stored in NoSQL or filesystem stores to achieve scalability while still preserving transactional guarantees. The applications can choose to store in NoSQL when the data quantity is extremely large, needs a querying capability, or is semi- structured, or the data store is specialized enough to handle the specific application use case such as graph processing.
What are the advantages and disadvantages of centralized data management in traditional data-centric applications?
Centralized data management allows data normalization for high consistency, enables running stored procedures across multiple tables for faster retrieval, and provides tight coupling between applications. However, it hinders the ability to evolve applications independently and is considered an antipattern for cloud native applications.
Page 172
Centralized data management is the most common type in traditional data-centric applications. In this approach, all data is stored in a single database, and multiple components of the application are allowed to access the data for processing (Figure 4-3).
This approach has several advantages; for instance, the data in these database tables can be normalized, providing high data consistency. Furthermore, as components can access all the tables, the centralized data storage provides the ability to run stored procedures across multiple tables and to retrieve results faster. On the other hand, this provides tight coupling between applications, and hinders the ability to evolve the applications independently. Therefore, it is considered an antipattern when building cloud native applications.
How does decentralized data management benefit microservices, and what are its potential disadvantages?
Decentralized data management allows scaling microservices independently, improving development time and release cycles, and solving data management and ownership problems. However, it can increase the cost of running separate data stores for each service.
Page 174
In Decentralized Data Management each independent functional component can be modeled as a microservice that has separate data stores, exclusive to each of them. This decentralized data management approach, illustrated in Figure 4-4, allows us to scale microservices independently without impacting other microservices.
Although application owners have less freedom to manage or evolve the data, segregating it in each microservice so that it’s managed by its teams/owners not only solves data management and ownership problems, but also improves the development time of new feature implementations and release cycles.
Decentralized data management allows services to choose the most appropriate data store for their use case. For example, a Payment service may use a relational database to perform transactions, while an Inquiry service may use a document store to store the details of the inquiry, and a Shopping Cart service may use a distributed key-value store to store the items picked by the customer.
one of the disadvantages of decentralized data management is the cost of running separate data stores for each service.
What is hybrid data management, and how does it help with data protection and security enforcement?
Hybrid data management helps achieve compliance with modern data-protection laws and ease security enforcement by having customer data managed via a few microservices within a secured bounded context. It provides ownership of the data to one or a few well-trained teams to apply data-protection policies.
Page 175
Hybrid Data Management helps achieve compliance with modern data-protection laws and ease security enforcement as data resides in a central place. Therefore, it is advisable to have all customer data managed via a few microservices within a secured bounded context, and to provide ownership of the data to one or a few well-trained teams to apply data-protection policies.
What benefits does exposing data as a data service provide, and in what situations is the Data Service Pattern useful?
Exposing data as a data service allows control over data presentation, security, and priority-based throttling. The Data Service Pattern is useful when data does not belong to a specific microservice and multiple microservices depend on it, or for exposing legacy on-premises or proprietary data stores to cloud native applications.
Page 180, 182
Exposing data as a data service, shown in Figure 4-10, provides us more control over that data. This allows us to present data in various compositions to various clients, apply security, and enforce priority-based throttling, allowing only critical services to access data during resource-constraint situations such as load spikes or system failures.
These data services can perform simple read and write operations to a database or even perform complex logic such as joining multiple tables or running stored procedures to build responses much more efficiently. These data services can also utilize caching to enhance their read performance.
We can use the Data Service Pattern when the data does not belong to any particular microservice; no microservice is the rightful owner of that data, yet multiple microservices are depending on it for their operation. In such cases, the common data should be exposed as an independent data service, allowing all dependent applications to access the data via APIs.
We can also use the Data Service Pattern to expose legacy on-premises or proprietary data stores to other cloud native applications
Why is accessing the same data via multiple microservices considered an antipattern, and how can the Data Service Pattern help?
Accessing the same data via multiple microservices introduces tight coupling and hinders scalability and independent evolution of microservices. The Data Service Pattern helps reduce coupling by providing managed APIs to access data.
Page 183
Considerations: When building cloud native applications, accessing the same data via multiple microservices is considered an antipattern. This will introduce tight coupling between the microservices and not allow the microservices to scale and evolve on their own. The Data Service pattern can help reduce coupling by providing managed APIs to access data.
the Data Service Pattern should not be used when the data can clearly be associated with an existing microservice, as introducing unnecessary microservices will cause additional management complexity.
What is the primary purpose of the Sharding Pattern, and what should be avoided when generating shard keys?
The primary purpose of the Sharding Pattern is to improve data retrieval time by distributing data across multiple shards. When generating shard keys, avoid using auto-incrementing fields and ensure the fields that contribute to the shard key remain fixed to avoid time-consuming data migration.
Page 198, 200
For sharding to be useful, the data should contain one or a collection of fields that uniquely identifies the data or meaningfully groups it into subsets. The combination of these fields generates the shard/partition key that will be used to locate the data. The values stored in the fields that contribute to the shard key should be fixed and never be changed upon data updates. This is because when they change, they will also change the shard key, and if the updated shard key now points to a different shard location, the data also needs to be migrated from the current shard to the new shard location. Moving data among shards is time-consuming, so this should be avoided at all costs.
We don’t recommend using auto-incrementing fields when generating shard keys. Shards do not communicate with each other, and because of the use of auto-incrementing fields, multiple shards may have generated the same keys and refer to different data with those keys locally. This can become a problem when the data is redistributed during data-rebalancing operations.
How does the Command and Query Responsibility Segregation (CQRS) Pattern enhance performance and scalability?
The CQRS Pattern enhances performance and scalability by segregating command (update/write) and query (read) operations into different services, allowing them to run on different nodes, optimize for their specific operations, and independently scale. It reduces data store contention and isolates operations needing higher security enforcement.
Page 203-204
We can separate commands (updates/writes) and queries (reads) by creating different services responsible for each (Figure 4-16). This not only facilitates running services related to update and reads on different nodes, but also helps model services appropriate for those operations and independently scale the services.
The command and query should not have data store–specific information but rather have high-level data relevant to the application. When a command is issued to a service, it extracts the information from the message and updates the data store. Then it will send that information as an event asynchronously to the services that serve the queries, such that they can build their data model. The Event Sourcing pattern using a log-based queue system like Kafka can be used to pass the events between services. Through this, the query services can read data from the event queues and perform bulk updates on their local stores, in the optimal format for serving that data.
Distribute operations and reduce data contention
the Command and Query Responsibility Segregation Pattern can be used when cloud native applications have performance-intensive update operations such as data and security validations, or message transformations, or have performance-intensive query operations containing complex joins or data mapping. When the same instance of the data store is used for both command and query, it can produce poor overall performance due to higher load on the data store. Therefore, by splitting the command and query operations, CQRS not only eliminates the impact of one on the other by improving the performance and scalability of the system, but also helps isolate operations that need higher security enforcement.
Because the Command and Query Responsibility Segregation Pattern allows commands and queries to be executed in different stores, it also enables the command and query systems to have different scaling requirements.
Why is CQRS not recommended for applications requiring high consistency between command and query operations?
CQRS is not recommended for applications requiring high consistency between command and query operations because it achieves eventual consistency by sending updates asynchronously to query stores via events, which can introduce lock contention and high latencies if synchronous data replication is used.
Page 205
Considerations: Because the Command and Query Responsibility Segregation Pattern segregates the command and query operations, it can provide high availability. Even if some command or query services become unavailable, the full system will not be halted. In the Command and Query Responsibility Segregation Pattern, we can scale the query operations infinitely, and with an appropriate number of replications, the query operations can provide guarantees of zero downtime. When scaling command operations, we might need to use patterns such as Data Sharding to partition data and eliminate potential merge conflicts.
CQRS is not recommended when high consistency is required between command and query operations. When data is updated, the updates are sent asynchronously to the query stores via events by using patterns such as Event Sourcing. Hence, use CQRS only when eventual consistency is tolerable. Achieving high consistency with synchronous data replication is not recommended in cloud native application environments as it can cause lock contention and introduce high latencies.
When using the Command and Query Responsibility Segregation Pattern, we may not be able to automatically generate separate command and query models by using tools such as object-relational mapping (ORM). Most of these tools use database schemas and usually produce combined models, so we may need to manually modify the models or write them from scratch.
What does the Materialized View Pattern accomplish, and how does it improve service performance?
The Materialized View Pattern replicates and moves data from dependent services to its local data store, building materialized views for efficient querying. It improves service performance by reducing the time to retrieve data, simplifying service logic, and providing resiliency by allowing operations to continue even when the source service is unavailable.
Page 210-212
the Materialized View Pattern replicates and moves data from dependent services to its local data store and builds materialized views (Figure 4-17). It also builds optimal views to efficiently query the data, similar to the Composite Data Services pattern.
the Materialized View Pattern asynchronously replicates data from the dependent services. If databases support asynchronous data replication, we can use it as a way to transfer data from one data store to another. Failing this, we need to use the Event Sourcing pattern and use event streams to replicate the data. The source service pushes each insert, delete, and update operation asynchronously to an event stream, and they get propagated to the services that build materialized views, where they will fetch and load the data to their local stores.
Even when we bring data into the same database, at times joining multiple tables can still be costly. In this case, we can use techniques like relational database views to consolidate data into an easily queryable materialized view.
Provide access to nonsensitive data hosted in secure systems
In some use cases, our caller service might depend on nonsensitive data that is behind a security layer, requiring the service needs to authenticate and go through validation checks before retrieving the data. But through the Materialized View Pattern, we can replicate the nonsensitive data relevant to the service and allow the caller service to access the data directly from its local store. This approach not only removes unnecessary security checks and validations but also improves performance.
When is the Data Locality Pattern especially useful, and what considerations should be taken into account when using it?
The Data Locality Pattern is especially useful when retrieving data from multiple sources to perform data aggregation or filtering operations, as it reduces data transfer and improves bandwidth utilization. Consider not overloading data nodes and balance the trade-off between bandwidth savings and additional execution cost at data nodes.
Page 216-217
Reduce bandwidth usage when retrieving data
the Data Locality Pattern is especially useful when we need to retrieve data from multiple sources to perform data aggregation or filtering operations. The output of these queries will be significantly smaller than their input. By running the execution closer to the data source, we need to transfer only a small amount of data, which can improve bandwidth utilization. This is especially useful when data stores are huge and clients are geographically distributed. This is a good approach when cloud native applications are experiencing bandwidth bottlenecks.
Considerations: Applying the Data Locality pattern can also help utilize idle CPU resources at the data nodes. Most data nodes are I/O intensive, and when the queries they perform are simple enough, they might have plenty of CPU resources idling. Moving execution to the data node can better utilize resources and optimize overall performance. We should be careful to not move all executions to the data nodes, as this can overload them and cause issues with data retrieval.
the Data Locality Pattern is not ideal when queries output most of their input. These cases will overload the data nodes without any savings to bandwidth or performance. Deciding when to use the Data Locality Pattern depends on the trade-off between bandwidth and CPU utilization. We recommend using the Data Locality Pattern when the gains achieved by reducing the data transfer are much greater than the additional execution cost incurred at the data nodes.
What are the key functions and benefits of caching in data retrieval?
Caching stores previously processed or retrieved data to reuse it without reprocessing or retrieving it again, improving time to retrieve data, static content loading, and reducing data store contention. It can also achieve high availability by relaxing data store dependency.
Page 219-221
How it works
A cache is usually an in-memory data store used to store previously processed or retrieved data so we can reuse that data when required without reprocessing or retrieving it again. When a request is made to retrieve data, and we can find the necessary data stored in the cache, we have a cache hit. If the data is not available in the cache, we have a cache miss.
When a cache miss occurs, the system usually needs to process or fetch data from the data store, as well as update the cache with the retrieved data for future reference. This process is called a read-through cache operation. Similarly, when a request is made to update the data, we should update it in the data store and remove or invalidate any relevant previously fetched entries stored in the cache. This process is called a write-through cache operation. Here, invalidation is important, because when that data is requested again, the cache should not return the old data but should retrieve updated data from the store by using the read-through cache operation. This reading and updating behavior is commonly referred to as a cache aside, and most commercial caches support this feature by default.
Caching data can happen on either the client or server side, or both, and the cache itself can be local (storing data in one instance) or shared (storing data in a distributed manner).
Especially when the cache is not shared, it cannot keep on adding data, as it will eventually exhaust available memory. Hence, it uses eviction policies to remove some records to accommodate new ones. The most popular eviction policy is least recently used (LRU), which removes data that is not used for a long period to accommodate new entries. Other policies include first in, first out (FIFO), which removes the oldest loaded entry; most recently used (MRU), which removes the last-used entry; and trigger-based options that remove entries based on values in the trigger event. We should use the eviction policy appropriate for our use case.
When data is cached, data stored in the data store can be updated by other applications, so holding data for a long period in the cache can cause inconsistencies between the data in the cache and the store. This is handled by using an expiry time for each cache entry. This helps reload the data from the data store upon time-out and improves consistency between the cache and data store.
Improve time to retrieve data
Caching can be used when retrieving data from the data store requires much more time than retrieving from the cache. This is especially useful when the original store needs to perform complex operations or is deployed in a remote location, and hence the network latency is high.
Improve static content loading
Caching is best for static data or for data that is rarely updated. Especially when the data is static and can be stored in memory, we can load the full data set to the cache and configure the cache not to expire. This drastically improves data-retrieval time and eliminates the need to load the data from the original data source.
Reduce data store contention
Because it reduces the number of calls to the data store, we can use the Caching Pattern to reduce data store contention or when the store is overloaded with many concurrent requests. If the application consuming the data can tolerate inconsistencies, such as data being outdated by a few minutes, we can also deploy the Caching Pattern on write-intensive data stores to reduce the read load and improve the stability of the system. In this case, the data in the cache will eventually become consistent when the cache times out.
What are the benefits of prefetching data in a cache, and when should this technique be used?
Prefetching data improves data-retrieval time by loading the cache with data likely to be queried, reducing initial cache misses, and stress on the service and data store. This technique should be used when predictable query patterns exist, such as processing recent orders or anticipating user actions like fetching the next set of search results.
Page 221
Prefetch data to improve data-retrieval time
We can preload the cache fully or partially when we know the kind of queries that are more likely to be issued. For example, if we are processing orders and know that the applications will mostly call last week’s data, we can preload the cache with last week’s data when we start the service. This can provide better performance than loading data on demand. When preloading is omitted, the service and the data store can encounter high stress, as most of the initial requests will result in a cache miss.
the Caching Pattern also can be used when we know what data will be queried next. For example, if a user is searching for products on a retail website, and we are rendering only the first 10 entries, the user likely will request the next 10 entries. Preloading the next 10 entries to the cache can save time when that data is needed.
How can caching achieve high availability, and what are the benefits of using a distributed cache?
Caching can achieve high availability by handling service calls with cached data even when the backend data store is unavailable, and using a fallback mechanism with a shared or distributed cache. Distributed caches provide scalability and resiliency by partitioning and replicating data, bringing data closer to clients and yielding faster response times.
Page 222-223
Achieve high availability by relaxing the data store dependency
Caching can also be used to achieve high availability, especially when the service availability is more important than the consistency of the data. We can handle service calls with cached data even when the backend data store is not available. As shown in Figure 4-20, we can also extend the Caching Pattern by making the local cache fall back on a shared or distributed cache, which in turn can fall back to the data store when the data is not present. the Caching Pattern can incorporate the Resilient Connectivity pattern with a circuit breaker discussed in Chapter 3 for the fallback calls so that they can retry and gracefully reconnect when the backends become available after a failure.
When using a shared cache, we can also introduce a secondary cache instance as a standby and replicate the data to it, to improve availability. This allows our applications to fall back to the standby when the primary cache fails.
Cache more data than a single node can hold
Distributed caching systems can be used as another alternative option when the local cache or shared cache cannot contain all the needed data. They also provide scalability and resiliency by partitioning and replicating data. These systems support read-through and write-through operations and can make direct calls to the data stores to retrieve and update data. We can also scale them by simply adding more cache servers as needed.
Though distributed caches can store lots of data, they are not as fast as the local cache and add more complexity to the system. We might need additional network hops to retrieve data, and we now need to manage an additional set of nodes. Most important, all nodes participating in the distributed cache should be within the same network and have relatively high bandwidth among one another; otherwise, they can also suffer data-synchronization delays. In contrast, when the clients are geographically distributed, a distributed cache can bring the data closer to the clients, yielding faster response times.