Chapter 4, Data Management Patterns GPT Flashcards

Question

What precautions should be taken when setting cache time-outs, and how can unnecessary layers of cache impact performance?

Answer 1

Cache time-outs should be set at an optimum level to balance consistency and reloading frequency. Unnecessary layers of cache can cause high memory consumption, reduce performance, and cause data inconsistencies. Load testing and monitoring cache hit percentage, performance, CPU, and memory usage are recommended to ensure cache effectiveness. Page 224, 226 ##Footnote Things to note: The cache time-out should be set at an optimum level, not too long or too short. While setting a too-long cache time-out can cause higher inconsistencies, setting a too-short time-out is also detrimental, as it will reload the data too often and defeat the purpose of caching data. However, setting a long time-out can also be beneficial when the cost of data retrieval is significantly higher than the cost of data being inconsistent. Things to note: Introducing unnecessary layers of cache can cause high memory consumption, reduce performance, and cause data inconsistencies. We highly recommend performing a load test when introducing any caching solution, and especially monitoring the percentage of cache hits, along with performance, CPU, and memory usage. A lower cache-hit percentage can indicate that the cache is not effective. In this case, either modify the cache to achieve a higher percentage of cache hits or choose other alternatives. Increasing the size of the cache, reducing cache expiry, and preloading the cache are options that we can use to improve cache hits. Note to self. I just put the image here.. It durante 100% match the question + notes.

Answer 2

Batch data updates optimize bandwidth and improve performance under high load. For concurrent updates, an optimistic approach assumes no concurrent updates and checks for concurrent writes before updating, while a pessimistic approach locks the cache for the update duration, though it is not scalable and suitable only for short-lived operations. Page 227 ##Footnote Whenever possible, we recommend batch data updates to caches, as is done in data stores. This optimizes bandwidth and improves performance when the load is high. When multiple cache entries are updated at the same time, the updates can follow either an optimistic or a pessimistic approach. In the optimistic approach, we assume that no concurrent updates will occur and check the cache only for a concurrent write before updating the cache. But in the pessimistic approach, we lock the cache for the full update period so no concurrent updates can occur. The latter approach is not scalable, so you should use this only for very short-lived operations.

Answer 3

Data security for caches can be enforced by adding a data service with API security on top of the cache, using the Data Service pattern. Caches should not be exposed directly to external systems because they are usually not designed for security, and adding a data service helps protect the data and control access. Page 228 ##Footnote Some commercial cache services can provide data security by using the Vault Key pattern, covered later in this chapter. But most caches are usually not designed for security, and they should not be directly exposed to external systems. To achieve security, we can add a data service on top of the cache by using the Data Service pattern and apply API security for the data service (Figure 4-22). These will add data protection and allow only authorized services to read and write data to the cache.

Answer 4

The Static Content Hosting Pattern allows direct serving of static content from storage services such as CDNs, reducing resource utilization on rendering services and providing faster static content delivery by replicating and caching data in multiple locations closer to clients. Page 231-232 ##Footnote Cloud native web services are used to create dynamic content based on clients’ requests. Some clients, especially browsers, require a lot of other static content, such as static HTML pages, JavaScript and CSS files, images, and files for downloads. Rather than using microservices to cater to static content, Static Content Hosting Pattern allows us to directly serve static content from storage services such as content delivery networks (CDNs). Provide faster static content delivery Because static content does not change, the Static Content Hosting Pattern replicates and caches data in multiple environments and geographical locations with the motivation of moving it closer to the clients. This can help serve static data with low latency. Reduce resource utilization on rendering services When we need to send both static and dynamic data to clients, as discussed in the preceding web browser example, we can separate the static data and move it to a storage system such as a CDN or an S3 bucket, and let clients directly fetch that data. This reduces the resource utilization of the microservice that renders the dynamic content, as it does not need to pack all the static content in its response.

Answer 5

The Static Content Hosting Pattern is not recommended if the static content needs updating before delivery or when the amount of static data is small, as requesting data from multiple sources can incur more latency. Additionally, it requires more complex client implementations and might need secure storage if authorized access is required. Page 233 ##Footnote Considerations: We cannot use the Static Content Hosting Pattern if the static content needs to be updated before delivering it to the clients, such as adding the current access time and location to the web response. Further, this is not a feasible solution when the amount of static data that needs to be served is small; the cost for the client to request data from multiple sources can incur more latency than is being served directly by the service. When you need to send both static and dynamic content, we recommend using the Static Content Hosting Pattern only when it can provide a significant performance advantage. When using the Static Content Hosting Pattern, remember that you might need more-complex client implementations. This is because, based on the dynamic data that arrives, the client should be able to retrieve the appropriate static content and combine both types of data at the client side. If we are using the Static Content Hosting Pattern for use cases other than web-page rendering in a browser, we have to also be able to build and execute complex clients to fulfill that use case. Sometimes we might need to store static data securely. If we need to allow authorized users to access static data via the Static Content Hosting Pattern, we can use the Data Service pattern along with API security or the Vault Key pattern to provide security for the data store.

Answer 6

The different levels of transaction isolation are serializable isolation, repeatable reads isolation, read committed isolation, and read uncommitted isolation. The Transaction Pattern can combine multiple operations as a single unit of work across multiple systems, such as consuming an event from an event queue, performing an update to a data store, and passing the message to another event queue. Page 238-239 ##Footnote We can achieve transaction isolation at different levels. Serializable isolation provides the highest level. This blocks data access on selected data for parallel read and write queries during the transaction, and blocks addition and removal of data that might fall into the transaction data range. Repeatable reads isolation provides the second-best level of isolation. This blocks data access on selected data for read and write queries during the transaction, but allows addition and removal of new data in the transaction data range. At the same time, read committed isolation blocks only data writes, while read uncommitted isolation allows reading noncommitted updates made by other transactions. Transactions are commonly used with only a single data store, such as a relational database, but we can also coordinate operations across multiple systems, such as databases, event streams, and queuing systems. Combine multiple operations as a single unit of work: We can use the Transaction pattern to combine multiple steps that should all be processed completely to consider the operation valid. We can also make sure that multiple transactions do not interfere with each other. For example, both Bob and Eve will be able to transfer money to Alice’s account at the same time in parallel. Combine operations across multiple systems: the Transaction pattern can be used when we want to consume an event from an event queue, perform an update based on that to a data store, and pass that message to another event queue for further processing—all in a single transaction, as depicted in Figure 4-24. To synchronize the operations between multiple systems, we can use an XA transaction that uses a two-phase commit protocol. Most databases and event-queuing systems also natively support XA transactions, and through this we can ensure that the event will not get lost even if the processing system fails in the middle of its execution.

Answer 7

The Transaction Pattern should be used when all steps must be performed as an atomic operation and are relatively short-lived. The Saga pattern should be considered when transactions involve more than three systems, and compensation transactions are possible, as it reduces latency and coupling compared to XA transactions. Page 240 ##Footnote Considerations: We do not need to use the Transaction pattern when the operation has only a single step, or when there are multiple steps but failure of some is considered acceptable. It is important to note that the use of consensus algorithms such as XA transactions will synchronize operations and introduce latency. We recommend using the Transaction pattern only when the transaction is relatively short lived, and only if it involves few systems. Whenever possible, make the operation idempotent; this will help eliminate the need for using any transactions and simplifies the system. This is because with idempotent updates, even when the same operation is performed multiple times, the results will be the same. When we need to synchronize execution and have more than three systems, we recommend using the Saga pattern discussed in Chapter 3. the Saga pattern is useful for coordinating transactions among multiple data stores, microservices, and message brokers. It enables us to execute multiple transactions in order, and to compensate previous transactions when a latter transaction fails. This can also reduce the high latency or coupling that can occur from the distributed locks used by XA transactions. But we can use Saga only when all the participating transactions can be reverted—in the event of a failure—by using a compensation transaction. This can especially become a problem when we are integrating with third-party systems and might not have a way to compensate them in the case of failure. We recommend using XA transactions over Saga when all updates need to be done in a single data store or when all steps must be performed at the same time as an atomic operation. While Saga performs transactions in order, other systems can access data from the data stores and microservices in parallel. They can then get inconsistent results if they retrieve one part of the data from a data store that has already performed the transaction and another from a data store that has not yet processed the transaction.

Answer 8

The Vault Key Pattern provides a mechanism to control data store access and enforce security. It requires data store support for key validation, and considerations include setting moderate expiry times to reduce damage from compromised keys, and using alternative approaches if data stores can't validate access based on keys. Page 244 ##Footnote Considerations: Once the caller service gets access to the data store, the application that governs the service usually will lose control. the Vault Key Pattern provides a mechanism to withhold control over the data store and enforce security. But we can apply the Vault Key Pattern only when the data store supports key validation. This is important to ensure that the token is issued by the identity provider and is not expired. Some advanced data stores also support access scopes; they can identify which section of the data store, such as the table or row, can be accessed by the incoming request. When the data store can’t validate access based on keys, use alternative approaches such as fronting the stores with a data service and protecting with API security. Sometimes the issued vault key can be compromised. In these cases, it is usually not possible to block the use of that token, as most data stores do not support this functionality. We can reduce the damage that can be caused by a compromised vault key by setting the expiry time to a moderate value.

Answer 9

For cloud native applications, it is highly recommended to use a managed version of RDBMS provided by a cloud vendor, such as Amazon RDS, Google Cloud SQL, or Azure SQL, to reduce the complexity of managing databases and ensure better tuning for the environment. Page 245 ##Footnote Relational Database Management Systems: Most traditional databases fall under the category of relational database management systems (RDBMSs), which includes MySQL, Oracle, MSSQL, Postgres, H2, and more. These relational databases provide the ACID properties, and with their SQL can also have very complex data access patterns. However, if you have nonrelational data such as XML, JSON, or binary format, then an RDBMS may not be the best option, and you might need to select a distributed filesystem or NoSQL database to store data, as discussed previously in “Relational Databases”. When building cloud native applications, instead of deploying the database yourself on the cloud infrastructure, we highly recommend using a managed version of RDBMSs provided by a cloud vendor, such as Amazon Relational Database Service (RDS), Google Cloud SQL, or Azure SQL. This will not only reduce the complexity of managing the databases, but also be better tuned for the environment. To scale RDBMSs, we can deploy them as primary and replica databases, as discussed in the Materialized View pattern, or shard the data as in the Sharding pattern. In the worst case, if we still have issues with space, we can also periodically back up older, rarely used data to an archive such as NoSQL, and delete it from the store.

Answer 10

Apache Cassandra is known for continuous availability, high performance, and linear scalability. It offers replication across data centers, handles large amounts of data, and supports eventual consistency with adjustable levels. However, it has limited performance for frequent updates or deletes and is inefficient for joining two column families. Page 246 ##Footnote Apache Cassandra: Apache Cassandra is a distributed NoSQL database that began internally at Facebook and was released as an open source project in July 2008. Cassandra column store is well-known for its continuous availability (zero downtime), high performance, and linear scalability, which modern applications and microservices require. It also offers replication across data centers and geographies to guarantee availability across regions. Cassandra can handle petabytes of information and thousands of concurrent operations per second, enabling you to manage large amounts of data across hybrid cloud and multicloud environments. For cloud native application deployment, we recommend using managed Cassandra deployments such as Amazon Keyspaces and Asta on Google Cloud. Cassandra’s write performance is very high compared to its read performance, As discussed previously in “NoSQL Databases”, it provides eventual consistency by design. However, it also lets us change its consistency levels to achieve weak or strong consistency based on the use case. The performance of Cassandra also depends on how we store and query data. If we will be querying data based on a set of keys, we should use its row key (partition key). If we need to query data from different keys, we can create secondary indexes. Don’t overuse the secondary indexes; they can slow the data store, as each insertion has to also update the indexes. Further, Cassandra is not efficient when we want to join two column families, and we should not use it if we are to update the data more frequently.

Answer 11

Apache HBase provides linear scalability and real-time read/write access to large data sets, and it supports dynamic database schema. However, it has a complex interdependent system and a single point of failure due to its master/worker deployment. HBase is more suitable for high data consistency requirements, unlike Cassandra, which excels in high availability. Page 274 ##Footnote Apache HBase: Apache HBase is a distributed, scalable, NoSQL column store that runs on top of the HDFS. HBase can host very large tables with billions of rows and millions of columns, and can also provide real-time, random read/write access to Hadoop data. It scales linearly across very large data sets and easily combines data sources with different structures and schemas. As HBase is a column store, it supports dynamic database schema, and as it runs on top of HDFS, it can also be used in MapReduce jobs. Consequently, HBase’s complex interdependent system is more difficult to configure, secure, and maintain. Unlike Cassandra, HBase uses “master/worker” deployment, and so can suffer a single point of failure. If your application requires high availability, choose Cassandra over HBase. However, when we depend heavily on data consistency, HBase will be more suitable because it writes data to only one place and always knows where to find it (because data replication is done “externally” by HDFS). Similar to Cassandra, HBase also does not perform well for frequent data deletes or updates.

Answer 12

MongoDB is a document store that supports JSON-like documents, allowing flexible schema definition and various data operations. It is suited for mobile applications, content management, real-time analytics, and IoT applications. It favors consistency over availability, with multiple secondary replicas and an automatic primary election mechanism. Page 248 ##Footnote MongoDB: MongoDB is a document store that supports storing data in JSON-like documents, as discussed in “NoSQL Databases”. Documents and collections in MongoDB are comparable to records and tables in relational databases. It uses MongoDB query language to access the stored data, perform aggregation filtering and sorting based on any document fields, and insert and delete fields without restructuring documents. MongoDB Cloud provides MongoDB as a hosted solution for cloud native application usage. Unlike Cassandra or RDBMSs, MongoDB prefers more indexes. When not indexed, its performance can suffer, as it needs to search the entire collection. MongoDB also favors consistency over availability. It achieves availability by using a single read/write primary and multiple secondary replicas. When a primary becomes unavailable, the read/write operations will be temporarily halted for about 10 to 40 seconds while MongoDB automatically elects one of its secondary replicas as the primary. MongoDB is heavily used for mobile applications, content management, real-time analytics, and IoT applications. MongoDB is also a good choice if you have no clear schema definition with your JSON documents, and you can tolerate some data store unavailability. However, like other NoSQL databases, it is not suitable for transactional data.

Answer 13

Amazon DynamoDB is a key-value and document database known for low latency, high scalability, automatic partitioning, and replication across multiple availability zones. It supports fine-grained access control but has limited querying capabilities and does not support relational database features like table joins and foreign-key concepts. Page 249 ##Footnote Amazon DynamoDB: DynamoDB is a key-value and document database that can be used to store and retrieve data with low latency and high scalability. It can handle more than 10 trillion requests per day and more than 20 million requests per second during peaks. Data in DynamoDB is stored on solid-state disks (SSDs), automatically partitioned, and replicated across multiple availability zones. It also provides fine-grained access control and uses proven secured methods to authenticate users and prevent unauthorized data access. DynamoDB, a service provided by AWS, cannot be installed on a local server or in clouds other than AWS. Use DynamoDB only if you are using AWS as the primary cloud infrastructure for your cloud native applications. Further, DynamoDB has limited querying capability compared to relational stores and does not support relational database features such as table joins and foreign-key concepts; instead, it advocates using non-normalized data with redundancy for performance.

Answer 14

Apache HDFS is used for storing analytical data due to its high data resiliency and optimization for writing and reading data in a streaming manner. It supports storing large files efficiently but has limitations with random reads and can suffer unavailability if the single-name node is down. Page 250 ##Footnote Apache HDFS: The Apache Hadoop Distributed File System (HDFS) is a widely used distributed filesystem designed to run on cheap commodity hardware while providing high data resiliency by storing at least three copies of data in a distributed manner. HDFS is commonly used to store analytical data because the data stored in HDFS is immutable and is optimized to write and read data in a streaming manner. This also allows HDFS to be used as the data source for Hadoop MapReduce jobs for efficient processing of large data. Cloudera and major cloud vendors provide HDFS as a hosted service to use with cloud native applications. HDFS stores data in multiple data nodes, and stores all its metadata in memory in a single-name node. When that node is not available, it can fail new reads and writes, causing unavailability. Also, based on the capacity of its name node’s memory, it has an upper limit on the number of files that it can store. We recommend using HDFS to store a small number of large files instead of a large number of small files. Because it is optimized to read data sequentially, it is not the best solution when we need random reads.

Answer 15

Amazon S3 is beneficial for cloud native applications as it provides highly available object storage with fine-grained data access control, supports running analytics on data nodes using standard SQL expressions, and allows retrieval of subsets of object data to improve performance. It is recommended for use with AWS as the primary cloud platform. Page 251 ##Footnote Amazon S3: Amazon Simple Storage Service (S3) is an object storage that is part of AWS. It can be used in a data lake, as storage for cloud native applications, as a data backup or archive, and for big data analytics. It also supports the Data Locality pattern by running analytics on data nodes using standard SQL expressions of Amazon Athena. We can use S3 Select to retrieve subsets of object data instead of the entire object. This can improve data-access performance by up to four times. Amazon S3 is highly available and provides fine-grained data access control. We recommend using it when you use AWS as your primary cloud native application platform.

Answer 16

Azure Cosmos DB is a fully managed NoSQL data store supporting key-value, document, column, and graph database semantics, providing low-latency data retrieval, enterprise-grade security, and open-source APIs for MongoDB and Cassandra. It can only be used on the Azure cloud platform and provides limited transactional support within logical data partitions. Page 251 ##Footnote Azure Cosmos DB: Azure Cosmos DB is a fully managed NoSQL data store that supports key-value, document, column, and graph database semantics. It can store and retrieve data with low latency, and provides enterprise-grade security with end-to-end encryption and access control. It also provides open source APIs for MongoDB and Cassandra, enabling clients to leverage the cloud without changing their application. Cosmos DB, a service provided by Azure, cannot be installed on a local server or in clouds other than Azure. Use Cosmos DB only if you are using Azure as the primary cloud infrastructure for your cloud native applications. Still, Cosmos DB provides some flexibility by providing migration and synchronization of data with your on-premises Cassandra cluster. Though Cosmos DB can provide transactional support, it is limited within the logical data partition.

Answer 17

Google Cloud Spanner supports unlimited scale, strong consistency, and the capability to run SQL queries with support for transactions across all cluster nodes. It provides security through data-layer encryption and access controls. It is only available on the Google Cloud platform and requires changes to applications due to partial ANSI SQL support. Page 251 ##Footnote Google Cloud Spanner: Google Cloud Spanner is a fully managed relational data store that supports unlimited scale and strong consistency. It provides the capability to run SQL queries while providing support for transactions across all the nodes in the cluster. It also linearly scales write and read transactions and provides security through data-layer encryption and access controls. Because Cloud Spanner is a service provided by Google, it cannot be installed on a local server or in clouds other than Google. Use Spanner only if you are using Google as the primary cloud infrastructure for your cloud native applications. Though it provides SQL support, it does not fully support the American National Standards Institute (ANSI) SQL spec and so requires changes to applications before migrating from standard relational databases to Spanner.

Answer 18

Recommended practices include enforcing physical and software security for data at rest using the Vault Key pattern and API security, encrypting sensitive data before storage, separating sensitive data for additional protection, using secure transmission channels like HTTPS for data in transit, and encrypting only sensitive parts of messages to protect data without segmenting messages. Page 256 ##Footnote Security: Protecting data and allowing only the appropriate people and systems to access relevant data is key to the successful execution of a cloud native application, and to the success of an organization in general. The security of data should be enforced both when data is at rest and when data is on the move. We can enforce data security at rest both physically and through software. Data servers should be guarded and accessed only by authorized persons. Data stores running in the servers should also enforce security via the Vault Key pattern and API security to control data access. When storing sensitive data, we recommend encrypting it before storing it in the data store. We also recommend encrypting the filesystem in which the data is stored as an added layer of protection. We recommend separating sensitive data from other data so that sensitive data can be governed with additional layers of protection, along with audit trails to monitor suspicious behavior. Don’t collect and store unnecessary sensitive information. When needed, mask all sensitive information such as usernames and email addresses. This can be done by replacing sensitive data with unique identifiers and storing their mapping in a protected data store. This can enable us to continuously analyze and audit user behavior while providing the capability to delete all sensitive user data by simply deleting the data mapping. This will also help enforce privacy and data regulations such as Europe’s General Data Protection Regulation (GDPR). When it comes to data in transit, we should always transmit the data via secure data transmission channels such as HTTPS. For added security, we can encrypt the messages with asymmetric keys so that the intermediary hosts will not have access to the content. To protect sensitive information without segmenting messages, we can encrypt only the part of the message that has sensitive information. The whole message will be delivered to each client, but only the clients with the relevant key for the sensitive data can decrypt it, while others can’t access that data.

Answer 19

Redis is suitable as a cache due to its in-memory data storage, support for various data structures, transactions, LRU eviction, automatic failover, and persistence options. However, it does not support efficient querying, complex data manipulation, or aggregation operations, making it unsuitable as a NoSQL replacement for relational data stores. Page 248 ##Footnote Redis: Redis is an in-memory key-value data store commonly used as a cache, as discussed in the “Caching Pattern”. It supports string keys and various kinds of values such as strings, lists, sets, sorted sets, hashes, bit arrays, and much more. This makes the application less complex, as it can now store its internal data structure directly in Redis. Redis is ideal for a cache, as it supports transactions, keys with a limited time to live, LRU eviction of keys, automatic failover, and its ability to write excess data to disk. Redis also has plenty of cloud hosting options for cloud native applications to use, including AWS, Google, Redis Labs, and IBM. Redis supports two types of persistence options: Redis Database Backup (RDB) and Append Only File (AOF). By using both options, we can achieve good write performance and a good degree of data safety upon system failures. Redis features high availability by using a single “master” and multiple “replica"s as in the CQRS pattern, and provides scalability through sharding “master” and “replica"s as discussed in the “Data Sharding Pattern”. However, Redis is not a NoSQL replacement for relational data stores, as it does not support many standard relational data store features, such as efficient querying, and performing complex data manipulation and aggregation operations.

Answer 20

Table 4-1 categorizes NoSQL data stores in terms of consistency and availability. 170

Chapter 4, Data Management Patterns GPT Flashcards

(44 cards)