Data intensive applications Flashcards by Michael Garcia

Impedance Mismatch

When converting objects in code to a SQL table and Vice Versaa

How well did you know this?

Not at all

Perfectly

Another name for rows in a DB?

Tuples

How well did you know this?

Not at all

Perfectly

Schema-on-read

The structure of the data is implicit and only interpreted when the data is read.

How well did you know this?

Not at all

Perfectly

Schema-on-write

The traditional approach of relational databases, where the schema is explicit and the database ensures all written data confirmed to it.

How well did you know this?

Not at all

Perfectly

Document database

Data structures are self contained. JSON representation Can be quite appropriate.

Document oriented databases: MongoDB, RethinkDB, CouchDB and Espresso.

How well did you know this?

Not at all

Perfectly

Imperative Languages

Tells the computer to perform tasks in a particular order. This makes them hard to parallelize across multiple cores and machines.

How well did you know this?

Not at all

Perfectly

Declarative languages

Specify the pattern of the results, not the algorithm that is used to determine the results. This lends them to more parallel executions.

How well did you know this?

Not at all

Perfectly

Explain MapReduce?

In a real-world computing context, MapReduce is a programming model used for processing and generating large datasets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. It’s popularly used in big data applications and was a key component of the original Google search engine’s infrastructure.

Imagine you’re a teacher with a large class of students. You need to find out the total number of books each student read over the summer. Instead of doing this task by yourself, which would be quite time-consuming, you employ the MapReduce strategy.

Map: You divide the task among your students. You ask each student to make a list of the books they’ve read over the summer and count them. This division and individual counting is like the “map” phase of MapReduce.
Shuffle & Sort: After your students complete their lists, you gather the lists together. You arrange them in order, making it easier for you to count.
Reduce: Lastly, you go through the sorted list and sum up the total number of books read. This final aggregation is like the “reduce” phase of MapReduce.

How well did you know this?

Not at all

Perfectly

Which data model is more appropriate for a one to many relationship (tree-structured) or no relationships?

Document model is appropriate.

How well did you know this?

Not at all

Perfectly

Which data model is appropriate if you have many-to-many relationships?

Relational can handle simple cases but as connections within your data becomes more complex it becomes more natural to start modeling your data as a graph.

How well did you know this?

Not at all

Perfectly

What is the basic structure of a Property Graph

Each Vertex has a unique identifier, set of outgoing & set of incoming edges plus a collection of properties (key-value pairs).

Each Edge has a unique identifier, tail vertex (start), head vertex (end), label to describe relationship between two vertices and a collection of properties (key-value pairs).

How well did you know this?

Not at all

Perfectly

When is it beneficial to use asynchronous programming models?

Asynchronous programming models are beneficial when dealing with I/O-bound operations, such as network requests or file operations. By allowing tasks to run concurrently and asynchronously, it helps improve responsiveness and resource utilization in scenarios where tasks spend a significant amount of time waiting for external operations to complete.

With asynchronous programming, you can utilize asynchronous calls to the API, allowing the application to continue executing other tasks while waiting for the response. Here’s a step-by-step breakdown:

The application sends an asynchronous request to the API using a designated function or library. This function typically takes a callback function as a parameter or returns a Promise object.
Instead of waiting for the API response, the application can continue executing other tasks while the request is being processed by the API. This ensures that the application remains responsive and can perform other operations in the meantime.
Once the API response is received, the callback function (or the resolved Promise) is triggered, allowing the application to handle the response. This callback function typically takes the response data as a parameter and contains the logic to process and display the data.
The application can then update the webpage or perform any necessary operations with the received data.

How well did you know this?

Not at all

Perfectly

When is it beneficial to use threading models?

Threading models are beneficial in scenarios involving CPU-bound tasks or parallel processing, where tasks require heavy computation and can benefit from utilizing multiple processor cores. Threading allows for true parallelism, enabling tasks to execute simultaneously and speed up the overall processing time.

How well did you know this?

Not at all

Perfectly

Describe how to use async/await functions.

async/await:
-The async/await syntax is a modern approach that makes asynchronous code appear more synchronous and easier to read.
-It is built on top of Promises and provides a way to write asynchronous code that looks similar to synchronous code.
-The async keyword is used to declare an asynchronous function, and the await keyword is used to wait for the Promise to be resolved before continuing execution.
-The try/catch block can be used to handle any errors that occur within the async function.

How well did you know this?

Not at all

Perfectly

Describe the major parts of the Circuit Breaker Pattern

The Circuit Breaker pattern works by monitoring the availability and responsiveness of a service. It maintains the state of the service (closed, open, or half-open) based on the observed behavior. Here’s a step-by-step breakdown of how the pattern operates:

Closed State: Initially, the circuit breaker is in the closed state, allowing requests to pass through to the remote service/API as usual.
Monitoring: The circuit breaker monitors the responses from the service. It keeps track of metrics such as response times, error rates, or timeouts.
Thresholds: Based on the observed metrics, the circuit breaker sets thresholds to determine when to open the circuit. For example, if the error rate exceeds a certain threshold or the response time exceeds a specified duration, it triggers the circuit breaker to open.
Open State: When the circuit breaker opens, it stops any further requests from reaching the remote service/API. Instead, it immediately returns a predefined fallback response (e.g., an error or cached data) without forwarding the request.
Wait Duration: After the circuit breaker opens, it enters a wait duration known as the “open state timeout.” During this period, the circuit breaker periodically allows a limited number of requests to pass through to the remote service/API to test its availability.
Half-Open State: After the wait duration elapses, the circuit breaker enters the half-open state. It allows a small number of requests to reach the remote service/API. If these requests succeed, it indicates that the service is now available, and the circuit breaker transitions back to the closed state. However, if any request fails, the circuit breaker reopens and returns to the open state.

By utilizing the Circuit Breaker pattern, you can achieve the following benefits:
- Fail Fast: Requests are quickly intercepted and don’t waste resources on unresponsive or failing services.
- Graceful Degradation: Instead of completely blocking requests, a fallback response is provided during service unavailability, ensuring that the application can still function partially.
- Avoid Cascading Failures: By isolating problematic services, the Circuit Breaker pattern prevents the propagation of failures to other parts of the system.

It’s worth noting that there are various implementations of the Circuit Breaker pattern available in different programming languages and frameworks. Some popular libraries for circuit breaking include Hystrix (Java), resilience4j (Java), and Polly (.NET).

How well did you know this?

Not at all

Perfectly

How does a Canary Release reduce the risk of new code impacting production?

Study These Flashcards

Canary Release Pattern:
The Canary Release pattern involves deploying a new version of an application or feature to a small subset of users or a specific environment (the canary group) while keeping the remaining users or environments on the stable version (the control group). This allows for a controlled and gradual rollout of the new code. The pattern typically involves the following steps:

Identify the canary group: Select a subset of users, such as a percentage or specific user segment, or a specific environment, to receive the new version of the application or feature.
Deploy the new version: Release the new code to the canary group while the majority of users or environments remain on the stable version.
Monitor and collect data: Monitor the canary group’s performance, including metrics like response times, error rates, and user feedback. Collect data to evaluate the impact of the new version.
Gradual expansion or rollback: Based on the monitored metrics and feedback, gradually expand the deployment to a larger user base or environment. Alternatively, if issues or negative impacts are detected, roll back the deployment to the stable version.

By gradually exposing the new version to a limited audience, issues can be detected and addressed before impacting the entire user base. It allows for gathering real-time feedback and performance data, ensuring a smooth transition and reducing the risk of widespread failures.

What are good patterns for quickly rolling back a deployment?

Study These Flashcards

In addition to the Canary Release pattern, here are a couple of patterns that can be used to stop a rollout or rollback a deployment:

Blue-Green Deployment Pattern:
The Blue-Green Deployment pattern involves maintaining two identical environments, the “blue” and the “green,” with only one actively serving production traffic at a time. The pattern allows for switching the traffic between the two environments easily. If issues arise during a rollout, rolling back to the previous version is as simple as redirecting the traffic back to the stable environment.
Feature Toggle Pattern:
The Feature Toggle pattern (also known as Feature Flags or Feature Flipping) provides a mechanism to enable or disable a specific feature in production. It allows for selectively activating or deactivating features without redeploying the application. If issues occur during a rollout, the feature toggle can be used to disable the new feature or revert to the previous behavior.

What are key features of Dynamo DB and describe what it is?

Study These Flashcards

Amazon DynamoDB is a managed NoSQL database service. It’s designed to provide low-latency, high-throughput performance by automatically distributing data across multiple servers to meet the requirements of your applications.

DynamoDB supports both key-value and document data models. Its flexible data model and reliable performance make it a great fit for mobile, web, gaming, ad tech, IoT, and many other applications.

Some of the key features of DynamoDB include:

Serverless: There’s no server management, and it scales automatically based on the load on the database.
Performance at scale: It can handle 10 trillion requests per day and can support peaks of more than 20 million requests per second.
Built for mission-critical workloads: It delivers single-digit millisecond performance at any scale with built-in security and in-memory caching.
Microsecond latency with DynamoDB Accelerator (DAX): DAX is a fully managed, highly available, in-memory cache for DynamoDB that can speed up the response times from milliseconds to microseconds, even at millions of requests per second.

What is a typical architecture that incorporates AWS Kinesis?

Study These Flashcards

A typical architecture that incorporates Amazon Kinesis might involve several other AWS services for a complete data processing pipeline. Here’s a simple example:

Data Producers: These are your sources of streaming data. They could be web servers, mobile apps, IoT devices, etc. These producers send data to Amazon Kinesis Streams.
Amazon Kinesis Streams: This service allows you to ingest and store data in real time. It can reliably handle large volumes of data, from hundreds of kilobytes per second to terabytes per hour.
Amazon Kinesis Data Analytics: This is an optional but useful step. Kinesis Data Analytics lets you process and analyze your streaming data with standard SQL. You can create time-series analytics, run real-time dashboards, detect anomalies, and more.
AWS Lambda: After your data is processed and analyzed, AWS Lambda can be triggered by Kinesis to perform any kind of action based on your business logic. For example, you might use Lambda to write records to a database, send notifications, or even trigger other AWS services.
Data Storage: After your data has been processed and any immediate actions have been taken, you might store the data for long-term analysis. Amazon S3 is a popular choice for this, but you might also use Amazon Redshift if you have a lot of structured data that you want to query with SQL.
Amazon CloudWatch: Throughout this pipeline, Amazon CloudWatch can be used to monitor your Kinesis streams and other AWS resources. You can set up alarms to notify you of any issues, and you can use CloudWatch logs to debug problems.
Amazon QuickSight / Other BI tools: Once your data is stored, you might use Amazon QuickSight or another business intelligence tool to create visualizations and reports based on your data.

Remember, this is just one example. The exact architecture would depend on your specific use case and requirements.

What is a the name of a popular ML function you can uses with AWS Kinesis Data Analytics for anomaly detection?

Study These Flashcards

RANDOM_CUT_FOREST - With Amazon Kinesis Data Analytics, you can write SQL queries that use machine learning functions to identify anomalies in your data stream. For example, you could use the RANDOM_CUT_FOREST function, which implements a popular algorithm for anomaly detection.

What are the advantages to Edge Computing?

Study These Flashcards

Edge computing refers to the practice of processing data closer to its source, rather than relying on a central location such as a cloud-based data center. This approach can lead to faster response times and less bandwidth use. Here are some key aspects:

Speed: By processing data near its source, edge computing reduces the latency that would be experienced if the data had to be sent to a central data center. This is particularly crucial for real-time applications.
Bandwidth Reduction: When large volumes of data are generated (such as from IoT devices), it’s often impractical to send all of this data to the cloud due to bandwidth limitations. Edge computing allows initial processing to be done locally, reducing the amount of data that needs to be transmitted.
Reliability: Since edge computing involves decentralized processing, it doesn’t rely on a constant connection to a central server. Therefore, even if the connection to the main server is lost, the local device can continue functioning.
Security and Privacy: With edge computing, sensitive data can be processed locally without being sent over the network, reducing the risk of interception.

Common use cases for edge computing include IoT devices, autonomous vehicles, drones, and any other applications that require real-time processing and decision-making. Tech giants such as Google, Amazon, and Microsoft are actively investing in edge computing technologies.

What is AWS Greengrass?

Study These Flashcards

AWS Greengrass: This service extends AWS onto your devices, so they can act locally on the data they generate, while still using the cloud for management, analytics, and storage. It allows you to run local compute, messaging, data caching, and sync capabilities for connected devices in a secure way.

What is the feature of AWS Cloudfront that let’s you run code closer to users of your application?

Study These Flashcards

Lambda@Edge

Pivotal has Cloud Foundry, what are are some of it’s key features?

Study These Flashcards

Cloud Foundry is an open-source platform-as-a-service (PaaS) that allows developers to build, deploy, run and scale applications. It’s designed to work with various languages and services and is highly customizable through the use of buildpacks. Here’s a summary of its key features:

Application Lifecycle Management: Cloud Foundry facilitates the entire process from coding to deployment. You just push your code, and Cloud Foundry takes care of the rest. This includes everything from setting up the appropriate runtime to managing the application instance once it’s running.
Multi-language Support: Cloud Foundry supports a wide variety of programming languages and frameworks thanks to its buildpack mechanism. This includes Java, Node.js, Ruby, Go, Python, PHP, .NET, and more. You can also add custom buildpacks if needed.
Multi-cloud: Cloud Foundry is designed to be cloud-agnostic, meaning it can run on various Infrastructure as a Service (IaaS) providers like AWS, Google Cloud Platform, Microsoft Azure, OpenStack, and more. This helps to prevent vendor lock-in and gives you the flexibility to use the infrastructure that best suits your needs.
Scaling and Load Balancing: Cloud Foundry allows you to easily scale your applications, both manually and automatically, based on the demand. It also provides routing and load balancing for your applications.
Services Marketplace: Cloud Foundry includes a marketplace of services (such as databases, messaging systems, and more) that can be easily bound to your applications. These services can be offered by the platform, or externally by third-party providers.
Health Management: Cloud Foundry monitors the health of your applications and can restart instances if they become unresponsive.

In essence, Cloud Foundry aims to simplify the deployment process, provide flexibility in terms of language and cloud provider choice, and enable easy scaling and management of applications.

Describe the difference between AWS EC2 and Elastic Beanstalk?

AWS Elastic Beanstalk and Amazon EC2 (Elastic Compute Cloud) are both services provided by Amazon Web Services, but they serve different purposes and operate at different levels of abstraction. 1. **Amazon EC2**: EC2 provides scalable computing capacity in the AWS cloud. It gives you complete control over your computing resources and lets you run on Amazon's computing environment. Essentially, EC2 provides virtual machines, called instances, where you have operating system level control. You can configure your instances as needed, install software, create containers, and manage the network settings. However, this also means you're responsible for setting up, managing, and scaling these resources. 2. **AWS Elastic Beanstalk**: Elastic Beanstalk is a Platform as a Service (PaaS) that simplifies the deployment and scaling of applications. You just upload your application, and Elastic Beanstalk handles the details of capacity provisioning, load balancing, scaling, and application health monitoring. Elastic Beanstalk uses EC2 instances behind the scenes, but it manages them for you. It's designed to support applications written in various languages, including Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker. In summary, if you need control over your instances and are willing to manage the infrastructure, you might choose Amazon EC2. If you'd prefer to focus on the application and let AWS manage the infrastructure, you might choose AWS Elastic Beanstalk.

What is Kubernetes and what are the popular implementations on AWS and GCP called?

AWS and GCP both offer managed services for Kubernetes, which allow you to run Kubernetes clusters without having to manage the underlying infrastructure. 1. For Amazon Web Services (AWS), the managed Kubernetes service is called **Amazon Elastic Kubernetes Service (EKS)**. EKS takes care of the underlying Kubernetes infrastructure (like the control plane), so you can focus on deploying and running applications. 2. For Google Cloud Platform (GCP), the managed Kubernetes service is called **Google Kubernetes Engine (GKE)**. GKE was one of the first managed Kubernetes services as Kubernetes was originally designed by Google. Like EKS, GKE manages the underlying Kubernetes infrastructure, allowing you to concentrate on your applications. Both of these services handle tasks such as cluster maintenance, scaling, and upgrades, making it easier to use Kubernetes in the cloud. They also integrate well with other services in their respective cloud ecosystems.

Highlight key aspects of Microservices and the Twelve-Factor App methodologies?

Yes, I'm familiar with both concepts. **Microservices** is an architectural style where an application is built as a suite of small, loosely coupled services. Each microservice is a separate component that runs a unique process and communicates with other components via well-defined APIs. They're independently deployable and can be written in different programming languages. They can also have their own databases and storage systems, allowing them to be scaled independently. This architecture can improve an application’s scalability and speed up the software development process. The **Twelve-Factor App** is a methodology for building software-as-a-service applications. These best practices are designed to enable applications to be built with portability and resilience when deployed to the web. Here are the twelve factors: 1. **Codebase**: One codebase tracked in revision control, many deploys 2. **Dependencies**: Explicitly declare and isolate dependencies 3. **Config**: Store config in the environment 4. **Backing Services**: Treat backing services as attached resources 5. **Build, Release, Run**: Strictly separate build and run stages 6. **Processes**: Execute the app as one or more stateless processes 7. **Port Binding**: Export services via port binding 8. **Concurrency**: Scale out via the process model 9. **Disposability**: Maximize robustness with fast startup and graceful shutdown 10. **Dev/Prod Parity**: Keep development, staging, and production as similar as possible 11. **Logs**: Treat logs as event streams 12. **Admin Processes**: Run admin/management tasks as one-off processes The combination of microservices architecture and twelve-factor app methodology can help organizations build and manage modern, scalable, and maintainable software applications.

Explain a Rolling Upgrade Architecture

In a rolling upgrade, you upgrade one component of your application at a time, and you can continue to serve traffic to users while the upgrade is in progress. Here are some of the benefits of using a rolling upgrade architecture: *Reduced downtime: Rolling upgrades allow you to upgrade your software without taking your application offline. This can save you time and money, and it can also improve the user experience. *Improved reliability: Rolling upgrades can help to improve the reliability of your application. This is because the upgrade process is spread out over time, which reduces the risk of problems occurring. *Increased security: Rolling upgrades can help to increase the security of your application. This is because new security patches are often released for software, and rolling upgrades can help to ensure that these patches are applied in a timely manner.

What are popular Rolling Restart Architectures?

In software systems, a rolling restart architecture refers to a technique of restarting the components of a distributed system in a rolling or sequential manner to minimize downtime and maintain system availability. Here are some popular rolling restart architectures: 1. Rolling Restart with Load Balancer: In this approach, a load balancer is used to distribute incoming requests across multiple instances of the application. The rolling restart is performed by taking each instance out of the load balancer pool, restarting it, and then adding it back to the pool. This ensures that there is no interruption in service as the remaining instances continue to handle the incoming requests. 2. Rolling Restart with Blue-Green Deployment: In a blue-green deployment, two identical environments (blue and green) are maintained, with only one environment serving live traffic at a time. The rolling restart involves switching the live traffic from the blue environment to the green environment by updating the load balancer configuration. Once the green environment is confirmed to be functioning correctly, the blue environment can be restarted and updated. 3. Rolling Restart with Container Orchestration: Container orchestration platforms like Kubernetes provide rolling update capabilities out of the box. The rolling restart can be achieved by updating the container image or configuration one instance at a time, while the other instances continue to run and serve traffic. Kubernetes manages the rollout and ensures that the desired number of instances are always available during the restart process. 4. Rolling Restart with Database Clustering: In systems with distributed databases, a rolling restart can be performed by taking one node out of the cluster, restarting it, and then adding it back to the cluster. The cluster handles the data replication and synchronization, allowing the system to continue serving requests while the rolling restart is underway.

When discussing uptime what does six, five, four and three nines mean with regards to downtime per year?

- Three nines (99.9%) **8 hours, 46 minutes** - Four nines (99.99%) **52 minutes, 36 seconds** - Five nines (99.999%) **5.26 minutes** - Six nines (99.9999%) **526 milliseconds**

Data intensive applications Flashcards

(30 cards)