Database & Analytics Flashcards
Databases
Is an organized collection of structured information, or data, typically stored electronically in a computer system.
• You build indexes to efficiently query / search through the data
• You define relationships between your datasets
Relational Databases
Is a collection of information that organizes data in predefined relationships where data is stored in one or more tables of columns and rows
• Can use the SQL language to perform queries / lookups
NoSQL Databases
• NoSQL databases are purpose built for specific data models and have flexible schemas for building modern applications
• Benefits: Flexibility, Scalability, High-performance, Highly functional
• Examples: Key-value, document, graph, in-memory, search databases
NoSQL data example: JSON
• JSON = JavaScript Object Notation
• JSON is a common form of data that fits into a NoSQL model
• Data can be nested
• Fields can change over time
Databases & Shared Responsibility on AWS
• AWS offers use to manage different databases
• Benefits include:
• Quick Provisioning, High Availability, Vertical and Horizontal Scaling
• Automated Backup & Restore, Operations, Upgrades
• Operating System Patching is handled by AWS
• Monitoring, alerting
AWS RDS
• RDS stands for Relational Database Service
• It’s a managed DB service for DB use SQL as a query language.
• It allows you to create databases in the cloud that are managed by AWS
• Postgres
• MySQL
• MariaDB
• Oracle
• Microsoft SQL Server
• Aurora (AWS Proprietary database)
Advantage over using RDS versus deploying
DB on EC2
• Automated provisioning, OS patching
• Continuous backups and restore to specific timestamp (Point in Time Restore)!
• Monitoring dashboards
• Read replicas for improved read performance
• Multi AZ setup for DR (Disaster Recovery)
• Maintenance windows for upgrades
• Scaling capability (vertical and horizontal)
• Storage backed by EBS (gp2 or io1)
Amazon Aurora
• Aurora is a proprietary technology from AWS (not open sourced)
• PostgreSQL and MySQL are both supported as Aurora DB
• Aurora is “AWS cloud optimized”, better performance than RDS
• Aurora storage automatically grows in increments of 10GB, up to 64 TB.
• Aurora costs more than RDS (20% more) – but is more efficient
RDS Deployments: Read Replicas, Multi-AZ
• Read Replicas:
• Scale the read workload of your DB
• Can create up to 5 Read Replicas
• Data is only written to the main DB
• Multi-AZ:
• Failover in case of AZ outage (high availability)
• Data is only read/written to the main database
• Can only have 1 other AZ as failover
RDS Deployments: Multi-Region
• Multi-Region (Read Replicas)
• Disaster recovery in case of region issue
• Local performance for global reads
• Replication cost
Amazon ElastiCache
• ElastiCache is to get managed Redis or Memcached
• Caches are in-memory databases with high performance, low latency
• Helps reduce load off databases for read intensive workloads
• You want to save the queries somewhere else,so that they’re very readily available.
DynamoDB
• Fully Managed Highly available with replication across 3 AZ
• NoSQL database /// Serverless
• Automatically scales up and down to adjust for capacity and maintain performance
• Millions of requests per seconds, 100s of TB of storage
• Single-digit millisecond latency – low latency retrieval
DynamoDB – type of data
• DynamoDB is a key/value database
DynamoDB Accelerator - DAX
• Fully Managed in-memory cache for
DynamoDB
• 10x performance improvement – single- digit millisecond latency to microseconds
latency
• Secure, highly scalable & highly available
DynamoDB – Global Tables
• Make a DynamoDB table accessible with low latency in multiple-regions
• Active-Active replication (read/write to any AWS Region)
Redshift
• Relational database
• Redshift is based on PostgreSQL, but it’s not used for OLTP
• It’s OLAP – online analytical processing (analytics and data warehousing)
Columnar storage of data (instead of row based)
• Load data once every hour, not every second
Amazon EMR “Elastic MapReduce”
• EMR helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data
• The clusters can be made of hundreds of EC2 instances
• Use cases: data processing, machine learning, web indexing, big data
Amazon Athena
• Serverless query service to analyze data stored in Amazon S3
• Uses standard SQL language
• Use cases: Business intelligence / analytics / reporting, analyze
• Exam Tip: analyze data in S3 using serverless SQL, use Athena
Amazon QuickSight
• Serverless machine that allows you to create dashboards on your databases so we can visually represent your data and show your business users the insights they’re looking for
• Fast, automatically scalable, embeddable, with per-session pricing
• Use cases: • Business analytics • Building visualization
DocumentDB
• DocumentDB is the same for MongoDB (which is a NoSQL database)
• MongoDB is used to store, query, and index JSON data
• Fully Managed, highly available with replication across 3 AZ
• Aurora storage automatically grows in increments of 10GB, up to 64 TB.
• Automatically scales to workloads with millions of requests per seconds
Amazon Neptune
• Fully managed graph database
• A popular graph dataset would be a social network
• Highly available across 3 AZ, with up to 15 read replicas
• Build and run applications working with highly connected datasets
• Can store up to billions of relations
Amazon QLDB
• QLDB stands for ”Quantum Ledger Database”
• Centralized component
• A ledger is a book recording financial transactions
• Fully Managed, Serverless, High available, Replication across 3 AZ
• Used to review history of all the changes made to your application data over time
• NoSQL
Amazon Managed Blockchain
• Blockchain makes it possible to build applications where multiple parties can execute transactions without the need for a trusted, central authority
• Amazon Managed Blockchain is a managed service to: • Join public blockchain networks
• Or create your own scalable private network
AWS Glue
• Fully serverless service
• Managed extract, transform, and load (ETL) service
• Useful to prepare and transform data for analytics
• Glue Data Catalog: catalog of datasets