Section 12: Databases and Analytics Flashcards
Relational vs Non-Relational databases
Relational
* SQL
* organised into tables, rows and columns
* ridig schema
* rules enforced in database
* usually verticially scalled
* supports complex queries and joins
* Amazon RDS, Orange, MySQL, PostgreSQL
Non-relational
* NoSQL
* varied data storage models
* flexible schema stored in key-value pairs, columns, documents or graphs
* rules can be defined in application code (outside of database)
* scales horiztonally
* unstructred, supports any kind of schema
* AWS DynamoDB, MongoDB, Redis, Neo4j
AWS Relational Database Service (RDS)
- Scales vertically, which means upgrading the EC2 instance (more CPU and RAM)
- Is an OLTP type of database (Online Transaction Processing)
- Horizontal scaling for queries (reads) can be done by creating a read replica. Meaning the is a RDS master and RDS read replica database. The master database syncs to the read replica.
Relational Database Service (RDS) backups
Relational Database Service backups
Automated backups
* automated backups are retained for 0 to 35 days
* restore can be to any point in time during the retention period
Manual backups (snapshots)
* backs up entire DB instance, not just individual database
* snapshots do not expire
What is Amazon Aurora
Amazon Aurora:
* database in the RDS family
* great in durability and scailability
* MySQL and PostgreSQL compatible
* built-in fault tolerence
Aurora key features
Aurora key features:
* high performance and scailability
* supports MySQL and PostgreSQL
* aurora replicas: in-region read scaling and failover target (up to 15 replicas)
* global database: cross-reguib cluser with read scailing
* multi-master: scales out writes within a region
* serverless: on-demand, autoscaling config, does not support read replicas or public IP’s. Aurora Serverless is a seperate service to Aurora
When to use Aurora Serverless
Use cases:
* inrequently used apps
* new apps
* variable workload
* unpredicatable workloads
* dev and test databases
* multi-tenant apps
What is RDS Proxy?
- RDS Proxy is a fully managed database proxy for RDS
- highly available across multiple AZ’s
- increases scailability, faul tolerence and security
- reduced stresss on CPU/Memory
- control authentication method
- controls pool of connections to database
What is Amazon ElastiCache
- Fully managed implementation of Redis and Memcached
- It is a key/value store
- Can be put in front of databases such as RDS and DyanmoDB
- ElastiCache runs on Amazon EC2 instances, so you must choose and instance family/type
ElastiCache - Memcahced vs Redis
Redis:
* Data persistance
* Complex data types
* Partitioning (only in Cluster Mode)
* high availability
* NOT multi threaded
Memcached
* No data persistance
* Simple data types
* Partitioning
* Not high availability
* Multithreaded
ElastiCache use cases
- data that is relatively static and frequently accessed
- apps that are tolerant of stale data
- often used for storing session state (DynamoDB can also be used)
What is Amazon DynamoDB?
- NoSQL database service
- key/value store and document store
- non-relational, key-value type of database
- fully serverless
- autoscailing based on read/write capacity defined
DynamoDB - TTL
- TTL (time to live) which lets you define when data can be deleted. Great for using DynamoDB like you would Redis for caching purposes
- allows you to add a timestamp on an item in the table to delete after TTL has expired
- No extra cost and does not use WCU/RCU (write capacity units / read capacity units)
What is DynamoDB Steams?
DynamoDB Streams:
Captures a time-ordered sequence of item-level modifications to any DynamoDB table and stores this information in a log for up to 24 hours
What is DynamoDB Accelator (DAX)?
- DAX is a fully managed, highly available, in-memory cache for DynamoDB
- improved performance from milliseconds to microseconds (will help with latency etc)
- used to improve read and write performance due to read-through and write-through cache
What is DynamoDB Global Tables
DynamoDB Global Tables:
* multi-region, multi-active database
* DynamoDB databases async replication across regions (same data set)
What is Amazon RedShift?
Amazon Redshift:
* data warehouse
* use to analyse data using SQL and other Business Intelligence (BI) tools such as Amazon QuickSight, Tableau, Microsoft Power BI
* relation database
* used for OLAP (online analytical processing)
* uses EC2 instances
* keeps 3 copied of your day
* continuous and incremental backup
Uses cases for Amazon Redshift
Amazon Redshift (data warehouse) use cases:
* perform** complex queries** on massive collections of structured and semi-structured data with fast performance
* use Redshift Spectrum for direct access of S3 objects in a data lake
What is Amazon Elastic Map Reduce (EMR)?
- Amazon Elatic Map Reduce is Amazon’s version of Hadoop
- It is used for running big data frameworks such as Apache Hadoop and Apache Spark
- used for processing data for analyics and business intelligance
- can also be use for transforming and moving large amounts of data
- performs extract, transform and load functions (ETL)
What is Amazon Kinesis?
Amazon Kinesis:
Amazon Kinesis cost-effectively processes and analyzes streaming data at any scale as a fully managed service. With Kinesis, you can ingest real-time data, such as video, audio, application logs, website clickstreams, and IoT telemetry data, for machine learning (ML), analytics, and other applications.
What is Amazon Athena?
Amazon Athena is an interactive query service that makes it simple to analyze data directly in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can choose to pay based on the queries you run or compute needed by your queries.
What is AWS Glue?
AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.
AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can more easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows in a few steps in AWS Glue Studio.
What is Amazon OpenSearch Service (ElasticSearch)
Search, visualise, and analyise text and unstrucutred data. Is is ElasticSearch, meaning you can use with Logstash and Kibana Dashboard (ELK stack)
Supports queries using SQL.
Amazon OpenSearch Service is a managed service that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch.
Amazon OpenSearch (ElasticSearch) best practices
- deploy OpenSearch data instances across 3 Availability Zones
- provision instances in multiples of 3
- if 3 is not available, use 2 AZ’s with equal number of instances
- configure at least 1 replica for each index
- apply restrictive resource-based access policies to the domain (or use fin-grained access control)
- create the domain within an Amazon VPC
- for sentitiva data enable node-to-node encryption for encryption at rest