SQL in AWS: Amazon RDS, Glue, and Redshift Flashcards
(50 cards)
What is Amazon RDS and how does it support MySQL?
Amazon RDS (Relational Database Service) is a managed database service that simplifies the setup, operation, and scaling of relational databases in the cloud. It supports several database engines, including MySQL. With RDS, you can easily create a MySQL instance, manage backups, apply patches, and scale your database without worrying about the underlying infrastructure. RDS handles routine database tasks, allowing you to focus on your application.
RDS provides a fully managed MySQL environment, ensuring compatibility with standard MySQL features while adding cloud-specific benefits like automated backups, point-in-time recovery, and high availability through Multi-AZ deployments. Unlike a self-managed MySQL instance where you’d handle server maintenance, RDS abstracts these complexities, making it ideal for production workloads.
– Example of creating a table in a MySQL RDS instance
CREATE TABLE users (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100)
);
How do you secure a MySQL database in Amazon RDS?
Securing a MySQL database in RDS involves multiple layers: (1) VPC and Security Groups: Place your RDS instance in a Virtual Private Cloud (VPC) and configure security groups to control traffic; (2) IAM Roles: Use AWS Identity and Access Management (IAM) to manage access; (3) Encryption: Enable encryption at rest with AWS KMS and in transit with SSL/TLS; (4) Database Authentication: Use strong passwords or IAM database authentication.
MySQL offers basic security features like user privileges and SSL connections, but RDS enhances these with AWS-specific tools. For instance, while MySQL requires manual SSL configuration, RDS simplifies it with pre-configured certificates and IAM integration, offering a more robust security model.
Example of enabling SSL for MySQL connections
mysql -h myrdsinstance.abcdef123456.us-west-2.rds.amazonaws.com –ssl-ca=rds-ca-2019-root.pem –ssl-mode=VERIFY_IDENTITY
How do automated backups work in Amazon RDS for MySQL?
RDS provides automated backups with daily snapshots taken during a backup window, retained for up to 35 days. It also supports continuous backups for point-in-time recovery to any second within that period. These features ensure data durability and easy recovery.
In a self-managed MySQL setup, backups require tools like mysqldump or custom scripts, often scheduled via cron jobs. RDS eliminates this overhead by automating the process, offering more granular recovery options than MySQL’s native binary log-based recovery.
Example of restoring a MySQL RDS instance to a point in time
aws rds restore-db-instance-to-point-in-time –source-db-instance-identifier mydbinstance –target-db-instance-identifier myrestoredinstance –restore-time 2025-06-20T23:45:00Z
What is AWS Glue and how does it relate to SQL?
AWS Glue is a fully managed ETL (Extract, Transform, Load) service that simplifies data preparation for analytics. It can crawl data sources, build a catalog, and generate ETL code. Glue supports SQL for transformations, allowing you to write queries to process data efficiently.
Glue can use MySQL as a data source or target, enabling SQL-based transformations before loading data elsewhere. Unlike MySQL, which focuses on transactional queries, Glue’s SQL capabilities are geared toward batch processing and data integration across AWS services.
Example of a Glue job script using SQL
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, [‘JOB_NAME’])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args[‘JOB_NAME’], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=”mydatabase”, table_name=”mytable”)
sql_transform = datasource0.sql(“SELECT id, name FROM mytable WHERE age > 18”)
datasink4 = glueContext.write_dynamic_frame.from_options(frame=sql_transform, connection_type=”s3”, connection_options={“path”: “s3://mybucket/output”}, format=”json”)
job.commit()
What is the Glue Data Catalog and how is it used?
The Glue Data Catalog is a centralized metadata repository that stores information about data assets across AWS. It acts as a hive metastore for services like Glue, EMR, and Athena, enabling data discovery and management.
By crawling MySQL databases, Glue populates the Data Catalog, making MySQL tables queryable via SQL in tools like Athena. MySQL itself doesn’t offer a comparable metadata service, requiring manual schema tracking.
Example of creating a Glue crawler for a MySQL database
aws glue create-crawler –name mysql-crawler –role GlueServiceRole –database-name mydatabase –targets “JdbcTargets=[{ConnectionName=mysql-connection, Path=mydatabase/%}]”
What is Amazon Redshift and how does it differ from MySQL?
Amazon Redshift is a fully managed data warehouse service designed for large-scale analytics. It uses columnar storage and parallel processing, contrasting with MySQL’s row-based, transactional focus. Redshift excels at complex analytical queries over massive datasets.
MySQL is suited for OLTP workloads, while Redshift targets OLAP. Redshift’s SQL dialect, based on PostgreSQL, differs from MySQL’s, offering features like advanced analytics that MySQL lacks natively.
– Example of creating a table in Redshift
CREATE TABLE sales (
sale_id INT,
product VARCHAR(50),
quantity INT,
sale_date DATE
)
DISTKEY(product)
SORTKEY(sale_date);
How do you load data into Amazon Redshift from S3?
Redshift uses the COPY command to efficiently load data from S3, supporting formats like CSV or JSON. This bulk ingestion method leverages Redshift’s distributed architecture for speed.
MySQL’s LOAD DATA INFILE serves a similar purpose but operates on a single server. Redshift’s COPY is optimized for parallel loading across nodes, far surpassing MySQL’s capabilities for large datasets.
– Example of loading data from S3 to Redshift
COPY sales FROM ‘s3://mybucket/sales_data.csv’
IAM_ROLE ‘arn:aws:iam::123456789012:role/MyRedshiftRole’
FORMAT AS CSV;
What are distribution styles in Redshift and why are they important?
Distribution styles (EVEN, KEY, ALL) dictate how data is spread across Redshift nodes. Choosing the right style reduces data movement during queries, boosting performance for joins and aggregations.
MySQL lacks data distribution as it’s not a clustered database. Redshift’s distribution styles are a key optimization absent in MySQL, tailored for distributed analytics.
– Example of creating a table with KEY distribution
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE
)
DISTKEY(customer_id);
What are window functions in Redshift and how do they compare to MySQL?
Window functions in Redshift, like ROW_NUMBER() or RANK(), compute values across rows without collapsing the result set. They’re powerful for analytics, supported by Redshift’s PostgreSQL-based SQL.
MySQL added window functions in version 8.0, aligning its capabilities with Redshift’s. However, Redshift’s implementation benefits from its distributed architecture, handling larger datasets more efficiently.
– Example of using ROW_NUMBER() in Redshift
SELECT sale_id, product, quantity,
ROW_NUMBER() OVER (PARTITION BY product ORDER BY quantity DESC) as rank
FROM sales;
How can AWS Glue be used with Amazon Redshift?
Glue extracts data from sources, transforms it with SQL or Python, and loads it into Redshift. The Glue Data Catalog also supports Redshift Spectrum for querying external data.
Glue can extract from MySQL, but its Redshift integration excels in data warehousing. MySQL lacks native ETL orchestration, making Glue a complementary tool for MySQL-to-Redshift pipelines.
Example of a Glue job loading data into Redshift
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, [‘JOB_NAME’])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args[‘JOB_NAME’], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=”mydatabase”, table_name=”mytable”)
transformed_data = datasource0.select_fields([“id”, “name”])
glueContext.write_dynamic_frame.from_jdbc_conf(frame=transformed_data,
catalog_connection=”redshift-connection”,
connection_options={“dbtable”: “public.mytable”, “database”: “mydatabase”},
redshift_tmp_dir=args[“TempDir”])
job.commit()
What is a Multi-AZ deployment in Amazon RDS?
Multi-AZ deployments enhance availability by replicating your RDS database to a standby instance in another Availability Zone. If the primary fails, RDS automatically fails over to the standby.
MySQL requires manual replication setup for high availability. RDS automates this with synchronous replication, reducing downtime compared to MySQL’s asynchronous options.
Example of enabling Multi-AZ for an RDS instance
aws rds modify-db-instance –db-instance-identifier mydbinstance –multi-az
What are Glue job bookmarks?
Glue job bookmarks track processed data across job runs, enabling incremental processing of new data only. This optimizes ETL jobs handling growing datasets.
MySQL might use timestamps or IDs for incremental updates, but Glue automates this with bookmarks, integrating seamlessly with AWS data sources beyond MySQL’s scope.
Example of enabling job bookmarks in Glue
job = Job(glueContext)
job.init(args[‘JOB_NAME’], args, bookmark=True)
What is Redshift workload management (WLM)?
WLM in Redshift manages query concurrency and memory by defining queues, prioritizing workloads to optimize cluster performance.
MySQL uses connection limits and timeouts for concurrency, but Redshift’s WLM offers finer control, critical for managing diverse analytical workloads unlike MySQL’s transactional focus.
– Example of setting up a WLM queue in Redshift
ALTER WLM CONFIGURATION
SET query_concurrency = 5
FOR queue_name = ‘default’;
What are RDS parameter groups?
Parameter groups in RDS manage database engine settings, allowing customization for performance or features. You can create and apply custom groups to instances.
MySQL uses my.cnf for configuration, requiring manual edits. RDS parameter groups offer a managed, scalable alternative, simplifying MySQL tuning in the cloud.
Example of creating a custom parameter group
aws rds create-db-parameter-group –db-parameter-group-name myparamgroup –db-parameter-group-family mysql8.0 –description “My custom parameter group”
How do Glue crawlers detect schema changes?
Glue crawlers detect schema changes in data sources and update the Data Catalog, ensuring metadata reflects the latest structure without manual updates.
MySQL requires manual schema management. Glue’s automation extends beyond MySQL, supporting dynamic schema evolution across diverse AWS data stores.
Example of scheduling a Glue crawler
aws glue start-crawler –name mysql-crawler –schedule “cron(0 12 * * ? *)”
What are materialized views in Redshift?
Materialized views in Redshift store precomputed query results, speeding up repeated analytical queries. They can be refreshed as needed.
MySQL lacks native materialized views, relying on triggers or jobs to simulate them. Redshift’s built-in support offers a performance edge for analytics.
– Example of creating a materialized view in Redshift
CREATE MATERIALIZED VIEW sales_summary AS
SELECT product, SUM(quantity) as total_quantity
FROM sales
GROUP BY product;
What are RDS read replicas?
RDS read replicas are copies of the primary database for read-only traffic, enhancing scalability. They can be promoted to standalone instances if needed.
MySQL supports replication, but RDS automates read replica management, reducing setup complexity and improving scalability over MySQL’s manual approach.
Example of creating a read replica
aws rds create-db-instance-read-replica –db-instance-identifier myreadreplica –source-db-instance-identifier mydbinstance
What are Glue triggers and workflows?
Glue triggers automate ETL job execution based on schedules or events, while workflows orchestrate multiple jobs and dependencies for complex pipelines.
MySQL uses cron jobs or event schedulers for automation. Glue’s managed workflows provide a broader, scalable solution beyond MySQL’s capabilities.
Example of creating a Glue trigger
aws glue create-trigger –name mytrigger –type SCHEDULED –schedule “cron(0 12 * * ? *)” –actions JobName=myjob
What are sort keys in Redshift?
Sort keys in Redshift physically order data on disk, optimizing query performance by reducing sorting overhead, especially for range-based queries.
MySQL uses indexes for sorting, but Redshift’s sort keys are integrated into table design, offering superior performance for large-scale analytics.
– Example of creating a table with a sort key
CREATE TABLE events (
event_id INT,
event_time TIMESTAMP
)
SORTKEY(event_time);
What is RDS Performance Insights?
Performance Insights in RDS monitors database performance, identifying bottlenecks and optimizing queries with visual tools and metrics.
MySQL offers EXPLAIN and slow query logs, but Performance Insights provides a more comprehensive, user-friendly interface, enhancing MySQL monitoring in RDS.
Example of enabling Performance Insights
aws rds modify-db-instance –db-instance-identifier mydbinstance –enable-performance-insights
How does Glue handle schema evolution?
Glue detects schema changes and updates the Data Catalog, supporting schema evolution without disrupting ETL jobs.
MySQL requires manual schema migrations. Glue’s automated evolution simplifies managing dynamic data sources, complementing MySQL’s static structure.
Example of using Glue’s schema evolution
glueContext.create_dynamic_frame.from_catalog(database=”mydatabase”, table_name=”mytable”, push_down_predicate=”partition_key=2025”)
What is Redshift Spectrum?
Redshift Spectrum lets you query S3 data using SQL without loading it into Redshift, leveraging the Glue Data Catalog for metadata.
MySQL’s federated tables offer similar external querying, but Redshift Spectrum scales better for data lakes, far exceeding MySQL’s scope.
– Example of querying S3 data with Redshift Spectrum
SELECT * FROM s3.my_external_table WHERE year=2025;
How do you scale an RDS MySQL instance?
RDS MySQL scales vertically by changing instance types or horizontally with read replicas. Storage can also auto-scale.
Self-managed MySQL scaling is manual and labor-intensive. RDS automates these processes, making scalability more accessible than in MySQL alone.
Example of modifying an RDS instance type
aws rds modify-db-instance –db-instance-identifier mydbinstance –db-instance-class db.m5.large –apply-immediately
What is Glue’s machine learning transform?
Glue’s ML transforms use machine learning to detect data quality issues like duplicates, automating cleanup tasks in ETL jobs.
MySQL requires custom SQL for such tasks. Glue’s ML capabilities extend beyond MySQL, offering advanced data processing automation.
Example of using FindMatches ML transform
find_matches = FindMatches.apply(frame=datasource0, transform_id=”tfm-123”, transformation_ctx=”find_matches”)