Redshift Flashcards

1
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Amazon Redshift?

A

A fully managed, petabyte-scale data warehouse service in the cloud.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some key benefits of using Amazon Redshift?

A
  • High performance: Up to 10x faster than other data warehouse solutions.
  • Cost-effective: Pay only for what you use.
  • Scalable: Easily scale your cluster up or down to meet changing needs.
  • Secure: Data is encrypted at rest and in transit.
  • Durable: Data is replicated within the cluster and backed up to S3.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Redshift Spectrum?

A

Allows you to query exabytes of data stored in S3 without loading it into Redshift.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does Redshift achieve high performance?

A

Utilizes Massively Parallel Processing (MPP) to distribute queries across multiple nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the different Redshift distribution styles?

A
  • AUTO: Redshift automatically chooses the best distribution style.
  • EVEN: Data is distributed evenly across all nodes.
  • KEY: Data is distributed based on a chosen key column.
  • ALL: The entire table is copied to all nodes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the purpose of the COPY command?

A

Efficiently load large amounts of data from external sources into Redshift.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the function of the UNLOAD command?

A

Unload data from Redshift tables to files in S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Redshift Workload Management (WLM)?

A

Prioritizes different types of queries and manages resources to optimize performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Concurrency Scaling?

A

Automatically adds cluster capacity to handle spikes in concurrent read queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Automatically adds cluster capacity to handle spikes in concurrent read queries.

A

Automatically adds cluster capacity to handle spikes in concurrent read queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two types of cluster resizing in Redshift?

A

Elastic Resize: Quick resizing with minimal downtime.
Classic Resize: More time-consuming but allows for changing node types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of the VACUUM command?

A

Reclaims disk space by removing deleted rows and sorting data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are RA3 nodes?

A

A new generation of Redshift nodes that allow for independent scaling of compute and storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Redshift Data Lake Export?

A

Enables unloading data from Redshift to S3 in Parquet format for efficient data lake integration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are Materialized Views?

A

Pre-compute and store query results for faster performance on complex queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Redshift Data Sharing?

A

Allows you to securely share live data across Redshift clusters without copying or moving data.

18
Q

What are Redshift Lambda UDFs?

A

Integrate custom code written in any language with your SQL queries.

19
Q

What are Redshift Federated Queries?

A

Query and analyze data across databases, data warehouses, and data lakes without ETL.

20
Q

What is the Redshift Data API?

A

Execute SQL statements against your Redshift cluster via HTTPS.

21
Q

What is Redshift Serverless?

A

Automatically provisions and scales compute capacity based on workload demands, eliminating the need to manage infrastructure.

22
Q

What are Redshift Processing Units (RPUs)?

A

A measure of compute capacity in Redshift Serverless.

23
Q

What are some of the Redshift system tables and views?

A
  • SVV_EXTERNAL_SCHEMAS: Information about external schemas.
  • STL_QUERYTEXT: Text of queries executed on the system.
  • SYS_QUERY_HISTORY: History of queries run on the system.
24
Q

What is DBLINK?

A

Allows you to connect Redshift to a PostgreSQL database (e.g., in RDS) for data copying and synchronization.

25
What is the benefit of Amazon Aurora zero-ETL integration with Redshift?
Enables automatic data replication from Aurora to Redshift without needing ETL processes.
26
What are some Redshift anti-patterns?
* Using Redshift for small datasets (better suited for RDS). * Using Redshift for OLTP workloads (consider RDS or DynamoDB). * Storing unstructured data directly in Redshift (use ETL and/or Redshift Spectrum). * Storing BLOB data directly in Redshift (store references to files in S3 instead).
27
How can you enhance security in Redshift?
* Use a Hardware Security Module (HSM) for encryption key management. * Define granular access privileges for users and groups using GRANT and REVOKE commands.
28
What are the different types of data shares in Redshift?
* Standard data shares: For sharing data across Redshift clusters. * AWS Data Exchange data shares: For sharing data with subscribers. * AWS Lake Formation-managed data shares: For sharing data with fine-grained access control.
29
What are some use cases for the Redshift Data API?
* Application integration: Build applications that interact with Redshift data. * ETL orchestration: Use with AWS Step Functions to create serverless data processing workflows. * Event-driven ETL: Trigger ETL jobs based on events using Amazon EventBridge. * Access from SageMaker notebooks: Run queries and analyze data in Jupyter notebooks.
30
What are some limitations of Redshift Serverless?
* No support for parameter groups. * No workload management capabilities. * Limited AWS Partner integration. * No public endpoints (must be accessed within a VPC).
31
How can you monitor Redshift Serverless?
* Monitoring views: SYS_QUERY_HISTORY, SYS_LOAD_HISTORY, SYS_SERVERLESS_USAGE. * CloudWatch logs: Connection and user logs, optional user activity logs. * CloudWatch metrics: QueriesCompletedPerSecond, QueryDuration, QueriesRunning.
32
What are the different ways to load data into Redshift?
* COPY command: For loading large datasets from external sources (S3, EMR, DynamoDB, etc.) * INSERT INTO ... SELECT: For loading data from existing tables within Redshift. * CREATE TABLE AS: For creating a new table based on the results of a query. * Streaming ingestion: For continuously loading data from Kinesis Data Streams or Amazon MSK.
33
What are some best practices for using the COPY command?
* Use a manifest file when loading from S3. * Leverage IAM roles for secure access to data sources. * Consider using compression (Gzip, Lzop, bzip2) to speed up data loading. * Use the AUTOMATIC COMPRESSION option to let Redshift determine the optimal compression scheme.
34
What is a Redshift snapshot?
A point-in-time backup of your Redshift cluster that can be used for disaster recovery or to create a new cluster.
35
How can you improve the performance of queries in Redshift?
* Use the appropriate distribution style for your tables. * Choose the right sort keys for your tables. * Use materialized views for frequently accessed data. * Optimize your queries by avoiding unnecessary joins and scans. * Utilize Redshift Workload Management (WLM) to prioritize queries.
36
What are some common Redshift performance metrics to monitor?
* CPU utilization * Disk space usage * Query runtime * Throughput (queries per second) * Concurrency (number of active queries)
37
What are some tools and services that can be integrated with Redshift?
* AWS S3: For data storage and loading. * Amazon EMR: For data preprocessing and ETL. * AWS Data Pipeline: For orchestrating data movement and transformation. * AWS Database Migration Service (DMS): For migrating data to Redshift. * Amazon QuickSight: For data visualization and business intelligence.
38
What is the difference between Redshift and Amazon Aurora?
* Redshift is a data warehouse optimized for analytical workloads. * Aurora is a relational database optimized for transactional workloads.
39
What are some security best practices for Redshift?
* Encrypt your cluster using AWS KMS. * Control network access using security groups and VPCs. * Implement least privilege access control using IAM roles and policies. * Regularly audit user activity and monitor for suspicious behavior.
40
What are some cost optimization strategies for Redshift?
* Choose the right node type and cluster size for your workload. * Utilize Redshift Spectrum to query data directly in S3. * Leverage reserved instances for predictable workloads. * Monitor your cluster usage and identify opportunities for optimization. * Consider using Redshift Serverless for variable workloads.
41
What are some resources for learning more about Redshift?
* AWS documentation * AWS Redshift blogs and whitepapers * AWS training courses and certifications * AWS community forums and support