Database Specialty - Redshift Architecture Flashcards

1
Q

Redshift Architecture

A
  • Massively parallel columnar database, runs within a VPC
  • Single leader node and multiple compute nodes
  • You can connect to Redshift using any application supporting JDBC or ODBC driver
    for PostgreSQL
  • Clients query the leader node using SQL endpoint
  • A job is distributed across compute nodes.
  • Compute nodes partition the job into slices.
  • Leader node then aggregates the results and returns them to the client
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Redshift node types

A
  • Dense compute nodes (DC2)
  • For compute-intensive DW workloads with local SSD storage
  • Dense storage nodes (DS2)
  • For large DWs, uses hard disk drives (HDDs)
  • RA3 nodes with managed storage
  • For large DWs, uses large local SSDs
  • Recommended over DS2
  • Automatically offloads data to S3 if node grows beyond its size
  • Compute and managed storage is billed independently
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Loading data into Redshift

A
  • Typically, data from OLTP systems is loaded into Redshift for analytics and BI purposes
  • Data from OLTP systems can be loaded into S3 and data from S3 can then be loaded into Redshift
  • Data from Kinesis Firehose can also be loaded in the same way
  • COPY command
  • Loads data from files stored in S3 into Redshift
  • Data is stored locally in the Redshift cluster (persistent storage = cost)
  • DynamoDB table data and EMR data can also be loaded using COPY command
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Loading data from S3 with COPY command
copy users from ‘s3://my_bucket/tickit/allusers_pipe.txt’
credentials ‘aws_iam_role=arn:aws:iam::0123456789:role/MyRedshiftRole’
delimiter ‘|’ region ‘us-west-2’;

A
  • Create an IAM Role
  • Create your Redshift cluster
  • Attach the IAM role to the cluster
  • The cluster can then temporarily assume the IAM role on your behalf
  • Load data from S3 using COPY command
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

More ways to load data into Redshift

A
  • Use AWS Glue – fully managed ETL service
  • ETL = Extract, Transform, and Load
  • Use ETL tools from APN partners
  • Use Data Pipeline
  • For migration from on-premise, use:
  • AWS Import/Export service (AWS Snowball)
  • AWS Direct Connect (private connection between your datacenter and AWS)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Querying external data with Redshift

A
  • Two ways
  • Redshift Spectrum
  • Redshift Federated Query
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Redshift Spectrum

A
  • Query exabytes of data from S3 without loading it into Redshift
  • Must have a Redshift cluster available to start the query
  • The query is then submitted to thousands of Redshift Spectrum nodes
  • External table structure/schemas can be created in external data catalog like Athena / Glue / Apache Hive metastore (EMR)
  • These external tables are read-only (insert / update / delete operations not possible)
  • Redshift cluster and S3 bucket must be in the same region
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Redshift Federated Query

A
  • Query and analyze data across different DBs, DWs, and data lakes
  • Currently works with Redshift, PostgreSQL on RDS, Aurora PostgreSQL
    and S3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Star Schema

A
  • Typical way of organizing data in a data warehouse
    • Two types of tables – fact tables and dimension tables
    • A star pattern consists of a fact table and multiple dimension tables
    • Fact table has foreign key relationships with multiple dimension tables
    • Dimension tables are generally small (fewer records, but often with many fields)
  • Recommended to store:
    • smaller dimension tables as local Redshift tables
    • larger fact tables as external Spectrum tables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Redshift Key Constraints

A
  • Redshift does not support indexes (clustered or non-clustered)
  • Has a concept of sort keys
  • data is stored on disk in a sorted order by sort key
  • Primary Key, Unique keys, and Foreign key constraints are not enforce (are informational only)
  • Query optimizer uses these constraints to generate more efficient query plans
  • You can enforce PK/FK relationships through your application
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Redshift Table Design

A
  • Key factors for efficient table design
    • Data distribution – how data is distributed across nodes
    • Sort strategies – how data is sorted in the tables
    • Compression – to reduce storage and I/O needs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Redshift Sort Styles

A
  • Single-column sort key (e.g. Dept)
  • Compound sort key
    • more that one column as sort key
    • e.g. Dept + Location
    • Hierarchical (order of the column in the sort key is important)
  • Interleaved sort key
    • gives equal weight to each column (or subset of columns) in the sort key
    • In effect, you can have multiple sort key combinations
    • e.g. Dept + Location, Location + Dep
  • Must be defined at the table creation time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

VACUUM operation in the tables

A
  • As you delete row from or add more rows to a sorted table containing data, performance might deteriorate over time
  • VACUUM operation re-sorts rows in one or all DB tables
  • And reclaims space from table rows marked for deletion
  • Redshift automatically sorts data and runs VACUUM DELETE in the background
  • Need not run manually, but you can if required
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Redshift Compression

A
  • Column-level operation to reduce size of stored data
  • Reduces disk I/O and improves query performance
  • Compression type = encoding
  • Each column can be separately compressed with different encodings (manually / automatically)
  • COPY command applies compression by default
  • RAW = No compression (default for sort keys)
  • LZO = Very high compression with good performance (default encoding)
  • Encoding cannot be changed after table creation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Redshift Workload Management (WLM)

A
  • Helps you prioritize workloads
  • Can prevent long-running queries from impacting short-running ones
  • Two modes – Automatic WLM and Manual WLM
  • Automatic WLM supports queue priorities
  • SQA (Short query acceleration)
    • prioritize selected short running queries w/o creating dedicated queue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Modifying the WLM configuration

A
  • Switching between Manual and Auto WLM modes requires cluster reboot
  • WLM uses parameter groups for its config
  • WLM configuration properties are either dynamic or static
  • Dynamic property changes do not require cluster reboot
  • Static property changes require cluster reboot