Data Loading & Unloading Flashcards

(98 cards)

1
Q

Stages

A

Stages are temporary storage locations for data files used in the data loading and unloading process
- Broken into Internal and External

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Internal Stages are divided into what 3 stages?

A
  • User stage
  • Table stage
  • Named stage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

User Stage

A
  • An internal stage
  • Automatically allocated when a user is created
  • Use a PUT command to get a file from your local machine to Snowflake
  • reference to a user stage is: ls@~;
  • Cannot be altered or dropped
  • Not appropriate if multiple users need access to stage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Table Stage

A
  • An internal stage
  • Automatically allocated when a table is created
  • Use a PUT command to get the file into a table stage
  • Reference to a table stage is: ls @%MY_TABLE;
  • Cannot be altered or dropped
  • User must have ownership privileges on table
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Named Stage

A
  • An internal stage
  • User created database object
  • Use a PUT command to get the file into a name stage
  • Reference to a named stage is: ls@MY_STAGE;
  • Securable object
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In internal storage, uncompressed files are automatically compressed using ________ when loaded into an internal storage unless explicitly set not to.

A

GZIP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In internal storage, stage files are automatically encrypted using ____ bit keys.

A

128

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

External Stages

A

External stages reference data files stored in a location outside of Snowflake, which we manage ourselves.
- Could be Amazon S3 buckets, Google cloud storage buckets, or Microsoft Azure containers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

External named stage

A
  • User created database object
  • Files are uploaded using the cloud utilities of the cloud provider
  • Reference to a named stage is ls @MY_STAGE;
  • Storage locations can be private or public
  • Copy options such as ON_ERROR and PURGE can be set on stages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Storage Integration

A

A storage integration is a reusable and securable Snowflake object which can be applied across stages and is recommended to avoid having to explicitly set sensitive information for each stage definition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Stage Helper Command - LIST

A

List the contents of a stage:
- Path of staged file
- Size of staged file
- MD5 Hash of staged file
- Last updated timestamp

  • Can optionally specify a path for specific folders or files
  • Named and internal table stages can optionally include database and schema global pointer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Stage Helper Command - SELECT

A
  • Query the contents of staged files directly using standard SQL for both internal and external stages
  • Useful for inspected files prior to data loading/unloading
  • Reference metadata columns such as filename and row numbers for a staged file
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Stage Helper Command - REMOVE

A
  • Remove files from either an external or internal stage
  • Can optionally specify a path for specific folders or files
  • Named and internal table stages can optionally include database and schema global pointer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

PUT command

A
  • The PUT command uploads data files from a local directory on a client machine to any of the three types of internal stage.
  • Uploaded files are automatically encrypted with a 128-bit key with optional support for a 256-bit key.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PUT cannot be executed from within ____________.

A

worksheets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Duplicate files uploaded to a stage via PUT are _______.

A

ignored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

COPY INTO <table> statement

A
  • The COPY INTO <table> statement copies the contents of an internal or external location directly into a table
  • COPY INTO <table> requires a user created virtual warehouse to execute
  • Load history is stored in the metadata of the target table for 64 days, which ensures files are not loaded twice
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What file formats can be uploaded into Snowflake?

A
  • Delimited files (CSV, TSC, etc)
  • JSON
  • Avro
  • ORC
  • Parquet
  • XML
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Copy Option: ON_ERROR

A

Value that specifies the error handling for the load operation:
- CONTINUE
- SKIP_FILE
- SKIP_FILE_<num>
- SKIP_FILE_<num>%
- ABORT_STATEMENT</num></num>

Default Value - ‘ABORT_STATEMENT’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Copy Option: SIZE_LIMIT

A

Number that specifies the maximum size of data loaded by a COPY statement

Default - null(no size limit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Copy Option: PURGE

A

Boolean that specifies whether to remove the data files from the stage automatically after the data is loaded successfully

Default Value - FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Copy Option: RETURN_FAILED_ONLY

A

Boolean that specifies whether to return only files that have failed to load in the statement result

Default Value - FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Copy Option: MATCH_BY_COLUMN_NAME

A

String that specifies whether to load semi-structured data into columns in the target table that match corresponding columns represented in the data

Default Value - None

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Copy Option: ENFORCE_LENGTH

A

Boolean that specifies whether to truncate text strings that exceed the target column length

Default Value - TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Copy Option: TRUNCATECOLUMNS
Boolean that specifies whether to truncate text strings that exceed the target column length Default Value - FALSE
26
Copy Option: FORCE
Boolean that specifies to load all files, regardless of whether they've been loaded previously and have not changed since they were loaded Default Value - FALSE
27
Copy Option: LOAD_UNCERTAIN_FILES
Boolean that specifies to load files for which the load status is unknown. The COPY command skips these files by default. Default Value - FALSE
28
Output Column Names: COPY INTO
- FILE - STATUS - ROWS_PARSED - ROWS_LOADED - ERROR_LIMIT - ERRORS_SEEN - FIRST_ERROR - FIRST_ERROR_LINE - FIRST_ERROR_CHARACTER - FIRST_ERROR_COLUMN_NAME
29
COPY INTO
: Column Name: FILE
Data Type - TEXT Description - Name of source file and relative path to the file
30
COPY INTO
: Column Name: STATUS
Data Type - TEXT Description - Status: loaded, load failed, or partially loaded
31
COPY INTO
: Column Name: ROWS_PARSED
Data Type - NUMBER Description - Number of rows parsed from the source file
32
COPY INTO
: Column Name: ROWS_LOADED
Data Type - NUMBER Description - Number of rows loaded from the source file
33
COPY INTO
: Column Name: ERROR_LMIT
Data Type - NUMBER Description - If the number of errors reaches this limit, then abort
34
COPY INTO
: Column Name: ERRORS_SEEN
Data Type - NUMBER Description - Number of error rows in the source file
35
COPY INTO
: Column Name: FIRST_ERROR
Data Type - TEXT Description - First error of the source file
36
COPY INTO
: Column Name: FIRST_ERROR_LINE
Data Type - NUMBER Description - Line number of the first error
37
COPY INTO
: Column Name: FIRST_ERROR_CHARACTER
Data Type - NUMBER Description - Position of the first error character
38
COPY INTO
: Column Name: FIRST_ERROR_COLUMN_NAME
Data Type - TEXT Description - Column name of the first error
39
VALIDATION_MODE statement
Optional parameter allows you to perform a dry-run of load process to expose errors when running COPY INTO
- RETURN_N_ROWS - RETURN_ERRORS - RETURN_ALL_ERRORS
40
VALIDATE statement
- Validate is a table function to view all errors encountered during a previous COPY INTO execution - Validate accepts a job id of a previous query or the last load operation executed
41
File format options can be set on a named stage or _____ ____ statement.
COPY INTO
42
Explicitly declared file format options can all be rolled up into independent _____ _____ _______ objects.
File Format Snowflake
43
File formats can be applied to both named stages and COPY INTO statements. If set on both _____ ______ will take precedence.
COPY INTO
44
In the File Format object the file format you're expecting to load is set via 'type' property with one of the following values: ______, ______, ______, ______, _______, or ______.
CSV, JSON, AVRO, ORC, PARQUET, XML
45
CSV File Type
Comma-Separated Values file A plain text file that contains a list of data. They mostly use the comma character to separate data, but sometimes use other characters, like semicolons.
46
JSON File Type
JavaScript Object Notation file A file that stores simple data structures and objects in JavaScript Object Notation (JSON) format. It is primarily used for transmitting data between a web application and a server. They are lightweight, text-based, human-readable, and can be edited using a text editor
47
AVRO File type
Stores the data definition in JSON format making it easy to read and interpret; the data itself is stored in binary format making it compact and efficient. Avro files include markers that can be used to split large data sets into subsets suitable for Apache MapReduce processing.
48
ORC File type
Optimized Row Columnar (ORC) Open-source columnar storage file format originally released in early 203 for Hadoop workloads. ORC provides a highly-efficient way to store Apache Hive data, though it can store other data as well. It was designed and optimized specifically with Hive data in mind, improving the overall performance when HIve reads, writes, and process data.
49
PARQUET File type
Apache Parquet is a file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. It's open-source and supports multiple coding languages, including Java, C++, and Python.
50
XML File type
Extensible Markup Language file It contains a formatted dataset that is intended to be processed by a website, web application, or software program. XML files can be thought of as text-based databases
51
If a File Format object or options are not provided to either the stage or COPY statement, the default behavior will be try and interpret the contents of a stage as a _____ with _____ encoding.
CSV, UTF-8
52
SNOWPIPE
The Pipe object defines a COPY INTO
statement that will execute in response to a file being uploaded to a stage.
53
The two methods for detecting when a new file has been uploaded to a stage:
1. Automating Snowpipe using cloud messaging (external stages only) 2. Call Snow REST endpoints (internal and external stages)
54
Snowpipe: Cloud Messaging flow
55
Snowpipe: REST Endpoint flow
56
Snowpipe is designed to load new data typically within a _______ after a file notification is sent.
minute
57
Snowpipe is a _______________ feature, using Snowflake managed compute resources to load data files not a user managed ________________ __________________.
serverless feature, Virtual Warehouse
58
Snowpipe load history is stored in the __________ of the pipe for _____ days, used to prevent reloading the same files in a table.
metadata, 14
59
When a pipe is paused, event messages received for the pipe enter a limited retention period. The period is ____ days by default.
14
60
Compare Bulk Loading vs. Snowpipe: Authentication Feature
Bulk Loading: Relies on the security options supported by the client for authenticating and initiating a user session. Snowpipe: When calling the REST endpoints: Requires key pair authentication with JSON Web Token (JWT). JWTs are signed using a public/private key pair with RSA encryption.
61
Compare Bulk Loading vs. Snowpipe: Load History
Bulk Loading: Stored in the metadata of the target table for 64 days. Snowpipe: Stored in the metadata of the pipe for 14 days.
62
Compare Bulk Loading vs. Snowpipe: Compute Resources
Bulk Loading: Requires a user-specified warehouse to execute COPY statements. Snowpipe: Uses Snowflake-supplied compute resources
63
Compare Bulk Loading vs. Snowpipe: Billing
Bulk Loading: Billed for the amount of time each virtual warehouse is active. Snowpipe: Snowflake tracks the resource consumption of loads for all pipes in an account, with per-second/per-core granularity, as Snowpipe actively queues and process data files. In addition to resource consumption, an overhead is included in the utilization costs charged for Snowpipe: 0.06 credits per 1000 files notified or listed via event notifications or REST API calls.
64
Data Loading Best Practices
- Break files into 100-250 MB compressed - Organize Data by Path - Separate virtual warehouses for Load and Query - Pre-sort data - Load files no more than 1 file per minute so they don't back up in queue and incur cost
65
Table data can be unloaded to a stage via the ________ command.
COPY INTO
66
The ____ command is used to download a staged file to the local file system.
GET
67
By default results unloaded to a stage using _______________ command are split in to multiple files.
COPY INTO
68
All data files unloaded to internal stages are automatically encrypted using ____-bit keys.
128
69
COPY INTO output files can be prefixed by specifying a string at the ____ of a stage path.
end
70
COPY INTO includes a ____________ ___ copy option to partition unloaded data into a directory structure.
PARTITION BY
71
COPY INTO can copy table records directly to _________ cloud provider's blob storage.
external
72
COPY INTO Copy Option: OVERWRITE
Definition: Boolean that specifies whether the COPY command overwrites existing files with matching names, if any, in the location where files are stored. Default Value: 'ABORT_STATEMENT'
73
COPY INTO Copy Option: SINGLE
Definition: Boolean that specifies whether to generate a single file or multiple files. Default Value: FALSE
74
COPY INTO Copy Option: MAX_FILE_SIZE
Definition: Number (>0) that specifies the upper size limit (in bytes) of each file to be generated in parallel per thread. Default Value: FALSE
75
COPY INTO Copy Option: INCLUDE_QUERY_ID
Definition: Boolean that specifies whether to uniquely identify unloaded files by including a universally unique identifier (UUID) in the filenames of unloaded data files. Default Value: FALSE
76
______ is the reverse of PUT. It allows users to specify a source stage and a _______ local directory to download files to.
GET, target
77
GET cannot be used for _________ stages.
external
78
GET cannot be ________ from within worksheets.
executed
79
When using the GET command, downloaded files are automatically decrypted? T/F
True
80
When using the GET command, __________ optional parameter specifies the number of threads to use for downloading files. Increasing this number can improve ____________ with downloading large files.
parallel, parellelization
81
When using the GET command, _________ optional parameter specifies a regular expression pattern for filtering files to download.
pattern
82
Semi-structured Data Type: ARRAY
Contains 0 or more elements of data. Each element is accessed by its position in the array.
83
Semi-structured Data Type: OBJECT
Represent collections of key-value pairs.
84
Semi-structured Data Type: VARIANT
Universal semi-structured data type used to represent arbitrary data structures.
85
VARIANT data type can hold up to ___MB compressed data per row.
16
86
Semi-structured Data Formats supported by Snowflake.
JSON, AVRO, ORC, PARQUET, XML
87
Loading Semi-Structured Data Flow
Semi-Structured Data file ---PUT--> Stage --- COPY INTO --> Table
88
JSON File Format Options: DATE_FORMAT
Used only for loading JSON data into separate columns. Defines the format of date string values in the data files.
89
JSON File Format Options: TIME FORMAT
Used only for loading JSON data into separate columns. Defines the format on time string values in the data files.
90
JSON File Format Options: COMPRESSION
Supported algorithms: GZIP, BZ2, BROTLI, ZSTD, DEFLATE, RAW_DEFLATE, NONE. If BROTLI, cannot use AUTO.
91
JSON File Format Options: ALLOW DUPLICATE
Only used for loading. If TRUE, allows duplicate object field names (only the last one will be preserved)
92
JSON File Format Options: STRIP OUTER ARRAY
Only used for loading. If TRUE, JSON parser will remove outer brackets []
93
JSON File Format Options: STRIP NULL VALUES
Only used for loading. If TRUE, JSON parser will remove object fields or array elements containing NULL
94
Three Semi-Structured Data Loading Approaches
1. ELT (Extract, Load, Transform) 2. ETL (Extract, Transform, Load) 3. Automatic Schema Detection (INFER_SCHEMA, MATCH_BY_COLUMN_NAME)
95
Unloading Semi-structured Data Flow
Table ---COPY INTO--> Stage --GET--> Semi-structured Data Files
96
Accessing Semi-Structured Data: Dot Notation Structure
SELECT :. FROM
;
97
Accessing Semi-Structured Data: Bracket Notation Structure
SELECT [''] FROM
;
98
Accessing Semi-Structured Data: Repeating Element
SELECT SRC: [Element Index] FROM
;