Optional parameter allows you to perform a dry-run of load process to expose errors when running COPY INTO
- RETURN_N_ROWS
- RETURN_ERRORS
- RETURN_ALL_ERRORS
- Validate is a table function to view all errors encountered during a previous COPY INTO execution
- Validate accepts a job id of a previous query or the last load operation executed
File format options can be set on a named stage or _____ ____ statement.
COPY INTO
Explicitly declared file format options can all be rolled up into independent _____ _____ _______ objects.
File Format Snowflake
File formats can be applied to both named stages and COPY INTO statements. If set on both _____ ______ will take precedence.
COPY INTO
In the File Format object the file format you're expecting to load is set via 'type' property with one of the following values: ______, ______, ______, ______, _______, or ______.
CSV, JSON, AVRO, ORC, PARQUET, XML
Comma-Separated Values file
A plain text file that contains a list of data. They mostly use the comma character to separate data, but sometimes use other characters, like semicolons.
JavaScript Object Notation file
A file that stores simple data structures and objects in JavaScript Object Notation (JSON) format. It is primarily used for transmitting data between a web application and a server. They are lightweight, text-based, human-readable, and can be edited using a text editor
Stores the data definition in JSON format making it easy to read and interpret; the data itself is stored in binary format making it compact and efficient. Avro files include markers that can be used to split large data sets into subsets suitable for Apache MapReduce processing.
Optimized Row Columnar (ORC)
Open-source columnar storage file format originally released in early 203 for Hadoop workloads. ORC provides a highly-efficient way to store Apache Hive data, though it can store other data as well. It was designed and optimized specifically with Hive data in mind, improving the overall performance when HIve reads, writes, and process data.
Apache Parquet is a file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. It's open-source and supports multiple coding languages, including Java, C++, and Python.
Extensible Markup Language file
It contains a formatted dataset that is intended to be processed by a website, web application, or software program. XML files can be thought of as text-based databases
If a File Format object or options are not provided to either the stage or COPY statement, the default behavior will be try and interpret the contents of a stage as a _____ with _____ encoding.
The Pipe object defines a COPY INTO
statement that will execute in response to a file being uploaded to a stage.
The two methods for detecting when a new file has been uploaded to a stage:
1. Automating Snowpipe using cloud messaging (external stages only)
2. Call Snow REST endpoints (internal and external stages)
Snowpipe: Cloud Messaging flow
Snowpipe: REST Endpoint flow
Snowpipe is designed to load new data typically within a _______ after a file notification is sent.
Snowpipe is a _______________ feature, using Snowflake managed compute resources to load data files not a user managed ________________ __________________.
serverless feature, Virtual Warehouse
Snowpipe load history is stored in the __________ of the pipe for _____ days, used to prevent reloading the same files in a table.
When a pipe is paused, event messages received for the pipe enter a limited retention period. The period is ____ days by default.
Compare Bulk Loading vs. Snowpipe: Authentication Feature
Bulk Loading: Relies on the security options supported by the client for authenticating and initiating a user session.
Snowpipe: When calling the REST endpoints: Requires key pair authentication with JSON Web Token (JWT). JWTs are signed using a public/private key pair with RSA encryption.
Compare Bulk Loading vs. Snowpipe: Load History
Bulk Loading: Stored in the metadata of the target table for 64 days.
Snowpipe: Stored in the metadata of the pipe for 14 days.
Compare Bulk Loading vs. Snowpipe: Compute Resources
Bulk Loading: Requires a user-specified warehouse to execute COPY statements.
Snowpipe: Uses Snowflake-supplied compute resources
Compare Bulk Loading vs. Snowpipe: Billing
Bulk Loading: Billed for the amount of time each virtual warehouse is active.
Snowpipe: Snowflake tracks the resource consumption of loads for all pipes in an account, with per-second/per-core granularity, as Snowpipe actively queues and process data files. In addition to resource consumption, an overhead is included in the utilization costs charged for Snowpipe: 0.06 credits per 1000 files notified or listed via event notifications or REST API calls.
Data Loading Best Practices
- Break files into 100-250 MB compressed
- Organize Data by Path
- Separate virtual warehouses for Load and Query
- Pre-sort data
- Load files no more than 1 file per minute so they don't back up in queue and incur cost
Table data can be unloaded to a stage via the ________ command.
COPY INTO
The ____ command is used to download a staged file to the local file system.
GET
By default results unloaded to a stage using _______________ command are split in to multiple files.
All data files unloaded to internal stages are automatically encrypted using ____-bit keys.
COPY INTO output files can be prefixed by specifying a string at the ____ of a stage path.
end
COPY INTO includes a ____________ ___ copy option to partition unloaded data into a directory structure.
PARTITION BY
COPY INTO can copy table records directly to _________ cloud provider's blob storage.
external
COPY INTO Copy Option: OVERWRITE
Definition: Boolean that specifies whether the COPY command overwrites existing files with matching names, if any, in the location where files are stored.
Default Value: 'ABORT_STATEMENT'
COPY INTO Copy Option: SINGLE
Definition: Boolean that specifies whether to generate a single file or multiple files.
Default Value: FALSE
COPY INTO Copy Option: MAX_FILE_SIZE
Definition: Number (>0) that specifies the upper size limit (in bytes) of each file to be generated in parallel per thread.
Default Value: FALSE
COPY INTO Copy Option: INCLUDE_QUERY_ID
Definition: Boolean that specifies whether to uniquely identify unloaded files by including a universally unique identifier (UUID) in the filenames of unloaded data files.
Default Value: FALSE
______ is the reverse of PUT. It allows users to specify a source stage and a _______ local directory to download files to.
GET, target
GET cannot be used for _________ stages.
GET cannot be ________ from within worksheets.
When using the GET command, downloaded files are automatically decrypted? T/F
When using the GET command, __________ optional parameter specifies the number of threads to use for downloading files. Increasing this number can improve ____________ with downloading large files.
parallel, parellelization
When using the GET command, _________ optional parameter specifies a regular expression pattern for filtering files to download.
pattern
Semi-structured Data Type: ARRAY
Contains 0 or more elements of data. Each element is accessed by its position in the array.
Semi-structured Data Type: OBJECT
Represent collections of key-value pairs.
Semi-structured Data Type: VARIANT
Universal semi-structured data type used to represent arbitrary data structures.
VARIANT data type can hold up to ___MB compressed data per row.
Semi-structured Data Formats supported by Snowflake.
JSON, AVRO, ORC, PARQUET, XML
Loading Semi-Structured Data Flow
Semi-Structured Data file ---PUT--> Stage --- COPY INTO --> Table
JSON File Format Options: DATE_FORMAT
Used only for loading JSON data into separate columns. Defines the format of date string values in the data files.
JSON File Format Options: TIME FORMAT
Used only for loading JSON data into separate columns. Defines the format on time string values in the data files.
JSON File Format Options: COMPRESSION
Supported algorithms: GZIP, BZ2, BROTLI, ZSTD, DEFLATE, RAW_DEFLATE, NONE. If BROTLI, cannot use AUTO.
JSON File Format Options: ALLOW DUPLICATE
Only used for loading. If TRUE, allows duplicate object field names (only the last one will be preserved)
JSON File Format Options: STRIP OUTER ARRAY
Only used for loading. If TRUE, JSON parser will remove outer brackets []
JSON File Format Options: STRIP NULL VALUES
Only used for loading. If TRUE, JSON parser will remove object fields or array elements containing NULL
Three Semi-Structured Data Loading Approaches
1. ELT (Extract, Load, Transform)
2. ETL (Extract, Transform, Load)
3. Automatic Schema Detection (INFER_SCHEMA, MATCH_BY_COLUMN_NAME)
Unloading Semi-structured Data Flow
Table ---COPY INTO--> Stage --GET--> Semi-structured Data Files
Accessing Semi-Structured Data: Dot Notation Structure
SELECT
:. FROM ;
Accessing Semi-Structured Data: Bracket Notation Structure
SELECT
[''] FROM ;
Accessing Semi-Structured Data: Repeating Element
SELECT SRC:
[Element Index] FROM