Databricks Data Engineer Associate Certification Flashcards

Question

Deduplicate a row based on specific columns.

Answer 1

`CAST(COLUMN AS TIMESTAMP)` or `TO_TIMESTAMP(COLUMN)` or `TRY_TO_TIMESTAMP(COLUMN)` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/functions/to_timestamp

Answer 2

```sql SELECT extract(YEAR FROM TIMESTAMP '2019-08-12 01:00:00.123456'); 2019; ``` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/functions/extract

Answer 3

```sql SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1); ``` ...returns: ```text 100 ```

Answer 4

```sql SELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data ``` ## Footnote https://docs.databricks.com/aws/en/semi-structured/json

Answer 5

* **Performance: **Array functions in Spark are optimized for distributed processing * TODO ## Footnote https://medium.com/@sujathamudadla1213/elt-with-apache-spark-for-topic-identify-the-benefits-of-using-array-functions-2ef4470a51d8

Answer 6

```sql SELECT from_json(COLUMN) ``` or ```sql SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE'); {"a":1,"b":0.8} ``` ## Footnote https://docs.databricks.com/gcp/en/sql/language-manual/functions/from_json

Answer 7

Using `LATERAL VIEW explode` or a table reference with `explode` will un-nest an array by creating new rows for each value in an array, whereas `flatten` will transform an array of arrays into a single array. ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/functions/explode https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-table-reference https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-lateral-view https://docs.databricks.com/aws/en/sql/language-manual/functions/flatten

Answer 8

```sql -- A very basic PIVOT -- Given a table with sales by quarter, return a table that returns sales across quarters per year. > CREATE TEMP VIEW sales(year, quarter, region, sales) AS VALUES (2018, 1, 'east', 100), (2018, 2, 'east', 20), (2018, 3, 'east', 40), (2018, 4, 'east', 40), (2019, 1, 'east', 120), (2019, 2, 'east', 110), (2019, 3, 'east', 80), (2019, 4, 'east', 60), (2018, 1, 'west', 105), (2018, 2, 'west', 25), (2018, 3, 'west', 45), (2018, 4, 'west', 45), (2019, 1, 'west', 125), (2019, 2, 'west', 115), (2019, 3, 'west', 85), (2019, 4, 'west', 65); > SELECT year, region, q1, q2, q3, q4 FROM sales PIVOT (sum(sales) AS sales FOR quarter IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4)); 2018 east 100 20 40 40 2019 east 120 110 80 60 2018 west 105 25 45 45 2019 west 125 115 85 65 ``` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-pivot

Answer 9

```sql CREATE FUNCTION to_hex(x INT COMMENT 'Any number between 0 - 255') RETURNS STRING COMMENT 'Converts a decimal to a hexadecimal' CONTAINS SQL DETERMINISTIC RETURN lpad(hex(least(greatest(0, x), 255)), 2, 0) ```

Answer 10

`DESCRIBE FUNCTION `

Answer 11

Permissions for UDFs are managed based on the access controls applied to the catalog, schema, or database where the UDF is registered. ```sql GRANT EXECUTE ON FUNCTION my_catalog.my_schema.calculate_bmi TO `user@example.com`; ``` ## Footnote https://docs.databricks.com/aws/en/udf/unity-catalog#share-udfs-in-unity-catalog

Answer 12

## Footnote https://docs.databricks.com/aws/en/lakehouse/acid

Answer 13

`DESCRIBE DETAIL '/data/events/'` or `DESCRIBE DETAIL events_table`

Answer 14

`DESCRIBE HISTORY`

Answer 15

`DESCRIBE HISTORY`

Answer 16

```sql RESTORE TABLE target_table TO VERSION AS OF ; RESTORE TABLE target_table TO TIMESTAMP AS OF ; ```

Answer 17

`DESCRIBE HISTORY table_name` ## Footnote https://docs.databricks.com/gcp/en/delta/history

Answer 18

`SELECT * FROM events@v123` or `SELECT * FROM events VERSION AS OF 123` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select#as-of-syntax ## Footnote https://docs.databricks.com/gcp/en/delta/history

Answer 19

Z-Ordering is a technique to colocate related information in the same set of files. It is beneficial for high cardinality scenarios, particularly for read heavy workloads and where multi column filtering is common. For example: For a table containing companies and dates you might want to partition by company, and z-order by date, assuming that table collects data for several years. Footnote: https://community.databricks.com/t5/data-engineering/what-is-z-ordering-in-delta-and-what-are-some-best-practices-on/td-p/26639

Answer 20

VACUUM removes all files from the table directory that are not managed by Delta, as well as data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/delta-vacuum

Answer 21

The OPTIMIZE command rewrites data files to improve data layout for Delta tables. For tables with liquid clustering enabled, OPTIMIZE rewrites data files to group data by liquid clustering keys. For tables with partitions defined, file compaction and data layout are performed within partitions. Tables without liquid clustering can optionally include a ZORDER BY clause to improve data clustering on rewrite. Databricks recommends using liquid clustering instead of partitions, ZORDER, or other data layout approaches.

Answer 22

`CREATE TABLE ... AS `

Answer 23

```sql CREATE TABLE default.people10m ( id INT, firstName STRING, middleName STRING, lastName STRING, gender STRING, birthDate TIMESTAMP, dateOfBirth DATE GENERATED ALWAYS AS (CAST(birthDate AS DATE)), ssn STRING, salary INT ) ``` ## Footnote https://docs.databricks.com/aws/en/delta/generated-columns

Answer 24

`COMMENT ON TABLE my_table IS 'This is my table'` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-comment

Answer 25

```sql CREATE OR REPLACE TABLE ...; INSERT OVERWRITE VALUES ...; ``` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-dml-insert-into

Answer 26

When you want to "upsert", update or insert, based on a condition (such as a primary key match). ## Footnote https://docs.databricks.com/gcp/en/delta/merge

Answer 27

```sql MERGE INTO logs USING newDedupedLogs ON logs.uniqueId = newDedupedLogs.uniqueId WHEN NOT MATCHED THEN INSERT * ``` ## Footnote https://docs.databricks.com/gcp/en/delta/merge#data-deduplication-when-writing-into-delta-tables

Answer 28

* Atomic Operations: Ensures all updates/inserts happen in a single transaction. * Efficiency: Reduces multiple I/O operations by combining inserts and updates. * Data Integrity: Prevents duplicates and maintains consistency ## Footnote https://www.databricks.com/discover/pages/optimize-data-workloads-guide#delta-merge

Answer 29

COPY INTO command internally uses key-value store - RocksDB to store the details of the input files. This information is stored inside the Delta table log directory. This acts like the checkpointing information for a streaming query. Next time a COPY INTO command is triggered on the same table, as a first step, the data from the RocksDB is loaded and compared against the input files. Under the hood, a dedupe logic is performed to ensure idempotency. ## Footnote https://community.databricks.com/t5/machine-learning/how-is-idempotency-ensured-for-copy-into-command/td-p/19795

Answer 30

Fully replacing the contents of a table, including loading data into a (potentially schemaless) Delta Lake table ## Footnote https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/copy-into/#example-load-data-into-a-schemaless-delta-lake-table

Answer 31

```sql COPY INTO sales_data FROM 's3://bucket/path/' FILEFORMAT = CSV FORMAT_OPTIONS ('header' = 'true') ``` ## Footnote https://medium.com/@lsleena/databricks-data-engineer-associate-certification-study-notes-18-118eceabc412

Answer 32

* Target Dataset: The final table(s) generated by the pipeline. * Notebook Libraries: Python/SQL notebooks containing transformation logic. * Pipeline Configuration: Settings like trigger mode (continuous or scheduled), storage location, and cluster specs. ## Footnote https://docs.databricks.com/aws/en/dlt/tutorial-pipelines https://medium.com/@lsleena/databricks-data-engineer-associate-certification-study-notes-18-118eceabc412

Answer 33

* Target: Defines the output tables (e.g., cleaned, aggregated data). * Notebook Libraries: Contain the business logic (SQL queries, PySpark transformations) that process raw data into the target. ## Footnote https://medium.com/@lsleena/databricks-data-engineer-associate-certification-study-notes-18-118eceabc412

Answer 34

Triggered pipeline cost less but have higher latency than continuous pipelines, which require always-on clusters. ## Footnote https://docs.databricks.com/aws/en/dlt/pipeline-mode

Answer 35

Auto loader is used when a location is configured with the "cloudFiles" format ## Footnote https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/#auto-loader

Answer 36

If you're going to ingest files in the order of thousands over time, you can use COPY INTO. If you are expecting files in the order of millions or more over time, use Auto Loader. Auto Loader requires fewer total operations to discover files compared to COPY INTO and can split the processing into multiple batches, which means that Auto Loader is less expensive and more efficient at scale. ## Footnote https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/#when-to-use-copy-into-and-when-to-use-auto-loader

Answer 37

By default, Auto Loader schema inference seeks to avoid schema evolution issues due to type mismatches. For formats that don't encode data types (JSON, CSV, and XML), Auto Loader infers all columns as strings (including nested fields in JSON files). For formats with typed schema (Parquet and Avro), Auto Loader samples a subset of files and merges the schemas of individual files. ## Footnote https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/schema#how-does-auto-loader-schema-inference-work

Answer 38

For `NOT NULL` and `CHECK` constraints, the transaction fails. Primary and foreign keys are informational only and are not enforced. ## Footnote https://docs.databricks.com/aws/en/tables/constraints#declare-primary-key-and-foreign-key-relationships

Answer 39

* `ON VIOLATION DROP ROW`: Data loss * `ON VIOLATION FAIL UPDATE`: Failed transaction ## Footnote https://docs.databricks.com/aws/en/dlt/expectations

Answer 40

Applies change data capture for source -> target tables Example: ```sql APPLY CHANGES INTO target_table FROM source_table KEYS (id) APPLY AS DELETE WHEN operation = 'DELETE' APPLY AS UPDATE WHEN operation = 'UPDATE' APPLY AS INSERT WHEN operation = 'INSERT' SEQUENCE BY STRUCT(version_column, timestamp_column) ``` ## Footnote https://docs.databricks.com/aws/en/dlt/cdc

Answer 41

By default, the name for the hidden event log is formatted as `event_log_{pipeline_id}`, where the pipeline ID is the system-assigned UUID with dashed replaced by underscores. ## Footnote https://docs.databricks.com/aws/en/dlt/observability#:~:text=The%20event%20log%20location%20also,log%20table%20is%20shared%20directly.

Answer 42

Example: ```sql CREATE OR REFRESH LIVE TABLE sales_cleaned AS SELECT * FROM STREAM(LIVE.sales_raw); ``` To identify which notebook produced an error, navigate to the DLT Pipeline UI in the Databricks workspace. Check the “Execution Details” tab to see which step or notebook caused the failure. ## Footnote https://medium.com/@sujathamudadla1213/databricks-data-engineer-associate-level-course-incremental-data-processing-troubleshoot-101c4381377b

Answer 43

* Modularity * Parallel Execution * Error Isolation * Resource Optimization (different job clusters for different tasks) * Simplify Maintenance ## Footnote https://medium.com/@lsleena/databricks-data-engineer-associate-certification-study-notes-20-81819978a414

Answer 44

Select a task. In the Depends on field, click the X to remove a task or select tasks to add from the drop-down menu. ## Footnote https://docs.databricks.com/aws/en/jobs/run-if

Answer 45

Navigate to the Jobs UI, select the specific job, and then the Runs tab.

Answer 46

* Data Quality * Data Stewardship * Data Protection and Compliance * Data Management ## Footnote https://www.databricks.com/discover/data-governance

Answer 47

* Metastore: The top-level container for metadata in unity catalog. * Catalog: The initial organization level within a metastore ## Footnote https://docs.databricks.com/aws/en/data-governance/unity-catalog

Answer 48

Ojbects to which privileges can be granted or revoked: ```text securable_object { CATALOG [ catalog_name ] | CONNECTION connection_name | CLEAN ROOM clean_room_name | EXTERNAL LOCATION location_name | EXTERNAL METADATA metadata_name | FUNCTION function_name | METASTORE | PROCEDURE procedure_name | SCHEMA schema_name | SHARE share_name | [ STORAGE | SERVICE ] CREDENTIAL credential_name | [ TABLE ] table_name | MATERIALIZED VIEW view_name | VIEW view_name | VOLUME volume_name } ``` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-privileges

Answer 49

An identity designed for automation and programmatic access. ## Footnote https://docs.databricks.com/aws/en/admin/users-groups/service-principals

Answer 50

* Dedicated (formerly single user): Can be assigned to and used by a single user or group. * Standard (formerly shared): Can be used by multiple users with data isolation among users.

Answer 51

Under compute -> create cluster, in the "advanced options", "Enable Unity Catalog"

Answer 52

Under Compute -> SQL Warehouses

Answer 53

`SELECT * FROM ..` or ```sql USE .; SELECT * FROM ; ```

Answer 54

Use access control lists ## Footnote https://docs.databricks.com/aws/en/security/auth/access-control/

Answer 55

When you design your data governance model, you should give careful thought to the catalogs that you create. As the highest level in your organization's data governance model, each catalog should represent a logical unit of data isolation and a logical category of data access, allowing an efficient hierarchy of grants to flow down to schemas and the data objects that they contain. Catalogs therefore often mirror organizational units or software development lifecycle scopes. You might choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data. ## Footnote https://docs.databricks.com/aws/en/catalogs/#how-should-i-organize-my-data-into-catalogs

Databricks Data Engineer Associate Certification Flashcards

https://www.databricks.com/sites/default/files/2025-02/databricks-certified-data-engineer-associate-exam-guide-1-mar-2025.pdf (102 cards)