Databricks Data Engineer Associate Certification Flashcards

https://www.databricks.com/sites/default/files/2025-02/databricks-certified-data-engineer-associate-exam-guide-1-mar-2025.pdf (102 cards)

1
Q

Describe the relationship between the data lakehouse and the data warehouse.

A

A data warehouse stores only structured data, whereas a data lakehouse can store both structured and unstructured data. A data lakehouse builds on data lake architecture, adding a metadata layer.

https://www.databricks.com/glossary/data-lakehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Identify the improvement in data quality in the data lakehouse over the data lake.

A

The data lakehouse adds:
* A metadata layer
* New query engine designs
* Optimized access for data science and ML tools

https://www.databricks.com/glossary/data-lakehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Compare and contrast silver and gold tables, which workloads will use a bronze table as a source, which workloads will use a gold table as a source.

A

Silver layer tables are minimally cleansed and conformed just enough for self-service ad-hoc reporting, advanced analytics, or ML.

Gold layer tables are typically consumption-ready and organized into project-specific databases for reporting. Gold layer tables are more often de-normalized, and should be read-optimized.

https://www.databricks.com/glossary/medallion-architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Identify elements of the Databricks Platform Architecture, such as what is located in the data plane versus the control plane and what resides in the customer’s cloud account.

A

Control Plane: backend services that Databricks manages in your Databricks account. The web application is in the control plane.
Compute Plane (also called the “data plane”): where your data is processed:
* Serverless compute in your Databricks account
* Classic Databricks compute in your AWS or Azure account

https://docs.databricks.com/aws/en/getting-started/overview

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Differentiate between all-purpose clusters and jobs clusters.

A
  • You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
  • The Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart an job cluster.

https://docs.databricks.com/aws/en/getting-started/concepts#computation-management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Identify how cluster software is versioned using the Databricks Runtime.

A
  • Long Term Support versions are represented by an “LTS” qualifier
  • Major versions are reprented by the integer preceding the decimal, increments to the major versionare likely to include backwards-incompatible changes
  • Feature versions are represented by the integer following the decimal, and represent introduction of new features which are always backwards compatible

https://docs.databricks.com/aws/en/compute/#runtime-versioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Identify how clusters can be filtered to view those that are accessible by the user.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe how clusters are terminated and the impact of terminating a cluster.

A
  • Manually through the UI, CLI, or REST API by a user with “CAN RESTART” or “CAN MANAGE” permissions
  • Automatically after a specified period of inactivity
  • Unexpected termination

https://docs.databricks.com/aws/en/compute/clusters-manage#terminate-a-compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Identify a scenario in which restarting the cluster will be useful.

A

When you restart a compute, it gets the latest images for the compute resource containers and the VM hosts. It is important to schedule regular restarts for long-running compute such as those used for processing streaming data.

https://docs.databricks.com/aws/en/compute/clusters-manage#restart-a-compute-to-update-it-with-the-latest-images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe how to use multiple languages within the same notebook.

A
  • Click the cell language button and select a different language for a cell
  • Use the language magic command such as %python, %scala, or %sql

https://docs.databricks.com/aws/en/notebooks/notebooks-code#code-languages-in-notebooks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Identify how to run one notebook from within another notebook.

A
  • Use the %run magic command: This includes another notebook within your notebook, in the same scope as your notebook.
  • Use dbutils.notebook.run(): This starts a new job to run the notebook, and allows you to pass parameters and return values.

https://docs.databricks.com/aws/en/notebooks/notebook-workflows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
A
  • Click “Share” at the top of the notebook and select a user(s) or group(s)
  • Share a folder in which a notebook is located

https://docs.databricks.com/aws/en/notebooks/notebooks-collaborate#share-a-notebook

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe how Databricks Repos enables CI/CD workflows in Databricks.

A

Using the Databricks Repos API:

  • Admin flow: For production flows, a Databricks workspace admin sets up top-level folders in your workspace to host the production git folders. The admin clones a Git repository and branch when creating them, and could give these folders meaningful names such as “Production”, “Test”, or “Staging” which correspond to the remote Git repositories’ purpose in your development flows. For more details, see Production Git folder.
  • User flow: A user can create a Git folder under /Workspace/Users/<email>/ based on a remote Git repository. A user creates a local user-specific branch for the work the user will commit to it and push to the remote repository. For information on collaborating in user-specific Git folders, see Collaborate using Git folders.</email>
  • Merge flow: Users can create pull requests (PRs) after pushing from a Git folder. When the PR is merged, automation can pull the changes into the production Git folders using the Databricks Repos API.

https://docs.databricks.com/aws/en/repos/ci-cd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Identify Git operations available via Databricks Repos.

A
  • Create branch
  • Switch branch
  • Commit & push
  • Pull
  • Merge
  • Rebase
  • Resolve Merge Conflicts
  • Reset
  • Sparse Checkout

https://docs.databricks.com/aws/en/repos/git-operations-with-repos

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Identify limitations in Databricks Notebooks version control functionality relative to Repos.

A
  • Working branches are limited to 1 gigabyte (GB).
  • Files larger than 10 MB can’t be viewed in the Databricks UI.
  • Individual workspace files are subject to a separate size limit. For more details, read Limitations.
  • The local version of a branch can remain present in the associated Git folder for up to 30 days after the remote branch is deleted. To completely remove a local branch in a Git folder, delete the repository.

https://docs.databricks.com/aws/en/repos/limits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Extract data from a single file and from a directory of files.

A
  • Unity Catalog
    SQL:
    sql
     SELECT * from parquet.`/Volumes/<catalog_name>/<schema_name>/<volume_name>/path/to/data`
      

    Python:
    python
           spark.read.format("parquet").load("/Volumes/catalog_name/schema_name/volume_name/path/to/data")
      
  • URIs
    SQL:
    sql
         SELECT * FROM json.`s3://bucket_name/path/to/data`
        

    Python:
    python
         spark.read.format("parquet").load("s3://bucket_name/path/to/data")
        

https://docs.databricks.com/aws/en/files/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Identify the prefix included after the FROM keyword as the data type.

A

```sql
SELECT * from parquet./Volumes/catalog_name/schema_name/volume_name/path/to/data
~~~

https://docs.databricks.com/aws/en/files/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Create a view, a temporary view, and a CTE as a reference to a file.

A

Create a view:

```sql
CREATE VIEW AS
SELECT * FROM parquet./Volumes/catalog_name/schema_name/volume_name/path/to/data.parquet
~~~

Create a temporary view:

```sql
CREATE TEMPORARY VIEW AS
SELECT * FROM parquet./Volumes/catalog_name/schema_name/volume_name/path/to/data.parquet
~~~

Create a CTE:

```sql
WITH DATA_CTE AS (
SELECT * FROM
parquet./Volumes/catalog_name/schema_name/volume_name/path/to/data.parquet
)
SELECT * FROM DATA_CTE
~~~

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Identify that tables from external sources are not Delta Lake tables.

A

```sql
DESCRIBE DETAIL catalog.schema.table AS JSON

...the output will be something like:

text
{
“format”: “delta”,
“provider”: “delta”,
“location”: “dbfs:/Volumes/catalog_name/schema_name/table_name”
}
~~~

The “format” will indicate whether the table is a delta table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Create a table from a JDBC connection and from an external CSV file.

A

```SQL
CREATE TABLE IF NOT EXISTS ora_tab
USING ORACLE
OPTIONS (
url ‘<jdbc-url>',
dbtable '<table-name>',
user '<username>',
password '<password>'
);
CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
~~~</password></username></table-name></jdbc-url>

https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using#examples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Identify how the count_if function and the count where x is null can be used.

A

Return the number of true values for an expression

https://docs.databricks.com/aws/en/sql/language-manual/functions/count_if

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Identify when the SQL count function skips NULL values.

A
  • count(*): counts all rows
  • count(COLUMN_NAME): counts non-null values in the named column

https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-null-semantics#built-in-aggregate-expressions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Deduplicate rows from an existing Delta Lake table.

A

Using MERGE INTO ... WHEN NOT MATCHED BY SOURCE THEN DELETE :

```sql
WITH deduplicated_target AS (
SELECT * from
target
QUALIFY ROW_NUMBER()
OVER (PARTITION BY id ORDER BY updated ASC)
)
MERGE INTO target
USING deduplicated_target
ON deduplicated_target.id = target.id
WHEN NOT MATCHED BY SOURCE THEN DELETE
~~~

https://docs.databricks.com/gcp/en/delta/merge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Create a new table from an existing table while removing duplicate rows.

A

```sql
CREATE TABLE new_table LIKE existing_table AS
SELECT * from
existing_table
QUALIFY ROW_NUMBER()
OVER (PARTITION BY id ORDER BY updated ASC)
~~~

https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-table-like

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Deduplicate a row based on specific columns.
26
Validate that the primary key is unique across all rows.
27
Validate that a field is associated with just one unique value in another field.
28
Validate that a value is not present in a specific field.
29
Cast a column to a timestamp.
`CAST(COLUMN AS TIMESTAMP)` or `TO_TIMESTAMP(COLUMN)` or `TRY_TO_TIMESTAMP(COLUMN)` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/functions/to_timestamp
30
Extract calendar data from a timestamp.
```sql SELECT extract(YEAR FROM TIMESTAMP '2019-08-12 01:00:00.123456'); 2019; ``` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/functions/extract
31
Extract a specific pattern from an existing string column.
```sql SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1); ``` ...returns: ```text 100 ```
32
Utilize the dot syntax to extract nested data fields.
```sql SELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data ``` ## Footnote https://docs.databricks.com/aws/en/semi-structured/json
33
Identify the benefits of using array functions.
* **Performance: **Array functions in Spark are optimized for distributed processing * TODO ## Footnote https://medium.com/@sujathamudadla1213/elt-with-apache-spark-for-topic-identify-the-benefits-of-using-array-functions-2ef4470a51d8
34
Parse JSON strings into structs.
```sql SELECT from_json(COLUMN) ``` or ```sql SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE'); {"a":1,"b":0.8} ``` ## Footnote https://docs.databricks.com/gcp/en/sql/language-manual/functions/from_json
35
Identify which result will be returned based on a join query.
36
Identify a scenario to use the explode function versus the flatten function.
Using `LATERAL VIEW explode` or a table reference with `explode` will un-nest an array by creating new rows for each value in an array, whereas `flatten` will transform an array of arrays into a single array. ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/functions/explode https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-table-reference https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-lateral-view https://docs.databricks.com/aws/en/sql/language-manual/functions/flatten
37
Identify the PIVOT clause as a way to convert data from a long format to a wide format.
```sql -- A very basic PIVOT -- Given a table with sales by quarter, return a table that returns sales across quarters per year. > CREATE TEMP VIEW sales(year, quarter, region, sales) AS VALUES (2018, 1, 'east', 100), (2018, 2, 'east', 20), (2018, 3, 'east', 40), (2018, 4, 'east', 40), (2019, 1, 'east', 120), (2019, 2, 'east', 110), (2019, 3, 'east', 80), (2019, 4, 'east', 60), (2018, 1, 'west', 105), (2018, 2, 'west', 25), (2018, 3, 'west', 45), (2018, 4, 'west', 45), (2019, 1, 'west', 125), (2019, 2, 'west', 115), (2019, 3, 'west', 85), (2019, 4, 'west', 65); > SELECT year, region, q1, q2, q3, q4 FROM sales PIVOT (sum(sales) AS sales FOR quarter IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4)); 2018 east 100 20 40 40 2019 east 120 110 80 60 2018 west 105 25 45 45 2019 west 125 115 85 65 ``` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-pivot
38
Define a SQL UDF.
```sql CREATE FUNCTION to_hex(x INT COMMENT 'Any number between 0 - 255') RETURNS STRING COMMENT 'Converts a decimal to a hexadecimal' CONTAINS SQL DETERMINISTIC RETURN lpad(hex(least(greatest(0, x), 255)), 2, 0) ```
39
Identify the location of a function.
`DESCRIBE FUNCTION `
40
Describe the security model for sharing SQL UDFs.
Permissions for UDFs are managed based on the access controls applied to the catalog, schema, or database where the UDF is registered. ```sql GRANT EXECUTE ON FUNCTION my_catalog.my_schema.calculate_bmi TO `user@example.com`; ``` ## Footnote https://docs.databricks.com/aws/en/udf/unity-catalog#share-udfs-in-unity-catalog
41
Use CASE/WHEN in SQL code.
42
Leverage CASE/WHEN for custom control flow.
43
Identify where Delta Lake provides ACID transactions.
44
Identify the benefits of ACID transactions.
## Footnote https://docs.databricks.com/aws/en/lakehouse/acid
45
Identify whether a transaction is ACID-compliant.
46
Compare and contrast data and metadata.
47
Compare and contrast managed and external tables.
48
Identify a scenario to use an external table.
49
Create a managed table.
50
Identify the location of a table.
51
Inspect the directory structure of Delta Lake files.
`DESCRIBE DETAIL '/data/events/'` or `DESCRIBE DETAIL events_table`
52
Identify who has written previous versions of a table.
`DESCRIBE HISTORY`
53
Review a history of table transactions.
`DESCRIBE HISTORY`
54
Roll back a table to a previous version.
```sql RESTORE TABLE target_table TO VERSION AS OF ; RESTORE TABLE target_table TO TIMESTAMP AS OF ; ```
55
Identify that a table can be rolled back to a previous version.
`DESCRIBE HISTORY table_name` ## Footnote https://docs.databricks.com/gcp/en/delta/history
56
Query a specific version of a table.
`SELECT * FROM events@v123` or `SELECT * FROM events VERSION AS OF 123` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select#as-of-syntax ## Footnote https://docs.databricks.com/gcp/en/delta/history
57
Identify why Zordering is beneficial to Delta Lake tables.
Z-Ordering is a technique to colocate related information in the same set of files. It is beneficial for high cardinality scenarios, particularly for read heavy workloads and where multi column filtering is common. For example: For a table containing companies and dates you might want to partition by company, and z-order by date, assuming that table collects data for several years. Footnote: https://community.databricks.com/t5/data-engineering/what-is-z-ordering-in-delta-and-what-are-some-best-practices-on/td-p/26639
58
Identify how vacuum commits deletes.
VACUUM removes all files from the table directory that are not managed by Delta, as well as data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/delta-vacuum
59
Identify the kind of files Optimize compacts.
The OPTIMIZE command rewrites data files to improve data layout for Delta tables. For tables with liquid clustering enabled, OPTIMIZE rewrites data files to group data by liquid clustering keys. For tables with partitions defined, file compaction and data layout are performed within partitions. Tables without liquid clustering can optionally include a ZORDER BY clause to improve data clustering on rewrite. Databricks recommends using liquid clustering instead of partitions, ZORDER, or other data layout approaches.
60
Identify CTAS as a solution.
`CREATE TABLE ... AS `
61
Create a generated column.
```sql CREATE TABLE default.people10m ( id INT, firstName STRING, middleName STRING, lastName STRING, gender STRING, birthDate TIMESTAMP, dateOfBirth DATE GENERATED ALWAYS AS (CAST(birthDate AS DATE)), ssn STRING, salary INT ) ``` ## Footnote https://docs.databricks.com/aws/en/delta/generated-columns
62
Add a table comment.
`COMMENT ON TABLE my_table IS 'This is my table'` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-comment
63
Use CREATE OR REPLACE TABLE and INSERT OVERWRITE.
```sql CREATE OR REPLACE TABLE ...; INSERT OVERWRITE VALUES ...; ``` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-dml-insert-into
64
Compare and contrast CREATE OR REPLACE TABLE and INSERT OVERWRITE.
65
Identify a scenario in which MERGE should be used.
When you want to "upsert", update or insert, based on a condition (such as a primary key match). ## Footnote https://docs.databricks.com/gcp/en/delta/merge
66
Identify MERGE as a command to deduplicate data upon writing.
```sql MERGE INTO logs USING newDedupedLogs ON logs.uniqueId = newDedupedLogs.uniqueId WHEN NOT MATCHED THEN INSERT * ``` ## Footnote https://docs.databricks.com/gcp/en/delta/merge#data-deduplication-when-writing-into-delta-tables
67
Describe the benefits of the MERGE command.
* Atomic Operations: Ensures all updates/inserts happen in a single transaction. * Efficiency: Reduces multiple I/O operations by combining inserts and updates. * Data Integrity: Prevents duplicates and maintains consistency ## Footnote https://www.databricks.com/discover/pages/optimize-data-workloads-guide#delta-merge
68
Identify why a COPY INTO statement is not duplicating data in the target table.
COPY INTO command internally uses key-value store - RocksDB to store the details of the input files. This information is stored inside the Delta table log directory. This acts like the checkpointing information for a streaming query. Next time a COPY INTO command is triggered on the same table, as a first step, the data from the RocksDB is loaded and compared against the input files. Under the hood, a dedupe logic is performed to ensure idempotency. ## Footnote https://community.databricks.com/t5/machine-learning/how-is-idempotency-ensured-for-copy-into-command/td-p/19795
69
Identify a scenario in which COPY INTO should be used.
Fully replacing the contents of a table, including loading data into a (potentially schemaless) Delta Lake table ## Footnote https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/copy-into/#example-load-data-into-a-schemaless-delta-lake-table
70
Use COPY INTO to insert data.
```sql COPY INTO sales_data FROM 's3://bucket/path/' FILEFORMAT = CSV FORMAT_OPTIONS ('header' = 'true') ``` ## Footnote https://medium.com/@lsleena/databricks-data-engineer-associate-certification-study-notes-18-118eceabc412
71
Identify the components necessary to create a new DLT pipeline.
* Target Dataset: The final table(s) generated by the pipeline. * Notebook Libraries: Python/SQL notebooks containing transformation logic. * Pipeline Configuration: Settings like trigger mode (continuous or scheduled), storage location, and cluster specs. ## Footnote https://docs.databricks.com/aws/en/dlt/tutorial-pipelines https://medium.com/@lsleena/databricks-data-engineer-associate-certification-study-notes-18-118eceabc412
72
Identify the purpose of the target and of the notebook libraries in creating a pipeline.
* Target: Defines the output tables (e.g., cleaned, aggregated data). * Notebook Libraries: Contain the business logic (SQL queries, PySpark transformations) that process raw data into the target. ## Footnote https://medium.com/@lsleena/databricks-data-engineer-associate-certification-study-notes-18-118eceabc412
73
Compare and contrast triggered and continuous pipelines in terms of cost and latency.
Triggered pipeline cost less but have higher latency than continuous pipelines, which require always-on clusters. ## Footnote https://docs.databricks.com/aws/en/dlt/pipeline-mode
74
Identify which source location is utilizing Auto Loader.
Auto loader is used when a location is configured with the "cloudFiles" format ## Footnote https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/#auto-loader
75
Identify a scenario in which Auto Loader is beneficial.
If you're going to ingest files in the order of thousands over time, you can use COPY INTO. If you are expecting files in the order of millions or more over time, use Auto Loader. Auto Loader requires fewer total operations to discover files compared to COPY INTO and can split the processing into multiple batches, which means that Auto Loader is less expensive and more efficient at scale. ## Footnote https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/#when-to-use-copy-into-and-when-to-use-auto-loader
76
Identify why Auto Loader has inferred all data to be STRING from a JSON source.
By default, Auto Loader schema inference seeks to avoid schema evolution issues due to type mismatches. For formats that don't encode data types (JSON, CSV, and XML), Auto Loader infers all columns as strings (including nested fields in JSON files). For formats with typed schema (Parquet and Avro), Auto Loader samples a subset of files and merges the schemas of individual files. ## Footnote https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/schema#how-does-auto-loader-schema-inference-work
77
Identify the default behavior of a constraint violation.
For `NOT NULL` and `CHECK` constraints, the transaction fails. Primary and foreign keys are informational only and are not enforced. ## Footnote https://docs.databricks.com/aws/en/tables/constraints#declare-primary-key-and-foreign-key-relationships
78
Identify the impact of ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE for a constraint violation
* `ON VIOLATION DROP ROW`: Data loss * `ON VIOLATION FAIL UPDATE`: Failed transaction ## Footnote https://docs.databricks.com/aws/en/dlt/expectations
79
Explain change data capture and the behavior of APPLY CHANGES INTO.
Applies change data capture for source -> target tables Example: ```sql APPLY CHANGES INTO target_table FROM source_table KEYS (id) APPLY AS DELETE WHEN operation = 'DELETE' APPLY AS UPDATE WHEN operation = 'UPDATE' APPLY AS INSERT WHEN operation = 'INSERT' SEQUENCE BY STRUCT(version_column, timestamp_column) ``` ## Footnote https://docs.databricks.com/aws/en/dlt/cdc
80
Query the events log to get metrics, perform audit loggin, examine lineage.
By default, the name for the hidden event log is formatted as `event_log_{pipeline_id}`, where the pipeline ID is the system-assigned UUID with dashed replaced by underscores. ## Footnote https://docs.databricks.com/aws/en/dlt/observability#:~:text=The%20event%20log%20location%20also,log%20table%20is%20shared%20directly.
81
Troubleshoot DLT syntax: Identify which notebook in a DLT pipeline produced an error, identify the need for LIVE in create statement, identify the need for STREAM in from clause.
Example: ```sql CREATE OR REFRESH LIVE TABLE sales_cleaned AS SELECT * FROM STREAM(LIVE.sales_raw); ``` To identify which notebook produced an error, navigate to the DLT Pipeline UI in the Databricks workspace. Check the “Execution Details” tab to see which step or notebook caused the failure. ## Footnote https://medium.com/@sujathamudadla1213/databricks-data-engineer-associate-level-course-incremental-data-processing-troubleshoot-101c4381377b
82
Identify the benefits of using multiple tasks in Jobs.
* Modularity * Parallel Execution * Error Isolation * Resource Optimization (different job clusters for different tasks) * Simplify Maintenance ## Footnote https://medium.com/@lsleena/databricks-data-engineer-associate-certification-study-notes-20-81819978a414
83
Set up a predecessor task in Jobs.
Select a task. In the Depends on field, click the X to remove a task or select tasks to add from the drop-down menu. ## Footnote https://docs.databricks.com/aws/en/jobs/run-if
84
Identify a scenario in which a predecessor task should be set up.
85
Review a task's execution history.
Navigate to the Jobs UI, select the specific job, and then the Runs tab.
86
Identify CRON as a scheduling opportunity.
87
Debug a failed task.
88
Set up a retry policy in case of failure.
89
Create an alert in the case of a failed task.
90
Identify that an alert can be sent via email.
91
Identify one of the four areas of data governance.
* Data Quality * Data Stewardship * Data Protection and Compliance * Data Management ## Footnote https://www.databricks.com/discover/data-governance
92
Compare and contrast metastores and catalogs.
* Metastore: The top-level container for metadata in unity catalog. * Catalog: The initial organization level within a metastore ## Footnote https://docs.databricks.com/aws/en/data-governance/unity-catalog
93
Identify Unity Catalog securables.
Ojbects to which privileges can be granted or revoked: ```text securable_object { CATALOG [ catalog_name ] | CONNECTION connection_name | CLEAN ROOM clean_room_name | EXTERNAL LOCATION location_name | EXTERNAL METADATA metadata_name | FUNCTION function_name | METASTORE | PROCEDURE procedure_name | SCHEMA schema_name | SHARE share_name | [ STORAGE | SERVICE ] CREDENTIAL credential_name | [ TABLE ] table_name | MATERIALIZED VIEW view_name | VIEW view_name | VOLUME volume_name } ``` ## Footnote https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-privileges
94
Define a service principal.
An identity designed for automation and programmatic access. ## Footnote https://docs.databricks.com/aws/en/admin/users-groups/service-principals
95
Identify the cluster security modes compatible with Unity Catalog.
* Dedicated (formerly single user): Can be assigned to and used by a single user or group. * Standard (formerly shared): Can be used by multiple users with data isolation among users.
96
Create a UC-enabled all-purpose cluster.
Under compute -> create cluster, in the "advanced options", "Enable Unity Catalog"
97
Create a DBSQL warehouse.
Under Compute -> SQL Warehouses
98
Identify how to query a three-layer namespace.
`SELECT * FROM ..` or ```sql USE .; SELECT * FROM ; ```
99
Implement data object access control.
Use access control lists ## Footnote https://docs.databricks.com/aws/en/security/auth/access-control/
100
Identify colocating metastores with a workspace as best practice.
101
Identify using service principals for connections as best practice.
102
Identify the segregation of business units across catalog as best practice.
When you design your data governance model, you should give careful thought to the catalogs that you create. As the highest level in your organization's data governance model, each catalog should represent a logical unit of data isolation and a logical category of data access, allowing an efficient hierarchy of grants to flow down to schemas and the data objects that they contain. Catalogs therefore often mirror organizational units or software development lifecycle scopes. You might choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data. ## Footnote https://docs.databricks.com/aws/en/catalogs/#how-should-i-organize-my-data-into-catalogs