Databricks Data Engineer Associate Certification Flashcards
https://www.databricks.com/sites/default/files/2025-02/databricks-certified-data-engineer-associate-exam-guide-1-mar-2025.pdf (102 cards)
Describe the relationship between the data lakehouse and the data warehouse.
A data warehouse stores only structured data, whereas a data lakehouse can store both structured and unstructured data. A data lakehouse builds on data lake architecture, adding a metadata layer.
https://www.databricks.com/glossary/data-lakehouse
Identify the improvement in data quality in the data lakehouse over the data lake.
The data lakehouse adds:
* A metadata layer
* New query engine designs
* Optimized access for data science and ML tools
https://www.databricks.com/glossary/data-lakehouse
Compare and contrast silver and gold tables, which workloads will use a bronze table as a source, which workloads will use a gold table as a source.
Silver layer tables are minimally cleansed and conformed just enough for self-service ad-hoc reporting, advanced analytics, or ML.
Gold layer tables are typically consumption-ready and organized into project-specific databases for reporting. Gold layer tables are more often de-normalized, and should be read-optimized.
https://www.databricks.com/glossary/medallion-architecture
Identify elements of the Databricks Platform Architecture, such as what is located in the data plane versus the control plane and what resides in the customer’s cloud account.
Control Plane: backend services that Databricks manages in your Databricks account. The web application is in the control plane.
Compute Plane (also called the “data plane”): where your data is processed:
* Serverless compute in your Databricks account
* Classic Databricks compute in your AWS or Azure account
https://docs.databricks.com/aws/en/getting-started/overview
Differentiate between all-purpose clusters and jobs clusters.
- You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
- The Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart an job cluster.
https://docs.databricks.com/aws/en/getting-started/concepts#computation-management
Identify how cluster software is versioned using the Databricks Runtime.
- Long Term Support versions are represented by an “LTS” qualifier
- Major versions are reprented by the integer preceding the decimal, increments to the major versionare likely to include backwards-incompatible changes
- Feature versions are represented by the integer following the decimal, and represent introduction of new features which are always backwards compatible
https://docs.databricks.com/aws/en/compute/#runtime-versioning
Identify how clusters can be filtered to view those that are accessible by the user.
Describe how clusters are terminated and the impact of terminating a cluster.
- Manually through the UI, CLI, or REST API by a user with “CAN RESTART” or “CAN MANAGE” permissions
- Automatically after a specified period of inactivity
- Unexpected termination
https://docs.databricks.com/aws/en/compute/clusters-manage#terminate-a-compute
Identify a scenario in which restarting the cluster will be useful.
When you restart a compute, it gets the latest images for the compute resource containers and the VM hosts. It is important to schedule regular restarts for long-running compute such as those used for processing streaming data.
https://docs.databricks.com/aws/en/compute/clusters-manage#restart-a-compute-to-update-it-with-the-latest-images
Describe how to use multiple languages within the same notebook.
- Click the cell language button and select a different language for a cell
- Use the language magic command such as
%python
,%scala
, or%sql
https://docs.databricks.com/aws/en/notebooks/notebooks-code#code-languages-in-notebooks
Identify how to run one notebook from within another notebook.
- Use the
%run
magic command: This includes another notebook within your notebook, in the same scope as your notebook. - Use
dbutils.notebook.run()
: This starts a new job to run the notebook, and allows you to pass parameters and return values.
https://docs.databricks.com/aws/en/notebooks/notebook-workflows
- Click “Share” at the top of the notebook and select a user(s) or group(s)
- Share a folder in which a notebook is located
https://docs.databricks.com/aws/en/notebooks/notebooks-collaborate#share-a-notebook
Describe how Databricks Repos enables CI/CD workflows in Databricks.
Using the Databricks Repos API:
- Admin flow: For production flows, a Databricks workspace admin sets up top-level folders in your workspace to host the production git folders. The admin clones a Git repository and branch when creating them, and could give these folders meaningful names such as “Production”, “Test”, or “Staging” which correspond to the remote Git repositories’ purpose in your development flows. For more details, see Production Git folder.
- User flow: A user can create a Git folder under /Workspace/Users/<email>/ based on a remote Git repository. A user creates a local user-specific branch for the work the user will commit to it and push to the remote repository. For information on collaborating in user-specific Git folders, see Collaborate using Git folders.</email>
- Merge flow: Users can create pull requests (PRs) after pushing from a Git folder. When the PR is merged, automation can pull the changes into the production Git folders using the Databricks Repos API.
https://docs.databricks.com/aws/en/repos/ci-cd
Identify Git operations available via Databricks Repos.
- Create branch
- Switch branch
- Commit & push
- Pull
- Merge
- Rebase
- Resolve Merge Conflicts
- Reset
- Sparse Checkout
https://docs.databricks.com/aws/en/repos/git-operations-with-repos
Identify limitations in Databricks Notebooks version control functionality relative to Repos.
- Working branches are limited to 1 gigabyte (GB).
- Files larger than 10 MB can’t be viewed in the Databricks UI.
- Individual workspace files are subject to a separate size limit. For more details, read Limitations.
- The local version of a branch can remain present in the associated Git folder for up to 30 days after the remote branch is deleted. To completely remove a local branch in a Git folder, delete the repository.
https://docs.databricks.com/aws/en/repos/limits
Extract data from a single file and from a directory of files.
-
Unity Catalog
SQL:
sql SELECT * from parquet.`/Volumes/<catalog_name>/<schema_name>/<volume_name>/path/to/data`
Python:
python spark.read.format("parquet").load("/Volumes/catalog_name/schema_name/volume_name/path/to/data")
-
URIs
SQL:
sql SELECT * FROM json.`s3://bucket_name/path/to/data`
Python:
python spark.read.format("parquet").load("s3://bucket_name/path/to/data")
https://docs.databricks.com/aws/en/files/
Identify the prefix included after the FROM keyword as the data type.
```sql
SELECT * from parquet./Volumes/catalog_name/schema_name/volume_name/path/to/data
~~~
https://docs.databricks.com/aws/en/files/
Create a view, a temporary view, and a CTE as a reference to a file.
Create a view:
```sql
CREATE VIEW AS
SELECT * FROM parquet./Volumes/catalog_name/schema_name/volume_name/path/to/data.parquet
~~~
Create a temporary view:
```sql
CREATE TEMPORARY VIEW AS
SELECT * FROM parquet./Volumes/catalog_name/schema_name/volume_name/path/to/data.parquet
~~~
Create a CTE:
```sql
WITH DATA_CTE AS (
SELECT * FROM
parquet./Volumes/catalog_name/schema_name/volume_name/path/to/data.parquet
)
SELECT * FROM DATA_CTE
~~~
Identify that tables from external sources are not Delta Lake tables.
```sql
DESCRIBE DETAIL catalog.schema.table AS JSON
...the output will be something like:
text
{
“format”: “delta”,
“provider”: “delta”,
“location”: “dbfs:/Volumes/catalog_name/schema_name/table_name”
}
~~~
The “format” will indicate whether the table is a delta table.
Create a table from a JDBC connection and from an external CSV file.
```SQL
CREATE TABLE IF NOT EXISTS ora_tab
USING ORACLE
OPTIONS (
url ‘<jdbc-url>',
dbtable '<table-name>',
user '<username>',
password '<password>'
);
CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
~~~</password></username></table-name></jdbc-url>
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using#examples
Identify how the count_if function and the count where x is null can be used.
Return the number of true values for an expression
https://docs.databricks.com/aws/en/sql/language-manual/functions/count_if
Identify when the SQL count
function skips NULL values.
-
count(*)
: counts all rows -
count(COLUMN_NAME)
: counts non-null values in the named column
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-null-semantics#built-in-aggregate-expressions
Deduplicate rows from an existing Delta Lake table.
Using MERGE INTO ... WHEN NOT MATCHED BY SOURCE THEN DELETE
:
```sql
WITH deduplicated_target AS (
SELECT * from
target
QUALIFY ROW_NUMBER()
OVER (PARTITION BY id ORDER BY updated ASC)
)
MERGE INTO target
USING deduplicated_target
ON deduplicated_target.id = target.id
WHEN NOT MATCHED BY SOURCE THEN DELETE
~~~
https://docs.databricks.com/gcp/en/delta/merge
Create a new table from an existing table while removing duplicate rows.
```sql
CREATE TABLE new_table LIKE existing_table AS
SELECT * from
existing_table
QUALIFY ROW_NUMBER()
OVER (PARTITION BY id ORDER BY updated ASC)
~~~
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-table-like