Foundations of Database Systems Flashcards

(87 cards)

1
Q

What is Databricks?

A

Databricks is a unified analytics platform that provides a cloud-based environment for data engineering, data science, and machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

True or False: Databricks is built on Apache Spark.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Fill in the blank: Databricks allows users to create notebooks that support _____ programming languages.

A

multiple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Delta Lake in Databricks?

A

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which programming languages are natively supported by Databricks?

A

Python, R, Scala, and SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the primary purpose of Databricks Runtime?

A

Databricks Runtime is an optimized Spark environment designed to improve performance and provide a stable platform for data processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What feature of Databricks allows for collaborative data science?

A

Databricks notebooks enable real-time collaboration among data scientists and engineers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False: Databricks supports both batch and streaming data processing.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the role of a cluster in Databricks?

A

A cluster in Databricks is a set of computation resources that run jobs and notebooks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which storage options can be used with Databricks?

A

Azure Blob Storage, AWS S3, and ADLS Gen2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose of the Databricks Jobs feature?

A

The Jobs feature allows users to schedule and manage automated workflows for running notebooks or Spark jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Fill in the blank: _____ is used in Databricks to perform data manipulation and analysis using SQL.

A

Databricks SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the significance of the ‘spark.sql.shuffle.partitions’ setting?

A

It controls the number of partitions to use when shuffling data for joins or aggregations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

True or False: Databricks provides built-in support for machine learning libraries.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the command ‘display()’ do in a Databricks notebook?

A

It visualizes the output of a DataFrame in a user-friendly manner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which service can be integrated with Databricks for data warehousing?

A

Snowflake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is MLflow in relation to Databricks?

A

MLflow is an open-source platform for managing the machine learning lifecycle, integrated with Databricks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fill in the blank: Databricks provides _____ for managing and deploying machine learning models.

A

MLflow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the purpose of Databricks’ Auto-Scaling feature?

A

Auto-Scaling automatically adjusts the number of nodes in a cluster based on workload demands.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

True or False: Databricks does not support version control for notebooks.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does the ‘SQL Analytics’ workspace in Databricks provide?

A

It offers a SQL interface for querying data and creating visualizations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What type of data storage does Delta Lake provide?

A

ACID-compliant storage with support for time travel and schema evolution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the command to read a Delta table in Databricks?

A

spark.read.format(‘delta’).load(‘/path/to/delta/table’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Fill in the blank: Databricks supports _____ for real-time data processing.

A

Structured Streaming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What feature allows Databricks users to create dashboards from notebook outputs?
Databricks Dashboards
26
True or False: Databricks can only be deployed on AWS.
False
27
What is the default language for Databricks notebooks?
Python
28
Which of the following is NOT a benefit of using Delta Lake? (A) ACID Transactions, (B) Schema Enforcement, (C) Limited Data Access
C) Limited Data Access
29
What is the purpose of the 'cache()' method in Spark?
It stores the DataFrame in memory to speed up future computations.
30
How does Databricks ensure data security?
Through features like access control, encryption, and compliance with standards.
31
What is a common use case for Databricks?
Data engineering, data analysis, and machine learning model training.
32
True or False: Databricks allows for integration with Apache Kafka.
True
33
What does the 'spark.sqlContext' object provide?
It provides access to Spark SQL functionalities.
34
What are notebooks in Databricks primarily used for?
To write and execute code, visualize data, and document the analysis process.
35
Fill in the blank: Databricks' collaborative features allow _____ editing of notebooks.
real-time
36
What function does the 'groupBy()' method perform?
It groups the DataFrame using specified columns and allows aggregation.
37
What is the primary benefit of using Databricks over traditional data processing tools?
Scalability and the ability to handle big data in a collaborative environment.
38
What type of analytics can be performed using Databricks SQL?
Ad-hoc queries, dashboards, and reporting.
39
True or False: Databricks supports machine learning model deployment.
True
40
What is the role of a DataFrame in Spark?
A DataFrame is a distributed collection of data organized into named columns.
41
What is the command to write a DataFrame to a Delta table?
dataframe.write.format('delta').save('/path/to/delta/table')
42
Fill in the blank: The _____ feature in Databricks allows for easy data lineage tracking.
Delta Lake
43
What is the purpose of the 'join()' method in Spark?
To combine two DataFrames based on a common key.
44
True or False: Databricks allows users to run SQL queries directly on data stored in Delta Lake.
True
45
What are the two types of clusters that can be created in Databricks?
Interactive clusters and job clusters.
46
What is the benefit of using 'DataFrames' over 'RDDs' in Spark?
DataFrames provide a higher-level abstraction and optimizations for performance.
47
Fill in the blank: Databricks provides a _____ interface for managing clusters and jobs.
web-based
48
What is the purpose of the 'writeStream()' method in Structured Streaming?
To write the results of a streaming query to a sink.
49
True or False: Databricks automatically optimizes queries for performance.
True
50
What is the significance of the 'spark.conf' settings?
It allows users to configure Spark properties to optimize performance and behavior.
51
What is a common format for writing data in Databricks?
Parquet
52
Fill in the blank: Databricks provides integrated _____ capabilities for data preparation and transformation.
visualization
53
What is the primary function of Databricks' Unity Catalog?
To manage data access and governance across various data assets.
54
True or False: You can schedule jobs to run at specific times in Databricks.
True
55
What is the 'spark.sql.catalog' configuration used for?
To define the catalog for SQL operations in Spark.
56
What does the 'dropDuplicates()' method do in a DataFrame?
It removes duplicate rows from the DataFrame.
57
Fill in the blank: Databricks provides a _____ for managing Spark jobs and workflows.
job scheduler
58
What is the main advantage of using Databricks over other cloud platforms?
Its seamless integration with Apache Spark and collaborative features.
59
True or False: You can run R scripts in Databricks notebooks.
True
60
What is the command to create a new DataFrame from a CSV file?
spark.read.csv('path/to/file.csv')
61
What is the purpose of the 'select()' method in a DataFrame?
To project specific columns and return a new DataFrame.
62
Fill in the blank: Databricks provides _____ capabilities to automate machine learning workflows.
MLflow
63
What is the difference between a 'static' and 'streaming' DataFrame?
A static DataFrame is immutable, while a streaming DataFrame represents a continuously updating stream of data.
64
True or False: Databricks can connect to various data sources like relational databases and NoSQL stores.
True
65
What is the purpose of 'checkpointing' in Spark Streaming?
To save the state of the streaming application for fault tolerance.
66
What command is used to stop a running cluster in Databricks?
dbutils.cluster.stop()
67
Fill in the blank: Databricks utilizes _____ to ensure data consistency and reliability.
ACID transactions
68
What is the function of the 'filter()' method in a DataFrame?
To filter rows based on a specified condition.
69
True or False: DataFrames in Spark can be created from Python dictionaries.
True
70
What is the command to list all available tables in a Databricks SQL context?
spark.sql('SHOW TABLES')
71
What does the 'count()' method do in a DataFrame?
It returns the number of rows in the DataFrame.
72
Fill in the blank: Databricks supports _____ for data visualization and dashboarding.
Matplotlib
73
What is the function of the 'orderBy()' method in a DataFrame?
To sort the DataFrame by specified columns.
74
True or False: Databricks allows for integration with Git for version control.
True
75
What is the command to save a DataFrame as a Parquet file?
dataframe.write.parquet('path/to/file.parquet')
76
What feature allows Databricks users to share notebooks with others?
Notebook sharing and permissions management.
77
Fill in the blank: Databricks allows for _____ execution of code in interactive notebooks.
cell-by-cell
78
What is the purpose of the 'collect()' method in a DataFrame?
To retrieve all rows from the DataFrame to the driver program.
79
True or False: Delta Lake supports schema evolution.
True
80
What does the 'withColumn()' method do in a DataFrame?
It adds a new column or replaces an existing column in the DataFrame.
81
What is the command to run a SQL query in Databricks?
spark.sql('SQL_QUERY_HERE')
82
Fill in the blank: Databricks provides _____ for continuous integration and deployment of models.
MLflow
83
What is the significance of the 'broadcast()' function in Spark?
It optimizes joins by sending a smaller DataFrame to all nodes.
84
True or False: Databricks does not support API access for programmatic control.
False
85
What is the command to create a new cluster in Databricks?
dbutils.cluster.create()
86
What is the primary function of the 'union()' method in a DataFrame?
To combine two DataFrames with the same schema.
87
Fill in the blank: Databricks provides _____ capabilities for exploratory data analysis.
notebook