Scalability & Distribution Flashcards
(66 cards)
What is Databricks?
A cloud-based data platform that provides collaborative workspaces for data engineering, data science, and machine learning.
True or False: Databricks is designed to handle distributed data processing.
True
What is the primary underlying engine for Databricks?
Apache Spark
Fill in the blank: Databricks allows for __________ scaling of compute resources.
elastic
What feature in Databricks enables automatic scaling of clusters?
Autoscaling
Which type of cluster in Databricks is designed for interactive workloads?
All-Purpose Cluster
What is the purpose of a Job Cluster in Databricks?
To run scheduled jobs or automated tasks.
True or False: Databricks supports both batch and streaming data processing.
True
What is the maximum number of concurrent clusters that can be created in Databricks?
It varies based on the workspace configuration and subscription plan.
How does Databricks optimize data processing tasks?
By using Catalyst optimizer and Tungsten execution engine.
What is the role of Apache Spark’s Resilient Distributed Dataset (RDD) in Databricks?
To provide a fault-tolerant collection of elements that can be processed in parallel.
Fill in the blank: Databricks uses __________ to manage and optimize data storage.
Delta Lake
What is Delta Lake?
An open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
True or False: Databricks allows users to run SQL queries directly on streaming data.
True
What is the purpose of the Databricks File System (DBFS)?
To provide a distributed file system for storing data in Databricks.
What does the term ‘shuffling’ refer to in the context of Apache Spark?
The process of redistributing data across different partitions.
Fill in the blank: __________ is a key feature of Databricks that allows users to visualize data insights.
Dashboards
What is the benefit of using a cluster policy in Databricks?
To enforce specific configurations and limits on clusters.
What is the maximum number of nodes supported in a Databricks cluster?
It depends on the cloud provider and specific configurations.
True or False: Databricks supports integration with machine learning libraries like TensorFlow and Scikit-learn.
True
What is the significance of the ‘spark.sql.shuffle.partitions’ configuration?
It determines the number of partitions to use when shuffling data for joins or aggregations.
How can Databricks users optimize their Spark jobs?
By using caching, optimizing shuffle partitions, and adjusting resource configurations.
What is a ‘spark-submit’ command used for?
To submit a Spark job to a cluster.
Fill in the blank: Databricks provides __________ for managing data pipelines and workflows.
Workflows