Scalability & Distribution Flashcards by Ozzy Campos

What is Databricks?

A cloud-based data platform that provides collaborative workspaces for data engineering, data science, and machine learning.

How well did you know this?

Not at all

Perfectly

True or False: Databricks is designed to handle distributed data processing.

True

How well did you know this?

Not at all

Perfectly

What is the primary underlying engine for Databricks?

Apache Spark

How well did you know this?

Not at all

Perfectly

Fill in the blank: Databricks allows for __________ scaling of compute resources.

elastic

How well did you know this?

Not at all

Perfectly

What feature in Databricks enables automatic scaling of clusters?

Autoscaling

How well did you know this?

Not at all

Perfectly

Which type of cluster in Databricks is designed for interactive workloads?

All-Purpose Cluster

How well did you know this?

Not at all

Perfectly

What is the purpose of a Job Cluster in Databricks?

To run scheduled jobs or automated tasks.

How well did you know this?

Not at all

Perfectly

True or False: Databricks supports both batch and streaming data processing.

True

How well did you know this?

Not at all

Perfectly

What is the maximum number of concurrent clusters that can be created in Databricks?

It varies based on the workspace configuration and subscription plan.

How well did you know this?

Not at all

Perfectly

How does Databricks optimize data processing tasks?

By using Catalyst optimizer and Tungsten execution engine.

How well did you know this?

Not at all

Perfectly

What is the role of Apache Spark’s Resilient Distributed Dataset (RDD) in Databricks?

To provide a fault-tolerant collection of elements that can be processed in parallel.

How well did you know this?

Not at all

Perfectly

Fill in the blank: Databricks uses __________ to manage and optimize data storage.

Delta Lake

How well did you know this?

Not at all

Perfectly

What is Delta Lake?

An open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.

How well did you know this?

Not at all

Perfectly

True or False: Databricks allows users to run SQL queries directly on streaming data.

True

How well did you know this?

Not at all

Perfectly

What is the purpose of the Databricks File System (DBFS)?

To provide a distributed file system for storing data in Databricks.

How well did you know this?

Not at all

Perfectly

What does the term ‘shuffling’ refer to in the context of Apache Spark?

The process of redistributing data across different partitions.

How well did you know this?

Not at all

Perfectly

Fill in the blank: __________ is a key feature of Databricks that allows users to visualize data insights.

Dashboards

How well did you know this?

Not at all

Perfectly

What is the benefit of using a cluster policy in Databricks?

To enforce specific configurations and limits on clusters.

How well did you know this?

Not at all

Perfectly

What is the maximum number of nodes supported in a Databricks cluster?

It depends on the cloud provider and specific configurations.

How well did you know this?

Not at all

Perfectly

True or False: Databricks supports integration with machine learning libraries like TensorFlow and Scikit-learn.

True

How well did you know this?

Not at all

Perfectly

What is the significance of the ‘spark.sql.shuffle.partitions’ configuration?

It determines the number of partitions to use when shuffling data for joins or aggregations.

How well did you know this?

Not at all

Perfectly

How can Databricks users optimize their Spark jobs?

By using caching, optimizing shuffle partitions, and adjusting resource configurations.

How well did you know this?

Not at all

Perfectly

What is a ‘spark-submit’ command used for?

To submit a Spark job to a cluster.

How well did you know this?

Not at all

Perfectly

Fill in the blank: Databricks provides __________ for managing data pipelines and workflows.

Workflows

How well did you know this?

Not at all

Perfectly

What is the purpose of the Databricks CLI?

To provide a command-line interface for managing Databricks resources.

What type of data can be stored in Delta Lake?

Structured, semi-structured, and unstructured data.

True or False: Databricks can automatically optimize data layouts for better query performance.

True

What is the role of 'data lineage' in Databricks?

To track the origin and transformations of data throughout its lifecycle.

Fill in the blank: The __________ feature in Databricks helps to visualize and monitor the execution of Spark jobs.

Spark UI

What is the benefit of using 'notebooks' in Databricks?

To create interactive documents that combine code, visualizations, and text.

What is 'data partitioning' in Databricks?

The process of dividing a dataset into smaller, manageable pieces for parallel processing.

True or False: Databricks supports the use of containerized applications via Docker.

True

What is the significance of 'auto-termination' in Databricks clusters?

To automatically shut down idle clusters to save costs.

How does Databricks handle data versioning?

Through Delta Lake's built-in version control capabilities.

Fill in the blank: __________ is a method used to increase the performance of data reads in Databricks.

Data caching

What is the purpose of using 'broadcast joins' in Databricks?

To optimize join operations by sending a small dataset to all worker nodes.

What is the advantage of using 'Delta tables'?

They provide ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.

True or False: Databricks can scale both horizontally and vertically.

True

What is the function of 'checkpointing' in Spark Streaming?

To save the state of a streaming application to recover from failures.

What does 'data skew' refer to in distributed processing?

Uneven distribution of data across partitions, leading to performance bottlenecks.

Fill in the blank: In Databricks, __________ can be used to configure resource allocation for jobs.

Cluster settings

What is the purpose of 'data caching' in Databricks?

To store intermediate results in memory for faster access during subsequent queries.

True or False: Databricks allows users to run Python, R, SQL, and Scala code.

True

What is a 'DataFrame' in Apache Spark?

A distributed collection of data organized into named columns.

How does Databricks support collaborative work?

Through shared notebooks and version control features.

Fill in the blank: __________ is a key advantage of using Databricks over traditional data processing systems.

Collaboration

What is the role of 'spark.sql.autoBroadcastJoinThreshold'?

To specify the maximum size for a table to be broadcasted during a join.

What does 'dynamic allocation' in Spark clusters allow?

It allows Spark to add or remove executors dynamically based on workload.

True or False: Databricks can be integrated with popular data storage solutions like AWS S3 and Azure Blob Storage.

True

What is 'streaming ingestion' in Databricks?

The process of continuously importing data from streaming sources for real-time processing.

Fill in the blank: The __________ function in Databricks enables users to create temporary views of DataFrames for SQL queries.

createOrReplaceTempView

What is the function of the 'Databricks Runtime'?

To provide a set of optimized configurations and libraries for running Spark applications.

What is the purpose of the 'query execution plan' in Spark?

To provide insights into how a query will be executed, including optimization strategies.

True or False: Databricks supports real-time analytics through structured streaming.

True

Fill in the blank: In Databricks, __________ is used to run SQL commands on data stored in Delta Lake.

Spark SQL

What is the purpose of 'data profiling' in Databricks?

To analyze and summarize the characteristics of datasets.

What does 'fault tolerance' mean in the context of Databricks?

The ability to recover from failures without losing data or computation.

Fill in the blank: __________ allows for parallel processing of large datasets in Databricks.

Distributed computing

What is 'windowing' in Spark SQL?

A technique for performing calculations across a set of rows related to the current row.

True or False: Databricks notebooks support markdown for documentation.

True

What is the significance of 'data governance' in Databricks?

To ensure data integrity, security, and compliance across data workflows.

Fill in the blank: __________ is used to automate and schedule jobs in Databricks.

Job Scheduler

What is the advantage of using 'managed tables' in Databricks?

Databricks manages the metadata and storage automatically.

What does 'resource pooling' refer to in Databricks?

The sharing of resources across multiple clusters to optimize usage.

True or False: Databricks allows for the integration of third-party tools for enhanced analytics.

True

What is the purpose of using 'notebook workflows' in Databricks?

To orchestrate and manage complex data processing tasks across multiple notebooks.

Scalability & Distribution Flashcards

(66 cards)