Questions Flashcards

1
Q

What are the differences between star schema and snowflake schema in data warehousing?

A

Star schema and snowflake schema are two types of dimensional modeling techniques used in data warehousing. Star schema has a central fact table that connects to multiple dimension tables, each representing a single attribute or entity. Snowflake schema is a variation of star schema that normalizes the dimension tables into multiple levels of hierarchy. Star schema is simpler and faster to query, but snowflake schema is more normalized and reduces data redundancy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the pros and cons of Hadoop for big data processing?

A

Hadoop is an open-source framework for distributed storage and processing of large-scale data sets. Pros: It can handle various data types, scale easily, and be cost-effective. Cons: It has a high learning curve, high latency, high maintenence costs for administration and security.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some common data quality issues?

A
  • Missing, incomplete, or incorrect data
  • Duplicate or redundant data
    Inconsistent or incompatible data formats or standards
  • Outdated or irrelevant data
  • Data that does not comply with business rules or regulations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to handle Data Quality issues?

A
  • Define the data quality criteria and metrics
  • Perform data profiling and auditing to identify the issues
  • Implement data cleansing and validation techniques
  • Monitor and report the data quality status and improvement
  • Establish data governance and security policies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some of the benefits working with cloud-based data platforms?

A
  • Scalability: Cloud platforms can easily adjust to the changing data volume and demand.
  • Availability: Cloud platforms can provide high availability and reliability by replicating the data across multiple locations and servers.
  • Cost-effectiveness: Cloud platforms can reduce the upfront and maintenance costs of owning and operating physical infrastructure and software.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some of the challenges working with cloud-based data platforms?

A
  • Integration: Cloud platforms can have integration challenges with existing on-premise systems or other cloud providers.
  • Hidden costs
  • Flexibility: locked into cloud infrastructure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Spark Driver?

A

The process that runs the main() method of the Spark application and creates the SparkSession object. It is responsible for coordinating the execution of tasks across the Spark cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is SparkSession?

A

The entry point to the Spark application. It provides access to the Spark functionality, such as creating and manipulating RDDs, DataFrames, Datasets, and Spark SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Spark Cluster Manager?

A

The component that manages the allocation and release of resources across the Spark cluster. It can be one of the following: Standalone, YARN, Mesos, or Kubernetes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Spark Executor?

A

The process that runs on each worker node in the cluster and executes the tasks assigned by the driver. It also stores the data in memory or disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Spark Task?

A

The unit of work that is sent to the executor by the driver. It is a computation on a partition of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Spark Job?

A

A parallel computation that consists of multiple tasks that are triggered by an action on an RDD, DataFrame, or Dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Spark Stage?

A

A set of tasks within a job that can be executed in parallel. A stage is divided by shuffle boundaries, which are operations that require data movement across executors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Spark RDD?

A

Resilient Distributed Dataset: The original and low-level abstraction in Spark. It is an immutable collection of objects that can be partitioned across the cluster and operated on in parallel. It supports two types of operations: transformations and actions. It provides fault-tolerance by maintaining lineage information. It does not have any schema or optimization information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Spark DataFrame?

A

A higher-level abstraction in Spark that is similar to a table in a relational database. It is a distributed collection of rows organized into named columns. It supports both SQL and domain-specific language (DSL) queries. It provides fault-tolerance by maintaining lineage information. It has a schema and optimization information that can be used by the Catalyst optimizer to improve performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Spark Data-Set?

A

A higher-level abstraction in Spark that combines the benefits of RDDs and DataFrames. It is a distributed collection of objects that can be manipulated using both functional and relational operations. It supports both SQL and domain-specific language (DSL) queries. It provides fault-tolerance by maintaining lineage information. It has a schema and optimization information that can be used by the Catalyst optimizer to improve performance.

17
Q

What are some of the benefits of using Spark for big data processing?

A
  • Speed: Spark can run workloads 100 times faster than Hadoop MapReduce by using in-memory computation.
  • Ease of use: Spark provides over 80 high-level operators that make it easy to build parallel apps using various languages.
  • Generality: Spark supports multiple types of data processing, such as batch, streaming, interactive, graph, or machine learning.
  • Compatibility: Spark can run on various platforms and access data from multiple sources.
18
Q

What are some of the challanges of using Spark for big data processing?

A
  • Memory management: Spark requires sufficient memory to store and process large amounts of data in memory. If the memory is insufficient or poorly configured, it can cause performance issues or errors.
  • Debugging: Spark applications can be difficult to debug due to their distributed nature and lack of visibility into the framework.
  • Tuning: Spark applications can require careful tuning to achieve optimal performance and resource utilization. Some of the parameters that need to be tuned are parallelism level, partition size, serialization format, memory fraction, garbage collection strategy, etc.
19
Q

Data Warehouse

A

A database for structured data with predefined schema and fast SQL queries for reporting and analysis.

20
Q

Data Lake

A

A repository for raw data in any format with flexible analytics for dashboards, big data, real-time, and machine learning.

21
Q

Data Lakehouse

A

A new architecture for data in open formats with data management and ACID transactions for BI and ML on all data.

22
Q

Data Vault

A

A design pattern for data warehouse with hubs, links, and satellites for agile and scalable analytics.

23
Q

Main advantage for Data Lakehouse

A

Combines the performance and optimization of a data warehouse with the flexibility of a data lake, enabling structure and schema to be applied to unstructured data.

24
Q

Main disadvantage for Data Lakehouse?

A

Difficult to implement, maintain or migrate. Involves extra steps so there might be higher data latency.