Data Warehousing with Apache Hive Flashcards

1
Q

What is a star schema in data warehousing?

A

A star schema is a type of database schema that is a popular design pattern for data warehouses and business intelligence applications. It consists of one or more fact tables referencing any number of dimension tables, resembling a star.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the role of a fact table in a star schema?

A

The fact table is the central table in a star schema. It contains measurable, quantitative data (like sales amount) and foreign keys from dimension tables, representing the core data being analyzed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are dimension tables in a star schema?

A

Dimension tables in a star schema store attributes or dimensions that describe the objects in the fact table. These include descriptive data like dates, products, or geographic locations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the benefits of using a star schema?

A

Star schemas simplify data modeling and querying. They enhance query performance due to their straightforward structure, making them ideal for business intelligence and data warehousing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does a star schema differ from a snowflake schema?

A

Unlike the more complex snowflake schema, the star schema is simpler with denormalized dimension tables. This simplicity can lead to faster query times but potentially larger storage requirements due to denormalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What would a fact table for a product order transaction typically include?

A

A fact table for a product order transaction would include measurable, quantitative data such as Order ID, Product ID, Quantity Ordered, Total Price, and possibly foreign keys linking to dimension tables like Date, Customer, Product, and Store.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What attributes might a Date dimension table contain for a product order?

A

The Date dimension table could include Date ID (as primary key), Day, Month, Quarter, Year, and other relevant time-related attributes like Weekday or Holiday.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What would you typically find in a Customer dimension table?

A

A Customer dimension table would feature Customer ID, Name, Address, Contact Information, Demographic Data, and possibly a Customer Segment or Category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are common fields in a Product dimension table?

A

The Product dimension table would contain Product ID, Name, Description, Category, Supplier, Price, and potentially Size, Color, or other product-specific attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What details might a Store dimension table include for a product order?

A

A Store dimension table might have Store ID, Location, Address, Region, Manager, and Store Type or Size. This table provides context about where the order was placed or fulfilled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a snowflake schema in data warehousing?

A

A snowflake schema is a type of database schema often used in data warehousing. It’s an extension of the star schema, where dimension tables are normalized into multiple related tables, forming a structure resembling a snowflake.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does normalization work in a snowflake schema?

A

In a snowflake schema, dimension tables are broken down into more tables to eliminate redundancy and improve data integrity. This leads to a more complex structure with more joins than in a star schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the role of a fact table in a snowflake schema?

A

Similar to the star schema, the fact table in a snowflake schema is at the center. It stores transactional data (like sales) and links to dimension tables, which are further normalized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the benefits of using a snowflake schema?

A

Snowflake schemas offer better data organization and improved data integrity due to normalization. They can also reduce data redundancy and storage costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does a snowflake schema differ from a star schema?

A

The snowflake schema is more complex due to the normalization of dimension tables, leading to more joins and potentially slower query performance, compared to the simpler, denormalized structure of the star schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a galaxy schema in data warehousing?

A

A galaxy schema, also known as a fact constellation schema, is a data warehouse design that includes multiple fact tables sharing dimension tables. It’s a complex schema designed for multifaceted data analysis across various business processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How is a galaxy schema structured?

A

In a galaxy schema, there are several fact tables that are related to shared dimension tables. Each fact table corresponds to a different business process but can share dimensions with other fact tables, resembling a collection of star schemas (hence the name “galaxy”).

18
Q

What is the role of fact tables in a galaxy schema?

A

Each fact table in a galaxy schema represents a specific business process or event, such as sales or inventory, containing the metrics or measurements of that process.

19
Q

How do shared dimension tables work in a galaxy schema?

A

Shared dimension tables in a galaxy schema serve multiple fact tables. For instance, a ‘Time’ dimension could be shared by both ‘Sales’ and ‘Inventory’ fact tables.

20
Q

What are the advantages of using a galaxy schema?

A

The galaxy schema allows for extensive and complex analysis across different business processes, providing a more comprehensive view. It’s flexible in accommodating diverse business requirements and analyses.

21
Q

What is a star schema?

A

The star schema is the simplest data warehouse schema, characterized by a central fact table connected to multiple dimension tables. It’s efficient for querying large data sets and is easy to understand.

22
Q

What is a snowflake schema?

A

The snowflake schema is a variation of the star schema where dimension tables are normalized, breaking them down into additional tables. This reduces redundancy but increases complexity.

23
Q

What is a galaxy schema?

A

The galaxy schema, or fact constellation schema, involves multiple fact tables that share dimension tables. It’s suitable for complex data warehousing scenarios involving multiple business processes.

24
Q

How do star, snowflake, and galaxy schemas compare in terms of complexity and performance?

A

Star schemas are simplest and fastest for queries. Snowflake schemas are more complex due to normalization, potentially slowing down queries. Galaxy schemas are the most complex, suitable for advanced, multifaceted analysis.

25
Q

Which schema should be used based on business requirements?

A

Star schemas are great for straightforward, large-volume data querying. Snowflake schemas suit scenarios needing normalized data structures for maintaining data integrity. Galaxy schemas fit complex analytical needs across various business domains.

26
Q

What is a data mart?

A

A data mart is a subset of a data warehouse focused on a specific business line or team. It contains data relevant to a particular group, like sales, finance, or marketing, making it more manageable and specific than the broader data warehouse.

27
Q

What is the primary purpose of a data mart?

A

The main purpose of a data mart is to provide business users with access to relevant data tailored to their specific needs. It simplifies data analysis by offering a more focused view of data, which is particularly useful for department-specific reporting and analysis.

28
Q

What are the different types of data marts?

A

There are three main types of data marts: independent, dependent, and hybrid. Independent data marts are created without a data warehouse. Dependent data marts are sourced from an existing data warehouse. Hybrid data marts combine both approaches.

29
Q

What are the advantages of using data marts?

A

Data marts offer improved performance for specific queries, ease of use for end-users, lower cost than full-scale data warehouses, and quicker implementation. They are tailored to specific needs, providing more relevant insights.

30
Q

How does a data mart differ from a data warehouse?

A

A data mart is a subset of a data warehouse designed for a specific line of business or purpose. While a data warehouse contains a broad view of the company’s data for all departments, a data mart focuses on specific needs, making it smaller and more focused.

31
Q

What are Slowly Changing Dimensions (SCD)?

A

Slowly Changing Dimensions are dimensions in a data warehouse that undergo changes infrequently but must be managed to maintain historical accuracy of data. These changes could be due to updates in business processes or alterations in data interpretation.

32
Q

What is SCD Type 1?

A

SCD Type 1 involves overwriting the old data with new data, without keeping historical data. This is suitable when it’s not necessary to keep track of historical changes.

33
Q

What is SCD Type 2?

A

SCD Type 2 involves keeping multiple rows to store historical data, with new records added for changes. This method maintains the history of dimensional changes, useful for tracking trends over time.

34
Q

What is SCD Type 3?

A

SCD Type 3 keeps the current and previous value in the same row. This method limits the history to a specific number of changes and is less commonly used.

35
Q

What is Slowly Changing Dimension Type 0 (SCD Type 0) in data warehousing?

A

SCD Type 0 refers to a dimension attribute that never changes once it has been loaded into the data warehouse. It’s a passive method where historical data remains as it was at the time of its initial load, with no updates or changes allowed. This type is used for permanent or historical data that must be preserved in its original state, such as birth dates or original account numbers, which are significant for historical reporting and analysis and should not be altered.

36
Q

What is a surrogate key in the context of database management?

A

A surrogate key is an artificial or synthetic key used in a database table. It’s a unique identifier for each row in the table, typically assigned by the database system itself. Surrogate keys are not derived from application data, unlike natural keys which are derived from meaningful data. They are often used in data warehousing to provide a unique identifier for each record, regardless of any changes to the actual data. Surrogate keys are useful for simplifying relationships between tables and improving query performance.

37
Q

What is the basic concept of SCD Type 2 in data warehousing?

A

SCD Type 2 is used to track and store the entire history of data changes. It involves adding new records to the dimension table with each change, preserving historical data for accurate analysis over time.

38
Q

What are key features of SCD Type 2?

A

It includes features like maintaining multiple historical records for each change, using surrogate keys for uniqueness, and possibly date/time stamps to record when changes occurred.

39
Q

How is SCD Type 2 typically implemented?

A

Implementation involves creating new entries in the dimension table for each change, along with effective and expiration dates for each record, and a flag to indicate the most current record.

40
Q

Why is SCD Type 2 important in business intelligence and reporting?

A

It’s crucial for detailed historical analysis, allowing businesses to track changes over time and understand trends and patterns in historical data, leading to more informed decision-making.

41
Q

What should be considered when using SCD Type 2?

A

Considerations include the impact on storage and query performance, the complexity of managing historical data, and the need for robust data management practices to handle the volume and complexity of the data.