Chapter 3 - Databases and Data Acquisition Flashcards

1
Q

What does an entity contain? What kind of database does it relate to?

A

An entity contains data about a single subject. Think of it as a noun that describes a person, place or thing. Used in relational databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What’s an entity instance?

A

It’s like a single record or instances of an entity subject. Like picking out one record in a table about pets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the name given to describe the relationship between two entities?

A

Cardinality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which types of entity relationships are comparatively complex and rare?

A

Unary and Ternary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between an entity instance and an entity attribute?

A

An entity attribute is akin to the column heading in a table. An entity instance is akin to a single record in a table, for example, the record about one particular customer. The record contains multiple entity attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Entities become a separate _____ in a database

A

table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What two things together link data between two tables?

A

An associative table and a foreign keys

the F-keys link to the primary keys in the two related tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A foreign key helps to enforce what?

A

foreign key enforces referential integriy, or how consistent the data is in related tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What describes the modifications necessary to to create tables for storing data?

A

A schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A table that contains a primary key in more than one column has a primary key called what?

A

A composite primary key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

You need to store and retrieve simple data quickly that doesn’t have a lot of structure, which of the two non-relational databases would you choose?
Key-value or Column-family

A

Key-value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name 2 high level key benefits of a column-family database?

A

1) DYNAMIC/FLEXIBLE - Column data can vary dynamically between key values
2) ABILITY TO SCALE - can process vast amounts of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What kind of database is best at exploring relationships between pieces of data?

A

Graph databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What database type do the below relate to and what are they?

1) NODES -
2) PROPERTIES -
3) RELATIONSHIPS -

A

GRAPH DATABASES

2) 1) NODES - these are the equivalent of an entity instance; a person or a thing.
2) PROPERTIES - these store attributes about a node
3) RELATIONSHIPS - these are arrows connecting nodes denoting their relationship to each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What kind of database is often behind website features like related or RECOMMENDED items?

A

Graph database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What database process is employed to reduce data redundancy?

A

Normalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

List the 5 rules of 1NF

A

1) KEY - a table without a primary key is not permitted
2) GROUPS - Repeating groups are not permitted
3) ORDER - Row order is not permitted
4) DATATYPES - Mixing of datatypes in a column is not permitted
5) VALUES - non-atomic values are not permitted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What level of normalization does this statement describe? “every attribute in a table should depend on the key, the whole key. and nothing but the key”?

A

3NF - third normal form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

it doesn’t use a schema
it stores data in raw format
it requires specialist knowledge (data scientists) about the data to operate

A

a datalake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

a subset of a datawarehouse that is data that serves a particular part or subject within the organization as opposed to data that serves the entire organziation

A

A Datamart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

List 2 key features of the Star Schema design pattern

A

1) Denormalized to improve read performance over large datasets
2) Uses a central fact table with dimension tables as spokes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Regarding OLAP databases, which schema design pattern would you pick if storage was a concern?

A

Snowflake schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Regarding design patterns, which design pattern does a datawarehouse usually use and why and which design pattern does a datamart usually use and why?

A

1) Datawarehouses usually use snowflake schema design patterns because they’re supplied from many differences sources.
2) Datamarts usually use the Star schema because they represent a single data subject area.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

When designing a dimension table, it is crucial to understand what to help you build it?

A

The questions that analysts will be asking about the the business or subject?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

A dimension table provides additional context to a fact table. These could be information about people, or the price of a product. The data may change over time and you’ll need to add additional attributes to the table. Regardless of how quickly the data changes, what must you consider about the data when designing dimension tables?

A

whether you need to store the history/indicate whether the attribute is current or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

what are 3 approaches mentioned by the author to handle the time dimension in tables

A

1) using a start date / end date for the attribute
2) indicator flag (shows what data is current)
3) effective date - date the attribute became effective

you must consider the historic requirements of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Name the main disadvantage of using effective date approach in creation of dimension tables

A

Itcomplicates queries because you need to calculate date math

when you ask the question “what was the price on x date” it needs to calculate between the effective dates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What’s the key difference between ETL and ELT and what technology do they use to do it?

A

1) ETL transforms data before it enters the warehouse and often uses Python
2) ELT transforms data after it enters the warehouse and typically uses SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Describe initial load, delta load and incremental load and batch window

A

1) initial load is the first load of data into the DWH
2) the delta loads are subsequent loads after the initial load
3) incremental loads are the same as delta loads
4) batch window is the time period in which you have to extract data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

ETL and ELT are associated with what data acquisition concept?

A

Integration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Regarding data-collection, Qualtrics is an application you can use in what type of data collection method?

A

conducting Surveys

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

when acquiring qualitative data using observation, what should you be mindful of?

A

1) accounting for subconscious action
2) bias - both in the observer and observee

you must develop methods for observation to mitigate the two

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What lets you actually create a database ready to accept data from a database design?

A

SQL’s Data Definition Language

34
Q

What’s the difference between DDL and DML?

A

1) DDL is about creating database tables and objects
2) DML is about querying, altering and retrieving data from the database table

35
Q

What are the four CRUD data manipulation actions and their association SQL Keywords?

A

1) Create - INSERT
2) Read - SELECT
3) Update - UPDATE
4) Delete DELETE

36
Q

Which logical function could you use to convert a given value into a different name of your choice to help divide data into categories?

For example, you want to change the M and F letters in the Sex column to Boys and Girls.

A

The IFF function

IFF(boolean_expression, true_value, false_value)

37
Q

List the 6 aggregate functions

A

1) Max
2) Min
3) Count
4) Sum
5) AVG
6) STDD

38
Q

Preventing injection attacks and reducing several queries into a single query are benefits of what?

A

Parameterization

39
Q

Name the key benefit of database indexes and one drawback

A

1) BENEFIT - they improve query speed
2) DRAWBACK - it slows down create, update and delete activity

40
Q

You want to explore historic sales data. The sales history table is huge. What’s the issue with this and what solution could you emply to mitigate it?

A

Querying a very large table is inefficient.
It’s more efficient to create a temporary table using a SUBSET of data from the main table

41
Q

What would you look at to troubleshoot query performance?

A

The execution plan

42
Q

What should you look at to ensure you’re developing an efficient query?

A

the execution plan

43
Q

Snowflake and Star schemas are also known as what label?

A

Multidimensional Schemas

44
Q

Generally speaking, Fact tables contain ______ data and dimension tables contain _____________ data?

A

1) fact tables = quantitative data
2) dimension tables = qualitative data

45
Q

Regarding data acquisition. ETL or ELT is associated with sourcing data from where?

A

a transactional system

46
Q

What are the key differences between ETL and ELT when it comes to how they transform the data?

A

1) ETL uses external technology to transform the data
2) ELT uses the database itself to transform the data

47
Q

What’s the name given to the concept of moving and combining data from different sources into a a data warehouse or datalake?

A

Data Integration

48
Q

In data integration, API’s return transaction data as what? And return bulk data as what?

A

1) JSON files
2) CSVs

49
Q

What were the 4 approaches mentioned in the book that mitigate performance issues for query optimization

A

1) Parameterization
2) Using data subsets and temp tables
3) Review execution plan with DB admin
4) Applying a database index

50
Q

Why is adding a time dimension to star of snowflake tables a good idea?

A

It allows handling of current and historical data

51
Q

which data collection method is good for validating business processes?

A

Observation

52
Q

if you’re doing just ad-hoc analysis, what would you utilize?

A

data subsets and use temporary tables

53
Q

Star and Snowflake Schemas are designed for what kind of database?

A

OLAP databases

54
Q

Why is the the denormalized nature of OLAP databases result in quicker read performance?

A

Queries are simpler and data reads faster due to fewer table joins required due to the hub and spoke nature of the schema.

55
Q

what determines the frequency of ETL data loads?

A

Business requirements

56
Q

What defines the batch window for loading data into the datawarehouse?

A

Between the last transaction of the day and the first transaction of the following day

57
Q

__________ translates the human-readable SQL into code the database understands

A

parsing

58
Q

____________ reduces the number of times the database has to parse individual queries

A

parameterization

59
Q

Large tables without indexes are inefficient. Applying a database index helps speed queries up. Ideally you want to include all columns in the index, but if you can’t, what should should you ensure when writing your select statement?

A

That the column you’re selecting is covered by an index

60
Q

while indexing improves read speed of a query, it slows down what 3 activites?

A

create, update and delete

the indexing strategy needs to match the type of database, i.e. transactional or reporting

61
Q

You find that a table where a non-key attribute is dependent on another non-key attribute. What form does this violate?

A

3NF

62
Q

When the primary key of a table is indirectly related to a non-key attribute through a key-attribute, what violation is this called?

A

transitive violation

because the key is related to a non-key attribute indirectly.

this violates 3NF

63
Q

Match the type of normalization to the statements below:
1) All non-key attributes must depend on the key
2) All non-key attributes must depend on the whole key and nothing else!

A

1) All non-key attributes must depend on the key - 2NF
2) All non-key attributes must depend on the whole key and nothing else! - 3NF

64
Q

You may have multiple applications in your business that need to request the same data, each using the same SQL calls to the database. What issue arises from this and what method can make this more efficient?

A

Issue - SQL code has to be maintained in multiple places.
Solution - implement a single API call for the specific data needed from the database. All applications can call the API instead of maintaining their own SQL.

65
Q

Frosty wants to use an web service enabled API. What must they obtain/purchase before they can use it?

A

an API key

66
Q

Webscrapers use Python / R or webscrapers use API’s?

A

Webscrapers use Python / R

67
Q

re: data collection/acquisition - the act of collecting primary source data from people or machines

A

is known as Observation

68
Q

Why is the snowflake schema more efficient with data storage than star schema?

A

because it is more normalized, reducing data duplication

69
Q

Fred is looking at where data comes from, how frequently it changes, and how long it needs to persist. What is he doing?

A

Considering the life-cycle of data

for managing data in a datawarehouse or datamart.

70
Q

Which method of handling time dimensionality keeps all pricing data in a single place (as opposed to splitting it out into a different table) and avoids using date math to get the current data?

A

Indicator flag method

71
Q

Which is the best choice if you need to get massive amounts of data from a transactional system into an OLAP system? ETL or ELT?

A

ELT

72
Q

you extract data from the source system and place it into a staging area as flat file. What Integration approach is this?

A

ETL

73
Q

what should take into account when designing the delta load strategy?

A

the batch window

74
Q

slowly changing dimension - keeps current and historic
adds column to show previous value

A

Type 3

75
Q

Slowly changing dimension - keep current state

A

Type 1

76
Q

slowly changing dimension - keeps current and hisotrical changes
tracks change on new row

A

Type 2

77
Q

Which symbol denotes the end of an SQL statement?

A

Semi-colon “;”

78
Q

which key enforced referencial integrity?

A

Foreign key

79
Q

s a set of protocols within a computer system that allows two unrelated systems to communicate.

cm

A

API

80
Q

is an API that allows a hosted computer on a network to share data with a computer in the same hosted environment.

cm

A

web service API

DO NOT AGREE!