Chapter 3 - Databases and Data Acquisition Flashcards

(80 cards)

1
Q

What does an entity contain? What kind of database does it relate to?

A

An entity contains data about a single subject. Think of it as a noun that describes a person, place or thing. Used in relational databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What’s an entity instance?

A

It’s like a single record or instances of an entity subject. Like picking out one record in a table about pets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the name given to describe the relationship between two entities?

A

Cardinality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which types of entity relationships are comparatively complex and rare?

A

Unary and Ternary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between an entity instance and an entity attribute?

A

An entity attribute is akin to the column heading in a table. An entity instance is akin to a single record in a table, for example, the record about one particular customer. The record contains multiple entity attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Entities become a separate _____ in a database

A

table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What two things together link data between two tables?

A

An associative table and a foreign keys

the F-keys link to the primary keys in the two related tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A foreign key helps to enforce what?

A

foreign key enforces referential integriy, or how consistent the data is in related tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What describes the modifications necessary to to create tables for storing data?

A

A schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A table that contains a primary key in more than one column has a primary key called what?

A

A composite primary key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

You need to store and retrieve simple data quickly that doesn’t have a lot of structure, which of the two non-relational databases would you choose?
Key-value or Column-family

A

Key-value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name 2 high level key benefits of a column-family database?

A

1) DYNAMIC/FLEXIBLE - Column data can vary dynamically between key values
2) ABILITY TO SCALE - can process vast amounts of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What kind of database is best at exploring relationships between pieces of data?

A

Graph databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What database type do the below relate to and what are they?

1) NODES -
2) PROPERTIES -
3) RELATIONSHIPS -

A

GRAPH DATABASES

2) 1) NODES - these are the equivalent of an entity instance; a person or a thing.
2) PROPERTIES - these store attributes about a node
3) RELATIONSHIPS - these are arrows connecting nodes denoting their relationship to each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What kind of database is often behind website features like related or RECOMMENDED items?

A

Graph database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What database process is employed to reduce data redundancy?

A

Normalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

List the 5 rules of 1NF

A

1) KEY - a table without a primary key is not permitted
2) GROUPS - Repeating groups are not permitted
3) ORDER - Row order is not permitted
4) DATATYPES - Mixing of datatypes in a column is not permitted
5) VALUES - non-atomic values are not permitted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What level of normalization does this statement describe? “every attribute in a table should depend on the key, the whole key. and nothing but the key”?

A

3NF - third normal form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

it doesn’t use a schema
it stores data in raw format
it requires specialist knowledge (data scientists) about the data to operate

A

a datalake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

a subset of a datawarehouse that is data that serves a particular part or subject within the organization as opposed to data that serves the entire organziation

A

A Datamart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

List 2 key features of the Star Schema design pattern

A

1) Denormalized to improve read performance over large datasets
2) Uses a central fact table with dimension tables as spokes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Regarding OLAP databases, which schema design pattern would you pick if storage was a concern?

A

Snowflake schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Regarding design patterns, which design pattern does a datawarehouse usually use and why and which design pattern does a datamart usually use and why?

A

1) Datawarehouses usually use snowflake schema design patterns because they’re supplied from many differences sources.
2) Datamarts usually use the Star schema because they represent a single data subject area.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

When designing a dimension table, it is crucial to understand what to help you build it?

A

The questions that analysts will be asking about the the business or subject?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
A dimension table provides additional context to a fact table. These could be information about people, or the price of a product. The data may change over time and you'll need to add additional attributes to the table. Regardless of how quickly the data changes, what must you consider about the data when designing dimension tables?
whether you need to store the history/indicate whether the attribute is current or not
26
what are 3 approaches mentioned by the author to handle the time dimension in tables
1) using a start date / end date for the attribute 2) indicator flag (shows what data is current) 3) effective date - date the attribute became effective ## Footnote you must consider the historic requirements of the data
27
Name the main disadvantage of using effective date approach in creation of dimension tables
Itcomplicates queries because you need to calculate date math ## Footnote when you ask the question "what was the price on x date" it needs to calculate between the effective dates
28
What's the key difference between ETL and ELT and what technology do they use to do it?
1) ETL transforms data before it enters the warehouse and often uses Python 2) ELT transforms data after it enters the warehouse and typically uses SQL
29
Describe initial load, delta load and incremental load and batch window
1) initial load is the first load of data into the DWH 2) the delta loads are subsequent loads after the initial load 3) incremental loads are the same as delta loads 4) batch window is the time period in which you have to extract data.
30
ETL and ELT are associated with what data acquisition concept?
Integration
31
Regarding data-collection, Qualtrics is an application you can use in what type of data collection method?
conducting Surveys
32
when acquiring qualitative data using observation, what should you be mindful of?
1) accounting for subconscious action 2) bias - both in the observer and observee | you must develop methods for observation to mitigate the two
33
What lets you actually create a database ready to accept data from a database design?
SQL's Data Definition Language
34
What's the difference between DDL and DML?
1) DDL is about creating database tables and objects 2) DML is about querying, altering and retrieving data from the database table
35
What are the four CRUD data manipulation actions and their association SQL Keywords?
1) Create - INSERT 2) Read - SELECT 3) Update - UPDATE 4) Delete DELETE
36
Which logical function could you use to convert a given value into a different name of your choice to help divide data into categories? ## Footnote For example, you want to change the M and F letters in the Sex column to Boys and Girls.
The IFF function | IFF(boolean_expression, true_value, false_value)
37
List the 6 aggregate functions
1) Max 2) Min 3) Count 4) Sum 5) AVG 6) STDD
38
Preventing injection attacks and reducing several queries into a single query are benefits of what?
Parameterization
39
Name the key benefit of database indexes and one drawback
1) BENEFIT - they improve query speed 2) DRAWBACK - it slows down create, update and delete activity
40
You want to explore historic sales data. The sales history table is huge. What's the issue with this and what solution could you emply to mitigate it?
Querying a very large table is inefficient. It's more efficient to create a temporary table using a SUBSET of data from the main table
41
What would you look at to troubleshoot query performance?
The execution plan
42
What should you look at to ensure you're developing an efficient query?
the execution plan
43
Snowflake and Star schemas are also known as what label?
Multidimensional Schemas
44
Generally speaking, Fact tables contain ______ data and dimension tables contain _____________ data?
1) fact tables = quantitative data 2) dimension tables = qualitative data
45
Regarding data acquisition. ETL or ELT is associated with sourcing data from where?
a transactional system
46
What are the key differences between ETL and ELT when it comes to how they transform the data?
1) ETL uses external technology to transform the data 2) ELT uses the database itself to transform the data
47
What's the name given to the concept of moving and combining data from different sources into a a data warehouse or datalake?
Data Integration
48
In data integration, API's return transaction data as what? And return bulk data as what?
1) JSON files 2) CSVs
49
What were the 4 approaches mentioned in the book that mitigate performance issues for query optimization
1) Parameterization 2) Using data subsets and temp tables 3) Review execution plan with DB admin 4) Applying a database index
50
Why is adding a time dimension to star of snowflake tables a good idea?
It allows handling of current and historical data
51
which data collection method is good for validating business processes?
Observation
52
if you're doing just ad-hoc analysis, what would you utilize?
data subsets and use temporary tables
53
Star and Snowflake Schemas are designed for what kind of database?
OLAP databases
54
Why is the the denormalized nature of OLAP databases result in quicker read performance?
Queries are simpler and data reads faster due to fewer table joins required due to the hub and spoke nature of the schema.
55
what determines the frequency of ETL data loads?
Business requirements
56
What defines the batch window for loading data into the datawarehouse?
Between the last transaction of the day and the first transaction of the following day
57
__________ translates the human-readable SQL into code the database understands
parsing
58
____________ reduces the number of times the database has to parse individual queries
parameterization
59
Large tables without indexes are inefficient. Applying a database index helps speed queries up. Ideally you want to include all columns in the index, but if you can't, what should should you ensure when writing your select statement?
That the column you're selecting is covered by an index
60
while indexing improves read speed of a query, it slows down what 3 activites?
create, update and delete ## Footnote the indexing strategy needs to match the type of database, i.e. transactional or reporting
61
You find that a table where a non-key attribute is dependent on another non-key attribute. What form does this violate?
3NF
62
When the primary key of a table is indirectly related to a non-key attribute through a key-attribute, what violation is this called?
transitive violation | because the key is related to a non-key attribute indirectly. ## Footnote this violates 3NF
63
Match the type of normalization to the statements below: 1) All non-key attributes must depend on the key 2) All non-key attributes must depend on the whole key and nothing else!
1) All non-key attributes must depend on the key - 2NF 2) All non-key attributes must depend on the whole key and nothing else! - 3NF
64
You may have multiple applications in your business that need to request the same data, each using the same SQL calls to the database. What issue arises from this and what method can make this more efficient?
**Issue** - SQL code has to be maintained in multiple places. **Solution** - implement a single API call for the specific data needed from the database. All applications can call the API instead of maintaining their own SQL.
65
Frosty wants to use an web service enabled API. What must they obtain/purchase before they can use it?
an API key
66
Webscrapers use Python / R or webscrapers use API's?
Webscrapers use Python / R
67
re: data collection/acquisition - the act of collecting primary source data from people or machines
is known as Observation
68
Why is the snowflake schema more efficient with data storage than star schema?
because it is more normalized, reducing data duplication
69
Fred is looking at where data comes from, how frequently it changes, and how long it needs to persist. What is he doing?
Considering the life-cycle of data ## Footnote for managing data in a datawarehouse or datamart.
70
Which method of handling time dimensionality keeps all pricing data in a single place (as opposed to splitting it out into a different table) and avoids using date math to get the current data?
Indicator flag method
71
Which is the best choice if you need to get massive amounts of data from a transactional system into an OLAP system? ETL or ELT?
ELT
72
you extract data from the source system and place it into a staging area as flat file. What Integration approach is this?
ETL
73
what should take into account when designing the delta load strategy?
the batch window
74
slowly changing dimension - keeps current and historic adds column to show previous value
Type 3
75
Slowly changing dimension - keep current state
Type 1
76
slowly changing dimension - keeps current and hisotrical changes tracks change on new row
Type 2
77
Which symbol denotes the end of an SQL statement?
Semi-colon ";"
78
which key enforced referencial integrity?
Foreign key
79
s a set of protocols within a computer system that allows two unrelated systems to communicate. | cm
API
80
is an API that allows a hosted computer on a network to share data with a computer in the same hosted environment. | cm
web service API ## Footnote DO NOT AGREE!