Database Design / Case Study Flashcards

Question 1

Q

What database relationship do you know?

Answer

A

1 to 1: so one entity associating with another. An example would be an employee with a particular work location

1-to-many: when one entity is associated with many others. an example would be if we continue a work location and all the employees that work there

Many to 1: when many entities are associated with one. An exmaple all of the students working on one project

Many to Many: when many entities are associated with many others. customers and products are a good one. Many products can be associated with many customers

Question 2

Q

Difference between snowflake and star schema and when you might use each?

Answer

A

A star schema is a data model where dimensional and fact tables are connected in a star like shape, for instance:

lets say an e-commerce site. it could have many different dimension tables. One for product. One for customer. One for store and one for date. these would link to a sales fact table. That would have the id’s of all the dimension tables and the purchase amount. The shape of this would resemble a star hence the name.

A snowflake schema splits dimensional tables into further dimension tables. So you will have more tables usually with an id and then name. each dimension is split until there is no repetition of values exceot for identifiers like id’s. So snowflake schemas are normalized whereas star chemas are denormaizled

Advantages:

Star schema has simple query writing.A snowflake schema might need multiple joins so there is a higher processing overhead.

However, star schemas take less space because they have data redundancy

Snowflake has better data integrity because it is normalized where as star schemas on update or delete can only update or delete one of the redundant rows

Generally, star schemas are easier to setup and query. Snowflake is easier to maintain.

Question 3

Q

What is normalization? Denormalization?

Answer

A

Data normalization is organizing and formatting data to appear similar across all records and fields. Data normalization helps provide analysts with more efficient and precise navigation, removing duplicate data and maintaining referential integrity.

On the other hand, Denormalization is a database technique in which redundant data is added to one or more tables. This technique can optimize performance by reducing the need for costly joins.

Question 4

Q

How would you handle data loss during a migration?

Answer

A

1) Define the specific data required for migration
2) Avoid migrating data that is no longer needed
3) run business analysis to determine data quality rules
4) understand the different rules of the source system you are moving to and test for that
5) perform real time verifications during the migration
6) Define a clear flow for data, error reporting, and rerun procedures
7) define the correct order of migration depending on the source.
8)Batch migration so maybe you split into 3 parts and test each time

Question 5

Q

What are the three main types of data model techniques?

Answer

A

relational, dimensional, and entity-relationship

relational data model uses tables with columns and rows

entity relationship - a model with rel world entities and the relaitonships between them. An employee in an employee database

Dimensional modeling is the tehcnique of data model to use fact and dimension tables. This is to optomize for the SELECT operation on SQL.

Question 6

Q

What is normalization? Denormalization?

Answer

A

Normalization is removing duplicate data and making data appear similair across all records and fields. This makes it easier to update data correctly, maintains a higher data integrity.

Denormalizaiton is a database technique in which redundant data is added to one or more tables to make querying faster because you ned to do less joins.

Question 7

Q

What are advantages and disadvantages of a NoSQL database?

Answer

A

NoSQL has superior scalability and superior performation and can handle all types of data. Unstructured, semi-structured and structured. It is easier to spin up as you dont need to define schema.

If you have a huge huge amount of data like an Uber then you would probably be using this somewhere. Or if you are rapidly changing schema’s. NoSQL now though is a bit of a misnomer as like MongoDB does have querying and relational references.

relational databases are more consistent during writes because there is a rigid schema behind them. meaning your data is more consistent.

Generally I feel that relational databases get cheaper and better for performance. If you dont have to use NoSQL then dont it will make your life and your data organization easier. If you need to cut costs and have the scale then you can.

The biggest question is can you let go of dara consistency for faster writes? like yfor you guys i would think no way because its financial informaiton.

Question 8

Q

structured vs unsctructured vs semistructred data?

Question 9

Q

ACID

Answer

A

Acid is an acronym for a database system that optimizes around atomicity, consistency, isolation and durability.

Automicity - the entire process ush finsih or else the file, document, transaction reverts back to its prior state

Consistency - any transaction you complete in the database follows rules you or others gave

Isolation - Transactions happen seperately from one another. This is tricky because it slows your sysetems down but adds security and data protection.

Durability - the database stores all the information on permanent storage.

Question 10

Q

What is a lake-house?

Answer

A

a new data management architecture that combines the scale of data lakes with the ACID stable transactions of a data warehouses. Databricks offers a data lakehouse.

Its i also think a marketing strategy for alot of companies.

Question 11

Q

With what database design patterns do you have the most experience

Answer

A

Lazy loading for sure. That us ow we have made our app alot faster. CAlls to the DB dont happen until the user initiates them.

Question 12

Q

How would you design a relational database of customer data?

Question 13

Q

You are analyzing auto insurance data and find that the marriage attribute column is marked TRUE for all customers.

Answer

A

How far back does this bug go? Is this a star or snowflake schema? If I add a new row is this populating?

Question 14

Q

What database optimizations might you consider for a Tinder-style app?

Answer

A

O okay. So first off querying the whole database for user_id of the swiped or somehtign woudl be awful. So fist partition by zipcode since you are only going to be swiping by those in your zipcode.

Secondly make one transactional table that just dealt with all the information the UI would need. That way we can make that info super fast and likely we dont need that much info.

All the BI would go to a different database for machine learning.

I see the biggest challenge being swiping with a Yes/No value and how best to store that data. I think that would be a good example of where you coudl use NoSQL because its a huge amount of very simple not complicated data. You might have to use NoSQL for that if every swipe is a new row.

Question 15

Q

How would you design a relational database of customer data?

Answer

A

1) gather staheholder input and determine the purpose of the database. What kind of analysis will be performed? how oftern will that analysis be performed ?

2) I would then write out the schema I needed. Grab test data and start writing queries and

3) after that I would clean the data. Add columns and compress the size of necessary data.

4) then have a stakeholder test

5) then ship

Question 16

Q

How do you go about debugging an ETL tool?

Answer

A

Okay so first the tool you are using should have a log. Airflow not only has that but it has alot of built in debugging code that you can use to check variables at certain points. Try and pinpoint where in your ETL process this is happening between. Once you have pinpointed that you need to find the query or function its happening on.

Question 17

Q

How would you design a system for DoorDash to minimize missing or wrong orders placed on the app?

Answer

A

When you say missing orders do you mean that customers are placing an order that is then not showing up?

Or are they entering the wrong information?

Question 18

Q

How would you design the YouTube video recommendation system? What are important factors to keep in mind when building recommendation algorithms?

Answer

A

The purpose of a recommendation algorithm is to recommend videos that a user might like. One way to approach this would be to suggest metrics that indicate how well a user likes a video. Let’s say we set a metric to gauge user interest in a video: whether users watch a whole video or stop before the video completes.

Once we have a functioning metric for whether users like or dislike videos, we can associate users with similar interests and attributes to generate a basic framework for a recommendation. Our approach relies on the assumption that if person A likes a lot of the things that person B likes or is similar in other respects (such as age, sex, etc.), there’s an above-average chance that person B will enjoy a video that person A likes.

What other factors might we want to take into account for our algorithm?

Question 19

Q

What is the difference between a view and a table?

Answer

A

Definition:

Table: A table stores data in a database

View: A view is a virtual table who’s rows and columns are made by a saved query. Really a view is a saved SELECT statement.

The advantage of views are that they can show data from different tables and they can be used as a security measure since the user does not have access to the underlying database.

Example:

If i’m running a query on employees and salaries to see the top 3 salaried employees.

I could do SELECT * from view instead of sending a complex query to the database each time

Brainscape's Knowledge GenomeTM

Database Design / Case Study Flashcards

Brainscape's Knowledge Genome^TM