Data Analysis Fundamentals Flashcards

(25 cards)

1
Q

What is a data warehouse?

A

A data management system that centralises and consolidates large amounts of data from various sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the benefits of using a data warehouse?

A
  • Centralised storage - data from multiple sources in one location.
  • Structured data - queries and analysis easier to perform.
  • Optimisation for analysis - designed to run complex analytical queries efficiently.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a data lake?

A

A centralised repository designed to store large amounts of data in raw form - structured, semistructured or unstructured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is AWS Redshift?

A

A data warehousing service designed to enable queries and analysis of large volumes of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

True or false: AWS Redshift is optimised for OLTP.

A

False - it is optimised for tasks more closely resembling OLAP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the CRISP DM framework?

A

Cross-Industry Standard Process for Data Mining: a common methodology applied for analytical data projects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or false: when executing the CRISP-DM framework, you will not need to return to a previous phase once it has been moved on from.

A

False - you may move between the initial stages of business + data understanding & data preparation multiple times to refine the approach before advancing to modeling etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Where may you store database credentials to keep them secure?

A

In a .env file which is not committed to your version control system e.g. GitHub or using a tool like GitHub or AWS Secrets for more sensitive credentials.

These methods are considered much more secure than hardcoding credentials in scripts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is SQLite?

A

A lightweight, serverless relational database management system that allows data to be stored in a single file on disk.

An alternative to traditional RDBMS like PostgreSQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is OTLP?

A

A data processing system used for real-time processing of online database transactions e.g. credit card payment processing.

Online Transactional Processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is OLAP?

A

A data processing system optimised for performing analysis at high speeds on large amounts of data e.g. in work by data scientists, business analysts.

Online Analytical Processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are Jupyter notebooks used for?

A

Documenting and demonstrating code alongside its results.
It is also useful for experimenting with code, data exploration, analysis & visualisation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True or False: code cells in a Jupyter notebook execute independently but do not share memory.

A

False - code cells execute independently and share memory.

Application: e.g. a variable assigned in a previous block can be called again in a subsequent block.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the benefits of independent code cell execution in Jupyter notebooks?

A

More control over code execution makes experimentation and the results of specific code blocks easier to understand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a tooltip?

A

A UI element which displays contextual info to the user when hovering over an element e.g. the value of a specific bar in a bar chart.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the benefits of using Pandas?

A
  • Integration with other Python libraries like NumPy
  • Comprehensive documentation
  • Adaptable to data in different structures or formats like CSV or from SQL databases.

https://www.altexsoft.com/blog/pandas-library/
https://www.nvidia.com/en-gb/glossary/pandas-python/

17
Q

When should you consider using SQL over Pandas? (Multiple answers)
* A) When working with very large datasets stored in a database.
* B) When performing complex data transformations within a Python environment.
* C) When needing to perform high-speed queries on data stored across multiple tables.
* D) When working primarily with in-memory data for quick analysis.
* E) When requiring integration with Python libraries for data visualisation.

18
Q

How would you join dataframes with Pandas?

19
Q

What is the primary purpose of measures of dispersion in data analysis?

A

To quantify the amount of variation or spread in a set of data values.

20
Q

Fill in the blank: The _______ is calculated as the difference between the maximum and minimum values in a dataset.

21
Q

What is the standard deviation a measure of?

A

The variation of the values of a variable about its mean.

22
Q

How may you use the standard deviation of a variable to identify potential outliers?

A

Data points > 3 standard deviations away from the mean are commonly considered as potential outliers.

23
Q

What are row-oriented databases vs. column-based databases?

A

Row-oriented databases store data in rows, which in turn consist of columns.
In contrast, a column-based database stores the values of each column together - rather than the values of each row together.

https://www.scylladb.com/glossary/columnar-database/ - contains a helpful visualisation for this difference.

24
Q

What is a quantile?

A

Cut points to divide a distribution of values into equal points, where each contains the same fraction of the total population e.g. a quartile or percentile.

25
What may you do during the data understanding phase of CRISP DM to explore the data?
* Look at data fields - identify if variables are continuous or categorical * Look at the number of entries * Look for relationships/links between variables/datasets * Data quality issues - e.g. null values.