Data Analysis Fundamentals Flashcards by Stefan Cole

What is a data warehouse?

A data management system that centralises and consolidates large amounts of data from various sources.

How well did you know this?

Not at all

Perfectly

What are the benefits of using a data warehouse?

Centralised storage - data from multiple sources in one location.
Structured data - queries and analysis easier to perform.
Optimisation for analysis - designed to run complex analytical queries efficiently.

How well did you know this?

Not at all

Perfectly

What is a data lake?

A centralised repository designed to store large amounts of data in raw form - structured, semistructured or unstructured.

How well did you know this?

Not at all

Perfectly

What is AWS Redshift?

A data warehousing service designed to enable queries and analysis of large volumes of data.

How well did you know this?

Not at all

Perfectly

True or false: AWS Redshift is optimised for OLTP.

False - it is optimised for tasks more closely resembling OLAP.

How well did you know this?

Not at all

Perfectly

What is the CRISP DM framework?

Cross-Industry Standard Process for Data Mining: a common methodology applied for analytical data projects.

How well did you know this?

Not at all

Perfectly

True or false: when executing the CRISP-DM framework, you will not need to return to a previous phase once it has been moved on from.

False - you may move between the initial stages of business + data understanding & data preparation multiple times to refine the approach before advancing to modeling etc.

How well did you know this?

Not at all

Perfectly

Where may you store database credentials to keep them secure?

In a .env file which is not committed to your version control system e.g. GitHub or using a tool like GitHub or AWS Secrets for more sensitive credentials.

These methods are considered much more secure than hardcoding credentials in scripts.

How well did you know this?

Not at all

Perfectly

What is SQLite?

A lightweight, serverless relational database management system that allows data to be stored in a single file on disk.

An alternative to traditional RDBMS like PostgreSQL.

How well did you know this?

Not at all

Perfectly

What is OTLP?

A data processing system used for real-time processing of online database transactions e.g. credit card payment processing.

Online Transactional Processing

How well did you know this?

Not at all

Perfectly

What is OLAP?

A data processing system optimised for performing analysis at high speeds on large amounts of data e.g. in work by data scientists, business analysts.

Online Analytical Processing

How well did you know this?

Not at all

Perfectly

What are Jupyter notebooks used for?

Documenting and demonstrating code alongside its results.
It is also useful for experimenting with code, data exploration, analysis & visualisation.

How well did you know this?

Not at all

Perfectly

True or False: code cells in a Jupyter notebook execute independently but do not share memory.

False - code cells execute independently and share memory.

Application: e.g. a variable assigned in a previous block can be called again in a subsequent block.

How well did you know this?

Not at all

Perfectly

What are the benefits of independent code cell execution in Jupyter notebooks?

More control over code execution makes experimentation and the results of specific code blocks easier to understand.

How well did you know this?

Not at all

Perfectly

What is a tooltip?

A UI element which displays contextual info to the user when hovering over an element e.g. the value of a specific bar in a bar chart.

How well did you know this?

Not at all

Perfectly

What are the benefits of using Pandas?

Study These Flashcards

Integration with other Python libraries like NumPy
Comprehensive documentation
Adaptable to data in different structures or formats like CSV or from SQL databases.

https://www.altexsoft.com/blog/pandas-library/
https://www.nvidia.com/en-gb/glossary/pandas-python/

When should you consider using SQL over Pandas? (Multiple answers)
* A) When working with very large datasets stored in a database.
* B) When performing complex data transformations within a Python environment.
* C) When needing to perform high-speed queries on data stored across multiple tables.
* D) When working primarily with in-memory data for quick analysis.
* E) When requiring integration with Python libraries for data visualisation.

Study These Flashcards

A) & C)

How would you join dataframes with Pandas?

Study These Flashcards

What is the primary purpose of measures of dispersion in data analysis?

Study These Flashcards

To quantify the amount of variation or spread in a set of data values.

Fill in the blank: The _______ is calculated as the difference between the maximum and minimum values in a dataset.

Study These Flashcards

Range

What is the standard deviation a measure of?

Study These Flashcards

The variation of the values of a variable about its mean.

How may you use the standard deviation of a variable to identify potential outliers?

Study These Flashcards

Data points > 3 standard deviations away from the mean are commonly considered as potential outliers.

What are row-oriented databases vs. column-based databases?

Study These Flashcards

Row-oriented databases store data in rows, which in turn consist of columns.
In contrast, a column-based database stores the values of each column together - rather than the values of each row together.

https://www.scylladb.com/glossary/columnar-database/ - contains a helpful visualisation for this difference.

What is a quantile?

Study These Flashcards

Cut points to divide a distribution of values into equal points, where each contains the same fraction of the total population e.g. a quartile or percentile.

What may you do during the data understanding phase of CRISP DM to explore the data?

* Look at data fields - identify if variables are continuous or categorical * Look at the number of entries * Look for relationships/links between variables/datasets * Data quality issues - e.g. null values.

Data Analysis Fundamentals Flashcards

(25 cards)