Common Questions Flashcards
(28 cards)
What is data analysis?
The practice of gathering, cleaning, transforming, and interpreting data to extract meaningful insights and use it to make decisions.
Explain the main aspects of data analysis.
- Data collection - collecting raw data from numerous sources e.g. external datasets, surveys, internal databases etc.
- Data transformation - cleaning, standardisation (data type alignment), data enrichment (adding additional data to existing records) e.g. historical/ geographical, data aggregation (from different sources), data mapping (matching fields/ cols from different sources), data partitioning (one dataset into smaller ones).
- EDA - Exploratory Data Analysis, aimed at studying and summarising the characteristics of data. The main methods to do this are statistics and data visualisations: Statistics provide brief informational coefficients that summarise data. e.g. mean, median, standard deviation, and correlation coefficients. Data visualisation is the graphical representation of data; some graphs will be more useful than others e.g. a boxplot is a great graph to visualize the distribution of data and split extreme values. Applying mathematical/ statistical techniques to data to draw conclusions.
- Showcasing results e.g. using Tableau, Power BI, Python packages such as Matplotlob, Seaborn etc., R packages such as ggplot2, Lattice etc.
How do data analysts differ from data scientists?
Data analyst - responsible for collecting, cleaning, and analysing data to help make better decisions. Visualisation tools are used to identify trends, and reports/ dashboard may be made to communicate findings.
Data scientists - responsible for creating/ implementing machine learning and statistical models on data which are used to make predictions and enhance business processes.
Give examples of different tools used for DA.
- Spreadsheet software e.g. Excel, Google Sheets.
- Database Management Systems to store, manage, and organise large datasets e.g. MySQL, SWL Server, PostgreSQL.
- Programming languages e.g. Python, R.
What is data wrangling?
AKA data munging.
Involves cleaning, transforming, and organising raw, unstructured data in a usable format, to improve the dataset structure and quality.
Can involve data cleaning, data transformation, data integration, data restructuring, data enrichment, and quality assurance.
What is data cleaning?
Identifying and removing errors, inconsistencies, and missing values from datasets.
What is data transformation?
Transform the structure, format, or values of data as per analysis requirements which may include normalisation, scaling, and encoding categorical values.
What is data restructuring?
Reorganising data to make it more suitable for analysis, e.g. reshaping into different formats or creating new variables by aggregating features at different levels.
What is data enrichment?
Data is enriched by adding additional relevant info e.g. combined aggregation of numerous features, external data etc.
What is quality assurance?
Ensuring data meets quality standards and is fit for analysis.
What is descriptive analysis?
- Used to describe questions e.g. what’s happened previously, what are key characteristics of data?
- Identifies patterns, trends, and relationships within data.
- Uses statistical measures, visualisations, and exploratory DA techniques to gain insight.
- Concerned with historical perspective, summary statistics (mode, mean, median), visualisations, trends, and exploration.
What is predictive analysis?
- Uses past data and applies statistical and ML models to identify patterns and make future predictions.
- Concerned with future projection, model building (using historical data), validation and testing (using unseen data to asses accuracy), feature selection (identifying variables that influence the predicted outcome), and decision making (by providing insights).
What is univariate, bicarbonate, and multivariate analysis?
- Univariate - analyses one variable at a time to understand measures of central tendency (mean, median, mode etc.) , measures of dispersion (range, variance), and graphical methods e.g. histograms.
- Bivariate - analyses relationship of 2 variables and understand how one car is related to another, how strong the correlation is etc. using scatter plots, contingency tables etc.
- Multivariate - analyses relationship between 3 or more vars to identify patterns/ clusters/ dependencies, using cluster analysis, regression analysis etc.
Give examples of some of the most popular DA tools.
- Tableau - visualisation and dashboard creation from numerous data sources.
- Power BI - data visualisation.
- Qlik Sense - data visualisation.
- SAS - advanced analytics, multivariate analysis, and business intelligence.
What are the steps taken to analyse a dataset (generic)?
- Ensure problem/ objective is clear.
- Collect data from various sources e.g. surveys, tests, databases, web scraping, and ensure it’s representative/ accurate.
- Data preprocessing/ data cleaning - fixing missing values, removing blanks/ duplicates/ extreme outliers, formatting, redefine columns etc.
- EDA (Exploratory Data Analysis) - apply different graphical/ statistical approaches to analyse data and discover trends/ patterns, identify outliers, and gain initial insights.
- Data visualisations - provide visual representation of complex info and patterns, which enhances understanding and allows communication to stakeholders.
Why is exploratory data analysis important?
- EDA - investigating and understand data through graphical and statistical techniques.
- Helps identify trends and understand relationships between variables.
- Non-parametric (doesn’t make assumptions about the dataset).
- Can get deep understanding of variable relationships, patterns, and nature of data.
- Can analyse quality of the dataset through univariate analysis e.g. mean, mode, median, quartile range and identify patterns of single rows of the dataset.
- Can find the most I bc lie tail feature of the dataset via correlations, covariance, and bivariate/ multivariate plotting.
- Can identify outliers using box plots.
What are key considerations when undertaking data transformation?
Data Profiling: Understanding the characteristics of the data before transformation.
Mapping: Defining how to map data from different sources.
Transformation Rules: Implementing rules to transform data into the desired format.
Testing and Validation: Ensuring the accuracy of the transformed data.
Iteration and Refinement: Continuously improving the transformation process based on feedback.
What are the four types of DA?
Descriptive analysis - what happened? Summarises historical data to understand what has happened previously.
Diagnostic analysis - why did it happen? Comparing different data sets to understand an outcome.
Predictive analysis - what will happen? Can use statistical models and forecasting techniques to understand the future, and involves using data from the past to predict what might happen in the future.
Prescriptive analysis - how can we make it happen? Helps predict future outcomes and suggest actions to take to benefit e.g. using machine learning
What is exploratory analysis?
Used to understand main characteristics of data set.
Often used at beginning of DA process to summarise main aspects of the data/ check for missing data/ test assumptions.
Can involve visual methods such as scatter plots, histograms, and box plots.
What is regression analysis?
Statistical method used to understand the relationship between a dependent variable and one or more independent variables.
Commonly used for forecasting, time series modeling, and finding the causal effect relationships between variables.
E.g. linear regression.
What is factor analysis?
Technique used to reduce a large number of variables into fewer factors.
The factors are constructed in such a way that they capture the maximum possible info from the original variables.
Often used in market research, customer segmentation, and image recognition.
What is cluster analysis?
Technique used to group a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups.
Often used in market segmentation, image segmentation, and recommendation systems.
E.g. hierarchical clustering and k-means clustering.