Chapter 2 Flashcards
(28 cards)
What is Data Management?
The process that a firm uses to acquire, organise, store, manipulate and distribute data
What is Data Wrangling?
Process of retrieving, cleansing, integrating, transforming and enriching data to support subsequent data analysis.
Transforms raw data into a format that is more appropriate and easier to analyse
Why do we need Data Wrangling?
The increasing volume and variety of data compel firms to spend great amounts of time and resources on gathering, cleaning and organising data before performing any analysis.
As the amount of data grows the need and difficulties of involving data wrangling increases.
What are some objectives of Data Wrangling?
- improves data quality
- reduce time and effort to perform analytics
- reveals true intelligence of data
What is a Database?
A collection of data logically organised to enable easy retrieval, management and distribution of data.
What is a Data Management System?
Software application for defining, manipulating and managing data in databases
What is a Relational Database?
Most common type of database that is modelled to offer flexibility and ease of data retrieval
What is Data Modeling?
Process of defining the structure of a database
What is an ERD?
Entity relationship diagram is a graphical representation used to illustrate the structure of data
What are the 6 key elements of an ERD?
- entity
- instance
- relationships
- Primary Key
- Foreign Key
- Composite Primary Key
What are the 3 different relationships an ERD can have?
1:1
1:M
M:N
How can we retrieve data that is stored in a relational database?
By using database queries like SQL: a language for manipulating data in a relational database using relative simple and intuitive commands
What is a Data Warehouse?
Central repository of data from multiple departments within a firm. Primary purpose is to support managerial decision making and therefore data in a data warehouse is organised around subjects such as sales, customers or products that are relevant to business decision making
Why should data be integrated from different databases in different departments?
- ETL process is used
- retrieve, reconcile and transform data into consistent formats
- to load the final data into the datawarehouse
What is a Data Mart?
Small scale data warehouse or subset of a warehouse that focuses on a specific subject or decision area and conforms to a multidimensional data model AKA star scheme
What type of database is used for Big Data?
NoSQL= not only SQL
Non relational database that supports the storage of a wide range of data types (structured, semi structured or unstructured)
What is Data Inspection?
Once raw data is extracted from the database, warehouse or mart we have to review and inspect the data to assess the quality and relevance of the information for the analysis.
We also need to count and sort the data to get a better understanding of the data as well as to determine if the data set is complete or if it has any missing values.
What is Data Preparation?
Happens after we inspected the data and we examine 2 different techniques: handling missing values and sub-setting data
Why are some values or data missing?
- respondents decline to provide information due to sensitive nature
- some items do not apply to every respondent
- caused by human errors, sloppy data collection or equipment failures
What are the 2 Strategies for dealing with missing values?
Omission and imputation
What is Subsetting?
Process of extracting portions of a data set that are relevant to the analysis
What is Data Transformation?
Data conversion process from one format to another. Performed to meet the requirements of statistical and data
Mining techniques used for the analysis
What are the 2 ways to transform numerical data?
Binning and Mathematical transformation
Sometimes, nominal and ordinal variables come with too many categories. Which potential problems could this cause?
- pull down model performance
- several parameters
- if the variable has categories that rarely occur it can be difficult to capture its impact
- relatively small samples may not contain any observations in some categories which can cause errors when the analytical model is later applied to a larger data set with observations in all categories
- if one category dominates in terms of occurrence, the categorical variable will fail to make a positive impact since modelling success is dependent on being able to differentiate among the observations