Data Analysis Concepts Flashcards
(27 cards)
What are the Data Analysis phases (in this course)?
AP-PASA
Ask: understand problem, goals, stakeholders - plan project
Prepare: get the data
Process: clean, organize transform
Analyze: explore, visualize, stats
Share: communicate, report, story
Act: solutions
What is Data Analysis
Turning data into insights for informed action - reduce risk of wasted efforts
What is the SMART question methodology?
Specific: simple, focused
Measurable: quantifiable
Action-oriented: Encourage chage
Relevant
Time-bound
What’s the difference between a Data Analyst, Data Engineer, & Data Scientist?
Analyst: answers questions with existing data – SQL, spreadsheets, DB’s, BI, dashboards
Engineer: turn raw data into actionable pipelines
Scientist: creates new ways of modeling and using data
What’s the difference between data-driven vs data-inspired decision-making?
Data-driven: using facts to guide strategy… requires quality & quantity… over-reliance can result in historical bias, ignoring qualitative insight
Data-inspired: adds in other sources of info - feelings/experience, difficult to measure qualities, related concepts
Quantitative vs Qualitative data: explain differences and give examples
Quantitative: specific & measurable. Often gives WHAT of a problem.
* Structured interviews, surveys/polls
Qualitative: subjective or explanatory - can’t be quanitified. Often gives WHY of a problem.
* Focus groups, social media text/review analysis, in-person interviews
Powerful when combined
Report vs Dashboard: explain differences, strengths/weaknesses
Report: Static, distributed periodically
+/- High level, historical
+. Quick to build, easy IF maintained
+. Static data - no cleaning
-. Continual mainteance
-. Less interactive
Dashboard: Real-time data, multiple datasets in one place
+. Dynamic, automated, interactive
+. User exploration
-. Labor-intensive design
-. Can be confusing/overwhelming (requires training)
-. More initial effort, and may need fixes
-. Potentially unclean data
3 Types of Common Dashboard focus
Strategic: long term goals - highest level metrics over time frame
Operational: short-term performance and goals (most common - real-time status)
Analytical: datasets and mathematics
Small Data vs Big Data: define and explain the differences in use
**Small Data: ** specific, short time-period, day-to-day decisions
- usually spreadsheets
- small/mid-size businesses
- simple to collect, store, manage, sort, visualize
- usually manageable size for analysis
**Big Data: ** larger, less-specific, longer time period, big decision
- usually database, queried
- larger businesses
- takes effort to collect, store, manager, sort, visualize
- usually needs to be broken down for analysis
- often more data than needed - challenge is to sift for gems
What is structured thinking?
Process - recognize problem, organize availble info, reveal gaps/opportunities, identify options for action.
Scope of Work & Statement of Work: Define and explain difference
Scope of Work: agreed upon timeline, including deliverables, milestones, and reports
Statement of Work: identifies products/services vendor or contractor will provide an organization (objectives, guidelines, deliverables, schedule, cost)
What are the W questions to explore possible bias in data?
**Who: **person/organization who collected/funded
**What: **things in world the could have impacted
**Where: **origin of data
**When: **time data was created/collected
Why: motivation behind creation/collection
**How: **methods used to create/collect
Important to include context/possible bias when presenting/reporting data
What are some tips when dealing with Executive team stakeholders?
- Strategic
- Headlines first
- Limited time
- Details in appendix
1st vs 2nd vs 3rd party data sources: what’s the difference?
First Party: collected by individual/group themselves for own use
Second Party: collected by a group from its own audience, then sold
Third Party: collected by outside sources who didn’t collect it directly themselves - requires more checking
Discrete vs Continuous?
Ordinal vs Nominal?
Internal vs External?
Discrete: whole numbers only
**Continuous: **any numeric value
Ordinal: qualitative data with set order/sequence
Nominal: qualitative data with no order/sequence
**Internal: **lives in org’s systems (more reliable, easier to collect)
**External: **lives outisde org’s systems
Structured vs Unstructured data?
Structured: defined data types; usually quantitative; easy to organize/search/analyze; rows/columuns; stored in DB or warehouse
Unstructured: varied data types; usually qualitative; difficult to search; more freedom for analysis; stored in data lakes, warehouses, NoSQL DB’s; Can’t be put in rows/columns; Examples: txt msgs, soc media comments, phone call transcripts, images/audio/video
What is Data Modeling?
Diagramming how data is organized/structured.
Data analysts don’t create data models, but they must be able to read/understand them.
What are the three types of data modeling (from lowest to highest level of detail)
**1. Conceptual **(Business concepts): high level view of data, key entities
2. Logical (data entities): relationships, attributes, entities (not actual table/column names)
3. Physical (physical tables): specific definitions of all tables, attributes, relationships, columns, data types, ACTUAL names
ERD vs UML
**ERD: **Entity Relationship Diagram - visual display of relationships between entities/database
**UML: **Unified Modeling Language - detailed diagrams that show ERD contents PLUS system behaviors and workflows (more detailed)
Data Types:
Number
String
Boolean
Number: numbers only, decimal
String: text characters, punctuation
Boolean: T or F (formula uses Boolean operators AND, OR, NOT
Wide Data vs Long Data
Wide Data: each subject has single row with multiple columns for attributes. Easier to compare specific attribute across different subjects. Analysts often transform Long data into Wide data for analysis/visualizations
**Long Data: **each row is one time point per subject. Each subject may have multiple rows. Versatile way to store data. More advanced, more detail on each subject.
What is data transformation?
Changing data format, structure, values (adding/copying/deleting/renaming/combinging/joining datasets/reformatting).
Goal: reorganize data for easier use/analysis; improve compatibility/portability; merging datasets; ehnancing with more detail/fields, comparing data
What are the three common types of metadata
Descriptive: used to identify record later on (name, ID, title, author, etc)
Structural: how organized related to other data/collections
Administrative: technical source of an asset
What 3 elements are needed to calculate a sample size? What do they each represent?
Population size (total)
Confidence Level (%): how sure you are that results are representative (if you did survey again, likelihood you’d get similar results). Standard 95% or 90%
Margin of Error (+/- %): how much your results might vary from ACTUAL value. standard