Exam questions Flashcards
Train concepts described by teacher (84 cards)
(L1): What distinguishes data science from traditional analytics or business intelligence?
Data science goes beyond reporting and dashboards; it focuses on extracting actionable insights from complex and often unstructured data using programming, statistical modeling, and machine learning. Unlike traditional business intelligence, which is retrospective and descriptive, data science is predictive, exploratory, and iterative.
(L1): How does the data science workflow support data-driven decision making?
The workflow—typically based on models like CRISP-DM—guides the process from understanding a business problem to collecting, preparing, analyzing, and deploying data solutions. It ensures that insights are not only technically correct but also aligned with business objectives.
(L1): What are the main types of data analytics (descriptive, diagnostic, predictive, prescriptive) and how are they applied in business?
Descriptive: What happened? (e.g., sales reports)
Diagnostic: Why did it happen? (e.g., churn analysis)
Predictive: What will happen? (e.g., demand forecasting)
Prescriptive: What should we do? (e.g., route optimization)
Each type supports decision-making at different stages, from insight generation to strategic planning.
(L1): Why is programming an essential skill for modern data scientists, even in business contexts?
Programming enables data scientists to automate tasks, clean and manipulate large datasets, develop models, and customize analyses. It bridges the gap between raw data and strategic insights, empowering them to build scalable, repeatable solutions.
(L1): What is R, and why is it widely used in data science?
R is a statistical programming language designed for data analysis, visualization, and modeling. It is popular because of its rich package ecosystem, strong community support, and strengths in exploratory and statistical work. It’s especially favored in academia and applied research.
(L1): How does understanding programming improve your ability to collaborate across business and technical teams?
Programming literacy helps data scientists translate business questions into data problems, communicate with developers, and explain technical results to non-technical stakeholders. This ensures that analytical solutions are strategically relevant and implementable.
(L2): What does it mean to say that all data is “socially constructed”?
To say that all data is “socially constructed” means that data does not exist independently as pure, objective facts. Instead, it is:
Collected, categorized, and defined by people based on specific goals, contexts, and assumptions.
Shaped by choices about what to measure, how to measure it, and for what purpose.
Embedded with values, biases, and power structures, often reflecting the interests of those who design the systems.
As emphasized in Rosenberg (2013), “raw data” is an oxymoron—data is never neutral; it is always filtered through human and institutional decisions.
(L2): Why is it important to question the objectivity of data sources in business contexts?
Questioning the objectivity of data sources in business contexts is crucial because:
Data reflects design choices—what to collect, how, and from whom—which may introduce bias or omissions.
Business decisions based on biased data can lead to unfair outcomes (e.g., discriminatory models), misallocation of resources, or flawed strategies.
Contextual understanding is needed to avoid over-trusting data that appears neutral but is influenced by historical, cultural, or organizational factors.
It ensures ethical, accurate, and responsible use of data for modeling, forecasting, and decision-making.
In short: uncritical use of data risks turning flawed input into flawed conclusions.
(L2): How can data collection methods introduce bias into an analysis?
Data collection methods can introduce bias into an analysis through:
Sampling bias – when the data does not represent the target population (e.g., only collecting data from active users).
Measurement bias – when the tools or definitions used to collect data skew results (e.g., vague survey questions, poorly calibrated sensors).
Exclusion bias – when important groups or variables are left out (e.g., ignoring non-digital consumers in online studies).
Observer or recording bias – when human judgment affects what is recorded or how (e.g., manual categorization or tagging).
Platform or algorithmic bias – when digital systems (e.g., search engines, social media) shape what data gets collected in the first place.
These biases distort findings, reduce generalizability, and can lead to misleading or harmful conclusions.
(L2): What are some practical ways to document data provenance and transformation in a project?
Practical ways to document data provenance (origin) and transformation (changes) include:
Data dictionaries – Describe each variable: name, type, source, meaning, and units.
Metadata files – Record dataset origin, collection date, method, and context.
Version control (e.g., Git) – Track changes in datasets, scripts, and models over time.
Code-based workflows – Use reproducible scripts (e.g., R scripts, Jupyter notebooks) to log each data cleaning and transformation step.
CRISP-DM documentation – Follow structured steps to log business understanding, data preparation, modeling, and evaluation.
Data lineage diagrams – Visualize how raw data moves through transformations to final outputs.
These practices support transparency, reproducibility, and auditability.
(L2): What ethical considerations should be made before collecting or using data for analysis?
Before collecting or using data for analysis, key ethical considerations include:
Consent – Was data collected with informed, voluntary consent?
Privacy – Does the analysis protect personal or sensitive information? Are anonymization and data minimization applied?
Purpose limitation – Is the data used strictly for its intended, declared purpose?
Bias and fairness – Could the data or analysis lead to discrimination or reinforce social inequalities?
Transparency – Are data sources, assumptions, and limitations clearly documented?
Accountability – Who is responsible for the outcomes of the analysis, especially if automated decisions are involved?
These issues are central to responsible data science and align with principles discussed in Lecture 9 (Ethics).
(L2): What does the phrase “no such thing as raw data” mean in the context of data science?
The phrase “no such thing as raw data” means that data is never neutral, pure, or untouched—it is always the result of human decisions about what to observe, how to measure it, and why it matters.
In data science, this highlights that:
Data is constructed, not discovered.
It reflects assumptions, context, and bias from its collection and processing.
Analysts must treat data as rhetorical and interpretive, not as unquestionable fact.
This concept, emphasized by Rosenberg (2013), challenges the myth of objective data and calls for critical engagement with how data is created and used.
(L2): How do social, technical, and political choices influence how data is collected and interpreted?
Social, technical, and political choices shape both what data is collected and how it is interpreted, in the following ways:
Social: Cultural norms and societal values influence what is deemed important to measure (e.g., gender categories, health metrics).
Technical: The tools and systems used (e.g., sensors, platforms, algorithms) define what can be captured and how accurately.
Political: Policies, funding, and power dynamics determine data priorities, access, and framing (e.g., census questions, surveillance practices).
These choices embed bias, exclusions, and power structures into data, affecting both the analysis and the decisions based on it. Data is never neutral—it reflects the worldviews of those who design its collection and use.
(L2): Why is it important to document data provenance, metadata, and transformations?
Documenting data provenance, metadata, and transformations is essential because it ensures:
Transparency – Others can understand where the data came from and how it was processed.
Reproducibility – Analyses can be repeated and validated by others using the same steps.
Accountability – It’s clear who made which decisions, reducing errors and ethical risks.
Contextual understanding – Metadata provides meaning, helping analysts interpret data correctly.
Data quality control – Tracks issues like missing values or inconsistencies introduced during cleaning.
Without documentation, analysis becomes opaque, unreliable, and potentially misleading.
(L2): What risks arise when we treat data as objective or neutral?
Treating data as objective or neutral introduces serious risks:
Bias reinforcement – Hidden biases in data can be mistaken for truths, leading to discriminatory models or decisions.
False legitimacy – Flawed conclusions gain credibility because “the data says so.”
Ethical blind spots – Ignoring the social context of data may result in privacy violations or harm to vulnerable groups.
Oversimplification – Complex social issues may be reduced to misleading numbers or categories.
Uncritical automation – Models trained on biased or incomplete data can make flawed decisions at scale.
In short: assuming neutrality masks the human choices behind data and undermines responsible analysis.
(L2): How can ethical considerations be incorporated at the data collection stage?
Ethical considerations can be incorporated at the data collection stage by:
Obtaining informed consent – Ensure participants know what data is collected, why, and how it will be used.
Minimizing data – Collect only what is necessary to reduce privacy risks.
Ensuring anonymity – Remove or mask personal identifiers where possible.
Being inclusive and fair – Design sampling methods to represent diverse groups and avoid exclusion.
Clarifying purpose and ownership – Be transparent about who owns the data and for what purposes it will be used.
Following legal and ethical standards – Comply with data protection laws (e.g., GDPR) and institutional ethics guidelines.
These steps promote responsible, trustworthy data practices from the outset.
(L3/4): What is the CRISP-DM model, and how does it structure a data science project?
CRISP-DM (Cross-Industry Standard Process for Data Mining) structures a data science project into six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It provides a flexible, iterative framework to align technical work with business needs.
(L3/4): How does business understanding translate into an analytical objective in CRISP-DM?
The business understanding phase involves clarifying goals, constraints, and success criteria. These are then translated into specific analytical tasks, like predicting customer churn or segmenting users, forming the basis for model development.
(L3/4): Why is iteration essential in the data science solution framework?
Iteration allows for refinement as new insights emerge during data exploration or modeling. It ensures models remain relevant, reliable, and aligned with business goals, especially when assumptions or data quality issues are uncovered later in the process.
(L3/4): What are the differences between analytical and operational deployment of models
Analytical deployment refers to using the model for decision support (e.g., dashboards, ad hoc analysis).
Operational deployment means embedding the model into automated systems for real-time or repeated use (e.g., credit scoring in loan applications).
(L3/4): How can you use proxy variables or composite measures in data preparation?
Proxy variables are substitutes when direct measures are unavailable (e.g., using zip code as a proxy for income). Composite measures combine multiple indicators into one (e.g., customer engagement index), often improving interpretability or predictive power.
(L3/4): Why is model evaluation tied to business success criteria, not just statistical accuracy?
A model can be statistically strong yet useless in practice if it doesn’t improve business outcomes. Evaluation should consider metrics like ROI, user adoption, or operational feasibility, not just accuracy or precision.
(L3/4): How does maintaining a reproducible script structure (e.g., load packages → load data → analysis) improve project quality?
A clear structure improves readability, reusability, and makes debugging easier. It also supports replication, which is essential for validating results and collaborating with others on shared codebases.
(L3/4): What is the Data Science Solution Framework (DSSF), and why is it useful?
The DSSF is a structured approach to solving business problems with data. It ensures alignment between analytical methods and business needs, guiding teams from problem definition to solution evaluation in a repeatable and goal-oriented way.