Big Data Quizz Flashcards

(20 cards)

1
Q

Which of the following is NOT a common principle of responsible
AI?
a. Fairness
b. Explainability
c. Unlimited data collections
d. Transparency

A

c. Unlimited data collections

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. What ethical concern arises when an AI recommendation system
    consistently pushes high-risk financial products to certain
    demographics?
    a. Data privacy violations
    b. Discrimination and unfair targeting
    c. Historical bias
    d. Human bias
A

b. Discrimination and unfair targeting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. How does AI impact sustainability?
    a. AI has no measurable impact on sustainability
    b. AI is environmentally neutral by design
    c. AI optimizes energy usage in systems but relies on large
    datasets and consumes resources
    d. AI can replace all carbon-based operations today
A

c. AI optimizes energy usage in systems but relies on large
datasets and consumes resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. Which one of the following is a good example of ethical AI
    deployment in HR analytics?
    a. Using biometric data to predict productivity
    b. Avoiding model evaluation to stay unbiased
    c. Limiting access to model outcomes
    d. Transparently reporting the model’s impact on hiring
    diversity
A

d. Transparently reporting the model’s impact on hiring
diversity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. Which of the following technologies is foundational for training
    large-scale language models like ChatGPT?
    a. Transformers
    b. Anomaly detection
    c. Blockchain
    d. Simple neural networks
A

a. Transformers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. What is metadata useful for?
    a. To delete raw data efficiently
    b. To improve .csv compression
    c. To help ensure stored data is searchable and usable
    d. To get data from multiple sources
A

c. To help ensure stored data is searchable and usable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. What is a potential issue that could result from the following
    code?
    a. Whether df1 contains numeric values
    b. Whether “customer_id” exists in both dataframes
    c. Whether the print statement is needed
    d. Whether df2 has more than 1000 rows
A

b. Whether “customer_id” exists in both dataframes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. Which of the following prompts would result in a better quality
    exploratory data cleaning for a new customer dataset?
    a. Write the code to remove all nulls in the dataset and plot
    everything. Simulate running the code and share your
    results.
    b. Write a Python code to determine the missing values,
    outliers, and inconsistent formats in the dataset. Then,
    suggest corrections and write their corresponding Python
    codes. Simulate running the codes and share your results
    c. Write a Python code to clean the dataset completely.
    Simulate running it and share your results
    d. List all potential data cleaning issues that may arise in
    customer datasets as well as suggested solutions for each.
    Write the python code to detect each potential issue and
    resolve it based on the suggested solutions. Simulate
    running the code and share your findings.
A

d. List all potential data cleaning issues that may arise in
customer datasets as well as suggested solutions for each.
Write the python code to detect each potential issue and
resolve it based on the suggested solutions. Simulate
running the code and share your findings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. What is the most likely output of the following Python code?
    a. Error due to NoneType
    b. Missing value replaced with 0
    c. Row dropped automatically
    d. Missing value replaced with 4.33
A

d. Missing value replaced with 4.33

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which of the following best explains the role of libraries in
Python? (0.5pts)
a. They provide reusable and optimized components for
modeling and data handling
b. They slow down execution
c. They remove the need for logic
d. They visualize predictions only

A

a. They provide reusable and optimized components for
modeling and data handling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. Let’s say you will train a RandomForest model but to classify
    continuous revenue data. Which one of the following prompts
    could be a good starting point? (2pts)
    a. Write the Python code to classify revenues with
    RandomForest in the attached dataset. Simulate running the
    code and share your results
    b. Write the Python code to clean the attached dataset, split
    it into training and testing, train the RandomForest model,
    evaluate with accuracy and precision, and explain key
    features. Simulate running the code and share your results
    c. Write the Python code to split the data into training and
    testing, train the RandomForest model, clean the data,
    evaluate with accuracy and precision, and explain key
    features. Simulate running the code and share your results
    d. Write the Python code to clean, and classify the dataset with
    the RandomForrest model. Share your findings.
A

b. Write the Python code to clean the attached dataset, split
it into training and testing, train the RandomForest model,
evaluate with accuracy and precision, and explain key
features. Simulate running the code and share your results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. You are working with a preprocessed DataFrame df containing
    company-level financial features and a binary target variable
    called Exited. Which of the following code snippets is most
    appropriate for training and evaluating a classification model?
A

Look at the Mock

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. You want to use ChatGPT to generate a synthetic dataset for a
    model that predicts whether a customer will repurchase. Which
    of the following prompts is the most appropriate? (1.5pts)
    a. Generate a dataset of 10,000 customers with features such
    as Age, Gender, and Income. Assign the Repurchase column
    randomly with values 0 and 1. Include at least 1,000
    repurchases to reflect real-world class balance. Return a
    Pandas DataFrame.
    b. Create a synthetic dataset for repurchase prediction with
    10,000 observations. Include features: Age, Region,
    SpendingScore, and LoyaltyLevel. Label customers as 1 if
    SpendingScore > 50 and 0 otherwise. Add slight variation in
    SpendingScore using Gaussian noise.
    c. Create a synthetic dataset with 10,000 customers including
    Age, TimeSinceLastPurchase, TotalSpend, and
    LoyaltyScore. The Repurchase column should be
    probabilistically determined, with higher loyalty and recent
    purchases increasing repurchase likelihood. Add some
    noise and ensure the dataset is balanced but realistic.
    Output as a Pandas DataFrame ready for classification
    modeling.
    d. Generate a synthetic customer dataset with Income, Tenure,
    and SatisfactionScore as features. Assign the Repurchase
    label using a 60/40 class distribution. Use random sampling
    for each feature from standard distributions and ensure the
    data has no missing values or outliers.
A

c. Create a synthetic dataset with 10,000 customers including
Age, TimeSinceLastPurchase, TotalSpend, and
LoyaltyScore. The Repurchase column should be
probabilistically determined, with higher loyalty and recent
purchases increasing repurchase likelihood. Add some
noise and ensure the dataset is balanced but realistic.
Output as a Pandas DataFrame ready for classification
modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. You asked ChatGPT to write a Python script for sentiment
    analysis. Below is the code it returned. What is the main issue
    with this script? (1.5pts)
    a. The model uses the wrong algorithm for sentiment analysis
    — Logistic Regression is not applicable here.
    b. The script attempts to use TF-IDF on numeric data, which is
    not allowed.
    c. The evaluation is invalid because the model is tested on
    the same data it was trained on.
    d. The model should have been trained with raw text instead
    of transformed features.
A

c. The evaluation is invalid because the model is tested on
the same data it was trained on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. You asked ChatGPT to generate a script to forecast revenue
    using historical quarterly data. Below is the code. What is the
    main issue with this forecasting workflow? (1.5pts)
    a. The ARIMA model is misused because it requires at least
    three seasonal components.
    b. The model is fitted on revenue values without checking for
    stationarity or transforming the data.
    c. The plot does not visualize confidence intervals.
    d. ARIMA should not be used for numeric values such as
    revenue — only for categorical time series.
A

b. The model is fitted on revenue values without checking for
stationarity or transforming the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. You are designing a custom GPT for your company’s customer
    support operations. The goal is to streamline complaint
    handling while maintaining safety, alignment, and
    accountability. Which of the following GPT design descriptions
    is the most complete and appropriate? (2pts)
    a. Create a GPT that uses brand-aligned tone guidelines to
    draft complaint responses and proactively generates
    discount offers or refund messages when sentiment analysis
    detects dissatisfaction. It should simulate empathy and
    avoid escalating issues unless asked.
    b. Build a GPT that automatically analyzes support messages,
    pulls customer history via API tools, generates apology
    messages, and closes tickets that match low-priority
    thresholds. Use few-shot examples to maintain a consistent
    tone across responses.
    c. Design a GPT that classifies incoming complaints into
    urgency levels, drafts response templates, and routes critical
    tickets to managers. Include access to internal product
    manuals via tools, and allow the GPT to update resolution
    logs based on its outputs.
    d. Create a GPT that operates within a narrow scope:
    responding to customer complaints using predefined tone
    guidance, flagging edge cases to human agents, and using
    tools only for retrieving ticket summaries. All actions must
    be logged. It should use chain-of-thought reasoning to
    explain complex responses and remain auditable at all
    times.
A

d. Create a GPT that operates within a narrow scope:
responding to customer complaints using predefined tone
guidance, flagging edge cases to human agents, and using
tools only for retrieving ticket summaries. All actions must
be logged. It should use chain-of-thought reasoning to
explain complex responses and remain auditable at all
times.

17
Q
  1. You are tasked with evaluating two GPTs: one for internal
    employee use (internal-facing), and one for responding directly
    to customer complaints (customer-facing). Which of the
    following design comparisons is most accurate and appropriate?
    (1pt)
    a. Internal-facing GPTs should be tightly constrained in
    language and tone to avoid misunderstandings, while
    customer-facing GPTs can operate more freely since they
    only surface templated responses.
    b. Internal-facing GPTs should log all outputs and avoid using
    external tools, while customer-facing GPTs can access full
    toolsets (e.g., customer databases) as long as they are
    accurate.
    c. Customer-facing GPTs require stricter tone control, tool
    access limits, and logging for traceability, while internalfacing GPTs can be more flexible in tone and reasoning
    style, assuming access is permissioned and outputs remain
    internal.
    d. Customer-facing GPTs are safer by default because their
    responses are short and templated, while internal GPTs can
    be riskier since they interact with employees and may
    hallucinate.
A

c. Customer-facing GPTs require stricter tone control, tool
access limits, and logging for traceability, while internalfacing GPTs can be more flexible in tone and reasoning
style, assuming access is permissioned and outputs remain
internal.

18
Q

18.Which of the following scenarios best illustrates the distinction
between supervised, unsupervised, and reinforcement
learning? (0.5pts)
a. A model clusters transaction records to detect anomalies,
then retrains on those clusters using reinforcement-based
feedback loops.
b. An image classifier is fine-tuned using labeled data, while a
recommendation engine groups user behaviors into
clusters, and an ad bidding agent learns through repeated
environment interaction and feedback-based rewards.
c. In supervised learning, models are trained on unlabeled
patterns; in unsupervised learning, labels guide clustering
decisions; reinforcement learning involves training without
feedback.
d. A time series forecasting model predicts demand using
labeled past data; a second model filters noise from the
dataset using autoencoders; a third one uses reward
functions to classify inputs.

A

b. An image classifier is fine-tuned using labeled data, while a
recommendation engine groups user behaviors into
clusters, and an ad bidding agent learns through repeated
environment interaction and feedback-based rewards.

19
Q

19.What best captures the relationship between AI and Big Data in
business contexts? (0.5pts)
a. Big Data techniques provide cloud storage and pipeline
design, while AI models focus on inference speed.
b. AI enables automated decisions, but requires preaggregated Big Data that is pre-cleaned and sampled to a
manageable size.
c. Big Data infrastructures allow for continuous, high-volume
data collection and storage, which fuels the performance
and adaptability of AI systems today.
d. AI operates independently of Big Data, relying instead on
curated structured datasets and embedded algorithms.

A

c. Big Data infrastructures allow for continuous, high-volume
data collection and storage, which fuels the performance
and adaptability of AI systems today.

20
Q

20.Which of the following statements about Artificial General
Intelligence (AGI) and quantum computing is most accurate?
(0.5pts)
a. AGIs will exist once quantum processors surpass classical
GPUs in deterministic logic and data labeling speed.
b. Quantum computing is required for any form of AGI,
because classical computation lacks the power to support
consciousness models.
c. AGIs may one day benefit from quantum computing’s
extremely high processing abilities
d. AGIs and quantum computing are both being studied by
large AI labs, but are unrelated and will likely remain siloed
fields

A

c. AGIs may one day benefit from quantum computing’s
extremely high processing abilities