Big Data Quizz Flashcards
(20 cards)
Which of the following is NOT a common principle of responsible
AI?
a. Fairness
b. Explainability
c. Unlimited data collections
d. Transparency
c. Unlimited data collections
- What ethical concern arises when an AI recommendation system
consistently pushes high-risk financial products to certain
demographics?
a. Data privacy violations
b. Discrimination and unfair targeting
c. Historical bias
d. Human bias
b. Discrimination and unfair targeting
- How does AI impact sustainability?
a. AI has no measurable impact on sustainability
b. AI is environmentally neutral by design
c. AI optimizes energy usage in systems but relies on large
datasets and consumes resources
d. AI can replace all carbon-based operations today
c. AI optimizes energy usage in systems but relies on large
datasets and consumes resources
- Which one of the following is a good example of ethical AI
deployment in HR analytics?
a. Using biometric data to predict productivity
b. Avoiding model evaluation to stay unbiased
c. Limiting access to model outcomes
d. Transparently reporting the model’s impact on hiring
diversity
d. Transparently reporting the model’s impact on hiring
diversity
- Which of the following technologies is foundational for training
large-scale language models like ChatGPT?
a. Transformers
b. Anomaly detection
c. Blockchain
d. Simple neural networks
a. Transformers
- What is metadata useful for?
a. To delete raw data efficiently
b. To improve .csv compression
c. To help ensure stored data is searchable and usable
d. To get data from multiple sources
c. To help ensure stored data is searchable and usable
- What is a potential issue that could result from the following
code?
a. Whether df1 contains numeric values
b. Whether “customer_id” exists in both dataframes
c. Whether the print statement is needed
d. Whether df2 has more than 1000 rows
b. Whether “customer_id” exists in both dataframes
- Which of the following prompts would result in a better quality
exploratory data cleaning for a new customer dataset?
a. Write the code to remove all nulls in the dataset and plot
everything. Simulate running the code and share your
results.
b. Write a Python code to determine the missing values,
outliers, and inconsistent formats in the dataset. Then,
suggest corrections and write their corresponding Python
codes. Simulate running the codes and share your results
c. Write a Python code to clean the dataset completely.
Simulate running it and share your results
d. List all potential data cleaning issues that may arise in
customer datasets as well as suggested solutions for each.
Write the python code to detect each potential issue and
resolve it based on the suggested solutions. Simulate
running the code and share your findings.
d. List all potential data cleaning issues that may arise in
customer datasets as well as suggested solutions for each.
Write the python code to detect each potential issue and
resolve it based on the suggested solutions. Simulate
running the code and share your findings.
- What is the most likely output of the following Python code?
a. Error due to NoneType
b. Missing value replaced with 0
c. Row dropped automatically
d. Missing value replaced with 4.33
d. Missing value replaced with 4.33
Which of the following best explains the role of libraries in
Python? (0.5pts)
a. They provide reusable and optimized components for
modeling and data handling
b. They slow down execution
c. They remove the need for logic
d. They visualize predictions only
a. They provide reusable and optimized components for
modeling and data handling
- Let’s say you will train a RandomForest model but to classify
continuous revenue data. Which one of the following prompts
could be a good starting point? (2pts)
a. Write the Python code to classify revenues with
RandomForest in the attached dataset. Simulate running the
code and share your results
b. Write the Python code to clean the attached dataset, split
it into training and testing, train the RandomForest model,
evaluate with accuracy and precision, and explain key
features. Simulate running the code and share your results
c. Write the Python code to split the data into training and
testing, train the RandomForest model, clean the data,
evaluate with accuracy and precision, and explain key
features. Simulate running the code and share your results
d. Write the Python code to clean, and classify the dataset with
the RandomForrest model. Share your findings.
b. Write the Python code to clean the attached dataset, split
it into training and testing, train the RandomForest model,
evaluate with accuracy and precision, and explain key
features. Simulate running the code and share your results
- You are working with a preprocessed DataFrame df containing
company-level financial features and a binary target variable
called Exited. Which of the following code snippets is most
appropriate for training and evaluating a classification model?
Look at the Mock
- You want to use ChatGPT to generate a synthetic dataset for a
model that predicts whether a customer will repurchase. Which
of the following prompts is the most appropriate? (1.5pts)
a. Generate a dataset of 10,000 customers with features such
as Age, Gender, and Income. Assign the Repurchase column
randomly with values 0 and 1. Include at least 1,000
repurchases to reflect real-world class balance. Return a
Pandas DataFrame.
b. Create a synthetic dataset for repurchase prediction with
10,000 observations. Include features: Age, Region,
SpendingScore, and LoyaltyLevel. Label customers as 1 if
SpendingScore > 50 and 0 otherwise. Add slight variation in
SpendingScore using Gaussian noise.
c. Create a synthetic dataset with 10,000 customers including
Age, TimeSinceLastPurchase, TotalSpend, and
LoyaltyScore. The Repurchase column should be
probabilistically determined, with higher loyalty and recent
purchases increasing repurchase likelihood. Add some
noise and ensure the dataset is balanced but realistic.
Output as a Pandas DataFrame ready for classification
modeling.
d. Generate a synthetic customer dataset with Income, Tenure,
and SatisfactionScore as features. Assign the Repurchase
label using a 60/40 class distribution. Use random sampling
for each feature from standard distributions and ensure the
data has no missing values or outliers.
c. Create a synthetic dataset with 10,000 customers including
Age, TimeSinceLastPurchase, TotalSpend, and
LoyaltyScore. The Repurchase column should be
probabilistically determined, with higher loyalty and recent
purchases increasing repurchase likelihood. Add some
noise and ensure the dataset is balanced but realistic.
Output as a Pandas DataFrame ready for classification
modeling.
- You asked ChatGPT to write a Python script for sentiment
analysis. Below is the code it returned. What is the main issue
with this script? (1.5pts)
a. The model uses the wrong algorithm for sentiment analysis
— Logistic Regression is not applicable here.
b. The script attempts to use TF-IDF on numeric data, which is
not allowed.
c. The evaluation is invalid because the model is tested on
the same data it was trained on.
d. The model should have been trained with raw text instead
of transformed features.
c. The evaluation is invalid because the model is tested on
the same data it was trained on.
- You asked ChatGPT to generate a script to forecast revenue
using historical quarterly data. Below is the code. What is the
main issue with this forecasting workflow? (1.5pts)
a. The ARIMA model is misused because it requires at least
three seasonal components.
b. The model is fitted on revenue values without checking for
stationarity or transforming the data.
c. The plot does not visualize confidence intervals.
d. ARIMA should not be used for numeric values such as
revenue — only for categorical time series.
b. The model is fitted on revenue values without checking for
stationarity or transforming the data.
- You are designing a custom GPT for your company’s customer
support operations. The goal is to streamline complaint
handling while maintaining safety, alignment, and
accountability. Which of the following GPT design descriptions
is the most complete and appropriate? (2pts)
a. Create a GPT that uses brand-aligned tone guidelines to
draft complaint responses and proactively generates
discount offers or refund messages when sentiment analysis
detects dissatisfaction. It should simulate empathy and
avoid escalating issues unless asked.
b. Build a GPT that automatically analyzes support messages,
pulls customer history via API tools, generates apology
messages, and closes tickets that match low-priority
thresholds. Use few-shot examples to maintain a consistent
tone across responses.
c. Design a GPT that classifies incoming complaints into
urgency levels, drafts response templates, and routes critical
tickets to managers. Include access to internal product
manuals via tools, and allow the GPT to update resolution
logs based on its outputs.
d. Create a GPT that operates within a narrow scope:
responding to customer complaints using predefined tone
guidance, flagging edge cases to human agents, and using
tools only for retrieving ticket summaries. All actions must
be logged. It should use chain-of-thought reasoning to
explain complex responses and remain auditable at all
times.
d. Create a GPT that operates within a narrow scope:
responding to customer complaints using predefined tone
guidance, flagging edge cases to human agents, and using
tools only for retrieving ticket summaries. All actions must
be logged. It should use chain-of-thought reasoning to
explain complex responses and remain auditable at all
times.
- You are tasked with evaluating two GPTs: one for internal
employee use (internal-facing), and one for responding directly
to customer complaints (customer-facing). Which of the
following design comparisons is most accurate and appropriate?
(1pt)
a. Internal-facing GPTs should be tightly constrained in
language and tone to avoid misunderstandings, while
customer-facing GPTs can operate more freely since they
only surface templated responses.
b. Internal-facing GPTs should log all outputs and avoid using
external tools, while customer-facing GPTs can access full
toolsets (e.g., customer databases) as long as they are
accurate.
c. Customer-facing GPTs require stricter tone control, tool
access limits, and logging for traceability, while internalfacing GPTs can be more flexible in tone and reasoning
style, assuming access is permissioned and outputs remain
internal.
d. Customer-facing GPTs are safer by default because their
responses are short and templated, while internal GPTs can
be riskier since they interact with employees and may
hallucinate.
c. Customer-facing GPTs require stricter tone control, tool
access limits, and logging for traceability, while internalfacing GPTs can be more flexible in tone and reasoning
style, assuming access is permissioned and outputs remain
internal.
18.Which of the following scenarios best illustrates the distinction
between supervised, unsupervised, and reinforcement
learning? (0.5pts)
a. A model clusters transaction records to detect anomalies,
then retrains on those clusters using reinforcement-based
feedback loops.
b. An image classifier is fine-tuned using labeled data, while a
recommendation engine groups user behaviors into
clusters, and an ad bidding agent learns through repeated
environment interaction and feedback-based rewards.
c. In supervised learning, models are trained on unlabeled
patterns; in unsupervised learning, labels guide clustering
decisions; reinforcement learning involves training without
feedback.
d. A time series forecasting model predicts demand using
labeled past data; a second model filters noise from the
dataset using autoencoders; a third one uses reward
functions to classify inputs.
b. An image classifier is fine-tuned using labeled data, while a
recommendation engine groups user behaviors into
clusters, and an ad bidding agent learns through repeated
environment interaction and feedback-based rewards.
19.What best captures the relationship between AI and Big Data in
business contexts? (0.5pts)
a. Big Data techniques provide cloud storage and pipeline
design, while AI models focus on inference speed.
b. AI enables automated decisions, but requires preaggregated Big Data that is pre-cleaned and sampled to a
manageable size.
c. Big Data infrastructures allow for continuous, high-volume
data collection and storage, which fuels the performance
and adaptability of AI systems today.
d. AI operates independently of Big Data, relying instead on
curated structured datasets and embedded algorithms.
c. Big Data infrastructures allow for continuous, high-volume
data collection and storage, which fuels the performance
and adaptability of AI systems today.
20.Which of the following statements about Artificial General
Intelligence (AGI) and quantum computing is most accurate?
(0.5pts)
a. AGIs will exist once quantum processors surpass classical
GPUs in deterministic logic and data labeling speed.
b. Quantum computing is required for any form of AGI,
because classical computation lacks the power to support
consciousness models.
c. AGIs may one day benefit from quantum computing’s
extremely high processing abilities
d. AGIs and quantum computing are both being studied by
large AI labs, but are unrelated and will likely remain siloed
fields
c. AGIs may one day benefit from quantum computing’s
extremely high processing abilities