ExamQuestions Flashcards

Question

What is marginalization in probability?

Answer 1

Marginalization is the process of summing (or integrating) out one or more variables to get the probability distribution over a subset of variables. It's used to simplify complex distributions by removing unneeded variables. If you have two variables, X and Y, and you want the marginal probability of X, you sum over all values of Y. P(X=x) = sum_y P(X=x, Y=y).

Answer 2

Bayesian network: Purpose: Represent probabilistic relationships among variables. Nodes: Only chance nodes (random variables) Outcome: Probabilities of outcomes Used for: Inference, prediction, diagnosis Decision Network: Purpose: Make decisions under uncertainty based on outcomes and utilities Nodes: Includes chance, decision, and utility nodes Output: Optimal decisions to maximize expected utility Used for: Rational decision-making, planning

Answer 3

Joint is the probability of events together; conditional is given another event.

Answer 4

Explainability (or interpretability) in AI refers to how easily humans can understand why and how an AI system made a particular decision or prediction. 🤖🗣️ It’s about making AI transparent, trustworthy, and accountable to users, developers, and regulators.

Answer 5

Symbolic AI uses: Explicit rules, logic, and symbols Human-readable knowledge bases (like "IF fever AND cough THEN flu") Modern machine learning (like deep learning): Learns patterns automatically Often works like a black box — hard to inspect or explain It is more interpretable and easier to audit. Symbolic AI systems can clearly explain why they made a decision, using human-understandable logic and rules.

Answer 6

he forward algorithm computes the probability of a sequence of observations given an HMM. How likely is it that this observed sequence was generated by this model? To compute the probability of an observed sequence.

Answer 7

A down-sampling operation that keeps the maximum value in each region. Max pooling is a downsampling operation that: Slides a small window (e.g., 2×2) across the feature map At each location, it selects the maximum value within that window It keeps the strongest feature and discards weaker signals in each region.

Answer 8

The policy iteration algorithm is a classic method in Reinforcement Learning and Markov Decision Processes (MDPs) used to find the optimal policy — that is, the best set of actions an agent can take in every state to maximize expected reward. An algorithm that evaluates and improves a policy until it converges.

Answer 9

A case base is a core concept in Case-Based Reasoning (CBR), a type of AI that solves new problems by remembering and adapting previous similar problems. A case base is a collection of past cases stored in memory. Each case typically includes: A problem description A solution (Optional) Outcome or success rating of the solution

Answer 10

In Markov Decision Processes (MDPs), a terminal state is a special type of state that ends the episode. A terminal state is a state in an MDP where: No further actions are taken. No more rewards are received (typically). The episode ends once the agent enters this state. Example: Maze: exit, Chess: Checkmate or draw States from which no further transitions occur.

Answer 11

In Case-Based Reasoning (CBR), similarity measurement is a critical step — it determines which past cases are most relevant for solving a new problem. Similarity measurement is the process of comparing a new problem to past cases in the case base to find the most similar ones.

Answer 12

Temporal Difference (TD) Learning is one of the most important ideas in Reinforcement Learning (RL). It’s the foundation of popular algorithms like Q-learning and SARSA. Temporal Difference learning is a method where an agent learns value estimates by comparing predictions at successive time steps — rather than waiting until the final outcome. At time t, the agent is in state s_t, and takes action a_t, gets reward r_(t+1) and lands in new state s_(t+1). Then it updates its value estimate for s_t based on: The immediate reward and the estimated value of the next state V(s_(t+1))

Answer 13

In Case-Based Reasoning (CBR), adaptation is the step where the system modifies a retrieved solution from a past case to make it fit a new problem.

Answer 14

In Bayesian inference, normalization ensures that the final result is a valid probability distribution — that is, all probabilities add up to 1.

Answer 15

P(A, B, C) = P(A)*P(B|A)*P(C|A,B).

Answer 16

Influence diagrams (also known as decision networks) are powerful tools in AI and decision theory that support rational decision-making under uncertainty. By allowing systematic evaluation of expected utility for each decision. Influence diagram might include: 🎲 Chance node: Disease (yes/no), Test Result (positive/negative) 🔷 Decision node: Run Test or Not 💰 Utility node: Patient health, cost of test

Answer 17

Difficulty in interpreting how complex models make decisions.

Answer 18

A Long Short-Term Memory (LSTM) is a special type of Recurrent Neural Network (RNN) designed to remember information over long sequences — something traditional RNNs struggle with. An LSTM is a neural network architecture that can learn long-term dependencies in sequential data by using a system of gates and a memory cell. 🧾 It's designed to preserve information over long sequences and forget irrelevant parts.

Answer 19

To determine whether a set of variables is conditionally independent in a Bayesian network. Does knowing variable A tell me anything more about variable B, once I already know C? If yes, then A and B are dependent given C If no, then A and B are conditionally independent given C → they are d-separated

Answer 20

The Markov blanket is a key concept in Bayesian networks that tells you exactly which variables you need to know about to make a node conditionally independent from the rest of the network. The Markov blanket of a node X is the smallest set of nodes in a Bayesian network that, when known (observed), makes X conditionally independent of all the other nodes in the network. The set of a node's parents, children, and co-parents of its children.

Answer 21

Two parent nodes pointing to a common child.

Answer 22

Value iteration and policy iteration are two fundamental algorithms for solving Markov Decision Processes (MDPs). Both aim to find an optimal policy, but they go about it in different ways. Value iteration updates utilities; policy iteration updates policies. Example: Value iteration: You rank every city (state) by how close it is to the destination and then decide what to do. Policy iteration: You start with a full plan (policy), see how well it works (evaluation), then tweak it.

Answer 23

To reverse conditional probabilities when direct measurement is hard. P(X|Y) = P(Y|X) * P(X) / P(Y)

Answer 24

To predict the probability of a sequence of words A language model (LM) is a model that learns to assign probabilities to sequences of words. It predicts how likely a given sequence is — or what word is likely to come next.

Answer 25

Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP) that involves identifying and classifying specific pieces of information in text. Detecting entities in a sentence (words or phrases) Classifying them into predefined categories like: 🧑 Person 🌍 Location 🏢 Organization

Answer 26

Case-Based Reasoning (CBR) is considered highly interpretable because it solves new problems by referring to concrete, human-understandable past cases — rather than relying on abstract or opaque computations like deep neural networks. Because solutions are based on concrete past examples.

Answer 27

A function that assigns a numerical value to outcomes representing preferences.

Answer 28

To prioritize immediate rewards over distant future rewards. In MDPs, we use a discount factor γ∈[0,1] to reduce the weight of future rewards when computing the total expected reward. Total_reward = R(t+1) + γR(t+2) + γ^2R(t+3) ...

Answer 29

When gradients become too small for deep networks to learn effectively. The vanishing gradient problem is a major challenge in training deep neural networks — especially Recurrent Neural Networks (RNNs) and deep feedforward networks. Occurs when gradients (used to update weights) become extremely small as they are propagated backward through the network during training. As a result, early layers (or earlier time steps in RNNs) learn very slowly or not at all — because the weight updates are near zero. This is especially problematic when using sigmoid or tanh activation functions, because their derivatives are small (max ~0.25), so gradients shrink with each layer or time step.

Answer 30

Prefer simpler models that explain the data. In AI and machine learning, this means: When building models, choose the simplest model that fits the data well. Occam’s Razor helps AI systems: Avoid overfitting (memorizing noise in training data) Generalize better to unseen data Choose between models with similar performance

Answer 31

Technique in Natural Language Processing (NLP). Simplify words while preserving their meaning and grammatical role. Reducing words to their dictionary base form. Unlike stemming, which may just chop off endings, lemmatization uses context (like part of speech) to return the correct, meaningful root word.

Answer 32

Organizing cases to make retrieval efficient. In Case-Based Reasoning (CBR), indexing is the process of organizing and accessing cases efficiently so the system can quickly retrieve relevant past experiences when faced with a new problem.

Answer 33

To allow networks to model complex functions. With only linear functions, no matter how deep the network is, it can only learn linear functions — which severely limits what it can model.

Answer 34

The Bellman optimality equation for Q-values is a central concept in Reinforcement Learning (RL) — it defines how to compute the maximum possible expected return from any given state–action pair under an optimal policy. Q(s,a) = E[r + γ max_a' Q(s',a') | s, a]. Q(s,a) = Q value for a state and action. r = recieved reward γ = discount factor E = expectation for different outcomes if transitions are stochastic

Answer 35

How easily humans can understand a model's logic.

Answer 36

A function that measures the difference between predicted and actual outputs.

Answer 37

States, observations, transition model, observation model. Hidden states S: What we want to infer (not directly observed) Observations O: What we observe at each time step Transition matrix A: Probabilities of moving between hidden states Emission matrix B: Probabilities of observations given hidden states Initial distribution π: Where the sequence starts

Answer 38

Whether a machine's behavior is indistinguishable from a human's.

Answer 39

Word embeddings are dense vector representations of words that capture semantic meaning and contextual similarity. ✅ Capturing meaning, not just exact words ✅ Grouping similar spammy words together (e.g., “buy now” ~ “purchase today”) ✅ Feeding useful input features into machine learning models They convert words to vectors which are then aggregated and fed to classifiers.

Answer 40

A flowchart-like structure used for classification or regression. A decision tree is a supervised machine learning model that makes predictions by splitting data into branches based on feature values — like asking a series of if-then questions. Is "free" in email? / \ Yes No / \ Contains "click"? Not Spam / \ Yes No Spam Not Spam

Answer 41

It means the agent chooses actions that maximize expected utility. A rational agent is an agent that always chooses the action that maximizes its expected performance measure, based on: Its current knowledge Its perception of the environment Its goals In other words: It does the right thing, given the information and goals it has.

Answer 42

Conditional Probability Table, specifying probabilities for a node given its parents Nodes represent random variables Edges represent dependencies Each node has a CPT that defines: P(Node|Parents) If a node has no parents, its CPT is just a prior distribution. Burglary P(Alarm = T) P(Alarm = F) | -------- | --------------- | ---------------- | | True | 0.95 | 0.05 | | False | 0.01 | 0.99 | | Burglary | P(Alarm = T) | P(Alarm = F) |

Answer 43

Because of spatial locality, fewer parameters, and shared weights. Unlike Multi-Layer Perceptrons (MLPs), which treat every pixel independently, Convolutional Neural Networks (CNNs): ✅ Understand the spatial layout of images ✅ Use filters to detect meaningful patterns like edges and textures ✅ Reuse weights across the image to reduce complexity and improve learning MLP: One pattern for every cat in every possible position — inefficient and brittle CNN: Learns cat features (ears, whiskers, tail) and recognizes them anywhere in the image — efficient and generalizable

Answer 44

To build systems that can perform specific tasks intelligently, without having general understanding or consciousness. It’s about creating software that behaves intelligently in a narrow domain, like recognizing faces, classifying spam, or recommending movies.

Answer 45

Pruning removes parts of the tree that are not useful or are too specific to the training data. A fully grown decision tree might perfectly classify training examples but perform poorly on new, unseen data. This is called overfitting. Pruning helps combat this by making the tree smaller and more general.

Answer 46

A Convolutional Neural Network (CNN) is a type of deep learning model designed to automatically and efficiently learn spatial patterns — especially from images and visual data. A Convolutional Neural Network (CNN) is a neural network that: Uses convolutional layers to detect features (like edges, textures, shapes) Learns to classify, detect, or segment images by processing them in parts Preserves the spatial structure of data (unlike traditional MLPs) 📸 It's the go-to architecture for tasks like image classification, object detection, and face recognition. Main layers: Convolution, ReLU, Pooling, Fully Connected

Answer 47

Two variables A and B are said to be conditionally independent given a third variable C if once you know C, knowing A gives you no additional information about B — and vice versa.

Answer 48

Value iteration is a classic dynamic programming algorithm used in Markov Decision Processes (MDPs) to compute the optimal policy by iteratively improving value estimates for each state. 💡 It’s used to find out: “What is the best thing to do in each state to maximize long-term reward?”

Answer 49

Turing test: If a machine can engage in a text conversation such that a human cannot reliably tell whether they’re talking to a machine or a person, it "passes" the test. Yes, weak AI can pass the Turing test. Many modern AI systems — like ChatGPT or other large language models — are examples of weak AI: They don’t understand language like a human. They don’t have goals, consciousness, or emotions. But they can mimic human-like responses extremely well. So even though the intelligence is simulated and narrow, the output may be convincing enough to fool a human — thus passing the test.

Answer 50

Prior is the initial belief; posterior is the updated belief after evidence.

Answer 51

ID3 (Iterative Dichotomiser 3) is a classic algorithm used to build decision trees for classification tasks. It does this by greedily selecting the best feature at each step based on information gain. ID3 is a top-down, greedy algorithm that builds a decision tree by: Choosing the best attribute to split the data at each node Recursively creating child nodes for each possible value of that attribute Stopping when: All examples in the subset belong to the same class There are no more attributes left to split on

Answer 52

In the context of filtering algorithms — especially in Bayesian filtering (like in Hidden Markov Models, Kalman filters, or particle filters) — α (alpha) is often used as a normalization constant. Normalizes the probabilities so they sum to 1.

Answer 53

Backward smoothing is a technique used in Hidden Markov Models (HMMs) and Bayesian filtering to improve the estimate of the hidden state at time t, after observing future evidence (up to time T). Backward smoothing uses both past and future observations to give a better estimate of a hidden state than filtering alone. (Filtering only uses past and current experiences)

Answer 54

Strong AI is conscious and self-aware; weak AI is not.

Answer 55

Information gain is a key concept in decision tree learning (like ID3) that tells us which attribute is most useful for splitting the data at each step. Information gain measures how much uncertainty (entropy) is reduced after splitting a dataset on a particular attribute.

Answer 56

A perceptron is the simplest type of artificial neural network, originally proposed by Frank Rosenblatt in 1958, designed to mimic a single neuron in the human brain. It’s a binary classifier that makes decisions by weighing input signals, applying a threshold, and outputting either 0 or 1.

Answer 57

Both classification and regression are types of supervised learning, where you train a model using labeled data. The main difference lies in the type of output they predict: Classification predicts a category or label, like spam or not spam. Regression predicts continuous numeric values, like house price.

Answer 58

Prediction in HMM refers to the process of computing the most likely future hidden state(s) based on past observations — without yet seeing future evidence.

Answer 59

Pooling layers are a key component of Convolutional Neural Networks (CNNs) that are used to reduce the spatial dimensions (width and height) of the feature maps while preserving important information. ✅ 1. Max Pooling (most common) Takes the maximum value in each window Highlights the most prominent feature ✅ 2. Average Pooling Takes the average value in the window Smoother output, but may dilute sharp features Typical architecture: Input image → Convolution → ReLU → **Pooling** → Convolution → ReLU → Pooling → Fully Connected → Output

Answer 60

In a neural network, an activation function determines the output of a neuron given its input — it introduces non-linearity into the model, allowing it to learn complex patterns. ReLU: max(0,x) Most common, fast, avoids vanishing gradients Sigmoid: Output between 0 and 1, used in binary classification Tanh: Output between -1 and 1, zero-centered Softmax: Turns outputs into probabilities — for multi-class classification

Answer 61

Q-learning is a fundamental reinforcement learning (RL) algorithm that teaches an agent how to act optimally in an environment by learning the value of actions without knowing the environment’s dynamics. Imagine a robot in a maze: It tries random moves (exploration) It gets rewards when it finds good outcomes (e.g., reaching the goal) It updates its Q-table with better estimates over time Eventually, it learns the best action for each state Q-learning: Model-free RL algorithm to learn optimal actions Learns: Q-values: how good each action is in each state Goal: Maximize long-term cumulative reward Used in: Games (like Atari, chess), robotics, navigation, decision-making

Answer 62

In an MDP (Markov Decision Process), a policy defines the agent’s behavior — that is: 🔁 A policy π is a mapping from states to actions. It tells the agent what action a to take when in state s. Two types of policies: Deterministic: Always choose same action in state. In state s1 choose a1. Stochastic: Choose action based on probability. In state s1, 20% a1, 80% a2

Answer 63

Learning from a small amount of labeled data and a large amount of unlabeled data. Labeled data is costly; unlabeled is cheap and plentiful

Answer 64

Simple reflex: A vacuum cleaner that turns left if it sees a wall. Model-based reflex: A robot that knows the room layout and updates where it thinks it is, even when it can't see everything. Goal-based: A GPS system finding the fastest route to a destination. Utility-based: A self-driving car balancing speed, safety, and passenger comfort.

Answer 65

In decision trees, entropy is a measure of impurity or uncertainty in a dataset. It tells us how mixed the class labels are at a given node. 🎯 The goal in building a decision tree is to split the data in a way that reduces entropy — meaning the split leads to purer subsets.

Answer 66

A Bayesian network represents the joint probability distribution (JPD) over a set of variables using a directed acyclic graph (DAG). You multiply the conditional probabilities of each variable given its parents. Imagine a Bayesian network with 3 nodes: A: no parents B: parent is A C: parents are A and B Then the joint probability is: P(A,B,C)=P(A)⋅P(B∣A)⋅P(C∣A,B) As the product of the conditional probabilities of each variable given its parents.

Answer 67

A language model (LM) is a model that learns to assign probabilities to sequences of words. It predicts how likely a given sequence is — or what word is likely to come next.

Answer 68

Chain, common cause, common effect (v-structure).

Answer 69

Having multiple hidden layers.

Answer 70

Bias, surveillance, job loss, transparency, misuse in weapons.

Answer 71

The retain step is the final phase of the Case-Based Reasoning (CBR) cycle, where the system stores the new problem-solving experience so it can use it in the future. The full CBR cycle has four main steps: Retrieve — Find similar past cases Reuse — Adapt the solution to the new problem Revise — Test and improve the proposed solution Retain — Store the new solution as a new case.

Answer 72

Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') − Q(s,a)] α = learning rate γ = discount factor

Answer 73

A graphical model that represents probabilistic relationships among variables.

Answer 74

Filtering in an HMM is the process of computing the probability distribution over the current hidden state given all evidence (observations) up to now. Given everything I've observed so far, what is the most likely state I'm in right now In an HMM: Hidden states (like weather conditions) are not directly observable. We only see evidence (like whether someone is carrying an umbrella). Filtering helps us maintain a belief state — a probability distribution over possible current states — as new evidence arrives.

Answer 75

The trade-off between exploring new actions and exploiting known good ones. If the agent only exploits: ✅ It gets good rewards now ❌ But it might miss even better actions it never tried If the agent only explores: ✅ It might find the best action eventually ❌ But it wastes time (and reward) trying bad ones

Answer 76

A thought experiment against strong AI asserting that syntax alone isn't understanding. Imagine: You're locked in a room. You don’t speak Chinese at all. You’re given a book of rules (a program) that tells you: "When you see this Chinese symbol, write this other one in response." Native Chinese speakers slip questions into the room in Chinese. You follow the rules and send back perfectly written answers — in Chinese. From the outside, it looks like you understand Chinese. But you don’t — you’re just manipulating symbols.

Answer 77

The learning rate controls how much the model's weights are updated during training in response to the error it sees. Specifically, it determines the step size at each iteration of the optimization algorithm (typically gradient descent) when moving toward a minimum of the loss function. Too high a learning rate can cause the model to overshoot the minimum, potentially leading to divergence or erratic behavior. Too low a learning rate makes training very slow and might get stuck in local minima. It's a key hyperparameter that affects both the speed and stability of training.

Answer 78

Learning from labeled data.

Answer 79

Using too many splits or fitting noise in the training data. Overfitting in decision trees happens when the tree learns not just the general patterns in the training data, but also the noise and specific quirks. This leads to excellent performance on the training set but poor generalization to new, unseen data. Common causes include: Tree depth is too large – the tree keeps splitting until each leaf has very few samples (or even just one), capturing noise. No pruning – without pruning, the tree doesn't remove branches that add complexity without improving accuracy. Low minimum samples per leaf or split – if the model is allowed to split even on very small data subsets, it can memorize the training data. Too many features – especially with irrelevant or redundant features, the tree can make splits that are too specific. No regularization – lack of constraints like max_depth, min_samples_split, or min_samples_leaf.

Answer 80

Expected utility is a concept from decision theory that represents the average value of an outcome, weighted by the probability of each possible outcome and the utility (or value) the decision-maker assigns to those outcomes. Expected Utility=∑(Probability of Outcome×Utility of Outcome)

Answer 81

A dense vector representing the semantic meaning of a word. A word embedding is a way to represent words as dense vectors of real numbers in a continuous vector space, where semantically similar words are mapped to nearby points. Unlike one-hot encoding (which is sparse and doesn't capture meaning), word embeddings capture relationships and context between words. For example, in a well-trained embedding space: vector("king") - vector("man") + vector("woman") ≈ vector("queen") Popular word embedding models include: Word2Vec GloVe FastText

Answer 82

Inference in Bayesian networks is the process of computing the probability of one or more variables given evidence about others. In other words, it's about updating beliefs based on observed data using the network's structure and conditional probabilities. If a Bayesian network models a medical diagnosis and you observe that a patient has a cough, inference can help compute the probability they have the flu, given that observation. Computing the posterior distribution of a variable given evidence.

Answer 83

The Bellman equation is a fundamental recursive relationship in reinforcement learning and dynamic programming. It describes how the value of a state (or state-action pair) is related to the values of successor states. The value of a state equals the expected immediate reward plus the discounted value of the next state.

Answer 84

They are used to make decisions under uncertainty. In rational agents, utility and probability are used together to guide decision-making under uncertainty: Probability is used to model uncertainty about the world — i.e., how likely different outcomes are. Utility is used to express preferences — i.e., how desirable each outcome is to the agent. A rational agent chooses actions to maximize expected utility, which combines these two: Expected Utility=∑(Probability of outcome)×(Utility of outcom

Answer 85

The sequence of states that most likely generated the observations. The Viterbi algorithm uses dynamic programming to avoid recomputing subproblems and finds the single path through the HMM that has the highest joint probability of both the state sequence and the observations.

Answer 86

The proposed solution is tested and possibly corrected. In Case-Based Reasoning (CBR), the revise step is where the proposed solution from a retrieved and adapted past case is evaluated and potentially improved before being retained for future use. Specifically, in the CBR cycle (Retrieve, Reuse, Revise, Retain), the Revise step involves: Testing the proposed solution (e.g. in the real world or a simulation), Detecting errors or mismatches between the solution and the actual outcome, Correcting the solution if needed. This step ensures that the final solution is accurate and suitable before it is learned from (i.e., retained in the case base).

Answer 87

The solution of the retrieved case is adapted to the new case. During the reuse step in Case-Based Reasoning (CBR), the system takes the solution from the most relevant past case(s) and adapts it to fit the new problem. Specifically, reuse involves: Extracting the solution from the retrieved case(s), Modifying or adapting that solution if the new problem differs in important ways, Producing a proposed solution to be tested in the revise step. For example, in a troubleshooting system: If a similar past case fixed a network issue by restarting the router, reuse might propose the same action — unless the current case has a different network setup, in which case it might need to adapt the solution (e.g., restart a switch instead).

Answer 88

In RL, the transition model and rewards are unknown and learned through experience. Reinforcement Learning (RL) and Markov Decision Processes (MDPs) are closely related but not the same: 🔁 MDP: A formal model that describes decision-making in environments with: States Actions Transition probabilities Rewards Assumes full knowledge of the model (i.e., transition and reward functions are known). Used to define the problem. 🤖 RL: A learning framework used to solve MDPs when the model is unknown. The agent learns: What actions to take (policy) From interacting with the environment (trial and error) Estimates transition and reward functions or learns value functions/policies directly.

Answer 89

To update weights in order to minimize error. Gradient descent is used in neural networks to optimize the model’s weights by minimizing the loss function — which measures how far the network's predictions are from the actual targets. Here's how it works: Compute the loss (e.g., cross-entropy or MSE) for a batch of training data. Calculate the gradient of the loss with respect to each weight (using backpropagation). Update each weight in the direction that reduces the loss: w←w−η *∂w/∂L

Answer 90

P(A) = Σ P(A, B) over all B.

Answer 91

A model for sequential decision making with states, actions, transition probabilities, and rewards. A Markov Decision Process (MDP) is a formal framework used to model decision-making in environments where outcomes are partly random and partly under the control of a decision-maker (agent). MDPs satisfy the Markov property: The future is independent of the past given the present. In other words, the next state and reward depend only on the current state and action, not on the full history.

Answer 92

When a model performs well on training data but poorly on unseen data.

Answer 93

To learn a policy that maximizes expected cumulative reward. The goal of reinforcement learning (RL) is to train an agent to learn a policy — a strategy for choosing actions — that maximizes cumulative reward over time while interacting with an environment. The agent learns through trial and error, using feedback from the environment (in the form of rewards or punishments) to improve its behavior over time — without knowing the environment's dynamics in advance.

Answer 94

To specify the probability of moving between states. The role of transition models is to describe how an environment changes in response to actions — specifically, they define the probability of moving to a new state given the current state and action. Formally, in a Markov Decision Process (MDP), the transition model is: P(s ′ ∣s,a) In model-free RL, the agent learns without explicitly using a transition model. In model-based RL, the agent tries to learn or is given the transition model to plan more efficiently.

Answer 95

A node that quantifies the agent's preferences. A utility node is a component in a decision network that represents the agent’s preferences over outcomes by assigning numerical utility values to different states of the world.

Answer 96

Dropout is a regularization technique used in neural networks to prevent overfitting during training. During each training step, randomly selected neurons are “dropped out” (i.e., temporarily removed) from the network with a certain probability (e.g., 0.5). This means those neurons don’t contribute to the forward pass or backpropagation in that step. At test time, no neurons are dropped, but their outputs are scaled to account for the dropout during training.

Answer 97

A decision node in a decision network (or influence diagram) represents a choice the agent can make — that is, a point where the agent selects an action based on the available information. Key characteristics: Typically shown as a rectangle in diagrams. Has no probabilities (unlike chance nodes). Inputs: Can take information from chance nodes or other decisions (what the agent knows when making the choice). Output: Feeds into utility nodes (to evaluate the consequences of the decision).

Answer 98

Finding patterns in unlabeled data.

Answer 99

TF-IDF (Term Frequency–Inverse Document Frequency) is a numerical statistic used in information retrieval and text mining to reflect how important a word is to a document in a collection (corpus). Words that appear frequently in a single document but rarely across others (like "robot" in a robotics paper) get high TF-IDF scores — meaning they are more informative. Common words like "the" or "and" get low scores because they appear in many documents.

Answer 100

Bayesian networks extended with decision and utility nodes.

Answer 101

A factor γ ∈ [0,1] that determines how future rewards are valued.

Answer 102

That certain variables are independent given others, reducing the number of probabilities needed.

Answer 103

Word embeddings are better than one-hot encoding for most NLP tasks because they provide dense, meaningful, and scalable representations of words. Here's why: One-hot encoding: Words are represented as binary vectors with a single 1 and the rest 0s — no information about meaning or similarity. Example: "cat" and "dog" are just as unrelated as "cat" and "laptop". Word embeddings: Words are mapped to dense vectors where similar words have similar values (e.g., "cat" and "dog" are close in the vector space).

Answer 104

No, they become dependent when C is observed.

Answer 105

P(A, B) = P(A|B) * P(B).

Answer 106

To ensure that decisions do not systematically disadvantage any group. Fairness is important in AI because AI systems increasingly make decisions that directly impact people's lives, and unfair or biased systems can cause real harm — including discrimination, exclusion, or unequal treatment.

Answer 107

The Viterbi algorithm is used to find the most likely sequence of hidden states (also called the best path) in a Hidden Markov Model (HMM), given a sequence of observed events. Common applications: Speech recognition Part-of-speech tagging DNA sequence analysis Spell checking and error correction What it does: Given: A sequence of observations (e.g., sounds or words), An HMM with known transition and emission probabilitiesThe Viterbi algorithm finds the single best state sequence S = (s1, s2, s3, ...) that maximizes the joint probability: arg max_s P(S,O) Where O is the observed sequence. It uses dynamic programming to avoid recomputing overlapping subproblems, making it much faster than brute-force enumeration of all possible state sequences.

Answer 108

A Bayesian network is a type of probabilistic graphical model that uses a directed acyclic graph (DAG) to represent a set of random variables and their conditional dependencies. Each node represents a variable, and each edge represents a direct influence from one variable to another. These networks are used to compute probabilities efficiently by leveraging conditional independencies among variables.

Answer 109

Two variables are conditionally independent given a third if knowing the third makes the first two independent. In Bayesian networks, this means that once you observe the 'middle' variable on a path (unless it's a collider), the information from one end of the path doesn't influence the other. Conditional independence is critical for simplifying joint probability computations.

Answer 110

On-policy learning means the agent learns the value of the policy it is currently using to make decisions (like SARSA). Off-policy learning, like Q-learning, means the agent learns the value of an optimal policy regardless of the policy it is actually using. This allows Q-learning to learn optimal strategies even while exploring with a different behavior.

Answer 111

Utility is a numeric value that represents the desirability or preference for a particular outcome. In AI, utility functions help agents evaluate and compare outcomes, guiding rational decision-making under uncertainty. Higher utility corresponds to more preferred outcomes.

Answer 112

CBR involves solving new problems based on solutions to past similar problems. The cycle includes four steps: (1) Retrieve the most similar case(s), (2) Reuse the solution of the retrieved case, (3) Revise the proposed solution if needed, and (4) Retain the new experience for future use. It is a model of learning from experience.

Answer 113

Overfitting happens when a model learns patterns specific to the training data, including noise, rather than general patterns. This results in poor performance on unseen data. It's like memorizing answers for a test rather than understanding the material. Techniques to prevent overfitting include regularization, dropout, cross-validation, and using more training data.

Answer 114

An HMM is a statistical model used to represent systems that are Markov processes with hidden states. It assumes that the system being modeled is a sequence of observations generated by hidden states, each of which follows a Markov process (only depends on the previous state). HMMs are widely used in speech recognition, bioinformatics, and time series analysis.

Answer 115

Q-learning is a model-free reinforcement learning algorithm. It learns a Q-value function that estimates the expected cumulative reward for taking a given action in a given state, and following the best policy thereafter. It updates its estimates based on the Bellman equation, using the maximum reward of the next state, even if the current action is exploratory.

Answer 116

A decision tree is a flowchart-like structure used in supervised learning to make decisions or classifications. Each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or output. It works by recursively partitioning the data into subsets that are as pure as possible.

Answer 117

Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. It receives rewards or penalties for its actions and aims to learn a policy that maximizes cumulative reward over time. Unlike supervised learning, it doesn't get direct feedback on the correct action but must learn from trial and error.

Answer 118

A Hidden Markov Model is used to model systems where the underlying process (state) is not directly observable (hidden), but we have observations that are probabilistically related to these states. It is especially useful in time-series data like speech recognition, where the actual spoken words (states) are inferred from audio signals (observations). HMMs use two key probabilities: transition probabilities between states and emission probabilities for observations.

Answer 119

The Bellman equation defines the relationship between the value of a state and the expected return of future states. It breaks down the value of a state into the immediate reward received from the current action and the discounted future value. This recursive structure allows dynamic programming methods like value iteration to efficiently compute optimal policies in Markov Decision Processes (MDPs).

Answer 120

Filtering is the task of computing the probability distribution over the current hidden state given all past observations. It is useful for tracking or monitoring applications. The forward algorithm is commonly used to perform filtering efficiently by propagating beliefs through time using the transition and observation models.

Answer 121

Value iteration is an algorithm that updates the utility of each state using the Bellman equation until the values converge. It starts with arbitrary utilities and repeatedly updates each state’s value based on expected utilities of successor states and rewards. Once the values stabilize, the optimal policy can be derived by choosing actions that maximize expected utility at each state.

Answer 122

An influence diagram is an extension of a Bayesian network that includes decision nodes and utility nodes. It is used for decision-making under uncertainty. Chance nodes represent random variables, decision nodes represent choices, and utility nodes represent preferences. The diagram encodes dependencies and helps compute expected utilities to guide rational decisions.

Answer 123

In supervised learning, the model is trained on labeled data where the output is known, such as classification or regression tasks. In unsupervised learning, the model tries to find patterns or structures in data without labeled outputs, such as clustering or dimensionality reduction. Supervised learning is used when prediction is the goal, while unsupervised learning is used for exploration and understanding data structure.

Answer 124

This trade-off refers to the dilemma of choosing between exploring new actions to discover better rewards and exploiting known actions that yield high rewards. Effective RL strategies need to balance these two approaches to avoid getting stuck in suboptimal behavior. Common solutions include ε-greedy policies or algorithms like Upper Confidence Bound (UCB).

Answer 125

An MDP is a mathematical framework for modeling decision-making problems where outcomes are partly random and partly under the control of an agent. It consists of states, actions, transition probabilities, a reward function, and a discount factor. The goal is to find a policy that maximizes the expected sum of rewards over time.

Answer 126

Dropout is a regularization technique used during training of neural networks to prevent overfitting. It randomly 'drops out' a proportion of neurons during each training step, forcing the network to learn redundant representations. This helps ensure the model generalizes better to new data by preventing it from relying too heavily on specific neurons.

Answer 127

A CNN is a deep learning model designed for processing data with a grid-like topology, such as images. It uses convolutional layers to apply filters that detect local patterns like edges or textures, followed by pooling layers to reduce spatial dimensions. CNNs are efficient in learning hierarchical features and are widely used in computer vision tasks.

Answer 128

Word embeddings are dense vector representations of words where words with similar meaning have similar representations. Unlike one-hot encoding, which is sparse and does not capture relationships, embeddings like Word2Vec or GloVe capture semantic and syntactic similarities and allow algorithms to understand context and meaning in text data.

Answer 129

The discount factor determines how much future rewards are valued compared to immediate rewards. A value close to 1 means future rewards are nearly as valuable as immediate ones, promoting long-term planning. A value near 0 emphasizes short-term gains. It ensures the sum of future rewards is finite and reflects time preference in decision making.

Answer 130

A utility function assigns a numeric value to each possible outcome, reflecting the agent’s preference for that outcome. Higher utility means the outcome is more desirable. In decision theory, agents use utility functions to evaluate options and choose actions that maximize expected utility, ensuring rational and goal-directed behavior.

Answer 131

SARSA (State-Action-Reward-State-Action) is an on-policy algorithm that updates its value estimates based on the actual action taken in the next state. Q-learning is off-policy and updates based on the best possible action in the next state, regardless of which action was actually taken. SARSA tends to be more conservative and safer, especially in risky environments.

Answer 132

The Chinese Room argument, proposed by philosopher John Searle, challenges the notion that a computer running a program can understand language or possess a mind. In the thought experiment, a person manipulates Chinese symbols using a rule book without understanding the language, suggesting that syntactic processing (like AI) does not equal semantic understanding.

Answer 133

The Turing Test, proposed by Alan Turing, is a test of a machine's ability to exhibit intelligent behavior indistinguishable from that of a human. If a human evaluator cannot reliably tell whether responses come from a machine or a person, the machine is considered to have passed the test. It evaluates behavioral intelligence, not consciousness or understanding.

Answer 134

Smoothing refers to estimating the hidden state at a previous time step, given the entire sequence of observations. Unlike filtering, which only uses past and current observations, smoothing incorporates future observations to make better estimates of past states. The forward-backward algorithm is commonly used for this purpose.

Answer 135

The most likely explanation is the sequence of hidden states that is most probable given the entire sequence of observations. It is computed using the Viterbi algorithm, which efficiently finds the single best path through the state space rather than the most probable state at each time step.

Answer 136

Gradient descent involves computing the gradient (partial derivatives) of the loss function with respect to model parameters, then updating the parameters in the direction that reduces the loss. It’s an iterative process used to minimize error in machine learning models by adjusting weights using a learning rate.

Answer 137

Backpropagation is an algorithm used to compute the gradient of the loss function with respect to each weight in the network. It works by propagating errors backward from the output layer to the input layer, applying the chain rule of calculus. It enables efficient training of multi-layer networks via gradient descent.

Answer 138

Entropy is a measure of impurity or uncertainty in a dataset. In decision trees, it's used to determine how mixed the classes are within a dataset. If a dataset contains only one class, its entropy is 0, meaning it is pure. The goal of splitting data in decision trees is to reduce entropy, creating branches that are as pure as possible.

Answer 139

Information gain measures the reduction in entropy achieved by splitting a dataset based on a particular attribute. It helps select the attribute that best separates the data into different classes. The attribute with the highest information gain is chosen for splitting at each node in the tree-building process.

Answer 140

Pruning is the process of removing nodes or branches from a decision tree to reduce its complexity and improve generalization. This helps prevent overfitting, which can occur when the tree memorizes noise in the training data. Pruning can be done during or after training using techniques like cost complexity pruning or reduced error pruning.

Answer 141

A perceptron is a basic type of neural network unit used for binary classification. It computes a weighted sum of inputs, applies an activation function (like the sign function), and outputs either +1 or -1. If the data is linearly separable, the perceptron learning algorithm can find a separating hyperplane that classifies the data correctly.

Answer 142

The learning rate is a hyperparameter that controls how much the model's weights are adjusted in response to the calculated error each time they are updated. If the learning rate is too high, the model may overshoot the optimal weights. If it is too low, training can be very slow or get stuck in local minima.

Answer 143

Classification is the task of predicting a categorical label (e.g., spam or not spam), whereas regression predicts a continuous quantity (e.g., price of a house). Both are types of supervised learning, but use different loss functions and evaluation metrics suited to their problem type.

Answer 144

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. This discourages the model from learning overly complex patterns. Common types include L1 regularization (Lasso), which promotes sparsity, and L2 regularization (Ridge), which penalizes large weights.

Answer 145

The curse of dimensionality refers to the challenges that arise when working with high-dimensional data. As the number of dimensions increases, data becomes sparse, distances between points become less meaningful, and algorithms may struggle to generalize. This makes learning and visualization more difficult without dimensionality reduction techniques.

Answer 146

Dimensionality reduction is the process of reducing the number of input variables in a dataset while preserving as much information as possible. Techniques like Principal Component Analysis (PCA) or t-SNE are used to simplify data, improve visualization, and make machine learning algorithms more efficient.

Answer 147

PCA is a technique used for dimensionality reduction that transforms the original features into a new set of uncorrelated features (principal components). These components are ordered by the amount of variance they explain in the data. PCA helps reduce complexity while retaining the most important information.

Answer 148

A value function estimates how good it is for an agent to be in a given state (or to take a specific action in a state), in terms of expected future rewards. It helps guide the agent's decisions by indicating which states or actions lead to higher cumulative rewards over time.

Answer 149

Policy evaluation is the process of determining the expected return (value) of each state under a specific policy. This involves calculating the value function by averaging the rewards and transitions when following the policy, and is a key step in methods like policy iteration.

Answer 150

Policy improvement is the process of using the value function of a current policy to derive a better policy. The agent chooses actions that maximize the expected value in each state, effectively creating a new policy that is guaranteed to perform at least as well as the old one.

Answer 151

A greedy policy is one that always selects the action with the highest estimated value in a given state. While this strategy can quickly find good actions, it may miss better long-term strategies due to lack of exploration. It's often combined with exploration strategies like ε-greedy.

Answer 152

The ε-greedy strategy balances exploration and exploitation by choosing the best known action most of the time (with probability 1-ε), and exploring a random action with probability ε. This prevents the agent from getting stuck with suboptimal policies by ensuring occasional exploration.

Answer 153

An activation function introduces non-linearity into a neural network, allowing it to learn complex patterns. Common activation functions include ReLU, sigmoid, and tanh. Without them, the network would be equivalent to a linear model and unable to solve problems like image recognition or natural language understanding.

Answer 154

Softmax is an activation function commonly used in the output layer of classification networks. It converts raw output scores (logits) into probabilities that sum to 1. Each score is exponentiated and divided by the sum of all exponentiated scores, allowing the network to predict class probabilities.

Answer 155

The vanishing gradient problem occurs when gradients become very small during backpropagation, especially in deep networks. This causes weights in earlier layers to update very slowly, hindering learning. It often occurs with sigmoid or tanh activations and is mitigated by using ReLU or residual connections.

Answer 156

An RNN is a type of neural network designed to handle sequential data by maintaining a hidden state that captures information from previous time steps. This makes them suitable for tasks like language modeling, time-series prediction, and speech recognition. However, they struggle with long-range dependencies.

Answer 157

LSTMs are a special kind of RNN designed to remember information for longer periods. They include memory cells and gating mechanisms that regulate the flow of information, allowing them to overcome the vanishing gradient problem and capture long-term dependencies in sequences like text or speech.

Answer 158

Deep learning uses multi-layered neural networks to learn complex representations of data. Each layer learns increasingly abstract features, from simple edges in early layers to high-level concepts in deeper ones. This allows deep learning models to achieve high performance on tasks like image and speech recognition.

Answer 159

Case-based reasoning is a problem-solving approach where new problems are solved by adapting solutions from similar past problems. Instead of learning general rules from data, CBR stores and reuses specific experiences. This makes it useful in domains where human-like reasoning based on past cases is important, such as legal or medical decision-making.

Answer 160

Model-free RL learns policies or value functions directly from experience, without building a model of the environment. Examples include Q-learning and SARSA. Model-based RL, on the other hand, involves learning or using a model of the environment's dynamics to plan ahead and simulate outcomes, which can lead to more sample-efficient learning.

Answer 161

A utility node represents the agent’s preferences over possible outcomes in a decision network. It assigns numerical utility values to different outcomes, allowing the agent to compare and choose between them rationally. The goal is to select decisions that maximize expected utility based on probabilities and outcomes.

Answer 162

A decision node represents a point where the agent must choose between different actions or options. It has no parent nodes because it is under the agent's control, but it influences future chance and utility nodes. The agent selects the decision that leads to the highest expected utility.

Answer 163

A chance node represents a random variable whose value is not controlled by the decision-maker but is determined probabilistically. Its conditional probabilities are defined in a Conditional Probability Table (CPT), and its value affects other nodes in the network.

Answer 164

A state in reinforcement learning is the complete representation of the environment at a given time. A belief state, used in partially observable environments (POMDPs), is a probability distribution over possible actual states, representing the agent’s uncertainty about the current situation.

Answer 165

The forward algorithm is used to compute the probability of an observed sequence of events in an HMM. It efficiently sums over all possible hidden state sequences using dynamic programming, avoiding the need to enumerate all paths explicitly. It is used in filtering and sequence evaluation.

Answer 166

An emission probability is the probability of observing a particular evidence or output given a specific hidden state. It models how the hidden states produce observations and is a key part of the observation model in HMMs.

Answer 167

CNNs take advantage of spatial locality in images by using shared filters across different regions of the image. This drastically reduces the number of parameters compared to fully connected layers, allowing for faster training, better generalization, and the ability to learn spatial hierarchies of features.

Answer 168

Transfer learning involves using a model trained on one task and adapting it to a different but related task. It is especially useful when data for the target task is limited. A common approach is to take a pre-trained deep network and fine-tune it on new data, leveraging previously learned features.

ExamQuestions Flashcards

(193 cards)