w7 Flashcards
(36 cards)
4P’s of Creativity
- Product (output, like list of creative ideas)
- Process (how one comes up with the ideas, e.g., optimal foraging)
- Person (characteristics of a creative person, e.g., openness, # published poems)
- Press (effects of context/environment, e.g., instructional manipulation)
alternative uses task, GPT vs humans
Humans:
- Excel in originality, generating creative and semantically distant ideas.
Show a tradeoff between originality and utility, balancing both effectively.
GPT:
- Excels in utility, producing practical and functional responses.
Struggles with originality, often generating predictable or less novel ideas due to reliance on training data patterns.
What is abstraction?
the ability to discover patterns from a few noisy images,
sounds or other senses
What is analogical reasoning?
learning about new things by relating it to what you know
What is analogy-making?
Using what you know about a situation to infer knowledge about a new, somehow related instance.
Calculator is to arithmetic as ChatGPT is to?
In adults the processing steps are:
(1) encode analogy elements A, B and C
(2) search for relationship between A and B (“stands on”)
(3) align A and C (“body and tree are things that stand”)
(4) map A:B relationship to C to get D (“tree stands on roots”)
In Children the processing steps are
(1) encode analogy elements C (maybe A, B, instruction)
(2) ignore / forget about A and B?
(3) search for similar instances to C (“leaves”, “branches”, “bush”, “nest”)
(4) use perceptual/semantic similarity with C to get D (“tree has leaves”)
Why do LLMs have trouble with analogical transfer?
- Lack domain knowledge
- Lack conceptual abstraction of what constitutes an alphabet, such as being an ordered sequence, so
can’t flexibly map to less familiar domains
what do Good (A)GI tests have
- unlimited rules
- limited training examples
to force AI to “think
Associative error: Duplication
Associative Error:
- Errors made by AI models due to their reliance on statistical associations within the training data. These errors reflect overreliance on patterns or repetitions rather than reasoning.
Duplication:
- A specific type of associative error where the model repeats parts of the input or closely related content, mistaking it for a valid response.
Example: In language generation, if prompted with a sentence about a “brick wall,” the model might redundantly describe “a wall made of bricks” without adding new information or insight.
Literal solutions and conceptual
Literal Solutions:
- Solutions that rely on surface-level patterns or direct, observable features of a problem without engaging with deeper abstract principles or relationships.
In AI (as highlighted in the lecture and book chapters), literal solutions are often the result of statistical correlations learned during training.
Example: In an analogy problem, if AI matches answers based solely on visible similarities without understanding the underlying relationship, it provides a literal solution.
Conceptual Solutions:
- Solutions that involve understanding and applying abstract principles, relationships, or rules that go beyond surface-level features.
These solutions demonstrate an ability to generalize and capture deeper meanings, which is challenging for LLMs.
Example: Correctly solving analogies like “Athens : Greece :: Paris : France” by understanding the city-country relationship rather than pattern-matching.
Do LLMs solve kidsARC items like children?
- Young kids and LLMs make numerous duplication errors.
- Humans make more concept errors.
- LLMs make more literal errors.
t/f
Analogy & abstraction lags far behind that of
humans and they currently cannot generalize as
well as children can
t
Challenges of Evaluating AI Intelligence:
Data Contamination: Inflated performance occurs when test data leaks into training sets, making results unreliable (e.g., GPT models performing well on benchmarks they were exposed to).
Benchmark Limitations: Tests like the Bar Exam or GLUE measure surface-level skills and statistical patterns but do not reflect true general intelligence.
Robustness Issues: AI struggles with edge cases and generalizing to unseen data.
Example: GPT-4’s ability to pass standardized tests does not equate to deep understanding or reasoning.
Creativity in AI:Discuss how the 4P’s of Creativity framework from the lecture applies to AI systems. Which aspects of this framework do AI models struggle with the most, and why?
- The 4P’s framework (Product, Process, Person, Press) shows AI excels in creating products but struggles with Process and Person aspects. AI lacks emotional nuance, meaning, and flexibility in ideation.
Example: In the “Alternative Uses Test,” GPT-3 gave functional but predictable responses (“brick as decoration”), whereas humans generated abstract ideas tied to emotional and situational contexts.
Theory of Mind in AI:
GPT-4’s apparent “theory of mind” success is attributed to statistical associations rather than genuine psychological reasoning.
Example: In false-belief tasks, GPT-4 predicts answers based on learned patterns but lacks an internal model of beliefs or mental states, as highlighted in the Melanie Mitchell article.
Human-Centered AI:
Principles like fairness, inclusivity, accountability, and transparency ensure AI aligns with human values.
Addressing bias in hiring algorithms or ensuring equitable access to AI technologies exemplifies these principles. Regulation is essential to mitigate risks like job displacement and surveillance
Analogical Reasoning:
AI fails in analogical reasoning because it lacks relational understanding. Humans use context and knowledge transfer (e.g., “roots anchor a tree like feet anchor a body”), while AI relies on surface-level patterns.
Example: ARC tests expose AI’s inability to generalize relationships from unfamiliar symbols or novel tasks.
What is the primary issue with data contamination in AI evaluations?
a) AI systems fail to recognize patterns in new datasets.
b) AI training datasets unintentionally include test questions, inflating performance.
c) AI systems are unable to distinguish between training and test data.
d) AI evaluations are too costly to administer reliably.
b) AI training datasets unintentionally include test questions, inflating performance.
In the “Alternative Uses Test,” how did GPT-3’s responses differ from human responses?
a) GPT-3 generated highly abstract uses compared to humans.
b) GPT-3 responses were more flexible but lacked persistence.
c) GPT-3 provided functional but less semantically distant uses.
d) GPT-3 outperformed humans in generating original ideas.
c) GPT-3 provided functional but less semantically distant uses.
What is the primary reason AI struggles with theory-of-mind tasks?
a) Lack of sufficient training data on psychological reasoning.
b) Dependence on shallow heuristics rather than robust conceptual understanding.
c) AI models lack the processing power for abstract reasoning.
d) AI systems misinterpret linguistic prompts in tasks.
b) Dependence on shallow heuristics rather than robust conceptual understanding
According to Chapter 15, which principle is NOT part of human-centered AI design?
a) Accountability and transparency.
b) Maximizing automation to replace human labor.
c) Ensuring fairness and inclusivity.
d) Aligning AI with human values.
b) Maximizing automation to replace human labor.
Why are benchmarks for AI often criticized as “broken”?
a) They are too complex for most AI models.
b) They encourage overfitting by focusing on statistical shortcuts.
c) They fail to include tasks that humans perform poorly on.
d) They prioritize open-source models over proprietary ones.
b) They encourage overfitting by focusing on statistical shortcuts.
True or False: Melanie Mitchell argues that AI models’ high performance on benchmarks like the Bar Exam proves they possess general intelligence.
False