2.6 Side Effects and Reward Hacking Flashcards
(6 cards)
Side effects and reward hacking
Can result in AI-based systems generating unexpected, and even harmful results
Negative side effects can result when
The designer of an AI-based system specifies a goal that ‘focuses on accomplishing some specific tasks in the environment but ignores other aspects of the (potentially very large) environment, and thus implicitly expresses indifference over environmental variables that might actually be harmful to change’
Example of negative side effect
A self-driving car aiming for ‘fuel-efficient and safe’ travel may achieve the goal, but passengers become annoyed at excessive time taken
Reward hacking can result from
An AI-based system achieving a specified goal by using a ‘clever’ or ‘easy’ solution that ‘perverts the spirit of the designer’s intent’
Reward hacking effectively means
The goal can be gamed
Widely used example of reward hacking
An AI-based system teaching itself to play an arcade computer game with the goal of ‘highest score’ hacks the data record that stores the highest score, rather than playing the game to achieve it