Canine Instruction Inspiring Gained Abilities in Reinforcement Learning

The overarching approach to our work on Reinforcement Learning is to address ways in which the computational approaches to RL have drifted from the behavior analytic principles well studied in psychology. At the highest level, machine RL is a model of operant conditioning. However, RL has evolved to be a highly sophisticated science that has more in common with dynamic programming than behaviorism. Our efforts to advance RL seek to address this disconnect.

Despite a substantial asymmetry in communication capabilities and a severely limited vocabulary comprising a few gradations of "yes" and "no", humans are able to train animals to guide the blind, detect minute amounts of explosive material, predict the onset of seizures, track missing persons, and many more things. Comparing against the myriad amazing things humans and animals are able to accomplish, our ability to leverage RL agents to accomplish even moderately challenging tasks is very poor. Our strong hypothesis is that bringing RL closer to quantitative behavior analysis will dramatically enhance the ability for us to leverage RL to solve increasingly complex problems.

We are especially interested in scenarios where RL agents are getting feedback from humans---especially humans that aren't experts in machine learning. Learning from human feedback is challenging for RL because rewards are non-stationary and the "cost" of receiving a reward (which requires a human in the loop) is generally very high. Thus, it is imperative for RL algorithms to maximize the efficiency of learning from every single reward, and our belief is that algorithms which incorporate quantitative behavior analysis models are the key to such efficiencies.

Projects

Optimality-preserving Reward Shaping

We aim to address three key limitations in Potential-Based Reward Shaping (PBRS), a technique used in Reinforcement Learning (RL) to improve learning in sparse-reward environments without altering optimal policies: its limited scope in covering all optimality-preserving shaping functions, its practical challenges in hand-designing effective potential-based rewards, particularly for complex domains like intrinsic motivation (IM), and its inapplicability to Reinforcement Learning with Human Feedback (RLHF) scenarios where the "true" reward is unknown and being estimated.

We're developing a class of methods designed to accelerate training in environments with sparse but known rewards by modifying potentially non-Markovian shaping rewards while preserving optimality. We focus on "plug-and-play" methods that require no manual design of shaping rewards and are easily adaptable to existing shaping rewards or Intrinsic Motivation terms. The approach can lead to methods that mitigate deviations from optimality caused by imperfectly learned reward models in environments where the true reward function is inaccessible, e.g., Reinforcement Learning from Human Feedback (RLHF).

Funding Source

None

Grant Forbes
Jianxun Wang
Leonardo Villalobos-Arias
Dr. Arnav Jhala
Dr. David L. Roberts

Reinforcement Learning from Human Ratings

Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy within the support of the dataset. However, practical offline datasets often contain data reflective of low stochastic behavior with limited exploration of the environment and from multiple behavior policies with diverse expertise levels. Limited exploration can impair offline RL algorithms' accuracy in the estimation of Q or V values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. Such scenarios bear resemblemnce to behavior analysis where coarse gradations of reinforcement and punishment provide barely more than ordinal feedback.

Funding Source

None

Entropy-based Autonomous Curriculum Generation

Curriculum learning is a training method in which an agent is first trained on a curriculum of relatively simple tasks related to a target task in an effort to shorten the time required to train on the target task. Autonomous curriculum generation involves the design of such curriculum with no reliance on human knowledge and/or expertise. Finding an efficient and effective way of autonomously designing curricula and reducing an agent's dependence on costly feedback functions remains an open problem for the RL frameworks. We aim to leverage the learner's uncertainty to generate a curricula that would improve the learning efficiency of the agent reduce the reliance of learners on human feedback or expensive rewards, improving the efficiency of learning while maintain- ing the quality of the learned policy. Our approach supports the generation of autonomous curricula in a self-assessed manner by leveraging the learner's past and current policies to estimate where uncertainty could be reduced with more training.

Funding Source

None

Canine Inspired Machine Learning

This project looks at how users can train computers and virtual agents to perform tasks using discrete, non-numerical forms of communication, and takes inspiration from techiniques used in animal training. We have shown that people use many different strategies when providing feedback to a learner, and may use a lack of feedback to communicate in the same way that they use, for example, explicit rewards. We have developed the SABL algorithm, which allows learning agents to adapt to a user's particular training strategy, allowing it to learn more quickly than other approaches. We have also considered how natural language commands can be learned through positive and negative feedback.

Our work has also considered the different factors that influence how users give feedback when teaching virtual agents. We have looked at how a user's feedback depends on the structure of a task. For example, users may give more feedback when the learner reaches a doorway between different rooms than while it is crossing a room. We have also looked at how the speed at which an agent moves affects the amount of feedback given. We have shown that we can effectively adjust that speed to reflect the agent's confidence in its action, in order to elicit more feedback when needed.

Funding Source

NSF

People Involved

Robert Loftin
Bei Peng
James MacGlashan
Dr. Michael L. Littman
Dr. Matthew E. Taylor
Dr. David L. Roberts

Continuous Reinforcement Learning Over Long Time Horizons

Significant progress has been made in applying reinforcement learning algorithms to contiuous, high dimensional domains. Algorithms such as fitted Q-iteration allow agents to learn about complex environments and plan their actions accordingly. These algorithms can struggle, however, when faced with tasks that require a large amount of time, and a large number of actions to complete. When using approximate representations of the learning problem, reinforcement learning algorithms can fail to accurately account for the long term effects of their actions. This project seeks to develop reinforcement learning algorithms that are specifically suited to long term planning. In particular, we are interested in modifications to fitted Q-iteration and fitted policy iteration with neural network representations.

Funding Source

NSF