Generative Expressive Robot Behaviors using Large Language Models

Generative Expressive Robot Behaviors
Using Large Language Models
HRI 2024 Best Paper (Technical Track)

Google Deepmind

Paper

Appendices

Abstract

People employ expressive behaviors to effectively communicate and coordinate their actions with others, such as nodding to acknowledge a person glancing at them or saying "excuse me" to pass people in a busy corridor. We would like robots to also demonstrate expressive behaviors in human-robot interaction. Prior work proposes rule-based methods that struggle to scale to new communication modalities or social situations, while data-driven methods require specialized datasets for each social situation the robot is used in. We propose to leverage the rich social context available from large language models (LLMs) and their ability to generate motion based on instructions or user preferences, to generate expressive robot motion that is adaptable and composable, building upon each other. Our approach utilizes few-shot chain-of-thought prompting to translate human language instructions into parametrized control code using the robot's available and learned skills. Through user studies and simulation experiments, we demonstrate that our approach produces behaviors that users found to be competent and easy to understand. Supplementary material can be found at https://generative-expressive-motion.github.io/.

Approach

We aim to tackle the problem of expressive behavior generation that is both adaptive to user feedback and composable so that more complex behaviors can build on simpler behaviors. Formally, we define being expressive as the distance between some expert expressive trajectory that could be generated by an animator (or demonstrated) \( \tau_{\text{expert}} \) and a robot trajectory \( \tau \). \( \text{dist}(\tau, \tau_{\text{expert}}) \) can be any desirable distance metric between the two trajectories, e.g., dynamic time warping (DTW). GenEM aims to minimize this distance \( d^* = \min \text{dist}(\tau,\tau_{\text{expert}}) \).

Our approach uses several LLMs in a modular fashion so that each LLM agent plays a distinct role. GenEM takes user language instructions \( l_{in} \in L \) as input and outputs a robot policy \( \pi_\theta \), which is in the form of a parameterized code. Human iterative feedback \( f_i \in L \) can be used to update the policy \( \pi_\theta \). The policy parameters get updated one step at a time given the feedback \( f_i \), where \( i \in \{1,\dots, K\} \). The policy can be instantiated from some initial state \( s_0 \in S \) to produce trajectories \( \tau = \{s_0, a_0, \dots, a_{N-1},s_N\} \) or instantiations of expressive robot behavior.

Studies

We conducted two user studies to assess whether our approach, GenEM, can be used to generate expressive behaviors that are perceivable by people. We generated two versions of behaviors: GenEM, and GenEM with iterative Feedback (or GenEM++). In both studies, all comparisons were made against behaviors designed by a professional animator and implemented by a software developer, which we term the oracle animator. In the first study, our goal was to assess whether behaviors that are generated using GenEM and GenEM++ would be perceived similarly to the behaviors created using the oracle animator. In the second study, we attempted to generate behaviors using GenEM and GenEM++ that were similar to the behaviors created using the oracle animator. Both studies aim to demonstrate that our approach is adaptable to human feedback.

Study 1: Benchmarking GenEM

To determine whether our approach produces expressive behaviors that people can perceive, we conducted a within-subjects user study with thirty participants. We generated ten expressive behaviors ranging in complexity: Nod, shake head (Shake), wake up (Wake), excuse me (Excuse), recoverable mistake (Recoverable), unrecoverable mistake (Unrecoverable), acknowledge person walking by (Acknowledge), follow person (Follow), approach person (Approach) and pay attention to person (Attention). The input included a one-line instruction (e.g., Respond to a person saying, “Come here. You cannot speak.”).

Oracle Animator

Nod

Shake

Wake Up

Excuse Me

Recoverable Mistake

Unrecoverable Mistake

Acknowledge, Walk

Approach Person

Pay Attention

GenEM

Nod

Shake

Wake Up

Excuse Me

Recoverable Mistake

Unrecoverable Mistake

Acknowledge, Walk

Follow Person

Approach Person

Pay Attention

GenEM++

Nod

Shake

Wake Up

Excuse Me

Recoverable Mistkae

Unrecoverable Mistake

Acknowledge, Walk

Follow Person

Approach Person

Pay Attention

Results

We found that GenEM++ behaviors were well received and were not perceived as significantly worse than the baseline oracle animator behaviors (and in some cases were perceived as better).

Study 2: Mimicking the Oracle Animator

We conducted an additional user study with twenty four participants to assess whether using GenEM to generate behaviors that resembled the oracle animator would be perceived differently. We generated ten expressive behaviors ranging in complexity, with eight overlapping behaviors from the first study: nod (Nod), shake head (Shake), wake up (Wake), excuse me (Excuse), recoverable mistake (Recoverable), unrecoverable mistake (Unrecoverable), acknowledge person walking by (Acknowledge Walking), acknowledge person stopping by (Acknowledge Stop), follow person (Follow), and teaching session (Teach). Behaviors that were different from the first study were chosen to add further complexity. Unlike in the first study, the prompts were more varied and sometimes included additional descriptions such as for the more complex behaviors (see Appendices for full prompts for each behavior). To create each GenEM++ behavior, an experimenter refined the GenEM behavior through iterative feedback until it appeared similar to the equivalent oracle animator behavior or after exceeding the maximum number of feedback rounds