IRL (v3))李宏毅深度学习

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

• Reward function: • Forward vs. reverse driving • Amount of switching between forward and reverse • Lane keeping • On-road vs. off-road • Curvature of paths
Inverse Reinforcement Learning
Environment Dynamics P(s’|s,a)
Inverse Reinforcement Reinforcement Learning
Modeling reward can be easier. Simple reward function can lead to complex policy.
~ y r arg max w x r , y
r r ~ ˆ y y
yY

Can be an issue
ˆ r xr , ~ w w xr , y yr

We are done!
IRL v.s. Structured Perceptron
F x, y w x, y
Third Person Imitation Learning
Third Person Imitation Learning
One-shot Imitation Learning
Unstructured Demonstration
• Review: InfoGAN
z = z’
c
Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav Sukhatme, Joseph Lim, MultiModal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets, arXiv preprint, 2017
Find policy:
Testing (Inference):
Review: Structured Perceptron ˆ , x , y ˆ ,, x , y ˆ , x , y
1 1 2 2 r r
F x, y w x, y
ˆ x , y
r r
oi
NN Actor
ai
Behavior Cloning
• Problem
Expert only samples limited observation (states) ? Let the expert in the states seem by machine Dataset Aggregation Expert
R w

w: Parameters
~ y arg max F x, y
yY
This is reinforcement learning.
GAN for Imitation Learning
Jonathan Ho and Stefano Ermon. "Generative adversarial imitation learning“, NIPS, 2016
Mismatch
oi
Actor
ai
Inverse Reinforcement Learning (IRL)
Also known as inverse optimal control, inverse optimal planning
Pieter Abbeel and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning“, ICML, 2004
Behavior Cloning
• Dataset Aggregation Expert Expert
Behavior Cloning
• Major problem: if machine has limited capacity, it may choose the wrong behavior to copy.
Behavior Cloning
Behavior Cloning
• Self-driving cars as example
Yes, this is supervised learning.
observation
Expert (Human driver): 向前 Machine: 向前
Training data:
speech speech NN Actor
oi
NN Actor
ai
oi
gesture
ai
gesture
• Some behavior must copy, but some can be ignored. • Supervised learning takes all errors equally
Algorithm
reward
Find the reward function that expert has larger reward.
Find the actor maximizing reward by reinforcement learning
Recap: Sentence Generation & Chat-bot
(“<BOS>”,”床”) (“床”,”前”)
(“床前”,”明”)
……
……
Maximum likelihood is behavior cloning. Now we have better approach like SeqGAN.
……
……
Examples of Recent Study
Parking Lot Navigation
Sentence Generation Expert trajectory: 床前明月光 Chat-bot Expert trajectory: input: how are you Output: I am fine (“input, <BOS>”,”I”) (“input, I”, ”am”) (“input, I am”, ”fine”)
First Person
Third Person
http://lasa.epfl.ch/research_new/ML/index.php
https://kknews.cc/sports/q5kbb8.html http://sc.chinaz.com/Files/pic/icons/1913/%E6%9C%BA%E5%99%A8%E4%BA%BA%E5%9B %BE%E6%A0%87%E4%B8%8B%E8%BD%BD34.png
Inverse Reinforcement Learning
Inverse Reinforcement Learning
Ring a bell in your mind?
Inverse Reinforcement Learning
Find reward function: Structured Learning Training:
http://www.commitstrip.com/en/2017/06/07/ai-inside/?
Path Planning
Third Person Imitation Learning
• Ref: Bradly C. Stadie, Pieter Abbeel, Ilya Sutskever, “Third-Person Imitation Learning”, arXiv preprint, 2017
Imitation Learning
Introduction
• Imitation Learning • Also known as learning by demonstration, apprenticeship learning • An expert demonstrates how to solve the task • Machine can also interact with the environment, but cannot explicitly obtain reward. • It is hard to define reward in some tasks. • Hand-crafted rewards can lead to uncontrolled behavior • Three approaches: • Behavior Cloning • Inverse Reinforcement Learning • Generative Adversarial Network
Generator
x
Discrimi nator
scalar
Predictor
来自百度文库
c
Predict the code c that generates x
Unstructured Demonstration
• The solution is similar to info GAN
observation o Actor action code c Predictor Expert demonstration: c Predict the code c given o and a action a Discrimi nator scalar
GAN v.s. Imitation Learning
Normal Distribution
generator G
As close as possible
Dynamic in environment
As close as possible
GAN for Imitation Learning
Discriminator D • A trajectory export or not • Find a discriminator such that
GAN for Imitation Learning
• Discriminator
Local Discriminator d (s,a) from expert (s,a) from actor
GAN for Imitation Learning
• Generator
policy gradient Each step in the same trajectory Given discriminator D can have different values.