基于深度强化学习的flappy bird

合集下载

FLAPPY BIRD源程序设计

FLAPPY BIRD源程序设计

FLAPPY BIRD源程序设计FLAPPY BIRD源程序设计简介Flappy Bird是一款非常经典的小鸟游戏,最初由越南游戏开发者Dong Nguyen于2013年开发。

这款游戏的简单操作和上瘾的玩法在短时间内迅速走红,并在各个平台上获得了巨大的成功。

本文档将介绍Flappy Bird的源程序设计,包括游戏的基本逻辑、图形界面设计、碰撞检测等关键要素。

游戏逻辑Flappy Bird的游戏逻辑非常简单。

玩家需要通过屏幕来控制小鸟的飞行,使其通过两侧不断移动的管道。

玩家每通过一个管道得到一分,碰到管道或者地面,游戏就会结束。

玩家可以不断尝试,争取获得更高的分数。

游戏的逻辑主要包括以下几个方面:- 初始化游戏:- 设定游戏界面大小;- 设定小鸟初始位置;- 设定管道的初始位置和间隔。

- 游戏主循环:- 小鸟跳跃:根据玩家屏幕的频率和时间,控制小鸟的飞行高度;- 管道移动:使管道向左移动,检测小鸟是否通过了管道;- 碰撞检测:- 小鸟与管道的碰撞:如果小鸟与管道碰撞,则游戏结束;- 小鸟与地面的碰撞:如果小鸟与地面碰撞,则游戏结束。

- 结束游戏:- 显示游戏结束画面;- 展示玩家的得分。

图形界面设计Flappy Bird的图形界面设计简洁而富有乐趣。

主要包括以下几个元素:- 背景:游戏背景为蓝天和云朵,通过循环平移的方式实现移动效果。

- 小鸟:小鸟以简单的2D图形表示,通过一系列不同状态的图像实现飞行效果。

- 管道:管道由上下两个部分组成,通过循环平移的方式实现移动效果。

- 地面:地面是一个平台,小鸟会在碰撞到地面时游戏结束。

- 分数:游戏窗口中央或上方显示玩家得到的分数。

碰撞检测在Flappy Bird中,碰撞检测是非常重要的一部分。

主要有以下两种检测方式:- 小鸟与管道的碰撞检测:检测小鸟的位置是否与管道发生了重叠,包括上下两个管道。

如果发生重叠,则判定为碰撞。

- 小鸟与地面的碰撞检测:检测小鸟的底部是否与地面发生了碰撞,如果发生碰撞,则判定为碰撞。

浅述基于深度强化学习的鸟群检测与应用

浅述基于深度强化学习的鸟群检测与应用

浅述基于深度强化学习的鸟群检测与应用作者:赵翔武淑红来源:《科学与信息化》2018年第29期摘要飞机在起飞和下降的过程一旦受到飞鸟的撞击,会导致机毁人亡的惨剧。

使用摄像头捕获机场的图像然后检测出鸟群出现的位置进行驱赶,能够提高机场工作人员的工作。

基于深度学习的卷积神经网络近些年在物体检测领域取得了突破性的研究和进展,该算法使用卷积网络用于机场的鸟群检测,然后在进行人工的驱赶。

考虑到现有的基于区域的物体检测算法需要在大量的候选框基础上进行边框回归消耗了大量的时间,基于此,本文提出基于深度强化学习的飞鸟检测算法。

实验表明,本文提出的检测算法能够在提升检测的效率同时保证检测的精度。

关键词深度学习;强化学习;物体检测;卷积网络前言飞机作为最便捷的交通工具受到人们的广泛喜爱。

由于飞机的飞行速度快,在起飞和下降的过程中如果和飞鸟碰撞则会导致机毁人亡的惨剧。

因此飞机场每年都会花费大量的人力和物力进行飞鸟驱赶任务,其中检测环节最为重要。

近些年基于深度学习的卷积神经网络[1]在计算机视觉领域取得了突破性的研究和进展,例如图像分类,物体分割和识别。

基于深度学习的物体检测算法首先使用selective search算法生成一些候选区域,然后将这些局部区域作为卷积网络的输入学习方框中包含物体的类别,并且使用边框回归算法进行位置的细化。

基于区域的物体检测算法虽然检测精度较高,但是时间效率低下。

基于此本文提出基于深度强化学习的飞鸟检测算法。

强化学习[2]属于机器学习的领域范畴,使用智能体与环境进行交互,在不同的状态下执行不同的动作,环境会给出对应的奖励值。

强化学习通过最大化累计的奖励值优化目标函数,学习状态和动作的映射关系。

将强化学习引入飞鸟检测领域使用“智能”的滑动窗口在图像中寻找物体的位置。

通过实验表明,基于深度强化学习的飞鸟检测算法在8个步骤之内就能够找到飞鸟的大概位置,由于飞鸟检测任务仅是为了找出鸟的大概位置,然后在进行人工驱赶,因此本文所提出的深度强化学习物体检测算法能够满足实际的需求,完成机场飞鸟检测的任务。

小学六年级课后服务:scratch少儿编程 四阶第7课:Flappy Bird

小学六年级课后服务:scratch少儿编程 四阶第7课:Flappy Bird
Flappy Bird
课程目标
课程内容 模拟流程一步一步完成一个完整的像素鸟飞行游戏。
课程时间 教学目标 教学难点
45分钟
1、像素鸟的飞行运动。 2、管道与地面的克隆运行。 3、场景的互动。
点击鼠标后小鸟的飞行与下降速率占比。
设备要求 音响、A4纸、笔
• 课程导入 • 程序解析 • 课堂任务 • 升级任务 • 知识拓展 • 创意练习
06 创意练习
06 创意练习
• 创意练习 练习:1、那发挥自由的想象力, 能不能添加关卡功能。
04 升级任务
04 升级任务
• 动手练习
练习:1.尝试能不能画出三角形,四边形或者多边形的物体来 !
05 知识拓展
05 像素的秘密
像素的秘密: 今天我们一起制作了像素鸟飞行程序,一起学习像素鸟飞行程序的
知识。就让我们一起来了解一下像素是什么吧! 像素是指由图像的小方格组成的, 这些小方块都有一个明确的位置和被分配的色彩 数值, 小方格颜色和位置就决定该图像所呈现出来的样子。可以将像素视为整个图 像中不可分割的单位或者是元素。不可分割的意思是它不能够再切割成更小单位抑 或是元素,它是以一个单一颜色的小格存在 。每一个点阵图像包含了一定量的像素, 这些像素决定图像在屏幕上所呈现的大小。相机所说的像素, 其实是最大像素的意 思, 像素是分辨率的单位, 这个像素值仅仅是相机所支持的有效最大分辨率。
01 课程导入
01 课程导入
• 课程导入
今天将完成像素鸟飞行 的小程序, 通过点击鼠标, 控制小鸟的上下飞行, 躲避 管道, 现在就来跟着老师来 一起完成一析
1. 管道移动的效果; 2. 碰到障碍物的效果; 3. 分数榜的制作。
02 程序解析

基于深度强化学习的智能飞行器控制研究

基于深度强化学习的智能飞行器控制研究

基于深度强化学习的智能飞行器控制研究随着人工智能领域的不断推进,智能飞行器也逐渐成为研究的热点之一。

与传统的飞行控制技术不同,基于深度强化学习的智能飞行器控制技术具有更高的智能化和自主化,能够更好地适应不同的飞行环境和任务需求。

一、强化学习在智能飞行器中的应用强化学习是人工智能领域中的一个重要分支,它通过智能体与环境的交互,试图寻找最优的行为策略,从而最大化累积奖励。

在智能飞行器中,强化学习技术可以用于控制飞行器的姿态、高度、速度、飞行路径等参数,实现智能飞行和自主导航。

例如,使用深度强化学习算法,可以训练飞行器在复杂的三维空间中进行高速飞行和避障,使其能够更好地适应实际环境和任务需求。

二、深度强化学习技术在智能飞行器中的研究进展近年来,深度强化学习技术在智能飞行器控制领域得到了广泛应用和研究。

其中,深度强化学习网络是实现智能飞行器控制的核心技术之一。

通过建立深度神经网络,将状态、动作和奖励进行映射,可以实现飞行器的自主学习和控制。

例如,利用深度强化学习算法,可以对无人机的航线进行规划和自主飞行,同时实现对目标的检测和识别,使其能够应对不同的飞行任务和环境。

同时,基于深度强化学习的智能飞行器控制技术也存在着一些挑战和困难。

首先,智能飞行器在不同的环境和任务中需要不断调整和优化自身的行为策略,这需要大量的实验和训练数据。

其次,深度强化学习算法的训练过程需要消耗大量的计算资源和时间,对硬件和算法的要求较高。

最后,智能飞行器的控制涉及到多种物理量和参数的控制,需要从多个角度进行综合考虑,这也增加了智能飞行器控制的难度。

三、未来智能飞行器控制技术的发展方向未来,基于深度强化学习的智能飞行器控制技术将会继续得到发展和优化。

一方面,随着深度学习和强化学习算法的不断进步,智能飞行器的控制能力和智能化水平将会不断提升。

另一方面,智能飞行器领域也将涌现出一系列新的技术和应用场景,例如多机协同、智能决策等领域,这些新技术和场景的出现将进一步推动智能飞行器控制技术的发展和创新。

人工智能与信息社会 网课答案

人工智能与信息社会 网课答案

如果没找到答案,请关注公众号:搜搜题免费搜题!!!1. 单选题电影()中,机器人最终脱离了人类社会,上演了“出埃及记”一幕。

(分)我,机器人2. 单选题 1977年在斯坦福大学研发的专家系统()是用于地质领域探测矿藏的一个专家系统。

(分)没搜到哦~3. 单选题能够提取出图片边缘特征的网络是()。

(分)卷积层4. 单选题在ε-greedy策略当中,ε的值越大,表示采用随机的一个动作的概率越(),采用当前Q函数值最大的动作的概率越()。

(分)大;小5. 单选题考虑到对称性,井字棋最终局面有()种不相同的可能。

(分)没搜到哦~6. 单选题在语音识别中,按照从微观到宏观的顺序排列正确的是()。

(分)帧-状态-音素-单词7. 单选题~没搜到哦.8. 单选题在强化学习过程中,()表示随机地采取某个动作,以便于尝试各种结果;()表示采取当前认为最优的动作,以便于进一步优化评估当前认为最优的动作的值。

(分)探索;开发9. 单选题一个运用二分查找算法的程序的时间复杂度是()。

(分)没搜到哦~10. 单选题典型的“鸡尾酒会”问题中,提取出不同人说话的声音是属于()。

(分)非监督学习11. 单选题 2016年3月,人工智能程序()在韩国首尔以4:1的比分战胜的人类围棋冠军李世石。

(分)AlphaGo12. 单选题首个在新闻报道的翻译质量和准确率上可以比肩人工翻译的翻译系统是()。

(分)微软13. 单选题被誉为计算机科学与人工智能之父的是()。

(分)图灵14. 单选题没搜到哦~15. 单选题科大讯飞目前的主要业务领域是()。

(分)语音识别16. 单选题如果某个隐藏层中存在以下四层,那么其中最接近输出层的是()。

(分)归一化指数层17. 单选题每一次比较都使搜索范围减少一半的方法是()。

(分)没搜到哦~18. 单选题人类对于知识的归纳总是通过()来进行的。

(分)没搜到哦~19. 单选题语音识别技术的英文缩写为()。

(分)ASR20. 单选题关于MNIST,下列说法错误的是()。

人工智能与信息社会2019尔雅答案教学教材

人工智能与信息社会2019尔雅答案教学教材

人工智能与信息社会2019尔雅答案人工智能与信息社会2019尔雅答案第一章1.AI时代主要的人机交互方式为()。

DA、鼠标、鼠标B、键盘、键盘C、触屏、触屏D、语音+视觉视觉2.2016年3月,人工智能程序()在韩国首尔以4:1的比分战胜的人类围棋冠军李世石。

AA、AlphaGoB、DeepMindC、DeepblueD、AlphaGo Zero3.Cortana是()推出的个人语音助手。

CA、苹果、苹果B、亚马逊、亚马逊C、微软、微软D、阿里巴巴、阿里巴巴4.首个在新闻报道的翻译质量和准确率上可以比肩人工翻译的翻译系统是()。

CA、苹果、苹果B、谷歌、谷歌C、微软、微软D、科大讯飞、科大讯飞5.相较于其他早期的面部解锁,iPhone X的原深感摄像头能够有效解决的问题是()。

CA、机主需要通过特定表情解锁手机、机主需要通过特定表情解锁手机B、机主是否主动解锁手机、机主是否主动解锁手机C、机主平面照片能够解锁手机、机主平面照片能够解锁手机D、机主双胞胎解锁手机、机主双胞胎解锁手机6.属于家中的人工智能产品的有()。

ABDA、智能音箱、智能音箱B、扫地机器人、扫地机器人C、声控灯、声控灯D、个人语音助手、个人语音助手7.谷歌相册与传统手机相册最大不同点是()。

ABEA、根据照片内容自动添加标记、根据照片内容自动添加标记B 、根据不同标记进行归类和搜索、根据不同标记进行归类和搜索C 、自动对照片进行美颜、自动对照片进行美颜D 、定时备份照片、定时备份照片E 、人脸识别和搜索、人脸识别和搜索8.目前外科手术领域的医用机器人的优点有()。

ABA 、定位误差小、定位误差小B 、手术创口小、手术创口小C 、不需要人类医生进行操作、不需要人类医生进行操作D 、能够实时监控患者的情况、能够实时监控患者的情况E 、可以帮助医生诊断病情、可以帮助医生诊断病情9.智能推荐系统的特点包括()。

ABCDA 、根据用户的购买记录记忆用户的偏好、根据用户的购买记录记忆用户的偏好B 、根据浏览时间判断商品对用户的吸引力、根据浏览时间判断商品对用户的吸引力C 、推荐用户消费过的相关产品、推荐用户消费过的相关产品D 、根据用户的喜好进行相关推荐、根据用户的喜好进行相关推荐10.一般来说,扫地机器人必需的传感器有()。

小学六年级课后服务:scratch少儿编程 四阶第7课:Flappy Bird

小学六年级课后服务:scratch少儿编程 四阶第7课:Flappy Bird
(教师)《教室介绍学校,以及自我介绍》授课老师开始授课!引
入上节课复习。
1 分钟
播放视频 1:课程导入
1 分钟
第二小节(上节回顾)
1 分钟
(教师)询问学生是否还有疑问,并引入本节课内容。
2 分钟
第三小节(本节课内容介绍)
(教师)抛出互动问题!和学生进行互动,提问
2 分钟
播放视频 1:课程导入
1 分钟
第十小节(课后作业)
播放视频 6:拓展练习
1 分钟
(课程结束)我们拓展练习就是使金牌银牌、铜牌分别对应不同的奖牌形状,看一看你还可以获得更多的分数吗?那么今天的课程就到这里了,大家可以把这节课完成的作品提交给老师。希望同学们能够在以后的课程中展现自己的奇思妙想,为我们的编程课堂迸发出不
一样的思维火花,我们下次编程课堂不见不散,拜拜!
(同学们操作,老师助教,保证学生完成本小节的代码指令!)
2 分钟
四、知识延伸
(教师)今天我们一起制作了像素鸟飞行程序,一起学习像素鸟飞
行程序的知识。就让我们一起来了解一下像素是什么吧!
1 分钟
播放视频 4:知识延伸
1 分钟
(师生互动)关于像素的知识,你都知道了吗。
2 分钟
第九小节(课程总结)
播放视频 5:课程总结(该视频为静态图片,用于辅助老师总结)
划出来吧。
1 分钟
分解流程图
1 分钟
(师生互动:动手练习)现在和老师一起来想一想,画出流程图。我们来为本节课的内容做一个划分,自己动手一起来分解一下我们要完成
的步骤吧。(让每一个同学完成流程图绘制)
3 分钟
三、编写程序
第六小节(程序的开始效果)
(教师)引入本节需要学习的代码指令,让学生认真听讲。

基于深度强化学习的flappy-bird

基于深度强化学习的flappy-bird

SHANGHAI JIAO TONG UNIVERSITYProject Title: Playing the Game of Flappy Bird with DeepReinforcement LearningGroup Number: G-07Group Members: Wang Wenqing 116032910080Gao Xiaoning 116032910032Qian Chen 11603Contents1Introduction (1)2Deep Q-learning Network (2)2.1Q-learning (2)2.1.1Reinforcement Learning Problem (2)2.1.2Q-learning Formulation [6] (3)2.2Deep Q-learning Network (4)2.3Input Pre-processing (5)2.4Experience Replay and Stability (5)2.5DQN Architecture and Algorithm (6)3Experiments (7)3.1Parameters Settings (7)3.2Results Analysis (9)4Conclusion (11)5References (12)Playing the Game of Flappy Bird with Deep Reinforcement LearningAbstractLetting machine play games has been one of the popular topics in AI today. Using game theory and search algorithms to play games requires specific domain knowledge, lacking scalability. In this project, we utilize a convolutional neural network to represent the environment of games, updating its parameters with Q-learning, a reinforcement learning algorithm. We call this overall algorithm as deep reinforcement learning or Deep Q-learning Network(DQN). Moreover, we only use the raw images of the game of flappy bird as the input of DQN, which guarantees the scalability for other games. After training with some tricks, DQN can greatly outperform human beings.1IntroductionFlappy bird is a popular game in the world recent years. The goal of players is guiding the bird on screen to pass the gap constructed by two pipes by tapping screen. If the player tap the screen, the bird will jump up, and if the player do nothing, the bird will fall down at a constant rate. The game will be over when the bird crash on pipes or ground, while the scores will be added one when the bird pass through the gap. In Figure1, there are three different state of bird. Figure 1 (a) represents the normal flight state, (b) represents the crash state, (c) represents the passing state.(a) (b) (c)Figure 1: (a) normal flight state (b) crash state (c) passing stateOur goal in this paper is to design an agent to play Flappy bird automatically with the same input comparing to human player, which means that we use raw images and rewards to teach our agent to learn how to play this game. Inspired by [1], we propose a deep reinforcement learning architecture to learn and play this game.Recent years, a huge amount of work has been done on deep learning in computer vision [6]. Deep learning extracts high dimension features from raw images. Therefore, it is nature to ask whether the deep learning can be used in reinforcement learning. However, there are four challenges in using deep learning. Firstly, most successful deep learning applications to date have required large amounts of hand-labelled training data. RL algorithms, on the other hand, must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed. Secondly, the delay between actions and resulting rewards, which can be thousands of time steps long, seems particularly daunting whencompared to the direct association between inputs and targets found in supervised learning. The third issue is that most deep learning algorithms assume the data samples to be independent, while in reinforcement learning one typically encounters sequences of highly correlated states. Furthermore, in RL the data distribution changes as the algorithm learns new behaviors, which can be problematic for deep learning methods that assume a fixed underlying distribution.This paper will demonstrate that using Convolutional Neural Network (CNN) can overcome those challenge mentioned above and learn successful control polices from raw images data in the game Flappy bird. This network is trained with a variant of the Q-learning algorithm [6]. By using Deep Q-learning Network (DQN), we construct the agent to make right decisions on the game flappy bird barely according to consequent raw images.2 Deep Q-learning NetworkRecent breakthroughs in computer vision have relied on efficiently training deep neural networks on very large training sets. By feeding sufficient data into deep neural networks, it is often possible to learn better representations than handcrafted features[2][3]. These successes motivate us to connect a reinforcement learning algorithm to a deep neural network, which operates directly on raw images and efficiently update parameters by using stochastic gradient descent.In the following section, we describe the Deep Q-learning Network algorithm (DQN) and how its model is parameterized.2.1 Q-learning2.1.1 Reinforcement Learning ProblemQ-learning is a specific algorithm of reinforcement learning (RL). As Figure 2 show, an agent interacts with its environment in discrete time steps. At each time t, the agent receives an state t s and a reward t r . It then chooses an action t a from the set of actions available, which is subsequently sent to the environment. The environment moves to a new state 1t s + and the reward 1t r + associated with the transition1(,,)t t t s a s +is determined [4].Figure 2: Traditional Reinforcement Learning scenarioThe goal of an agent is to collect as much reward as possible. The agent can choose any action as a function of the history and it can even randomize its action selection. Note that in order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize the future income), although the immediate reward associated with this might be negative [5].2.1.2 Q-learning Formulation [6]In Q-learning problem, the set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards:00111211s ,,,s ,,,...,s ,,,n n n n a r a r a r s --Here s i represents the state, i a is the action and 1i r +is the reward after performing theaction i a . The episode ends with terminal state n s . To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. Define the total future reward from time point t onward as:11...t t t n n R r r r r +-=++++ (1)In order to ensure the divergence and balance the immediate reward and future reward, total reward must use discounted future reward:111...n n t n t i t t t t n n i i t R r r r r r γγγγ----+-==++++=∑(2)Here γis the discount factor between 0 and 1, the more into the future the reward is, the less we take it into consideration. Transforming equation (2) can get:1t t t R r R γ+=+ (3)In Q-learning, define a function (, )t t Q s a representing the maximum discounted future reward when we perform action t a in state:1(,)max t t t Q s a R += (4)It is called Q-function, because it represents the “quality” of a certain action in a given state. A good strategy for an agent would be to always choose an action that maximizes the discounted future reward:()arg max (,)t t a t t s Q s a π= (5)Here π represents the policy, the rule how we choose an action in each state. Given a transition 1(,,)t t t s a s +, equation (3)(4) can get following bellman equation - maximumfuture reward for this state and action is the immediate reward plus maximum future reward for the next state:''1(,)max (,)t t t t a Q s a r Q s a γ+=+(6) The only way to collect information about the environment is by interacting with it. Q-learning is the process of learning the optimal function (,)t t Q s a , which is a table in. Here is the overall algorithm 1:Algorithm 1 Q-learningInitialize Q[num_states, num_actions] arbitrarilyObserve initial state s 0RepeatSelect and carry out an action aObserve reward r and new state s’'''(,)(,)(max (,)(,))a Q s a Q s a r Q s a Q s a αγ=++-s = s’Until terminated2.2 Deep Q-learning NetworkIn Q-learning, the state space often is too big to be put into main memory. A game frame of 8080⨯ binary images has 64002states, which is impossible to be represented by Q-table. What’s more, during training, encountering a known state, Q-learning just perform a random action, meaning that it’s not heuristic. In order overcome these two problems, just approximate the Q-table with a convolutional neural networks (CNN)[7][8]. This variation of Q-learning is called Deep Q-learning Network (DQN) [9][10]. After training the DQN, a multilayer neural networks can approach the traditional optimal Q-table as followed:*(,;)(,)t t t t Q s a Q s a θ= (7)As for playing flappy bird, the screenshot s t is inputted into the CNN, and the outputs are the Q-value of actions, as shown in Figure 3:Figure 3: In DQN, CNN’s input is raw game image while its outputs are Q-values Q(s,a), one neuron corresponding to one action’s Q-value.In order to update CNN’s weight, defining the cost function and gradient update function as [9][10]: '2'11max (,;)(,;)2t t t t a L r Q s a Q s a θθ-+⎡⎤=+-⎣⎦ (8) ''1(max (,;)(,;)(,;)t t t t t t a L r Q s a Q s a Q s a θθγθθθ-+⎡⎤∇=+-∇⎣⎦(9)()L θθθθ---=+∇ (10)Here, θare the DQN parameters that get trained and θ-are non-updated parameters for the Q-value function. During training, use equation(9) to update the weights of CNN.Meanwhile, obtaining optimal reward in every episode requires the balance between exploring the environment and exploiting experience.ε-greedy approach can achieve this target. When training, select a random action with probability ε o r otherwise choose the optimal action ''argmax (,;)t a a Q s a θ= . The εanneals linearly to zero with increase in number of updates.2.3 Input Pre-processingWorking directly with raw game frames, which are 288512⨯pixel RGB images, can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the input dimensionality.Figure 4: Pre-process game frames. First convert frames to gray images, then down-sample them to specific size. Afterwards, convert them to binary images, finally stack up last 4 frames as a state.In order to improve the accuracy of the convolutional network, the background of game was removed and substituted with a pure black image to remove noise. As Figure 4 shows, the raw game frames are preprocessed by first converting their RGB representation to gray-scale and down-sampling it to an 8080⨯image. Then convert the gray image to binary image. In addition, stack up last 4 game frames as a state for CNN. The current frame is overlapped with the previous frames with slightly reduced intensities and the intensity reduces as we move farther away from the most recent frame. Thus, the input image will give good information on the trajectory on which the bird is currently in.2.4 Experience Replay and StabilityBy now we can estimate the future reward in each state using Q-learning and approximate the Q-function using a convolutional neural network. But the approximation of Q-values using non-linear functions is not very stable. In Q-learning, the experiences recorded in a sequential manner are highly correlated. If sequentially use them to update the DQN parameters, the training process might stuck in a local minimal solution or diverge.To ensure the stability of training of DQN, we use a technical trick called experience replay. During game playing, particular number of experience 11(,,,)t t t t s a r s ++ are stored in a replay memory. When training the network, random mini-batches from the replay memory are used instead of the most recent transition. This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum. As a result of this randomness in the choice of the mini-batch, the data that goes in to update the DQN parameters are likely to be de-correlated.Furthermore, to better the stability of the convergence of the loss functions, we use a clone of the DQN model with parameters θ-. The parameters θ-are updated to θ after every C updates to the DQN.2.5 DQN Architecture and AlgorithmAs shown in Figure 5, firstly, get the flappy bird game frame, and after pre-processing described in section 2.3, stack up last 4 frames as a state. Input this state as raw images into the CNN whose output is the quality of specific action in given state., the agent performs an action According to policy ()arg max (,)t t a t t s Q s a π=, with probability ε, otherwise perform a random action. The current experience is stored in a replay memory, a random mini-batch of experiences are sampled from the memory and used to perform a gradient descent on the CNN’s parameters. This is an interactive process until some criteria are being satisfied.Figure 5: DQN’s training architecture: upper data flow show the training process, whilethe lower data flow display the interactive process between the agent and environment.The complete DQN training process is shown in Algorithm 2. We should note that the εfactor is set to zero during test, and while training we use a decaying value, balancing the exploration and exploitation.Algorithm 2 Deep Q-learning NetworkInitialize replay memory D to certain capacity N Initialize the CNN with random weights θInitialize θ-=: θfor games = 1: maxGames dofor snapShots = 1: T doWith probability εselect a random action a totherwise select '':argmax (,;)t a a Q s a θ=Execute a t and observe r t+1 and next sate s t+1Store transition (s t ,a t , r t+1 , s t+1) in replay memory DSample mini-batch of transitions from Dfor j = 1: batchSize doif game terminates at next state thenQ_pred =: r jelseQ_pred =: r j + ''1max (,;)t a Q s a θ-+end ifPerform gradient descent on 21(_(,;))2t t L Q pred Q s a θ=- according to equation (10)end forEvery C steps reset θ-=: θend forend for3 ExperimentsThis section will describe our algorithm’s parameters setting and the analysis of experiment results.3.1 Parameters SettingsFigure 6 illustrates our CNN’s layers setting. The neural networks has 3 CNN hidden layers followed by 2 fully connected hidden layers. Table 1 show the detailed parameters of every layer. Here we just use a max pooling in the first CNN hidden layer. Also, we use the ReLU activation function to produce the neural output.Figure 6: The layer setting of CNN: this CNN has 3 convolutional layers followed by 2 fully connected layers. As for training, we use Adam optimizer to update the CNN’s parameters.Table 1: The detailed layers setting of CNNTable 1 lists all the parameter setting of DQN. We use a decayed ranging from 0.1 to 0.001 to balance exploration and exploitation. What’s more, Table 2 shows that the batch stochastic gradient descent optimizer is Adam with batch size of 32. Finally, we also allocate a large replay memory.Table 2: The training parameters of DQN3.2Results AnalysisWe train our model about 4 million epochs. Figure 7 shows the weights and biases of CNN’s first hidden layer. The weights and biases finally centralize around 0, with low variance, which directly stabilize CNN’s output Q-value(,)Q s a and reducet tprobability of random action. The stability of CNN’s parameters leads to obtaining optimal policy.Figure 7: Left (right) figure is the histogram of weights (biases) of CNN’s first hidden layerFigure 8is the cost value of DQN during training. The cost function has a slow downtrend, close to 0 after 3.5 million epochs. It means that DQN has learned the most common state subspace and will perform optimal action when coming across known state.In a word, DQN has obtained its best action policy.Figure 8:DQN’s cost function: the plot shows the training progress of DQN. We trained our model about 4 million epochs.When playing flappy bird, if the bird gets through the pipe , we give a reward 1, if dead, give -1, otherwise 0.1. Figure 9 is the average returned reward from environment. The stabiltiy in final training state means that the agent can automatically choose the best action, and the environment gives the best reward in turns. We know that the agent and environment has enter into a friendly interaction, guaranteeing the maximal total reward.Figure 9: The average returned reward from environment. We average the returned reward every 1000 epochs.From this Figure 10, the predicted max Q-value from CNN converges and stabilizes in a value after about 100 000. It means that CNN can accurately predict the quality of actions in specific state, and we can steadily perform actions with max Q-value. The convergence of max Q-values states that CNN has explored the state space widely and greatly approximated the environment well.Figure 10: The average max Q-value obtained from CNN’s output. We average the max Q-value every 1000 epochs.Figure 11 illustrates the DQN’s action strategy. If the predicted max Q-value is so high, we are confident that we will get through the gap when perform the action with max Q-value like A, C. If the max Q-value is relatively low, and we perform the action, we might hit the pipe, like B. In the final state of training, the max Q-value is dramatically high, meaning that we are confident to get through the gaps if performing the actions with max Q-value.Figure 11: The leftmost plot shows the CNN’s predicted max Q-value for a 100 frames segment of the game flappy bird. The three screenshots correspond to the frames labeled by A, B, and C respectively.4ConclusionWe successfully use DQN to play flappy bird, which can outperform human beings. DQN can automatically learn knowledge from environment just using raw image to play games without prior knowledge. This feature give DQN the power to play almost simple games. Moreover, the use of CNN as a function approximation allow DQN to deal with large environment which has almost infinite state space. Last but not least, CNN can also greatly represent feature space without handcrafted feature extraction reducing the massive manual work.5References[1] C. Clark and A. Storkey. Teaching deep convolutional neural networks to play go.arXiv preprint arXiv:1412.3409, 2014. 1.[2]Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification withdeep convolutional neural networks.In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.[3]George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30 –42, 2012, 1.[4]Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MITPress, 1998.[5]Brian Sallans and Geoffrey E. Hinton. Reinforcement learning with factored statesand actions. Journal of Machine Learning Research, 5:1063–1088, 2004.[6]Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.[7]Hamid Maei, Csaba Szepesv´ari, Shalabh Bhatnagar, and Richard S. Sutton.Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pages 719–726, 2010.[8]Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.[9]V.Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.Wierstra, andM.Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 1.[10]V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A.Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. 3, 5.。

Flappybird游戏设计

Flappybird游戏设计

Flappybird游戏设计Flappybird是一款非常受欢迎的手机游戏,其简单易上手的操作方式、可爱的游戏角色以及趣味性别具特色,深受玩家喜爱。

在本文档中,我们将讨论Flappybird游戏的设计实现过程,并详细介绍其中的关键要素。

游戏基本设计Flappybird是一款端游,在游戏开始时,玩家需要通过点击屏幕,让游戏中的小鸟飞行,同时避开屏幕上的不断出现的柱子障碍。

当小鸟撞击柱子或掉落到地面后,游戏结束并重新开始游戏。

游戏结束后,玩家可以查看自己的得分,并重新开始游戏。

游戏场景游戏场景是Flappybird游戏非常重要的一部分,其主要作用是为游戏提供一个视觉背景。

在游戏中,我们通常使用背景图片或动画来实现场景的呈现,并通过不断循环移动场景中的背景元素,为玩家带来不同的视觉效果。

场景的背景图片可以是翻转、旋转、变化或闪烁等等,以此为玩家带来更加生动的视觉效果。

游戏元素Flappybird游戏中的元素主要包括小鸟、柱子、地面等。

其中,小鸟是游戏的主角,其掌握游戏的命运。

柱子是小鸟需要避开的障碍,通常是两根竖直排列的柱子,之间有一个缝隙,玩家需要通过缝隙使小鸟飞过。

地面则是小鸟落地的区域,如果小鸟坠落到了地面,那么游戏就结束了。

游戏难度游戏难度是Flappybird游戏设计中的一个重要因素。

游戏可以设置关卡难度,例如:障碍物的数量、障碍物的大小、障碍物的距离等等。

游戏设计者可以根据玩家的反馈,设定游戏的难度,使之更加具有挑战性。

游戏实现过程Flappybird游戏的实现过程需要技术支持,下面我们将介绍游戏的实现过程以及所需要的技术细节。

游戏引擎在游戏开发中,我们通常使用游戏引擎(Game Engine)来实现游戏的逻辑。

游戏引擎是一种软件体系,包括游戏中需要使用的各种工具和游戏逻辑实现。

Flappybird游戏中常使用的游戏引擎有Unity3D和Cocos等。

游戏物理引擎在Flappybird游戏的实现过程中,游戏物理引擎是非常重要的组成部分。

一文看懂深度学习(白话解释8个优缺点4个典型算法)

一文看懂深度学习(白话解释8个优缺点4个典型算法)

一文看懂深度学习(白话解释8个优缺点4个典型算法)深度学习有很好的表现,引领了第三次人工智能的浪潮。

目前大部分表现优异的应用都用到了深度学习,大红大紫的AlphaGo 就使用到了深度学习。

本文将详细的给大家介绍深度学习的基本概念、优缺点和主流的几种算法。

深度学习、神经网络、机器学习、人工智能的关系深度学习、机器学习、人工智能简单来说:1.深度学习是机器学习的一个分支(最重要的分支)2.机器学习是人工智能的一个分支目前表现最好的一些应用大部分都是深度学习,正是因为深度学习的突出表现,引发了人工智能的第三次浪潮。

详情可以看《人工智能的发展史——3次 AI 浪潮》深度学习、神经网络深度学习的概念源于人工神经网络的研究,但是并不完全等于传统神经网络。

不过在叫法上,很多深度学习算法中都会包含”神经网络”这个词,比如:卷积神经网络、循环神经网络。

所以,深度学习可以说是在传统神经网络基础上的升级,约等于神经网络。

大白话解释深度学习看了很多版本的解释,发现李开复在《人工智能》一书中讲的是最容易理解的,所以下面直接引用他的解释:我们以识别图片中的汉字为例。

假设深度学习要处理的信息是“水流”,而处理数据的深度学习网络是一个由管道和阀门组成的巨大水管网络。

网络的入口是若干管道开口,网络的出口也是若干管道开口。

这个水管网络有许多层,每一层由许多个可以控制水流流向与流量的调节阀。

根据不同任务的需要,水管网络的层数、每层的调节阀数量可以有不同的变化组合。

对复杂任务来说,调节阀的总数可以成千上万甚至更多。

水管网络中,每一层的每个调节阀都通过水管与下一层的所有调节阀连接起来,组成一个从前到后,逐层完全连通的水流系统。

那么,计算机该如何使用这个庞大的水管网络来学习识字呢?比如,当计算机看到一张写有“田”字的图片,就简单将组成这张图片的所有数字(在计算机里,图片的每个颜色点都是用“0”和“1”组成的数字来表示的)全都变成信息的水流,从入口灌进水管网络。

基于强化学习的Flappy Bird游戏交互

基于强化学习的Flappy Bird游戏交互

2021.2强化学习与有监督学习和无监督学习不同,其主要通过与环境的交互来获得能产生良好效果的应对策略。

简单来说就是在强化学习的过程中模型可以与环境互动,并且环境会给出相应的反馈信息(可能是正面的或负面的),然后通过模型给出下一步的动作,通过不断的训练使得尽量多地获得正面的反馈。

Flappy Bird 是一款十分经典的游戏,游戏中玩家需要控制一只小鸟,跨越由各种不同长度水管所组成的障碍。

项目使用强化学习的相关的算法进行训练,得到一个效果较好的模型,将其部署到不同的端。

在不同的端都实现了自动飞行的Flappy Bird。

1前期准备为了使最终的效果更加优秀,前期进行了大量的准备工作,其中包括但不仅限于各个端游戏的复刻、数据集的收集及树莓派运行环境的准备。

复刻各端游戏是为了更好地显示运行效果以及达到可以训练并测试的要求,树莓派运行环境的准备是为了后期测试使用时的方便,下面就来详细地介绍所做的准备工作。

1.1PC 端游戏的复刻对于PC 端的游戏,更多的是要做到将原游戏的逻辑精准还原,在创新方面并没有什么很高的要求。

选择了用Python 语言来完成这件事情,当键盘输入1时调用上升方法,小鸟飞行轨迹趋势向上升;当输入0时调用下降方法,但是需要注意的是下降方法不代表小鸟直接往下飞,而是根据加速度来判定,方向可能向上也可能向下,但是总体趋势是为了小鸟的下降。

如果没有任何输入则画面不动。

通过键盘操作来改变游戏数据并更新画面,PC 端的复刻主要是为了数据集的收集。

1.2安卓端游戏的复刻PC 端游戏复刻成功后,安卓端的复刻就变得简单了一些,只需要保证所有的逻辑可以和PC 端Flappy Bird 完美对应即可。

其中在碰撞检测的地方曾出现了一些小问题,在最终安卓端的Flappy Bird 游戏中使用的碰撞检测是精准检测,不是简简单单地将小鸟图片和管道图片的包围框检查是否有相叠部分,采用的是像素检测,将当前图片进行处理,图片中有鸟或管道的部分掩码序列中为1,若是无效部分则返回0,检测两张图片中1是否重叠。

基于深度学习的鸟类自动识别技术研究

基于深度学习的鸟类自动识别技术研究

基于深度学习的鸟类自动识别技术研究在自然界中,鸟类的种类繁多,防护森林和自然保护区内经常可以看到各种各样的鸟类。

然而,鸟类的形态复杂多样,种类繁多,在野外进行鸟类的观察与识别是一件非常困难的事情,因此寻找一种自动识别鸟类的技术显得尤为重要。

近年来,随着深度学习的快速发展,人工智能在图像处理领域得到了广泛的应用,鸟类自动识别技术也得到了长足的进展。

基于深度学习的鸟类自动识别技术是一种应用广泛的新技术,其依靠输入图片的颜色,纹理和形状等特征,通过大量的图片训练,提取并学习鸟类的特征,从而实现对鸟类的准确自动识别。

目前,基于深度学习的鸟类自动识别技术已经被广泛研究和应用。

具体来说,该技术实现的方法主要包括图像预处理、特征提取和分类器训练三个步骤。

首先,图像预处理是基于深度学习的鸟类自动识别技术的第一步,其目的是使输入的鸟类图片能够获得最佳的处理结果。

在预处理过程中,需要对原始的鸟类图片进行一系列处理,如剪裁、缩放、灰度化、图像增强等,从而获得更为清晰高质的图片。

其次,特征提取是基于深度学习的鸟类自动识别技术的另一个关键步骤。

该技术利用卷积神经网络(CNN)等深度学习模型,能够提取出鸟类图片中的各种纹理、形状等特征,形成一组高维特征向量。

这些特征向量的提取过程需要经过卷积层、池化层、全连接层等多个过程。

最后,分类器的训练是基于深度学习的鸟类自动识别技术的关键环节。

在这个过程中,需要使用大量的鸟类图片来训练分类器,从而得到高精度的深度学习模型。

训练的过程主要是通过反向传播算法来优化深度学习模型中的权值和偏移量,使其能够更加准确地识别不同种类的鸟类。

总的来说,基于深度学习的鸟类自动识别技术主要依靠数据训练和深度神经网络模型优化来实现。

目前,该技术已被成功应用于多种领域,如鸟类监测、生态保护等。

通过该技术,不仅能够减轻人工监测的工作难度和成本,同时也能够帮助人们更好地了解鸟类的分布和生态习性,从而更好地保护和管理野生动物资源。

基于深度强化学习的智能飞行控制方法研究

基于深度强化学习的智能飞行控制方法研究

基于深度强化学习的智能飞行控制方法研究随着机器学习和人工智能技术的不断发展,智能飞行控制系统的研究和应用也越来越受到重视。

深度强化学习是一种新兴的机器学习方法,它能够通过与环境的交互学习到最佳的决策策略,因此被视为一种有潜力的智能飞行控制方法。

本文将从深度强化学习的基本原理、智能飞行控制系统的架构和实验研究等方面,探讨基于深度强化学习的智能飞行控制方法研究进展和存在的问题。

一、深度强化学习的基本原理深度强化学习是一种结合深度神经网络和强化学习的方法。

其中,深度神经网络用于学习状态和行动之间的映射关系,强化学习则用于在与环境的交互中学习到最佳的决策策略。

强化学习是一种通过对环境的探索和试错来学习最佳行为策略的机器学习方法。

在强化学习中,智能体与环境进行交互,通过优化回报函数来学习最佳行为策略。

其中,回报函数可以根据具体的应用场景进行设计,比如在智能飞行控制中可以将坠毁或飞行耗费的能量作为惩罚项,将飞行距离和平稳度等指标作为奖励项。

深度神经网络是一种能够学习到非线性映射关系的神经网络。

在强化学习中,深度神经网络用于学习状态和行动之间的映射关系,从而实现最优策略的推断。

常见的深度学习算法包括卷积神经网络(CNN)、循环神经网络(RNN)和深度Q网络(DQN)等。

二、智能飞行控制系统的架构基于深度强化学习的智能飞行控制系统通常包含以下三个模块:感知模块、决策模块和执行模块。

其中,感知模块用于将传感器获取的数据转化为状态特征表示,决策模块用于学习最优的决策策略,执行模块用于实现最优决策策略。

感知模块一般包括传感器数据的采集和预处理、特征提取和特征选择等过程。

在飞行控制中,常用的传感器包括惯性导航系统(Inertial Navigation System, INS)、全球定位系统(Global Positioning System, GPS)、空气动力学传感器、蒸汽压力计等。

特征提取则可以使用各种机器学习算法,比如主成分分析(Principal Component Analysis, PCA)、线性判别分析(Linear Discriminant Analysis, LDA)、自编码器(Autoencoder)等。

FlappyBird小游戏实训报告

FlappyBird小游戏实训报告

郑州轻工业学院实训报告—FlappyBird小游戏专业班级:姓名:学号:电话:游戏,2D图形,场景类,地面,柱子,重力因素,碰撞检测,类声明文件,类实现文件一、FlappyBird游戏开发步骤1、准备资源图片,影音文件;2、使用QtCreator框架开发;2.1 安装QSDK4.8以上版本;2.2 安装QtCreator集成开发环境;2D图形QImage图片类QWidget窗口/ 其他组件QPainter画笔二、实现步骤:1、创建场景类World Widget,实现背景图片的加载。

2、创建小鸟类Bird,加载小鸟图片数组,选择当前显示图片,定义位置,初速度,时间间隔,重力等因素。

添加paint()函数。

3、在场景类中添加小鸟对象,调用bird->paint()将小鸟画入场景。

4、实现小鸟自由落体运动效果bird.h中添加成员属性g,t,speed,distance;在bird.h中增加step() 计算小鸟y轴位置;增加flappy() ; 修改当前速度,使小鸟向上移动;在world类run()中调用小鸟的step()和flappy();5、地面移动ground.h声明Ground类ground.cpp 实现Ground6、加入柱子column.h Column类column.cpp 实现Column类现有以上功能实现。

另外加入碰撞检测;Bird类中:bool hit(column& ,column&, Ground&);三、文件组织结构:main.cpp 主程序world.h场景World类声明文件world.cpp World类实现文件bird.h小鸟类Bird声明文件bird.cpp Bird类实现文件四、程序代码:main.cpp#include <QApplication>#include "world.h"int main(intargc, char** argv){QApplicationapp(argc,argv);World w;w.show();returnapp.exec();}#ifndef _WORLD_H#define _WORLD_H#include<QWidget>World.h#ifndef _WORLD_H#define _WORLD_H#include <QWidget>#include <QPaintEvent>#include <QTimer>#include <QLabel>#include "bird.h"#include "ground.h"#include "column.h"//场景类:负责维护各种图片class World : public QWidget{Q_OBJECTpublic:World(QWidget* parent = 0);//绘制窗口~World(); //void restart();//重新开始void save(unsigned short );//保存文件voidpaintEvent(QPaintEvent*); voidmousePressEvent(QMouseEvent *); public slots://自定义槽,控制图片运行void run();private:Bird* bird;Ground* ground;Column* c1;Column* c2;QTimer timer; QImagegameoverImage; QImagebgImage;//加入gatReady图片QImagestartImage;boolgameOver;// 游戏结束boolstartGame;//游戏是否开始unsigned short score;//分数unsigned short best_score;// 历史最高QLabel* score_label;};#endifWorld.cpp#include "world.h"#include <QPainter>#include <QFile>#include <QTextStream>#include <QDataStream>#include "bird.h"#include <QDebug>World::World(QWidget* parent): QWidget(parent){//this->resize(432, 644);this->setGeometry(400,200, 432,644);bird = new Bird;ground = new Ground;c1 = new Column(0);c2 = new Column(1);gameoverImage.load(":gameover");bgImage.load(":bg");startImage.load(":start");gameOver = false;startGame = false;score = 0;score_label = new QLabel(this);score_label->setGeometry(QRect(270,10,120,40));score_label->setStyleSheet(QString::fromUtf8("font: 20pt \"Khmer OS System\";\n""color: rgb(85, 0, 255);"));timer.setInterval(1000/70);connect(&timer, SIGNAL(timeout()),this, SLOT(run())); //一会写run// timer.start();QFilefile("./score.dat");if(!file.open(QFile::ReadOnly | QFile::Text)){best_score = 0;}else{//QTextStreamin(&file);QDataStreamin(&file);in>>best_score;qDebug() << "read...";}file.close();}World::~World(){if(score >best_score)save(score);}void World::save(unsigned short best){QFilefile("./score.dat");if(!file.open(QFile::WriteOnly | QFile::Text)){return;}else{// QTextStreamout(&file);QDataStreamout(&file);out<< best;//qDebug() << "write";}file.close();}//哑元函数void World::paintEvent(QPaintEvent*){QPainterpainter(this);painter.drawImage(0,0,bgImage);//将画笔传给bird对象,由bird对象画出当前小鸟的图片c1->paint(&painter);c2->paint(&painter);bird->paint(&painter);ground->paint(&painter);if(!startGame){painter.drawImage(0,0,startImage);}if(gameOver){painter.drawImage(0,0,gameoverImage);}if(!startGame){painter.setFont(QFont("Khmer OS System",20,QFont::Bold)); painter.drawText(QRect(QPoint(145,390),QPoint(320,445)),QString::fromUtf8("历史最高:")+=QString::number(best_score));}score_label->setText(QString("score:")+=QString::number(score));}void World::run(){bird->fly();//飞bird->step();//小鸟下落c1->step();c2->step();ground->step();if(bird->pass(*c1) || bird->pass(*c2)){qDebug("pass");score++;}if(bird->hit(*c1,*c2,*ground)){timer.stop();gameOver = true;//gameover ...//TODO/**1)加载gameover图片,实现点击图片的开始按钮重新开始游戏。

基于深度强化学习的游戏智能体设计与实现

基于深度强化学习的游戏智能体设计与实现

基于深度强化学习的游戏智能体设计与实现深度强化学习(Deep Reinforcement Learning, DRL)是一种人工智能领域的前沿技术,近年来在游戏智能体设计与实现中取得了显著的成果。

本文将基于深度强化学习的方法,探讨游戏智能体的设计与实现。

首先,我们将介绍深度强化学习在游戏智能体设计中的基本原理。

深度强化学习是将深度学习和强化学习相结合的一种方法。

其核心思想是通过强化学习算法使智能体通过与环境的交互,学习到如何在给定的游戏环境中采取最优的行动策略。

深度强化学习的关键在于使用深度神经网络来近似智能体的行动策略,并通过优化算法来不断调整网络的参数,使得智能体能够逐渐改善其对环境的理解和行动选择能力。

其次,我们将探讨游戏智能体设计中的关键问题和挑战。

首先是状态表示的设计。

游戏智能体的状态表示往往是非常复杂的,它需要包含大量的信息以便智能体能够理解当前的游戏环境。

在设计状态表示时,我们可以采用图像处理技术,将游戏画面转换成对应的特征向量,或者使用模型推理等技术,将游戏场景的各种属性转化为对应的状态向量。

其次是行动策略的选择。

在深度强化学习中,我们通常采用Q-learning算法或者其改进算法,来学习智能体的行动策略。

对于每个时间步骤,智能体需要根据当前的状态选择一个最优的行动,而行动的选择则依赖于Q值。

最后是奖励函数的设计。

奖励函数是用来评估智能体在某一状态下的行动好坏的指标。

在设计奖励函数时,我们需要根据游戏的目标和规则来设置相应的奖励权重,以引导智能体学习到正确的行动策略。

接下来,我们将介绍游戏智能体的实现过程。

首先是数据采集。

在实现过程中,我们需要使用游戏环境以及与环境交互的智能体来采集大量的状态、动作、奖励以及下一状态的数据样本。

这些数据样本将用于训练智能体的深度神经网络。

其次是网络训练。

我们使用采集到的数据样本来训练深度神经网络,通过优化算法来不断调整网络的参数,使得网络能够逐渐准确地预测每个状态下的行动价值。

基于深度强化学习的智能游戏设计与优化

基于深度强化学习的智能游戏设计与优化

基于深度强化学习的智能游戏设计与优化近年来,深度强化学习(Deep Reinforcement Learning,简称DRL)在人工智能领域中的应用逐渐引起了广泛关注。

其通过模拟自主决策的智能代理与环境的交互,在游戏设计与优化中发挥了重要作用。

本文将着重探讨基于深度强化学习的智能游戏设计与优化方法。

一、深度强化学习在游戏设计中的应用深度强化学习是一种基于智能体通过与环境交互,通过不断试错来学习最优策略的方法。

在游戏设计中,可以将游戏的规则和环境构建成一个强化学习的模型。

智能代理通过与游戏环境的交互,不断调整自己的决策和策略,从而优化游戏的设计。

1. 游戏环境建模在基于深度强化学习的智能游戏设计中,首先需要对游戏环境进行建模。

游戏环境的构建可以基于现有的游戏引擎,也可以自行设计。

通过将游戏的规则、道具、地图等要素整合进模型中,构建出逼真的虚拟游戏场景。

2. 智能代理训练智能代理作为游戏中的控制单元,需要通过与游戏环境的交互来学习最优的策略。

在深度强化学习中,可以使用神经网络作为智能代理的决策模型,通过不断迭代更新神经网络参数,逐渐优化代理的决策能力。

同时,采用强化学习算法,如Q-learning、Actor-Critic等,能够帮助智能代理在不断试错中逐渐学习到最优策略。

二、游戏设计的优化方法基于深度强化学习的智能游戏设计不仅可以提供更加智能化的游戏体验,还可以实现游戏的优化和创新。

1. 游戏难度调整通过深度强化学习,智能代理可以自动学习和调整游戏的难度。

智能代理不断与游戏环境交互,根据游戏中的反馈信息调整自身的决策与行为,使游戏难度能够适应玩家的水平。

这种自适应的游戏难度调整,可以提高游戏的可玩性和挑战性,使玩家获得更好的游戏体验。

2. 游戏关卡设计利用深度强化学习,可以对游戏关卡进行优化设计。

智能代理通过与多个关卡的交互学习,逐渐掌握游戏关卡的规律和难点,从而设计出更加丰富多样、平衡度更高的关卡。

基于深度强化学习的智能游戏设计与优化

基于深度强化学习的智能游戏设计与优化

基于深度强化学习的智能游戏设计与优化在智能游戏设计与优化领域,基于深度强化学习的技术的应用正变得越来越重要。

深度强化学习以其在算法和模型方面的创新,为游戏开发者提供了一种全新的方法来设计和优化智能游戏。

本文将探讨基于深度强化学习的智能游戏设计与优化,并介绍其在游戏领域中的应用。

一、深度强化学习简介深度强化学习是结合了深度学习和强化学习两种技术的方法。

深度学习是一种通过多层神经网络来学习复杂模式和特征的机器学习方法;而强化学习是一种通过智能体与环境交互来学习最优行为策略的方法。

深度强化学习将这两种技术相结合,可以在游戏设计与优化中发挥巨大的作用。

二、智能游戏设计基于深度强化学习的智能游戏设计,可以使游戏中的NPC角色拥有更加智能的行为。

传统的游戏NPC往往是通过预先设定的规则来确定其行为,而基于深度强化学习的智能游戏设计能够让NPC角色通过与环境的交互来自主学习和调整策略。

这种设计方法可以使游戏中的NPC角色更具挑战性和真实感,提高游戏的可玩性和乐趣。

三、智能游戏优化除了设计智能的NPC角色,基于深度强化学习的智能游戏优化还可以帮助游戏开发者提高游戏性能和用户体验。

通过深度强化学习算法来自动优化游戏中的关卡设计、游戏平衡性和难度设置等方面,可以有效提高游戏的吸引力和可持续性。

四、深度强化学习在游戏中的应用案例基于深度强化学习的智能游戏设计与优化已经在多个游戏中得到了成功应用。

例如,在围棋等棋类游戏中,AlphaGo等基于深度强化学习的算法战胜了多位世界冠军选手,展示了其在智能游戏中的强大能力。

此外,基于深度强化学习的技术也在其他类型的游戏中取得了显著的优化效果,如射击游戏中的敌人行为模拟、角色扮演游戏中的任务生成等方面。

五、挑战与展望尽管基于深度强化学习的智能游戏设计与优化已经取得了很大的进展,但仍然面临一些挑战。

首先,深度强化学习算法的训练过程通常需要大量的计算资源和数据集,对于小型游戏开发者来说可能存在一定的门槛。

AI玩Flappy Bird课堂教学设计

AI玩Flappy Bird课堂教学设计

课堂教学设计方案及技术运用的说明18物联网目前处于高三,是浙江省第一届中本一体化班级。

学生入学成绩在中职生中名列前茅,学习专业积极性很高,但因为要参加高职考目前处在高职考学科复习阶段。

也是因为高职考整个高三他们专业课学习处于停滞状态。

学生没有学习新知识的时间,但学生具有程序设计基础但没有学习过时下人工智能领域使用最广的“Python”语言,无法使用功能强大的“Pytorch”“TensorFlow”等人工智能框架。

所以我选择了“源码编辑器”它自带简单的人工智能插件,图形化界面学生简单学习就可以上手使用。

教学重难点(根据学情确定本次教学的重点和难点)重点:“人工神经网络”模型的建立、训练、使用的方法难点:理解为什么“AI玩Flappy Bird”需要借助于“机器学习”中的“分类”任务来实现教学其他准备(指信息技术的软硬件、数字资源、数字模型的准备,以及教学课件、微课程的设计制作等)课前教学视频利用钉钉平台提前发布,让学生学习“人工智能”“机器学习”“深度学习”“神经网络”等相关知识,课上通过“测试”检查学生学习情况,达到巩固知识引出“新知”的目的。

课上利用“操作帮助”文档帮助课堂上的学困生完成操作。

教学实施过程在此阶段如果有运用技术的情形,请在相应环节处说明选用技术的软件名称和技术使用的基本情况(包括根据教学目标和学情使用技术的理由和技术的使用方式等)教学导入测试题1,2,3。

一方面测试学生课前学习情况另一方面引出如何把“AI玩Flappy Bird”与神经网络联系在一起。

教学展开(本节课教学实施的流程顺序,包括教师讲授和学生自主学习的各个活动等)教学活动一学习任务课堂小测试问题引领“AI玩Flappy Bird”属于“分类”问题还是“回归”问题给出三个课堂“测试”,题目层层层递进,测试学生课前学情况,通过第三题引发学生思考引出本节课内容“AI玩Flappy Bird”属于“分类”问题还是“回归”问题,原因是什么?教学活动二学习任务确定“AI玩Flappy Bird”中的“输入特征”和“输出结果”问题引领如果利用神经网络完成“AI玩Flappy Bird”那么它的“输入特征”和“输出结果”是什么?逐步分析探究利用神经网络完成“AI玩Flappy Bird”输入层的“输入特征”,输出层“输出结果”是什么?隐藏层具体处理过程不过多涉及。

(完整版)基于深度强化学习的flappybird

(完整版)基于深度强化学习的flappybird

SHANGHAI JIAO TONG UNIVERSITYProject Title: Playing the Game of Flappy Bird with DeepReinforcement LearningGroup Number: G-07Group Members: Wang Wenqing 116032910080Gao Xiaoning 116032910032Qian Chen 116032910073Contents1Introduction (1)2Deep Q-learning Network (2)2.1Q-learning (2)2.1.1Reinforcement Learning Problem (2)2.1.2Q-learning Formulation [6] (3)2.2Deep Q-learning Network (4)2.3Input Pre-processing (5)2.4Experience Replay and Stability (5)2.5DQN Architecture and Algorithm (6)3Experiments (7)3.1Parameters Settings (7)3.2Results Analysis (9)4Conclusion (11)5References (12)Playing the Game of Flappy Bird with Deep Reinforcement LearningAbstractLetting machine play games has been one of the popular topics in AI today. Using game theory and search algorithms to play games requires specific domain knowledge, lacking scalability. In this project, we utilize a convolutional neural network to represent the environment of games, updating its parameters with Q-learning, a reinforcement learning algorithm. We call this overall algorithm as deep reinforcement learning or Deep Q-learning Network(DQN). Moreover, we only use the raw images of the game of flappy bird as the input of DQN, which guarantees the scalability for other games. After training with some tricks, DQN can greatly outperform human beings.1IntroductionFlappy bird is a popular game in the world recent years. The goal of players is guiding the bird on screen to pass the gap constructed by two pipes by tapping screen. If the player tap the screen, the bird will jump up, and if the player do nothing, the bird will fall down at a constant rate. The game will be over when the bird crash on pipes or ground, while the scores will be added one when the bird pass through the gap. In Figure1, there are three different state of bird. Figure 1 (a) represents the normal flight state, (b) represents the crash state, (c) represents the passing state.(a) (b) (c)Figure 1: (a) normal flight state (b) crash state (c) passing stateOur goal in this paper is to design an agent to play Flappy bird automatically with the same input comparing to human player, which means that we use raw images and rewards to teach our agent to learn how to play this game. Inspired by [1], we propose a deep reinforcement learning architecture to learn and play this game.Recent years, a huge amount of work has been done on deep learning in computer vision [6]. Deep learning extracts high dimension features from raw images. Therefore, it is nature to ask whether the deep learning can be used in reinforcement learning. However, there are four challenges in using deep learning. Firstly, most successful deep learning applications to date have required large amounts of hand-labelled training data. RL algorithms, on the other hand, must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed. Secondly, the delay between actions and resulting rewards, which can be thousands of time steps long, seems particularly daunting whencompared to the direct association between inputs and targets found in supervised learning. The third issue is that most deep learning algorithms assume the data samples to be independent, while in reinforcement learning one typically encounters sequences of highly correlated states. Furthermore, in RL the data distribution changes as the algorithm learns new behaviors, which can be problematic for deep learning methods that assume a fixed underlying distribution.This paper will demonstrate that using Convolutional Neural Network (CNN) can overcome those challenge mentioned above and learn successful control polices from raw images data in the game Flappy bird. This network is trained with a variant of the Q-learning algorithm [6]. By using Deep Q-learning Network (DQN), we construct the agent to make right decisions on the game flappy bird barely according to consequent raw images.2 Deep Q-learning NetworkRecent breakthroughs in computer vision have relied on efficiently training deep neural networks on very large training sets. By feeding sufficient data into deep neural networks, it is often possible to learn better representations than handcrafted features[2][3]. These successes motivate us to connect a reinforcement learning algorithm to a deep neural network, which operates directly on raw images and efficiently update parameters by using stochastic gradient descent.In the following section, we describe the Deep Q-learning Network algorithm (DQN) and how its model is parameterized.2.1 Q-learning2.1.1 Reinforcement Learning ProblemQ-learning is a specific algorithm of reinforcement learning (RL). As Figure 2 show, an agent interacts with its environment in discrete time steps. At each time t, the agent receives an state t s and a reward t r . It then chooses an action t a from the set of actions available, which is subsequently sent to the environment. The environment moves to a new state 1t s + and the reward 1t r + associated with the transition1(,,)t t t s a s +is determined [4].Figure 2: Traditional Reinforcement Learning scenarioThe goal of an agent is to collect as much reward as possible. The agent can choose any action as a function of the history and it can even randomize its action selection. Note that in order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize the future income), although the immediate reward associated with this might be negative [5].2.1.2 Q-learning Formulation [6]In Q-learning problem, the set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards:00111211s ,,,s ,,,...,s ,,,n n n n a r a r a r s --Here s i represents the state, i a is the action and 1i r +is the reward after performing theaction i a . The episode ends with terminal state n s . To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. Define the total future reward from time point t onward as:11...t t t n n R r r r r +-=++++ (1)In order to ensure the divergence and balance the immediate reward and future reward, total reward must use discounted future reward:111...n n t n t i t t t t n n i i t R r r r r r γγγγ----+-==++++=∑(2)Here γis the discount factor between 0 and 1, the more into the future the reward is, the less we take it into consideration. Transforming equation (2) can get:1t t t R r R γ+=+ (3)In Q-learning, define a function (, )t t Q s a representing the maximum discounted future reward when we perform action t a in state:1(,)max t t t Q s a R += (4)It is called Q-function, because it represents the “quality” of a certain action in a given state. A good strategy for an agent would be to always choose an action that maximizes the discounted future reward:()arg max (,)t t a t t s Q s a π= (5)Here π represents the policy, the rule how we choose an action in each state. Given a transition 1(,,)t t t s a s +, equation (3)(4) can get following bellman equation - maximumfuture reward for this state and action is the immediate reward plus maximum future reward for the next state:''1(,)max (,)t t t t a Q s a r Q s a γ+=+(6) The only way to collect information about the environment is by interacting with it. Q-learning is the process of learning the optimal function (,)t t Q s a , which is a table in. Here is the overall algorithm 1:Algorithm 1 Q-learningInitialize Q[num_states, num_actions] arbitrarilyObserve initial state s 0RepeatSelect and carry out an action aObserve reward r and new state s’'''(,)(,)(max (,)(,))a Q s a Q s a r Q s a Q s a αγ=++-s = s’Until terminated2.2 Deep Q-learning NetworkIn Q-learning, the state space often is too big to be put into main memory. A game frame of 8080⨯ binary images has 64002states, which is impossible to be represented by Q-table. What’s more, during training, encountering a known state, Q-learning just perform a random action, meaning that it’s not heuristic. In order overcome these two problems, just approximate the Q-table with a convolutional neural networks (CNN)[7][8]. This variation of Q-learning is called Deep Q-learning Network (DQN) [9][10]. After training the DQN, a multilayer neural networks can approach the traditional optimal Q-table as followed:*(,;)(,)t t t t Q s a Q s a θ= (7)As for playing flappy bird, the screenshot s t is inputted into the CNN, and the outputs are the Q-value of actions, as shown in Figure 3:Figure 3: In DQN, CNN’s input is raw game image while its outputs are Q-values Q(s,a), one neuron corresponding to one action’s Q-value.In order to update CNN’s weight, defining the cost function and gradient update function as [9][10]: '2'11max (,;)(,;)2t t t t a L r Q s a Q s a θθ-+⎡⎤=+-⎣⎦ (8) ''1(max (,;)(,;)(,;)t t t t t t a L r Q s a Q s a Q s a θθγθθθ-+⎡⎤∇=+-∇⎣⎦(9)()L θθθθ---=+∇ (10)Here, θare the DQN parameters that get trained and θ-are non-updated parameters for the Q-value function. During training, use equation(9) to update the weights of CNN.Meanwhile, obtaining optimal reward in every episode requires the balance between exploring the environment and exploiting experience.ε-greedy approach can achieve this target. When training, select a random action with probability ε o r otherwise choose the optimal action ''argmax (,;)t a a Q s a θ= . The εanneals linearly to zero with increase in number of updates.2.3 Input Pre-processingWorking directly with raw game frames, which are 288512⨯pixel RGB images, can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the input dimensionality.Figure 4: Pre-process game frames. First convert frames to gray images, then down-sample them to specific size. Afterwards, convert them to binary images, finally stack up last 4 frames as a state.In order to improve the accuracy of the convolutional network, the background of game was removed and substituted with a pure black image to remove noise. As Figure 4 shows, the raw game frames are preprocessed by first converting their RGB representation to gray-scale and down-sampling it to an 8080⨯image. Then convert the gray image to binary image. In addition, stack up last 4 game frames as a state for CNN. The current frame is overlapped with the previous frames with slightly reduced intensities and the intensity reduces as we move farther away from the most recent frame. Thus, the input image will give good information on the trajectory on which the bird is currently in.2.4 Experience Replay and StabilityBy now we can estimate the future reward in each state using Q-learning and approximate the Q-function using a convolutional neural network. But the approximation of Q-values using non-linear functions is not very stable. In Q-learning, the experiences recorded in a sequential manner are highly correlated. If sequentially use them to update the DQN parameters, the training process might stuck in a local minimal solution or diverge.To ensure the stability of training of DQN, we use a technical trick called experience replay. During game playing, particular number of experience 11(,,,)t t t t s a r s ++ are stored in a replay memory. When training the network, random mini-batches from the replay memory are used instead of the most recent transition. This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum. As a result of this randomness in the choice of the mini-batch, the data that goes in to update the DQN parameters are likely to be de-correlated.Furthermore, to better the stability of the convergence of the loss functions, we use a clone of the DQN model with parameters θ-. The parameters θ-are updated to θ after every C updates to the DQN.2.5 DQN Architecture and AlgorithmAs shown in Figure 5, firstly, get the flappy bird game frame, and after pre-processing described in section 2.3, stack up last 4 frames as a state. Input this state as raw images into the CNN whose output is the quality of specific action in given state., the agent performs an action According to policy ()arg max (,)t t a t t s Q s a π=, with probability ε, otherwise perform a random action. The current experience is stored in a replay memory, a random mini-batch of experiences are sampled from the memory and used to perform a gradient descent on the CNN’s parameters. This is an interactive process until some criteria are being satisfied.Figure 5: DQN’s training architecture: upper data flow show the training process, whilethe lower data flow display the interactive process between the agent and environment.The complete DQN training process is shown in Algorithm 2. We should note that the εfactor is set to zero during test, and while training we use a decaying value, balancing the exploration and exploitation.Algorithm 2 Deep Q-learning NetworkInitialize replay memory D to certain capacity N Initialize the CNN with random weights θInitialize θ-=: θfor games = 1: maxGames dofor snapShots = 1: T doWith probability εselect a random action a totherwise select '':argmax (,;)t a a Q s a θ=Execute a t and observe r t+1 and next sate s t+1Store transition (s t ,a t , r t+1 , s t+1) in replay memory DSample mini-batch of transitions from Dfor j = 1: batchSize doif game terminates at next state thenQ_pred =: r jelseQ_pred =: r j + ''1max (,;)t a Q s a θ-+end ifPerform gradient descent on 21(_(,;))2t t L Q pred Q s a θ=- according to equation (10)end forEvery C steps reset θ-=: θend forend for3 ExperimentsThis section will describe our algorithm’s parameters setting and the analysis of experiment results.3.1 Parameters SettingsFigure 6 illustrates our CNN’s layers setting. The neural networks has 3 CNN hidden layers followed by 2 fully connected hidden layers. Table 1 show the detailed parameters of every layer. Here we just use a max pooling in the first CNN hidden layer. Also, we use the ReLU activation function to produce the neural output.Figure 6: The layer setting of CNN: this CNN has 3 convolutional layers followed by 2 fully connected layers. As for training, we use Adam optimizer to update the CNN’s parameters.Table 1: The detailed layers setting of CNNTable 1 lists all the parameter setting of DQN. We use a decayed ranging from 0.1 to 0.001 to balance exploration and exploitation. What’s more, Table 2 shows that the batch stochastic gradient descent optimizer is Adam with batch size of 32. Finally, we also allocate a large replay memory.Table 2: The training parameters of DQN3.2Results AnalysisWe train our model about 4 million epochs. Figure 7 shows the weights and biases of CNN’s first hidden layer. The weights and biases finally centralize around 0, with low variance, which directly stabilize CNN’s output Q-value(,)Q s a and reducet tprobability of random action. The stability of CNN’s parameters leads to obtaining optimal policy.Figure 7: Left (right) figure is the histogram of weights (biases) of CNN’s first hidden layerFigure 8is the cost value of DQN during training. The cost function has a slow downtrend, close to 0 after 3.5 million epochs. It means that DQN has learned the most common state subspace and will perform optimal action when coming across known state.In a word, DQN has obtained its best action policy.Figure 8:DQN’s cost function: the plot shows the training progress of DQN. We trained our model about 4 million epochs.When playing flappy bird, if the bird gets through the pipe , we give a reward 1, if dead, give -1, otherwise 0.1. Figure 9 is the average returned reward from environment. The stabiltiy in final training state means that the agent can automatically choose the best action, and the environment gives the best reward in turns. We know that the agent and environment has enter into a friendly interaction, guaranteeing the maximal total reward.Figure 9: The average returned reward from environment. We average the returned reward every 1000 epochs.From this Figure 10, the predicted max Q-value from CNN converges and stabilizes in a value after about 100 000. It means that CNN can accurately predict the quality of actions in specific state, and we can steadily perform actions with max Q-value. The convergence of max Q-values states that CNN has explored the state space widely and greatly approximated the environment well.Figure 10: The average max Q-value obtained from CNN’s output. We average the max Q-value every 1000 epochs.Figure 11 illustrates the DQN’s action strategy. If the predicted max Q-value is so high, we are confident that we will get through the gap when perform the action with max Q-value like A, C. If the max Q-value is relatively low, and we perform the action, we might hit the pipe, like B. In the final state of training, the max Q-value is dramatically high, meaning that we are confident to get through the gaps if performing the actions with max Q-value.Figure 11: The leftmost plot shows the CNN’s predicted max Q-value for a 100 frames segment of the game flappy bird. The three screenshots correspond to the frames labeled by A, B, and C respectively.4ConclusionWe successfully use DQN to play flappy bird, which can outperform human beings. DQN can automatically learn knowledge from environment just using raw image to play games without prior knowledge. This feature give DQN the power to play almost simple games. Moreover, the use of CNN as a function approximation allow DQN to deal with large environment which has almost infinite state space. Last but not least, CNN can also greatly represent feature space without handcrafted feature extraction reducing the massive manual work.5References[1] C. Clark and A. Storkey. Teaching deep convolutional neural networks to play go.arXiv preprint arXiv:1412.3409, 2014. 1.[2]Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification withdeep convolutional neural networks.In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.[3]George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30 –42, 2012, 1.[4]Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MITPress, 1998.[5]Brian Sallans and Geoffrey E. Hinton. Reinforcement learning with factored statesand actions. Journal of Machine Learning Research, 5:1063–1088, 2004.[6]Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.[7]Hamid Maei, Csaba Szepesv´ari, Shalabh Bhatnagar, and Richard S. Sutton.Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pages 719–726, 2010.[8]Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.[9]V.Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.Wierstra, andM.Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 1.[10]V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A.Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. 3, 5.。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

SHANGHAI JIAO TONG UNIVERSITYProject Title: Playing the Game of Flappy Bird with DeepReinforcement LearningGroup Number: G-07Group Members: Wang Wenqing 116032910080Gao Xiaoning 116032910032Qian Chen 116032910073Contents1Introduction (1)2Deep Q-learning Network (2)2.1Q-learning (2)2.1.1Reinforcement Learning Problem (2)2.1.2Q-learning Formulation [6] (3)2.2Deep Q-learning Network (4)2.3Input Pre-processing (5)2.4Experience Replay and Stability (5)2.5DQN Architecture and Algorithm (6)3Experiments (7)3.1Parameters Settings (7)3.2Results Analysis (9)4Conclusion (11)5References (12)Playing the Game of Flappy Bird with Deep Reinforcement LearningAbstractLetting machine play games has been one of the popular topics in AI today. Using game theory and search algorithms to play games requires specific domain knowledge, lacking scalability. In this project, we utilize a convolutional neural network to represent the environment of games, updating its parameters with Q-learning, a reinforcement learning algorithm. We call this overall algorithm as deep reinforcement learning or Deep Q-learning Network(DQN). Moreover, we only use the raw images of the game of flappy bird as the input of DQN, which guarantees the scalability for other games. After training with some tricks, DQN can greatly outperform human beings.1IntroductionFlappy bird is a popular game in the world recent years. The goal of players is guiding the bird on screen to pass the gap constructed by two pipes by tapping screen. If the player tap the screen, the bird will jump up, and if the player do nothing, the bird will fall down at a constant rate. The game will be over when the bird crash on pipes or ground, while the scores will be added one when the bird pass through the gap. In Figure1, there are three different state of bird. Figure 1 (a) represents the normal flight state, (b) represents the crash state, (c) represents the passing state.(a) (b) (c)Figure 1: (a) normal flight state (b) crash state (c) passing stateOur goal in this paper is to design an agent to play Flappy bird automatically with the same input comparing to human player, which means that we use raw images and rewards to teach our agent to learn how to play this game. Inspired by [1], we propose a deep reinforcement learning architecture to learn and play this game.Recent years, a huge amount of work has been done on deep learning in computer vision [6]. Deep learning extracts high dimension features from raw images. Therefore, it is nature to ask whether the deep learning can be used in reinforcement learning. However, there are four challenges in using deep learning. Firstly, most successful deep learning applications to date have required large amounts of hand-labelled training data. RL algorithms, on the other hand, must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed. Secondly, the delay between actions and resulting rewards, which can be thousands of time steps long, seems particularly daunting whencompared to the direct association between inputs and targets found in supervised learning. The third issue is that most deep learning algorithms assume the data samples to be independent, while in reinforcement learning one typically encounters sequences of highly correlated states. Furthermore, in RL the data distribution changes as the algorithm learns new behaviors, which can be problematic for deep learning methods that assume a fixed underlying distribution.This paper will demonstrate that using Convolutional Neural Network (CNN) can overcome those challenge mentioned above and learn successful control polices from raw images data in the game Flappy bird. This network is trained with a variant of the Q-learning algorithm [6]. By using Deep Q-learning Network (DQN), we construct the agent to make right decisions on the game flappy bird barely according to consequent raw images.2 Deep Q-learning NetworkRecent breakthroughs in computer vision have relied on efficiently training deep neural networks on very large training sets. By feeding sufficient data into deep neural networks, it is often possible to learn better representations than handcrafted features[2][3]. These successes motivate us to connect a reinforcement learning algorithm to a deep neural network, which operates directly on raw images and efficiently update parameters by using stochastic gradient descent.In the following section, we describe the Deep Q-learning Network algorithm (DQN) and how its model is parameterized.2.1 Q-learning2.1.1 Reinforcement Learning ProblemQ-learning is a specific algorithm of reinforcement learning (RL). As Figure 2 show, an agent interacts with its environment in discrete time steps. At each time t, the agent receives an state t s and a reward t r . It then chooses an action t a from the set of actions available, which is subsequently sent to the environment. The environment moves to a new state 1t s + and the reward 1t r + associated with the transition1(,,)t t t s a s +is determined [4].Figure 2: Traditional Reinforcement Learning scenarioThe goal of an agent is to collect as much reward as possible. The agent can choose any action as a function of the history and it can even randomize its action selection. Note that in order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize the future income), although the immediate reward associated with this might be negative [5].2.1.2 Q-learning Formulation [6]In Q-learning problem, the set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards:00111211s ,,,s ,,,...,s ,,,n n n n a r a r a r s --Here s i represents the state, i a is the action and 1i r +is the reward after performing theaction i a . The episode ends with terminal state n s . To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. Define the total future reward from time point t onward as:11...t t t n n R r r r r +-=++++ (1)In order to ensure the divergence and balance the immediate reward and future reward, total reward must use discounted future reward:111...n n t n t i t t t t n n i i t R r r r r r γγγγ----+-==++++=∑(2)Here γis the discount factor between 0 and 1, the more into the future the reward is, the less we take it into consideration. Transforming equation (2) can get:1t t t R r R γ+=+ (3)In Q-learning, define a function (, )t t Q s a representing the maximum discounted future reward when we perform action t a in state:1(,)max t t t Q s a R += (4)It is called Q-function, because it represents the “quality” of a certain action in a given state. A good strategy for an agent would be to always choose an action that maximizes the discounted future reward:()arg max (,)t t a t t s Q s a π= (5)Here π represents the policy, the rule how we choose an action in each state. Given a transition 1(,,)t t t s a s +, equation (3)(4) can get following bellman equation - maximumfuture reward for this state and action is the immediate reward plus maximum future reward for the next state:''1(,)max (,)t t t t a Q s a r Q s a γ+=+(6) The only way to collect information about the environment is by interacting with it. Q-learning is the process of learning the optimal function (,)t t Q s a , which is a table in. Here is the overall algorithm 1:Algorithm 1 Q-learningInitialize Q[num_states, num_actions] arbitrarilyObserve initial state s 0RepeatSelect and carry out an action aObserve reward r and new state s’'''(,)(,)(max (,)(,))a Q s a Q s a r Q s a Q s a αγ=++-s = s’Until terminated2.2 Deep Q-learning NetworkIn Q-learning, the state space often is too big to be put into main memory. A game frame of 8080⨯ binary images has 64002states, which is impossible to be represented by Q-table. What’s more, during training, encountering a known state, Q-learning just perform a random action, meaning that it’s not heuristic. In order overcome these two problems, just approximate the Q-table with a convolutional neural networks (CNN)[7][8]. This variation of Q-learning is called Deep Q-learning Network (DQN) [9][10]. After training the DQN, a multilayer neural networks can approach the traditional optimal Q-table as followed:*(,;)(,)t t t t Q s a Q s a θ= (7)As for playing flappy bird, the screenshot s t is inputted into the CNN, and the outputs are the Q-value of actions, as shown in Figure 3:Figure 3: In DQN, CNN’s input is raw game image while its outputs are Q-values Q(s,a), one neuron corresponding to one action’s Q-value.In order to update CNN’s weight, defining the cost function and gradient update function as [9][10]: '2'11max (,;)(,;)2t t t t a L r Q s a Q s a θθ-+⎡⎤=+-⎣⎦ (8) ''1(max (,;)(,;)(,;)t t t t t t a L r Q s a Q s a Q s a θθγθθθ-+⎡⎤∇=+-∇⎣⎦(9)()L θθθθ---=+∇ (10)Here, θare the DQN parameters that get trained and θ-are non-updated parameters for the Q-value function. During training, use equation(9) to update the weights of CNN.Meanwhile, obtaining optimal reward in every episode requires the balance between exploring the environment and exploiting experience.ε-greedy approach can achieve this target. When training, select a random action with probability ε o r otherwise choose the optimal action ''argmax (,;)t a a Q s a θ= . The εanneals linearly to zero with increase in number of updates.2.3 Input Pre-processingWorking directly with raw game frames, which are 288512⨯pixel RGB images, can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the input dimensionality.Figure 4: Pre-process game frames. First convert frames to gray images, then down-sample them to specific size. Afterwards, convert them to binary images, finally stack up last 4 frames as a state.In order to improve the accuracy of the convolutional network, the background of game was removed and substituted with a pure black image to remove noise. As Figure 4 shows, the raw game frames are preprocessed by first converting their RGB representation to gray-scale and down-sampling it to an 8080⨯image. Then convert the gray image to binary image. In addition, stack up last 4 game frames as a state for CNN. The current frame is overlapped with the previous frames with slightly reduced intensities and the intensity reduces as we move farther away from the most recent frame. Thus, the input image will give good information on the trajectory on which the bird is currently in.2.4 Experience Replay and StabilityBy now we can estimate the future reward in each state using Q-learning and approximate the Q-function using a convolutional neural network. But the approximation of Q-values using non-linear functions is not very stable. In Q-learning, the experiences recorded in a sequential manner are highly correlated. If sequentially use them to update the DQN parameters, the training process might stuck in a local minimal solution or diverge.To ensure the stability of training of DQN, we use a technical trick called experience replay. During game playing, particular number of experience 11(,,,)t t t t s a r s ++ are stored in a replay memory. When training the network, random mini-batches from the replay memory are used instead of the most recent transition. This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum. As a result of this randomness in the choice of the mini-batch, the data that goes in to update the DQN parameters are likely to be de-correlated.Furthermore, to better the stability of the convergence of the loss functions, we use a clone of the DQN model with parameters θ-. The parameters θ-are updated to θ after every C updates to the DQN.2.5 DQN Architecture and AlgorithmAs shown in Figure 5, firstly, get the flappy bird game frame, and after pre-processing described in section 2.3, stack up last 4 frames as a state. Input this state as raw images into the CNN whose output is the quality of specific action in given state., the agent performs an action According to policy ()arg max (,)t t a t t s Q s a π=, with probability ε, otherwise perform a random action. The current experience is stored in a replay memory, a random mini-batch of experiences are sampled from the memory and used to perform a gradient descent on the CNN’s parameters. This is an interactive process until some criteria are being satisfied.Figure 5: DQN’s training architecture: upper data flow show the training process, whilethe lower data flow display the interactive process between the agent and environment.The complete DQN training process is shown in Algorithm 2. We should note that the εfactor is set to zero during test, and while training we use a decaying value, balancing the exploration and exploitation.Algorithm 2 Deep Q-learning NetworkInitialize replay memory D to certain capacity N Initialize the CNN with random weights θInitialize θ-=: θfor games = 1: maxGames dofor snapShots = 1: T doWith probability εselect a random action a totherwise select '':argmax (,;)t a a Q s a θ=Execute a t and observe r t+1 and next sate s t+1Store transition (s t ,a t , r t+1 , s t+1) in replay memory DSample mini-batch of transitions from Dfor j = 1: batchSize doif game terminates at next state thenQ_pred =: r jelseQ_pred =: r j + ''1max (,;)t a Q s a θ-+end ifPerform gradient descent on 21(_(,;))2t t L Q pred Q s a θ=- according to equation (10)end forEvery C steps reset θ-=: θend forend for3 ExperimentsThis section will describe our algorithm’s parameters setting and the analysis of experiment results.3.1 Parameters SettingsFigure 6 illustrates our CNN’s layers setting. The neural networks has 3 CNN hidden layers followed by 2 fully connected hidden layers. Table 1 show the detailed parameters of every layer. Here we just use a max pooling in the first CNN hidden layer. Also, we use the ReLU activation function to produce the neural output.Figure 6: The layer setting of CNN: this CNN has 3 convolutional layers followed by 2 fully connected layers. As for training, we use Adam optimizer to update the CNN’s parameters.Table 1: The detailed layers setting of CNNTable 1 lists all the parameter setting of DQN. We use a decayed ranging from 0.1 to 0.001 to balance exploration and exploitation. What’s more, Table 2 shows that the batch stochastic gradient descent optimizer is Adam with batch size of 32. Finally, we also allocate a large replay memory.Table 2: The training parameters of DQN3.2Results AnalysisWe train our model about 4 million epochs. Figure 7 shows the weights and biases of CNN’s first hidden layer. The weights and biases finally centralize around 0, with low variance, which directly stabilize CNN’s output Q-value(,)Q s a and reducet tprobability of random action. The stability of CNN’s parameters leads to obtaining optimal policy.Figure 7: Left (right) figure is the histogram of weights (biases) of CNN’s first hidden layerFigure 8is the cost value of DQN during training. The cost function has a slow downtrend, close to 0 after 3.5 million epochs. It means that DQN has learned the most common state subspace and will perform optimal action when coming across known state.In a word, DQN has obtained its best action policy.Figure 8:DQN’s cost function: the plot shows the training progress of DQN. We trained our model about 4 million epochs.When playing flappy bird, if the bird gets through the pipe , we give a reward 1, if dead, give -1, otherwise 0.1. Figure 9 is the average returned reward from environment. The stabiltiy in final training state means that the agent can automatically choose the best action, and the environment gives the best reward in turns. We know that the agent and environment has enter into a friendly interaction, guaranteeing the maximal total reward.Figure 9: The average returned reward from environment. We average the returned reward every 1000 epochs.From this Figure 10, the predicted max Q-value from CNN converges and stabilizes in a value after about 100 000. It means that CNN can accurately predict the quality of actions in specific state, and we can steadily perform actions with max Q-value. The convergence of max Q-values states that CNN has explored the state space widely and greatly approximated the environment well.Figure 10: The average max Q-value obtained from CNN’s output. We average the max Q-value every 1000 epochs.Figure 11 illustrates the DQN’s action strategy. If the predicted max Q-value is so high, we are confident that we will get through the gap when perform the action with max Q-value like A, C. If the max Q-value is relatively low, and we perform the action, we might hit the pipe, like B. In the final state of training, the max Q-value is dramatically high, meaning that we are confident to get through the gaps if performing the actions with max Q-value.Figure 11: The leftmost plot shows the CNN’s predicted max Q-value for a 100 frames segment of the game flappy bird. The three screenshots correspond to the frames labeled by A, B, and C respectively.4ConclusionWe successfully use DQN to play flappy bird, which can outperform human beings. DQN can automatically learn knowledge from environment just using raw image to play games without prior knowledge. This feature give DQN the power to play almost simple games. Moreover, the use of CNN as a function approximation allow DQN to deal with large environment which has almost infinite state space. Last but not least, CNN can also greatly represent feature space without handcrafted feature extraction reducing the massive manual work.5References[1] C. Clark and A. Storkey. Teaching deep convolutional neural networks to play go.arXiv preprint arXiv:1412.3409, 2014. 1.[2]Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification withdeep convolutional neural networks.In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.[3]George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30 –42, 2012, 1.[4]Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MITPress, 1998.[5]Brian Sallans and Geoffrey E. Hinton. Reinforcement learning with factored statesand actions. Journal of Machine Learning Research, 5:1063–1088, 2004.[6]Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.[7]Hamid Maei, Csaba Szepesv´ari, Shalabh Bhatnagar, and Richard S. Sutton.Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pages 719–726, 2010.[8]Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.[9]V.Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.Wierstra, andM.Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 1.[10]V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A.Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. 3, 5.。

相关文档
最新文档