KinectV2传感器实时动态手势识别算法
KinectV2传感器实时动态手势识别算法
Figure 2. Gesture sample library building flowchart 图 2. 手势样本库建立流程图
10
仪器与设备
刘希东 等
4. 手势运动特征提取
如何提取低计算量,低复杂度的特征并且找到统一的动作特征来对人体动作进行识别,是手势识别 过程中的关键问题。以往的研究者们在进行骨骼动作的识别的时候,主要提取位置特征,角度特征或者 速度特征来描述手势的运动[9],然而对于不同身高,不同体型的人做同一动作或者同一个人做不同动作 的过程中,由于每个人的骨骼关节位置信息有一定的差异,运动速度也不尽相同[10],均不能统一而准确 的描述手势的运动,但是即使每个人的骨骼坐标位置和运动速度不同,在运动过程中,关节间的夹角特 征并没有发生明显变化,基于这种现状,本文中提出了将手势运动过程中的骨骼间的向量夹角作为手势 特征描述,并结合动态时间规整(DTW)算法进行模板匹配,夹角特征提取过程如下。
KinectV2 Sensor Real-Time Dynamic Gesture Recognition Algorithm
Xidong Liu1, Yan Meng1, Guoyou Li2, Lei Tang2, Jiadong Guo2
1College of Biological and Environmental Engineering, Tianjin Vocational Institute University, Tianjing 2Shanghai Institute of Aerospace Technology, Shanghai
关Hale Waihona Puke 词KinectV2传感器,手势识别,骨骼角度特征,人机交互
Copyright © 2020 by author(s) and Hans Publishers Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). /licenses/by/4.0/
基于微软Kinect的手势识别技术研究
基于微软Kinect的手势识别技术研究近年来,随着科技的快速发展,各种智能设备也逐渐普及。
其中,基于微软Kinect的手势识别技术备受关注。
这一技术通过使用Kinect摄像头和深度传感器,可以感知人体的运动姿势和动作,并将其转化为相应的指令,实现对设备的控制。
本文就从技术原理、应用场景及发展趋势等方面进行探讨。
一、技术原理微软Kinect是一款专门用于游戏控制的设备,其核心技术在于3D深度摄像头和红外发射器。
摄像头能够捕捉周围环境的深度信息,而红外发射器则能够检测人体的运动姿势。
其中,深度摄像头通过红外线的反射和接收,可以精确地测量物体距离,将其转化为三维坐标。
而红外发射器则可以发射红外信号,以检测人体在空间中的运动。
基于这一技术,Kinect的手势识别功能应运而生。
通过人体骨架模型的建立和动作捕捉算法,Kinect可以识别人体姿态和动作,并将其转化成计算机能够理解的指令。
例如,向左移动手臂可以控制游戏中的角色向左转动,向前伸手可以控制角色前进等等。
同时,Kinect还支持语音控制和面部表情识别等功能,大大提升了用户的交互体验。
二、应用场景手势识别技术作为一种新兴的交互方式,已经得到了广泛的应用。
除了游戏控制以外,其实际应用场景还包括人机交互、智能家居、虚拟现实等领域。
在人机交互领域,手势识别技术可以被应用于机器人控制、医疗诊断、智能办公等方面。
例如,在智能办公场景中,用户可以通过手势控制电脑的开关、程序的启动以及文档的编辑等操作,提高工作效率。
在智能家居领域,手势识别技术可以被应用于智能家电的控制、家庭安防等方面。
例如,在智能家电方面,用户可以通过手势控制灯光的开关、音响的播放以及电视的切换等操作,提高家居生活的便利程度。
在虚拟现实领域,手势识别技术可以被应用于3D模型设计、游戏控制等方面。
例如,在游戏领域中,用户可以通过手势控制游戏角色的动作和攻击技能,提供更为真实的游戏体验。
三、发展趋势手势识别技术虽然已经取得了很大的进展,但还有很多发展空间。
基于Kinect动态手势识别的机械臂实时位姿控制系统
基于Kinect动态手势识别的机械臂实时位姿控制系统倪涛;赵泳嘉;张红彦;刘香福;黄玲涛【期刊名称】《农业机械学报》【年(卷),期】2017(48)10【摘要】基于Kinect动态手势识别达到实时控制机械臂末端位姿的效果.位置控制信息的获取采用Kinect计算手部4个关节点在控制中的位置变动,数据噪声在控制中易引起机械臂误动作和运动振动等问题,为了避免噪声对实时控制的不利影响,采用卡尔曼滤波跟踪降噪.姿势控制信息通过采集手部点云经滤波处理后应用最小二乘拟合的方式获取掌心所在平面,运用迭代器降噪处理.系统通过对手部位置和姿势信息的整合、手势到机械臂空间坐标映射及运动学求解来实时控制机械臂末端位姿.实验结果证明,手势控制系统满足控制要求,简单、易于操作,机械臂实时响应速度快、运动准确.%The research achieved to control the mechanical arm position and pose by using real-time dynamic gesture recognition based on Kinect device.The information of the position controlling was obtained by calculating the position changes of the four hand joint points.The noise of the joints was liable to lead mechanical arm misoperation and the vibration of motion during the control of the mechanical arm.Aiming to avoid the negative impact of the noise in real-time controlling,Kalmanfilter was adopted to track position and reduce noise.According to the hand point cloud information,the information of the posture controlling was obtained by means of using least squares fitting to get the plane of hand mind.The end of the position and pose of the mechanical arm wascontrolled by integrating the position and posture information,space coordinate mapping and the resolving of kinematics in real-time.The result of the experiment indicated that the gesture control was easy to operate and mechanical arm responded at high speed.The effect of filter was so remarkable that the motion of the mechanical arm was controlled accurately and smoothly,and no mechanical arm misoperation and others controlling anomaly.Gesture control system could meet the requirement of actually controlling.System could be applied to a variety of human-computer interaction.【总页数】8页(P417-423,407)【作者】倪涛;赵泳嘉;张红彦;刘香福;黄玲涛【作者单位】吉林大学机械科学与工程学院,长春130022;吉林大学机械科学与工程学院,长春130022;吉林大学机械科学与工程学院,长春130022;吉林大学机械科学与工程学院,长春130022;吉林大学机械科学与工程学院,长春130022【正文语种】中文【中图分类】TP241;TP311.52【相关文献】1.基于Kinect的动态手势识别 [J], 王兵;董洪伟;张明敏;潘志庚2.基于Kinect的实时手势识别方法 [J], 田元;王学璠;王志锋;陈加;姚璜3.基于Kinect的动态手势识别算法改进与实现 [J], 李国友; 孟岩; 闫春玮; 宋成全4.基于Kinect的动态手势识别研究 [J], 邵天培;蒋刚;留沧海5.基于Kinect和改进DTW算法的动态手势识别 [J], 魏秋月;刘雨帆因版权原因,仅展示原文概要,查看原文内容请购买。
基于Kinect的动态手势识别算法改进与实现
基于Kinect的动态手势识别算法改进与实现李国友; 孟岩; 闫春玮; 宋成全【期刊名称】《《高技术通讯》》【年(卷),期】2019(029)009【总页数】11页(P841-851)【关键词】Kinect V2传感器; 动态手势识别; 改进隐马尔科夫模型(HMM); 未定义手势; 识别率【作者】李国友; 孟岩; 闫春玮; 宋成全【作者单位】燕山大学电气工程学院秦皇岛066004【正文语种】中文0 引言手势作为人们日常生活中的习惯交流方式,有着直观、自然的特性。
因此,随着计算机技术的不断发展,基于机器视觉的手势识别技术也逐渐成为人机交互领域的研究热点。
基于手势的人机交互方式的出现使用户不再局限于鼠标、键盘等传统的人机交互方式,而是一种类似于人与人的,更加自然流畅的交互方式[1]。
由于手势识别交互方式的自然性和灵活性,使其广泛应用于各个领域,例如医学图像、聋哑人辅助生活、远程交流、机器人操作、虚拟仿真、电子游戏及无人驾驶等领域[2]。
手势识别根据手势运动状态主要分为静态手势识别和动态手势识别[3]。
动态手势识别相对静态手势识别而言,动态手势可以更直观、更方便地表达客户的用途,满足用户的需求[4]。
但在研究过程中也有一定的困难,比如特征复杂、分类困难等。
近年来,主要的动态手势学习和识别方法有隐马尔科夫模型(hidden Markov model,HMM)、动态时间规整(dynamic time warping, DTW)、最长公共子序列(longest common subsequence, LCSS)、K近邻(K-nearest neighbor, KNN)[5]等。
其中动态时间规整、最长公共子序列和K近邻属于模板匹配的方法,隐马尔科夫模型是概率统计的方法[6]。
李凯等人[7]提出了一种改进的动态时间规整算法,该方法将获取到的Kinect骨骼点坐标及手型数据结合,构造了矢量特征来描述手的运动轨迹,实现了手势的快速匹配。
基于Kinect传感器深度信息的动态手势识别
第 4期
厦 门大 学 学报 ( 自然 科 学版 )
J o u r n a l o f Xi a me n Un i v e r s i t y( Na t u r a l S c i e n c e )
Vo 1 . 5 2 No . 4
2 0 1 3年 7月
模型空 间里的点分类 到该空 间里某个 子集 的过 程 ; 而动 态手势识别 就是 把这 些 点组 成 的轨 迹分 类 到某 个 子集 的过程 , 由于涉及 时 间及 空 间上 下文 , 在 进行 建 模 时通
势; 文 献[ 5 ] 提 出用 S UR F剪 裁手 势 , 并 通 过简 单 的 前
常显示 为一条动作 轨迹 , 这种识 别难度 大但 更具 有实用
性. 现有 的动态手 势识 别 技术 可 以分 归 3类 : 基 于 隐 马 尔可 夫 模 型 ( HMM) 的 识 别 ] , 基 于 动 态 时 间 规 整 ( D TW) 的识 别 以 及 基 于 压 缩 时 间 轴 的 识 别 l - 3 ] . 基 于
激光散斑 获取深度 信息 , 不受环 境 的光 照变 化和 复杂 背 景的影 响 , 以此来进行 手势跟踪 与识别.
处理( 手势 跟踪 、 手势 分 割 、 特征提取等) ; 二 是 手 势 运 动轨迹 提取 和分 类. 例如文 献[ 2 ] 提 出采用 三维 坐标 系
利用手 势重 心 的 中心位 置 进 行 轨迹 提 取 和 分类 , 识 别
J u 1 .2 o 1 3
基 于 Ki n e c t 传 感 器 深 度 信 息 的 动 态 手 势 识 别
陶丽君 , 李 翠 华 , 张希婧 , 李 胜睿
( 厦 门大 学 信 息 科 学 与 技 术 学 院 , 福建 厦门 3 6 1 0 0 5 )
基于Kinect的动态手势识别系统的开题报告
基于Kinect的动态手势识别系统的开题报告1. 问题提出在现代生活中,人与计算机的交互方式越来越多样化。
其中手势交互成为一种快速、自然的交互方式。
手势识别技术的发展使得计算机可以根据人体动作的信息实现人与计算机之间的交互。
Kinect作为一种深度摄像头,可以捕捉人体动作以及深度信息,为手势识别技术提供强有力的支持,被广泛应用于人机交互领域。
然而,目前基于Kinect的手势识别系统还存在很多问题,例如:精度不够高、实时性差、容易被环境影响等。
因此,开发一种高效、实用的基于Kinect的动态手势识别系统具有重要的研究价值和实际意义。
2. 研究目标本文旨在设计一种基于Kinect的动态手势识别系统,具体研究目标包括:(1) 建立手势库:收集并整理手势图片,建立丰富的手势库。
(2) 设计手势识别算法:通过分析和比较不同的手势识别算法,设计出一种精度高、实时性好的手势识别算法。
(3) 系统设计与实现:根据手势识别算法,设计并实现一套完整的基于Kinect的动态手势识别系统,包括图像采集、手势追踪、手势识别等模块。
(4) 系统优化与实验验证:通过实验验证和系统优化,提高系统的性能和稳定性,并对系统的精度、实时性等参数进行评估和分析,分析系统的优缺点以及未来改进方向。
3. 研究方法本文采用以下研究方法:(1) 文献调研:调研国内外关于基于Kinect的手势识别系统的研究现状和发展趋势,分析已有手势识别算法的优缺点,探索新的算法和实现方法。
(2) 系统设计:根据手势识别算法和系统需求,设计系统的整体框架、数据流程和模块实现。
(3) 系统实现:利用C#等编程语言和Visual Studio等开发工具,实现系统的各个模块,完成手势采集、识别、运动跟踪等功能。
(4) 系统测试:选取不同场景下的手势图片,对系统进行测试并进行参数分析和性能评估,分析系统的优缺点及未来的改进方向。
4. 研究意义本文将研究和实现一套高效的基于Kinect的动态手势识别系统,为人机交互技术提供了一种新的交互方式。
基于kinect的动态手势识别算法改进与实现
路径约束改进的 DTW 算法ꎬ该方法选取 8 个点作为
学模型ꎮ 动态时间规整法适用于处理特征匹配问
题ꎬ但相比于 HMMꎬ不能融合所有训练数据的特性ꎬ
导致其对于模型的代表性不是很好ꎮ KNN 虽然可
以选择具有代表性的轨迹特征作为模板ꎬ但是用户
ꎮ 但在研究过程中
完成一个动作时仍会有较大差异ꎮ Nguyen ̄Dinh 使
像、聋哑人辅助生活、远程交流、机器人操作、虚拟仿
手部特征ꎬ并且通过加权距离的方法建立手势的数
的交互方式
[1]
和灵活性ꎬ使其广泛应用于各个领域ꎬ例如医学图
真、电子游戏及无人驾驶等领域
[2]
ꎮ
手势识别根据手势运动状态主要分为静态手势
识别和动态手势识别 [3] ꎮ 动态手势识别相对静态
手势识别而言ꎬ动态手势可以更直观、更方便地表达
进行均匀量化编码ꎬ通过设置概率阈值模型及编码的种类来排除未定义手势、进行动态手
势识别ꎬ并对比不同实验环境下的识别效果ꎮ 实验结果表明ꎬ改进后的 HMM 算法有效地
排除了多种未定义手势ꎬ能够适应复杂背景和黑暗条件ꎬ而且能够提高对已定义手势的识
别率ꎮ
关键词 Kinect V2 传感器ꎬ 动态手势识别ꎬ 改进隐马尔科夫模型( HMM) ꎬ 未定义手势ꎬ
1 手势分割及质心轨迹跟踪
来ꎬ主要的动态手势学习和识别方法有隐马尔科夫
时将 LCSS 分别与 DTW 和 SVM 进行了比较ꎬ结果表
客户的用途ꎬ满足用户的需求
[4]
也有一定的困难ꎬ比如特征复杂、分类困难等ꎮ 近年
①
②
③
用最长公共子序列实现了实时的动态手势识别ꎬ同
国家自然科学基金( F2012203111) 和河北省高等学校科学技术研究青年基金(2011139) 资助项目ꎮ
kinect v2原理
Kinect V2原理介绍Kinect V2是微软开发的一种深度感应器,通过红外线投影和红外相机共同工作,能够实现对用户的动作和姿势的跟踪。
本文将深入探讨Kinect V2的工作原理和技术细节。
红外线投影Kinect V2使用了红外线投影技术来获取深度信息。
它通过发射大量的红外光点到场景中,然后利用红外相机来获取这些光点的位置信息。
这种投影方式能够在不受外部光照影响的情况下获取深度信息,并且适用于各种室内环境。
红外线光源Kinect V2使用一个内置的红外线激光发射器作为光源。
该激光发射器能够发射大量集中在一个平面上的红外光点,为后续的深度信息获取提供了基础。
红外线相机Kinect V2内置了一台红外线相机,用于捕捉红外光点的位置信息。
这台相机具有高分辨率和高帧率的特点,能够精确地捕捉到红外光点的位置,并将其转换为深度信息。
深度感应原理通过红外线投影和红外相机的配合,Kinect V2能够实现对场景中物体的深度感应。
具体的原理如下:1.发射红外线光点 Kinect V2发射的红外线光点会照射到场景中的物体上,光点在物体表面产生反射。
2.接收红外线光点红外相机会捕捉到反射的红外线光点,并记录它们的位置信息。
3.计算深度值 Kinect V2通过比较红外线激光发射器和红外相机之间的距离,计算出每个红外线光点的深度值。
4.生成深度图像利用红外线光点的深度值,Kinect V2可以生成一个深度图像,其中每个像素点表示对应位置的物体距离红外相机的距离。
骨骼追踪技术除了深度感应,Kinect V2还支持骨骼追踪技术,能够实时跟踪用户的动作和姿势。
它通过分析深度图像中的物体形状和动态信息,提取出用户的骨骼关节位置。
深度图像处理为了进行骨骼追踪,Kinect V2首先需要对深度图像进行处理,以提取出物体的轮廓和形状信息。
它使用了一系列的图像处理算法,包括边缘检测、轮廓提取和形状匹配等。
骨骼模型Kinect V2使用了一个预定义的骨骼模型来表示人体的骨骼结构。
《2024年基于Kinect的手势识别与机器人控制技术研究》范文
《基于Kinect的手势识别与机器人控制技术研究》篇一一、引言随着人工智能技术的不断发展,人机交互技术已成为研究热点之一。
其中,基于Kinect的手势识别技术因其高精度、实时性和自然性成为了重要的人机交互方式。
同时,机器人控制技术也在不断进步,如何将手势识别技术应用于机器人控制,提高机器人的智能化水平,已成为研究的重要方向。
本文旨在研究基于Kinect的手势识别与机器人控制技术,探索其应用前景和实现方法。
二、Kinect手势识别技术Kinect是一种常用的深度传感器,能够通过捕捉人体运动信息来实现手势识别。
基于Kinect的手势识别技术主要包括以下几个步骤:1. 数据采集:通过Kinect传感器捕捉人体运动信息,包括骨骼数据、颜色信息和深度信息等。
2. 数据预处理:对采集到的数据进行预处理,如去除噪声、平滑处理等,以提高数据的准确性和可靠性。
3. 特征提取:从预处理后的数据中提取出与手势相关的特征,如手势的形状、速度、加速度等。
4. 模式识别:采用模式识别算法对提取出的特征进行分类和识别,实现手势的分类和识别。
三、机器人控制技术机器人控制技术是实现机器人运动和行为的关键技术。
基于Kinect的手势识别技术可以应用于机器人控制,实现机器人的智能化控制。
机器人控制技术主要包括以下几个方面的内容:1. 运动规划:根据机器人的任务需求,制定合理的运动轨迹和姿态。
2. 控制算法:采用控制算法对机器人的运动进行控制和调节,保证机器人的稳定性和精度。
3. 传感器融合:将多种传感器信息进行融合,提高机器人的感知能力和反应速度。
四、基于Kinect的手势识别与机器人控制技术的研究基于Kinect的手势识别与机器人控制技术的结合,可以实现人机自然交互,提高机器人的智能化水平。
具体实现方法包括:1. 构建系统框架:搭建基于Kinect的手势识别系统,将手势识别结果传输给机器人控制系统。
2. 训练模型:采用机器学习算法对手势识别模型进行训练和优化,提高识别的准确性和实时性。
基于Kinect的指尖检测与手势识别方法
Fi ng e r t i p d e t e c t i o n a n d g e s t ur e r e c o g ni t i o n me t ho d ba s e d o n Ki ne c t
T AN J i a p u ,XU We n s h e n g
o n t h e i s s ue , a h nd a g e s t u r e r e c o g ni t i o n me t h o d ba s e d o n de p t h i ma g e , s k e l e t o n i ma g e n d a c o l o r i ma g e i n f o r ma t i o n Wa s
t h e d i s t nc a e s b e t we e n t h e g e s t u r e c o n t o u r p o i n t s nd a t h e p l m a p o i n t w e r e c a l c u l a t e d t o g e n e r a t e t h e d i s t a n c e c u r v e , a nd t h e r a t i o o f c u r v e p e a k s t o t r o u g h s wa s s e t u p t o o b t a i n i f n g e r p o i n t .F i n ll a y ,c o mmo n l y u s e d 1 2 g e s t u es r we r e i d e n t i i f e d b y c o mb i n i n g en b d i n g i f n g e r p o i n t f e a t u r e s a n d t h e ma x i mu m a mo u n t o f c o n t o u r a r e a .S i x e x p e ime r n t e r s we r e i n v i t e d t o v li a d a t e t h e p r o p o s e d me t h o d i n e x p e i r me n t l a r e s u l t s v e i r i f c a t i o n p h a s e .Ev e r y g e s t u r e W s a e x p e i r me n t e d 1 2 0 t i me s u n d e r t h e c o n d i t i o n o f r e l a t i v e l y s t a b l e l i g h t e n v i r o n me n t a n d t h e r e c o ni g t i o n r a t e o f 1 2 g e s t u r e s w a s 9 7 . 9 2 % o n a v e r a g e .T h e e x p e i r me n t a l r e s u l t s s h o w t h a t t h e p op r os e d me t h o d C n a q u i c k l y l o c a t i o n g e s t u r e s nd a a c c u r a t e l y r e c o ni g z e t h e c o mmo n l y u s e d 1 2 k i n d s o f h a n d
《2024年基于Kinect的手势识别与机器人控制技术研究》范文
《基于Kinect的手势识别与机器人控制技术研究》篇一一、引言随着人工智能技术的快速发展,人机交互技术已经成为当今研究的热点。
其中,基于Kinect的手势识别与机器人控制技术,以其高效、自然的人机交互方式,正受到广泛关注。
本文将重点探讨基于Kinect的手势识别技术及其在机器人控制领域的应用。
二、Kinect技术概述Kinect是微软开发的一款体感摄像头,它能够捕捉人体动作、姿态和手势等信息,从而实现自然的人机交互。
Kinect技术通过深度传感器和RGB摄像头等设备,对人体进行三维空间的定位和跟踪,从而实现对人体动作的精确识别。
三、手势识别技术手势识别是Kinect技术的重要应用之一。
通过对手势的捕捉和分析,可以实现对人机交互的进一步优化。
基于Kinect的手势识别技术主要包括以下步骤:1. 数据采集:利用Kinect的深度传感器和RGB摄像头,对人体进行三维空间的定位和跟踪,获取手势数据。
2. 数据预处理:对采集到的数据进行去噪、平滑等处理,以提高手势识别的准确性。
3. 特征提取:从预处理后的数据中提取出手势的特征信息,如手势的形状、运动轨迹等。
4. 模式识别:通过机器学习、深度学习等算法,对提取出的特征信息进行分类和识别,实现手势的分类和识别。
四、机器人控制技术应用基于Kinect的手势识别技术可以广泛应用于机器人控制领域。
通过对手势的识别和分析,可以实现对机器人的远程控制。
具体应用包括:1. 家庭服务机器人:通过识别用户的简单手势,如挥手、指向等,实现对家庭服务机器人的控制,如开关电视、调节灯光等。
2. 工业机器人:在工业生产线上,通过识别工人的手势指令,实现对工业机器人的远程操控,提高生产效率。
3. 医疗康复机器人:在医疗康复领域,通过识别患者的康复训练手势,实现对康复机器人的控制,帮助患者进行康复训练。
五、技术研究挑战与展望虽然基于Kinect的手势识别与机器人控制技术已经取得了很大的进展,但仍面临一些挑战和问题。
Kinect手势识别原理
Accurate,Robust,and Flexible Real-time Hand Tracking Toby Sharp†Cem Keskin†Duncan Robertson†Jonathan Taylor†Jamie Shotton†David Kim Christoph Rhemann Ido Leichter Alon Vinnikov Yichen WeiDaniel Freedman Pushmeet Kohli Eyal Krupka Andrew Fitzgibbon Shahram IzadiMicrosoft ResearchFigure1:We present a new system for tracking the detailed motion of a user’s hand using only a commodity depth camera. Our system can accurately reconstruct the complex articulated pose of the hand,whilst being robust to tracking failure,and supportingflexible setups such as tracking at large distances and over-the-shoulder camera placement.ABSTRACTWe present a new real-time hand tracking system based on a single depth camera.The system can accurately reconstruct complex hand poses across a variety of subjects.It also allows for robust tracking,rapidly recovering from any temporary failures.Most uniquely,our tracker is highlyflexible,dra-matically improving upon previous approaches which have focused on front-facing close-range scenarios.Thisflexibil-ity opens up new possibilities for human-computer interaction with examples including tracking at distances from tens of centimeters through to several meters(for controlling the TV at a distance),supporting tracking using a moving depth cam-era(for mobile scenarios),and arbitrary camera placements (for VR headsets).These features are achieved through a new pipeline that combines a multi-layered discriminative reini-tialization strategy for per-frame pose estimation,followed by a generative model-fitting stage.We provide extensive techni-cal details and a detailed qualitative and quantitative analysis. INTRODUCTIONThe human hand is remarkably dextrous,capable of high-bandwidth communication such as typing and sign language. Computer interfaces based on the human hand have so far been limited in their ability to accurately and reliably track the detailed articulated motion of a user’s hand in real time.We believe that,if these limitations can be lifted,hand tracking will become a foundational interaction technology for a wide range of applications including immersive virtual reality,as-sistive technologies,robotics,home automation,and gaming. However,hand tracking is challenging:hands can form a va-riety of complex poses due to their many degrees of free-dom(DoFs),and come in different shapes and sizes.Despite some notable successes(e.g.[7,32]),solutions that augment †denotes jointfirst authorship. denotes joint last authorship.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted.To copy otherwise, or republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.Request permissions from permissions@.CHI’15,Apr18–232015,Seoul,Republic of Korea.Copyright is held by the owner/author(s).Publication rights licensed to ACM.ACM978-1-4503-3145-6/15/04...$15.00./10.1145/2702123.2702179.the user’s hand with gloves or markers can be cumbersome and inaccurate.Much recent effort,including this work,has thus focused on camera-based systems.However,cameras, even modern consumer depth cameras,pose further difficul-ties:thefingers can be hard to disambiguate visually and are often occluded by other parts of the hand.Even state of the art academic and commercial systems are thus sometimes in-accurate and susceptible to loss of track,e.g.due to fast mo-tion.Many approaches address these concerns by severely constraining the tracking setup,for example by supporting only close-range and front facing scenarios,or by using mul-tiple cameras to help with occlusions.In this paper,we present a system that aims to relax these constraints.We aim for high accuracy,i.e.the correctness andfidelity of thefinal reconstruction across a wide range of human hand poses and motions.Our system is remarkably robust,i.e.can rapidly recover from momentary failures of tracking.But perhaps most uniquely,our system isflexible:it uses only a single commodity depth camera;the user’s hand does not need to be instrumented;the hand andfingers can point in arbitrary directions relative to the sensor;it works well at distances of several meters;and the camera placement is largely unconstrained and need not be static(see Fig.1). Our system combines a per-frame reinitializer that ensures ro-bust recovery from loss of track,with a model-fitter that uses temporal information to achieve a smooth and accurate result. We propose a new reinitializer that uses machine learning to efficiently predict a hierarchical distribution over hand poses. For the model-fitter,we describe a‘golden’objective function and stochastic optimization algorithm that minimizes the re-construction error between a detailed3D hand model and the observed depth image.The pipeline runs in real time on con-sumer hardware.We provide extensive technical details to aid replication,as well as a detailed qualitative and quantitative analysis and comparison on several test datasets.We show how our tracker’sflexibility can enable new interactive possi-bilities beyond previous work,including:tracking across the range of tens of centimeters through to a several meters(for highfidelity control of displays,such as TVs,at a distance); tracking using a moving depth camera(for mobile scenarios);and arbitrary camera placements,includingfirst person(en-abling tracking for head-worn VR systems).RELATED WORKGiven our aims described above,we avoid encumbering the user’s hand with data gloves[7],colored gloves[32],wear-able cameras[13],or markers[38],all of which can be bar-riers for natural interaction.We focus below on vision-based articulated hand tracking.Discriminative approaches work directly on the image data(e.g.extracting image features and using classification or regression techniques)to establish a mapping to a predefined set of hand pose configurations.Th-ese often do not require temporal information and can thus be used as robust reinitializers[21].Generative(or model-based)methods use an explicit hand model to recover pose. Hybrid methods(such as[3,23]and ours)combine discrim-inative and generative to improve the robustness of frame-to-frame modelfitting with per-frame reinitialization.RGB input Early work relied on monocular RGB cameras, making the problem extremely challenging(see[8]for a sur-vey).Discriminative methods[2,34]used small databases of restricted hand poses,limiting accuracy.Generative methods used models with restricted DoFs,working with simplified hand representations based on2D,2½D or inverse kinematics (IK)(e.g.[24,35]),again resulting in limited accuracy.Bray et al.[4]used a more detailed3D hand model.de La Gorce et al.[6]automatically apply a scaling to the bones in the hand model during tracking.Most of these systems performed of-fline tracking using recorded sequences.An early example of online(10Hz)tracking with a simplified deformable hand model is[10].This work,as with other RGB-based meth-ods,struggled with complex poses,changing backgrounds, and occlusions,thus limiting general applicability.Multi-camera input There has also been recent work on high-quality,offline(non-interactive)performance capture of hands using multi-camera rigs.Ballan et al.[3]demonstrate high-quality results closelyfitting a detailed scanned mesh model to complex two-handed and hand-object interactions. The pipeline takes about30seconds per frame.Zhao et al.[38] use a depth camera,motion capture rig,and markers worn on the user’s hand,to capture complex single-hand poses,again offline.Wang et al.[33]show complex hand-object interac-tions by minimizing a silhouette,color,and edge-based ob-jective using a physics engine,though take minutes per frame. Sridhar et al.[23]use a rig comprisingfive RGB cameras and a time-of-flight(ToF)sensor to track a user’s hand using a person-calibrated model at∼10Hz.We provide a direct com-parison later in this paper.The above systems can produce highly accurate results,but are impractical for interactive con-sumer scenarios.Depth input The advent of consumer depth cameras such as Kinect has made computer vision more tractable,for example through robustness to lighting changes and invariance to fore-ground and background appearance.This,together with the use of GPUs for highly-parallel processing,has made real-time hand tracking more feasible.However interactive,de-tailed hand tracking that is simultaneously accurate,robust, andflexible remains an unsolved and challenging problem. Oikonomidis et al.[17]present a generative method based on particle swarm optimization(PSO)for full DoF hand tracking (at15Hz)using a depth sensor.The hand is tracked from a known initial pose,and the method cannot recover from loss of track.Qian et al.[20]extend[17]by adding a‘guided’PSO step and a reinitializer that requiresfingertips to be cle-arly visible.Melax et al.[16]use a generative approach driven by a physics solver to generate3D pose estimates.These systems work only at close ranges,and are based on simple polyhedral models;our approach instead works across a wide range of distances,and exploits a full3D hand mesh model that is better able tofit to the observed data.Keskin et al.[12]propose a discriminative method using a multi-layered random-forest to predict hand parts and thereby tofit a simple skeleton.The system runs at30Hz on con-sumer CPU hardware,but can fail under occlusion.Tang et al.[27,26]extend this work demonstrating more complex poses at25Hz.Whilst more robust to occlusions than[12], neither approach employs an explicit modelfitting step mean-ing that results may not be kinematically valid(e.g.implausi-ble articulations orfinger lengths).Xu et al.[36]estimate the global orientation and location of the hand,regress candidate 21-DoF hand poses,and select the correct pose by minimiz-ing reconstruction error.The system runs at12Hz,and the lack of tracking can lead to jittery pose estimates.Tompson et al.[30]demonstrate impressive hand tracking results us-ing deep neural networks to predict feature locations and IK to infer a skeleton.While real-time,the approach only tack-les close-range scenarios.Wang et al.[32,31]demonstrate a discriminative nearest-neighbor lookup scheme using a large hand pose database,and IK for pose refinement.Nearest-neighbor methods are highly dependent on the database of poses,and can struggle to generalize to unseen poses. Commercial systems Beyond this research,there have also been commercial hand tracking systems.The3Gear Sys-tems[1](based on prior work of[32,31])and the second-generation software for the Leap Motion[15]have shown tracking of a range of complex poses,including two-handed interaction.We compare to both in our results section. SYSTEM OVERVIEW AND CONTRIBUTIONSWe follow recent approaches to hand and body tracking[3, 17,28]and adopt an approach based on‘analysis by synthe-sis’[37].We use machine learning and temporal propagation to generate a large set of candidate hand pose hypotheses for a new input frame.Each pose hypothesis contains a param-eter vector describing the global position and orientation of the hand,as well as the angles between joints in the hand skeleton.Each hypothesis is then rendered as an image us-ing standard graphics techniques,and scored against the raw input image.Finally,the best-scoring hypothesis is output. One of the main contributions of this paper is to make this approach practical,by which we mean accurate,robust,and flexible as described in the introduction.To this end,we make three technical contributions.First,we show how to use a simple‘golden’energy function to accurately distinguish good and bad hypotheses.*Second,we present a discrimina-tive approach(the‘reinitializer’)that predicts a distribution over hand poses.We can then quickly sample diverse pose hypotheses from this distribution,some of which are likely to be close to the correct answer.Third,we demonstrate that the rendering of a detailed articulated hand mesh model and the *‘Energy’,‘objective’,and‘scoring’functions are notionally equivalent, though‘good’will be used to imply low energy but high objective/score.Hand RoI Detector ReinitializerPSOGolden Energy RendererInput Depth Streamglobal rotation classifierglobal translation rotation refinement pose classifiersampled posesGPU.This goes beyond existing approaches that have relied on approximate hand models such as spheres and cylinders.PipelineEach input depth image is processed by a pipeline comprising three steps (see Fig.2):1.Hand RoI extraction :Identify a square region of interest (RoI)around the hand and segment hand from background.2.Reinitialization :Infer a hierarchical distribution over hand poses with a layered discriminative model applied to the RoI.3.Model fitting :Optimize a ‘population’of hand pose hy-potheses (‘particles’)using a stochastic optimizer based on particle swarm optimization (PSO).We describe each of these components in detail below,but first introduce some preliminaries.PRELIMINARIES 3D Hand ModelWe represent the human hand as a 3D model,represented by a detailed mesh of triangles and vertices.The 3D positions of the M mesh vertices are represented as columns in a 3×M matrix V that defines the hand shape in a ‘base’(rest)pose.To be useful for pose estimation,we must be able to articu-late the wrist,finger,and thumb joints in the model.We use a standard ‘kinematic’skeleton for this which specifies a hi-erarchy of joints and the transformations between them.We use vector θto denote the pose of the hand,including:global scale (3parameters);global translation (3);global orientation (3);and the relative scale (1)and rotations (3)at each joint (three joints per finger and thumb,plus the wrist).The global rotation is represented as a normalized quaternion which al-lows us to deal with arbitrary global rotations without risk of ‘gimbal lock’.The joint rotations are represented as ‘Euler angles’about local coordinate axes which are defined to cor-respond to ‘flexion’,‘abduction’and ‘twist’.Given the pose vector θ,the vertices of the posed mesh can be computed by a function Φ(θ;V ),which returns a 3×M matrix.The columns of this matrix correspond to those in the base pose model V ,but are in the pose specified by θ.For function Φwe use a standard technique called linear blend skinning (LBS).The precise details of LBS are unimportant for this paper,though see e.g.[29]for more details.Input depth imageOur algorithm operates on images captured by a single depthcamera.The image is represented as a two-dimensional array of depth valuesZ ={z ij |0≤i <H,0≤j <W }where z ij is the depth in meters stored at pixel location (i,j )and W and H are respectively the width and height of the depth map.We assume the depth map has been pre-processed such that invalid pixels (e.g.low reflectance or shadowed from the illuminator)are set to a large background depth value z bg .SCORING FUNCTION:THE ‘GOLDEN ENERGY’An underlying assumption of all analysis-by-synthesis algo-rithms is that,given unlimited computation,the best solution can be found by rendering all possible poses of the hand,and selecting the pose whose corresponding rendering best matches the input image (while appropriately accounting for the prior probabilities of poses).For this assumption to hold,a number of details are important.First,the model should be able to accurately represent the ob-served data.Overly approximate models such as those based on spheres and cylinders will struggle to accurately describe the observed data.While still an approximation,we employ a detailed skinned hand mesh that we believe can describe the observed data much more accurately.Furthermore,given our focus on flexible camera setups,the camera may see not just the hand,but also a large portion of the user’s body,as well as the background.To avoid having to simultaneously fit an en-tire model of the hand,body,and background to the depth im-age,we thus detect and extract (resample)a reasonably tight region of interest (RoI)Z roi ={¯z ij |0≤i <S,0≤j <S },comprising of S ×S pixels around the hand,and segment the hand from the background (see next section).Working with a tight RoI also increases the efficiency of the scoring function by ensuring that a large proportion of pixels in the rendered images belong to the hands.Second,several ‘short cut’solutions to rendering can invali-date the approach.For example,simply projecting each ver-tex of the model into the image and measuring distance (either in depth,or in 3D using a distance transform)fails to account for occlusion.We thus define a GPU-based rendering module which takes as input a base mesh V ,a vector of pose parame-ters θ,and a bounding box B within the original depth image Z ,and renders a synthetic depth imageR roi (θ;V ,B )={r ij |0≤i <S,0≤j <S }(1)of the same size as Z roi ,where background pixels receive the value r ij =z bg .The bounding box B is used to render R roi perspective-correctly regardless of its position in the original depth image.Given compatible rendered and acquired images,the ‘golden energy’scoring function we use is simple:E Au(Z roi ,R roi )=ijρ(¯z ij −r ij ).(2)Instead of a squared (‘L2’)error,we employ a truncated ‘L1’distance ρ(e )=min(|e |,τ)which is much less sensitive to outliers due to camera noise and other factors.Despite its simplicity,this energy is able to capture important subtleties.In particular,where a model hand pixel is rendered over data background,or model background over data from the hand,the contribution to the energy is z bg ,which is large enough to always translate to the truncation value τ.Where rendered and data are both z bg ,the score is zero.Input Depth, Approximate Hand Localizationx , and Inferred Segmentation ExtractedHand RoI Z dG r o u n d t r u t h I n p u t d e p t hT raining Data (Synthetic)xx B Figure 3:Hand Region of Interest (RoI)extraction.We dub the above function the ‘golden energy’to distinguish it from energies that do not directly aim to explain the image data such as [16,23,28].E Au very effectively captures the notion of analysis by synthesis,and experiments suggest that casting tracking as an optimization problem whose sole ob-jective is minimization of E Au yields high-quality results.We found the energy to be reasonably robust to different values of τ;all our experiments used a value τ=100mm.REGION OF INTEREST (RoI)EXTRACTIONThe RoI extraction process is illustrated in Fig.3left.We startfrom a rough estimate ˆxof the 3D hand location.This can be obtained in several ways,for example using motion extrapo-lation from previous frames,the output of a hand detector,or the hand position provided by the Kinect skeletal tracker.Theestimate ˆxis often fairly approximate,and so a second step is used to more accurately segment and localize the hand using a learned pixel-wise classifier based on [21].The classifier,trained on 100k synthetic images of the hand and arm (Fig.3right),is applied to pixels within a 3D search radius r 1aroundlocation ˆxto classify pixels as hand or arm.To precisely localize the hand,we then search for position x for which the accumulated hand probability within a 20cm ×20cm win-dow is maximal.Finally,we extract the RoI,using nearest-neighbor downsampling to a S ×S pixel image Z d for use in the golden energy.We typically set S between 64or 128:larger will preserve more detail but will be slower to render in the golden energy computation.Note that we do not remove the arm pixels (see below),though we do set pixels outside ra-dius r 2(<r 1)of x to background value z bg .We also record the bounding box B of the RoI in the original depth map.While in practice we found the classifier [21]to work reliably at improving the localization of the RoI around the hand,we decided against attempting to directly segment the hand from the forearm,for two reasons.First,the inferred boundary be-tween hand and arm proved rather imprecise,and from frame to frame would typically move up and down the arm by sev-eral centimeters.Second,we observed that there is consider-able value in observing a small amount of forearm that abuts the edge of the RoI.Unlike existing approaches,we include a forearm in our hand mesh model,allowing our golden energy to explain the observed forearm pixels within the RoI and thus help prevent the optimized hand pose from flipping or sliding up and down the arm.This part of the pipeline is similar to the hand segmentation approach in [30],but does not require training data for all possible backgrounds,can exploit large quantities of synthetic data,is potentially faster since the classifier is only applied to relatively few pixels,and is able to exploit the forearm signal.ROBUST REINITIALIZATIONOf all parts in our pipeline,we found the reinitializer the most critical in achieving the accuracy,robustness,and flexibility demonstrated in our results.Its primary goal is to output a pool of hypotheses of the full hand pose by observing just the current input depth image,i.e.without temporal information.The hypotheses will in due course be evaluated and refined refined in later stages of the pipeline.Existing approaches include multi-layer random forests [12],nearest-neighbor look-up [31],fingertip detection [20],or con-volutional neural networks [30].However,our focus on flexi-ble camera setups pushes the requirements for reinitialization far beyond what has yet been demonstrated.In particular,we place no restrictions on the global rotation of the hand.This means that we cannot rely on seeing fingertips,and further,the range of hand appearance that we expect to see is dra-matically increased compared to the near-frontal close-range hands evaluated in existing work.Our reinitialization component makes two significant advan-ces beyond existing work.First,we believe the problem is so hard that we cannot hope to predict a single good pose solu-tion.Instead,our approach predicts a distribution over poses,from which our model fitter is free to quickly sample as many poses as desired and use the golden energy to disambiguate the good from the bad candidates.Second,while following existing practice [25,12]in breaking the regression into two ‘layers’(or stages),we focus the first layer exclusively in pre-dicting quantized global hand rotation.By predicting coarse rotation at the first layer,we dramatically reduce the appear-ance variation that each second-layer predictor will have to deal with.Synthetic Training DataOur reinitializers are trained on a corpus of synthetic training data.Given the dexterity of hands and fingers,the potential pose space is large,and to achieve our goal of flexibility,the training data must ensure good coverage of this space while avoiding unlikely poses.We achieve this by sampling from a predefined prior distribution that is specifically designed to give a broad coverage of pose space.We believe it is substan-tially more general-purpose than existing work which tends to concentrate on close-range frontal poses.Global hand orientation samples are drawn from a uniform distribution,global translation samples are drawn from a uni-form distribution within the view frustum,and wrist pose is randomized within sensible flexion/abduction limits.For the finger and thumb rotations,we employ six manually defined prototype poses (or proto-poses ):Open,Flat,Closed,Point-ing,HalfOpen,Pinching.Each proto-pose is able to gener-ate realistic poses for a subset of pose space,with a ‘mean’shape designed to look like the proto-pose name,and a set of randomization rules to allow one to draw samples from the proto-pose (see supplementary material for example pseudo-code and samples).In future we hope these might instead be learned from a large corpus of hand motion capture data.As training data,we sample 100k poses θfrom the above dis-tribution,and for each pose,we generate a synthetic depth image using the same mesh model and renderer as is used for computing the golden energy.The reinitializer acts on a depth RoI rather than the full depth map,and so a tight bound-ing box is computed around the pixels belonging to hand (notforearm),and then expanded randomly by up toeach side to simulate an imprecise RoI at test time.Two-Layer Reinitialization ArchitectureGiven the above data,we train a two-layer[12,25]izer.Of all elements of the pose vector,global rotationably causes the largest changes in the hand’s appearance. thus train afirst layer to predict global rotation,quantized 128discrete bins.For each rotation bin,seconddictors are trained to infer other elements of thesuch asfinger rotations.Each second layer predictoronly on those images within the respective globalThis reduces the appearance variation each seconddictor needs to handle,and thus simplifies the learninglem[12,25].The number of bins was chosen toAt the second layer,we train three predictors for each of the 128rotation bins q:a global rotation refinement regressor, an offset translation regressor,and afinger proto-pose classi-fier.Each predictor is trained using the subset of data belong-ing to the relevant ground truth rotation bin q.We employ per-pixel decision jungles[22]for all layer-two predictors given their small memory requirements and the large num-ber of predictors.The g lobal r otation r efinement regressor is trained to minimize the variance over quaternion globalrotation predictions.It predicts a distribution P grrq (q|Z roi)over the global rotation quaternion q.The o ffset t ranslation r egressor[9]is trained to predict the3D offset from any RoI pixel’s3D camera space location to the wrist joint position (i.e.the global translation).It is trained similarly to minimize the variance of the3D offsets at the leaf nodes.By aggregat-ing the predictions from multiple pixels at test time,a distribu-tion P otrq (t|Z roi)over the absolute global translation t is ob-tained.Finally,the p roto-p ose c lassifier is a trained as a con-ventional per-pixel classifier over the6proto-pose classes.At test time,multiple randomly-chosen foreground pixels’pre-dictions are aggregated to predict a distribution P ppcq(f|Z roi). Making PredictionsAt test time,thefirst layer is evaluated to predict P grb(q|Z roi). We assume that the top5most likely global rotation bins con-centrate the majority of the probability mass,and so for speed evaluate the relevant second layer predictors only for these bins,in order to predict the refined global rotation,the global translation,and the proto-pose.At this point we have effec-tively inferred a hierarchical distribution over hand poses,and can now efficiently sample as many draws from this distribu-tion as possible:first sample a q,then conditioned on q samplea global rotation q from P grrq (q|Z roi),a global translation tinput depth groundtruth samples from reinitializerby our reinitializer.Forimage of the input depthin the ground truth pose,from the reinitializer.Ourglobal rotations to allowf from P ppcq(f|Z roi).Fi-,we concatenate the globalfinger pose sampledapplied to syn-thetic test images in Fig.4.Observe that at least some sam-ples from the reinitializer are usually close to the correct handpose,especially so for global rotation and -paring rows3and4one can see the effect of the second layerpose classifier:row3has more‘open hand’samples,whereasrow4has more‘closed hand’samples.Even the incorrectsamples often exhibit sensible confusions,for example beingflipped along a reasonable axis.MODEL FITTINGIn order to optimize the golden energy and achieve an accu-rate hand pose estimate,we employ a modelfitting algorithmthat combines features of particle swarm optimization(PSO)and genetic algorithms(GA)in the spirit of[11,17,20].Themodelfitter is the component that integrates information fromprevious frames along with proposals from the reinitializer.Itis also the main computational workhorse of our system,andwe describe in the supplementary material how our modelfit-ter can be implemented efficiently on the GPU to achieve real-time frame rates.The algorithm maintains a population of P‘particles’{φp}P p=1(each particle contains a pose vectorθand additional statesuch as velocity in pose space),and the scoring function isevaluated across the population in parallel on the GPU toyield scores{E Au p}P p=1.Each such evaluation comprises one‘generation’.The standard PSO algorithm(see e.g.[17])thenspecifies an update rule for populating the next generation.This rule incorporates‘momentum’in pose space and attrac-tion towards current local minima of the energy function.Wefound that the standard update rules did not work well on theirown,and describe below our extensions which we found crit-ical for success.。
基于Kinect的实时手势识别方法
4!引!言
在人机交互中#手势能生动表达大量交互信息!目前# 手势识别主要有依靠穿戴设备的方法$+#J%和利用 615摄像 头的方法$!)4%!依靠穿戴设备能准确提取手势且有较高辨识 度#如 87T/8I76 I 提 出 的 基 于 颜 色 手 套 的 识 别 方 法$+%#用户穿戴特 殊 红 色 手 套 进 行 手 势 识 别#但 这 对 手 部 活动造成不便#无法识别用户指尖!利用 615摄像头的识
J-+4年#月 第!"卷!第#期
计算机工程与设计
$`8./=H6H@1(@HH6(@1 7@IIH?(1@
Z_;'!J-+4 b%C0!"!@%0#
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
基于 0(61/,的实时手势识别方法
田!元王学王志锋陈!加姚!璜
6'AC)F:E'UA;BK'PF_&'&'G%K;:F:%;E'FU%B_P:;K^:;'GF
=(7@ S_A;#97@1R_')WA;#97@1TU:)W';K#$2H@Z:A#S7` 2_A;K
&?GU%%C%WHB_GAF:%;AC(;W%&EAF:%;='GU;%C%KD#$';F&AC$U:;A@%&EAC/;:Y'&P:FD#9_UA;*!--<"#$U:;A'
华中师范大学 教育信息技术学院湖北 武汉 *!--<"
摘!要针对手势识别的准确性受到复杂背景光照变化肤色干扰旋转平移缩放等影响的问题提出一种基于 ^:;'GF的实时手势识别方法使用 ^:;'GF体感设备获取人体的骨骼信息和深度信息分析人体手部特征根据人体手部 轮廓提取指尖位置特征结合骨骼关节点位置分析人体手部以及得到的手指特征检测并实时识别+-种计数手势实验结 果表明该方法能够在复杂背景肤色干扰以及光照发生变化的条件下获得正确的识别结果满足手势的旋转平移和缩 放不变性识别率高具有良好的实时性和鲁棒性 关键词手势识别手势特征体感设备骨骼信息深度图像 中图法分类号=.!"+0*+! 文献标识号7 !文)+<J+)-# &'(+-0+#J-4+a0:PP;+---)<-J*0J-+40-#0-!"
基于Kinect视觉功能对体感手势目标识别
基于Kinect视觉功能对体感手势目标识别李国城(广州新华学院,广东广州510520)随着计算机视觉技术的发展,基于视觉的人机交互智能化不断提高,利用人的手势动作、语音等,与计算机环境建立一种类似人与人的人机交互模式[1],相比通过传统的鼠标、键盘等设备交流,可以减少硬件对人机交互的束缚和局限性,提高信息交流的简便性、自然性。
近年来,众多科研研究机构对基于视觉的非接触式手势感知技术进行了深入研究,在体感娱乐游戏等领域有了新型的操控模式[2]。
目前新型的手势识别人机交互技术有基于计算机视觉和基于数据手套两种方式。
手套内嵌微处理器及加速度传感器、弯曲传感器设计基于数据手套的识别系统,通过提取手掌轮廓区域的特征、手的倾斜度、手掌的运动轨迹等进行分析识别,识别精度较高,但需要佩戴特殊手套,使用有一定的局限性[3]。
基于计算机视觉的方法是使用摄像头获取识别对象的RGB彩色图像和深度图像,对深度图像数据进行手势分割、定位、特征提取处理,达到对动态手势的正确识别[4]。
本文选用微软公司的Kinect 传感器作为摄像头获取动态手势的彩色图像和深度图像数据,再利用基于OpenCV视觉库的手势识别算法,并融入滤波算法,对手势进行提取和定义,达到最优解,降低了背景、光照等因素的干扰,有效提高了对动态手势识别的鲁棒性、实时性。
1Kinect简介Kinect是微软公司推出的一款XBOX体感输入设备,通过特定的摄像头与传感器实现了人机自然交互。
如图1所示,Kinect硬件主要由一组多阵列麦克风组成,能进行语音输入,增强人机交互效果。
中间彩色输入摄像头,能够以每秒30帧的速率捕获分辨率为640×480的彩色图像。
两端的深度传感器为红外投影机和红外摄像头,对环境有较高的鲁棒性,用来获取用户位置和手势动作信息,经过内部芯片处理之后得到320×240像素的深度图像。
同时微软公司与OpenNI推出的Kinect for Windows SDK开发工具包,使得Kinect通过插值处理,向上层软件实时提供相差无几像素的彩色数据流和深度数据流[5]。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Abstract
In order to solve the problems of low generalization of features extracted by the dynamic gesture action recognition algorithm, poor real-time recognition and low recognition rate, this paper proposes a template matching gesture recognition method based on KinectV2 sensor. This method uses Kinect Studio and Visual Gesture Builder to build a gesture action database, extracts the angle between key bones of the human body as a one-dimensional feature vector that changes with time, and proposes Weighted DTW gesture motion recognition algorithm based on the dynamic time warping (DTW) algorithm with the help of dynamic planning ideas. Through the improvement of the DTW algorithm, the regular path converges faster, the real-time and accuracy of gesture motion recognition is improved, and the real-time and validity verification experiments are carried out in the Unity 3D virtual simulation experiment platform for simple human-computer interaction.
Figure 2. Gesture sample library building flowchart 图 2. 手势样本库建立流程图
10
仪器与设备
刘如何提取低计算量,低复杂度的特征并且找到统一的动作特征来对人体动作进行识别,是手势识别 过程中的关键问题。以往的研究者们在进行骨骼动作的识别的时候,主要提取位置特征,角度特征或者 速度特征来描述手势的运动[9],然而对于不同身高,不同体型的人做同一动作或者同一个人做不同动作 的过程中,由于每个人的骨骼关节位置信息有一定的差异,运动速度也不尽相同[10],均不能统一而准确 的描述手势的运动,但是即使每个人的骨骼坐标位置和运动速度不同,在运动过程中,关节间的夹角特 征并没有发生明显变化,基于这种现状,本文中提出了将手势运动过程中的骨骼间的向量夹角作为手势 特征描述,并结合动态时间规整(DTW)算法进行模板匹配,夹角特征提取过程如下。
4.1. 人体骨骼节点坐标获取
如图 3 所示,以求取右手手腕,右肩和颈下脊椎之间的夹角为例,其中,WR 代表右手手腕关节点, SR 代表右肩关节点,SS 代表颈下脊椎关节点,要想求它们之间的空间向量的夹角,则需要求其关节点 的空间坐标位置。
Figure 3. Diagram of joint angle 图 3. 关节夹角示意图
通过研究发现,语音识别算法和手势识别算法基本形同,主要有矢量量化,轨迹追踪和模板匹配三 种方法[2],因为矢量化的识别算法识别效率比较低,现在已经基本不再研究,所以近年来的手势识别算 法主要是以隐马尔科夫模型(HMM)为代表的轨迹追踪算法和以动态时间规整模型(DTW)为代表的模板匹 配算法。郭晓利等人[3]提出了一种改进的 HMM 模型算法,该算法与模糊神经网络相结合,比原 HMM 模型要稳定,但是还没有应用到实际中,还有待在实际环境中进一步检测。Saha S 等人[4]提出了一种特 征描述子用来描叙单个时间序列的扭曲信息,结合 HMM 模型来进行手势识别,该方法对 12 种不同的手 势进行了测试,虽然准确度较高,但由于该方法模型较为复杂,所以实时性较差。李凯等人[5]提出了一 种改进的动态时间规整算法,该方法将获取到的 Kinect 的骨骼点坐标及手型数据结合,构造了矢量特征 来描述手的运动轨迹,实现了手势的快速匹配,但未考虑到每个人因骨骼坐标位置不同而对识别结果带 来的影响;Ruan X 等人[6]提出了一种通过失真度阈值和路径约束来改进 DTW 的算法,该方法通过选取 8 个点作为手部特征建立手势的数学模型,实验结果表明,该方法无论是在识别率还是鲁棒性方便,都 取得了很好的实验效果。综上所述,HMM 和 DTW 算法在手势明显的情况下,分类效果差别不大,通常 情况下,HMM 算法的识别率是要稍高于 DTW 算法,但是 HMM 方法模型复杂,计算量较大,因此实时 性不如 DTW 算法,很难在实际生活中应用。
收稿日期:2020年2月15日;录用日期:2020年3月3日;发布日期:2020年3月10日
摘要
为解决动态手势动作识别算法提取的特征泛化程度低,识别实时性差和识别率不高的问题,本文提出了
文章引用: 刘希东, 孟岩, 李国友, 唐磊, 郭佳栋. KinectV2 传感器实时动态手势识别算法[J]. 仪器与设备, 2020, 8(1): 8-20. DOI: 10.12677/iae.2020.81002
关键词
KinectV2传感器,手势识别,骨骼角度特征,人机交互
Copyright © 2020 by author(s) and Hans Publishers Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). /licenses/by/4.0/
Figure 1. Contrast map of human skeleton node 图 1. 人体骨架节点对照图
3. 建立手势样本库
本文在建立手势样本库过程中主要分为录取手势,制作手势正负样本和通过程序代码调用手势样本 库三个步骤,具体步骤的工具和如图 2 所示:
DOI: 10.12677/iae.2020.81002
刘希东 等
一种基于KinectV2传感器 的模板匹配手势识别方法。该方法利用 Kinect Studio和Visual Gesture Builder建立手势动作数据库,提取人体关键骨骼间的夹角作为随时间变化的一维特征向量,借助动态规 划思想,在动态时间规整(DTW)算法的基础上提出了加权DTW手势动作识别算法。通过对DTW算法的 改进,使其规整路径更快的收敛,提高了手势动作识别的实时性和准确性,并在Unity 3D虚拟仿真实验 平台中进行了实时性和有效性的验证实验,同时实现了简单的人机交互。
KinectV2 Sensor Real-Time Dynamic Gesture Recognition Algorithm
Xidong Liu1, Yan Meng1, Guoyou Li2, Lei Tang2, Jiadong Guo2
1College of Biological and Environmental Engineering, Tianjin Vocational Institute University, Tianjing 2Shanghai Institute of Aerospace Technology, Shanghai
虽然 KinectV2 能够得到人体各骨骼节点图像,但是却不能够直接获取其各个节点的空间坐标位置, 所以我们需要通过编程的方式来采集需要的关节点坐标,如果将 KinectV2 捕捉的 25 个关节点全部用来 做人体运动的特征分析,则会增加本系统的计算量和复杂度,从而影响系统的实时性,考虑到文中所识 别的手势只用到了人体上半身的 8 个关键骨骼点,所以只对上半身的 8 个骨骼节点坐标进行提取,由于 篇幅原因,表 1 中只列出了其中一组关键骨骼节点的空间位置坐标。
本文针对提取骨骼坐标位置特征时,因个人骨骼坐标位置不同造成的识别效率低,实时性差的缺陷,
DOI: 10.12677/iae.2020.81002
9
仪器与设备
刘希东 等
提出了一种基于 KinectV2 传感器骨骼数据的改进 DTW 模板匹配手势识别算法。该算法通过提取人体关 键骨骼间的夹角特征,改善了因每个人骨骼位置不同造成的对手势动作特征描述不准确的缺点;通过对 DTW 算法进行改进,提高了计算速率,使动态规整路径更快的收敛。本文的算法不受骨骼坐标位置差异 影响,能够使路径更快的收敛,既满手势识别的准确性,又满足实时性。
Table 1. Coordinate positions of key skeletal nodes 表 1. 各关键骨骼节点坐标位置
2. Kinect 骨骼跟踪技术
Kinect SDK 2.0 中提供了人体的各个骨骼关节的空间坐标位置分布信息及相对位置的变化,骨骼节点 的数据类型以骨骼帧的形式表示,每一帧都是由 25 个骨骼节点组成的 JointType 类型的集合[7],人体各 部位节点位置标注图和编号如图 1 所示。根据 KinectV2 追踪骨骼数据的状态在 Kinect SDK 2.0 中定义了 三种骨骼节点跟踪情况:能够进行跟踪(Tracked)、不能够进行跟踪(Not Tracked)、通过运动状态进行推断 跟踪(Infered) [8]。
Instrumentation and Equipments 仪器与设备, 2020, 8(1), 8-20 Published Online March 2020 in Hans. /journal/iae https:///10.12677/iae.2020.81002