Pattern Recognition Chinese Academy of Sciences,
基于辅助信息的无人机图像批处理三维重建方法
第39卷第6期自动化学报Vol.39,No.6 2013年6月ACTA AUTOMATICA SINICA June,2013基于辅助信息的无人机图像批处理三维重建方法郭复胜1高伟1摘要随着我国低空空域对民用的开放,无人机(Unmanned aerial vehicles,UAVs)的应用将是一个巨大的潜在市场.目前,如何对轻便的无人机获取的图像进行全自动处理,是一项急需解决的瓶颈技术.本文将探索如何将近年来在视频、图像领域获得巨大成功的三维重建技术应用到无人机图像处理领域,对无人机图像进行全自动的大场景三维重建.本文首先给出了经典增量式三维重建方法Bundler在无人机图像处理中存在的问题,然后通过分析无人机图像的辅助信息的特点,提出了一种基于批处理重建(Batch reconstruction)框架下的鲁棒无人机图像三维重建方法.多组无人机图像三维重建实验表明:本文提出的方法在算法鲁棒性、三维重建效率与精度等方面都具有很好的结果.关键词三维重建,无人机,批处理重建,辅助信息引用格式郭复胜,高伟.基于辅助信息的无人机图像批处理三维重建方法.自动化学报,2013,39(6):834−845DOI10.3724/SP.J.1004.2013.00834Batch Reconstruction from UA V Images with Prior InformationGUO Fu-Sheng1GAO Wei1Abstract With the latest deregulation and opening-up policy of Chinese government on low altitude airspace to private sectors,the applications of unmanned aerial vehicles(UAVs)will be a huge potential market.Currently the automatic processing technology of UAV images is far behind the market demand,and has become the bottleneck of various appli-cations.This work is meant to apply hugely successful scene reconstruction techniques in computer visionfield to large scene reconstruction from UAV images.To this end,atfirst,specific problems of direct application of the Bundler,a popular increment reconstruction technique in computer vision are investigated.Then a batch reconstruction method from UAV images is proposed by fully taking into account various pieces of prior information which are usually available in UAV images,such as those from GPS,IMU,DSM,etc.Our method is tested with several sets of UAV images,and the experiments show that our method performs satisfactorily in terms of robustness,accuracy and scalability for UAV images.Key words3D reconstruction,unmanned aerial vehicles(UAVs),batch reconstruction,prior informationCitation Guo Fu-Sheng,Gao Wei.Batch reconstruction from UAV images with prior information.Acta Automatica Sinica,2013,39(6):834−845随着空间信息科学技术的迅速发展,遥感影像数据被越来越多地应用于社会各个领域.无人机(Unmanned aerial vehicles,UAVs)遥感以其灵活性强、操作方便、投入低、适用范围广等优势,填补了卫星、航空遥感在一些特定应用范围快速获取高分辨率影像需求上的空白.近年来,随着我国低空域对民用的开放,无人机遥感技术已逐步从研究开发阶段发展到实际应用阶段,无人机的应用将是一个收稿日期2012-04-20录用日期2012-07-25Manuscript received April20,2012;accepted July25,2012国家重点基础研究发展计划(973计划)(2012CB316302),中国科学院战略性先导科技专项计划(XDA06030300),国家自然科学基金(612 03278)资助Supported by National Basic Research Program of China(973 Program)(2012CB316302)and the Strategic Priority Research Program of the Chinese Academy of Sciences(XDA06030300), and National Natural Science Foundation of China(61203278)本文责任编委贾云得Recommended by Associate Editor JIA Yun-De1.中国科学院自动化研究所模式识别国家重点实验室北京100190 1.National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing100190巨大的潜在市场.目前,如何对轻便无人机获取的大数据量图像进行快速鲁棒地全自动三维处理,是一项亟需解决的瓶颈技术[1−2].在过去的二十多年里,随着特征点检测和匹配算法[3−4]、鲁棒性估计算法[5−6]、自标定算法[7−9]、运动恢复结构重建算法(Structure from motion, SfM)[10−11]以及多视图立体匹配算法(Multi-view stereopises,MVS)[12−13]等技术的不断进步和完善,基于图像的三维重建技术有了突飞猛进的发展.这也使得基于图像的自动三维重建技术受到了越来越多的关注,并在诸多领域得到了广泛的应用.运动恢复结构重建算法是指由图像匹配恢复摄像机投影矩阵和场景三维结构的方法,是三维重建过程中最核心的关键技术.当前运动恢复结构重建的研究,最具有代表性和影响力工作的当属Snavely 等发布的Photo tourism system[10],其中发布的核心系统Bundler完成了由不同视点的二维图像构建相应的三维点坐标和恢复拍摄图像的相机内参数、6期郭复胜等:基于辅助信息的无人机图像批处理三维重建方法835位置以及方向信息的整个流程.由于Bundler开放了源代码,同时系统具有全自动和鲁棒性等特点,以及该系统在Internet网络社区图像重建的成功应用,该工作受到了广泛关注.目前较为成熟的运动恢复结构算法很多,总结起来大体可以分为三类:增量式重建方法(Incremental reconstruction)、分步重建方法(Hierarchical reconstruction)[14−15]和批处理重建方法(Batch reconstruction)[16−18].增量式重建方法从一个小的“种子”重建出发,通过不断地添加新的相机和三维点并进行不断地优化,实现运动恢复结构重建,Bundler即为一种增量式重建系统.分步重建方法是将一个大场景重建问题通过分解为许多子场景的重建问题,然后通过子场景重建的融合实现大场景三维重建.与前两种方法依赖的迭代优化架构不同,批处理重建方法只需要一次性优化即可完成场景的三维重建.因此,批处理重建方法更加适合于大数据量无人机图像的三维重建处理.批处理三维重建方法首先由Tomasi等[18]针对仿射相机模型进行了研究,后来由Sturm等[17]推广到透视相机.但是由于数据缺失和外点的存在,使得对于一般的三维重建问题不存在闭合解.Kahl 等[19]通过将该问题转化为范数下的最小化问题,利用二阶锥规划(Second order cone programming, SOCP)方法推导出一种闭合近似解,在已知相机旋转的情况下,可以同时求解相机的平移量和三维点位置.但是该方法对错误的特征点匹配非常敏感,甚至一个错误的外点就会导致失败.Martinec等[16]提出了一种只依赖于4个内点(点对应)的二阶锥规划方法.Sim等[20]提出了一种更为鲁棒的求解方法,该方法不依赖于单独的点对应,而是利用了两个相机之间的相对平移方向进行求解.但是如果两幅图像的相对位姿估计不准,会导致算法失败.近年,已经有研究人员[21−24]尝试将计算机视觉的方法引入无人机图像重建领域[21].首先从图像序列提取关键帧,然后两两图像通过内点匹配恢复摄像机间的旋转和平移,最后通过三角化完成三维点的重建.文献[22]采用了仿射投影模型作为透视投影模型的近似模型求解初标定的始值,然后再基于透视投影模型进行优化,该方法的优点是考虑了无人机图像属于远景图像,通过仿射近似减少参数个数,但文章只是进行了模拟实验,并且实际精度并未有显著提高.文献[23]所采用的方法是增量式重建的方法.Irschara等[24]的方法与本文最相似,也利用了无人机的辅助信息,并采用了批处理的重建架构求解姿态信息,但是该文在特征匹配的鲁棒处理的工作不多,并且该文对摄像机位置直接采用了GPS的坐标,过于简单且精度无法保障.本文主要工作是针对无人机图像数据的特点,利用其辅助信息,提出了一种鲁棒的批处理三维重建方法,从而实现了大数据量无人机图像的快速三维重建.本文的结构如下:第1节给出了增量式重建方法的简单介绍,以及在无人机图像重建应用上存在的问题;第2节给出本文提出的基于辅助信息的无人机图像批处理三维重建方法;第3节是实验与结果分析;第4节对存在的问题及需要进一步开展的工作进行了讨论.1Bundler及其在无人机图像重建中存在的问题这里首先简单介绍在计算机视觉中广泛应用的增量式重建算法Bundler的主要流程,随后介绍该算法直接应用于无人机图像三维重建过程中存在的问题.1.1增量式重建Bundler简介目前,Snavely等[10]发布的Bundler以及以Bundler为核心算法的微软商业网络软件Photo-synth是现行SfM中比较优秀的系统集成,该系统集成了当时最优秀的特征点检测、匹配和捆绑调整优化的算法,最终可以得到很好的重建效果.这里简单介绍一下Bundler算法的主要流程(如图1所示).Bundler算法首先选择用于重建的初始图像对,通过迭代地增加单幅(或少量几幅)图像,对新的公共匹配点进行三角化,并对新加入的图像进行标定.该算法运行稳定的重要原因在于精心地选择了初始重建的图像对,并设计了较好的图像加入策略,每一步加入图像完成之后,会进行捆绑调整以优化相机的位置姿态信息.1.2增量式重建在无人机图像重建应用中存在的问题无人机图像获取与网络下载图像的获取方式不同,获取的图像具有自身的数据特点,因此不宜直接简单地套用网络下载图像的重建方法对无人机图像进行重建.文献[23]直接采用了Bundler对无人机图像进行重建,发现存在以下一些问题:1)对于大数据量的无人机图像,增量式重建算法效率低原始增量式重建算法中,特征点匹配和迭代式的捆绑调整,是整个算法最费时的环节[25].由于必须进行两两图像间的特征点匹配,图像特征点匹配的时间复杂度为O(n2),其中n为处理的图像个数.而文献[15]分析中指出,在捆绑调整过程中,捆绑调整过程的时间复杂度达到了O(n4).解决效率低下问题的一种有效方法就是通过并行加速,最近的一些研究在增量式重建基础上进行了GPU加速,比较有代表性的工作是Sinha等[26]和Wu等[27]对SIFT特征点检测和匹配的GPU并行算法,和Wu836自动化学报39卷图1增量式重建算法Bundler 的算法流程Fig.1The flow chart of Bundler等[28]对增量式重建算法中稀疏捆绑调整进行GPU 和多核CPU 并行的优化.虽然该工作使得处理速度有所提高,但依然无法改善增量式重建算法的本质问题.2)对初始图像对选择的依赖,引起的算法稳定性问题由图1可知,增量式重建方法首先需要选取一对较好的图像对作为参考的图像对进行重建,然后依次加入剩余的图像进行摄像机矩阵估计和三维结构恢复,直到所有图像遍历结束.该算法依赖于初始图像对的选取,同时新的相机的加入是一个迭代的过程,这就使得最终的重建结果依赖于初始图像对的选取和相机的增加次序.另外,在对地形场景等数据的重建中,某些纯平面或接近纯平面的场景也会引起重建结果不稳定,甚至出现严重的错误.3)没有利用辅助信息对无人机图像数据,在图像获取过程中,一般事先已知了相机的畸变信息和内参数信息,同时会有一些精度不高的其他辅助先验信息,如航线设计、无人机导航数据以及稳定平台的姿态等信息,充分利用这些辅助信息有望提高重建的精度和效率.文献[23]在灾场图像场景的重建应用中通过实验发现,直接用Bundler 而没有用到这些辅助信息进行重建时,有些数据的实验结果很不理想.2基于辅助信息的鲁棒无人机图像批处理三维重建方法由于无人机数据不同于无序下载的网络图像数据,也不同于连续帧间的视频数据,同时由无人机的飞行方式、平台搭载的辅助传感器等特点,决定针对无人机图像的快速三维重建,有必要设计具有针对性的算法流程.本文基于辅助信息的鲁棒批处理三维重建方法是基于批处理三维重建架构.在介绍本文重建算法之前,首先介绍无人机图像数据的特点.2.1无人机图像数据特点分析无人机搭载的相机一般为固定镜头的非量测型的普通数码相机,在获取图像同时,会同步记录该图像对应的经纬度坐标(导航GPS 提供),以及飞行速度、高度和方向角信息,部分无人机为了保证飞行的稳定,设计了稳定平台信息,同时提供了横滚角和俯仰角信息.由于无人机载荷以及成本限制,装载的导航GPS 精度只有十米左右的精度,同时辅助数据记录的角信息精度也较低.无人机飞行之前,会设计航线规划的轨迹,而实际的航飞轨迹并不规则,部分飞行任务会偏离原设计的航线,同时飞行过程不能保证姿态稳定,倾斜较大.对拍摄的地形,可以通过Google Earth 数据或者公开的DSM 数据[29−30],获得该区域对应的精度在30米左右的粗略的地形高程数据.为了保证飞行过程不产生漏拍,无人机影像重叠率较高,做重建的图像获取航向重叠一般超过70%,旁向重叠超过30%,一次飞行任务获取的影像张数较多,一般以百计,部分大场景的应用会拍摄获取上千张影像.对无人机图像数据特点的总结,可以发现无人机图像数据具有以下几个特点:1)大部分应用中,获取无人机图像的相机为定焦镜头,其焦距值固定,同时可以通过严格的标定,消除畸变,获得相机的内6期郭复胜等:基于辅助信息的无人机图像批处理三维重建方法837参数信息;2)有大约10米左右精度的位置辅助信息;3)有精度不高的姿态辅助信息,一般在10度以内;4)有粗略的地形高程数据;5)保证了高重叠率(航向重叠和旁向重叠);6)地面图像数据纹理丰富,适合自动匹配重建.本文基于辅助信息的鲁棒批处理三维重建方法正是利用了无人机数据的上述特点,实现了具有针对性的鲁棒批处理重建系统.2.2批处理三维重建架构批处理三维重建方法由于不依赖于迭代优化重建架构,只需要一次性优化即可完成整个三维重建过程,因此特别适合于大数据量的三维场景重建.算法核心是在给定两两视图i 和j 间的相对旋转矩阵R ij 和平移T ij 下,如何获取全局一致性的旋转矩阵R i 和平移量T i ,并保持T i 尺度的一致性.主要包括两部分内容:1)由相对旋转估计绝对旋转(在全局坐标系下);2)在给定绝对旋转情况下,由相对位移估计绝对位移(在全局坐标系下).具体步骤见图2.图2批处理重建框架流程图Fig.2The flow chart of the batch reconstruction method这里相机的内参数已知,由两两图像匹配可以获得的相对旋转矩阵和平移量,它们与最后重建结果绝对空间的结构相差一个相似变换.2.2.1绝对旋转估计两幅图像i ,j 之间的相对旋转用R ij 表示,图像i 的绝对旋转用R i 表示.给定相对旋转R ij 的情况下,我们需要估计绝对旋转R i ,i =1,2,···,m ,使其满足如下两个条件:R j =R ij ∗R i ,∀i,jR i ∗R Ti =I,|R i |=1,i =1,2,···,m (1)当至少有m 个相对旋转已知的话,需要求解在满足条件(2)情况下的最小二乘解,本文采用文献[16]给出基于SVD 分解的估计方法.2.2.2绝对位移估计图像i 和图像j 之间的相对位移用T ij 表示,图像i 的绝对位移用T i 表示.C ij 表示相机光心分别在C i ,C j (在全局坐标系下)的单位相对位移,C ij=C j −C i C j −C i.由于C ij =−R T ij ∗T ij ,C i =−R T i ∗T i ,因此可以将给定相对位移T ij ,估计绝对位移T i 的问题转化为给定单位相对位移C ij ,估计相机光心位置C i 的问题.由于位移估计在相差一个尺度情况下是相等的,所以不存在C ik =C ij +C jk 的关系.因此,无法采用类似估计绝对旋转的最小二乘解法.同时,由于噪声的影响,并不存在精确解.在L ∞架构下,将该问题转化为给定单位相对位移C ij ,求解x =(C 1,C 2,···,C m )T 的最小最大化问题,即:find min xmax i,jtan θji(2)这里,θji 表示向量C ji 和ˆCji =C i −C j 之间的夹角.文献[20]在给定绝对旋转的情况下,将该问题转化为二阶锥规划进行求解.为了改进绝对位移的估计精度,还可以采用三视图之间的三焦张量(Trifocal tensor)信息估计相对位移对(C ji ,C ik ).令单位方向向量C jik=(C T ji ,C C T ik )T(C T ji ,C C T ik )T ,给定C jik ,求解x =(C 1,C 2,···,C m )T ,即:find min xmax j,i,ktan θjik(3)这里,θjik 表示向量C jik 和ˆCjik =((C i −C j )T ,(C k −C i )T )T 之间的夹角.这里采用文献[20]的二阶锥规划可以进行求解.批处理三维重建方法虽然具有处理速度快的优点,但是当图像匹配错误(两幅图像之间本身不匹配而进行了匹配)或者图像相对位姿错误(由两幅图像之间的特征点错误匹配导致的错误位姿估计),都会使得批处理三维重建失败,因此需要鲁棒的批处理三维重建方法.2.3基于辅助信息的鲁棒的批处理三维重建方法由于批处理三维重建方法受外点影响较大,本文提出了一种基于辅助信息的鲁棒的批处理三维重建方法,该方法的基本步骤如下:1)利用辅助信息对每两幅图像进行特征点提取与匹配,并在RANSAC 架构下利用五点算法估计每两幅图像之间的相对位姿;2)利用三视图匹配,剔除错误图像匹配或图像相对位姿错误;3)估计绝对位姿,进行三维点云重建并利用捆绑调整方法进行一次性优化.具体流程图如图3所示,下面将详细介绍算法的各步骤.2.3.1利用辅助信息进行两两图像间的特征点匹配和位姿计算首先,我们充分利用无人机平台的辅助信息,包括低精度的位置、姿态信息以及已知的粗略地形高程数据,可以获得粗略的图像匹配集合.838自动化学报39卷图3基于辅助信息的鲁棒的批处理三维重建方法Fig.3The flow chart of our robust batch reconstruction of UAV images with prior informationS ={ i,j |i,j =1,···,m }(4)具体做法是:在已知每幅图像拍摄时刻的GPS 信息和IMU 信息情况下,可以获取到每幅图像近似的投影矩阵信息.又在确定飞行区域情况下,通过一些公开的网络地理数据[29−30],可以获取该地区的近似高程信息.利用无人机平台的这些辅助信息,将无人机图像的四个图像角点投影到与地平面平行的平面,且保证该平面是所获取的地形高程最高值所在的平面.如图4所示,图像i,j 投影到地形最高所在的地平面上,判断所投影的地平面信息上,图像投影四边形区域是否有重叠,如果存在一定的重叠区域,就认为对应的两幅图像(记为i,j 图像)具有匹配关系,并将 i,j 加入集合S .虽然这里所用的辅助信息并不十分精确,计算得到的图像匹配关系仅是一个粗略值.然而由于在计算的过程中放宽了重叠度的要求,因此真实的图像匹配集合S −是该图像匹配集合S 的一个子集.然后在每幅图像上分别检测SIFT 特征点,并在图像匹配集合S 中进行特征点匹配.利用图像的待匹配集合取代原来的穷举匹配(每幅图像与其他所有图像进行匹配),在匹配过程中限定了图像的匹配范围,总的图像匹配的计算复杂度由O(n 2)减少到O(n ),提高了匹配效率.又因为匹配过程只是选取图4利用辅助信息计算图像间的重叠与匹配子集Fig.4The image overlap estimation by projecting the image on the highest parallel plane有可能重叠的图像进行匹配,可以排除非关联图像的干扰,可以从理论上减少了由于不存在的图像匹配产生的误匹配,提高匹配的准确率,从而提高重建系统的鲁棒性.如果两幅图像i,j 的匹配点个数少于a 1,则认为这两幅图像不匹配,将 i,j 从集合S 中删除.否6期郭复胜等:基于辅助信息的无人机图像批处理三维重建方法839则,在内参数已知情况下,利用RANSAC 架构的五点算法计算本质矩阵,同时分解得到相对位姿(R ij ,T ij ).当计算相对位姿的内点个数少于a 2时,则认为该相对位姿不准确,将 i,j 从集合S 中删除.2.3.2利用三视图匹配,剔除错误图像匹配或图像相对位姿错误有了图像匹配集合S ,就可以构造一个无向图G =(V,E ),V 表示节点的集合,E 表示边的集合.无向图G 中的每一个节点v i ∈V 表示一幅图像,i =1,2,···,m .如果集合S 中存在元素 i,j ,则认为节点i,j 之间存在一条边e ij ∈E .如果节点i,j ,k 之间同时存在边e ij ,e jk ,e ik ,那么认为i ,j ,k 为三视匹配关系,记为 i,j,k .无向图G 中的所有三视匹配关系构成的集合记为S .对于每一个三视匹配关系 i,j,k ∈S ,可以利用i,j,k 之间的相对位姿关系的冗余信息进行错误图像匹配或者是错误相对位姿的剔除.如果图像i ,j,k 的公共匹配点个数小于a 3,则将 i,j,k 从集合S 中删除;否则,利用相对旋转(R ij ,R jk ,R ik ),根据第2.2.1节计算三视图中的一致性旋转(R i ,R j ,R k ).然后在给定绝对旋转的前提下,由相对位移(T ij ,T jk ,T ik ),根据第2.2.2节计算三视图中的一致性位移(T i ,T j ,T k ).对三视图像的公共匹配点进行三维重建,如果三维点的重投影误差大于β1,则剔除该公共匹配点.进一步,如果当公共匹配点个数小于a 3,则将 i,j,k 从集合S 中删除.2.3.3估计绝对位姿根据三视匹配集合S ,构造新的无向图G =(V ,E ).节点集合V 依然是由所有图像构成.如果集合S 中存在元素 i,j,k ,则认为节点i,j,k 之间分别存在边e ij ,e jk ,e ik .由于不能保证所拍摄的图像都能通过图像匹配建立一个完整的连通关系,又第2.3.2节中通过匹配的阈值设置剔除了一些错误图像匹配,因此无向图G 有可能是一个非连通图.本文采用深度优先(Depth first search)方法搜索连通分量(Connected component),并将具有最多节点个数的连通分量记为G sub =(V sub ,Esub ).对于集合S 中的元素 i,j,k ,如果i,j,k 中任意一个不属于Vsub ,则将 i,j,k 从集合S 中删除;否则,根据第2.2.2节,由三视图一致性的(R i ,R j ,R k )和(T i ,T j ,T k )估计向量对应Esub 中的(R ij ,R jk ,R ik )和C jik =(C T ji ,C C T ik )TT ji T ik .下面我们通过相对位姿关系估计包含在Vsub中图像的绝对位姿.由已知的相对旋转R ij ,i,j ∈V sub ,根据第2.2.1节计算绝对旋转R i ,i ∈V sub .在给定绝对旋转的情况下,根据第2.2.2节,这里采用三视匹配关系C jik ,可以得到各摄像机光心x =(C 1,C 2,···,C m )T 的估计,并计算得到绝对位移T i ,i ∈Vsub .根据绝对位姿和特征点匹配进行三维重建,如果三维点的重投影误差大于β2,则剔除该匹配点.最后将三维重建点云与绝对位姿利用捆绑调整方法进行一次性优化.2.4摄像机的绝对定向得到解算后的位置和姿态信息后,该信息和摄像机自身的真实位置相差一个相似变换,即由摄像机坐标系到世界坐标系的相似变换.如果有地面控制点,可以采用地面控制点完成,最少需要三个点.本文利用记录下的GPS 位置信息进行相似变换,由于已知的GPS 坐标个数远多于三个,这里需要进行最小二乘求解,考虑部分求解的摄像机位置可能存在错误,本文采用了RANSAC 框架的鲁棒估计方法.3实验结果及结果分析本文对多种无人机系统数据进行了实验,提供本文实验数据的低空无人机系统包括灾害测量系统、资源调查系统以及部分环境调查系统,采用了各种不同小型固定翼无人机的飞行平台,无人机飞行高度在200米∼2000米之间高度不等,无人机平台上挂载了定位精度在5米∼15米左右的动态单点定位的GPS 和精度在10度以内的陀螺仪.飞行之前,由航线规划软件可以对拍摄的航向重叠和旁向重叠进行设置,提供的数据预设的航向重叠在80%以上,旁向重叠在40%以上.采集的数据有单航带数据,也有多航带数据.为了保证获取图像不模糊,相机的曝光时间在1/500秒以下.我们对十多组数据进行了实验,以验证算法的稳定性和有效性.这里给出其中两组具有代表性的实验数据的结果以及结果分析.3.1实验结果本文实现了第2节介绍的批处理方法,同时对Bundler 也进行了GPU 的并行化,这里将比较增量式重建方法和批处理方法之间的标定结果的差异.给出的实验数据中,其中一组数据包含了由专业摄影测量软件Inpho 的标定结果[31],以此结果作为参考依据,将批处理方法的实验结果与此进行了对比.下面分别从效率、稳定性、精度等方面对这两组数据进行分析.实验中,第3节中各参数固定,其中a 1取值为80,a 2取值为50,a 3取值为30,β1取值为10,β2取值为8.本文实验环境为64位的Win 7系统,内存24G,处理器为Intel Xeon(R)X5550@2.67GHz,4处理器,同时利用的GPU 是显存为4G 的Tesla C1060.。
张家俊,李茂西,周玉,陈钰枫,宗成庆 中国
2 参评系统描述
在这次机器翻译评测中我们使用了 6 个翻译系统,即: (1)基于最大熵括弧转录文法 (MEBTG)的统计机器翻译系统、 (2)句法增强的基于最大熵括弧转录文法(SynMEBTG)的统 1 计机器翻译系统、 (3)开源基于短语的翻译系统(Moses ) 、 (4)开源基于层次短语的翻译系统 2 (Joshua ) 、 (5)词语级系统融合系统(WordComb)以及(6)句子级系统融合系统(SenComb) 。
λ8
λ7 ⋅ PLM y
其中 Ω 是调序分值, λ8 为相应特征的权重。与[xiong et al., 2006]相似,调序的分值由基于 词汇化(边界词)特征的最大熵模型训练得到。
2.2 SynMEBTG 系统的句法增强版本。由于 MEBTG 的核心思想就是将顺序合 并和逆序合并看成一个最大熵的二元分类问题。 因此, 分类所采用的特征将成为决定系统性能的 关键因素。MEBTG 系统只采用了词汇化的特征,分类的正确率不是很高。SynMEBTG 系统就 是设法在不降低实际解码速度的情形下,将源语言的句法信息高效地融入调序模型。 SynMEBTG 的基本思想就是:如果被合并的两个短语都是句法短语,我们就采用句法调序 信息,否则我们采用 MEBTG 的词汇化调序信息。不同于在解码过程中计算句法调序信息,我 们将句法调序信息的计算作为翻译前的预处理模块。类似于[Li et al., 2007],我们从一棵句法树 上获得句法调序信息。 [Li et al, 2007]处理含有两个或三个孩子节点的子树, 然后决定孩子节点间 是否需要调序, 最终得到调序后的源语言句子。 我们的方法如下: 如果一个节点有两个孩子节点, 我们即可以构造一个规则决定他们是否需要交换顺序;如果一个节点含有三个以上的孩子节点, 我们首先判断孩子节点中是否有中心节点(VP 或者 NP),有的话,我们便设计一个规则决定位于 中心节点前的修饰节点是否需要调至中心节点后。综合而得我们设计的规则如下:
中科院自动化所的中英文新闻语料库
中科院自动化所的中英文新闻语料库中科院自动化所(Institute of Automation, Chinese Academy of Sciences)是中国科学院下属的一家研究机构,致力于开展自动化科学及其应用的研究。
该所的研究涵盖了从理论基础到技术创新的广泛领域,包括人工智能、机器人技术、自动控制、模式识别等。
下面将分别从中文和英文角度介绍该所的相关新闻语料。
[中文新闻语料]1. 中国科学院自动化所在人脸识别领域取得重大突破中国科学院自动化所的研究团队在人脸识别技术方面取得了重大突破。
通过深度学习算法和大规模数据集的训练,该研究团队成功地提高了人脸识别的准确性和稳定性,使其在安防、金融等领域得到广泛应用。
2. 中科院自动化所发布最新研究成果:基于机器学习的智能交通系统中科院自动化所发布了一项基于机器学习的智能交通系统研究成果。
通过对交通数据的收集和分析,研究团队开发了智能交通控制算法,能够优化交通流量,减少交通拥堵和时间浪费,提高交通效率。
3. 中国科学院自动化所举办国际学术研讨会中国科学院自动化所举办了一场国际学术研讨会,邀请了来自不同国家的自动化领域专家参加。
研讨会涵盖了人工智能、机器人技术、自动化控制等多个研究方向,旨在促进国际间的学术交流和合作。
4. 中科院自动化所签署合作协议,推动机器人技术的产业化发展中科院自动化所与一家著名机器人企业签署了合作协议,共同推动机器人技术的产业化发展。
合作内容包括技术研发、人才培养、市场推广等方面,旨在加强学界与工业界的合作,加速机器人技术的应用和推广。
5. 中国科学院自动化所获得国家科技进步一等奖中国科学院自动化所凭借在人工智能领域的重要研究成果荣获国家科技进步一等奖。
该研究成果在自动驾驶、物联网等领域具有重要应用价值,并对相关行业的创新和发展起到了积极推动作用。
[英文新闻语料]1. Institute of Automation, Chinese Academy of Sciences achievesa major breakthrough in face recognitionThe research team at the Institute of Automation, Chinese Academy of Sciences has made a major breakthrough in face recognition technology. Through training with deep learning algorithms and large-scale datasets, the research team has successfully improved the accuracy and stability of face recognition, which has been widely applied in areas such as security and finance.2. Institute of Automation, Chinese Academy of Sciences releases latest research on machine learning-based intelligent transportationsystemThe Institute of Automation, Chinese Academy of Sciences has released a research paper on a machine learning-based intelligent transportation system. By collecting and analyzing traffic data, the research team has developed intelligent traffic control algorithms that optimize traffic flow, reduce congestion, and minimize time wastage, thereby enhancing overall traffic efficiency.3. Institute of Automation, Chinese Academy of Sciences hosts international academic symposiumThe Institute of Automation, Chinese Academy of Sciences recently held an international academic symposium, inviting automation experts from different countries to participate. The symposium covered various research areas, including artificial intelligence, robotics, and automatic control, aiming to facilitate academic exchanges and collaborations on an international level.4. Institute of Automation, Chinese Academy of Sciences signs cooperation agreement to promote the industrialization of robotics technologyThe Institute of Automation, Chinese Academy of Sciences has signed a cooperation agreement with a renowned robotics company to jointly promote the industrialization of robotics technology. The cooperation includes areas such as technology research and development, talent cultivation, and market promotion, aiming to strengthen the collaboration between academia and industry and accelerate the application and popularization of robotics technology.5. Institute of Automation, Chinese Academy of Sciences receivesNational Science and Technology Progress Award (First Class) The Institute of Automation, Chinese Academy of Sciences has been awarded the National Science and Technology Progress Award (First Class) for its important research achievements in the field of artificial intelligence. The research outcomes have significant application value in areas such as autonomous driving and the Internet of Things, playing a proactive role in promoting innovation and development in related industries.。
Microsoft Word - 修改后的文章_Huwx doc
汉语朗读话语重音自动分类研究STUDY ON STRESS PERCEPTION IN CHINESESPEECH胡伟湘 董宏辉陶建华 黄泰翼中科院自动化所模式识别国家重点实验室 100080 北京Hu Weixiang Dong Honghui Tao Jianhua Huang TaiyiNational Laboratory of Pattern Recognition (NLPR) Institute of Automation Chinese Academy of Sciences{wxhu, hhd, jhtao, huang}@AbstractRestricted by prosody hierarchy and disturbed by tone and intonation, it is a hard task to detect the stress of Chinese speech automatically. In this paper, aiming at automatic stress perception in normal mandarin reading speech, we studied some acoustical measurements based on F0, duration and intensity and proposed a novel model to calculate the stress of each syllable. With a structure of classify tree, the model combined the restriction of tone context and prosody hierarchy effectively. It was shown from the result that the top line of pitch, pitch range, duration are important cues for stress perception.摘 要汉语的重音由于受到声调、语调以及韵律单元层级的干扰和制约,对于重音的自动感知一直是比较困难的问题。
中国科学院大学模式识别国家重点实验室计算机视觉课件
模式识别国家重点实验室
中国科学院自动化研究所
混合高斯模型
• 流程图
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
模式识别国家重点实验室
中国科学院自动化研究所
运动分析
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
?
输入图象
背景图象
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
模式识别国家重点实验室
中国科学院自动化研究所
背景差法
• 怎样获得背景图像?
人为给定若干背景图像 –求其均值图像; –图像训练集的中值图像; 没有指定背景图象 –混合高斯模型; –其它。
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
模式识别国家重点实验室
中国科学院自动化研究所
背景差法
• 原理:计算当前图像与背景图像的逐象素的灰 度差,再通过设置阈值来确定运动前景区域。
National Laboratory of Pattern Recognition
视角和光照显著变化时的变化检测方法研究
第35卷第5期自动化学报Vol.35,No.5 2009年5月ACTA AUTOMATICA SINICA May,2009视角和光照显著变化时的变化检测方法研究李炜明1吴毅红1胡占义1摘要探讨使用计算机视觉的最新方法来解决基于两幅高空间分辨率光学遥感图像的城市变化检测问题.基本原理是通过提取聚类出现的变化直线段群来提取城市变化,重点研究了拍摄视角和光照条件显著变化时的几个主要问题.提出了一种基于多种类型图像特征的匹配方法来提取无变化建筑的顶部区域,结合几何约束引入了变化盲区的概念以处理高层建筑在不同视角和光照下的图像不同现象.使用真实遥感图像进行实验,在视角和光照显著变化时仍可取得满意的变化检测结果.关键词变化检测,高空间分辨率遥感图像,城市变化,变化盲区,视角变化,光照变化中图分类号TP75Urban Change Detection under Large View and Illumination VariationsLI Wei-Ming1WU Yi-Hong1HU Zhan-Yi1Abstract This paper intends to explore the state of the art computer vision techniques to detect urban changes from bi-temporal very high resolution(VHR)remote sensing images.The underlying principle of this work is that most real urban changes usually involve clustered line segment changes.Several major problems are investigated when the view angle and illumination condition undergo large variations.In particular,a method is proposed to extract roof regions of unchanged tall buildings by matching both image point groups and image regions.In addition,by considering the geometrical constraints,the concept of change blindness region is introduced to remove the image changes related to unchanged tall buildings.Experiments with real remote sensing images show that the proposed approach still performs well though the image pairs undergo significant variations of viewing angles and illumination conditions.Key words Change detection,very high resolution remote sensing images,urban changes,change blindness regions, view variations,illumination variations从大范围城市中快速、自动地检测出发生了变化的区域,成为城市地图更新、土地利用监控、城市执法管理、紧急灾害救援等应用问题的迫切需求[1−3].本文主要探讨如何使用计算机视觉的最新方法来解决从两幅高空间分辨率光学遥感图像中自动检测城市变化的问题.在实际的城市场景中,感兴趣的城市变化在图像上通常表现为聚类出现的变化图像特征群,因此可以通过提取这样的变化图像特征群来提取城市变化.本文将讨论基于上述原理的变化检测方法,并重点研究在两幅图像的拍摄视角和光照条件发生显著变化情况下的城市变化检测问题.近年来,空间分辨率优于1m的高空间分辨率光学遥感图像开始大量进入市场,如Ikonos卫星图像、Quickbird卫星图像以及航拍图像等.这些图像可以清晰分辨出单个建筑物的形状、结构和大小.本文研究的目标是使用两幅同一城市地区在不同时收稿日期2008-04-02收修改稿日期2008-10-22Received April2,2008;in revised form October22,2008国家自然科学基金(60673104,60675020)资助Supported by National Natural Science Foundation of China (60673104,60675020)1.中国科学院自动化研究所模式识别国家重点实验室北京100190 1.National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing100190 DOI:10.3724/SP.J.1004.2009.00449刻拍摄的这类图像,自动检测面积相当于或大于单个建筑物(如15m×15m)的城市变化区域,这些变化由城市基础设施的变化造成,如建筑物的新建和拆除、道路的改变、城市场地使用类型的改变等.变化检测是遥感图像处理领域中的经典问题,但传统的遥感图像变化检测方法[2−3]大都只适用于中、低空间分辨率的多光谱(或高光谱、超光谱)遥感图像,很难用于本文的问题.近年来针对高空间分辨率遥感图像的变化检测问题,国内外的研究者先后开展了一些相关研究[1−5].其中的一些方法借助于使用两幅光学图像以外的其他类型数据,如基于场景模型的方法[6]需要预先建立场景中物体的三维线框模型;基于数字高程模型(Digital elevation model,DEM)的方法[7]需要使用场景的DEM数据.通过使用这些辅助数据可以使得问题大大简化,但在实际应用中这些辅助数据可能难以获得.本文主要研究仅使用两幅图像作为主要输入数据的变化检测方法.基于两幅图像的方法主要通过分析比较两幅图像之间某种类型图像特征的变化情况来检测物理场景的变化,如基于点特征的方法[4,8−9]、基于图像区块的方法[10−11]、基于线特征的方法[12−15]、基于图像对象特征的方法[16−21]等,这些方法在两幅图像的拍摄视角基本不变或变化不大时可以取得450自动化学报35卷良好的效果.但是对于包含大量高层三维建筑物的城市场景,当两个时刻的拍摄视角和场景光照情况发生显著变化时,单个建筑大小的图像特征受图像拍摄条件的影响严重,同一建筑在两个时刻的图像特征可能产生明显不同,这些拍摄条件导致的图像变化和真实城市变化导致的图像变化同时出现,给上述基于图像特征的变化检测方法造成了很大困难.在包含大量高层建筑物、场景内容复杂的城市区域,当视角和光照变化显著时,仅使用两幅光学图像作为主要输入数据来检测单个建筑物大小的城市变化仍然是一个尚未得到很好解决的问题.城市变化通常体现为聚集出现并占据一定面积的变化图像特征群.本文选择直线段作为检测变化的图像特征,因此可以将检测城市变化的问题转变为在两幅图像之间检测聚类出现的变化直线段群,这是本文的基本原理.然而实际中变化直线段的成因比较复杂,可将其大致分为两类:1)由场景中真实的城市物体变化造成的;2)由两幅图像之间拍摄视角变化、场景光照变化以及其他的噪声干扰造成的.在检测聚类的变化直线段群之前必须首先设法排除第2类变化直线段.文献[12]中提出了一种遵循上述思路的城市变化检测方法.由于考虑了第2类变化直线段的影响,此方法对视角和光照的变化不敏感,可以在众多的图像变化中分辨出真实的城市变化.但是对其进行测试时发现,当视角变化和光照变化的幅度显著时,此方法的性能将下降,出现大量的错误检测.正如原文实验部分所指出,对于一些高层建筑物附近复杂的视差和阴影情况,这种方法尚不能很好地进行处理.本文对文献[12]中的方法进行了改进,主要的贡献在于:1)分析并归纳了在视角、光照显著变化的情况下进行城市变化检测的新问题,提出了针对这些问题的改进思路和实现算法;2)研究了问题中的几何约束,提出使用基于多种类型图像特征的匹配方法来提取没有发生变化的建筑物屋顶区域;3)针对高层建筑在视角和光照显著变化下出现的大视差、自遮挡、互遮挡和地面阴影变化的问题,提出了“变化盲区”的概念以及相应的检测方法.通过检测变化盲区可以有效地去除没有变化的高层建筑物导致的虚假图像变化.使用真实图像对本文方法进行初步验证,当两幅图像的拍摄视角和光照条件存在显著变化时,本文方法仍然可以取得满意的检测结果.本文的组织结构如下:第1节介绍视角和光照显著变化时的特殊问题,并提出相应的解决思路和方法;第2节结合一个真实遥感图像的例子介绍了基于这种解决方法的算法实现,并给出了中间结果;第3节使用Ikonos卫星图像、Quickbird卫星图像和航空图像对上述方法进行了实验,并对算法中的问题进行了分析;第4节对本文进行了总结;附录部分给出了第2.1节一个命题的证明过程.1视角和光照显著变化情况下的问题及其解决思路和方法文献[12]介绍了一种城市变化检测方法.在包含大量高层建筑的城市场景中,这种方法在视角和光照变化显著时会遇到新的问题,这些问题主要包括:问题1.无变化高层建筑物顶部的显著视差.当拍摄视角不同时,场景中同一个三维点在两幅图像上的像点位置不同(通常称之为视差现象).建筑物的高度越高,拍摄的视角变化越大,则视差越大.当拍摄视角变化十分显著时,同一座高层建筑物的顶部区域在两幅图像之间位置偏移的幅度甚至可能超过单个建筑物的尺度.问题2.无变化建筑物侧面的变形和建筑物的自身遮挡.由于遥感图像的拍摄视角是朝向地面,因此建筑物侧面区域在竖直方向被大幅度压缩变形.在视角变化显著时,同一建筑物的侧面区域在两幅图像中的变形情况很不相同,在实际中很难进行正确匹配.另一方面,由于建筑物本身不透明(本文暂不考虑透明的建筑),建筑的某些侧面区域可能只出现在一幅图像中,而在另一幅图像的视角下由于被建筑物自身遮挡而不可见.问题3.建筑物之间以及建筑物与其周围地面之间的互相遮挡.尤其是高层建筑会遮挡其周围的地面或高度较低的建筑.当视角发生显著变化时,两幅图像上的遮挡情况可能很不相同,某些物体在一幅图像中可见,而在另一幅图像中由于被高层建筑遮挡而不可见.上述三种情况产生的变化直线段都不是由场景中的真实城市变化引起的,因此必须设法去除.对于问题1,文献[12]通过结合极线几何约束匹配尺度不变特征变换(Scale invariant feature transform,SIFT)[22]来提取没有变化的城市建筑,从而去除视差导致的错误变化直线段.这种方法在视角变化不大的情况下效果良好.但当视角变化显著时,实验发现SIFT特征无法提取足够多的匹配图像特征.其中的原因之一在于视角的显著变化引起同一高层建筑物顶部在两幅图像上对应的图像区域产生明显平移,经平移后对应图像区域周围的图像内容发生了显著变化,使用SIFT特征无法匹配这些图像特征.为此,本文提出了一种结合多种类型图像特征的方法来提取两幅图像中的无变化区域,具体的特征提取和匹配方法将在第2.3节中介绍.5期李炜明等:视角和光照显著变化时的变化检测方法研究451对于问题2和问题3,文献[12]都没有考虑,因此将在这些区域中产生错误的城市变化检测.为了解决这些问题,本文引入“变化盲区”的概念.变化盲区定义为两幅图像中的一些特殊图像区域,由于两幅图像的成像过程不同,这些区域对应的物体表面仅仅在两幅不同时刻图像中的一幅图像中可以有效地得以呈现.由于不同时具备两个时刻的有效信息,没有足够的依据对这些区域中的变化情况作出判断,因而称之为变化盲区.产生变化盲区的本质原因在于成像过程中的信息丢失.从三维物理世界到二维图像平面的成像过程伴随着多种信息丢失过程,包括由三维到二维的几何投影过程、不透明物体表面间的遮挡、物体对光线的遮挡以及相机传感器能力的限制等.当两个时刻的成像过程不同时,两次信息丢失的情况也不同,某些物体表面的信息只在一幅图像中出现而在另一幅图像中丢失,因而这些表面成为变化盲区.根据成因,变化盲区有多种不同类型.按照城市变化检测的需要,本文将检测三种变化盲区,其中前两种分别针对问题2和问题3,第三种用来解决阴影检测的问题.检测三种变化盲区的具体方法将在第2.4节中详细介绍,这里首先给出其定义和基本的检测思路:变化盲区1.无变化建筑物侧面对应的图像区域.由于俯视拍摄导致这些区域在竖直方向被剧烈压缩变形,其中的图像特征很难有效提取和利用,因此将其作为变化盲区处理.同时,当视角显著变化时,这些区域往往被建筑物自身遮挡成为变化盲区.检测这部分变化盲区并将其中的变化直线段去除可以解决问题2导致的错误变化检测.上文提到的图像特征匹配步骤在提取出无变化建筑物屋顶区域的同时也得到这些区域的视差大小.在某屋顶区域视差已知的情况下,利用视差和建筑物高度之间的比例关系,结合城市场景结构的几何约束可以计算出该屋顶区域对应的建筑侧面在图像上的区域范围.变化盲区2.无变化建筑物产生的互遮挡区域.这些区域由于周围高层建筑物的遮挡而仅仅在一幅图像的视角下可见.检测这部分变化盲区主要针对问题3.利用上述无变化建筑物顶部的提取结果以及建筑物侧面的计算结果,可以检测出互遮挡区域.变化盲区3.无变化建筑物产生的地面阴影区域.在阴影区域中,由于直射阳光被遮挡,物体只能接受环境光线的照射.由于环境光线比较微弱,阴影区域图像的信噪比很低,物体的图像特征很难可靠提取.对于这种光线不足引起的信息丢失情况,这里也作为一种变化盲区处理.利用上文提到的无变化高层建筑物顶部区域的位置形状和视差的大小,结合光照模型、可以解析地预测图像中阴影区域的分布情况.将上述计算得到的阴影区域和实际图像中的灰度属性结合起来,可以更加准确地分割出图像中的阴影区域.在上述这些变化盲区中,即使实际场景中的物体没有发生任何变化,也可能会发生图像特征的变化,对于这些变化盲区中检测到的图像特征变化应该予以去除.综上所述,为了解决视角和光照显著变化给变化检测带来的特殊问题,本文使用更加有效的方法提取没有变化的建筑物区域,并结合问题的几何约束定义和提取变化盲区.将与这些匹配区域和变化盲区有关的变化直线段去除,可以有效地减少算法的错误检测.需要说明的是,本文定义的变化盲区中也可能包含真实的城市变化.按照上述方法将这些变化盲区去除可能造成算法的“漏检测”.这种漏检测的原因在于两幅图像中的数据缺失,即场景中的某些局部不能在两幅图像中都得以有效呈现.本文明确地定义并提取了这些两幅图像中由于数据缺失而无法进行变化检测的“变化盲区”.在实际应用中可以根据需要进一步对变化盲区进行处理.例如,可以借助于其他视角或光照条件下拍摄的补充图像数据进一步对变化盲区中的变化情况进行确认.2变化检测的算法实现基于上述方法的算法流程图如图1(见下页)所示.算法首先对两幅输入图像进行几何配准并标定算法中需要用到的几何参数.在一些特殊性质的图像区域中,变化检测问题可以有特殊的先验知识约束使得问题得以简化,因此算法接下来提取三种特殊区域并单独处理其中的变化检测问题(三种特殊区域是大面积植被区域、大面积水体区域以及大面积地面区域).在这些特殊区域以外的城市建筑用地区域中,使用上文提到的基于变化直线段群的变化检测方法来检测城市变化.下面将结合一个真实图像的例子来说明算法每一步的计算过程和结果.在这个例子中,我们使用由Ikonos卫星分别于2000年和2002年对纽约曼哈顿城区拍摄的遥感图像,原始图像的空间分辨率为1m.作为美国空间成像(Space Imaging)公司提供的样例图像,原始图像被压缩为24位标准JPEG格式.2.1图像配准和几何标定首先按照地面特征通过一个全局变换将两幅原始图像进行几何配准.对于修建在平坦地区的城市,可以认为城市场景由两类物体组成:地平面和分布在地平面上具有一定高度的三维建筑物.通过一个全局坐标变换可以将两幅图像按照地平面上的特征进行配准.在一般情况下,将两幅图像中同一个平面物体对应的区域配准需要一个单应变换(Homogra-452自动化学报35卷图1变化检测算法流程图Fig.1Flowchart of the change detection algorithmphy),对于本文使用的Quickbird 和Ikonos 图像,单应变换退化为相似变换.对于航空图像的局部子图像,这种退化仍然成立.我们使用RANSAC 方法[23]估计两幅图像之间的相似变换,从而将两幅图像按照地面配准到相同的图像坐标下.当成像过程可以用弱透视投影模型表达时,使用全局相似变换配准的两幅图像具有以下几个性质:性质1.地平面上的物体在两个时刻图像上对应的像素对齐到相同的图像坐标下.性质2.两幅图像之间的对极线平行.性质3.场景中的平行直线段在每幅经过配准图像中的投影直线段仍然平行.配准后并截取公共部分得到的两幅图像I t 1和I t 2如图2(a)和图2(b)所示.由性质1,可见地面物体对齐到相同的图像坐标系下,可以观察到地面物体的配准误差一般在两个像素以内.图像的大小为858像素×766像素,覆盖实际面积约为0.6km 2.例子中两幅图像之间的拍摄视角和光照方向都存在显著不同.尤其是视角不同引起的图像变化在这两幅图像之间表现得十分明显.图3是I t 1和I t 2的右下角局部,图中标出了三对对应点的位置(A 1,A 2)、(B 1,B 2)和(C 1,C 2).从对应点在两幅图像上位置的差异可以看出视差现象十分明显(注意两幅图像中的地面物体对齐到相同的图像坐标下).高层建筑物由于视角变化而产生的遮挡现象也很明显,不妨记图像中的方位为上北下南、左西右东,则可见左图中的高层建筑物呈现出向东的侧面,而右图中的高层建筑物呈现出向西和向南的侧面.按照性质2,两幅图像中的对极线平行,本文通过手工选取一双对应点来确定两幅图像之间的对极线方向.实际中,这个方向可以通过传感器记录的拍摄角度计算出来,也可以通过图像特征匹配方法自动求取.两幅配准图像中的几个关键的方向为变化检测提供了重要的约束,包括太阳光线在两幅图像中的方(a)t 1时刻图像I t 1(a)Image I t 1for time t 1(b)t 2时刻图像I t 2(b)Image I t 2for time t 2图2按照地面配准后的两幅Ikonos 图像Fig.2Ikonos images after ground registration(a)图像I t 1局部(a)Sub-image of I t 1(b)图像I t 2局部(b)Sub-image of I t 2图3两幅图像之间的对应点示例Fig.3Corresponding points between the image pair5期李炜明等:视角和光照显著变化时的变化检测方法研究453向、场景中地平面垂线在两幅图像中的方向、场景中地面垂线的阴影方向在两幅图像中的方向.对于场景中某个高于地平面的点在一幅图像中的像点,可以按照上述三个方向在这幅图像中确定一个三角形,我们称之为这个图像点的城市变化检测几何特征约束三角形,简称几何特征三角形(Geometricalfeature triangle,GFT).例如,图像I t1中某点P1对应的几何特征三角形P1R1S1和图像I t2中某点P2对应的几何特征三角形P2R2S2如图4所示.由图4可见,点P1对应于某高层建筑物矩形屋顶的一个顶点,直线段P1R1是这座高层建筑某侧面上的竖直边缘在图像中的成像,点S1是阳光经过点P1对应的物点投在地平面上的阴影点在图像中的像,点R1是经过点P1对应的物点的竖直直线与地平面的交点在图像上的像(这个交点一般位于建筑物的底部),直线段R1S1的方向代表了场景中地平面垂线在地平面上阴影的方向.三角形P2R2S2的定义与此相似.实际上,场景中每一个高于地平面的点都可以在每幅图像上对应一个几何特征三角形.由于三维物体的遮挡,这个三角形中的一条边或多条边未必在图像中全部可见.这里需要说明的是,图中的P1和P2并不是对应点,但由于P1和P2对应的物点位于同一个平顶建筑的屋顶,它们的视差相同,这个结论可以由性质3直接推出.(a)图I t1中的几何特征三角形(a)An example of GFT in I t1(b)图I t2中的几何特征三角形(b)An example of GFT in I t2图4几何特征三角形示例Fig.4Examples of geometrical feature triangles需要说明的是,在一些特殊的情况下,几何特征三角形将退化为一条直线段或一个点.按照定义,几何特征三角形的形状受到下列三个方向的影响:相机拍摄方向、场景中的地平面垂线方向和太阳光线入射方向.当这三个方向中的任意两个方向平行时,几何特征三角形将退化为一条直线段.当这三个方向全部平行时,几何特征三角形退化成一个点.以几何特征三角形P1R1S1举例说明:当拍摄方向平行于地面垂线方向时,点P1和点R1重合,高层建筑物侧面区域在图像上不出现;当光照方向平行于地面垂线方向时,点R1和点S1重合,建筑物地面阴影区域在图像上不出现;当拍摄方向平行于光照方向时,点P1和点S1重合,建筑物地面阴影区域被建筑物本体遮挡而在图像上不可见.当三个方向都重合时,点P1、点R1和点S1重合,建筑物侧面区域和建筑物地面阴影区域在图像上都不出现.当几何特征三角形发生退化的情况时,本文只处理余下来的一条边或者一个点,相应的定义作适当调整,受篇幅所限后文各处不再重复说明.几何特征三角形具有下列有用的性质:同一幅图像中两个不同图像点对应的两个几何特征三角形相似,且它们对应边长度的比例等于这两个点在两幅配准图像中像点视差大小的比例.该性质的证明见本文的附录.由此,只需在每幅图像中指定一个点的几何特征三角形并确定这个点的视差,就可以得知这幅图像中任意一点的几何特征三角形的形状,并可以由该点的视差计算出这个几何特征三角形的大小.本文通过手工的方式完成上述几何标定,即在两幅图像中各选择三个点确定一个特征三角形并找出其对应点的视差.实际工程中的遥感传感器往往可以提供拍摄图像时的相机姿态参数,在未来的工作中将结合这些数据设计算法自动完成这部分工作.图4中所示的两个几何特征三角形视差相同,在两幅图像中的三条边均为可见,且来自场景中最高的建筑物之一.我们使用像这样的一对几何特征三角形来定量估计拍摄视角变化和光照方向变化对变化检测问题的影响程度.通过平移两个几何特征三角形可以将R1和R2对齐,将平移后的两个几何特征三角形仍然记为三角形P1R1S1和三角形P2R2S2,则P1和P2的欧氏距离|P1P2|反映了拍摄视角变化造成的视差大小,S1和S2的欧氏距离|S1S2|反映了光照方向变化造成的阴影变化的大小.这个例子中,|P1P2|=93像素,|S1S2|=21像素.对于空间分辨率为1m的Iknonos图像,这意味着场景中同一物点在两幅图像上的视差位移可能达到实际距离中的93m,阴影点的位移可能达到21m.当算法意图在单个建筑的尺度上检测城市变化时(如15m×15m的方形建筑),上述视角和光照的变化情况都会产生严重的干扰.2.2特殊图像区域的单独处理图2中的纽约曼哈顿城区是一个典型的复杂城市场景,其中包括多种类型的城市基础设施,如大型高层建筑、密集建筑群、机动车道路、停车坪等,在这个例子中还包括大片公园绿地形成的植被区域、临近海港的大片水体区域以及码头设施.城市中类型相近的设施经常聚集在一起形成大片的城市区域.在不同类型的城市区域中,城市建筑用地区域是经常出现的城市区域.由于这些区域包含大量的三维高层建筑物,因此视角和光照变化对这类区域图像特征的影响尤为显著,本文将在下文。
pattern recognition格式
文章题目:深度探讨:模式识别的意义与应用一、引言在当今信息爆炸的时代,我们面临着海量的数据和信息,如何从这些信息中提取有用的知识和规律成为了一个重要的问题。
而模式识别作为一种重要的人工智能技术,能够帮助我们从数据中找到隐藏的规律和趋势,提供了一种强大的工具来解决这个问题。
在本文中,我们将深入探讨模式识别的意义与应用,并探讨其在不同领域的作用。
二、模式识别的定义与基本原理模式识别是一种通过数据分析和学习来识别数据中的规律和模式的技术。
它的基本原理是通过对输入的数据进行特征提取和分类,从而识别出数据中的各种模式。
在这个过程中,模式识别涉及到统计学、机器学习、人工智能等多个领域的知识,是一个集成了多种技术和方法的交叉学科。
通过模式识别技术,我们能够从数据中发现隐藏的规律和结构,从而实现对数据的理解和应用。
三、模式识别的意义与应用在现实生活中,模式识别技术得到了广泛的应用。
在医学领域,模式识别可以帮助医生从医学影像中识别出病变的模式,辅助医生进行诊断和治疗。
在金融领域,模式识别可以帮助分析师从市场数据中发现交易的规律和趋势,指导投资决策。
在工业生产中,模式识别可以帮助工程师监测设备运行的模式,预测设备的故障和维护周期。
在自然语言处理中,模式识别可以帮助机器理解和生成自然语言,实现智能对话和翻译。
可以看到,模式识别技术在各个领域都发挥着重要作用,促进了这些领域的发展和进步。
四、模式识别的个人观点与理解在我看来,模式识别是一种强大的工具,能够帮助人类从数据中挖掘出宝贵的知识和规律。
通过模式识别技术,我们能够实现对复杂数据的理解和应用,帮助人类更好地处理现实生活中的各种问题。
在未来,随着人工智能技术的不断发展,模式识别技术将发挥越来越重要的作用,并且将在更多的领域得到应用,实现人类社会的进步和发展。
五、总结通过本文的探讨,我们对模式识别的意义与应用有了更深入的理解。
模式识别作为一种重要的人工智能技术,不仅在医学、金融、工业、自然语言处理等领域得到了广泛的应用,而且在人类生活和工作中起着重要的作用。
视觉通路
Institute of Automation, Chinese Academy of Sciences
模式识别国家重点实验室
中国科学院自动化研究所
�
视束交叉
R2 R1
L2 L1
外侧膝状体
颞侧的一半视神经纤维投射到同侧,鼻侧的一半投射到对侧. 右侧的视觉信息投射到左脑,左侧的视觉信息投射到右脑. National Laboratory of Pattern Recognition 模式识别国家重点实验室
Institute of Automation, Chinese Academy of Sciences 中国科学院自动化研究所
光线
3
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
2
1
模式识别国家重点实验室
中国科学院自动化研究所
视束交叉
L1 L2R1 C L2 L1 R2 R1 C L2 L1
Institute of Automation, Chinese Academy of Sciences
初级阶段, (包括V1, V2和V3)
模式识别国家重点实验室
中国科学院自动化研究所
初级视皮层单细胞感受野
带状交替结构 对给光敏感 对撤光敏感
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
面部识别在中国的应用英语作文
面部识别在中国的应用英语作文Facial recognition technology has been rapidly advancing in recent years, and China has emerged as a global leader in the development and implementation of this innovative technology. China's vast population, coupled with its ambitious plans to build a comprehensive surveillance system, has made facial recognition a crucial component of the country's technological landscape. This essay will explore the various applications of facial recognition in China, its benefits, and the ethical concerns surrounding its use.One of the primary applications of facial recognition in China is its integration into the country's extensive surveillance network. China has been investing heavily in building a nationwide network of surveillance cameras, with estimates suggesting that the country has over 200 million surveillance cameras installed, making it the world's largest video surveillance system. Facial recognition technology is used to identify and track individuals as they move through public spaces, providing the government with a powerful tool for monitoring and controlling its citizens.The Chinese government has justified the use of facial recognition by claiming that it enhances public safety and security. The technology has been employed to identify and apprehend criminals, as well as to monitor the movements of individuals deemed to be potential threats to social stability. For example, the government has used facial recognition to track and monitor the Uyghur minority population in the Xinjiang region, a practice that has been widely criticized by human rights organizations as a violation of individual privacy and a form of ethnic discrimination.In addition to its use in surveillance, facial recognition technology has also been integrated into various other aspects of daily life in China. The technology is widely used in mobile payment systems, allowing users to authenticate their identity and make payments using their facial features. This has led to a significant increase in the adoption of mobile payment platforms, such as Alipay and WeChat Pay, which have become ubiquitous in the country.Furthermore, facial recognition has been implemented in various public services, such as accessing public transportation, entering office buildings, and even checking into hotels. This has led to increased efficiency and convenience for users, but it has also raised concerns about the potential for abuse and the erosion of personal privacy.One of the most controversial applications of facial recognition in China is its use in the country's social credit system. The social credit system is a government-run initiative that aims to monitor and assess the behavior of Chinese citizens, with the goal of incentivizing "good" behavior and punishing "bad" behavior. Facial recognition is used to identify individuals and track their activities, which can then be used to assign them a social credit score. This score can have significant consequences, affecting an individual's access to various public services and opportunities.The use of facial recognition in China's social credit system has been widely criticized by human rights organizations and international observers. They argue that the system represents a significant threat to individual privacy and civil liberties, as it gives the government unprecedented power to monitor and control its citizens.Despite these concerns, the Chinese government has continued to invest heavily in the development and deployment of facial recognition technology. The country has become a global leader in this field, with Chinese companies such as Hikvision, Dahua, and SenseTime emerging as major players in the global facial recognition market.The rapid advancement of facial recognition technology in China has also raised concerns about the potential for abuse and the erosion ofindividual privacy. There are fears that the technology could be used to suppress dissent, target minority groups, and create a highly invasive surveillance state. Moreover, the lack of robust privacy protections and oversight mechanisms in China has exacerbated these concerns.In response to these concerns, the Chinese government has attempted to address some of the ethical issues surrounding the use of facial recognition. For example, the government has introduced regulations that require companies to obtain user consent before collecting and using facial recognition data. Additionally, the government has established guidelines for the ethical use of facial recognition technology, which include measures to protect individual privacy and prevent discrimination.However, critics argue that these measures are largely inadequate and that the Chinese government's commitment to protecting individual privacy is questionable. They point to the government's continued use of facial recognition for surveillance and social control purposes as evidence of its prioritization of national security over individual rights.In conclusion, the application of facial recognition technology in China is a complex and multifaceted issue. While the technology has brought about increased efficiency and convenience in variousaspects of daily life, it has also raised significant ethical concerns about the potential for abuse and the erosion of individual privacy. As China continues to push the boundaries of this technology, it will be crucial for the government to strike a delicate balance between national security and individual rights, and to implement robust safeguards and oversight mechanisms to ensure the ethical and responsible use of facial recognition technology.。
基于跨模态深度度量学习的甲骨文字识别
基于跨模态深度度量学习的甲骨文字识别张颐康 1, 2张 恒 1刘永革 4刘成林1, 2, 3摘 要 甲骨文字图像可以分为拓片甲骨文字与临摹甲骨文字两类. 拓片甲骨文字图像是从龟甲、兽骨等载体上获取的原始拓片图像, 临摹甲骨文字图像是经过专家手工书写得到的高清图像. 拓片甲骨文字样本难以获得, 而临摹文字样本相对容易获得. 为了提高拓片甲骨文字识别的性能, 本文提出一种基于跨模态深度度量学习的甲骨文字识别方法, 通过对临摹甲骨文字和拓片甲骨文字进行共享特征空间建模和最近邻分类, 实现了拓片甲骨文字的跨模态识别. 实验结果表明, 在拓片甲骨文字识别任务上, 本文提出的跨模态学习方法比单模态方法有明显的提升, 同时对新类别拓片甲骨文字也能增量识别.关键词 甲骨文字识别, 深度度量学习, 最近邻分类, 跨模态学习引用格式 张颐康, 张恒, 刘永革, 刘成林. 基于跨模态深度度量学习的甲骨文字识别. 自动化学报, 2021, 47(4): 791−800DOI 10.16383/j.aas.c200443Oracle Character Recognition Based on Cross-Modal Deep Metric LearningZHANG Yi-Kang 1, 2 ZHANG Heng 1 LIU Yong-Ge 4 LIU Cheng-Lin 1, 2, 3Abstract There are two types of oracle character images: handprinted ones that are clean, and ones scanned from bones and shells that are noised. The collection of handprinted samples is easier than that of scanned images. There-fore, to improve the recognition of scanned oracle characters, we propose a method based on cross-modal deep met-ric learning to take advantage of the handprinted samples. Via shared feature space learning using cross-modal handprinted and scanned samples, scanned characters can be recognized by nearest neighbor classification in the shared space. Experimental results demonstrate that the proposed method not only achieves better performance in oracle character recognition but also can recognize new categories incrementally.Key words Oracle character recognition, deep metric learning, nearest neighbor classification, cross-modal learning Citation Zhang Yi-Kang, Zhang Heng, Liu Yong-Ge, Liu Cheng-Lin. Oracle character recognition based on cross-modal deep metric learning. Acta Automatica Sinica , 2021, 47(4): 791−800甲骨文字是早在中国商朝时期就出现的文字,是世界上最古老的文字之一, 同时也是中国及东亚已知的最早成体系的一种文字形式. 自动识别甲骨文字对考古学、古文字学以及历史年代学等多个领域都有着非常重要的应用价值. 目前甲骨文字标注基本只能依靠甲骨文专家手动处理, 计算机自动检测与识别技术刚刚起步, 性能远不能达到实用化水平. 随着人工智能技术的发展, 如何让计算机像处理现代文字一样处理甲骨文字, 成为计算机学者和文字与语言学者共同关注的课题.如图1所示, 甲骨文字图像可以分为临摹甲骨文字图像与拓片甲骨文字图像两类. 拓片甲骨文字图像是从龟甲、兽骨等载体上获取的原始拓片图像,临摹甲骨文字图像是专家临摹拓片甲骨文字后得到的高清图像, 修复了拓片甲骨文字图像的残缺和噪声等问题. 临摹甲骨文字图像可以通过临摹、手绘得到大量样本, 而拓片甲骨文字因为客观条件的限制难以获取. 由于缺少训练样本, 拓片甲骨文字识别很难取得较高的识别精度[1]. 因此, 本文研究如何用临摹甲骨文字样本辅助训练分类器进行拓片甲骨文字识别. 同时, 由于一些拓扑甲骨文字类别没有训练样本, 甲骨文字的增量识别也是辅助甲骨文字专家进行语言研究的重要手段.由于甲骨文字本身具有噪声严重、图像残缺(如图2所示)、类内样本少、类间样本不均衡等问题, 文字识别领域性能优异的深度学习方法[2−4]由于收稿日期 2020-06-22 录用日期 2020-10-19Manuscript received June 22, 2020; accepted October 19, 2020新一代人工智能重大项目(2018AAA0100400), 国家自然科学基金(61936003, 61721004), 安阳师范学院甲骨文信息处理教育部重点实验室开放课题(KFKT2018001)资助Supported by Major Project for New Generation AI (2018AAA0100400), National Natural Science Foundation of China (61936003, 61721004), Open Project from Key Laboratory of Oracle Information Processing in Anyang Normal University (KFKT2018001)本文责任编委 张军平Recommended by Associate Editor ZHANG Jun-Ping1. 中国科学院自动化研究所模式识别国家重点实验室 北京1001902. 中科院大学人工智能学院 北京 1000493. 中国科学院脑科学与智能技术卓越创新中心 北京 1001904. 安阳师范学院 安阳 4550991. National Laboratory of Pattern Recognition, Institute of Automation of Chinese Academy of Sciences, Beijing 1001902. Sch-ool of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 1000493. Chinese Academy of Sciences Cen-ter for Excellence of Brain Science and Intelligence Technology,Beijing 1001904. Anyang Normal University, Anyang 455099第 47 卷 第 4 期自 动 化 学 报Vol. 47, No. 42021 年 4 月ACTA AUTOMATICA SINICAApril, 2021依赖大量样本训练而难以得到满意的识别性能.Guo 等[5]提出了一种基于卷积神经网络(Convolu-tional neural network, CNN)的甲骨文字分类方法, 他们在基于Gabor 算子的低层特征和基于稀疏自编码器[6]的中层特征表示基础上, 设计了一种多特征融合的层次化特征表示方法, 继而通过CNN [7]实现了更好的识别效果. 这种方法依然是基于常规分类任务的CNN 框架, 并没有充分利用甲骨文字自身的特点, 所以对样本极少的类别很难取得良好的识别性能.为了充分利用临摹甲骨文字训练样本以提高拓片甲骨文字的识别性能, 我们提出一种基于深度度量学习的跨模态甲骨文字识别方法. 基于CNN 和深度度量学习分别将拓片甲骨文字与临摹甲骨文字映射到相同维度的特征空间, 并通过对抗学习算法使相同类别的拓片甲骨文字和临摹甲骨文字具有相似的特征分布, 再使用深度度量学习对拓片甲骨文字特征进行修正, 以增大拓片字符样本与异类临摹甲骨文字特征的距离, 同时减小与同类临摹甲骨文字特征的距离, 实现甲骨文字的跨模态特征空间建模. 在跨模态特征学习的基础上, 我们以临摹甲骨文字特征作为原型, 使用最近邻分类方法对拓片甲骨文进行识别, 不仅可以提高已知(已训练)类别的识别性能, 还可以对没有训练样本的拓片甲骨文字进行增量识别(使用临摹甲骨文字原型). 根据已有资料来看, 本文工作首先在甲骨文字识别中提出跨模态学习方法, 通过利用临摹甲骨文字明显提高了拓片甲骨文字的识别精度, 并且可以实现对无训练样本的新类别拓片甲骨文字增量识别.本文接下来的组织结构如下: 第1节主要介绍与本文研究相关的甲骨文字识别和草图识别, 以及深度度量学习和跨模态特征学习; 第2节介绍本文提出的跨模态甲骨文字识别方法; 第3节介绍实验设置和实验结果及分析; 第4节给出全文总结.1 相关工作1.1 基于深度学习的甲骨文字识别和草图识别甲骨文字带有明显的图画痕迹, 因而也可以被看作是一种草图. 早期的草图识别工作, 一般都是先提取一些人工设计的特征, 例如形状上下文特征(Shape context, SC)[8]、方位形状直方图(Histo-gram of orientation shape context, HOOSC)[9] 等;然后将这些特征送入支撑向量机(Support vector machine, SVM)[10] 等分类器进行识别.随着深度学习的发展, 研究者们开始采用深度学习方法进行草图识别. Yu 等[11]提出了一种多尺度、多架构的CNN 框架以及两种新颖的数据增强策略, 通过基于联合贝叶斯的方案对多个子网络进行融合后, 得到了较高识别性能. Creswell 等[12]提出了一种基于生成对抗神经网络(Generative ad-versarial networks, GAN)[13] 的CNN 特征提取器.其采取了一种无监督的方式来训练生成对抗神经网络, 使之能够生成可以以假乱真的数据, 然后把去掉最后一层全连接层的网络作为特征提取器. 该方法在一个商业图标数据集上进行草图检索实验, 取得了不错的效果.相对而言, 基于深度学习的甲骨文字识别的工作目前十分稀缺. Guo 等[5]提出了一种基于CNN 的甲骨文字分类方法, 他们设计了一种多层次特征融合的表示方法, 继而通过结合CNN 提高了识别精度. 然而, 他们在实验中丢掉了小样本量的类别,仅仅在一个样本量类间分布均衡且类别数较小的数据集上进行了相关实验.1.2 深度度量学习度量学习的目标是学习一个可以衡量样本间相(a) 临摹甲骨文字图像(a) Handprinted oracle character images(b) 拓片甲骨文字图像(b) Oracle character images scanned from bones and shells图 1 不同模态的甲骨文字图像Fig. 1 Oracle character images with different modals字形残缺噪声大图 2 拓片甲骨文中的字形残缺、大量噪声问题Fig. 2 Incomplete and noisy oracle character images792自 动 化 学 报47 卷似性的度量方法[14−16]. 深度度量学习是指用一个深度神经网络(如CNN)来建模上述度量函数或特征空间. 早期的深度度量学习是应用于已经人工设计好的特征上的, 而近年来的工作主要用CNN 来建模从特征提取到度量函数的整个流程. Schroff 等[17]提出了一个框架FaceNet, 基于CNN直接将人脸数据映射到一个欧氏距离度量空间. 在该度量空间内, 人脸验证或者聚类等问题便可以基于距离度量的方法较为简单地实现. 在训练过程中, FaceNet 基于两个对应同一人的人脸图像与一个他人的人脸图像构建三元组来对CNN 进行训练, 并提出了一个动态的三元组选择方法.1.3 基于领域自适应的共享特征空间学习计算机视觉和模式识别任务中经常遇到一类跨领域问题: 有两类数据, 一类数据有标签信息, 而另一类没有标签信息或者标签信息较少. 已知这两类数据非常相关的情况下, 如何利用有标签的数据学习得到一个可以应用于无/少标签数据的模型. 这就是典型的领域自适应问题[18−20], 其中有标签的数据与无标签的数据对应的域分别称为 “源域”与“目标域”. 领域自适应方法是解决跨模态共享特征空间学习、实现跨模态识别的一种有效方法.如果直接在源域训练模型并将其应用于目标域, 效果往往不好. 这是因为尽管两个域具有较强相关性, 但两者的特征分布还是会存在一定的差别.为了消除这个差别, 近年来的主流方法大多是基于对抗学习的方法[21−23], 即通过对抗训练的方式使得两个域上学到的特征服从相同的分布. 通过该方式,实现了源域上训练得到的模型在目标域上应用的目的. 本文提出的跨模态甲骨文字识别方法中, 同样基于对抗训练的思路, 将拓片甲骨文字映射到与临摹甲骨文字相同的特征空间中, 并约束二者服从尽可能相似的分布. 为了令对抗训练的过程更加稳定,我们采用了Wasserstein GAN[24]的框架与训练技巧[25].2 基于跨模态学习的甲骨文字识别图3为本文方法的总体框架图. 该框架包括一个基于CNN的临摹甲骨文字特征编码器、一个拓片甲骨文字特征编码器. 通过学习共享特征空间,实现跨模态分类. 由于临摹甲骨文字有更多样本和类别, 本方法首先基于单模态度量学习的方式对临摹甲骨文字编码器进行训练, 将临摹甲骨文字图像映射到一个特征空间. 然后, 基于领域自适应与度量学习算法训练拓片甲骨文字编码器(网络结构和临摹甲骨文字编码器相同, 而参数不同), 将拓片甲骨文字映射到同样的共享特征空间. 由于拓片甲骨文字样本的特点(类别数大、类内样本不均衡、噪声严重···), 采用Softmax输出的神经网络由于输出概率的闭集特性, 难以推广到新类别进行增类识别.因此, 我们采用扩展性更强、更接近人类模式识别方式的原型分类器. 在上述深度神经网络(CNN)学习得到的特征空间内, 用最近邻分类得到识别结果. 最近邻分类所使用的原型来自于与待识别输入(拓片甲骨文字)不同的模态(临摹甲骨文字), 故而称为 “跨模态最近邻分类”. 此外, 通过跨模态最近邻分类与Softmax分类进行结合的方式, 可以进一甲骨文图像拓片甲骨文图像拓片甲骨文编码器拓片甲骨文特征临摹甲骨文图像临摹甲骨文编码器临摹甲骨文特征跨模态最近邻已知类识别新类别识别度量学习度量学习领域自适应图 3 基于跨模态深度度量学习的拓片甲骨文字识别Fig. 3 Oracle character recognition based on cross-modal deep metric learning4 期张颐康等: 基于跨模态深度度量学习的甲骨文字识别793步提升拓片甲骨文字的识别性能. 同时, 只要提供相应的临摹甲骨文字样本, 便可以基于跨模态最近邻分类识别新增类别的拓片甲骨文字.下面对临摹甲骨文字编码器的训练算法、拓片甲骨文字编码器的训练算法和跨模态识别方法进行详细阐述.2.1 临摹甲骨文字编码器训练2.1.1 临摹甲骨文字编码器考虑识别性能, 我们采用视觉识别中常用且性能优良的DenseNet 作为甲骨文字识别的编码器[26],如图4所示. 编码器的输入是一个64×64的单通道图像, 输入图像先经过一个卷积层, 再先后经过4个Dense block, 相邻的Dense block 之间会基于Transition 对特征进行下采样. Transition 由Batch Normalization 层、输入输出通道数相等的1×1卷积层以及一个2×2的平均池化层组成. 最后一个Transition 的输出经过Dropout 层[26]后作为全连接层的输入. 全连接层的输出为128维的向量, 经过L2范数归一化后便是整个网络的输出特征.Dense block 1Dense block 4E m b e d d i n gL 2 n o r mL i n e a rD r o p o u tT r a n s i t i o nT r a n s i t i o nC o n v o l u t i o n图 4 甲骨文字编码器结构Fig. 4 Embedding of oracle character images上述每一个Dense block 内, 各个层之间的特征会采用稠密连接的方式, 即第l 层的输入就是前l − 1层的输出, 每个Dense block 包含9个卷积层,且每层的输出特征通道数均为6, 第l 层的输出O l 可以形式化表达为:H l 其中, [·] 表示特征的连接操作, 表示由Batch normalization (BN)[27]、ReLu [28]以及一个3×3 卷积层组成的复合函数.2.1.2 训练算法||f (x )||2=1x i a x i p x in 给定一个临摹甲骨文字图像x , 它会由临摹甲骨文字编码器映射到特征空间f (x ), 特征间的欧氏距离衡量了临摹甲骨文字图像间的差异. 由于临摹甲骨文字编码器(图4)的输出经过了L2范数归一化, 故而f (x ) 满足 . 在距离度量学习框架下, 本文使用三元组损失函数[17]来优化特征表示,具体来说, 第i 个三元组包括一个锚点 , 一个与锚点同类的正样本 , 一个与锚点异类的负样本 .我们的编码器优化目标是同类样本更近、异类样本更远(如图5所示):其中, α为超参, 表示三元组中锚点与正样本、负样本特征距离的间隔, 本文中α设定为0.2 (与Schroff 等[17]的设置保持一致), τ表示容量为N 的三元组集合, 故而i = 1, 2, ···, N .锚点锚点正样本正样本负样本负样本Learning图 5 三元组损失函数的学习目标Fig. 5 Learning objective of triple loss function由于训练的过程中, 大部分三元组很快就会满足式(2), 不会再对编码器参数的更新产生影响, 故而为了使得训练更加高效, 在训练过程中会动态选择难以区分的三元组, 即不满足式(2)的三元组.在整个训练过程中, 三元组的构建分为两个阶段:阶段 1. 由于网络参数随机初始化后, 大部分三元组仍然不满足式(2), 所以在第一阶段三元组采用完全随机的策略进行构建.x i a x ip x i n x ia x i p x in 阶段 2. 阶段一的训练收敛后, 已经有大量的三元组不再影响网络参数的更新, 此时开始动态地选择难以区分正负样本的三元组来训练网络. 具体来说, 由于网络基于批优化进行训练, 所以我们可以令每批(Mini-batch)都由来自不同类别的若干锚点-正样本对( , )组成, 并选择和锚点距离最近的异类样本作为负样本 , 继而构成三元组( ,, ).2.2 拓片甲骨文字编码器训练拓片甲骨文字编码器的网络结构与临摹甲骨文字编码器(见第2.1.1节)完全一致. 其训练过程分为两个步骤: 先基于跨模态对抗训练进行领域自适应, 使同类的不同模态文字具有相似的特征分布;然后基于度量学习对特征进行修正, 使不同模态的文字特征可以更好的度量欧氏距离.2.2.1 基于对抗训练的领域自适应D c (x )如图6所示, 对每个类别c 均引入一个判别器与拓片甲骨文字编码器进行对抗训练, 各个794自 动 化 学 报47 卷D c (x )类别的判别器共享最后一层之前的网络参数. 网络输入为128 维的多模态甲骨文字特征, 经过两层以LeakReLU 为激活函数的全连接层后, 再基于全连接回归对应各个类别的 .Linear Linear LinearLeakReLU LeakReLU D 1(x ), D 2(x ), D 3(x ), ..., D c (x )图 6 判别器的网络结构Fig. 6 Network structure of discriminatorP c g P cr D c (x )P c g P cr P c g P c r P cg P crD c (x )对于第c 个类别, 记对应该类别的拓片甲骨文字和临摹甲骨文字的后验概率分别为 和 . 对抗训练过程中, 判别器 的目标是将 与 进行区分, 而甲骨文字编码器的目标是令判别器无法区分 与 . 通过两者的博弈最终达到 与 服从相同分布的目标. 我们采用Wasserstein GAN [24−25]的训练框架可以令上述对抗训练过程更加稳定, 对应的损失函数为:P cˆx ∇ˆx D c(ˆx )D c (x )ˆx 其中, 表示第c 个类别的临摹甲骨文字与拓片甲骨文字在特征空间的插值所对应的概率分布,表示 对于输入 的梯度向量.D c (x )E ˜x ∼P c g[D c (˜x )]E x ∼P c r [D c (x )]Eˆx ∼P c ˆx [(∥∇ˆx D c(ˆx )∥2−1)2]D c (x )E ˜x ∼P c g [D c (˜x )]P c g P cr D c (x )基于损失函数(3), 判别器 经过训练后会使得相比于 更小, 且在正则化项 的约束下, 在特征空间会平滑地变化. 故而, 以增大 为目标训练拓片甲骨文字编码器可以令 向 “靠拢”. 因而在训练判别器 后, 拓片甲骨文字编码器基于以下损失函数进行优化:P c g P cr 基于损失函数(3)与(4)的迭代对抗训练, 可以令 与 服从近似相同的分布.2.2.2 基于度量学习的特征修正第2.2.1节的领域自适应可以使得拓片甲骨文字与临摹甲骨文字在每个类内都有着相似的特征分布.然而该方法并不能保证拓片甲骨文字与异类临摹甲骨文字具有足够的特征分布差异, 直接使用上述特征进行跨模态最近邻分类不一定能得到正确分类. 举例说明如下.图7(a)表示未经训练的甲骨文字特征分布, 其中每个点代表一个样本, 包括属于A 、B 两类的临摹甲骨文字和拓片甲骨文字. 此时拓片甲骨文字编码器尚未训练, 故而拓片甲骨文字特征近似随机分布. 图7(b)表示经过对抗训练后的特征分布, 每个类别的拓片甲骨文字样本会服从以同类临摹甲骨文字为中心的类高斯分布. 然而此时由于没有利用异类临摹甲骨文字样本进行判别学习, 部分样本会越过最近邻分类的 “分类面”落入其他类别所在的区域. 我们在本文采用度量学习的方式, 可以使拓片甲骨文字的特征与同类临摹甲骨文字特征更近、与临摹 A 类临摹 B 类拓片 A 类拓片 B 类临摹 A 类临摹 B 类拓片 A 类拓片 B 类临摹 A 类临摹 B 类拓片 A 类拓片 B 类(a) 初始特征分布(a) Initial feature distribution(b) 对抗训练后的特征分布(b) Feature distribution after adversarial training(c) 经过度量学习修正后的特征分布(c) Feature distribution after metric learning correction图 7 不同阶段特征分布示意图Fig. 7 Feature distributions in different stages4 期张颐康等: 基于跨模态深度度量学习的甲骨文字识别795异类临摹甲骨文字特征更远, 如图7(c)所示, 此时基于跨模态最近邻分类可以达到更好的效果.此处的度量学习方法与第2.1.2节类似, 同样是基于三元组式(2). 三元组包含一个拓片甲骨文字样本(锚点样本)、一个与锚点同类的临摹甲骨文字样本(正样本)以及一个与锚点异类的临摹甲骨文字样本(负样本). 以式(2)为目标对拓片甲骨文字编码器进行训练, 以减小正样本与锚点的距离、同时增大负样本与锚点的距离. 与第2.1.2节不同,由于拓片甲骨文字编码器已经进行过对抗训练, 多数三元组都满足式(2), 故而度量学习的训练过程略过阶段1, 仅包括阶段2.我们利用反卷积方法对临摹甲骨文和拓片甲骨文特征进行可视化, 如图8所示. 对于128维特征的每个维度, 我们均可获得一个与输入图像同尺寸的特征图, 为了更清晰的观察特征的分布区域, 我们将128个特征图通过在每个空间位置求和的方式融合为一个特征图. 可以看出, 无论是临摹甲骨文还是拓片甲骨文, 学到的共享特征均集中于文字区域(而不是背景区域), 反映了文字本质结构特点.临摹甲骨文字特征可视化拓片甲骨文字特征可视化图 8 不同模态的甲骨文特征可视化Fig. 8 Visualization of oracle character features withdifferent modals2.3 基于跨模态最近邻分类的拓片甲骨文字识别以临摹甲骨文字的特征为原型, 检索待识别的拓片甲骨文字特征, 距离最近的临摹甲骨文字的类别就是拓片甲骨文字的识别结果. 只要存在对应类别的临摹甲骨文字, 就可以对拓片甲骨文字进行识别, 所以可以实现新类拓片甲骨文字的增量识别(即使拓片甲骨文字没有训练过).针对已知类识别, 为了进一步提高识别性能,我们可以采用单模态CNN 模型和跨模态最近邻分类相结合的方法. 单模态CNN 的网络结构与甲骨文字编码器基本一致(仅将图4最后的全连接层输出维度设置为与类别数相等, 并且将L2归一化层替换为Softmax 层). 这里分两步分类: 当单模态CNN 的分类置信度较高时(类别置信度最大值高于一个阈值), 直接输出CNN 的分类结果; 否则, 通过跨模态最近邻分类给出识别结果. 这样很好地利用了两个分类器的互补性. 为了提高最近邻分类器的计算效率, 原型检索仅仅在单模态CNN 输出置信度最高的若干类别中进行.3 实验3.1 数据集介绍本文使用的甲骨文字数据集由安阳师范学院甲骨文信息处理实验室提供. 拓片甲骨文字[1]包含样本数较多的241个类别, 共295 466个样本, 其中每个类别最少16个、最多25 898个、平均1 226个样本. 样本量的类间分布不均衡, 如图9 所示. 本实验会对图像进行缩放使得长边为64像素, 然后将其置于一个大小为64×64的黑色背景图像的中心. 临摹甲骨文字数据集[26]包含2 583个类别, 共39 062个样本, 其中每个类别最少2个、最多287个、平均16个样本. 我们使用了其中241个对应于拓片甲骨文字的类别进行实验. 其余类别的样本可用于拓片甲骨文字增量识别(也就是只要有临摹文字样本的类别, 就可以识别对应的拓片甲骨文字), 但由于这些类别没有拓片甲骨文字测试样本, 我们没有进行评测, 而是用241类拓片甲骨文字样本中的41类进行新类的增量识别评测(详见第3.4节), 已知类别识别的数据是全部241类拓片甲骨文字样本(详见第3.2节、第3.3节).302520151050类别图 9 拓片甲骨文字中241个类别的样本个数分布Fig. 9 Sample distribution of oracle characters scannedfrom bones and shells3.2 基于对抗训练的领域自适应如图10 所示, 对于拓片甲骨文字, 每一类内往796自 动 化 学 报47 卷往包含了多种字形. 如果在特征空间, 某拓片甲骨文字与某临摹甲骨文字的特征距离十分相近, 他们对应的图像应该不仅仅属于同一类, 更应该对应相近的字形. 由于我们只有拓片甲骨文字与临摹甲骨文字的类别信息, 而没有类别下具体的字形标签,故而在构建三元组的时候无法选择合适的锚点与正样本. 如果锚点与正样本来自同一类下的两个不同字形, 通过优化令二者在度量空间相近是难以做到的. 基于上述分析, 在利用度量学习的方式训练拓片甲骨文字编码器前, 先对拓片甲骨文字特征进行基于对抗训练的领域自适应学习.图 10 拓片甲骨文字类内样本示例, 每一列属于同一类Fig. 10 Oracle images with the same characters ineach array实验设置 训练集的构建方式为(241类):1) 对于样本量小于900 的类别, 随机调取2/3的样本加入训练集.2) 对于样本量大于等于900的类别, 随机调取600个样本加入训练集.其余样本作为后续已知类识别实验的测试集.图11展示了领域自适应后的五组最近邻对, 观察数据可以发现, 进行领域自适应之后, 同一类内不同模态的最近邻对已经在字形上非常相似. 可以看到尽管拓片甲骨文字图像被噪声污染严重, 其字形结构与最近邻的临摹甲骨文字是一致的. 故而可以基于对抗训练的结果, 以上述最近邻对作为三元组中的锚点与正样本进行深度度量学习.3.3 拓片甲骨文字已知类别识别这里比较4个分类方法的性能: 单模态最近邻分类器[26]、单模态CNN 、跨模态最近邻分类器、融合跨模态信息的CNN. 单模态最近邻分类和单模态CNN 的训练集构建与第3.2节一致, 训练样本之外的拓片甲骨文字样本为测试集. 单模态最近邻分类中, 拓片甲骨文字编码器训练方法与第2.1.2节临摹甲骨文字编码器的训练方法一致. 单模态CNN 的网络结构与甲骨文字编码器基本一致(如第2.3节所述). 跨模态最近邻分类的原型来自临摹甲骨文字特征, 临摹甲骨文字特征编码器的相关设置和文献[26]相同. 融合跨模态信息的CNN 是指CNN 分类置信度(输出类别置信度最大值)低于给定阈值τ时采用跨模态最近邻分类, 并在CNN 置信度最高的K 个类别上搜索最近邻原型. 下面我们首先分别研究阈值τ和参数K 对融合跨模态信息的CNN 分类性能的影响. 再对4种不同的甲骨文识别方法进行比较.图12 反映了阈值τ对融合跨模态信息的CNN 分类性能的影响. 当阈值τ过小时, 主要由CNN 分类模块发挥作用. 由于甲骨文字的特点, 即使本实验中数据充足(每个类平均1 226个样本), 单凭该模块也难以达到较高的识别性能. 当置信度阈值过大时, 主要由跨模态最近邻分类发挥作用. 由于对应较高置信度的CNN 分类结果往往比较可靠,如果没有利用该信息, 同样会降低识别性能. 通过曲线可以看到, 本文所采用的阈值0.66 是一个相对较好的平衡点, 后续实验的τ值固定为0.66.%%%%%%%0.30.40.50.60.70.80.9本文方法基于 SoftMax 分类图 12 置信度阈值与识别精度关系曲线图Fig. 12 Relationship between confidence threshold andrecognition accuracy图13 展示了参数K 对融合跨模态信息的CNN分类性能的影响. 可以看到, 当K 小于4时, 由于考虑的类别数太小, 识别精度会很低. 当K 大于4时, 不仅仅最近邻检索效率变低, 算法精度也会逐渐变低. 这说明考虑置信度过低的类别不仅仅会增加最近邻检索的计算量, 同时这些类别会对最近邻临摹拓片图 11 领域自适应后的最近邻对示例Fig. 11 Nearest neighbor pairs after domain adaption4 期张颐康等: 基于跨模态深度度量学习的甲骨文字识别797。
复杂网络聚类系数和平均路径长度计算的MATLAB源代码
复杂网络聚类系数和平均路径长度计算的MA TLAB源代码申明:文章来自百度用户carrot_hy复杂网络的代码总共是三个m文件,复制如下:第一个文件,function [Cp_Global, Cp_Nodal] = CCM_ClusteringCoef(gMatrix, Types)% CCM_ClusteringCoef calculates clustering coefficients.% Input:% gMatrix adjacency matrix% Types type of graph:'binary','weighted','directed','all'(default).% Usage:% [Cp_Global, Cp_Nodal] = CCM_ClusteringCoef(gMatrix, Types) returns% clustering coefficients for all nodes "Cp_Nodal" and average clustering% coefficient of network "Cp_Global".% Example:% G = CCM_TestGraph1('nograph');% [Cp_Global, Cp_Nodal] = CCM_ClusteringCoef(G);% Note:% 1) one node have vaule 0, while which only has a neighbour or none.% 2) The dircted network termed triplets that fulfill the follow condition % as non-vacuous: j->i->k and k->i-j,if don't satisfy with that as% vacuous, just like: j->i,k->i and i->j,i->k. and the closed triplets % only j->i->k == j->k and k->i->j == k->j.% 3) 'ALL' type network code from Mika Rubinov's BCT toolkit.% Refer:% [1] Barrat et al. (2004) The architecture of the complex weighted networks.% [2] Wasserman,S.,Faust,K.(1994) Social Network Analysis: Methods and % Applications.% [3] Tore Opsahl and Pietro Panzarasa (2009). "Clustering in Weighted % Networks". Social Networks31(2).% See also CCM_Transitivity% Written by Yong Liu, Oct,2007% Center for Computational Medicine (CCM),% National Laboratory of Pattern Recognition (NLPR),% Institute of Automation,Chinese Academy of Sciences (IACAS), China.% Revise by Hu Yong, Nov, 2010% E-mail:% based on Matlab 2006a% $Revision: , Copywrite (c) 2007error(nargchk(1,2,nargin,'struct'));if(nargin < 2), Types = 'all'; endN = length(gMatrix);gMatrix(1:(N+1):end) = 0;%Clear self-edgesCp_Nodal = zeros(N,1); %Preallocateswitch(upper(Types))case 'BINARY'%Binary networkgMatrix = double(gMatrix > 0);%Ensure binary networkfor i = 1:Nneighbor = (gMatrix(i,:) > 0);Num = sum(neighbor);%number of neighbor nodestemp = gMatrix(neighbor, neighbor);if(Num > 1), Cp_Nodal(i) = sum(temp(:))/Num/(Num-1); end endcase 'WEIGHTED'% Weighted network -- arithmetic meanfor i = 1:Nneighbor = (gMatrix(i,:) > 0);n_weight = gMatrix(i,neighbor);Si = sum(n_weight);Num = sum(neighbor);if(Num > 1),n_weight = ones(Num,1)*n_weight;n_weight = n_weight + n_weight';n_weight = n_weight.*(gMatrix(neighbor, neighbor) > 0);Cp_Nodal(i) = sum(n_weight(:))/(2*Si*(Num-1));endend%case 'WEIGHTED'% Weighted network -- geometric mean% A = (gMatrix~= 0);% G3 = diag((gMatrix.^(1/3) )^3);)% A(A == 0) = inf; %close-triplet no exist,let CpNode=0 (A=inf)% CpNode = G3./(A.*(A-1));case 'DIRECTED', % Directed networkfor i = 1:Ninset = (gMatrix(:,i) > 0); %in-nodes setoutset = (gMatrix(i,:) > 0)'; %out-nodes setif(any(inset & outset))allset = and(inset, outset);% Ensure aji*aik > 0,j belongs to inset,and k belongs to outsettotal = sum(inset)*sum(outset) - sum(allset);tri = sum(sum(gMatrix(inset, outset)));Cp_Nodal(i) = tri./total;endend%case 'DIRECTED', % Directed network -- clarity format (from Mika Rubinov, UNSW) % G = gMatrix + gMatrix'; %symmetrized% D = sum(G,2); %total degree% g3 = diag(G^3)/2; %number of triplet% D(g3 == 0) = inf; %3-cycles no exist,let Cp=0 % c3 = D.*(D-1) - 2*diag(gMatrix^2); %number of all possible 3-cycles% Cp_Nodal = g3./c3;%Note: Directed & weighted network (from Mika Rubinov)case 'ALL',%All typeA = (gMatrix~= 0); %adjacency matrixG = gMatrix.^(1/3) + (gMatrix.').^(1/3);D = sum(A + A.',2); %total degreeg3 = diag(G^3)/2; %number of tripletD(g3 == 0) = inf; %3-cycles no exist,let Cp=0c3 = D.*(D-1) - 2*diag(A^2);Cp_Nodal = g3./c3;otherwise,%Eorr Msgerror('Type only four: "Binary","Weighted","Directed",and "All"');endCp_Global =sum(Cp_Nodal)/N;%%第二个文件:function [D_Global, D_Nodal] = CCM_AvgShortestPath(gMatrix, s, t)% CCM_AvgShortestPath generates the shortest distance matrix of source nodes % indice s to the target nodes indice t.% Input:% gMatrix symmetry binary connect matrix or weighted connect matrix % s source nodes, default is 1:N% t target nodes, default is 1:N% Usage:% [D_Global, D_Nodal] = CCM_AvgShortestPath(gMatrix) returns the mean% shortest-path length of whole network D_Global,and the mean shortest-path % length of each node in the network% Example:% G = CCM_TestGraph1('nograph');% [D_Global, D_Nodal] = CCM_AvgShortestPath(G); % See also dijk, MEAN, SUM% Written by Yong Liu, Oct,2007% Modified by Hu Yong, Nov 2010% Center for Computational Medicine (CCM),% Based on Matlab 2008a% $Revision: , Copywrite (c) 2007% ###### Input check #########error(nargchk(1,3,nargin,'struct'));N = length(gMatrix);if(nargin < 2 | isempty(s)), s = (1:N)';else s = s(:); endif(nargin < 3 | isempty(t)), t = (1:N)';else t = t(:); end% Calculate the shortest-path from s to all nodeD = dijk(gMatrix,s);%D(isinf(D)) = 0;D = D(:,t); %To target nodesD_Nodal = (sum(D,2)./sum(D>0,2));% D_Nodal(isnan(D_Nodal)) = [];D_Global = mean(D_Nodal);第三个文件:function D = dijk(A,s,t)%DIJK Shortest paths from nodes 's' to nodes 't' using Dijkstra algorithm.% D = dijk(A,s,t)% A = n x n node-node weighted adjacency matrix of arc lengths% (Note: A(i,j) = 0 => Arc (i,j) does not exist;% A(i,j) = NaN => Arc (i,j) exists with 0 weight) % s = FROM node indices% = [] (default), paths from all nodes% t = TO node indices% = [] (default), paths to all nodes% D = |s| x |t| matrix of shortest path distances from 's' to 't'% = [D(i,j)], where D(i,j) = distance from node 'i' to node 'j'%% (If A is a triangular matrix, then computationally intensive node% selection step not needed since graph is acyclic (triangularity is a% sufficient, but not a necessary, condition for a graph to be acyclic)% and A can have non-negative elements)%% (If |s| >> |t|, then DIJK is faster if DIJK(A',t,s) used, where D is now% transposed and P now represents successor indices)%% (Based on Fig. 4.6 in Ahuja, Magnanti, and Orlin, Network Flows,% Prentice-Hall, 1993, p. 109.)% Copyright (c) 1998-2000 by Michael G. Kay% Matlog Version 29-Aug-2000%% Modified by JBT, Dec 2000, to delete paths% Input Error Checking ****************************************************** error(nargchk(1,3,nargin,'struct'));[n,cA] = size(A);if nargin < 2 | isempty(s), s = (1:n)'; else s = s(:); endif nargin < 3 | isempty(t), t = (1:n)'; else t = t(:); endif ~any(any(tril(A) ~= 0)) % A is upper triangularisAcyclic = 1;elseif ~any(any(triu(A) ~= 0)) % A is lower triangularisAcyclic = 2;else % Graph may not be acyclicisAcyclic = 0;endif n ~= cAerror('A must be a square matrix');elseif ~isAcyclic & any(any(A < 0))error('A must be non-negative');elseif any(s < 1 | s > n)error(['''s'' must be an integer between 1 and ',num2str(n)]);elseif any(t < 1 | t > n)error(['''t'' must be an integer between 1 and ',num2str(n)]);end% End (Input Error Checking) ************************************************A = A'; % Use transpose to speed-up FIND for sparse A D = zeros(length(s),length(t));P = zeros(length(s),n);for i = 1:length(s)j = s(i);Di = Inf*ones(n,1); Di(j) = 0;isLab = logical(zeros(length(t),1));if isAcyclic == 1nLab = j - 1;elseif isAcyclic == 2nLab = n - j;elsenLab = 0;UnLab = 1:n;isUnLab = logical(ones(n,1));endwhile nLab < n & ~all(isLab)if isAcyclicDj = Di(j);else % Node selection[Dj,jj] = min(Di(isUnLab));j = UnLab(jj);UnLab(jj) = [];isUnLab(j) = 0;endnLab = nLab + 1;if length(t) < n, isLab = isLab | (j == t); end[jA,kA,Aj] = find(A(:,j));Aj(isnan(Aj)) = 0;if isempty(Aj), Dk = Inf; else Dk = Dj + Aj; endP(i,jA(Dk < Di(jA))) = j;Di(jA) = min(Di(jA),Dk);if isAcyclic == 1 % Increment node index for upper triangular A j = j + 1;elseif isAcyclic == 2 % Decrement node index for lower triangular Aj = j - 1;end%disp( num2str( nLab ));endD(i,:) = Di(t)';end。
一种新的基于Kruppa方程的摄像机自标定方法
Kruppa 方程[3 ,4 ]如下 :
5期
雷 成等 :一种新的基于 Kruppa 方程的摄像机自标定方法
589
(σ1F) 2 v1T Cv1 σ1Fσ2F v2T Cv1
σ1Fσ2F v1T Cv2 (σ2F) 2 v2T Cv2
0 0 =
0
0
0
u1T Cu1 - u1T Cu2 0
s′ - u2T Cu1 u2T Cu2 0
2 基于 Kruppa 方程的摄像机自标定 技术的简要回顾
本文中 ,假设摄像机的模型是常用的针孔模型. 因此从三维空间点 X = ( x , y , z , 1) T到二维图像点 m = ( u , v ,1) T的成像关系可以表示为
m K[ R | t ] X
(1)
f u γ u0
其中 K = 0 f v v0 是摄像机的内参数矩阵 ,
588
计 算 机 学 报
2003 年
种摄像机自标定方法中 ,人们是利用由 IAC 的对极 几何关系所推导出的 Kruppa 方程所提供的关于 IAC 的约束 , 通过确定 IAC 来标定摄像机内参数 的. 但实践中发现基于 Kruppa 方程的摄像机标定方 法并不十分鲁棒 ,为此人们又提出了很多更为鲁棒 的自标定算法 ,但大多需要作一些基于摄像机先验 知识的假设 ,或者对摄像机的运动 ,或对所拍摄的场 景有一些特殊的要求. 而在某些情况下 ,我们又不可 避免地需要利用 Kruppa 方程来进行摄像机标定 ,因 此对如何提高基于 Kruppa 方程的摄像机标定算法 的鲁棒性和实用性仍有着很重要的意义.
=
( FCF T) 22
( [ e′] ×C[ e′] T×) 22
=
( FC F T) 23
微分滤波器PPT课件
模式识别国家重点实验室
中国科学院自动化研究所
f (x)
f (x)h(x)
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
模式识别国家重点实验室
中国科学院自动化研究所
模式识别国家重点实验室
中国科学院自动化研究所
微分算子检测边缘:一维信号
一阶导数的极大值点:
E d g e { x |x a rg m a x (f'(x ))}
二阶导数的过零点:
E d g e { x |f''( x ) 0 ,z e r o c r o s s i n g s }
注意:仅仅等于0不够,常数函数 也为0,必须存在符号改变
Institute of Automation, Chinese Academy of Sciences
计算均值, 平滑噪声
模式识别国家重点实验室
中国科学院自动化研究所
Sobel算子
• Sobel算子,近似一阶微分 • 去噪 + 增强边缘,给四邻域更大的权重
-1 0 1 -2 0 2 -1 0 1
• 灰度图象边缘提取,主要的思想:
– 抑制噪声(低通滤波、平滑、去噪、模糊) – 边缘特征增强(高通滤波、锐化) – 边缘定位
原始图像
抑制噪声 增强边缘
中间结果
边缘定位
图像边缘
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
Pattern Recognition
Pattern Recognition 49 (2016) 102–114Yimin Zhou a ,n ,1, Guolai Jiang a ,b ,1, Yaorong Lin baShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, ChinabSchool of Electronic and Information Engineering, South China University of Technology, Chinaa r t i c l e i n f o ab s t r ac t&2015 Elsevier Ltd. All rights reserved.Article history:Received 17 March 2014 Received in revised form 8 August 2014Accepted 29 July 2015Available online 8 August 2015This paper presents a high-level hand feature extraction method for real-time gesture recognition. Firstly, the fi ngers are modelled as cylindrical objects due to their parallel edge feature. Then a novel algorithm is proposed to directly extract fi ngers from salient hand edges. Considering the hand geometrical characteristics, the hand posture is segmented and described based on the fi nger positions, palm center location and wrist position. A weighted radial projection algorithm with the origin at theKeywords: Computer vision Finger modelling Salient hand edge Convolution operatorReal-time hand gesture recognitionwrist position is applied to localize each fi nger. The developed system can not only extract extensional fi ngers but also fl exional fi ngers with high accuracy. Furthermore, hand rotation and fi nger angle variation have no effect on the algorithm performance. The orientation of the gesture can be calculated without the aid of arm direction and it would not be disturbed by the bare arm area. Experiments have been performed to demonstrate that the proposed method can directly extract high-level hand feature and estimate hand poses in real-time.A novel fi for handreal-time technique estimation pose hand and nger gesture recognitionat available lists Contents ScienceDirec tjournal homepage:/locate/prPattern RecognitionY . Zhou et al. / Pattern Recognition 49 (2016) 102–114 1031. IntroductionHand gesture recognition based on computer vision technology has been received great interests recently, due to its natural human-computer interaction characteristics. Hand gestures are generally composed of different hand postures and their motions. However, human hand is an articulated object with over 20 degrees of freedom (DOF) [12], and many self-occlusions would occur in its projection results. Moreover, hand motion is often too fast and complicated compared with current computer image processing speed. Therefore, real-time hand posture estimation is still a challenging research topic with multi-disciplinary work including pattern recognition, image processing, computer vision, arti fi cial intelligence and machine learning. In human –machine interaction history, keyboard input & character text output and mouse input & graphic window display are main traditional interaction forms. With the development of computer techniques, the human –machine interaction via hand posture plays an important role under three dimensional virtual environment. Many methods have been developed for hand pose recognition [3,4,10,18,24,29].A general framework for visual based hand gesture recognition is illustrated in Fig. 1. Firstly, the hand is located and segmented from the input image, which can be achieved via skin-color based segmentation methods [27,31] or direct object recognition algorithms. The second step is to extract useful feature for static hand posture and motion identi fi cation. Then the gesture can be identi fi ed via feature matching. Finally, different human machine interaction can be applied based on the successful hand gesture recognition.There are a lot of constraints and dif fi culties in accurate hand gesture recognition from images since human hand is an object with complex and versatile shapes [25]. Firstly, different from less remarkable metamorphosis objects such as human face, human hand possesses over 20 free degree plus variations in hand gesture location and rotation which make hand posture estimation extremely dif fi cult. Evidence shows that at least 6-dimension information is required for basic hand gesture estimation. The occlusion also could increase the dif fi culty in pose recognition. Since the involved hand gesture images are usually two dimensioned images, it would result in occlusion of some key parts of the hand on the plane project due to various heights of the hand shapes.Besides, the impact of the complex environment to the broadly applied visual-based hand gesture recognition techniques has to/10.1016/j.patcog.2015.07.014 0031-3203/&2015Elsevier Ltd. All rights reserved.nCorresponding author.E-mail addresses: ym.zhou@ (Y . Zhou), gl.jiang@ (G. Jiang). 1fi rst author and second author contribute equally in the paper. Thebe considered. The lightness variation and complex background such factors make it more dif fi cult for the hand gesture segmentation. Up to now, there is no united de fi nition for dynamic hand104Y . Zhou et al. / Pattern Recognition 49 (2016) 102–114Fig. 2. Hand gesture models with different complexities (a) 3D strip model; (b) 3 D surface model; (c) paper model [36]; (d) gesture silhouette; and (e) gesture contour.gesture recognition, which is also an unsolved problem to accommodate human habits and facilitate computer recognition. It should be noted that human hand has deformable shape in front of a camera due to its own characteristics. The extraction of a hand image has to be executed in real-time independent of the users and device. Human motion possesses a fast speed up to 5 m/s for translation and 300 1C/s for rotation. The sampling frequency of a digital camera is about 30–60 Hz, which could result in fuzzi fi cation on the collected images with negative impact on further identi fi cation. On the other hand, with the hand gesture module added in the system, the dealt frame number per second for the computer will be even less, which will bring more serious pressure on the relatively lower sampling speed. Moreover, a large amount of data have to be dealt in computer visual system, especially for high complex versatile objects. Under current computer hardware conditions, a lot of high-precision recognition algorithms are dif fi cult to be operated in real-time.Our developed algorithm focuses on single camera based realtime hand gesture recognition. Some assumptions are made without loss of generality: (a) the background is not too complex without large area skin color disturbance; (b) lightness should avoid too low or too light such worse conditions; (c) the palm is right faced to the camera with distance in the range r 0:5 m. These three limitations are not dif fi cult to be realized in the actual application scenarios.Firstly, a new fi nger detection algorithm is proposed. Compared to previous fi nger detection algorithms, the developed algorithm is independent of the fi nger tip feature but can extract fi ngers directly from the main edge of the whole fi ngers. Considering that each fi nger has two main “parallel ” edges, a fi nger is determined from convolution result of salient hand edge image with a speci fi c operator G. The algorithm can not only extract extensional fi ngers but also fl exional fi ngers with high accuracy, which is the basis for complete hand pose high-level feature extraction. After the fi nger central area has been obtained, the center, orientation of the hand gesture can be calculated. During the procedure, a novel high-level gesture feature extraction algorithm is developed. Through weighted radius projection algorithm, the gesture feature sequence can be extracted and the fi ngers can also be localized from the local maxima of angular projection, thus the gesture can be estimated directly in real-time.The remainder of the paper is organized as follows. Section 2 describes hand gesture recognition procedure and generally used methods. Finger extraction algorithm based on parallel edge characteristics is introduced in Section 3. Salient hand image can also be achieved. The speci fi c operator G and threshold is explained in detail in Section 4. High-level hand feature extraction through convolution is demonstrated in Section 5. Experiments in different scenarios are performed to prove the effectiveness of the proposed algorithm in Section 6. Conclusions and future works are given in Section 7.2. Methods of hand gesture recognition based on computer vision2.1. Hand modellingHand posture modelling plays a key role in the whole hand gesture recognition system. The selection of the hand model is dependent on the actual application environments. The hand model can be categorized as gesture appearance modelling and 3D modelling. Generally used hand gesture models are demonstrated in Fig. 2.3D hand gesture model considers the geometrical structure with histogram or hyperquadric surface to approximate fi nger joints and palm. The model parameters can be estimated from single image or several images. However, the 3D model based gesture modelling has quite a high calculation complexity, and too many linearization and approximation would cause unreliable parameter estimation. As for appearance based gesture models, they are built through appearance characteristics, which have the advantages of less computation load and fast processing speed. The adoption of the silhouette, contour model and paper model can only re fl ect partial hand gesture characteristics. In this paper, based on the simpli fi ed paper gesture model [36], a new gesture model is proposed where each fi nger is represented by extension and fl exion states considering gesture completeness and real-time recognition requirements.Many hand pose recognition methods use skin color-based detection and take geometrical features for hand modelling. Hand pose estimation from 2D to 3D using multi-viewpoint silhouette images is described in [35]. In recent years, 3D sensors, such as binocular cameras, Kinect and leap motion, have been applied for hand gesture recognition with good performance [5]. However, hand gesture recognition has quite a limitation, since 3D sensors are not always available in many systems, i.e., Google Glasses.2.2. Description of hand gesture featureThe feature extraction and matching is the most important component in vision-based hand posture recognition system. In early stage of the hand gesture recognition, colored glove or labeling methods are usually chosen to strengthen the feature in different parts of the hand for extraction and recognition. Mechanical gloves can be used to capture hand motion, however, they are rigid with only certain free movements and relatively expensive cost [23]. Compared with the hand recognition methods with additional assistance of data glove or other devices, computervision based hand gesture recognition will need less or no additional device, which is more adaptable and has bright application prospect. A real-time algorithm to track and recognize hand gesture is described for video game in [23]. Only four gestures can be recognized, which has no generality. Here, the hand gesture images without any markers arediscussed for feature extraction.Fig. 1. The general framework of computer based hand posture recognition.Y . Zhou et al. / Pattern Recognition 49 (2016) 102–114 105The generally used image feature for hand gesture recognition can be divided into two categories, low-level and high-level, as shown in Fig. 3. The low-level features such as edge, edge orientation, histogram of oriented gradients (HOG) contour/silhouette and Haar feature, are basic computer image characteristics and can be extracted conveniently. However, in actual applications, due to the diversities of hand motions, even for the same gesture, the subtle variation in fi nger angle can result in large difference in the image. With rotational changes in hand gesture, it is much more dif fi cult to recognize gestures with direct adoption of low-level feature matching.Since the skin color is a distinctive cue of hands which is invariant to scale and rotation, it is regarded as one of the key features. Skin color segmentation is widely used for hand localization [16,31]. Skin detection are normally achieved by Bayesian decision theory, Bayesian classi fi er model and training images [22]. Edge is another common feature for model-based matching [36]. Histogram of oriented gradients has been implemented in [30]. Combinations of multiple features can improve the accuracy and robustness of the algorithm [8].Fingertip position, fi nger location and gesture orientation such high-level gesture features are related to the hand structure, which has direct relationship to the hand recognition. Therefore, they can be easily matched for various gesture recognition in real-time. However, this type of features is generally dif fi cult to be extracted accurately.In [1], fi ngertips were located from probabilistic models. The detected edge segments of monochrome images are computed by Hough transform for fi ngertip detection. But the light and brightness would seriously affect the quality of the dealt images and detection result. Fingertip detection and fi nger type determination are studied with a model-based method in [5], which is only applicable for static hand pose recognition. In [26], fi ngertips are found by fi ngertip masks considering their characteristics, and they can be located via feature matching. However, objects which share similar fi ngertip shapes could result in a misjudgment.Hand postures can be recognized through the geometric features and external shapes of the palm and fi ngers [4]. It proposes a prediction model for showing the hand postures. The measurement error would be large, however, because of the complexity of the hand gestures and diversi fi ed hand motions. In [3], palm and fi ngers were detected by skin-colored blob and ridge features. In [11], a fi nger detection method using grayscale morphology and blob analysis is described, which can be used for fl exional fi nger detection. In [9,13], high-level hand features were extracted by analyzing hand contour. 2.3. Methods of hand gesture segmentationFast and accurate hand segmentation from image sequences is the fundamental for gesture recognition, which have direct impact on the followed gesture tracking, feature extraction and fi nal recognition performance. Many geometrical characteristics can be used to detect hand existence in image sequences via projection, such as contour, fi ngertip and fi nger orientation [26]. Other non-geometrical features, i.e., color [2], strip [34] and motion can also be used for hand detection. Due to the complex background, unpredictable environmental factors and diversi fi ed hand shapes, hand gesture segmentation is still an open issue.Typical methods for hand segmentation are summarized as follows. Increasing constraints and building hand gesture shape database are usually used for segmentation. Black or white wall and dark color cloth can be applied to simplify the backgrounds. Besides, particular colored gloves can be worn todivide the hand and background through emphasized front view. Although these kinds of methods have good performance, it adds more limitation at the cost of freedom. A database can be built to collect hand sample images at any moment with different positions and ratios for hand segmentation through matching. It is a time consuming process, though, the completeness of thedatabase can never be achieved which has to be updated all the time.Methods of contour tracking include snake-model based segmentation [17], which can track the deformation and non-rigid movement effectively, so as to segment the hand gesture from the backgrounds. Differential method [20] and its improved algorithm can realize the segmentation by the deduction from the object images to the background images. It has a fatal defect that the camera has to be fi xed and the background should be kept invariant during background and hand image extraction.Skin color, assumed as one of the most remarkable surface features of human body, is often applied in gesture segmentation [31]. However, only adoption of this feature would be easily affected by the ambient environmental variations, especially when a large area of skin color disturbance is in the background, i.e., hand gesture overlapped by human face. Motion is another remarkable and easily extracted feature in gesture images. The combination of these two features becomes more and more popular in recent years [15].Depth information (the distance information between the object and the camera) in the object images can also be used for background elimination and segmentation since human hands are the closet objects to the camera. Currently, the normally used depth camera are Swiss Ranger 4000 from Mesa Imaging company, Cam cube 2.0 from PMD Technologies, Kinect from Microsoft and Depth camera from PrimeSense. 2.4. Methods of gesture recognition2.4.1. Methods of static gesture recognitionMethods of static gesture recognition can be classi fi ed into several categories:(1) Edge feature based matching: Gesture recognition based on this type offeature is realized through the calculated relationship between data sets of the features and samples to seek the best matching [22,28,33,36]. Although it is relatively simple for feature extraction and adaptable for complex background and lightness, the data based matching algorithm is quite complicated with heavy computational load and time cost. A large amount of templates should be prepared to identify different gestures.(2) Gesture silhouette based matching: Gesture silhouette is normallydenoted as binary images of the wrapped gesture from segmentation. In [20], matching is calculated through the size of the overlapped area between the template and silhouette.Zernike matrix of the images is used to cope with the gesture rotation [14] and feature set is developed for matching. The disadvantages of this type of method are that not all gestures can be identi fi ed with only silhouette and accurate hand segmentation is required.(3) Harr-like feature based recognition: Harr-like feature based Adaboostrecognition algorithm has achieved good performance in face recognition [21]. This method can be used for hand detection [37] and simple gesture recognition [10]. Experiments demonstrate that the method can recognize speci fi c gestures in real-time under complex background environments. However, Harr-like feature based algorithmFig. 3. The generally used feature and classi fi cation for hand gesture recognition.Extraction Identification106 Y. Zhou et al. / Pattern Recognition 49 (2016) 102–114has high requirement on the consistency for the dealt objects, whereas hand gesture has diversi fi ed shape variations. Currently, this type method can only be applied in prede fi ned static gestures recognition.(4)External contour based recognition: External contour is an importantfeature for gesture. Generally speaking, different gestures have different external contours. The curvature of the external contour is varied in different positions of a hand (i.e., curvature is large at fi ngertip). In [9], curvature is analyzed for CSS (Curvature Scale Space) feature extraction to recognize gesture. Fingertip, fi nger root and joint such high-level features can be extracted from contour analysis [13]. A feature sequence is constructed by the distances from the contour points to the center for gesture recognition [32]. This type of method is adaptable for the angle variation between fi ngers but also dependent on the performance of the segmentation.(5)Finger feature based recognition: Finger is the most widely appliedhigh-level feature in hand pose recognition since the location and states of the fi ngers embody the most intuitional characteristics of different gestures. Only the fi nger positions, fi nger states and hand center are located thus simple hand gestures can be determined directly.Several fi ngertip recognition algorithms are compared in [6]. In [26], circular fi ngertip template is used to seek the fi ngertip location and motion tracking. Combined with skin color feature, Blob and Ridge features are used to recognize palm and fi ngers [3]. However, only extensional fi ngers can be recognized via this type of methods.2.4.2. Methods of motion gesture recognitionTime domain models are normally adopted for motion gesture recognition, which includes HMM (Hidden Markov Model) and DTW (Dynamic Time Warping) based methods:(1)HMM-based method: It has achieved good performance in voicerecognition area and applied in gesture recognition as well. Different motion gesture sequences are modelled via HMM, and each gesture is related to a HMM process. HMMbased method can realize recognition through feature matching at each moment, whose training process is a dynamic programming (DP) process. This method can provide time scale invariance and keep the gestures in time sequence. However, the training process is time consuming and the selection of its topology structure is determined by the expert experience, i.e., trial and error method used for number of invisible states and transfer states determination.(2)DTW-based method: It is widely used in simple tracking recognitionthrough the difference between the dealt gestures and standard gestures for feature matching at each moment. HMM and DTW are essentially dynamic programming processes, and DTW is the simpli fi ed version of HMM. DTW-based recognition has limitation on the word database applications.In summary, methods of hand modelling, gesture segmentation and feature extraction are discussed. Most used hand gesture recognition methods are also illustrated. The following sections will introduce the proposed algorithm for real-time hand gesture recognition in detail. 3.Finger extraction algorithm based on parallel edge featureThe most notable parts of a hand to differentiate other skin objects, i.e, human face, arm, are fi ngers. As it is known, fi nger feature extraction and matching have great signi fi cance in hand segmentation. Contour can be extracted from silhouette of a hand region as the commonly used feature for hand recognition. Due to nearly 30 degrees of freedom in hand motion, hand image extraction will be executed regarding a hand as a whole. Moreover, arm should be eliminated. It should be noted that occlusion among four fi ngers (except thumb) could frequently occur, especially for fl exional fi ngers.To solve these problems associated with hand image extraction, a model-based approach for fi nger extraction is developed in this paper. It can obviate the fi nger joint location in hand motion and extract fi nger features from the silhouette of the segmented hand region. In complex background circumstances, models with fi xed threshold can result in false detection or detection failure. However, fi xed threshold color model is still selected for segmentation in this paper because of its simplicity, low computational load and invariant properties with regard to the various hand shapes. The threshold is prede fi ned to accommodate the general human hand sizes. Then the selected pixels are transformed from RGB-space to YCbCr-space for segmentation. Finger extraction is explained in detail.3.1.Salient hand gesture edge extraction3.1.1. Finger modellingCombined with the skin, edge and external contour such easily extracted low-level features, a novel fi nger extraction algorithm is proposed based on the approximately parallel fi nger edge appearance. It can detect the states of the extensional and fl exional fi ngers accurately, which can also be used for further high-level feature extraction. It is known that fi ngers are cylindrical objects with nearly constant diameter from root to tip. As for human hand motion, it is almost impossible to move the distal interphalangeal (DIP) joint without moving the adjacent proximal interphalangeal (PIP) joint with no external force assistance and vice versa. Therefore, there is almost a linear relationship between these two types of joints, where the fi nger description can be simpli fi ed accordingly.Firstly, each fi nger is modelled by its edges, which is illustrated in Fig. 4(a).C fi, the boundary of the ith fi nger, is the composition of arc edges (C ti, fi ngertip or joints) and a pair of parallel edges (C ei [ C0ei , fi nger body), described asC fi ¼ ðC ei [ C 0ei Þ [ X C ti;j ð1Þj ¼1;2where the arc edge C ti;j denotes the fi nger either in extensional (j¼1) state or fl exional state (j¼2) (see two green circles inFig. 4(b)).The fi nger center line (FCL), C FCLi, is introduced as the center line of the parallel fi nger edges to represent the main fi nger body. The distance between fi nger edges is de fi ned as 2d, which is the averaged diameter of all the fi ngers. Fingertip/joint center O ti is located at the end of FCL, and it is the center of the arc curve C ti as well. The fi nger central area along with C FCLi will be extracted for fi nger detection. Compared with many algorithms based on fi ngertip feature [26], the proposed method is more reliable which can also detect the fl exional fi ngers successfully.Y . Zhou et al. / Pattern Recognition 49 (2016) 102–114 107Fig. 5. The diagram of extracting hand salient edge procedure.3.1.2. Structure of hand gesture edgeThe remarkable hand edge C hand of a gesture can provide concise and clear label for different hand gestures. Considering the hand structure characteristics and the assumed constraints, C hand consists of the following curves:5C hand ¼ X C fi þC p þC nð2Þi ¼ 1where C p is the palm/arm edge curve and C n is the noise edge curve. The diagram of the hand edge is shown in Fig. 4(b). Finger edges and palm/arm edges have direct relationship with the gesture structure, which are the main parts of the hand gesture edges and have to be detected as fully as possible. Edge curves formed by palmprint, skin color variation and skin rumple are noise, which has no connection with the gesture structure and should be eliminated completely.3.1.3. Extracting the salient hand gesture edgeAs for the hand gesture image input with complex background, skin color based segmentation algorithm is selected for initial segmentation. Morphology fi lter is then used to extract hand area mask and gesture contour I contour ðx ; y Þ. The gray image I gray ðx ; y Þof the gesture can be obtained at the same time, where arm area might be included. Canny edge detection algorithm [7] can extract most of the remarkable gesture edges. However, the detection results will also contain some noisy edges formed by the rumple and re fl ection from hand surface, which should be separated from the fi nger edges. handThecontoursalient whichhand edgeincludesimageboundariesI edge ðx ; y Þ ofis mainlyextensional made fi ngers,up offl exional fi ngers and arm/palm. Adduction (approximation) and abduction (separation) movements of the fi ngers can be referenced with fi nger III (middle fi nger), and it has slight movement without external force disturbance during motion. When the fi ngers are in extensional states, they are free to carry out adduction and abduction movements, whose edges areeasily obtained. When the fi ngers are clenched into a fi st or in ‘six ’ numbergesture such fl exional states, as shown in Fig. 4, obvious ravine would be formed in the appressed part with lower gray values in the related pixels. Based on this characteristic, most noisy edges can be eliminated.The procedure of extracting the salient hand edge is depicted in Fig. 5. One of the hand posture shown in Fig. 4(b) is used as an example, where its grayscale hand image can be seen in Fig. 5( a ). The steps of I edge ðx ; y Þ extraction are summarized as follows:1. Extract grayscale hand image I gray ðx ; y Þ (see Fig. 5(a)) and hand contour image I contour ðx ; y Þ (see Fig. 5(b)) from source color image using skin color segmentation method in [16].2. Extract canny edge image I canny ðx ; y Þ (see Fig. 5(c)) from3. ApplyingI gray ðx ; y Þ [7]the. threshold Th black ( prede fi ned) to grayscale handimage for extraction, then the obtained I black ðx ; y Þ (see Fig. 5( d )) is I black ðx ; y Þ ¼ ( 10;; ððII graygray ððxx ;; yy ÞÞZo ThTh blackblack Þ Þ3Þ ðThe boundaries of the fl exional fi ngers are extracted from the overlapped area, i.e., I canny ðx ; y Þ \ I black ðx ; y Þ.4. Then the salient hand edge image I edge ðx ; y Þ is obtained:I edge ðx ; y Þ ¼ I black ðx ; y Þ \ I canny ðx ; y Þ [ I contour ðx ; y Þ ð4ÞThe curve denoted by binary image I edge ðx ; y Þ shown in Fig. 5( e )) is the remarkable edge C hand .3.2. Finger extraction via parallel edge featureFinger edge modelHand gesture edgesFig. 4. The diagram of the fi nger edge model. (For interpretation of the references to color in this fi gure, the reader is referred to the web version of this paper.)ANDORgreyI contourI canny I blackI edgeI ExtentionalfingersFlexionalfingerseiC eiC tiC d2 Fingertip/jointcurveFinger paralleledgesFinger center line ( F CL) tiO Fingertip/jointcenterFingertip or finger joint edge tjCPalm/arm edge pC nC Noise edge The parallel edges of fingersFinger center line (FCL) iFCLC。
虹膜图像识别处理外文翻译
外文一:AbstractThe biological features recognition is one kind of basis human body own inherent physiology characteristic and the behavior characteristic distinguishes the status the technology,Namely through the computer and optics, acoustics, the biosensor and the biometrics principle and so on high tech method unifies closely,Carry on individual status using the human body inherent physiology characteristic and the behavior characteristic the appraisal。
The biological features recognition technology has is not easy to forget, the forgery-proof performance good, not easy forge or is robbed, “carries” along with and anytime and anywhere available and so on merits.Iris recognition is a new method for man identification based on the biological features, which has the significant value in the information and security field. Combined with the previous work of other researchers, a discussion is elaborately made on the key techniques concerning the capture of iris images, location of iris circle and some improved and approaches to these problems are put forward. The location of iris recognition is realized which proves efficient.Iris location is a crucial part in the process of iris recognition,thus obtaining the iris localization precisely and fleetly is the prelude of effective iris localization .Iris location of is a kernel procession in an iris recognition system. The speed an accuracy of the iris location decide the performance of the iris recognition system.Take the advantages of the iris image, per-processes the images, decides the pesudo –center of pupil by a method of gray projection .Then the application calculus operator law carries on inside and outside the iris the boundary localization,in this paper ,this algorithm is based on the Daugman algorithm .Finally realizes the localization process in matlab.Keywords: Iris location,Biological features recognition,Calculus operator,Daugman algorithmTable of ContentsThe 1 Chapter Introduction1.1 The research background of iris recognition (6)1.2 The purpose and significance (8)1.3 Domestic and foreign research (9)Chapter 2 of iris recognition technology Introduction2.1 biometric identification technology (14)2.1.1 The status and development (14)2.1.2 Several biometric technology (17)2.2Iris recognition technology (23)2.3 Summary (26)Chapter 3 Research Status of iris location algorithm3.1Several common localization algorithm (27)3.1.1 Hough transform method (27)3.1.2 Geometric features location method (28)3.1.3 Active contour positioning method (29)3.2 Positioning algorithm studied (31)Chapter 4 operator calculus based iris localization algorithm4.1Image preprocessing (34)4.1.1Iris image smoothing (denoising) (36)4.1.2 Sharpen the image (filter)..................37.4.2Coarse positioning the inner edge of the iris (39)4.3 the iris to locate calculus operator law (40)4.4 Summary (41)Chapter 5 Conclusion (41)References (43)The first chapter1.1 The research background of iris recognitionBiometrics is a technology for personal identification using physiological characteristics and behavior characteristics inherent in the human body. Can be used for the biological characteristics of biological recognition, fingerprint, hand type face, iris, retina, pulse, ear etc.. Behavior has the following characteristics: signature, voice, gait, etc.. Based on these characteristics, it has been the development of hand shape recognition, fingerprint recognition, facial recognition, iris recognition, signature recognition and other biometric technology, many techniques have been formed and mature to application of.Biological recognition technology in a , has a long history, the ancient Egyptians throughidentification of each part of the body size measure to carry out identity may be the earliest human based on the earliest history of biometrics. But the modern biological recognition technology began in twentieth Century 70 time metaphase, as biometric devices early is relatively expensive, so only a higher security level atomic test, production base.due to declining cost of microprocessor and various electronic components, precision gradually improve, control device of a biological recognition technology has been gradually applied to commerce authorized, such as access control, attendance management, management system, safety certification field etc..All biometric technology, iris recognition is currently used as a convenient and accurate. Making twenty-first Century is information technology, network technology of the century, is also the human get rid of traditional technology, more and more freedom of the century. In the information, free for the characteristics of the century, biometric authentication technology, high-tech as the end of the twentieth Century began to flourish, will play a more and more important role in social life, fundamentally change the human way of life . Characteristics of the iris, fingerprint, DNA the body itself, will gradually existing password, key, become people lifestyle, instead of at the same time, personal data to ensure maximum safety, maximize the prevention of various types of crime, economic crime.Iris recognition technology, because of its unique in terms of acquisition, accuracy and other advantages, will become the mainstream of biometric authentication technology in the future society. Application of safety control, the customs import and export inspection, e-commerce and other fields in the future, is also inevitable in iris recognition technology as the focus. This trend, now in various applications around the world began to appear in the.1.2 Objective and significance of iris recognitionIris recognition technology rising in recent years, because of its strong advantages and potential commercial value, driven by some international companies and institutions have invested a lot of manpower, financial resources and energy research. The concept of automatic iris identification is first proposed by Frown, then Daugman for the first time in the algorithm becomes feasible.The iris is a colored ring in the pupil in the eye of fabric shape, each iris contains a structure like the one and only based on the crown, crystalline, filaments, spots, structure, concave point, ray, wrinkles and fringe characteristic. The iris is different from the retina, retinal is located in the fundus, difficult to image, iris can be seen directly, biometric identification technology can obtain the image of iris fine with camera equipment based on the following basis: Iris fibrous tissue details is rich and complicated, and the formation and embryonic tissue of iris details the occurrence stage of the environment, have great random the. The inherent characteristics of iris tissue is differ from man to man, even identical twins, there is no real possibility of characteristics of the same.When the iris are fully developed, he changes in people's life and tiny. In the iris outer, with a layer of transparent corneal it is separated from the outside world. So mature iris less susceptible to external damage and change.These characteristics of the iris has the advantages, the iris image acquisition, the human eye is not in direct contact with CCD, CMOS and other light sensor, uses a non technology acquisition invasion. So, as an important biometric identity verification system, iris recognition by virtue of the iris texture information, stability, uniqueness and non aggressive, more and more attention from both academic and industrial circles.1.3 Status and application of domestic and foreign research on iris recognitionIDC (International Data Group) statistics show that: by the end of 2003, the global iris recognition technology and related products market capacity will reach the level of $2000000000. Predicted conserved survey China biometric authentication center: in the next 5 years, only in the Chinese, iris recognition in the market amounted to 4000000000 rmb. With the expansion of application of the iris recognition technology, and the application in the electronic commerce domain, this number will expand to hundreds of billions.The development of iris recognition can be traced back to nineteenth Century 80's.. In 1885, ALPHONSE BERTILLON will use the criminal prison thoughts of the application of biometrics individual in Paris, including biological characteristics for use at the time: the size of the ears, feet in length, iris.In 1987, ARAN SAFIR and LEONARD FLOM Department of Ophthalmology experts first proposed the concept, the use of automatic iris recognition iris image in 1991, USA Los ala Moss National Laboratory JOHNSON realized an automatic iris recognition system.In 1993, JOHN DAUGMAN to achieve a high performance automatic iris recognition system.In 1997, the first patent Chinese iris recognition is approved, the applicant, Wang Jiesheng.In 2005, the Chinese Academy of Sciences Institute of automation, National Laboratory of pattern recognition, because of outstanding achievement "in recognition of" iris image acquisition and aspects, won the two "National Technology Invention Prize", the highest level represents the development of iris recognition technology in china.In 2007 November, "requirements for information security technology in iris recognition system" (GB/T20979-2007) national standards promulgated and implemented, the drafting unit: Beijing arithen Information Technology Co., ltd..Application of safety control, the customs import and export inspection, e-commerce and other fields in the future, is also inevitable in iris recognition technology as the focus. This trend, now in various applications around the world began to appear in the. In foreign countries, iris recognition products have been applied in a wide range.In February 8, 2002, the British Heathrow Airport began to test an advanced security system, the new system can scan the passenger's eyes, instead of to check passports. It is reported, the pilot scheme for a period of five months, a British Airways and virgin Airlines passengers can participate in this test. The International Air Transport Association interested in the results of this study are, they encourage the Heathrow Airport to test, through the iris boarding passengers to determine its identity as a boarding pass.Iris recognition system America "Iriscan" developed has been applied in the three business department of Union Bank of American Texas within. Depositors to be left with nothing whatsoever to banking, no bank card password, no more memories trouble. They get money fromthe A TM, a camera first eye of the user to scan, and then scan the image into digital information and data check, check the user's identity.America Plumsted school in New Jersey has been in the campus installed device of iris recognition for security control of any school, students and staff are no longer use cards and certificates of any kind, as long as they passed in the iris camera before, their location, identity is system identification, all foreign workers must be iris data logging to enter the campus. At the same time, through the central login and access control system to carry on the control to enter the scope of activities. After the installation of the system, various campus in violation of rules and infringement, criminal activity is greatly reduced, greatly reducing the campus management difficulty.In Afghanistan, the United Nations (UN) and the United Nations USA federal agency refugee agency (UNHCR) using iris recognition system identification of refugees, to prevent the same refugee multiple receive relief goods. Use the same system in refugee camps in Pakistan and Afghanistan. A total of more than 2000000 refugees use iris recognition system, this system to a key role for the United Nations for distribution of humanitarian aid from.In March 18, 2003, Abu Zabi (one of the Arabia and the United Arab Emirates) announced the iris recognition technology for expelled foreigners iris tracking and control system based on the borders opened the world's first set of national level, this system began construction from 2001, its purpose is to prevent all expelled by Abu Zabi tourists and other personnel to enter the Abu Zabi. Without this system in the past, due to the unique characteristics of the surface of the Arabs (Hu Xuduo), and the number of the expulsion of the numerous, customs inspection staff is very difficult to distinguish between what is a deported person. By using this system, illegal immigration, all be avoided, the maximum guarantee of national security.Kennedy International Airport in New Jersey state (John F. Kennedy International Airport) of the iris recognition system installed on its international flights fourth boarding port, 300 of all 1300 employees have already started to use the system login control. By using this system, all can enter to the apron personnel must be after the system safety certification of personnel. Unauthorized want to break through, the system will automatically take emergency measures to try to force through personnel closed in the guard space. Using this system, the safety grade Kennedy International Airport rose from B+ to A+ grade. The Kennedy International Airport to travel to other parts of the passengers has increased by 18.7%.Generally speaking, the iris recognition technology has already begun in all walks of life in various forms of application in the world. At the same time, to the application of their units of all had seen and what sorts of social benefits and economic benefits are not see. This trend is to enhance the high speed, the next 10 years will be gradually achieve the comprehensive application of iris recognition in each industry.In China, due to the Chinese embargo and iris technology itself and the difficulty in domestic cannot develop products. So far, there has not been a real application of iris recognition system. However, many domestic units are expressed using strong intention, especially the "9 · 11" later, security anti terrorism consciousness has become the most concerned problems in the field of aviation, finance. Iris recognition system is a major airline companies, major financial institutions and other security mechanisms (such as aerospace bureau) become the focus of attention of object and other key national security agency. As with the trend of development in the world, iris recognition technology will in the near future in application China set off climax.The second chapter of introduction of iris recognition technology2.1 Technology of biological feature recognition based on2.1.1 Present status and development of biological feature recognition“9.11" event is an important turning point in the devel opment of biometric identification technology in the world, the importance of it makes governments more clearly aware of the biological recognition technology. Traditional identity recognition technologies in the face of defect anti terrorism has shown, the government began a large-scale investment in the research and application of biometric technology. At the same time, the public understanding of biological recognition technology with "9.11" exposure rate and greatly improve the.The traditional method of individual identification is the identity of the people with knowledge, identity objects recognition. The so-called identity: knowledge refers to the knowledge and memory system of personal identification, cannot be stolen, and the system is easy to install, but once the identification knowledge stolen or forgotten, the identity of easily being fake or replaced, this method at present in a wide range of applications. For example: the user name and password. The so-called identity items: refers to the person, master items. Although it is stable and reliable, but mainly depend on the outer body, lost or stolen identification items once proof of identity, the identity of easily being fake or replaced, for example: keys, certificates, magnetic card, IC card etc..Biometric identification technology is related to physical characteristics, someone using prior record of behavior, to confirm whether the facts. Biometric identification technology can be widely used in all fields of society. For example: a customer came into the bank, he did not take bank card, also did not remember the password directly drawing, when he was drawing in the drawing machine, a camera to scan on his eyes, and then quickly and accurately complete the user identification and deal with business. This is the application of the iris recognition system of modern biological identification technology. "".America "9.11" after the incident, the anti terrorist activity has become the consensus of governments, it is very important to strengthen the security and defense security at the airport, some airports USA can in the crowd out a face, whether he Is it right? Wanted. This is the application of modern technology in biological feature recognition "facial recognition technology".Compared with the traditional means of identity recognition, biometric identity recognition technology in general has the following advantages:(1) the security performance is good, not easy to counterfeit or stolen.(2) carry, whenever and wherever possible, therefore more safety and security and other identification method.For the biological information of biometric recognition, its basic nature must meet the following three conditions: universality, uniqueness and permanency.The so-called universality, refers to any individual has the. Uniqueness, is in addition to other than himself, other people did not have any, namely different. The so-called permanent, refers to the character does not change over time, namely, life-long.Feature selection of organisms with more than three properties, is the first step of biological recognition.In addition, there are two important indexes in biological recognition technology. The rejection rate and recognition rate. Adjusting the relation of these two values is very important. The reject rate, the so-called false rejection, this value is high, use frequency is low, the errorrecognition, its value is high, safety is relatively reduced. So in the biological identification of any adjustment, the two index is a can not abandon the process. The choice of range size, related to the biological identification is feasible and available .And technology of identity recognition based on iris feature now appears, it is the development of biometric identification technology quickly, due to its uniqueness, stability, convenience and reliability, so the formation of biometric identification technology has the prospects for development.Generally speaking, the biological recognition system consists of 4 steps. The first step, the image acquisition system of collecting biometric image; the second step, the biological characteristics of image preprocessing (location, normalization, image enhancement and so on); the third step, feature information extraction, converted into digital code; the fourth step, the generation of test code and database template code to compare, make identification。
行为识别
模式识别国家重点实验室
中国科学院自动化研究所
基于模板匹配的方法
基于模板匹配的方法是用输入图像序列提取的特 征与在训练阶段预先保存好的模板进行相似性度 量,选择与测试序列距离最小的已知模板的所属 类别作为被测试序列的识别结果。
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
模式识别国家重点实验室
中国科学院自动化研究所
动态时间归整
当测试序列模式与参考序列模式的时间尺度不完全一致时:
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
模式识别国家重点实验室
中国科学院自动化研究所
动态时间归整
当测试序列模式与参考序列模式长度不一致时
C = {c1 , c2 ,L , cm }
Q = {q1 , q2 ,L , qn }
D=
min( m , n )
Institute of Automation, Chinese Academy of Sciences
模式识别国家重点实验室
中国科学院自动化研究所
人运动的特殊性
运动的类型: 刚体运动 vs 非刚体运动 人的运动属于非刚体运动中的一个子类: Articulated motion:人体各个部位的运动是刚体运 动;而人整体的运动是非刚体运动;
VisualPathway
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
模式识别国家重点实验室
中国科学院自动化研究所
视网膜结构
1. 光感受体:包括视锥细胞和视杆细胞。作用是将光信号 转换为电脉冲信号。 2. 中间层:构成视觉信息传输的直接和间接通道。 3. 神经节细胞层:视觉信息在这里形成纤维束,离开人眼。 光线
大脑视觉皮层
National Laboratory of Pattern Recognition
模式识别国家重点实验室
中国科学院自动化研究所 Institute of Automation, Chinese Academy of Sciences This figure is modified from Figure 21.14 of Nicholls et al. From neuron to brain.
视觉通路 (visual pathway)
胡占义 中国科学院自动化研究所
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
模式识别国家重点实验室
中国科学院自动化研究所
视束交叉
R2 R1
L2 L1
外侧膝状体
• •
颞侧的一半视神经纤维投射到同侧,鼻侧的一半投射到对侧。 右侧的视觉信息投射到左脑,左侧的视觉信息投射到右脑。
National Laboratory of Pattern Recognition
声音转换技术的研究与进展
声音转换技术的研究与进展左国玉1,2,刘文举1,阮晓钢2(1.中科院自动化所模式识别国家重点实验室,北京100080;2.北京工业大学电子信息与控制工程学院,北京100022) 摘 要: 声音转换是一项改变说话人声音特征的技术,可以将一人的语音模式转换为与其特性不同的另一人语音模式.声音转换算法的目标是确定一个什么样的模式转换规则,使转换语音保持第一个说话人原有语音信息内容不变,而具有第二个说话人的声音特点.本文介绍了当前声音转换技术领域的研究状态,主要分析现有声音转换技术中各种转换算法的实现原理,描述声音转换系统性能的各种评估方法,最后给出了对声音转换技术的简要评述和展望.关键词: 声音转换;语音频谱;基频曲线;声门激励;码本映射;人工神经网络;高斯混合模型;隐马尔科夫模型中图分类号: T N91213 文献标识码: A 文章编号: 037222112(2004)0721165208Voice Conversion Technology and I ts DevelopmentZ UO G uo 2yu 1,2,LI U Wen 2ju 1,RUAN X iao 2gang 2(1.National Laboratory o f Pattern Recognition ,Institute o f Automation ,Chinese Academy o f Sciences ,Beijing 100080,China ;2.School o f Electronics Information and Control Engineering ,Beijing Univer sity o f Technology ,Beijing 100022,China )Abstract : V oice conversion technology trans forms one pers on ’s speech pattern into another pattern with distinct characteris 2tics.The g oal of a v oice conversion alg orithm is to achieve a trans formation that makes the speech of the first speaker s ounds as though it were uttered by the second speaker giving it a new identity ,while preserving the original meaning.This paper introduces s ome cur 2rent studies on v oice conversion technology ,which focus on various types of alg orithms im plemented in v oice conversion area.Different evaluation methods for v oice conversion per formance are described.A technological outlook for this speech technique is given in the last section.K ey words : v oice conversion ;speech spectrum ;pitch contour ;glottal excitation ;codebook mapping ;artificial neural netw ork ;G aussian mixture m odel ;hidden Markov m odel1 引言 在人们的日常生活交流中,一个人的声音往往就是他的身份名片,也就是通常所说的说话人身份(speaker identity ).说话人身份使人们仅从说话人的声音就能辨认出自己的亲戚朋友,在广播节目中听出是否是自己熟悉的主持人在主持节目.《红楼梦》中王熙凤的出场描写可谓是“未见其人,先闻其声”.这些现象成为人们研究声音转换技术的最初出发点.声音转换或声音个性化是一项改变说话人声音特性的技术,使得一人的声音听起来像是由另一人说出来的[1].声音转换技术属于语音识别和语者识别技术的范畴.自动语音识别预处理过程中的说话人自适应方法被广泛用在声音转换技术中[2,3].这项语音技术发展和延伸了说话人识别技术[4,5].1970年代初,Atal 等人[6]就研究了使用LPC 声码器改变声音特性的可行性.Seneff [7]研究了一种改变激励和声道参数的方法.Childers 等人[8,9]检验了男声变女声、女声变男声的方法.Abe 等人[10]提出了一种基于矢量量化(VQ )的码本映射技术.I wahashi 等人[11]提出用频谱插值法增强码本映射技术的鲁棒性.Rinscheid [12]使用时变滤波器和拓扑特征映射实现了声音的改变.Valbret 等人[13]使用基音同步叠加法(PS O 2LA )调整激励信号中的韵律特征来改善转换性能.Naren 2dranath [14]和Watanabe [15]分别用BP 和R BF 等人工神经网络方法实现共振峰特性和LPC 频谱包络的变换.近年来,更多的研究人员致力于语音特征的统计分布来实现声音的转换[2,16-19].国内也开始出现这项语音技术的研究[3,20].所有这些先驱的工作极大地推动了声音转换技术的发展.声音转换技术是对语音合成技术的丰富和延拓,有着良好的技术发展前景.这项技术首先期望应用于特定文语合成系统[21].未来的系统会在人们接收E 2mail 或手机短信息时自动将信件内容用发信人的声音读出来.扩展自然对话系统功能是这种应用的一种延伸.特别是在娱乐和教育领域,产生多说话人特征的语音显示出很高的需求性,如戏剧、广播剧和电收稿日期:2002212213;修回日期:2003211205基金项目:国家自然科学基金项目(N o 160172055;N o.60121302);北京市自然科学基金(N o 14042025).第7期2004年7月电 子 学 报ACT A E LECTRONICA SINICA V ol.32 N o.7July 2004影里的角色配音(v oice dubbing )等[22].语音数据的采集与传输赋予声音转换技术以新的研究价值.传统的语料采集办法非常耗时费力,使用声音转换技术有可能使这个过程变得比较简单.如图1所示,语音合成系统从一个单说话人语音库中提取每一句话输入声音转换系统,分别采用不同目标说话人的模型,使新产生的语音具有所期望的多个目标说话人声音特性,从而建设成为一个由单人语音库生成的多说话人语音库.声音转换技术的优越性也将反映在超低带宽的语音编码领域.当语音编码系统设计的传输速率为2.4kbs 或更低时,在传输过程中将不再保留说话人的语音特征[23].声音转换技术则有可能在接收方重现解码语音,使其与传送人的说话人特征相匹配.图1 单人语音库生成多人语音库系统示意图声音转换的另一个主要用途是用于说话人辨认技术.声音调整是多方会话翻译系统的一个重要技术内容[16,23].系统首先识别一方说话人的每一句话,然后用对方(另一方)语言翻译出来,再用本方说话人声音特征合成新的声音,这样使持不同语言的双方(多方)交流更为方便.在整个会话过程中维持转换语音的自然度是这项应用的重要技术要素.安全系统中的访问控制也激励了声音转换技术的进展[13].一般来说,声音转换的技术实现主要包括以下几个要素:(1)语音模型和特征:模型类型规定了系统要调整语音信号的哪方面参数.模型参数或特征由训练和转换过程中的语音分析阶段获得.(2)映射规则:其作用是将源说话人的声学特征映射到一个近似于目标说话人的特征集上.(3)语音库:在训练过程中用于训练数据和性能评估时用于测试的语音句子集合.本文主要从这几个方面阐述当前声音转换技术领域的研究状态,以期能帮助对这项语音技术感兴趣的研究者有一个比较全面的了解,起到抛砖引玉的作用.2 说话人特征与语音模型及其参数表示 语音信号中含有各种各样的信息,主要载有语音内容信息(what was said )、说话人特征信息(who said it )以及说话环境信息(where it was said ).说话人特征描述了与说话人身份相关的声音方面特征,而与具体内容信息和说话环境无关.声音转换的任务就是要改变说话人特征,而其他方面的信息保留不变.一般地,说话人特征可表示为以下三个层面[24~26]:(1)音段信息:短时声学特征,如共振峰位置、共振峰带宽、频谱倾斜(spectral tilt )、基频(F0)和能量,与音质相关,依赖于发音器官条件和情感状况.(2)超音段信息:表示声学特征的时变演化,如平均基频、音素时长变化、语调变化和句中重读等,与说话风格和韵律相关,受到社会因素和心理状态的影响.(3)语言学信息:说话时字词的选取、方言和口音等,在当前声音转换技术研究范围之外.当某人说话时,超音段特征比较容易改变,说话人可以随意加快放慢语速、提高降低音量以及改变语气的轻重等.音段特征与语音产生器官的生理情况密切相关,因此可认为几乎不变.音段和超音段特征在说话人识别中具有很重要的感知性意义.在所有超音段特征中,平均F0和语音速度对说话人识别贡献最多,而在音段特征中,频谱包络和共振峰位置起主要作用.特别地,平均F0解释了55%的辨别说话人能力,而F0与前三个共振峰和F0与频谱倾斜分别能够表示71%和85%的说话人特征变化[27].以短时频谱形式表达的音段特征和超音段特征的平均行为(主要是语速和平均F0)能充分满足很大程度上的说话人区别,仅频谱包络就包含了丰富的说话人身份信息[28].因此当前声音转换系统主要集中在短时频谱包络参数的变换上,同时调整源说话人的基频、能量和语速从均值上匹配目标说话人的这些参数.语音模型是语音信号的数学建模.在声音转换系统中,源2滤波器语音模型较好地表示了短时语音频谱,这种模型通过把频谱包络拟合到短时语音幅度谱上,将声道近似为一个缓变滤波器.转换算法常用的模型参数为LPC [10,15]及其演变形式,如LPC 倒谱系数[19,29]、线频谱频率(LSF )[17,30]等,以及进一步分析LPC 频谱获得的共振峰频率和共振峰带宽[14,31,32].其它参数还包括美尔频标倒谱系数(MFCC )[3,16],美尔倒谱系数(MCC )[2,33].将语音信号用相应的LPC 滤波器反向滤波就得到近似于声门激励波形的LPC 残差.由于LPC 残差仍然含有一定的说话人信息.因此,很多转换系统在进行频谱变换时为提高转换语音质量,同时考虑了对残差信号或声门激励波形的变换处理[29,30,34,35].3 声音转换算法实现 声音转换系统通过改变语音信号的声学特征参数来调整语音.一般地,声音转换过程可以分为训练和转换两个步骤来进行[29,36],如图2所示.图2 声音转换算法原理在训练阶段,系统对源说话人和目标说话人的语音样本进行训练,估计映射规则,获取源语音和目标语音的模型参数之间的关系.在转换阶段,利用转换函数对源语音的音段特征和超音段特征等进行变换,使合成语音具有目标说话人特征.6611 电 子 学 报2004年在训练过程中,系统在一个特定的语音模型假设下分析源语音和目标语音.每一种变换算法都有一个提取模型参数的语音分析过程.在分析完成后,训练过程根据对应的语音将源2目标特征聚类分组,构造训练数据.特征关联属性(feature as 2s ociation )一般可由时间对准和分类过程得到,如动态时间规整(DT W )[10,13,16,29,31]、无监督隐马尔柯夫建模[30]或强制对准(forced 2alignment )语音识别[30,37]等过程.经过时间对准后的数据被用来估计转换函数.在当前转换系统中已实现了多种语音频谱的转换算法,其中包括映射码本[10,37]、线性多变量回归[13]、动态频率规整[13]、人工神经网络[14,15]、高斯混合模型[16,17]和隐马尔科夫模型[2,19].转换时,已训练好的转换函数从新输入源语音特征来预测目标语音特征,最后在合成阶段,由预测特征产生最终的转换语音信号.源说话人的韵律特征如F 0曲线、能量曲线和说话速度也被调整,使之匹配目标说话人的韵律特征.311 语音频谱变换语音频谱承载了说话人特征的重要信息,调整语音频谱是当前声音转换技术的首要内容.训练频谱变换函数就是为了找到源、目标说话人声学特征之间的映射关系.一般地,训练前,需将源于两个说话人的特征矢量流采用某种算法进行时间对准,然后再根据映射方案训练频谱变换函数.31111 码本映射 码本映射是声音转换领域比较常用的转换算法.这种转换算法最早是由Abe 和Shikano 等人[10]提出来的,源于语音识别过程中的说话人自适应技术[38].图3显示了这种基于VQ 码本映射的声音转换实现原理.图3 基于矢量量化码本映射的声音转换系统在这个方案中,为产生映射码本,首先用矢量量化算法将源说话人和目标说话人的特征空间进行划分,用DT W 算法将源矢量和目标矢量相关联,产生对应码本矢量的统计直方图.最终的目标码本定义为用直方统计值作为权函数的目标码字的线性组合.可以这样表达:一个与源说话人输入频谱X (S )相对应的VQ 频谱V i (S )经对目标说话人VQ 频谱V j (T )加权求和(权值h ij 表示训练数据时V i (S )→V j (T )的对应统计值)转换成与V j (T )目标说话人的频谱X i (T ),表示如下:X i (T )=∑N ij =1h ij V j (T )∑N ij =1h ij (1)这种算法的一个基本问题是由于矢量量化作用造成的频谱的不连续性.为克服这种简单矢量量化方法的缺点,有人提出了模糊矢量量化技术(fuzzy VQ )[39].源说话人输入频谱X(S )就不再唯一地量化成V i (S ),而是表达为V i (S )邻域码矢量的线性组合∑Mii=1u i V i(S ),其中u i 是由X (S )的模糊关系函数确定的系数.Abe 等人[22]又提出了分段矢量量化技术(segment VQ )的改进方案,用音素时长大小语音段来代替单帧VQ 编码,语音段的切分采用了常用的H M M 音素切分器.通过矢量量化可以获得比较精确的转换.Arslan 等人[30,37]提出一种基于音素码本和滤波器思想的转换算法,较好地改善了转换信号连续语音帧之间的过渡性能和转换语音质量.这种方法采用语句H M M (sentence H M M )方法代替DT W 方法做音素对准,因而对准精度较高,鲁棒性较好.其训练过程如图4所示.图4 音素码本训练流程源、目标码本生成后,与源频谱V s (w )对应的源语音LSF矢量被近似为源码本LSF 矢量的加权组合,与目标频谱V t (w )对应的目标语音LSF 矢量估计为相同权值加权的目标码本的线性组合,声道滤波器H v (w )就可表示为V t (w )与V s(w )的商.当输入语音的频谱为X (w )时,输出语音就可表示为Y (w )=H v (w )3X (w ).还有一些改善码本映射算法性能的工作.文献[40]使用一个三层的神经网络实现映射码本.频谱插值法通过几个说话人语音频谱之间插值确定转换语音频谱以提高系统的鲁棒性[11],类似算法在文献[41]中也有描述.整体上,上述码本映射方案还是受到转换质量不高和鲁棒性不太好等因素的困扰.31112 线性多变量回归和动态频率规整 有学者提出了不同于全局的码本映射思想的转换算法[13,31].用标准的无监督聚类技术(如VQ )将说话人声学空间划分为多个不相重叠的类,每一类语音对应于一个转换函数(也称作局部函数),每个转换函数都表述了这一类中源2目标语音之间的映射关系,所以码本映射方案中的全局映射就被这些局部函数所近似.文献[13]中提出了多变量线性回归和动态频率规整算法两种局部转换算法.多变量线性回归(LMR )方法通过最小化每一类中所有源2目标矢量对之间预测误差的均方值来确定各最优线性变换矩阵M :C k C =Mc kS(2)J =min∑Nk =1(CkT -C k C )2最优解为:M =C T T C S (C TS C S )-1(3)其中,C k S 、C k T 和C kC 分别表示源倒谱矢量、目标倒谱矢量和由通过最小化性能指标J 计算得到的最优化矩阵M 变换7611第 7 期左国玉:声音转换技术的研究与进展而得的转换矢量,C T代表矩阵C的转置,C S和C T分别表示N个p维源矢量和目标矢量序列构成的矩阵.LMR可解释成在源频谱矢量为高斯联合分布的假设下搜索目标矢量期望值.动态频率规整算法(DFW)试图在同一声学类中找到源2目标语音频谱的映射路径.这种方法首先计算每一源、目标说话人的对数幅度谱,并从中去除频谱倾斜(spectral tilt).对归一化后的源、目标频谱采用一种频率规整算法,获得一条源2目标矢量对应关系的规整曲线.每一类中规整函数的数量等于这一类的源2目标矢量对(vector pair)数目.计算这一类中的平均规整函数,并用一个三阶多项式来表示.DFW算法能在频域改变频谱形状,因此它能调整共振峰频率及其带宽,而其幅度几乎不受影响,但是转换性能稍逊于LMR算法[13].M izuno等人[31]提出了类似局部变换的转换算法.说话人频谱空间由矢量量化划分成许多不同的子空间,由此计算出一个共振峰线性转换规则集.这种多个局部转换函数方法可以产生无穷个目标特征量.但是由于选择单个局部转换函数的离散性还存在,所以不连贯性仍然出现在输出语音中.31113 人工神经网络模型 人工神经网络(ANN)是连续变换函数的一个例子.理论上,一个具有非线性隐层的ANN能够逼近任意映射.在连续语音中,声道系统特征变化迅速.为比较真实地变换说话人的声学特征,码本映射方法中的码本尺寸就必须很大.而在神经网络技术中,即使训练数据量较少但只要选取合适,也能较好地学习一个连续特征映射函数.神经网络的这种泛化特性有助于降低数据储备要求而能较好完成说话人特性之间的变换.根据上述原理,Narendranath等人[14]借助于由BP算法训练的人工神经网络实现共振峰频率的变换.共振峰变换函数的训练算法如下所示:repeatfor每一共振峰数据集beginstep1:将对应于源说话人(男声)的共振峰频率(F1,F2, F3)作为网络输入.step2:提取目标说话人(女声)语音对应帧的共振峰作为期望输出.step3:用B P算法调整权值.enduntil权值收敛除了采用BP网捕获源2目标说话人声学特征之间的关系外,径向基(R BF)网络也可实现说话人之间的频谱变换[15].训练时,从训练集中分别抽出以LPC频谱表示对应的源说话人和目标说话人的语音音素,分别作为R BF网的输入和输出,采用最小二乘法最小化实际输出与期望输出的均方差来调整网络的联结权值.尽管ANN表现出较好的连续性,但是很少有实验数据表明ANN方法能取得较佳的转换性能.31114 高斯混合模型 近年来,很多研究者采用概率方法改善转换语音的自然度和目标说话人特征倾向性.S tylianou等人[16]用高斯混合模型(G M M)反映源特征分布和目标特征概率分布之间映射关系.一个高斯混合模型被用来拟合源特征矢量x的概率分布,对源特征空间做“软”分类,表示如下:p(x)=∑mi=1αiN(x;μi,∑i),∑mi=1αi=1,αi≥0(4)其中,N(x;μi,∑i)为第i个抽象声学类的正态分布,m 为高斯混合成分(mixture)的数目,αi为第i类的权系数.根据贝叶斯理论,给定观察矢量x,它属于第i类的概率为h i(x)=αiN(x;μi,∑i)∑mj=1αjN(x;μj,∑j)(5)参数(x;μ,∑)由E M算法[42]来估计.转换函数表示为^y=F(x)=∑mi=1h i(x)[v i+Γi(x-μi)](6)在源特征和目标特征相对应的基础上,通过求解最小二乘问题的正态方程估计每一个局部变换函数的参数v i和Γi,使在全部学习数据上的转换误差达到最小:ε=E[‖y-^y‖2](7) K ain和Macon等人[17]对上述算法做了一些改变,用一个G M M拟合由源矢量x和目标矢量x构成的联合矢量z=[x T, y T]T的概率分布P(x,y).由给定x寻找E[y|x]是一个回归:^y=E[y|x]=∑mi=1h i(x)[μy i+∑yx i∑xx i-1(x-μx i)],h i(x)=αiN(x;μx i,∑xx i)∑mj=1αjN(x;μx j,∑xx j)(8)其中μx i和μy i分别表示源、目标说话人第i类的均值矢量,∑xxi表示源说话人第i类的方差,∑xy i表示源说话人和目标说话人第i类的互方差.联合矢量z的第i类的协方差和均值分别表示为∑zi=∑xxi∑xyi∑yxi∑yyi,μz i=μxiμyi(9)与最小二乘法相比,联合概率方法理论上能使回归问题的高斯混合成分得到更合理的配置,但在进行E M算法运算时的计算量要大很多.试验表明,G M M方法有效地改善了转换语音的自然度,结合韵律参数调整,可以极大提高转换语音的目标说话人特征倾向性.31115 隐马尔科夫模型 在有些语音合成系统中,声音转换算法采用了非特定人(SI)语音识别系统中广泛使用的说话人自适应技术[2,3],如最大似然线性回归(M LLR)[43]、最大后验概率(M AP)[44]、矢量场平滑(VFS)[45]和H M M插值法[33]等技术.在基于H M M的TTS系统[2]中,这种转换算法的基本原理是,音素H M M作为语音合成单元,初始的说话人无关音素H M M(average v oice H M M)在训练阶段由观察矢量训练而成.使8611 电 子 学 报2004年用目标说话人语音对说话人独立的音素H M M做M LLR自适应,从而使自适应后的音素模型具有目标说话人特征.合成时,将待合成的给定文本转换为上下文相关的音素标记序列.根据标记序列,由经过自适应后的音素H M M单元拼接成语句H M M.这种算法用平均说话人声音取代了源说话人声音.可以看到,H M M转换算法与前文所述各种算法在实现原理架构上明显不同,在语音合成过程中几乎包含了应用自适应技术的语音识别过程,其技术基础是合成语音参数可由音素H M M中生成.尽管由H M M做自适应后转换合成的输出语音性能还不是太好,但优点是只需少量数据做自适应就可以方便地合成不同目标说话人的语音.312 激励信号变换如前文所讨论,在源2滤波器语音模型中,声门激励信号或LPC残差信号仍含有一定的说话人特征信息.当前声音转换系统对于激励信号的处理方法主要存在以下几种.31211 激励码本 Childers等人[34]将声门激励信号分成浊音和清音两种,对于浊音部分,激励波形表示为一个六阶多项式模型,以多项式系数为矢量形成有一个32项浊音激励的声门激励码本;清音部分的噪声激励信号用一个256项随机产生的噪声激励码本表示.进而类似频谱码本映射,将激励信号源码本映射到目标码本上再做韵律和频谱的调整[9].在Arslan[30]的激励码本滤波器方案中,激励映射码本由源、目标语音的LPC残差信号训练形成.目标声门激励由激励码本滤波器U t i(w)/U s i(w)的加权组合构建而成的滤波器H g (w)来估计,U t i(w)和U s i(w)分别表示第i个源和目标的激励码本的幅度谱.当输入语音为X(w)、声道码本滤波器为H v (w)时,经过语音频谱和激励变换后,输出语音就可表示为Y (w)=H g(w)3H v(w)3X(w).31212 神经网络预测器 Lee等人[29]将声学特征被分成线性和非线性两类:抽取了LPC倒谱系数作为线性部分的特征,而非线性部分表示激励信号,由一个用神经网络表示的长时延非线性预测器模型表示.LPC倒谱系数矢量转换规则由正交矢量空间变换[46]确定,用平均基频比和映射码本实现声源信号的转换.31213 LPC残差预测 K ain等人[35]认为,对于语音的浊音部分,变换后的LPC频谱可预测目标LPC残差.其基本假设是对于特定说话人语音相近的声学类,其残差信号也相似并且可以预测.上述激励或残差信号的处理在不同程度上增加了转换语音的目标说话人特征倾向.313 基频曲线变换语音的韵律特征尤其是基频曲线含有大量的说话人身份信息,对确定说话人身份起了很重要的作用.但是相对于声道相关的声学特征和激励信号,当前关于韵律变换的研究工作还比较少,其研究也主要集中在F0曲线的变换上,而能量曲线和语速等特征只是简单地做均值线性变换.F0曲线变换主要表现为以下几种方法,31311 均值线性变换模型 最简单F0曲线建模的方法是认为F0服从高斯正态分布,由此可估计其均值和方差,从而实现输入基频值到期望基频值的映射[18,29,30,32,47],其映射关系可表示为x2=x1-μsσs・σt+μt(10)其中,μs和σs分别为源说话人基频的均值和标准差,μt和σt 分别为目标说话人基频的均值和标准差.31312 确定性/随机性混合建模 Ceyssens等人[48]提出一种确定和随机性混杂建模F0曲线的方案.对每一句话求取对数域上基频的回归拟合直线,确定基频偏置Po和下倾斜率Pd,估计回归线与实际F0曲线的方差V;类似求取(Po,Pd,V)中各元素对句子长度L的回归关系,分别获得Po、Pd和V关于L的偏置、斜率和方差等九个参数.基频曲线偏置的变换关系可表示为PCo=(PTo o+L・PTo d)+PTo vPo-(PSo0+L・PSo d)PSo v(11)式中,Po表示输入语句的基频偏置,S、T和C分别代表源、目标和转换语句,下标o,d和v分别表示偏置、斜率和方差.转换语音基频的下降斜率PCd和方差VC可类似求出.式(11)可认为是式(10)的一阶形式.31313 逐段线性映射 G illet提出了一种基频逐段(piecewise)线性映射的方法[49].首先确定每一语句的四个基频点:句首高位值S,非句首峰值H,重音后谷底值L和句末低位值F.当分别有一段通过点(F s,F t)和(L s,L t),一段通过(L s,L t)和(H s,H t),另一段通过(H s,H t)和(S s,S t)时,逐段求取源2目标F0的映射关系,表示如下:M(x)=F t+(x-F s)(L t-F t)(L s-F s), x<L sL s+(x-L s)(H t-l t)(H s-L t), L s<x<H sH s+(x-H s)(S t-H t)(S s-H s), x>H s(12)逐段线性映射方法效果要优于式(10)的基频转换效果. 31314 语句基频曲线码本 用整句水平上生成的语句基频曲线码本建模一个说话人基频特征是可能的[50].这种方法使用DT W算法来比较测试语句和相同说话人数据集中所有语句,从中找到F0曲线匹配最近的语句后,选取期望说话人的相同语句,再将所选的这两条语句以音素或词边界为中间约束做DT W,所获得的F0曲线再规整到测试语句,就产生了一条用于转换语音的新F0曲线.该算法优点在于存在使用真实F0曲线进行合成或调整的可能性,但是它只局限于小词汇量的专用场合,生成所有的F0曲线则不切实际.Türk[51]提出了类似的分段(segmental)语音基频曲线映射方法.31315 高斯混合模型 文献[52]中提出与上述算法都不同的思想,认为F0和LSF频谱是类相关的,因此可用一个高斯混合模型来描述输入语音这种广义类的分布,在改变语音幅度频谱的同时调整了基频参数.在F0调节因子较大时,这种方法在一定程度上可以改善转换语音的自然度.从现有的语音技术来看,高层次信息如基频-声调的抽取和控制还存在不少困难,因此还不能很好地对具体的超音9611第 7 期左国玉:声音转换技术的研究与进展。
模式识别Pattern Recognition课件-新版.ppt
许建华 xujianhua@
南京师范大学计算机科学系
2007年3月- 6月
精品
第1章 绪论
1.1 模式识别与模式的概念 1.2 模式识别系统 1.3 关于模式识别的若干基本问题
精品
1.1 模式识别与模式的概念
1.1.1 基本概念 两个例子:
根据内容或者外观聚成相应的类
物以类聚,人以群分 精品
人的模式识别能力
人通过视觉、嗅觉、听觉、味觉、触觉接 收外界信息、再经过人脑根据已有知识 进行适当的处理后作出的判别事物或者 划分事物性质(类别)的能力
精品
模式识别 (Pattern Recognition)
用计算机来实现人的模式识别能力,即用计算机 实现人对各种事物或现象的分析、描述、判断、 识别
1k n k
k
精品
马哈拉诺比斯(Mahalanobis)距离
d(x, y) (x y)Σ1(x y)
其中协方差矩阵和均值为
Σ
l
1 1
l i 1
(xi
x)(xi
x)T
x
1 l
l i 1
xi
精品
1.3.4 数据的标准化
目的:消除各个分量之间数值范围大小对 算法的影响
幼儿认动物 图书归类
精品
幼儿认动物
老师教幼儿学(学习) 幼儿自己认(决策) 错分现象
精品
图书归类
归类 1 : 精美印刷的书 普通印刷的书
归类 2: 大开本的书 小开本的书 微型开本的书
归类 3:
数学类图书 物理学图书 化学类图书 计算机类图书 小说类图书 法律类图书
第1讲 模式识别简介
模式识别系统的基本组成
一个完整的模式识别系统主要由5个基 本部分组成:
输 入 传 感 器 预 处 理 器 特 征 提 取 器 分 类 器 后 处 理 器 决 策
返回
分 割 器
上下文信 息调整
传感器
传感器是一个模式识别系统的输入设备, 比如摄像机、麦克风等。 传感器的特性和局限性包括带宽、分辨 率、灵敏度、失真、信噪比、延迟等参 数或指标
样本、模式与模式类
模式识别系统: 输入通常称为样本,有时也称为模式。 输出通常称为模式,有时也称为模式类。
返回
模式识别的复杂性
模式识别是一个非常复杂的问题 鲸在生物学中属于哺乳类,应该和牛算 作同一类;但是从产业的角度看,捕鲸 属于水产业,鲸和鱼是一类,而牛属于 畜牧业,与鲸不同类。
第1讲 模式识别简介
常见模式识别系统举例 模式识别的基本概念 模式识别的相关领域 模式识别的应用 模式识别系统的基本组成
常见模式识别系统举例
条码识别
常见模式识别系统举例
光学字符识别系统 (Optical Charater Recognition System, OCR)
返回
预处理器
预处理器的主要作用是把对识别起主要 作用的信息和背景噪声分割开来,因此 又称为分割器。 返回
特征提取器
特征提取器(feature extractor)的作用 是对输入产生一个类别信息相关的表达 或描述, 通常要求来自同一类别的不同样本的特 征值应该非常接近,而来自不同类别的 样本的特征值则应该有很大的差异。 特征应具有类别不变性 返回
密度估计用于求解具有某种特定特征的 类别成员(样本)出现的(概率)密度 问题。 返回
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Robust Lip Tracking by Combining Shape,Color and MotionYing-li TianRobotics Institute, Carnegie Mellon University, Pittsburgh,PA15213yltian@National Laboratory ofPattern Recognition Chinese Academy of Sciences,Beijing,ChinaTakeo KanadeRobotics Institute, Carnegie Mellon University, Pittsburgh,PA15213tk@Jeffrey F.Cohn Department of Psychology, University of PittsburghPittsburgh,PA15260jeffcohn@AbstractAccurately tracking facial features requires coping with the large variation in appearance across subjects and the combination of rigid and non-rigid motion. In this paper,we describe our work toward devel-oping a robust method of tracking facial features,in particular,lip contours,by using a multi-state mouth model and combining lip color,shape and motion in-formation.Three lip states are explicitly modeled: open,relatively closed,and tightly closed.The gross shapes of lip contours are modeled by using differ-ent lip templates.Given the initial location of the lip template in thefirst frame,the lip and skin color is modeled by a Gaussian mixture.Several points of a lip are tracked over the image sequence,and the lip contours are obtained by calculating the correspond-ing lip template parameters.The color and shape information is used to obtain lip states.Our method has been tested on5000images from the University of Pittsburgh-Carnegie Mellon University(Pitt-CMU) Facial Expression AU Coded Database,which includes image sequences of children and adults of European, African,and Asian ancestry.The subjects were video-taped while displaying a wide variety of facial expres-sions with and without head motion.Accurate tracking results were obtained in99%of the image sequences. Processing speed on a Pentium II400MHZ PC was approximately4frames/second.The multi-state model based method accurately tracked lip motion and was robust to variation in facial appearance among sub-jects,specularity,mouth state,and head motion. Keywords:Lip tracking Multiple-state mouth model Lip template1.IntroductionRobust and accurate analysis of facial features re-quires coping with the large variation in appearance across subjects and the large appearance variability of a single subject caused by changes in lighting,pose, and facial expressions.Facial analysis has received a great deal of attention in the face detection and recog-nition literature[1,2,6,14].Mouth features play a central role for automatic face recognition,facial ex-pression analysis,lip-readings and speech processing. Accurately and robustly tracking lip motion in image sequences is especially difficult because lips are highly deformable,and they vary in shape,color,specularity, and relation to surrounding features across individuals; in addition,they are subject to both non-rigid(expres-sion)and rigid motion(head movement).Although many lip tracking methods have been proposed,there are the limitations to each method when it comes to obtaining robust results.In this paper,we developed a robust method for tracking lip contours in color image sequence bycombining color,shape and motion.A multi-state mouth model is used to represent the different mouth states:open,relatively closed,and tightly closed.Two parabolic arcs are used as a lip template for modeling the lip shape.Our goal is to robustly track the lip con-tour in image sequences across individual,specularity and facial expression.The lip template is manually lo-cated in thefirst frame,and the lip color information is modeled as a Gaussian mixture.Then,the key points of the lip template are automatically tracked in the image sequences.The lip states and state transitions are determined by the lip shape and color.We have tested our method on the Pitt-CMU Facial Expression AU Coded Database that includes more than5000im-ages of different kinds of people(including Caucasian, Afro-American,Hispanic and Asian)and expressions. Excellent results have been obtained even when there is head motion.2.Lip TrackingEach of the lip tracking methods that have been proposed so far has its own strengths and limitations. We believe that a feature extraction system intended to be robust to all the sources of variability(i.e.,individ-ual differences in people’s appearance,etc.),should use as much knowledge about the scene as possible. Lip tracking methods based on a single cue about the scene are insufficient for robustly and accurately track-ing lips.For example,the snake[7]and active contour methods[11]often converge to the wrong result when the lip edges are indistinct or when lip color is very close to face color.Luettin and Thacker[10]proposed a lip tracking method for speechreading using prob-abilistic models.Their method needs a large set of training data to learn patterns of typical lip deforma-tion.The lip feature point tracking method of Lien[8] is sensitive to the initial feature points position,and the lip feature points have ambiguity along the lip edges.A feature extraction system should use all available in-formation about the scene to handle all the sources of variability in real environments(illumination,individ-ual appearance,etc.).Many researchers tried to combine more infor-mation.Bregler and his colleagues[4]developed an audio-visual speech recognition system that uses Kass’s snake approach with shape constraints imposed on possible contour deformations.They found that the outer lip contour was not sufficiently distinctive.This method uses image forces consisting of gray-level gra-dients,which are known to be inadequate for identify-ing the outer lip contour[14].Y uille et ed the edge,peak and valley information with mouth template for lip tracking,but still experienced some problems during energy minimizing.The weights for each en-ergy term were adjusted by performing preliminary ex-periments.This process was time-consuming,and the weights were not applicable to the novel subjects.The color-based deformable template method developed by Rao[13]combines shape and color information,but it has difficulty when there is a shadow area near the lip or the lip color is similar to that of the face.Examples of some of these problems are shown in Figure1.The limitations of these methods can be clearly observed. Figure1(a)shows the failure of Lien’s feature points tracking when the lip contour becomes occluded.The feature points on the lip shift to wrong positions when lips are tightly closed.Figure1(b)shows the failure of Rao’s color-based deformable method due to shadow and to occlusion.The lower lip contour jumps to the chin because of theshadow.(a)Lip tracking by the feature point tracking method.Lip contour points shift to the wrongpositions.(b)Lip tracking by the color-based deformable template method.Lower lip contour jumps to the wrong position because of the shadow.3.Lip Tracking by Combining Color,Shape and MotionAs shown in Figure2,we classify the mouth states as open,relatively closed,and tightly closed.We de-fine the lip state as tightly closed if the lips are invisible because of lip suck.For the different states,different lip templates are used to obtain the lip contour(Fig-ure2(e),(f),and(g)).For the open mouth,a more complex template could be used that includes inner lip contour and visibility of teeth or tongue.Currently, only the outer lip contour is considered.For the rel-atively closed mouth,the outer contour of the lips is modeled by two parabolic arcs with six parameters: lip center(xc,yc),lip shape(1,2and),and lip rotation().For the tightly closed mouth,the dark mouth line ended at lip corners is used(Figure2(g)). The state transitions are determined by the lip shape and color.(a)(b)(c)(d)θh2h1(xc, yc)Xwp1p2p3p4Lip corner2(e)(f)(g)We model the color distribution inside the closedmouth as a Gaussian mixture.There are three promi-nent color regions inside the mouth:a dark aperture,pink lips,and bright specularity.The density functionsof the mouth Gaussian mixtures are given by:31(1)where121(2)are the mixture weights(310,are the mixture means,and are the mixturecovariance matrices.In order to identify the model parameter values,thelip region is manually specified in thefirst frame im-age.The Expectation-Maximization(EM)algorithm[5]is used to estimate both the mixture weights and theunderlying Gaussian parameters.K-means clusteringis used to provide initial estimates of the parameters.Once a model is built,the succeeding frames are testedby a look-up table obtained from thefirst frame.Lip motion:In our system,the lip motion is ob-tained by a modified version of the Lucas-Kanadetracking algorithm[9].We assume that intensity val-ues of any given region(feature window size)do notchange,but merely shift from one position to another.Consider an intensity feature template over aregion in the reference image at time;wewish tofind the translation of this region in the fol-lowing frame1at time1,by minimizinga cost function defined as:12(3)and the minimization forfinding the translation canbe done in iterations:1For the open mouth and the tightly closed mouth,there are non-lip pixels inside the lip contours.Assume the lip state in thefirst frame is neutral closed.From the lip color distribution,we get the lip states by:if000and1if000and2otherwise(5) where0and are the sum of top lip and bottom lip heights in thefirst frame and current frame;,is non-lip pixel number inside lip contour and is all the pixel numbers inside the lip contour;1035and2025are thresholds in ourapplication.(a)The original image and color distribution of tightlyclosedmouth(b)The original image and color distribution of openmouthLip Contours for Tightly Closed Lips:For thetightly closed mouth,we trace the mouth line by lo-cating the darkest pixels along the perpendicular lineswith a distance between two lip corners(Figure4).Based on the multi-state mouth model,thecolor,(a)Approximate contour(b)Refined contourshape,and motion information are combined in ourlip tracking method.This method is very robust totracking lips for each state.Figure5(a)shows the liptracking results by using shape and motion informa-tion,but not using the multi-state mouth model.Thelip contours are not correct for tightly closed lip.Fig-ure5(b)shows the correct lip tracking results for tightlyclosed lip by adding the multi-state mouthmodel.(a)Lip tracking by combining shape and motionwithout using the multi-state mouth model.The lipcontours for tightly closed lips are notcorrect.(b)Lip tracking by combining shape,color andmotion based on multi-state mouth model.The lipcontours for tightly closed lips are correct.4.Experiment ResultsWe tested our method on500image sequences(ap-proximately5000images)from the Pitt-CMU FacialExpression AU Coded Database.Subjects ranged inage from3to30years old and included males and fe-males of European,African,and Asian ancestry.Theywere videotaped in an indoor environment with uni-form lighting.During recording,the camera was posi-tioned in front of the subjects and provided for a full-face view.Images were digitized into640x480pixelarrays with24-bit color resolution.A large variety oflip motions was represented[3].Figures6,7,and8show representative results of lipcontour tracking.Figure6shows tracking results for the same subject showing different expressions.Fromthe happy,surprised and anger expression sequences,we see that lip states transit each other and correct lipcontours are obtained for each lip state.The resultsfor the dark-skinned subjects are shown in Figure7.Our method works well even for big lip movement,forexample,the expression of surprise.Figure8demon-strates the robustness of our method for tracking lipcontours when subjects have both non-rigid motionand rigid motion.The lip contours are tracked cor-rectly with different kinds of head motion.Thefirstand second rows demonstrate the good lip trackingresults for different expressions with horizontal andvertical head rotation.In the last row,the excellenttracking results for the infant are given,including bothbig lip deformation,big head motion and backgroundmotion.Our system can track4images/sec on a Pen-tium II400MHZ PC and works robustly for image ofmen,women,and children,and for people of varyingskin color andappearance.5.Conclusion and DiscussionWe have described a robust method for tracking un-adorned lip movements in color image sequences ofvarious subjects and various expressions by combin-ing color,shape and motion information.A multi-statemouth model was introduced.Three lip states wereconsidered:open,relatively closed,and tightly closed.The lip state transitions were determined by the lipshape and color.Given the initial location of the liptemplate in thefirst frame,the lip color informationwas obtained by a Gaussian mixture.Then,the lipkey points were tracked via the tracking method de-veloped by Lucas-Kanade[9]in the image sequenceand the lip contours were obtained by calculating thecorresponding lip template parameters.The color andshape information is used to obtain lip -pared with other approaches,our method requires notraining data,and works well for different individuals,for mouths with specularity,and for different mouthstates.Our method is able to track lips with verticaland horizontal head rotation.A limitation of our method is that we assume thatthe lip template is symmetrical about the perpendicularbisector to the line connecting the lip corners.For non-symmetrical expressions and complex lip shapes,there are some errors between the tracking lip contour and the actual lip shape (Figure 9).A more complex lip template will be necessary to get more accurate lip contours for non-symmetrical expression analysis in our futurework.(a)(b)AcknowledgementsThe authors would like to thank Zara Ambadar fortesting the method on the database.We also thank Simon Baker for his comments and suggestions onearlier versions of this paper.This work is supported by NIMH grant R01MH51435.References[1]R.Brunelli and T.Poggio.Face recognition:Featuresversus templates.IEEE Trans.on Pattern Analysis and Machine Intelligence ,15(10):1042–1052,Oct.1993.[2]G.Chow and X.Li.Towards a system for auto-matic facial feature detection.Pattern Recognition ,26(12):1739–1755,1993.[3]J.F.Cohn,A.J.Zlochower,J.Lien,and T.Kanade.Automated face analysis by feature point tracking has high concurrent validity with manual facs coding.Psy-chophysiology ,36:35–43,1999.[4]J.D.Cowan,G.Tesauro,and J.Alspector(eds).SurfaceLearning with Applications to Lipreading .Advances in Neural Information Processing Systems 6,Morgan Kaufmann Publishers,San Francisco,1994.[5] A.Dempster,ird,and D.Rubin.Maximum like-lihood from incomplete data via the em algorithm.J.R.Stat.Soc.,,(B(39)):1–38,1977.[6]L.Huang and C.W.Chen.Human facial feature ex-traction for face interpretation and recognition.Pattern Recognition ,25(12):1435–1444,1992.[7]M.Kass,A.Witkin,and D.Terzopoulus.Snakes:Active contour models.International Journal of Com-puter Vision ,1(4):321–331,1988.[8]J.-J.J.Lien.Automatic Recognition of Facial Expres-sions Using Hidden Markov Models and Estimation of Expression Intensity .PhD Thesis,University of Pitts-burgh,1998.[9] B.Lucas and T.Kanade.An interative image regis-tration technique with an application in stereo vision.In The 7th International Joint Conference on Artificial Intelligence ,pages 674–679,1981.[10]J.Luettin and N.A.Tracker.Speechreading usingprobabilistic puter vision and Image Un-derstanding ,65(2):163–178,Feb.1997.[11]J.Luettin,N.A.Tracker,and S.W.Beet.Active ShapeModels for Visual Speech Feature Extraction .Elec-tronic Systems Group Report No.95/44,University of Sheffield,UK,1995.[12] C.Poelman.The paraperspective and projective fac-torization method for recovering shape and motion.Technical Report CMU-CS-95-173,Carnegie Mellon University ,1995.[13]R.R.Rao.Audio-Visal Interaction in Multimedia .PHD Thesis,Electrical Engineering,Georgia Institute of Technology,1998.[14] A.Yuille,P.Haallinan,and D.S.Cohen.Featureextraction from faces using deformable templates.In-ternational Journal of Computer Vision,,8(2):99–111,1992.。