Sparse Feature Learning for Deep Belief Networks
大连理工裴文彬的学术成果
大连理工裴文彬的学术成果大连理工大学的裴文彬教授在多个领域都取得了丰硕的学术成果。
以下是他在一些重要领域的研究成果:1. 图像处理和计算机视觉,裴文彬教授在图像处理和计算机视觉方面的研究上取得了显著的成果。
他在图像分割、目标检测和识别、图像重建等方面做出了重要贡献。
他提出了一种基于深度学习的图像分割方法,该方法在准确性和效率上都取得了显著的提升。
2. 人工智能和机器学习,裴文彬教授在人工智能和机器学习领域也进行了深入的研究。
他在深度学习、强化学习和模式识别等方面做出了重要贡献。
他提出了一种基于深度神经网络的图像识别方法,该方法在多个数据集上取得了优秀的性能。
3. 数据挖掘和大数据分析,裴文彬教授在数据挖掘和大数据分析方面也有很多研究成果。
他提出了一种基于集成学习的数据分类方法,该方法在处理大规模数据时具有很高的效率和准确性。
他还研究了大数据的存储和处理技术,为大规模数据分析提供了有效的解决方案。
4. 计算机网络和信息安全,裴文彬教授在计算机网络和信息安全领域也有很多研究成果。
他研究了网络流量分析和入侵检测技术,提出了一种基于机器学习的入侵检测方法,可以有效地识别网络中的恶意行为。
他还研究了网络安全的加密和认证技术,提高了网络通信的安全性。
5. 智能交通系统,裴文彬教授还在智能交通系统方面做出了一些研究成果。
他研究了交通流量预测和交通信号优化等问题,提出了一种基于机器学习的交通流量预测方法,可以准确地预测交通拥堵情况,并提供相应的交通优化方案。
这只是裴文彬教授在学术研究上的一部分成果,他还在其他领域也有很多重要的研究贡献。
他的学术成果不仅在国内外学术界产生了广泛影响,也为相关领域的发展做出了重要贡献。
集成学习综述
集成学习综述梁英毅摘要 机器学习方法在生产、科研和生活中有着广泛应用,而集成学习则是机器学习的首要热门方向[1]。
集成学习是使用一系列学习器进行学习,并使用某种规则把各个学习结果进行整合从而获得比单个学习器更好的学习效果的一种机器学习方法。
本文对集成学习的概念以及一些主要的集成学习方法进行简介,以便于进行进一步的研究。
一、 引言机器学习是计算机科学中研究怎么让机器具有学习能力的分支,[2]把机器学习的目标归纳为“给出关于如何进行学习的严格的、计算上具体的、合理的说明”。
[3]指出四类问题的解决对于人类来说是困难的甚至不可能的,从而说明机器学习的必要性。
目前,机器学习方法已经在科学研究、语音识别、人脸识别、手写识别、数据挖掘、医疗诊断、游戏等等领域之中得到应用[1, 4]。
随着机器学习方法的普及,机器学习方面的研究也越来越热门,目前来说机器学习的研究主要分为四个大方向[1]: a) 通过集成学习方法提高学习精度;b) 扩大学习规模;c) 强化学习;d) 学习复杂的随机模型;有关Machine Learning 的进一步介绍请参考[5, 1,3, 4, 6]。
本文的目的是对集成学习的各种方法进行综述,以了解当前集成学习方面的进展和问题。
本文以下内容组织如下:第二节首先介绍集成学习;第三节对一些常见的集成学习方法进行简单介绍;第四节给出一些关于集成学习的分析方法和分析结果。
二、 集成学习简介1、 分类问题分类问题属于概念学习的范畴。
分类问题是集成学习的基本研究问题,简单来说就是把一系列实例根据某种规则进行分类,这实际上是要寻找某个函数)(x f y =,使得对于一个给定的实例x ,找出正确的分类。
机器学习中的解决思路是通过某种学习方法在假设空间中找出一个足够好的函数来近似,这个近似函数就叫做分类器[7]。
y h f h2、 什么是集成学习传统的机器学习方法是在一个由各种可能的函数构成的空间(称为“假设空间”)中寻找一个最接近实际分类函数的分类器h [6]。
引入稳定学习的多中心脑磁共振影像统计分类方法研究
第37卷第1期湖南理工学院学报(自然科学版)V ol. 37 No. 1 2024年3月 Journal of Hunan Institute of Science and Technology (Natural Sciences) Mar. 2024引入稳定学习的多中心脑磁共振影像统计分类方法研究杨勃, 钟志锴(湖南理工学院信息科学与工程学院, 湖南岳阳 414006)摘要:针对现有统计分析方法在多中心统计分类任务上缺乏稳定性的问题, 提出一种引入稳定学习的多中心脑磁共振影像的统计分类方法. 该方法使用多层3D卷积神经网络作为骨干结构, 并引入稳定学习旁路结构调节卷积网络习得特征的稳定性. 在稳定学习旁路中, 首先使用随机傅里叶变换获取卷积网络特征的多路随机序列, 然后通过学习和优化批次样本采样权重以获取卷积网络特征之间的独立性, 从而改善跨中心分类泛化性. 最后, 在公开数据库FCP中的3中心脑影像数据集上进行跨中心性别分类实验. 实验结果表明, 与基准卷积网络相比, 引入稳定学习的卷积网络具有更高的跨中心分类正确率, 有效提高了跨中心泛化性和多中心统计分类的稳定性.关键词:多中心脑磁共振影像分析; 卷积神经网络; 稳定学习; 跨中心泛化中图分类号: TP183 文章编号: 1672-5298(2024)01-0015-05 Research on a Classification Approach for Multi-site Brain Magnetic Resonance Imaging Analysis byIntroducing Stable LearningYANG Bo, ZHONG Zhikai(School of Information Science and Engineering, Hunan Institute of Science and Technology, Yueyang 414006, China) Abstract: Aiming at the lack of stability of existing statistical analysis methods suitable for single site tasks in a multi-site setting, a statistical classification approach integrating stable learning for multi-site brain magnetic resonance imaging(MRI) analysis tasks was proposed. In the proposed approach, a multi-layer 3-dimensional convolutional neural network(3D CNN) was used as the backbone structure, while a stable learning module used for improving the stability of features learning by CNN was integrated as bypassing structure. In the stable learning module, the random Fourier transform was firstly used to obtain the random sequences of CNN features, and then the independence between different sequences was obtained by optimizing sampling weights of every sample batch and improving the cross-site generalization. Finally, a cross-site gender classification experiment was conducted on the 3 brain MRI data site from the publicly available database FCP. The experimental results show that compared with the basic CNN, the CNN with stable learning has a higher accuracy in cross-site classification, and effectively improves the stability of cross-center generalization and multi-center statistical classification.Key words: multi-site brain MRI analysis; convolutional neural network; stable learning; cross-site generalization0 引言经典机器学习方法使用训练数据集来训练模型, 然后使用训练好的模型对新数据进行预测. 确保该训练—预测流程的有效性, 主要基于两点[1]: 一是理论上满足独立同分布假设, 即训练数据和新数据均独立采样自同一统计分布; 二是训练数据量要充分, 能够准确描述该统计分布.在大量实际应用中, 收集到的数据往往来自不同数据域, 不满足独立同分布假设, 导致经典机器学习方法在此场景下性能显著退化, 在某一个域中训练得到的模型完全无法迁移到其他域的数据上, 跨域泛化性差[2]. 磁共振影像(Magnetic Resonance Imaging, MRI)分析领域也同样存在此类问题. 为增大数据量以获得更优的训练效果, 单中心脑MRI分析已逐渐发展到多中心脑MRI分析. 虽然多中心影像数据量显著增收稿日期: 2023-06-19基金项目:湖南省研究生科研创新项目(CX20221231,YCX2023A50); 湖南省自然科学基金项目“面向小样本脑磁共振影像分析的数据生成技术与深度学习方法研究”(2024JJ7208)作者简介: 杨勃, 男, 博士, 教授. 主要研究方向: 机器学习、脑影像分析16 湖南理工学院学报(自然科学版) 第37卷长, 但由于存在机器参数、被试生理参数等诸多不同, 不同中心的数据无法满足独立同分布假设, 导致多中心统计分析表现出较差的稳定性[3,4].为提升多域分析的稳定性, 近年来机器学习理论研究从因果分析角度提出一系列基于线性无关特征采样的稳定预测方法[5,6], 并在低维数据上取得了一定效果, 初步展现出在多域分析上的巨大潜力. Zhang等[7]在此基础上提出稳定学习方法, 扩展了以前的线性框架, 以纳入深度模型. 由于在深度模型中获得的复杂非线性特征之间的依赖关系比线性情况下更难测量和消除[8,9], 因此稳定学习采用了一种基于随机傅里叶特征(Random Fourier Features, RFF)[10]的非线性特征去相关方法; 同时, 为了适应现代深度模型, 还专门设计了一种全局关联的保存和重新加载机制, 以减少训练大规模数据时的存储和计算成本. 相关实验表明, 稳定学习结合深度学习在高维图像识别任务上表现出较好的稳定性[7].本文尝试将稳定学习引入多中心脑MRI 的统计分类任务中, 将稳定学习与3D CNN 结合, 解决跨中心泛化性问题, 提高多中心分类稳定性. 首先介绍本研究设计的融合稳定学习的3D CNN 网络架构; 然后介绍稳定学习特征独立性最大化准则; 最后与基准3D CNN 分别在公开数据集FCP 中的3中心脑MRI 数据集上进行对比分类实验. 实验结果表明, 引入稳定学习的卷积网络具有更高的跨中心分类正确率, 有效提高了多中心脑MRI 统计分类的稳定性.1 融合稳定学习的3D CNN 架构设计融合稳定学习的3D CNN 总体架构设计如图1所示. 首先使用3D CNN 提取脑MRI 的3D 特征, 再将特征分别输出至稳定学习旁路和分类器主路进行训练. 稳定学习旁路使用随机傅里叶变换模块提取3D特征的多路RFF 特征, 然后使用样本加权解相关模块(Learning Sample Weighting for Decorrelation, LSWD)优化样本采样权重. 最后使用样本权重对分类器的预测损失进行加权, 以加权损失最小化为优化目标进行反向传播.图1 融合稳定学习的3D CNN 总体架构设计2 特征独立性最大化2.1 基于随机傅里叶变换的随机变量独立性判定设X 、Y 为两个随机变量, ()X f X 、()Y f Y 、(,)f X Y 分别表示X 的概率密度、Y 的概率密度以及X 和Y 的联合概率密度, 若满足(,)()()X Y f X Y f X f Y =,则称随机变量X 、Y 相互独立.当X 、Y 均服从高斯分布时, 统计独立性等价于统计不相关, 即Conv (,)((())(()))()()()0X Y E X E x Y E Y E XY E X E Y =--=-=,其中Conv (,)⋅⋅为两随机变量之间的协方差, ()E ⋅为随机变量的期望.第1期 杨 勃, 等: 引入稳定学习的多中心脑磁共振影像统计分类方法研究 17在本文深度神经网络中, 随机变量,X Y 就是脑MRI 的3D 特征变量. 设有n 个训练样本, 可将其视为对随机变量,X Y 分别进行了n 次采样, 获得了对应的随机序列12(,,,)n X x x x = 和12(,,,)nY y y y = . 可使用随机序列之间的协方差进行无偏估计:Conv 111()111,1n n n i j i j i j j X Y x x y y n n n ===⎛⎫⎛⎫=-- ⎪ ⎪-⎝⎭⎝⎭∑∑∑ . 需要指出的是, 若,X Y 不服从高斯分布, 则Conv0(),X Y = 不能作为变量独立性判定准则. 文[9]指出, 此情形下可将随机序列,X Y 转换为k 个随机傅里叶变换序列{RFF },){RFF }()(i i k i i kX Y ≤≤后再使用协方差进行判定.随机傅里叶变换公式为RFF ,)()(s i i i X X ωφ+ ~(0,1),i N ω~Uniform(0,2π),iφi <∞. 其中随机频率i ω从标准正态分布中采样得到, 随机相位i φ从0~2π之间的均匀分布中采样得到.通过随机傅里叶变换可获得如下两个随机矩阵RFF(),RFF()n k XY ⨯∈ : 1212RFF()(RFF ,RFF ,,RFF ),R .()())FF()(F ()RF ,RF ,()(F ,RFF ())k kX X X X Y Y Y Y == 计算这两个随机矩阵的协方差矩阵:Conv T111111(((RFF ),RF ()()()(F())RFF RFF RFF RFF 1)n n n i j i j i j j X Y X X Y Y n n n ===⎡⎤=--⎢⎣⎥-⎣⎦⎡⎤⎢⎥⎦∑∑∑ . 若||Conv 2(RFF(),RFF())||0,F X Y = 则可判定随机变量,X Y 相互独立. 本文参照文[6]建议, 固定5k =.2.2 基于样本加权的特征独立性最大化在融合稳定学习的深度神经网络中, 通过LSWD 模块优化样本权重并最大化特征之间的独立性, 优化准则如下:,1,j |arg min |()m i j i L =<=∑w Conv (RFF(w ⨀i Q ), RFF(w ⨀2))||j F Q , T s.t.,n >=0w w e .其中1n i ⨯∈ Q 为网络输出的第i 个特征序列, ⨀为Hamard 乘积运算, 1n ⨯∈ w 为n 个样本的权重, e 为全1向量. 上述优化准则, 可使得深度神经网络输出特征两两之间相互独立.3 实验结果与分析3.1 实验数据与预处理实验数据来自网上公共数据库1000功能连接组计划(1000 Functional Connectomes Project, FCP). 该公共数据库收集了35个中心合计1355名被试的脑MRI 数据. 本实验使用了FCP 中3个中心的数据集, 分别为:北京(Beijing)、剑桥(Cambridge)和国际医学会议(ICBM)[11], 主要任务是使用其中的3D 脑结构MRI 数据完成性别分类. 其中, Beijing 数据集包含被试样本140个(男性70个/女性70个), Cambridge 数据集包含被试样本198个(男性75个/女性123个), ICBM 数据集包含被试样本86个(男性41个/女性45个).在Matlab 2015中使用SPM8工具包对原始脑结构MRI 数据进行如下数据预处理:第1步 脑影像颅骨剥离;第2步 分割去颅骨脑影像为灰质、白质和脑脊液3部分(本实验仅使用灰质数据);第3步 标准化预处理, 将脑影像统一配准到MNI(Montreal Neurological Institute)模版空间;第4步 去噪与平滑预处理, 使用高斯平滑方法平滑标准化灰质影像.预处理后, 最终得到尺寸大小为121×145×121的3D 结构影像.18 湖南理工学院学报(自然科学版) 第37卷此外, 为减少后续计算量, 通过尺度缩放操作将预处理后的3D 结构影像尺寸进一步缩小至64×64×64.然后使用Z-Score 标准化方法对每个中心的数据分别进行中心偏差校正.3.2 分类器参数设置分别测试了基准3D CNN 和融合稳定学习的3D CNN 的多中心脑MRI 分类性能. 其中3D CNN 架构部分,两种分类器均采用同样的网络架构和参数, 具体如下.网络层数设计为5层, 每层包含2个3D 卷积操作(with padding), 2个ReLU 非线性映射操作和1个3D maxpooling 操作(每层窗宽均为2). 其中, 第1层卷积核尺寸为7×7×7, 第2~5层卷积核尺寸均为3×3×3, 1~5层输出通道大小分别为32、64、128、256、512.使用Pytorch 1.12.0平台搭建网络. 训练时, 初始学习率固定为0.001, 使用Adam 优化器进行训练,batchsize 固定为128(男女样本各64个).3.3 跨中心性别分类实验采用域泛化实验设置LOSO(Leave One Site Out)来测试不同分类器的跨中心脑MRI 分类的泛化能力, 即留一个中心数据作为测试数据, 其他中心数据作为训练数据. 在训练过程中, 确保用于测试的中心数据完全隔离. 实验重复三次, 使用不同的随机种子, 取平均值作为最终结果. 跨中心分类平均正确率见表1.表1 跨中心分类平均正确率(%)对比方法 测试中心 总体平均分类正确率(Cambridge, ICBM)-Beijing (Beijing, ICBM)-Cambridge (Cambridge, Beijing)-ICBMbase 75.76 73.91 72.48 74.05stable 78.11 75.59 75.97 76.56* base: 基准3D CNN; stable: 融合稳定学习的3D CNN.由表1可知, 融合稳定学习的3D CNN 在(Cambridge, ICBM)-Beijing 、(Beijing, ICBM)-Cambridge 、(Cambridge, Beijing)-ICBM 三个LOSO 分类测试中平均分类正确率分别提升2.35、1.68、3.49个百分点, 总体平均类正确率则提升2.51个百分点. 实验结果验证了引入稳定学习后, 跨中心泛化性得到明显提升.进一步绘制三个LOSO 分类任务的PR(Precision-Recall)曲线和ROC(Receiver Operating Characteristic)曲线, 并计算AUC(Area Under the Curve), 以评估分类方法的跨中心预测性能, 如图2~3所示.(a) (Cambridge, ICBM)-Beijing (b) (Beijing, ICBM)-Cambridge(c) (Cambridge, Beijing)-ICBM图2 跨中心分类ROC 曲线(a) (Cambridge, ICBM)-Beijing (b) (Beijing, ICBM)-Cambridge(c) (Cambridge, Beijing)-ICBM图3 跨中心分类PR 曲线 图2显示, 在三个LOSO 分类任务中融合稳定学习的3D CNN 的ROC 曲线明显优于基准3D CNN, 其AUC 值也分别提升了0.01, 0.05和0.05. 此外, 由每个LOSO 分类的三次随机实验统计得到的标准差相比基第1期 杨 勃, 等: 引入稳定学习的多中心脑磁共振影像统计分类方法研究 19 准3D CNN 显著下降了1个数量级, 也很好地证实了融合稳定学习的3D CNN 具有很好的多中心分类稳定性. 图3中, 除第1个LOSO 分类任务无法确定两种方法的优劣外, 在后两个LOSO 分类任务上, 融合稳定学习的3D CNN 表现明显优于基准3D CNN.最后绘制三个LOSO 分类任务训练过程中测试正确率变化曲线, 结果如图4所示.(a) (Cambridge, ICBM)-Beijing (b) (Beijing, ICBM)-Cambridge(c) (Cambridge, Beijing)-ICBM图4 跨中心分类训练过程中测试正确率变化情况 图4显示, 三个LOSO 分类任务在训练迭代到100代后, 融合稳定学习的3D CNN 的测试分类正确率稳定优于基准3D CNN, 进一步展示了引入稳定学习的多中心脑MRI 分类的有效性.4 结束语为解决多中心脑MRI 分类的稳定性问题, 本文提出引入稳定学习的统计分类方法, 设计融合稳定学习的3D CNN 架构, 通过学习样本权重提升特征之间的统计独立性, 从而提高对未知中心数据的跨中心预测能力. 通过3中心公共数据集性别分类实验, 最后验证了融合稳定学习的3D CNN 分类模型的有效性. 实验表明, 将稳定学习引入多中心脑MRI 统计分类任务中, 可以改善跨中心分类方法的泛化性能, 从而进一步提高多中心脑MRI 统计分类的稳定性.参考文献:[1] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.[2] GEIRHOS R, RUBISCH P, MICHAELIS C, et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy androbustness[EB/OL]. (2018-11-29)[2024-3-20]. https:///abs/1811.12231.[3] ZENG L L, WANG H, HU P, et al. Multi-site diagnostic classification of schizophrenia using discriminant deep learning with functional connectivityMRI[J]. EBioMedicine, 2018, 30: 74−85.[4] 李文彬, 许雁玲, 钟志楷, 等. 基于稳定学习的图神经网络模型[J]. 湖南理工学院学报(自然科学版), 2023, 36(4): 16−18.[5] KUANG K, XIONG R, CUI P, et al. Stable prediction with model misspecification and agnostic distribution shift[C]//Proceedings of the AAAI Conferenceon Artificial Intelligence, 2020, 34(4): 4485−4492.[6] KUANG K, CUI P, ATHEY S, et al. Stable prediction across unknown environments[C]//Proceedings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining, New York: Association for Computing Machinery, 2018: 1617–1626.[7] ZHANG X, CUI P, XU R, et al. Deep stable learning for out-of-distribution generalization[C]// Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, IEEE Computer Society, 2021: 5368−5378.[8] LI H, PAN S J, WANG S, et al. Domain generalization with adversarial feature learning[C]//Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, IEEE Computer Society, 2018: 5400−5409.[9] GRUBINGER T, BIRLUTIU A, SCHONER H, et al. Domain generalization based on transfer component analysis[C]//Proceedings of the 13th InternationalWork-Conference on Artificial Neural Networks, Springer, 2015: 325−334.[10] RAHIMI A, RECHT B. Random features for large-scale kernel machines[C]//Proceedings of the 20th International Conference on Neural InformationProcessing Systems, 2007: 1177–1184.[11] JIANG R, ABBOTT C C, JIANG T, et al. SMRI biomarkers predict electroconvulsive treatment outcomes: accuracy with independent data sets[J].Neuropsychopharmacology, 2018, 43(5): 1078−1087.。
deep learning-based models
deep learning-based models
基于深度学习的模型(Deep Learning-based models)是一种机器学习的方法,它使用深度神经网络来处理大量的数据并从中学习。
深度学习模型通常使用大量的参数和复杂的网络结构,以在各种任务中实现卓越的性能,包括图像识别、语音识别、自然语言处理等。
深度学习模型的基本结构包括输入层、隐藏层和输出层。
输入层接收原始数据,隐藏层通过一系列复杂的计算将输入转化为有意义的特征表示,最后输出层将隐藏层的结果转化为具体的输出。
深度学习模型能够自动学习和提取输入数据的特征,这使得它们在许多任务中比传统的机器学习方法更有效。
深度学习的应用非常广泛,包括但不限于:
1.图像识别:深度学习模型可以自动学习和识别图像中的特征,例如人脸识别、物体检测等。
2.自然语言处理:深度学习模型可以处理和生成自然语言文本,例如机器翻译、文本生成等。
3.语音识别:深度学习模型可以自动识别和转化语音为文本,例如语音助手、语音搜索等。
4.推荐系统:深度学习模型可以根据用户的历史行为和偏好,自动推荐相关的内容或产品,例如视频推荐、电商推荐等。
5.医学影像分析:深度学习模型可以自动分析和识别医学影像,例如CT扫描、MRI图像等,用于辅助医生诊断和治疗疾病。
总的来说,基于深度学习的模型在人工智能领域中发挥着越来越重要的作用,并将在未来继续推动着技术的发展和创新。
基于多层特征嵌入的单目标跟踪算法
基于多层特征嵌入的单目标跟踪算法1. 内容描述基于多层特征嵌入的单目标跟踪算法是一种在计算机视觉领域中广泛应用的跟踪技术。
该算法的核心思想是通过多层特征嵌入来提取目标物体的特征表示,并利用这些特征表示进行目标跟踪。
该算法首先通过预处理步骤对输入图像进行降维和增强,然后将降维后的图像输入到神经网络中,得到不同层次的特征图。
通过对这些特征图进行池化操作,得到一个低维度的特征向量。
将这个特征向量输入到跟踪器中,以实现对目标物体的实时跟踪。
为了提高单目标跟踪算法的性能,本研究提出了一种基于多层特征嵌入的方法。
该方法首先引入了一个自适应的学习率策略,使得神经网络能够根据当前训练状态自动调整学习率。
通过引入注意力机制,使得神经网络能够更加关注重要的特征信息。
为了进一步提高跟踪器的鲁棒性,本研究还采用了一种多目标融合的方法,将多个跟踪器的结果进行加权融合,从而得到更加准确的目标位置估计。
通过实验验证,本研究提出的方法在多种数据集上均取得了显著的性能提升,证明了其在单目标跟踪领域的有效性和可行性。
1.1 研究背景随着计算机视觉和深度学习技术的快速发展,目标跟踪在许多领域(如安防、智能监控、自动驾驶等)中发挥着越来越重要的作用。
单目标跟踪(MOT)算法是一种广泛应用于视频分析领域的技术,它能够实时跟踪视频序列中的单个目标物体,并将其位置信息与相邻帧进行比较,以估计目标的运动轨迹。
传统的单目标跟踪算法在处理复杂场景、遮挡、运动模糊等问题时表现出较差的鲁棒性。
为了解决这些问题,研究者们提出了许多改进的单目标跟踪算法,如基于卡尔曼滤波的目标跟踪、基于扩展卡尔曼滤波的目标跟踪以及基于深度学习的目标跟踪等。
这些方法在一定程度上提高了单目标跟踪的性能,但仍然存在一些局限性,如对多目标跟踪的支持不足、对非平稳运动的适应性差等。
开发一种既能有效跟踪单个目标物体,又能应对多种挑战的单目标跟踪算法具有重要的理论和实际意义。
1.2 研究目的本研究旨在设计一种基于多层特征嵌入的单目标跟踪算法,以提高目标跟踪的准确性和鲁棒性。
Deep Sparse Rectifier Neural Networks
Deep Sparse Rectifier Neural NetworksXavier Glorot Antoine Bordes Yoshua BengioDIRO,Universit´e de Montr´e al Montr´e al,QC,Canada glorotxa@iro.umontreal.ca Heudiasyc,UMR CNRS6599UTC,Compi`e gne,FranceandDIRO,Universit´e de Montr´e alMontr´e al,QC,Canadaantoine.bordes@hds.utc.frDIRO,Universit´e de Montr´e alMontr´e al,QC,Canadabengioy@iro.umontreal.caAbstractWhile logistic sigmoid neurons are more bi-ologically plausible than hyperbolic tangentneurons,the latter work better for train-ing multi-layer neural networks.This pa-per shows that rectifying neurons are aneven better model of biological neurons andyield equal or better performance than hy-perbolic tangent networks in spite of thehard non-linearity and non-differentiabilityat zero,creating sparse representations withtrue zeros,which seem remarkably suitablefor naturally sparse data.Even though theycan take advantage of semi-supervised setupswith extra-unlabeled data,deep rectifier net-works can reach their best performance with-out requiring any unsupervised pre-trainingon purely supervised tasks with large labeleddatasets.Hence,these results can be seen asa new milestone in the attempts at under-standing the difficulty in training deep butpurely supervised neural networks,and clos-ing the performance gap between neural net-works learnt with and without unsupervisedpre-training.1IntroductionMany differences exist between the neural network models used by machine learning researchers and those used by computational neuroscientists.This is in part Appearing in Proceedings of the14th International Con-ference on Artificial Intelligence and Statistics(AISTATS) 2011,Fort Lauderdale,FL,USA.Volume15of JMLR: W&CP15.Copyright2011by the authors.because the objective of the former is to obtain com-putationally efficient learners,that generalize well to new examples,whereas the objective of the latter is to abstract out neuroscientific data while obtaining ex-planations of the principles involved,providing predic-tions and guidance for future biological experiments. Areas where both objectives coincide are therefore particularly worthy of investigation,pointing towards computationally motivated principles of operation in the brain that can also enhance research in artificial intelligence.In this paper we show that two com-mon gaps between computational neuroscience models and machine learning neural network models can be bridged by using the following linear by part activa-tion:max(0,x),called the rectifier(or hinge)activa-tion function.Experimental results will show engaging training behavior of this activation function,especially for deep architectures(see Bengio(2009)for a review), i.e.,where the number of hidden layers in the neural network is3or more.Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures.This is in part inspired by observations of the mammalian vi-sual cortex,which consists of a chain of processing elements,each of which is associated with a different representation of the raw visual input.This is partic-ularly clear in the primate visual system(Serre et al., 2007),with its sequence of processing stages:detection of edges,primitive shapes,and moving up to gradu-ally more complex visual shapes.Interestingly,it was found that the features learned in deep architectures resemble those observed in thefirst two of these stages (in areas V1and V2of visual cortex)(Lee et al.,2008), and that they become increasingly invariant to factors of variation(such as camera movement)in higher lay-ers(Goodfellow et al.,2009).Deep Sparse Rectifier Neural NetworksRegarding the training of deep networks,something that can be considered a breakthrough happened in2006,with the introduction of Deep Belief Net-works(Hinton et al.,2006),and more generally the idea of initializing each layer by unsupervised learn-ing(Bengio et al.,2007;Ranzato et al.,2007).Some authors have tried to understand why this unsuper-vised procedure helps(Erhan et al.,2010)while oth-ers investigated why the original training procedure for deep neural networks failed(Bengio and Glorot,2010). From the machine learning point of view,this paper brings additional results in these lines of investigation. We propose to explore the use of rectifying non-linearities as alternatives to the hyperbolic tangent or sigmoid in deep artificial neural networks,in ad-dition to using an L1regularizer on the activation val-ues to promote sparsity and prevent potential numer-ical problems with unbounded activation.Nair and Hinton(2010)present promising results of the influ-ence of such units in the context of Restricted Boltz-mann Machines compared to logistic sigmoid activa-tions on image classification tasks.Our work extends this for the case of pre-training using denoising auto-encoders(Vincent et al.,2008)and provides an exten-sive empirical comparison of the rectifying activation function against the hyperbolic tangent on image clas-sification benchmarks as well as an original derivation for the text application of sentiment analysis.Our experiments on image and text data indicate that training proceeds better when the artificial neurons are either offor operating mostly in a linear regime.Sur-prisingly,rectifying activation allows deep networks to achieve their best performance without unsupervised pre-training.Hence,our work proposes a new contri-bution to the trend of understanding and merging the performance gap between deep networks learnt with and without unsupervised pre-training(Erhan et al., 2010;Bengio and Glorot,2010).Still,rectifier net-works can benefit from unsupervised pre-training in the context of semi-supervised learning where large amounts of unlabeled data are provided.Furthermore, as rectifier units naturally lead to sparse networks and are closer to biological neurons’responses in their main operating regime,this work also bridges(in part)a machine learning/neuroscience gap in terms of acti-vation function and sparsity.This paper is organized as follows.Section2presents some neuroscience and machine learning background which inspired this work.Section3introduces recti-fier neurons and explains their potential benefits and drawbacks in deep networks.Then we propose an experimental study with empirical results on image recognition in Section4.1and sentiment analysis in Section4.2.Section5presents our conclusions.2Background2.1Neuroscience ObservationsFor models of biological neurons,the activation func-tion is the expectedfiring rate as a function of the total input currently arising out of incoming signals at synapses(Dayan and Abott,2001).An activation function is termed,respectively antisymmetric or sym-metric when its response to the opposite of a strongly excitatory input pattern is respectively a strongly in-hibitory or excitatory one,and one-sided when this response is zero.The main gaps that we wish to con-sider between computational neuroscience models and machine learning models are the following:•Studies on brain energy expense suggest that neurons encode information in a sparse and dis-tributed way(Attwell and Laughlin,2001),esti-mating the percentage of neurons active at the same time to be between1and4%(Lennie,2003).This corresponds to a trade-offbetween richness of representation and small action potential en-ergy expenditure.Without additional regulariza-tion,such as an L1penalty,ordinary feedforward neural nets do not have this property.For ex-ample,the sigmoid activation has a steady state regime around12,therefore,after initializing with small weights,all neuronsfire at half their satura-tion regime.This is biologically implausible and hurts gradient-based optimization(LeCun et al., 1998;Bengio and Glorot,2010).•Important divergences between biological and machine learning models concern non-linear activation functions.A common biological model of neuron,the leaky integrate-and-fire(or LIF)(Dayan and Abott,2001),gives the follow-ing relation between thefiring rate and the input current,illustrated in Figure1(left):f(I)=τlogE+RI−V rE+RI−V th+t ref−1,if E+RI>V th0,if E+RI≤V thwhere t ref is the refractory period(minimal time between two action potentials),I the input cur-rent,V r the resting potential and V th the thresh-old potential(with V th>V r),and R,E,τthe membrane resistance,potential and time con-stant.The most commonly used activation func-tions in the deep learning and neural networks lit-erature are the standard logistic sigmoid and theXavier Glorot,Antoine Bordes,YoshuaBengioFigure1:Left:Common neural activation function motivated by biological data.Right:Commonly used activation functions in neural networks literature:logistic sigmoid and hyperbolic tangent(tanh).hyperbolic tangent(see Figure1,right),which areequivalent up to a linear transformation.The hy-perbolic tangent has a steady state at0,and istherefore preferred from the optimization stand-point(LeCun et al.,1998;Bengio and Glorot,2010),but it forces an antisymmetry around0which is absent in biological neurons.2.2Advantages of SparsitySparsity has become a concept of interest,not only incomputational neuroscience and machine learning butalso in statistics and signal processing(Candes andTao,2005).It wasfirst introduced in computationalneuroscience in the context of sparse coding in the vi-sual system(Olshausen and Field,1997).It has beena key element of deep convolutional networks exploit-ing a variant of auto-encoders(Ranzato et al.,2007,2008;Mairal et al.,2009)with a sparse distributedrepresentation,and has also become a key ingredientin Deep Belief Networks(Lee et al.,2008).A sparsitypenalty has been used in several computational neuro-science(Olshausen and Field,1997;Doi et al.,2006)and machine learning models(Lee et al.,2007;Mairalet al.,2009),in particular for deep architectures(Leeet al.,2008;Ranzato et al.,2007,2008).However,inthe latter,the neurons end up taking small but non-zero activation orfiring probability.We show here thatusing a rectifying non-linearity gives rise to real zerosof activations and thus truly sparse representations.From a computational point of view,such representa-tions are appealing for the following reasons:•Information disentangling.One of theclaimed objectives of deep learning algo-rithms(Bengio,2009)is to disentangle thefactors explaining the variations in the data.Adense representation is highly entangled becausealmost any change in the input modifies most ofthe entries in the representation vector.Instead,if a representation is both sparse and robust tosmall input changes,the set of non-zero featuresis almost always roughly conserved by smallchanges of the input.•Efficient variable-size representation.Dif-ferent inputs may contain different amounts of in-formation and would be more conveniently repre-sented using a variable-size data-structure,whichis common in computer representations of infor-mation.Varying the number of active neuronsallows a model to control the effective dimension-ality of the representation for a given input andthe required precision.•Linear separability.Sparse representations arealso more likely to be linearly separable,or moreeasily separable with less non-linear machinery,simply because the information is represented ina high-dimensional space.Besides,this can reflectthe original data format.In text-related applica-tions for instance,the original raw data is alreadyvery sparse(see Section4.2).•Distributed but sparse.Dense distributed rep-resentations are the richest representations,be-ing potentially exponentially more efficient thanpurely local ones(Bengio,2009).Sparse repre-sentations’efficiency is still exponentially greater,with the power of the exponent being the numberof non-zero features.They may represent a goodtrade-offwith respect to the above criteria.Nevertheless,forcing too much sparsity may hurt pre-dictive performance for an equal number of neurons,because it reduces the effective capacity of the model.Deep Sparse Rectifier NeuralNetworksFigure 2:Left:Sparse propagation of activations and gradients in a network of rectifier units.The input selects a subset of active neurons and computation is linear in this subset.Right:Rectifier and softplus activation functions.The second one is a smooth version of the first.3Deep Rectifier Networks3.1Rectifier NeuronsThe neuroscience literature (Bush and Sejnowski,1995;Douglas and al.,2003)indicates that corti-cal neurons are rarely in their maximum saturation regime ,and suggests that their activation function can be approximated by a rectifier.Most previous stud-ies of neural networks involving a rectifying activation function concern recurrent networks (Salinas and Ab-bott,1996;Hahnloser,1998).The rectifier function rectifier(x )=max(0,x )is one-sided and therefore does not enforce a sign symmetry 1or antisymmetry 1:instead,the response to the oppo-site of an excitatory input pattern is 0(no response).However,we can obtain symmetry or antisymmetry by combining two rectifier units sharing parameters.Advantages The rectifier activation function allows a network to easily obtain sparse representations.For example,after uniform initialization of the weights,around 50%of hidden units continuous output val-ues are real zeros,and this fraction can easily increase with sparsity-inducing regularization.Apart from be-ing more biologically plausible,sparsity also leads to mathematical advantages (see previous section).As illustrated in Figure 2(left),the only non-linearity in the network comes from the path selection associ-ated with individual neurons being active or not.For a given input only a subset of neurons are active .Com-putation is linear on this subset:once this subset of neurons is selected,the output is a linear function of1The hyperbolic tangent absolute value non-linearity |tanh(x )|used by Jarrett et al.(2009)enforces sign symme-try.A tanh(x )non-linearity enforces sign antisymmetry.the input (although a large enough change can trigger a discrete change of the active set of neurons).The function computed by each neuron or by the network output in terms of the network input is thus linear by parts.We can see the model as an exponential num-ber of linear models that share parameters (Nair and Hinton,2010).Because of this linearity,gradients flow well on the active paths of neurons (there is no gra-dient vanishing effect due to activation non-linearities of sigmoid or tanh units),and mathematical investi-gation is putations are also cheaper:there is no need for computing the exponential function in activations,and sparsity can be exploited.Potential Problems One may hypothesize that the hard saturation at 0may hurt optimization by block-ing gradient back-propagation.To evaluate the poten-tial impact of this effect we also investigate the soft-plus activation:softplus (x )=log (1+e x )(Dugas et al.,2001),a smooth version of the rectifying non-linearity.We lose the exact sparsity,but may hope to gain eas-ier training.However,experimental results (see Sec-tion 4.1)tend to contradict that hypothesis,suggesting that hard zeros can actually help supervised training.We hypothesize that the hard non-linearities do not hurt so long as the gradient can propagate along some paths ,i.e.,that some of the hidden units in each layer are non-zero.With the credit and blame assigned to these ON units rather than distributed more evenly,we hypothesize that optimization is easier.Another prob-lem could arise due to the unbounded behavior of the activations;one may thus want to use a regularizer to prevent potential numerical problems.Therefore,we use the L 1penalty on the activation values,which also promotes additional sparsity.Also recall that,in or-der to efficiently represent symmetric/antisymmetric behavior in the data,a rectifier network would needXavier Glorot,Antoine Bordes,Yoshua Bengiotwice as many hidden units as a network of symmet-ric/antisymmetric activation functions.Finally,rectifier networks are subject to ill-conditioning of the parametrization.Biases and weights can be scaled in different (and consistent)ways while preserving the same overall network function.More precisely,consider for each layer of depth i of the network a scalar αi ,and scaling the parameters asW i =W iαi and b i =b i ij =1αj.The output units values then change as follow:s =sn j =1αj .Therefore,aslong as nj =1αj is 1,the network function is identical.3.2Unsupervised Pre-trainingThis paper is particularly inspired by the sparse repre-sentations learned in the context of auto-encoder vari-ants,as they have been found to be very useful intraining deep architectures (Bengio,2009),especially for unsupervised pre-training of neural networks (Er-han et al.,2010).Nonetheless,certain difficulties arise when one wants to introduce rectifier activations into stacked denois-ing auto-encoders (Vincent et al.,2008).First,the hard saturation below the threshold of the rectifier function is not suited for the reconstruction units.In-deed,whenever the network happens to reconstruct a zero in place of a non-zero target,the reconstruc-tion unit can not backpropagate any gradient.2Sec-ond,the unbounded behavior of the rectifier activation also needs to be taken into account.In the follow-ing,we denote ˜x the corrupted version of the input x ,σ()the logistic sigmoid function and θthe model pa-rameters (W enc ,b enc ,W dec ,b dec ),and define the linear recontruction function as:f (x,θ)=W dec max(W enc x +b enc ,0)+b dec .Here are the several strategies we have experimented:e a softplus activation function for the recon-struction layer,along with a quadratic cost:L (x,θ)=||x −log(1+exp(f (˜x ,θ)))||2.2.Scale the rectifier activation values coming from the previous encoding layer to bound them be-tween 0and 1,then use a sigmoid activation func-tion for the reconstruction layer,along with a cross-entropy reconstruction cost.L (x,θ)=−x log(σ(f (˜x ,θ)))−(1−x )log(1−σ(f (˜x ,θ))).2Why is this not a problem for hidden layers too?we hy-pothesize that it is because gradients can still flow throughthe active (non-zero),possibly helping rather than hurting the assignment of credit.e a linear activation function for the reconstruc-tion layer,along with a quadratic cost.We triedto use input unit values either before or after the rectifier non-linearity as reconstruction targets.(For the first layer,raw inputs are directly used.)e a rectifier activation function for the recon-struction layer,along with a quadratic cost.The first strategy has proven to yield better gener-alization on image data and the second one on text data.Consequently,the following experimental study presents results using those two.4Experimental StudyThis section discusses our empirical evaluation of recti-fier units for deep networks.We first compare them to hyperbolic tangent and softplus activations on image benchmarks with and without pre-training,and then apply them to the text task of sentiment analysis.4.1Image RecognitionExperimental setup We considered the image datasets detailed below.Each of them has a train-ing set (for tuning parameters),a validation set (for tuning hyper-parameters)and a test set (for report-ing generalization performance).They are presented according to their number of training/validation/test examples,their respective image sizes,as well as their number of classes:•MNIST (LeCun et al.,1998):50k/10k/10k,28×28digit images,10classes.•CIFAR10(Krizhevsky and Hinton,2009):50k/5k/5k,32×32×3RGB images,10classes.•NISTP:81,920k/80k/20k,32×32character im-ages from the NIST database 19,with randomized distortions (Bengio and al,2010),62classes.This dataset is much larger and more difficult than the original NIST (Grother,1995).•NORB:233,172/58,428/58,320,taken from Jittered-Cluttered NORB (LeCun et al.,2004).Stereo-pair images of toys on a cluttered background,6classes.The data has been prepro-cessed similarly to (Nair and Hinton,2010):we subsampled the original 2×108×108stereo-pair images to 2×32×32and scaled linearly the image in the range [−1,1].We followed the procedure used by Nair and Hinton (2010)to create the validation set.Deep Sparse Rectifier Neural NetworksTable1:Test error on networks of depth3.Bold results represent statistical equivalence between similar ex-periments,with and without pre-training,under the null hypothesis of the pairwise test with p=0.05.Neuron MNIST CIF AR10NISTP NORB With unsupervised pre-trainingRectifier 1.20%49.96%32.86%16.46% Tanh 1.16%50.79%35.89%17.66% Softplus 1.17%49.52%33.27%19.19% Without unsupervised pre-trainingRectifier 1.43%50.86%32.64%16.40% Tanh 1.57%52.62%36.46%19.29% Softplus 1.77%53.20%35.48%17.68% For all experiments except on the NORB data(Le-Cun et al.,2004),the models we used are stacked denoising auto-encoders(Vincent et al.,2008)with three hidden layers and1000units per layer.The ar-chitecture of Nair and Hinton(2010)has been used on NORB:two hidden layers with respectively4000 and2000units.We used a cross-entropy reconstruc-tion cost for tanh networks and a quadratic cost over a softplus reconstruction layer for the rectifier and softplus networks.We chose masking noise as the corruption process:each pixel has a probability of0.25of being artificially set to0.The unsuper-vised learning rate is constant,and the following val-ues have been explored:{.1,.01,.001,.0001}.We se-lect the model with the lowest reconstruction error. For the supervisedfine-tuning we chose a constant learning rate in the same range as the unsupervised learning rate with respect to the supervised valida-tion error.The training cost is the negative log likeli-hood−log P(correct class|input)where the probabil-ities are obtained from the output layer(which imple-ments a softmax logistic regression).We used stochas-tic gradient descent with mini-batches of size10for both unsupervised and supervised training phases.To take into account the potential problem of rectifier units not being symmetric around0,we use a vari-ant of the activation function for whichhalf of the units output values are multiplied by-1.This serves to cancel out the mean activation value for each layer and can be interpreted either as inhibitory neurons or simply as a way to equalize activations numerically. Additionally,an L1penalty on the activations with a coefficient of0.001was added to the cost function dur-ing pre-training andfine-tuning in order to increase the amount of sparsity in the learned representations. Main results Table1summarizes the results on networks of3hidden layers of1000hidden units each,Figure3:Influence offinal sparsity on accu-racy.200randomly initialized deep rectifier networks were trained on MNIST with various L1penalties(from 0to0.01)to obtain different sparsity levels.Results show that enforcing sparsity of the activation does not hurtfinal performance until around85%of true zeros.comparing all the neuron types3on all the datasets, with or without unsupervised pre-training.In the lat-ter case,the supervised training phase has been carried out using the same experimental setup as the one de-scribed above forfine-tuning.The main observations we make are the following:•Despite the hard threshold at0,networks trained with the rectifier activation function canfind lo-cal minima of greater or equal quality than those obtained with its smooth counterpart,the soft-plus.On NORB,we tested a rescaled version of the softplus defined by1αsoftplus(αx),which allows to interpolate in a smooth manner be-tween the softplus(α=1)and the rectifier(α=∞).We obtained the followingα/test error cou-ples:1/17.68%,1.3/17.53%,2/16.9%,3/16.66%, 6/16.54%,∞/16.40%.There is no trade-offbe-tween those activation functions.Rectifiers are not only biologically plausible,they are also com-putationally efficient.•There is almost no improvement when using un-supervised pre-training with rectifier activations, contrary to what is experienced using tanh or soft-plus.Purely supervised rectifier networks remain competitive on all4datasets,even against the pretrained tanh or softplus models.3We also tested a rescaled version of the LIF and max(tanh(x),0)as activation functions.We obtained worse generalization performance than those of Table1, and chose not to report them.Xavier Glorot,Antoine Bordes,Yoshua Bengio•Rectifier networks are truly deep sparse networks.There is an average exact sparsity(fraction of ze-ros)of the hidden layers of83.4%on MNIST,72.0%on CIFAR10,68.0%on NISTP and73.8%on NORB.Figure3provides a better understand-ing of the influence of sparsity.It displays the MNIST test error of deep rectifier networks(with-out pre-training)according to different average sparsity obtained by varying the L1penalty on the works appear to be quite ro-bust to it as models with70%to almost85%of true zeros can achieve similar performances. With labeled data,deep rectifier networks appear to be attractive models.They are biologically credible, and,compared to their standard counterparts,do not seem to depend as much on unsupervised pre-training, while ultimately yielding sparse representations.This last conclusion is slightly different from those re-ported in(Nair and Hinton,2010)in which is demon-strated that unsupervised pre-training with Restricted Boltzmann Machines and using rectifier units is ben-eficial.In particular,the paper reports that pre-trained rectified Deep Belief Networks can achieve a test error on NORB below16%.However,we be-lieve that our results are compatible with those:we extend the experimental framework to a different kind of models(stacked denoising auto-encoders)and dif-ferent datasets(on which conclusions seem to be differ-ent).Furthermore,note that our rectified model with-out pre-training on NORB is very competitive(16.4% error)and outperforms the17.6%error of the non-pretrained model from Nair and Hinton(2010),which is basically what wefind with the non-pretrained soft-plus units(17.68%error).Semi-supervised setting Figure4presents re-sults of semi-supervised experiments conducted on the NORB dataset.We vary the percentage of the orig-inal labeled training set which is used for the super-vised training phase of the rectifier and hyperbolic tan-gent networks and evaluate the effect of the unsuper-vised pre-training(using the whole training set,unla-beled).Confirming conclusions of Erhan et al.(2010), the network with hyperbolic tangent activations im-proves with unsupervised pre-training for any labeled set size(even when all the training set is labeled). However,the picture changes with rectifying activa-tions.In semi-supervised setups(with few labeled data),the pre-training is highly beneficial.But the more the labeled set grows,the closer the models with and without pre-training.Eventually,when all avail-able data is labeled,the two models achieve identical performance.Rectifier networks can maximally ex-ploit labeled and unlabeledinformation.Figure4:Effect of unsupervised pre-training.On NORB,we compare hyperbolic tangent and rectifier net-works,with or without unsupervised pre-training,andfine-tune only on subsets of increasing size of the training set.4.2Sentiment AnalysisNair and Hinton(2010)also demonstrated that recti-fier units were efficient for image-related tasks.They mentioned the intensity equivariance property(i.e. without bias parameters the network function is lin-early variant to intensity changes in the input)as ar-gument to explain this observation.This would sug-gest that rectifying activation is mostly useful to im-age data.In this section,we investigate on a different modality to cast a fresh light on rectifier units.A recent study(Zhou et al.,2010)shows that Deep Be-lief Networks with binary units are competitive with the state-of-the-art methods for sentiment analysis. This indicates that deep learning is appropriate to this text task which seems therefore ideal to observe the behavior of rectifier units on a different modality,and provide a data point towards the hypothesis that rec-tifier nets are particarly appropriate for sparse input vectors,such as found in NLP.Sentiment analysis is a text mining area which aims to determine the judg-ment of a writer with respect to a given topic(see (Pang and Lee,2008)for a review).The basic task consists in classifying the polarity of reviews either by predicting whether the expressed opinions are positive or negative,or by assigning them star ratings on either 3,4or5star scales.Following a task originally proposed by Snyder and Barzilay(2007),our data consists of restaurant reviews which have been extracted from the restaurant review site .We have access to10,000 labeled and300,000unlabeled training reviews,while the test set contains10,000examples.The goal is to predict the rating on a5star scale and performance is evaluated using Root Mean Squared Error(RMSE).4 4Even though our tasks are identical,our database is。
泛化迁移深度学习下的跨模态图像行人识别算法
第42卷 第1期吉林大学学报(信息科学版)Vol.42 No.12024年1月Journal of Jilin University (Information Science Edition)Jan.2024文章编号:1671⁃5896(2024)01⁃0137⁃06泛化迁移深度学习下的跨模态图像行人识别算法收稿日期:2022⁃10⁃13基金项目:西安明德理工学院科研基金资助项目(2021XY01L09)作者简介:蔡现龙(1976 ),男,陕西渭南人,西安明德理工学院讲师,主要从事计算机科学与技术研究,(Tel)86⁃189****7386(E⁃mail)2631069053@㊂蔡现龙,李 阳,陈 曦(西安明德理工学院信息工程学院,西安710124)摘要:针对由于受光照条件变化㊁行人身高差异等影响,致使监控视频图像在不同时刻的成像存在较大的跨模态差异问题,为准确识别跨模态图像中的行人,提出基于泛化迁移深度学习的跨模态图像行人识别算法㊂通过循环生成对抗网络(Cyele GAN:Cycle Generative Adversarial Network)形成跨模态图像,采用单目标图像处理对基准图分割处理,得到人体候选区域,在匹配图中搜索和其匹配的区域,得到人体区域的视差,通过视差提取人体区域的深度和透视特征㊂将注意力机制和跨模态行人识别相结合,分析两种不同类型图像的差异,将两个子空间映射到同一个特征空间,同时引入泛化迁移深度学习算法对损失函数度量学习,自动筛选跨模态图像的行人特征,最终通过模态融合模块将筛选的特征融合处理完成行人识别㊂实验结果表明,所提算法可以快速㊁准确地提取不同模态图像中的行人,识别效果较好㊂关键词:泛化迁移深度学习;跨模态图像;行人识别;特征提取中图分类号:TP311文献标志码:APedestrian Recognition Algorithm of Cross⁃Modal Image under Generalized Transfer Deep LearningCAI Xianlong,LI Yang,CHEN Xi(School of Information Engineering,Xi’an Mingde Institute of Technology,Xi’an 710124,China)Abstract :Due to the influence of changes in lighting conditions and pedestrian height differences,there are large cross modal differences in surveillance video images at different times.In order to accurately identify pedestrians in cross modal images,a pedestrian recognition algorithm based on generalized transfer depth learning is proposed.The cross modal image is formed through Cyele GAN(Cycle Generative Adversarial Network),and the reference map is segmented using single object image processing to obtain candidate human body regions.The matching regions are searched in the matching map to obtain the disparity of human body regions,and the depth and perspective features of human body regions are extracted through the disparity.The attention mechanism and cross modal pedestrian recognition are combined to analyze the differences between the two types of images.The two subspaces are mapped to the same feature space.And the generalized migration depth learning algorithm is introduced to learn the loss function measurement,automatically screen the pedestrian features of the cross modal images,and finally complete pedestrian recognition through the modal fusion module to fuse the filtered features.The experimental results show that the proposed algorithm can quickly and accurately extract pedestrians from different modal images,and the recognition effect is good.Key words :generalization transfer deep learning;cross⁃modal images;pedestrian recognition;feature extraction0 引 言由于在光照条件较差的环境中对单模态行人识别,无法满足相关领域对行人识别效果的预期要求,因此人们将深度学习技术应用于行人识别[1⁃2]中,并在对应的数据集中取得了较高的识别率㊂由于昼夜光照差异比较明显,导致跨模态的行人识别面临巨大挑战㊂目前人们针对跨模态行人识别方面的研究已有许多报道,如王留洋等[3]优先组建双模态特征提取网络,通过构建的网络对图像深度特征实行提取操作,增强处理全部特征后融合图像的全部像素信息,完成行人识别㊂Oh 等[4]利用多个图像区域(头部㊁身体等)的convnet 特征构建了行人识别框架,从时间和视点两方面分析了不同特征的重要性,利用人脸识别器实现了行人人脸识别㊂郑爱华等[5]采用双路模型提取不同模态下的全局特征,对其实行局部精细化处理,挖掘行人的结构化局部信息;通过标签和预测信息构建跨模态局部信息之间的关联,完成跨模态融合处理,确保各个特征之间相互补充,最终实现行人识别㊂为降低光照等因素引起的图像模态差异对行人识别效果的影响,笔者引入泛化迁移深度学习,提出一种跨模态图像行人识别算法㊂经实验测试结果表明,所提算法能有效降低行人识别时间,提升行人识别结果的准确性㊂1 跨模态图像行人识别模型设计1.1 跨模态图像行人特征提取由于受摄像机角度㊁外部环境等因素影响,使行人视频监控图像产生了较大的模态差异,为此需要将识别的行人视频设定为一个图像集,利用Cyele GAN 生成跨模态图像㊂由于人体的轮廓在图像集中近似为矩形,所以可借助矩形目标检测方法得到人体候选区域㊂优先采用Hough变换方法提取行人的主要图1 人体候选区域获取流程图Fig.1 Flow chart of human body candidate region acquisition 特征信息,通过视知觉分组的灰度分类器和共圆分类器将人体候选区域虚假信息剔除㊂图1给出了人体候选区域获取的详细操作流程图㊂为得到人体候选区域不同区域的特征信息,优先需要获取不同区域的视差㊂在实际操作过程中,采用基于局部约束的像素点区域匹配算法㊂以基准图中待匹配像素点为中心构建一个窗口,通过窗口内相邻像素的灰度值描述图像中的像素特征㊂将基准图中随机一个像素点设定为中心,同时创建多个大小完全一致的滑动窗口,引入搜索策略获取像点图在对准图中对应的像素点,两者之间的差值即为视差㊂块匹配方法[6⁃7]的核心是将基准图待匹配的窗口设定为模板图像,对准图像作为目标图像,对两者实行模板匹配㊂在匹配过程中,主要通过人体候选区域每个灰度间的相关测度描述不同视图间的相关性,如下:D p SSD (h )=∑(u ,w )∈R p R (u ,v )-I m R (u ,v ),(1)其中D p SSD (h )表示视图之间的相关性;R (u ,v )表示跨模态图像的水平偏移量;I m 表示基准图像;R p 表示随机像素对应的块状邻域㊂由于每个候选区域的相关性保持不变,所以需要将目标区域中区域相关性设定为式(1)的形式,进而获取目标区域对应的距离测度,如下:D T SAD (h )=∑(u ,w )∈R p 1R (u ,v )-I m R (u ,v ),(2)其中D T SAD (h )表示各个目标区域之间的距离测度㊂在实际应用过程中,需要消除左右两个视图之间由于光照亮度产生的差异,为此引入零均值方法,将其应用目标匹配过程中,进而获取零均值视图相关性D T ZSAD (h ),如下:831吉林大学学报(信息科学版)第42卷D T ZSAD (h )=∑(u ,w )∈R p R (u ,v )-I m -1R (u ,v )-I m R (u ,v )㊂(3) 通过候选人体区域取代式(2)和式(3)中的目标区域,而候选人体区域的视差可根据外极线约束在经过校正处理后的左右视图中,沿外极线方向搜索目标最小视图相关性D T ZSAD (h ),如下:[D T SAD (h )]min =arg min (u ,w )∈R p [D p SSD (h )-D T ZSAD (h )]㊂(4) 在跨模态图像中,人体和其他物体之间存在明显差异,则跨模态图像可能出现的行人身高最小值为h min ,如下:h min =H -b D T ZSAD (h ),(5)其中H 表示人体候选区域内的深度特征㊂设定人体区域在空间中的真实长度为l ,在采集人体图像的过程中,可通过小孔透视比例得到不同轮廓的特征提取结果:W (u ,v )=(z -h )h 1R (u ,v ),(6)其中z 表示人体候选区域的深度㊂由于跨模态图像中人体候选区域的视差半径和真实人体身高之间存在密切关联,而人体的真实身高可看做是行人的固有特征,设定行人身高的变化范围,则有h min ≤h ≤h max ㊂通过上述分析,利用图2给出跨模态图像行人特征提取流程图㊂图2 跨模态图像行人特征提取流程Fig.2 Flow chart of pedestrian feature extraction from cross⁃modal images 通过人体视觉[8⁃9]可得到人体区域的深度和透视特征,如下:S (u ,v )=1[D T SAD (h )]min R (u ,v )I m ,T (u ,v )={W (u ,v )(z -h )}2I m ìîíïïï,(7)其中S (u ,v )和T (u ,v )分别表示人体区域的深度特征和透视特征㊂1.2 泛化迁移深度学习下的跨模态图像行人识别深度学习中的注意力机制是指重点关注图像的细节信息,忽略没有利用价值的信息,使其在图像领域得到广泛应用,取得了十分显著的成果㊂将通道域思想应用于跨模态图像行人识别中,可以快速获取红绿蓝(RGB:Red,Green and Blue)和相对照度(RI:Relative Illumination)图像两者之间的差异性,进而准确区分不同类型的行人㊂通过SeNet 网络的思想全面引入压缩激活神经网络,其中压缩激活模块主要是利用每个通道之间的关系,学习特征权重,有效增强特征图关键信息的权重比例㊂设定输入特征为F ={f 1,f 2, ,f n },大小为F ∈E (h ,w ,c ),优先对1.1小节得到的特征压缩处理,通过全局池化的方式,将特征图转换为大小完全相同的向量,即全局通道描述符b (u ,v ),如下:b (u ,v )=F (sp )(u ,v ),1W (u ,v )∑m =1∑n =1f n (i ,j {),(8)931第1期蔡现龙,等:泛化迁移深度学习下的跨模态图像行人识别算法其中F (sp )(㊃)表示压缩操作;f n (i ,j )表示通道总数㊂通过两个全连接层得到特征向量u 的计算如下:u =H (u ,v )(i ,j ),β(g (u ,v {)),(9)其中H (u ,v )(㊃)表示激励操作;β表示激活函数;g (u ,v )表示两个全连接层对应的权值矩阵㊂将注意力机制应用于跨模态图像行人识别中,构建基于压缩激活机制的双路径跨模态模型,模型中融入了压缩激活模块,方便后续学习更加具有鲁棒性的特征㊂学习不同模态下的特征,将其映射到对应的子空间中㊂通过上述分析,优先计算行人各个特征之间的欧氏距离,并基于其再次计算即可获取三元组损失函数,如下:K chtri =1F (sp )(u ,v )∑m =1∑n =1f n (i ,j )[max(D (u ,v )-min D (u ,v ))+β],(10)其中K chtri 表示三元组损失函数;D (u ,v )表示相同跨模态图像之间的特征距离㊂将三元组损失函数和身份损失函数两者结合,最终获取综合损失函数如下:K tocal =K chtri +K id ,(11)其中K tocal 表示综合损失函数;K id 表示身份损失函数㊂经上述分析,引入泛化迁移深度学习算法对综合损失函数度量学习,则有:K tocal (u ,v )=(k a ,p -β)K chtri +K id ,(12)其中K tocal (u ,v )表示综合损失函数的度量学习结果;k a ,p 表示超参数㊂对输入的原始图像,通过测试集形成的跨模态图像集并没有得到充分应用,所以需要借助模态融合模块将两种筛选后的特征融合处理,同时将融合后的结果输入到全连接层中,采用SoftMax 损失展开有监督的训练㊂模态融合[10]模块的主要目的是将原始图像和跨模态图像两者有效融合,在设定条件下可利用RGB 图得到丰富的颜色特征,采用RI 图像可得到丰富的纹理特征,如下:L lsr =(1-β)lg{p (k )}-1/K chtri (k a ,p -β),(13)其中L lsr 表示跨模态图像的纹理特征;p (k )表示平滑参数㊂采用模态融合模块融合处理上述提取的特征和式(13)提取的纹理特征,以实现跨模态图像行人识别,如下:Q (u ,v )=1/(1-β){(k a ,p -β)K tocal (u ,v )}f n (i ,j ),(14)其中Q (u ,v )表示跨模态图像的行人识别结果㊂至此,实现跨模态图像行人识别㊂2 实验分析为验证所提泛化迁移深度学习下的跨模态图像行人识别算法的有效性,实验在INRIA Person Dataset 图像库(http:∥pascal.inrialpes.fr /data /human /)中随机选择200幅跨模态图像作为测试图像集,设定图像的大小为256×256像素,优先利用图3给出部分测试图像㊂图3 部分行人测试图像集Fig.3 Part of the pedestrian test image set 041吉林大学学报(信息科学版)第42卷将文献[3⁃4]算法作为所提方法的对比方法,从不同角度对图3所示的行人图像进行测试㊂2.1 实验流程实验计算机配备IntelXeon 6230(2.10GHz)CPU 和32GByte 视频内存的NVIDIA Tesla V100视频卡㊂实验中,文献[3⁃4]算法行人识别流程和参数设置依照其实验最佳参数进行设定㊂笔者算法具体的实验流程如图4所示㊂图4 所提算法识别流程Fig.4 Identification process of the proposed algorithm 2.2 实验结果分析在图3所示的测试图像集上进行实验测试,分析不同算法的识别效果,实验测试结果如图5所示㊂图5 不同算法的跨模态图像行人识别结果对比Fig.5 Comparison of pedestrian recognition results incross⁃modal images by different algorithms 从图5可看出,无论白天还是夜晚,采用所提算法均可准确识别行人,而另外两种算法在比较复杂的场景下只能识别出行人的局部特征信息,出现了漏识和误识现象㊂由此可见,所提算法利用模态融合模块能更好地完成行人识别,且受光照差异造成的模态差异影响较小㊂以相同数据集中不同光照强度的图像作为测试对象,将识别时间作为测试指标,表1给出了具体实验分析结果㊂表1 不同算法的跨模态图像行人识别时间测试结果对比 平均识别时间为1.732s,分别低于另外两种算法的1.79s 和1.85s,全面验证了笔者算法的优势,同时可以更快的速度完成行人识别,受光照影响较小㊂141第1期蔡现龙,等:泛化迁移深度学习下的跨模态图像行人识别算法图6 图像不同视差距离下峰值信噪比数值Fig.6 Peak signal to noise ratio values of images at different parallax distances 以峰值信噪比(PSNR:Peak Singal⁃Noise Ratio)为指标,测试在图像不同视差距离下行人识别的峰值信噪比数值,结果如图6所示㊂从图6可看出,随着视差距离的增大,行人识别图像峰值信噪比数值虽然呈现降低趋势,但降低幅度很小㊂其中笔者方法的峰值信噪比数值始终高于两种对比算法㊂上述结果说明笔者方法将泛化迁移深度学习引入到行人识别中,获取的行人识别结果较完整,表明识别能力较好㊂3 结 语针对行人识别方法受光照㊁视差距离影响产生的模态差异造成识别时间较长以及识别结果不准确的问题,笔者提出一种泛化迁移深度学习下的跨模态图像行人识别算法㊂通过和另外两种算法对比可知,笔者算法可以全面降低行人识别所用时间,同时还能增加识别结果准确性,为后续开展此方面研究提供了重要的策略和理论依据㊂参考文献:[1]祁磊,于沛泽,高阳.弱监督场景下的行人重识别研究综述[J].软件学报,2020,31(9):2883⁃2902.QI L,YU P Z,GAO Y.Research on Weak⁃Supervised Person Re⁃Identification [J].Journal of Software,2020,31(9):2883⁃2902.[2]韩光,葛亚鸣,张城玮.基于去相关高精度分类网络与重排序的行人再识别[J].计算机应用研究,2020,37(5):1587⁃1591,1596.HAN G,GE Y M,ZHANG C W.Person Re⁃Identification by Decorrelated High⁃Precision Classification Network and Re⁃Ranking [J].Application Research of Computers,2020,37(5):1587⁃1591,1596.[3]王留洋,芮挺,郑南,等.基于跨模态特征增强的RGB⁃T 行人检测算法研究[J].兵器装备工程学报,2022,43(5):254⁃260.WANG L Y,RUI T,ZHENG N,et al.Research on RGB⁃T Pedestrian Detection Algorithm Based on Cross⁃Modal Feature Enhancement [J].Journal of Ordnance Equipment Engineering,2022,43(5):254⁃260.[4]OH S J,BENENSON R,FRITZ M,et al.Person Recognition in Personal Photo Collections [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(1):203⁃220.[5]郑爱华,曾小强,江波,等.基于局部异质协同双路网络的跨模态行人重识别[J].模式识别与人工智能,2020,33(10):867⁃878.ZHENG A H,ZENG X Q,JIANG B,et al.Cross⁃Modal Person Re⁃Identification Based on Local Heterogeneous CollaborativeDual⁃Path Network [J].Pattern Recognition and Artificial Intelligence,2020,33(10):867⁃878.[6]AGARWAL R,VERMA O P.Robust Copy⁃Move Forgery Detection Using Modified Superpixel Based FCM Clustering withEmperor Penguin Optimization and Block Feature Matching [J].Evolving Systems,2022,13(1):27⁃41.[7]JAVDANI D,RAHMANI H,WEISS G.SeMBlock:A Semantic⁃Aware Meta⁃Blocking Approach for Entity Resolution [J].Intelligent Decision Technologies:An International Journal,2021,15(3):461⁃468.[8]WU J Y,LU C H,LO H H,et al.P⁃23:Image Adaptation to Human Vision (Eyeglasses Free):Full Visual⁃CorrectedFunction in Light⁃Field Near⁃to⁃Eye Displays [J].SID International Symposium:Digest of Technology Papers,2021,52(3):1143⁃1145.[9]ANNAMALAI R,DORNEICH M,TOKADLI G.Evaluating the Effect of Poor Contrast Ratio in Simulated Sensor⁃Based VisionSystems on Performance [J].IEEE Transactions on Human⁃Machine Systems,2021,51(6):632⁃640.[10]邓佳桐,程志江,叶浩劼.改进YOLOv3的多模态融合行人检测算法[J].中国测试,2022,48(5):108⁃115.DENG J T,CHENG Z J,YE H J.Multimodal Fusion Pedestrian Detection Algorithm Based on Improved YOLOv3[J].China Measurement &Testing Technology,2022,48(5):108⁃115.(责任编辑:刘东亮)241吉林大学学报(信息科学版)第42卷。
retinaface原理
retinaface原理RetinaFace是一种检测人脸的深度学习算法,能够精确地检测出多个人脸并对其位置、大小和姿态进行准确估计。
RetinaFace最初由中国香港城市大学傅仁辉教授领导的团队开发,在2019年通过CVPR论文《RetinaFace: Single-stage Dense Face Localisation in the Wild》首次公布于众。
RetinaFace的原理基于SSD(Single Shot Multibox Detector)和DenseBox两种方法的结合。
SSD是一种基于深度学习的目标检测算法,能够在不使用候选框的情况下,精确地检测目标。
而DenseBox则是一种密集的候选框生成方法,能够在不牺牲精度的情况下提高检测速度。
通过将这两种方法结合,RetinaFace能够在不牺牲精度的情况下提高检测速度。
更具体地说,RetinaFace首先使用一系列锚框(anchor)对图像进行划分,然后对每个锚框进行分类和回归。
其中,分类是指判断锚框是否包含人脸,回归则是指根据锚框的位置、大小和姿态等信息,对人脸的位置、大小和姿态进行精确估计。
为了提高检测效果,RetinaFace使用了多尺度归一化(multi-scale normalization)和特征金字塔(feature pyramid)等技术,使得算法能够适应不同大小和姿态的人脸。
总的来说,RetinaFace是一种非常高效和精确的人脸检测算法,能够广泛应用于实际场景中的人脸识别、人脸对齐、人脸特征提取等任务。
尤其是在进行大规模人群检测时,RetinaFace具有非常大的优势,能够大幅提高检测速度和准确度。
未来,随着深度学习技术的不断发展,相信RetinaFace会在更多应用场景中得到广泛应用。
基于BBNN的网络攻击文本自动化分类方法
第22卷第1期信息工程大学学报Vol.22No.12021年2月Journal of Information Engineering UniversityFeb.2021㊀㊀收稿日期:2020-08-31;修回日期:2020-09-08㊀㊀基金项目:国家自然科学基金资助项目(61502528)㊀㊀作者简介:欧昀佳(1989-),男,硕士生,主要研究方向为网络安全㊂DOI :10.3969/j.issn.1671-0673.2021.01.008基于BBNN 的网络攻击文本自动化分类方法欧昀佳,周天阳,朱俊虎,臧艺超(信息工程大学,河南郑州450001)摘要:基于描述文本的网络攻击自动化分类是实现APT 攻击知识智能抽取的重要基础㊂针对网络攻击文本专业词汇多㊁难识别,语义上下文依赖强㊁难判断等问题提出一种基于上下文语义分析的文本词句特征自动抽取方法,通过构建BERT 与BiLSTM 的混合神经网络模型BBNN (BERT and BiLSTM Neural Network ),计算得到网络攻击文本的初步分类结果,再利用方差过滤器对分类结果进行自动筛选㊂在CAPEC (Common Attack Pattern Enumeration and Classifica-tion )攻击知识库上的实验结果显示,该方法的准确率达到了79.17%,相较于单一的BERT 模型和BiLSTM 模型的分类结果分别提高了7.29%和3.00%,实现了更好的网络攻击文本自动化分类㊂关键词:神经网络;APT 网络攻击;文本分类中图分类号:TP393㊀㊀㊀文献标识码:A文章编号:1671-0673(2021)01-0044-07Document Classification in Cyberattack Text Based on BBNN ModelOU Yunjia,ZHOU Tianyang,ZHU Junhu,ZANG Yichao(Information Engineering University,Zhengzhou 450001,China)Abstract :The document classification in cyberattack text is fundamental to automatic knowledge ex-traction from APT attack information.In this paper,an automatic method based on context analysis is proposed to tackle the problems rooted in cyberattack,such as having too many terminologies,be-ing hard to distinguish and classify,over-relying on context,etc.,by extracting text features in words level and sentences level respectively.This method,BBNN (BERT and BiLSTM Neural Net-work)model,synthesized BERT and BiLSTM Neural Network,can compute the preliminary classifi-cation results of cyberattack,and automatically filter the classification results of the text via vari-ance.The experiment results from the attack knowledge base of CAPEC (Common Attack Pattern Enumeration and Classification)suggests this method can reach a 79.17%accuracy,which is in-creased by 7.29%and 3.00%compared to singular BERT or BiLSTM models,and thus achieve a better automatic classification of cyberattack.Key words :neural network;cyber attack;document classification㊀㊀随着信息技术的飞速发展,网络攻击手段层出不穷,在互联网上的黑客论坛㊁技术博客以及安全厂商发布的研究报告中含有大量关于高级可持续威胁(Advanced Persistent Threat,APT)攻击的研究成果㊂这些原始的攻击技术描述信息和攻击案例的流程分析文本是安全研究人员构建网络攻击安全知识体系,深化攻击研究的重要数据来源㊂从复杂多样的攻击描述文本中抽取APT 攻击㊀第1期欧昀佳,等:基于BBNN的网络攻击文本自动化分类方法45㊀知识,构建完整的网络攻击知识图谱,可以更好地确定安全威胁程度,推测攻击者意图,拟定有针对性地防御措施㊂通过网络攻击文本分类,将攻击知识映射到ATT&CK(Adversarial Tactics,Tech-niques,and Common Knowledge)㊁CAPEC等成熟攻击知识框架,是认知网络攻击,抽取攻击知识的重要基础㊂传统的人工阅读专家判定的分类方法存在人员素质要求严苛㊁效率低下㊁成本过高等问题㊂如何自动㊁快速㊁准确地识别大量网络安全文本中攻击手段的所属类别是当前网络安全研究的新热点㊂开源的网络攻击描述文本具有来源多样㊁表达各异㊁语法不规范等特点㊂例如,技术性短文本偏重于对某一具体的攻击技术实现进行描述,而大多APT报告则通过长期跟踪着重对捕获案例的整个攻击流程进行描述㊂原始网络攻击文本的这些特点极大地增加了自动化识别与分类的难度㊂针对该问题,本文构建了一种基于新型混合神经网络模型的网络攻击文本分类方法,通过分析网络攻击文本的上下文语义,自动抽取文本词句特征,并以此作为自动化攻击判定与分类的计算依据㊂该方法利用模型输出的概率分布方差评估预测项的确定程度,对特定的低概率分类项进行随机选取预测,可以有效提高网络攻击文本分类的准确率㊂1㊀相关研究文本分类是自然语言处理的一大热门研究领域㊂早期的研究大多基于规则和统计的方法,但该类方法存在效率低下㊁人工成本高㊁泛化能力弱㊁特征提取不充分等缺点㊂神经网络能够弥补上述缺点更好地解决分类问题㊂文献[1]首先提出基于卷积神经网络的文本特征提取,并用于文本倾向性分类任务上㊂文献[2]在此基础上提出VDCNN模型,该方法采用深度卷积网络对文本局部特征进行提取,提高文本分类效果㊂但是卷积神经网络在考虑文本的上下文关系时有很大局限性,其固定的卷积核不能建模更长的序列信息,使得文本分类效果很难进一步提升㊂RNN(Recurrent Neural Network)模型不仅可以处理不同长度的文本,而且能够学习输入序列的上下文语义关系㊂双向RNN[3]可以让输出获取当前输出之前以及之后的时间步信息㊂但由于RNN是通过时间序列进行输入,随着输入的增多,RNN对很久以前信息的感知能力下降,将会产生长期依赖和梯度消失问题[4]㊂在RNN基础之上改进而来的门控制单元(GRU)[5]和长短时序记忆模型(Long Short Term Memory,LSTM)[6]利用门机制可以解决RNN的长期依赖和梯度消失问题㊂文献[7]利用RNN设计了3种用于多任务文本分类问题的模型㊂文献[8]提出了一种基于word2vec与LSTM模型的健康文本分类方法㊂为进一步提高文本分类的准确率,文献[9]首先提出了一种用于文档分类的分层注意力网络模型㊂该模型在单词和句子级别应用了两个不同的注意力机制,使得模型在构建文档时能够给予重要内容更大权重,同时也可以缓解RNN在捕捉文档序列信息时产生的梯度消失问题㊂文献[10]提出一种基于注意力机制的深度学习态势信息推荐模型,该模型能学习指挥员与态势信息之间的潜在关系,为用户推荐其关心的重要信息㊂在注意力机制基础上,Google团队提出了Transformer模型[11],它摒弃了常用的CNN或者RNN模型,采用Encoder-Decoder架构,对文本进行加密㊁解密处理㊂Trans-former通过利用大量原始的语料库训练,从而得到一个泛化能力很强的模型,只需微调参数进行训练,就可以将模型应用到特定的文本分类任务中[12]㊂词嵌入(word embedding)是单词的一种数值化表示方法,是机器学习方法用于文本分类的基础㊂通过此技术可以将文本中的词语转化为在向量空间上的分布式表示,并大大降低词语向量化后的维度㊂预训练模型是在具体分类任务训练前使用大量文本训练词嵌入模型,使模型通过学习一般语义规律,获得单词的特定词嵌入表达㊂文献[13]在2013年提出了word2vec词嵌入训练模型,包括CBOW模型和Skip-Gram模型[14],这两种模型可以通过学习得到高质量的词语分布式表达,能训练大量的语料且捕捉文本之间的相似性㊂但其缺点是只考虑到了文本的局部信息,未考虑整体信息㊂文献[15]提出GloVe模型,利用共现矩阵,同时考虑局部信息和整体信息㊂上述提及的词嵌入技术均属于静态词嵌入,即训练后词向量固定不变㊂但在实际应用中,同一词在不同语境下的语义是不一样的,其表达需要根据不同语境进行变化㊂针对该问题,文献[16]提出了BERT预训练模型,使用多层双向Transformer编码器对海量语料进行训练,结合所有层的上下文信息进行提取,实现了文本的46㊀信息工程大学学报㊀2021年㊀深度双向表示㊂网络攻击的描述文本相对特殊,大多采用人工分析,专家定义的方式㊂该类文本具有长度短㊁专业词汇多㊁文字特征稀疏㊁上下文依赖性强等特点㊂神经网络的优势在于对文本的特征提取上,但短文本的文字特征稀疏,单一的神经网络难以进行有效处理㊂为此,可以在此基础上引入迁移学习的思想,即利用经过预训练的深度神经网络对文本进行编码,经过这样编码的文本蕴含了在大量文本上提取的一般特征,然后再针对具体分类任务将这些特征用于神经网络的训练㊂由于具体分类任务的关注点不同,导致文本信息对分类预测的重要程度不同㊂注意力机制可以使分类模型关注到重点信息,从而有效提升模型性能㊂本文将神经网络㊁预训练和注意力机制进行有机融合,开展网络攻击分类方法研究㊂2㊀网络攻击文本分类模型网络攻击文本分类任务不同于一般的新闻文本分类任务,其存在描述范围有限㊁各类攻击关联性强㊁区分度不大且专业性词汇多的特点㊂例如:句1:Client side injection-induced buffer over-flow,this type of attack exploits a buffer overflow vul-nerability in targeted client software through injection of malicious content from a custom-built hostile serv-ice.句2:Log injection-tampering-forging,this attack targets the log files of the target host.句1㊁句2都有对注入injection的描述,但两者描述的攻击类别分别属于Manipulate Data Struc-tures类和Manipulate System Resources类㊂句子中的exploits属于专业词汇,与平时所指代的 开发㊁开拓 意义不同,这里指的是 对软件的脆弱性利用 ㊂这导致了普通的静态预训练模型在该任务上性能不佳㊂针对以上问题,本文提出从词语级和句子级分别对攻击文本进行特征提取㊂利用BERT模型的动态编码机制对输入攻击文本进行动态编码表示,使计算出的句子向量能包含具体攻击描述文本中的语义信息,解决文本中存在的语义偏移和专业词汇识别难等问题㊂同时,采用BiLSTM模型提取攻击文本中词与词的前后关联关系特征,学习的词向量利用注意力机制强化特殊词汇对于攻击文本分类的权重比值,解决文本中存在的描述类似导致区分度小等问题㊂然后,结合以上两个模型学习到的特征,分别计算出攻击文本分类的预估值㊂继而对获取的两类预估值进行分析判定,求出最后的分类预测值㊂如图1所示,BBNN攻击文本分类模型分为4层,输入层㊁模型计算层㊁预值分析层㊁输出层㊂下面将依次对模型的各层结构进行详细介绍㊂图1㊀BBNN攻击文本分类模型输入层的数据处理要经过数据输入㊁分词㊁向量计算3个阶段㊂对于要输入给BiLSTM模块的数据只需经过简单的分词和词嵌入处理㊂而对于要输入给BERT模块的数据则需要由词向量㊁分段向量和位置向量组成㊂BERT模块的分词阶段主要包含两个部分:①将输入的文本进行wordpiece 处理,wordpiece处理是将词划分为更小的单元,如图1中rapport一词就被划分为rap和##port,这样做的目的是压缩了词典大小,并且增大了可表示词数量;②在句子首尾分别嵌入[CLS]和[SEP]两个特殊token㊂对于BERT模块来说,由于文本分类任务的输入对象是一个单句,所以分段向量E A全部以0进行填充㊂位置向量(E1至E7)则通过正弦余弦编码来获取,表征该句子序列的位置信息㊂将上述3部分向量相加得到BERT模块的输入向量㊂模型计算层包含BiLSTM计算模块和BERT 计算模块㊂这两个模块需要进行单独训练,再加入到整体的BBNN攻击文本分类模型结构中,如图2所示㊂从图2可以看到,BiLSTM模块包含一个前向LSTM神经网络和一个后向LSTM神经网络,分别学习输入序列中各个词的前后信息㊂该模块接收输入层编码的词嵌入向量Embedding作为输入㊂E代表经过词嵌入处理的词向量,t代表第t个时间步㊂hңt是第t个时间步的前向隐藏状态编码,它由上一个时间步的隐藏状态编码hңt-1和当前输㊀第1期欧昀佳,等:基于BBNN 的网络攻击文本自动化分类方法47㊀图2㊀BiLSTM 模块结构入向量E t 计算得出㊂h ѳt 是第t 个时间步的后向隐藏状态编码,由下一个时间步隐藏状态编码h ѳt-1和当前输入向量E t 计算得出㊂连接前向后向隐藏状态向量得到含有上下文信息的t 时刻隐藏状态向量h =[h ңt ,h ѳt ]㊂综合各个时间步隐藏状态向量得到文本词向量矩阵H ={h 0,h 1 ,h n }作为输出,传递给实现注意力机制的Attention 网络㊂Atten-tion 网络接收文本向量矩阵H 作为输入,为每个输入值分配不同的权重ω和偏置项b ,计算出文本中每个单词的权重μ,得到权重矩阵U ㊂W 是随机初始化生成的矩阵:u t =tanh(h t ∗ωt +b )(1)U =tanh(H ∗W +b )(2)V 是随机初始化生成的一维矩阵,使用softmax 函数计算出每个时间步对当前时刻的权值向量αt ,得到得分矩阵A :αt =e u t∗Vðt e u t∗V(3)A =Softmax(UV )(4)将每个得分权重和对应的隐藏状态编码进行加权求和,得到Attention 网络的输出s :s =ðnt =0αt ∗h t(5)最后将得到的向量s 进行softmax 归一化处理,计算各个分类的概率情况作为BiLSTM 模块的输出㊂BERT 模块接收输入层的输出向量作为输入,其模型结构如图3所示㊂双向Transformer 模型能利用内部的self_attetnion 机制计算出该句其他词和当前编码词的语义关联程度,从而增强上下文语义信息的提取㊂词嵌入向量通过双向Transformer 模型结构进行编码学习,其第一位词嵌入向量(即[CLS]对应词向量编码)包含该句所有信息㊂抽取第一位词嵌入向量作为该句子的向量表示,利用tanh 函数激活后做softmax 处理,获得该句子的各类概率分布作为BERT 模块的输出㊂图3㊀BERT 模块结构分析层的输入为BiLSTM 模块和BERT 模块对文本的预测概率分布㊂分析层主要有两个阶段,①选取两个模型确定性更大的预测分布作为最终预测分布,②在最终预测分布上筛选不确定性更大的预测项,对该项分布的最大二值进行随机抽取确定最终输出标签㊂第1个阶段中,两个概率分布的离散程度反应了该模型对预测分类的准确度,即离散程度越大表明该模型对此项分类判定的确定性越大㊂反之,概率分布的离散程度越小,代表各个分类的概率值相近,判断相对模糊㊂因此,通过对比两个模型得出的概率分布可以获知某个模型对该段文本的预测确定性更高㊂取确定性更高的分布作为BBNN 攻击文本分类模型的最终概率分布输出㊂如算法1所示㊂算法1㊀基于概率分布的预测确定性分析算法输入:BiLSTM 和BERT 模块预测的各类别的概率分布P bert ㊁P lstm输出:最终判定的概率分布P out1㊀v bert =GetVariance(P bert )2㊀v lstm =GetVariance(P lstm )3㊀if v bert >v lstm4㊀㊀P out =P bert5㊀else6㊀㊀P out =P lstm7㊀return P out48㊀信息工程大学学报㊀2021年㊀步骤1㊁步骤2分别计算BERT模块㊁BiLSTM 模块预测概率分布的方差;步骤3~步骤6通过对比方差大小选定确定性更大的概率分布;步骤7输出最终概率分布㊂第2个阶段中,认为最终输出概率分布中低概率值㊁低方差区域的预测项不确定性最大,这些分布的各项预测值相近㊂其中最大预测值项并不是真实分类项的概率最高㊂因此,可扩大选择范围,对相对较高的预测值进行随机选取以确定最后预测项㊂这里,采用最优二值随机选择的办法来进行㊂首先通过设置阈值选定随机区域,然后对方差和概率值属于该区域的预测项进行最优二项值随机选取㊂如算法2所示㊂算法2㊀基于概率分布方差及概率值的低确定性随机选择算法输入:预测概率分布P in,概率项阈值α㊁方差项阈值β输出:最终预测分类编号1㊀v=GetVariance(P in)2㊀P max1=Max(P in(x i))3㊀if P max1<αand v<β4㊀P max2=Max(P in(X=x i)-P max1)5㊀P final=Random(P max1,P max2)6㊀else7㊀㊀P final=P max18㊀㊀x out=Classify(P out)9㊀return x out步骤1获取输入分布的方差值v,步骤2获取输入分布的最大概率值P max1,步骤3判断最大概率值和方差是否均分别小于阈值α㊁β,步骤4㊁步骤5表示若满足步骤3的条件则求出第二大概率值并在最大概率值和第二大概率值间做随机选择赋值给最终选择概率P final,步骤6㊁步骤7表示若不符合步骤3条件则将最大概率值赋值给最终选择概率P final,步骤8求出输入分布中概率为P final的分类标签x out,步骤9输出标签㊂3㊀实验3.1㊀实验设置本文采用通用攻击模型枚举分类项目, CAPEC中的攻击分类描述文本作为实验数据集,该数据集包括了攻击定义描述㊁攻击实例描述㊁攻击条件准备等信息㊂目前,从攻击机制的角度,CAPEC将所有攻击共分为9个基础类型[17](截止2019年9月30日)㊂本次实验数据通过对数据集的攻击名称㊁攻击定义描述㊁实例描述数据进行提取并标注,获得了1452条样本信息㊂样本分布信息如图4所示, CAPEC的9类数据中Collect and Analyze Informa-tion类别的样本数最多,这是因为网络攻击过程中的多个步骤均涉及信息的搜集与分析,所占比例最大㊂Employ Probabilitstic Techniques和Manipulate Timing and State类的样本数不足50所占比例最少,表明该两类攻击使用领域较小㊂图4㊀样本分布实验中将样本数据做随机抽取,选取其中70%的样本数据作为训练集,10%作为验证集, 20%作为测试集㊂实验环境如表1所示㊂为验证BBNN模型的有效性,实验使用以下几个模型进行对比:①通过GloVe预训练模型获得词嵌入向量输入微调训练后的BiLSTM+Attention模型;②经数据微调训练后获得的BERT模型;③无随机选择的BBNN模型;④选定阈值随机选择的BBNN模型㊂表1㊀实验环境项目版本操作系统Linux Mint19.3Cinnamon4.4.5处理器 2.6GHz18核Intel Core i9-7980XE内存125.6GiB显卡NVIDIA Corporation GV100[TITAN V]Python 3.6Tensorflow Gpu-2.2.0Numpy 1.18.1Pandas 1.0.3Keras 2.3.1Matplotlib 3.1.3实验中GloVe预训练模型使用预训练好的㊀第1期欧昀佳,等:基于BBNN的网络攻击文本自动化分类方法49㊀glove.6B.300d,即编码后的词嵌入向量为300维, LSTM隐藏单元数设置为64,Attention网络隐藏单元为16㊂BERT预训练模型结构使用Google发布的uncased_L-12_H-768_A-12,经过该BERT模型结构的词向量维度为768㊂为避免过拟合BiLSTM 和BERT模型输出后的Drpout设置为0.5,模型最大句子长度为512㊂BBNN模型的随机选择区域阈值,概率项设为0.75和方差项设为0.06,即概率值小于0.75且方差小于0.06的预测项进行最优二项选择㊂由于随机选择存在随机性,BBNN模型的随机预测结果评估值取100次预测的平均评估值作为最后的参考指标㊂3.2㊀实验结果与分析几种模型在测试集上的预测分布方差和预测概率值如图5所示,图5(a)为BERT模型预测结果,图5(b)为BiLSTM模型预测结果,图5(c)为BBNN模型低命中区域无随机选取的预测结果,图5(d)为BBNN低命中区域随机选取的预测结果㊂图5中,横轴代表预测的概率值,纵轴代表预测分布的方差㊂点标代表预测正确的文本,叉标表示预测错误的文本㊂对比图5(a)和图5(b)可以看出,大多数预测错误项的方差和概率值小于预测正确项的方差和概率值,预测正确的值集中在方差与预测概率值较高的区域㊂并且,方差越小概率值越小㊂图5(c)对比图5(a)和图5(b)可以发现经过BBNN模型优选后的输出结果在方差和概率值低的区域预测错误项有明显下降㊂对比图5(c)和图5(d),设置阈值并选取低概率值㊁低方差区域做随机猜选能够提高低命中率区域的命中率㊂图5㊀模型预测结果分布㊀㊀模型在测试集上的预测表现如表2所示, BBNN模型在测试集上的各项指标优于其他模型㊂对比BERT和BiLSTM+Attention模型可以发现BBNN模型的优选机制对模型的准确率有一定提升,分别提高了7.29%和3.00%㊂同时,F1值分别提升了7.62%㊁1.73%,精确率分别提升了8.29%㊁1.72%,召回率分别提升了6.22%㊁1.90%㊂表2㊀测试集预测评估结果对比表模型Accuracy(%)F1(%)Precision(%)Recall(%) BERT71.8872.0572.7072.94 BiLSTM76.1777.9479.2777.26BBNN78.5278.9079.9278.69 BBNN(Random)79.1779.6780.9979.16实验中使用BBNN模型进行了100次随机选择预测,结果如图6所示㊂图中,横轴代表预测次数,纵轴代表百分比㊂4项指标中的最好预测结果为准确率80.46%㊁精确率82.21%㊁F1值80.63%㊁回归率80.08%;最差预测结果为准确率77.73%㊁精确率79.45%㊁F1值78.33%㊁回归率78.14%㊂最差预测结果的各项指标相比单一的模型预测结果要好㊂随机选择机制使得模型的各项指标在3%以内上下浮动,存在一定的不稳定性㊂图6㊀100次随机预测评估结果4㊀结束语网络中存在大量攻击描述文本,对该类文本进行快速分类有助于研究人员从原始数据中迅速提炼知识,形成有体系的知识架构㊂本文通过对网络攻击描述文本进行分析,提出一种BBNN混合优选型模型,用于自动识别网络攻击文本中描述的攻击方式㊂与传统的机器学习模型不同,该模型能从词语级和句子级的层面对文本特征进行学习,并通过优选机制得到更确定的预测分类㊂实验表明,50㊀信息工程大学学报㊀2021年㊀BBNN模型对攻击文本的识别效果最好,但是该模型更为复杂,训练和使用时相比其他模型耗时更长,占用资源更多,且预测结果存在一定的不稳定性㊂下一步工作中将进一步扩大数据集,提高样本覆盖率㊂同时有针对性地对模型结构进行简化和优化,提高攻击文本分类方法的效率和稳定性㊂参考文献:[1]KIM Y.Convolutional neural networks for sentence clas-sification[C]//Empirical Methods in Natual Language Processing.Doha Ratar,2014:1746-1751. [2]CONNEAU A,SCHWENK H,BARRAULT L,et al. Very deep convolutional networks for text classification [J].Proceedings of the15th Conference of the Europeam Chapter of the Associational Linguistics.Vacencia Spain, 2017:193-207.[3]李洋,董红斌.基于CNN和BiLSTM网络特征融合的文本情感分析[J].计算机应用,2018,38(11): 29-34.[4]BENGIO Y,SIMARD P,FRASCONI P.Learning long-term dependencies with gradient descent is difficult[J]. IEEE Transactions on Neural Networks,1994,5(2): 157-166.[5]CHO K,VAN MERRIENBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//2014 Conference on Empirical Methods in Natural Language Processing.Doha Qatar,2014:1724-1734.[6]HOCHREITER S,SCHMIDHUBER J.LSTM can solve hard long time lag problems[C]//Advances in Neural In-formation Processing Systems,1997:473-479. [7]LIU P,QIU X,HUANG X.Recurrent neural network for text classification with multi-task learning[C]//Interna-tional Joint Conference on Artifical Intelligence.New York,2016:2873-2879.[8]赵明,杜会芳,董翠翠,等.基于word2vec和LSTM 的饮食健康文本分类研究[J].农业机械学报,2017, 48(10):202-208.[9]YANG Z,YANG D,DYER C,et al.Hierarchical attention networks for document classification[C]//Proceedings of the2016Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies.San Diego,2016:1480-1489. [10]周春华,郭晓峰,沈建京,等.基于注意力机制的深度学习态势信息推荐模型[J].信息工程大学学报,2019,20(5):597-603.[11]VASWANI A,SHAZEER N,PARMAR N,et al.At-tention is all you need[C]//Advances in Neural Infor-mation Processing Systems,2017:5998-6008. [12]QIU X,SUN T,XU Y,et al.Pre-trained models fornatural language processing:a survey[EB/OL].[2020-08-24].https:/abs/2003.08271. [13]MIKOLOV T,CHEN K,CORRADO G,et al.Efficientestimation of word representations in vector space[EB/OL].[2013-09-07].arXiv preprint,2013arXiv:1301.3781.[14]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distrib-uted representations of words and phrases and their com-positionality[C]//Advances in neural information pro-cessing systems,2013:3111-3119.[15]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Pro-ceedings of the2014conference on empirical methods innatural language processing(EMNLP).Doha Qatar,2014:1532-1543.[16]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for languageunderstanding[C]//2019Conference of the NorthAmerilan Chapter on the Association for ComputationalLinguistics Human Language Technologies.MinneapolisUSA,2019:4171-4186.[17]CAPEC TEAM.Schema Documentation[EB/OL].(2019-09-30).[2020-06-01].https://capec.mitre.org/documents/schema/index.html.(编辑:高明霞)。
量子人工智能研究中心陈玺教授:奋斗成就理想,科研反哺教育_50
人物简介陈玺,上海大学理学院物理系教授、博士生导师、理学院副院长。
本硕博毕业于上海大学理学院物理系。
2007年4月留校工作、2011年10月破格晋升教授。
2008年获西班牙科技部Juan de la Cierva奖学金;主持完成国家自然科学基金,上海市科委等科研项目10余项;先后入选上海市“东方学者”特聘教授、曙光学者、青年科技启明星、浦江学者、晨光学者等各类人才计划项目。
主要研究量子光学、量子信息、量子计算等,发表论文130余篇。
近年来,提出了量子绝热捷径技术,在原子物理及量子信息处理中得到了广泛应用,相关成果发表在Physical Review Letters和Nature Communications上,得到了国内外同行的关注和实验验证。
他是充满使命感的物理学人他是学生学术路上“亦师亦友”的引领者他是对自己有着极高要求的教育工作者他更是对学校充满自豪与期待的“老上大人”他就是——上海大学理学院量子人工智能中心陈玺教授作为学校国际化平台的主要负责人,陈玺教授身上有着很多闪耀的标签,但这一路走来,他始终没有忘记自己最初那份朴素的学术理想和对于国家、民族的使命与担当,这是他一直坚守的信仰。
一路走来的风风雨雨,让他对于学生和为人师者的使命有了更深的理解,在繁忙的科研之余,他不断探索学生的教育问题,希望能通过科学研究实现对教育的反哺。
不知道怎么做科研对研究生活很茫然不明白怎样去坚持听听陈教授的故事或许你能找到答案01热爱、使命、担当再踏征途,使命在肩成立量子人工智能研究中心是上海大学建设一流高水平大学国际化战略中重要举措之一。
平台命名与其所研究的内容息息相关,主要关注人工智能、量子计算和脑科学。
量子计算的存在是为了解决人工智能研究过程中所必需的大量计算;同时,人工智能的研究也有助于推进量子计算;而在人工智能所要学习的内容中有一个重要的方向,就是深入学习神经网络,这是脑科学的重要组成部分,平台因此做到了三者的有机结合。
ICDAR2011
Text Detection and Character Recognition in Scene Imageswith Unsupervised Feature LearningAdam Coates,Blake Carpenter,Carl Case,Sanjeev Satheesh,Bipin Suresh,Tao Wang,David J.Wu,Andrew Y.NgComputer Science DepartmentStanford University353Serra MallStanford,CA94305USA{acoates,blakec,cbcase,ssanjeev,bipins,twangcat,dwu4,ang}@Abstract—Reading text from photographs is a challenging problem that has received a significant amount of attention. Two key components of most systems are(i)text detection from images and(ii)character recognition,and many recent methods have been proposed to design better feature representations and models for both.In this paper,we apply methods recently developed in machine learning–specifically,large-scale algo-rithms for learning the features automatically from unlabeled data–and show that they allow us to construct highly effective classifiers for both detection and recognition to be used in a high accuracy end-to-end system.Keywords-Robust reading,character recognition,feature learning,photo OCRI.I NTRODUCTIONDetection of text and identification of characters in scene images is a challenging visual recognition problem.As in much of computer vision,the challenges posed by the complexity of these images have been combated with hand-designed features[1],[2],[3]and models that incorporate various pieces of high-level prior knowledge[4],[5].In this paper,we produce results from a system that attempts to learn the necessary features directly from the data as an alternative to using purpose-built,text-specific features or models.Among our results,we achieve performance among the best known on the ICDAR2003character recognition dataset.In contrast to more classical OCR problems,where the characters are typically monotone onfixed backgrounds, character recognition in scene images is potentially far more complicated due to the many possible variations in background,lighting,texture and font.As a result,build-ing complete systems for these scenarios requires us to invent representations that account for all of these types of variations.Indeed,significant effort has gone into creating such systems,with top performers integrating dozens of cleverly combined features and processing stages[5].Recent work in machine learning,however,has sought to create algorithms that can learn higher level representations of data automatically for many tasks.Such systems might be particularly valuable where specialized features are needed but not easily created by hand.Another potential strength of these approaches is that we can easily generate large numbers of features that enable higher performance to be achieved by classification algorithms.In this paper,we’ll apply one such feature learning system to determine to what extent these algorithms may be useful in scene text detection and character recognition.Feature learning algorithms have enjoyed a string of successes in otherfields(for instance,achieving high perfor-mance in visual recognition[6]and audio recognition[7]). Unfortunately,one caveat is that these systems have often been too computationally expensive,especially for applica-tion to large images.To apply these algorithms to scene text applications,we will thus use a more scalable feature learning system.Specifically,we use a variant of K-means clustering to train a bank of features,similarly to the system in[8].Armed with this tool,we will produce results showing the effect on recognition performance as we increase the number of learned features.Our results will show that it’s possible to do quite well simply by learning many features from the data.Our approach contrasts with much prior work in scene text applications,as none of the features used here have been explicitly built for the application at hand.Indeed, the system follows closely the one proposed in[8].This paper is organized as follows.We willfirst survey some related work in scene text recognition,as well as the machine learning and vision results that inform our basic approach in Section II.We’ll then describe the learning architecture used in our experiments in Section III,and present our experimental results in Section IV followed by our conclusions.II.R ELATED W ORKScene text recognition has generated significant interest from many branches of research.While it is now possible to achieve extremely high performance on tasks such as digit recognition in controlled settings[9],the task of detecting and labeling characters in complex scenes remains an active research topic.However,many of the methods used for scene text detection and character recognition arepredicated on cleverly engineered systems specific to the new task.For text detection,for instance,solutions have ranged from simple off-the-shelf classifiers trained on hand-coded features[10]to multi-stage pipelines combining many different algorithms[11],[5].Common features include edge features,texture descriptors,and shape contexts[1]. Meanwhile,variousflavors of probabilistic model have also been applied[4],[12],[13],folding many forms of prior knowledge into the detection and recognition system.On the other hand,some systems with highlyflexible learning schemes attempt to learn all necessary information from labeled data with minimal prior knowledge.For in-stance,multi-layered neural network architectures have been applied to character recognition and are competitive with other leading methods[14].This mirrors the success of such approaches in more traditional document and hand-written text recognition systems[15].Indeed,the method used in our system is related to convolutional neural networks.The primary difference is that the training method used here is unsupervised,and uses a much more scalable training algorithm that can rapidly train many features.Feature learning methods in general are currently the focus of much research,particularly applied to computer vision problems.As a result,a wide variety of algorithms are now available to learn features from unlabeled data[16], [17],[18],[19],[20].Many results obtained with feature learning systems have also shown that higher performance in recognition tasks could be achieved through larger scale representations,such as could be generated by a scalable feature learning system.For instance,Van Gemert et al.[21] showed that performance can grow with larger numbers of low-level features,and Li et al.[22]have provided evidence of a similar phenomenon for high-level features like objects and parts.In this work,we focus on training low-level features,but more sophisticated feature learning methods are capable of learning higher level constructs that might be even more effective[23],[7],[17],[6].III.L EARNING A RCHITECTUREWe now describe the architecture used to learn the feature representations and train the classifiers used for our detection and character recognition systems.The basic setup is closely related to a convolutional neural network[15],but due to its training method can be used to rapidly construct extremely large sets of features with minimal tuning.Our system proceeds in several stages:1)Apply an unsupervised feature learning algorithm to aset of image patches harvested from the training data to learn a bank of image features.2)Evaluate the features convolutionally over the trainingimages.Reduce the number of features using spatial pooling[15].3)Train a linear classifier for either text detection orcharacter recognition.We will now describe each of these stages in more detail.A.Feature learningThe key component of our system is the application of an unsupervised learning algorithm to generate the features used for classification.Many choices of unsupervised learn-ing algorithm are available for this purpose,such as auto-encoders[19],RBMs[16],and sparse coding[24].Here, however,we use a variant of K-means clustering that has been shown to yield results comparable to other methods while also being much simpler and faster.Like many feature learning schemes,our system works by applying a common recipe:1)Collect a set of small image patches,˜x(i)from trainingdata.In our case,we use8x8grayscale1patches,so˜x(i)∈R64.2)Apply simple statistical pre-processing(e.g.,whiten-ing)to the patches of the input to yield a new dataset x(i).3)Run an unsupervised learning algorithm on the x(i)tobuild a mapping from input patches to a feature vector, z(i)=f(x(i)).The particular system we employ is similar to the one presented in[8].First,given a set of training images,we extract a set of m8-by-8pixel patches to yield vectors of pixels˜x(i)∈R64,i∈{1,...,m}.Each vector is brightness and contrast normalized.2We then whiten the˜x(i)using ZCA3whitening[25]to yield x(i).Given this whitened bank of input vectors,we are now ready to learn a set of features that can be evaluated on such patches.For the unsupervised learning stage,we use a variant of K-means clustering.K-means can be modified so that it yields a dictionary D∈R64×d of normalized basis vectors.Specifically,instead of learning“centroids”based on Euclidean distance,we learn a set of normalized vectors D(j),j∈{1,...,d}to form the columns of D,using inner products as the similarity metric.That is,we solveminD,s(i)i||Ds(i)−x(i)||2(1)s.t.||s(i)||1=||s(i)||∞,∀i(2)||D(j)||2=1,∀j(3) where x(i)are the input examples and s(i)are the corre-sponding“one hot”encodings4of the examples.Like K-means,the optimization is done by alternating minimization over D and the s(i).Here,the optimal solution for s(i)given 1All of our experiments use grayscale images,though the methods here are equally applicable to color patches.2We subtract out the mean and divide by the standard deviation of all the pixel values.3ZCA whitening is like PCA whitening,except that it rotates the data back to the same axes as the original input.4The constraint||s(i)||1=||s(i)||∞means that s(i)may have only1 non-zero value,though its magnitude is unconstrained.Figure1.A small subset of the dictionary elements learned from grayscale, 8-by-8pixel image patches extracted from the ICDAR2003dataset.D is to set s(i)k=D(k) x(i)for k=arg max j D(j) x(i), and set s(i)j=0for all other j=k.Then,holding all s(i)fixed,it is easy to solve for D(in closed-form for each column)followed by renormalizing the columns.Shown in Figure1are a set of dictionary elements (columns of D)resulting from this algorithm when applied to whitened patches extracted from small images of char-acters.These are visibly similar tofilters learned by other algorithms(e.g.,[24],[25],[16]),even though the method we use is quite simple and very fast.Note that the features are specialized to the data—some elements correspond to short,curved strokes rather than simply to edges.Once we have our trained dictionary,D,we can then define the feature representation for a single new8-by-8patch.Given a new input patch˜x,wefirst apply the normalization and whitening transform used above to yield x,then map it to a new representation z∈R d by taking the inner product with each dictionary element(column of D) and applying a scalar nonlinear function.In this work,we use the following mapping,which we have found to work well in other applications:z=max{0,|Dx|−α}whereαis a hyper-parameter to be chosen.(We typically useα=0.5.)B.Feature extractionBoth our detector and character classifier consider32-by-32pixel images.To compute the feature representation of the 32-by-32image,we compute the representation described above for every8-by-8sub-patch of the input,yielding a25-by-25-by-d representation.Formally,we will let z(ij)∈R d be the representation of the8-by-8patch located at position i,j within the input image.At this stage,it is necessary to reduce the dimensionality of the representation before classification.A common way to do this is with spatial pooling[26]where we combine the responses of a feature at multiple locations into a single feature.In our system, we use average pooling:we sum up the vectors z(ij)over 9blocks in a3-by-3grid over the image,yielding afinal feature vector with9d features for this image.C.Text detector trainingFor text detection,we train a binary classifier that aims to distinguish32-by-32windows that contain text from windows that do not.We build a training set for thisclassifier(a)Distorted ICDAR ex-amples(b)Synthetic examplesFigure2.Augmented training examples.by extracting32-by-32windows from the ICDAR2003 training dataset,using the word bounding boxes to decide whether a window is text or non-text.5With this procedure, we harvest a set of6000032-by-32windows for training (30000positive,30000negative).We then use the feature extraction method described above to convert each image into a9d-dimensional feature vector.These feature vectors and the ground-truth“text”and“not text”labels acquired from the bounding boxes are then used to train a linear SVM.We will later use our feature extractor and the trained classifier for detection in the usual“sliding window”fashion.D.Character classifier trainingFor character classification,we also use afixed-sized input image of32-by-32pixels,which is applied to the character images in a set of labeled train and test datasets.6 However,since we can produce large numbers of features using the feature learning approach above,over-fitting be-comes a serious problem when training from the(relatively) small character datasets currently in use.To help mitigate this problem,we have combined data from multiple sources. In particular,we have compiled our training data from the ICDAR2003training images[27],Weinman et al.’s sign reading dataset[4],and the English subset of the Chars74k dataset[1].Our combined training set contains approximately12400labeled character images.With large numbers of features,it is useful to have even more data.To satisfy these needs,we have also experimented with synthetic augmentations of these datasets.In particular, we have added synthetic examples that are copies of the ICDAR training samples with random distortions and image filters applied(see Figure2(a)),as well as artificial examples of rendered characters blended with random scenery images 5We define a window as“text”if80%of the window’s area is within a text region,and the window’s width or height is within30%of the width or height(respectively)of the ground-truth region.The latter condition ensures that the detector tends to detector characters of size similar to the window. 6Typically,input images from public datasets are already cropped to the boundaries of the character.Since our classifier uses afixed-sized window, we re-cropped characters from the original images using an enclosing window of the proper size.Figure3.Precision-Recall curves for detectors with varying numbers of features.(Figure2(b)).With these examples included,our dataset includes a total of49200images.IV.E XPERIMENTSWe now present experimental results achieved with the system described above,demonstrating the impact of being able to train increasing numbers of features.Specifically, for detection and character recognition,we trained our classifiers with increasing numbers of learned features and in each case evaluated the results on the ICDAR2003test sets for text detection and character recognition.A.DetectionTo evaluate our detector over a large input image,we take the classifier trained as in Section III-C and compute the features and classifier output for each32-by-32window of the image.We perform this process at multiple scales and then,for each location in the original image assign it a score equal to the maximum classifier output achieved at any scale.By this mechanism,we label each pixel with a score according to whether that pixel is part of a block of text.These scores are then thresholded to yield binary decisions at each pixel.By varying the threshold and using the ICDAR bounding boxes as per-pixel labels,we sweep out a precision-recall curve for the detector and report the area under this curve(AUC)as ourfinal performance measure.Figure3plots the precision-recall curves for our detector for varying numbers of features.It is seen there that perfor-mance improves consistently as we increase the number of features.Our detector’s performance(area under each curve) improves from0.5AUC,to0.62AUC simply by including more features.While our performance is not yet comparable to top performing systems it is notable that our approach included virtually no prior knowledge.In contrast,Pan et al.’s recent state-of-the-art system[5]involvesmultiple(a)ICDAR testimage(b)Text detectorscores(c)ICDAR testimage(d)Text detector scoresFigure4.Example text detection classifier outputs.Figure5.Character classification accuracy(62-way)on ICDAR2003test set as a function of the number of learned features.highly tuned processing stages incorporating several sets of expert-chosen features.Note that these numbers are per-pixel accuracies(i.e., the performance of the detector in identifying,for a single window,whether it is text or non-text).In practice,the predicted labels of adjacent windows are highly correlated and thus the outputs include large contiguous“clumps”of positively and negatively labeled windows that could be passed on for more processing.A typical result generated by our detector is shown in Figure4.B.Character RecognitionAs with the detectors,we trained our character classifiers with varying numbers of features on the combined training set described in Section III.We then tested this classifier on the ICDAR2003test set,which contains5198test characters 7Achieved without pre-segmented characters.Table IT EST RECOGNITION ACCURACY ON ICDAR2003CHARACTER SETS.(D ATASET-C LASSES)Algorithm Test-62Sample-62Sample-36 Neumann and Matas,2010[28]67.0%7--Yokobayashi et al.,2006[2]-81.4%-Saidane and Garcia,2007[14]--84.5% This paper81.7%81.4%85.5% from62classes(10digits,26upper-and26lower-case letters).The average classification accuracy on the ICDAR test set for increasing numbers of features is plotted in Figure5.Again,we see that accuracy climbs as a function of the number of features.Note that the accuracy for the largest system(1500features)is the highest,at81.7%for the62-way classification problem.This is comparable or superior to other(purpose-built)systems tested on the same problem. For instance,the system in[2],achieves81.4%on the smaller ICDAR“sample”set where we,too,achieve81.4%. The authors of[14],employing a supervised convolutional network,achieve84.5%on this dataset when it is collapsed to a36-way problem(removing case sensitivity).In that scenario,our system achieves85.5%with1500features. These results are summarized in comparison to other work in Table I.V.C ONCLUSIONIn this paper we have produced a text detection and recognition system based on a scalable feature learning algorithm and applied it to images of text in natural scenes. We demonstrated that with larger banks of features we are able to achieve increasing accuracy with top performance comparable to other systems,similar to results observed in other areas of computer vision and machine learning.Thus, while much research has focused on developing by hand the models and features used in scene-text applications,our results point out that it may be possible to achieve high performance using a more automated and scalable solution. With more scalable and sophisticated feature learning al-gorithms currently being developed by machine learning researchers,it is possible that the approaches pursued here might achieve performance well beyond what is possible through other methods that rely heavily on hand-coded prior knowledge.A CKNOWLEDGMENTAdam Coates is supported by a Stanford Graduate Fel-lowship.R EFERENCES[1]T. E.de Campos, B.R.Babu,and M.Varma,“Charac-ter recognition in natural images,”in Proceedings of the International Conference on Computer Vision Theory and Applications,Lisbon,Portugal,February2009.[2]M.Yokobayashi and T.Wakahara,“Binarization and recog-nition of degraded characters using a maximum separability axis in color space and gat correlation,”in International Conference on Pattern Recognition,vol.2,2006,pp.885–888.[3]J.J.Weinman,“Typographical features for scene text recog-nition,”in Proc.IAPR International Conference on Pattern Recognition,Aug.2010,pp.3987–3990.[4]J.Weinman,E.Learned-Miller,and A.R.Hanson,“Scenetext recognition using similarity and a lexicon with sparse belief propagation,”in Transactions on Pattern Analysis and Machine Intelligence,vol.31,no.10,2009.[5]Y.Pan,X.Hou,and C.Liu,“Text localization in natural sceneimages based on conditional randomfield,”in International Conference on Document Analysis and Recognition,2009.[6]J.Yang,K.Yu,Y.Gong,and T.S.Huang,“Linear spatialpyramid matching using sparse coding for image classifica-tion.”in Computer Vision and Pattern Recognition,2009. [7]H.Lee,R.Grosse,R.Ranganath,and A.Y.Ng,“Convolu-tional deep belief networks for scalable unsupervised learning of hierarchical representations,”in International Conference on Machine Learning,2009.[8] A.Coates,H.Lee,and A.Y.Ng,“An analysis of single-layernetworks in unsupervised feature learning,”in International Conference on Artificial Intelligence and Statistics,2011. [9]M.Ranzato,Y.Boureau,and Y.LeCun,“Sparse featurelearning for deep belief networks,”in Neural Information Processing Systems,2007.[10]X.Chen and A.Yuille,“Detecting and reading text in naturalscenes,”in Computer Vision and Pattern Recognition,vol.2, 2004.[11]Y.Pan,X.Hou,and C.Liu,“A robust system to detectand localize texts in natural scene images,”in International Workshop on Document Analysis Systems,2008.[12]J.J.Weinman,E.Learned-Miller,and A.R.Hanson,“A dis-criminative semi-markov model for robust scene text recog-nition,”in Proc.IAPR International Conference on Pattern Recognition,Dec.2008.[13]X.Fan and G.Fan,“Graphical Models for Joint Segmentationand Recognition of License Plate Characters,”IEEE Signal Processing Letters,vol.16,no.1,2009.[14]Z.Saidane and C.Garcia,“Automatic scene text recogni-tion using a convolutional neural network,”in Workshop on Camera-Based Document Analysis and Recognition,2007.[15]Y.LeCun, B.Boser,J.S.Denker, D.Henderson,R. E.Howard,W.Hubbard,and L.D.Jackel,“Backpropagation applied to handwritten zip code recognition,”Neural Compu-tation,vol.1,pp.541–551,1989.[16]G.Hinton,S.Osindero,and Y.Teh,“A fast learning algorithmfor deep belief nets,”Neural Computation,vol.18,no.7,pp.1527–1554,2006.[17]R.Salakhutdinov and G.E.Hinton,“Deep Boltzmann Ma-chines,”in12th International Conference on AI and Statistics, 2009.[18]M.Ranzato,A.Krizhevsky,and G.E.Hinton,“Factored3-way Restricted Boltzmann Machines for Modeling Natural Images,”in13th International Conference on AI and Statis-tics,2010.[19]Y.Bengio,mblin, D.Popovici,and rochelle,“Greedy layer-wise training of deep networks,”in Neural Information Processing Systems,2006.[20]R.Raina,A.Battle,H.Lee,B.Packer,and A.Ng,“Self-taught learning:transfer learning from unlabeled data,”in 24th International Conference on Machine learning,2007.[21]J.C.van Gemert,J.M.Geusebroek,C.J.Veenman,andA.W.M.Smeulders,“Kernel codebooks for scene catego-rization,”in European Conference on Computer Vision,2008.[22]L.-J.Li,H.Su, E.Xing,and L.Fei-Fei,“Object bank:A high-level image representation for scene classificationand semantic feature sparsification,”in Advances in Neural Information Processing Systems,2010.[23]K.Kavukcuoglu,P.Sermanet,Y.Boureau,K.Gregor,M.Mathieu,and Y.LeCun,“Learning convolutional feature hierarchies for visual recognition,”in Advances in Neural Information Processing Systems,2010.[24] B.A.Olshausen and D.J.Field,“Emergence of simple-cellreceptivefield properties by learning a sparse code for natural images,”Nature,vol.381,no.6583,pp.607–609,1996. [25] A.Hyvarinen and E.Oja,“Independent component analysis:algorithms and applications,”Neural networks,vol.13,no.4-5,pp.411–430,2000.[26]Y.Boureau,F.Bach,Y.LeCun,and J.Ponce,“Learningmid-level features for recognition,”in Computer Vision and Pattern Recognition,2010.[27]S.Lucas,A.Panaretos,L.Sosa,A.Tang,S.Wong,andR.Young,“ICDAR2003robust reading competitions,”Inter-national Conference on Document Analysis and Recognition, 2003.[28]L.Neumann and J.Matas,“A method for text localizationand recognition in real-world images,”in Asian Conference on Computer Vision,2010.。
一天搞懂深度学习演示教学ppt课件
Softmax
1-2 基本思想
Neural Network
1-2 基本思想
……
……
……
……
……
……
y1
y2
y10
Cross Entropy
“1”
……
1
0
0
……
target
Softmax
……
Given a set of parameters
目标识别
目标分析
图像捕获 图像压缩 图像存储
图像预处理 图像分割
特征提取 目标分类 判断匹配
模型建立 行为识别
2-1 机器视觉
关键技术与应用
A)生物特征识别技术——安全领域应用广泛 生物特征识别技术是一种通过对生物特征识别和检测,对身伤实行鉴定的技术。从 统计意义上讲人类的指纹、虹膜等生理特征存在唯一性,可以作为鉴另用户身份 的依据。目前,生物特征识别技术主要用于身份识别,包括语音、指纹、人脸、 静脉,虹膜识别等。
1958: Perceptron (linear model) 1969: Perceptron has limitation 1980s: Multi-layer perceptron Do not have significant difference from DNN today 1986: Backpropagation Usually more than 3 hidden layers is not helpful 1989: 1 hidden layer is “good enough”, why deep? 2006: RBM initialization 2009: GPU 2011: Start to be popular in speech recognition 2012: win ILSVRC image competition 2015.2: Image recognition surpassing human-level performance 2016.3: Alpha GO beats Lee Sedol 2016.10: Speech recognition system as good as humans
基于改进EfficientNetV2网络的脑肿瘤分类方法
第61卷 第5期吉林大学学报(理学版)V o l .61 N o .5 2023年9月J o u r n a l o f J i l i nU n i v e r s i t y (S c i e n c eE d i t i o n )S e p 2023d o i :10.13413/j .c n k i .jd x b l x b .2022383基于改进E f f i c ie n t N e t V 2网络的脑肿瘤分类方法崔 博,贾兆年,姬 鹏,李秀华,侯阿临(长春工业大学计算机科学与工程学院,长春130012)摘要:针对脑肿瘤磁共振图像分类问题中过拟合及分类准确率较低的问题,提出一种基于改进E f f i c i e n t N e t V 2网络的脑肿瘤分类方法.该方法在E f f i c i e n t N e t V 2网络中引入坐标注意力机制,该注意力机制将同时从垂直和水平两个方向获取脑肿瘤的特征信息,精准识别脑肿瘤的病灶特征,从而帮助模型更全面㊁准确地定位和识别病灶区域信息,有效抑制背景信息对检测结果的影响,使模型分类精度更高,解决了因获取特征信息不足导致分类精度低的问题.为进一步提升分类准确率,引入H a r d -S w i s h 激活函数,该激活函数不仅可以提升脑肿瘤分类网络模型的运算速度,也可有效提高分类精度.同时,改进后的模型搭配了D r o p o u t 层和归一化层,可更好抑制过拟合的发生,加快模型收敛速度,提高模型的鲁棒性,且分类精度有明显提升.实验结果表明,改进后的模型在验证集中获得了98.4%的分类准确率,通过对比实验和消融实验验证了改进后的模型在脑肿瘤分类任务中的有效性.关键词:磁共振图像;脑肿瘤分类;E f f i c i e n t N e t V 2网络;注意力机制中图分类号:T P 391 文献标志码:A 文章编号:1671-5489(2023)05-1169-09B r a i nT u m o rC l a s s i f i c a t i o n M e t h o dB a s e d o n I m pr o v e dE f f i c i e n t N e t V 2N e t w o r k C U IB o ,J I AZ h a o n i a n ,J IP e n g,L IX i u h u a ,HO U A l i n (C o l l e g e o f C o m p u t e rS c i e n c e a n dE n g i n e e r i n g ,C h a n g c h u nU n i v e r s i t y o f T e c h n o l o g y ,C h a n g c h u n 130012,C h i n a )收稿日期:2022-09-26.第一作者简介:崔 博(1997 ),女,汉族,硕士研究生,从事智能图像处理与机器学习的研究,E -m a i l :953914472@q q.c o m.通信作者简介:侯阿临(1972 ),女,汉族,博士,教授,从事智能图像处理与机器学习的研究,E -m a i l :h o u a l i n @c c u t .e d u .c n .基金项目:吉林省教育厅科学技术研究规划项目(批准号:J J K H 20210738K J )和吉林省科技厅高新重点研发项目(批准号:20210201051G X ).A b s t r a c t :A i m i n g a tt h e p r o b l e m s o fo v e r f i t t i n g a n dl o w c l a s s i f i c a t i o n a c c u r a c y i n b r a i nt u m o r m a g n e t i c r e s o n a n c e i m a g e c l a s s i f i c a t i o n ,w e p r o p o s e d a b r a i n t u m o r c l a s s i f i c a t i o nm e t h o db a s e do n a n i m p r o v e dE f f i c i e n t N e t V 2n e t w o r k .T h em e t h o d i n t r o d u c e d t h e c o o r d i n a t e a t t e n t i o nm e c h a n i s mi n t h e E f f i c i e n t N e t V 2n e t w o r k ,w h i c hs i m u l t a n e o u s l y ob t a i n e d t h e f e a t u r e i n f o r m a t i o no f b r a i n t u m o r f r o m b o t hv e r t ic a l a n dh o r i z o n t a ld i re c t i o n s a n da c c u r a t e l y id e n t i f i e d t h e l e s i o n f e a t u r e so f b r a i nt u m o r .I t h e l p e dt h e m o d e lt ol o c a t e a n di d e n t i f y t h el e s i o n a r e ai n f o r m a t i o n m o r e c o m p r e h e n s i v e l y a n d a c c u r a t e l y ,a n d e f f e c t i v e l y s u p p r e s s e dt h ei n f l u e n c e o fb a c k g r o u n di n f o r m a t i o n o nt h e d e t e c t i o n r e s u l t s ,s ot h a tt h e m o d e lh a d h i g h e rc l a s s i f i c a t i o na c c u r a c y .T h e p r o b l e m o fl o w c l a s s i f i c a t i o n a c c u r a c y c a u s e db y i n s u f f i c i e n ta c q u i s i t i o no ff e a t u r ei n f o r m a t i o n w a ss o l v e d .I no r d e rt of u r t h e r i m p r o v e t h e c l a s s i f i c a t i o na c c u r a c y ,t h eH a r d -S w i s ha c t i v a t i o n f u n c t i o nw a s i n t r o d u c e d ,w h i c hc o u l d Copyright ©博看网. All Rights Reserved.0711吉林大学学报(理学版)第61卷n o t o n l y i m p r o v e t h ec o m p u t a t i o n a l s p e e do f t h eb r a i nt u m o r c l a s s i f i c a t i o nn e t w o r k m o d e l,b u t a l s oe f f e c t i v e l y i m p r o v et h ec l a s s i f i c a t i o na c c u r a c y.M e a n w h i l e,t h e i m p r o v e d m o d e lw a se q u i p p e d w i t h D r o p o u t l a y e ra n dn o r m a l i z a t i o nl a y e r,w h i c hc o u l db e t t e rs u p p r e s st h eo c c u r r e n c eo fo v e r f i t t i n g, a c c e l e r a t e t h e c o n v e r g e n c e s p e e d o f t h em o d e l,i m p r o v e t h e r o b u s t n e s s o f t h em o d e l,a n d s i g n i f i c a n t l y i m p r o v e t h e c l a s s i f i c a t i o na c c u r a c y.T h ee x p e r i m e n t a l r e s u l t s s h o wt h a t t h e i m p r o v e d m o d e l o b t a i n s c l a s s i f i c a t i o na c c u r a c y o f98.4%i n t h e v a l i d a t i o n s e t,a n d t h e e f f e c t i v e n e s s o f t h e i m p r o v e dm o d e l i n b r a i n t u m o r c l a s s i f i c a t i o n t a s k i s v e r i f i e db y c o m p a r i s o ne x p e r i m e n t s a n d a b l a t i o ne x p e r i m e n t s.K e y w o r d s:m a g n e t i c r e s o n a n c e i m a g e;b r a i n t u m o r c l a s s i f i c a t i o n;E f f i c i e n t N e t V2n e t w o r k;a t t e n t i o nm e c h a n i s m医学成像技术是非侵入性的,常用于检测肿瘤,是目前用于癌症类型分类最常见和最可靠的技术[1].磁共振成像(m a g n e t i c r e s o n a n c e i m a g i n g,M R I)技术是医学成像技术的一种,在脑肿瘤的分类中尤其适用于提供高分辨率的脑组织图像[2].脑肿瘤分为150种,其中包括良性肿瘤和恶性肿瘤[3].脑肿瘤的早期诊断和准确分类对挽救病人生命至关重要,脑部M R I图像可分为不同的肿瘤类型.自动分类可以在放射科医生最少干预的情况下对脑肿瘤M R I图像进行分类[4].卷积神经网络(c o n v o l u t i o n a ln e u r a ln e t w o r k s,C N N)是深度学习的主要代表算法之一. S h e l h a m e r等[5]提出了一种深度㊁粗层和细层相结合的双路径C N N跳跃结构,实现了对脑癌的精确分割.C h e n g等[6]利用T1-M R I数据研究了脑肿瘤的三级分类问题,该方法先利用图像膨胀扩大肿瘤区域,然后将其划分为逐渐细小的环状子区域.D a s等[7]使用包含3064张T1加权对比增强M R I图像的卷积神经网络识别各种脑癌,如胶质瘤㊁脑膜瘤和脑垂体瘤.C N N模型通过基于可变大小的卷积滤波器/核调整卷积网络的大小,获得了94%的准确率.B a dža等[8]提出了一种新的C N N结构,该结构基于对现有的预训练网络的改进,使用T1加权对比增强磁共振图像对脑肿瘤进行分类,该模型的准确率为96.56%,其由两个10倍交叉验证技术组成,使用增强图片.M z o u g h i等[9]采用基于灰度归一化和自适应对比度增强的预处理方法,提出了一种完全自动化的脑胶质瘤三维细胞神经网络模型,用于脑胶质瘤的低级别和高级别分类.深度学习中的C N N是目前较好的图像处理方法[10].A l e x N e t[11],V G G[12]等C N N的发展证明了增加网络深度能在一定程度上提高网络性能.但卷积核作为其中数量最多的参数,对网络性能的影响较大[13].若只通过简单的网络层堆叠增加深度会导致网络出现梯度消失和过拟合的情况[14].为解决这类问题,本文提出基于改进E f f i c i e n t n e t V2网络的方法,实验结果表明,与其他算法相比,将其运用到脑肿瘤分类的任务中,准确率更高.1模型架构1.1E f f i c i e n t N e t V2网络模型E f f c i c e n t N e t V2[15]是根据E f f i c i e n t N e t[16]的基本思想进行改进的网络.该网络通过改变网络的深度㊁宽度和输入图像分辨率的参数提高网络性能[16].E f f i c i e n t N e t V2是一种采用神经结构搜索技术(n e u r a l a r c h i t e c t u r e s e a r c h,N A S)和复合模型扩张方法的分类识别网络,它能选择最优的复合系数,即按比例扩展网络的深度㊁宽度和输入图像分辨率3个维度,以找到最大识别特征精度所需的最优参数.E f f i c i e n t N e t V2通过自适应地平衡3个维度,成功减少了模型训练的参数量和复杂度,从而大幅度提高了模型性能.与单一维度的缩放相比,这种方法能获得更好的效果,同时在训练速度上也有明显的提升.E f f c i c e n t N e t V2网络主要由F u s e d-M B C o n v[17]和M B C o n v模块堆叠而成,网络结构列于表1.理论上,随着硬件设备的优化和算法设计的改进,普通卷积在某些条件下可能会比之前更有效.为验证这两者对卷积计算速度的影响,在浅层卷积使用F u s e d-M B C o n v,在深层卷积使用深度可分离卷积M B C o n v.Copyright©博看网. All Rights Reserved.表1 E f f i c i e n t N e t V 2网络结构T a b l e 1 E f f i c i e n t N e t V 2n e t w o r ka r c h i t e c t u r e 阶段操作步长通道数层数0C o n v 3ˑ322411F u s e d -M B C o n v 1,k 3ˑ312422F u s e d -M B C o n v 4,k 3ˑ324843F u s e d -M B C o n v 4,k 3ˑ326444M B C o n v 4,k 3ˑ3212865M B C o n v 6,k 3ˑ3116096M B C o n v 6,k 3ˑ32256157C o n v 1ˑ1&P o o l i n g &F C 12801 为使网络运算速度更快,该网络在浅层卷积中使用F u s e d -M B C o n v 模块,在深层卷积中使用深度图1 模块结构F i g.1 M o d u l e s t r u c t u r e 可分离卷积M B C o n v 模块,M B C o n v 模块由2个1ˑ1卷积㊁1个3ˑ3深度可分离卷积㊁S E (s q u e e z e -a n d -e x c i t a t i o n )注意力模块组成,如图1(A )所示;由于深度可分离卷积用在浅层卷积中会拖慢运行速度,因此F u s e d -M B C o n v 模块是在M B C o n v 中使用一个普通的3ˑ3卷积结构替换1ˑ1的升维卷积和3ˑ3的深度卷积,如图1(B )所示.S E 注意力模块由全局平均池化和两个全连接层组成.该模块首先进行S q u e e z e 操作,将每个通道的特征压缩为全局特征,然后通过E x c i t a t i o n 操作将全局特征转化为权重系数,从而实现模型对不同通道间特征的区分.1.2 坐标注意力机制坐标注意力(c o o r d i n a t e a t t e n t i o n ,C A )机制[18]是一种新型轻量级即插即用的注意力机制,它可以很容易地插入到移动网络的经典模块中,不仅可获取通道的输入特征信息,还可获取方向和位置的特征信息,有助于模型更准确地定位和识别感兴趣的对象,重点关注重要信息的目标区域,有效抑制背景信息给检测结果产生的影响,缓解信息丢失问题[19].并且同时考虑水平和垂直方向的特征信息,在可学习参数和计算代价相当的情况下,坐标信息嵌入更有助于图像分类.C A 结构如图2所示.图2 C A 机制模块F i g.2 C A m e c h a n i s m m o d u l e 2 本文算法设计2.1 改进的网络结构由于人的大脑组织结构复杂,因此要想将脑部肿瘤精确分类,就要精准地获取输入的脑肿瘤M R I1711 第5期崔 博,等:基于改进E f f i c i e n t N e t V 2网络的脑肿瘤分类方法 Copyright ©博看网. All Rights Reserved.2711吉林大学学报(理学版)第61卷图像的特征信息.E f f i c i e n t N e t V2网络模型中M B C o n v模块加入了S E注意力机制,S E注意力机制只考虑对通道间信息进行编码,而忽略了位置信息的重要性,导致注意力机制提取脑肿瘤的特征信息不全面,从而使模型的分类精准度较低.而C A注意力模块恰好可以更准确地获取输入的位置特征信息,有助于网络更准确地定位感兴趣的对象,同时使网络覆盖更大的区域.坐标注意力模块可以自适应地选择输入脑肿瘤图像的感受野大小和位置,使其能更好地适应不同尺度和形状的输入图像,从而提高脑肿瘤分类模型的鲁棒性和泛化能力,因此将S E注意力模块替换为C A注意力模块.在替换时,需要保持注意力模块的输入和输出维度不变,以保证模型的正确性.为使该网络计算速度更快,对量化更友好,进一步提高脑肿瘤分类的正确率,本文将原网络中的S i L U激活函数换为H a r d-S w i s h激活函数,改进后的网络结构如图3所示.图3改进后的网络结构F i g.3I m p r o v e dn e t w o r k s t r u c t u r e该实验脑肿瘤M R I图像的样本数量较少,网络中用D r o p o u t层减少训练网络中部分神经元的活性,以防止网络过拟合的发生,B N(b a t c hn o r m a l i z a t i o n)[20]通过减少梯度对参数或其初始值尺度的依赖,从而能使用更高的学习率,极大加快了深层神经网络的训练速度.2.2激活函数在原E f f i c i e n t N e t V2网络使用的S i L U激活函数是S w i s h激活函数的一个特例,当β=1时, S w i s h激活函数称为S i L U激活函数.该激活函数存在一些潜在的缺点:S i L U激活函数的导数对于较大的负输入可能会变得非常小,可能导致反向传播期间梯度消失.对于深度神经网络,会导致训练变得困难;虽然S i L U是一种灵活的激活函数,但它表达能力有限,可能会限制其在数据中表示复杂函数和模式的能力;S i L U激活函数缺乏可解释性,与其他一些激活函数(如S i g m o i d或双曲正切)不同, S i L U函数没有明确的概率解释,可能会使解释模型的行为和输出的含义变得更困难.S i L U激活函数曲线如图4所示.H a r d-S w i s h激活函数[21]是S w i s h激活函数非线性改进版本,其计算速度更快,计算过程只涉及基本的数学运算,如乘法和加法,实现相对简单,而S i L U涉及指数运算,计算速度相对较慢.并且H a r d-S w i s h激活函数对量化更友好,更适合深度卷积网络,同时也可增加脑肿瘤分类的正确率.因此本文将原网络中的S i L U激活函数换为H a r d-S w i s h激活函数.H a r d-S w i s h激活函数曲线如图5所示.训练卷积网络时,在梯度下降的过程中,梯度消失是一个常见的问题.相比于S i L U激活函数, H a r d-S w i s h激活函数在输入接近于0时的梯度更大,因此在这种情况下出现梯度消失问题的概率更小.H a r d-S w i s h激活函数更复杂,因此在训练深层神经网络时可以更好地捕捉特征,从而提高训练效果.S w i s h函数表达式为S w i s h(x)=xˑS i g m o i d(βx),(1)其中β表示可调参数.H a r d-S w i s h函数表达式为H a r d-S w i s h[x]=xˑR e L U(x+3)6.(2)Copyright©博看网. All Rights Reserved.图4 S i L U 激活函数曲线F i g .4 S i L Ua c t i v a t i o n f u n c t i o n c u r ve 图5 H a r d -S w i s h 激活函数曲线F i g.5 H a r d -S w i s ha c t i v a t i o n f u n c t i o n c u r v e 3 实验与分析3.1 数据集与预处理本文使用的脑肿瘤图像数据集来自公开数据集f i g s h a r e (w w w.f i gs h a r e .c o m ),该数据集由233名患者的脑部M R I 图像组成,包括横断面㊁冠状面和矢状面共3064张图像,其中脑膜瘤切片708张,胶质瘤切片1426张,垂体瘤切片930张.本文预处理首先将脑肿瘤图像数据集中的每张图像调整尺寸大小至512像素ˑ512像素;其次将M R I 图像原始数据信息进行特征提取,得到带有标签的样本集合;最后将数据集按照80%和20%划分为训练集和验证集.分类标签包括脑膜瘤㊁胶质瘤㊁垂体瘤3种类别标签.3.2 参数设置与评价指标3.2.1 实验环境本文实验基于P y T o r c h 深度学习框架,使用P y t h o n 3.6语言实现,搭建了基于改进E f f i c i e n t N e t V 2网络的脑肿瘤图像三分类框架.在L i n u xC e n t O S 7环境下完成,C U D A 版本为10.2.硬件设备为:I n t e l (R )X e o n (R )C P U E 5-26502.2G H z C P U ,N V I D I A T I T A N X Pˑ2显卡,12G B ˑ2显存.3.2.2 参数设置在训练过程中,本文实验使用随机梯度下降(S G D M )优化器优化所设计的模型,网络模型训练过程中涉及的超参数设置列于表2.表2 训练过程的超参数设置T a b l e 2 H y p e r p a r a m e t e r s e t t i n g s f o r t r a i n i n gp r o c e s s 超参数迭代次数批次大小优化器初始学习率D r o p o u t 数值2008S G D M 0.0030.2 因为仅凭准确率评价指标并不能充分证明网络设计及参数调整的合理性,因此,为全面评价本文改进的E f f i c i e n t N e t V 2网络的性能,使用准确率和混淆矩阵作为评价指标.混淆矩阵是一种用于总结分类算法性能的技术.如果每个类别中的观察数不相等,或者数据集中有两个以上类别,则仅凭分类准确性可能会产生误导.混淆矩阵提供真阳性(T P ),即预测为正㊁事实为正;真阴性(T N ),即预测为负㊁事实为负;假阳性(F P ),即预测为正㊁事实为负;假阴性(F N ),即预测为负㊁事实为正的值.在经过准确率和混淆矩阵进行网络模型评估后,这些值可用于计算本文算法的准确率㊁精确度㊁召回率和F 1-S c o r e 的数值[22-23].其中F 1-S c o r e 是准确率和召回率的调和平均评估指标[24].本文用这5个评价指标作为实验结果的评判标准.准确率表示所有预测正确的样本占总样本的百分数,用公式表示为A c c u r a c y=T P +T N T P +T N +F N +F P ˑ100%.(3)精确度表示预测正确的正类样本占所有预测为正类样本的百分数,用公式表示为3711 第5期崔 博,等:基于改进E f f i c i e n t N e t V 2网络的脑肿瘤分类方法 Copyright ©博看网. All Rights Reserved.4711吉林大学学报(理学版)第61卷P r e c i s i o n=T PT P+F Pˑ100%.(4)召回率表示预测正确的正类样本占所有事实正类样本情况的百分数.召回率在医学图像分类问题中又被称为灵敏度,灵敏度越高,说明对病灶区域越敏感,用公式表示为R e c a l l=T PT P+F Nˑ100%.(5) F1-S c o r e表示精确率和召回率的调和平均数,综合考虑了精确率和召回率,是一个综合评价指标,通常被用作分类模型性能的重要指标之一[25],用公式表示为F1-S c o r e=2ˑP r e c i s i o nˑR e c a l lP r e c i s i o n+R e c a l l.(6) 3.3实验结果与分析为证明本文基于改进的E f f i c i e n t N e t V2网络的M R I脑肿瘤图像分类算法的有效性,实验选择引入坐标注意力机制的分类模型,模型中还结合了H a r d-S w i s h激活函数㊁B N以及D r o p o u t层.同时选择了S E,C B AM(c o n v o l u t i o n a l b l o c k a t t e n t i o nm o d u l e)和E C A(e f f i c i e n t c h a n n e l a t t e n t i o n)注意力模块进行对比实验,观察模型算法并进行分析.数据集上训练和验证过程的准确率和损失函数曲线如图6所示,其中:灰色曲线表示E f f i c i e n t N e t V2原模型中引入的S E注意力模块的训练曲线;橙色㊁绿色曲线分别表示将E f f i c i e n t N e t V2模型中的S E替换为C B AM和E C A;紫色曲线为本文改进的引入C A注意力模块的训练曲线.将改进的E f f i c i e n t N e t V2模型称C A-E f f i c i e n t N e t V2模型.准确率的变化曲线能反应模型训练过程中对脑肿瘤的分类精度,准确率的值越高表示模型的分类正确率越高.损失值的变化曲线能反应模型训练过程中的优化结果,损失值越小表示模型的鲁棒性越好,当损失值趋于平稳时,表示模型训练达到了局部最优值.图6训练和验证的准确率和损失函数曲线F i g.6A c c u r a c y a n d l o s s f u n c t i o n c u r v e s f o r t r a i n i n g a n d v a l i d a t i o n由图6可见:在训练过程中4种模型的训练效果均较稳定,但C A-E f f i c i e n t N e t V2模型的精度明显高于其他3种模型,达到了98.9%,损失函数值明显低于其他3种模型,达到了0.022;在验证过程中,C A-E f f i c i e n t N e t V2模型的准确率有明显提升,训练曲线平稳且收敛效果好,达到了98.4%,其他3种模型训练过程中的振荡较严重,且准确率明显低于C A-E f f i c i e n t N e t V2模型;在损失函数值上,Copyright©博看网. All Rights Reserved.其他3种模型训练过程中的振荡极其严重,而C A -E f f i c i e n t N e t V 2模型的训练过程较稳定,虽有轻微振荡,但不影响实验结果,并且损失函数值低于其他模型.总体分析C A -E f f i c i e n t N e t V 2模型的训练曲线,由图6(A )和(B )可见,模型在前100轮保持稳定的上升趋势,在后100轮准确率保持稳定,说明模型得到了拟合.由图6(C )和(D )可见,模型在前40轮下降速度较快,在后面的轮数中下降趋于平缓,最后保持稳定,说明模型在训练过程中得到收敛.因此可证明C A -E f f i c i e n t N e t V 2模型对脑肿瘤有良好的分类效果.为更清楚地表达C A -E f f i c i e n t N e t V 2网络对每个类别的预测结果,也为更准确㊁更方便地计算评判模型的评价指标,本文给出了脑肿瘤图像数据集在该模型下预测结果的混淆矩阵,结果列于表3.在混淆矩阵中,从左上到右下的对角线表明每个类别正确预测的个数,对角线上的数量越多,说明模型对验证集数据的预测效果越好,以此进一步评估模型的性能.表3 预测结果的混淆矩阵T a b l e 3 C o n f u s i o nm a t r i x f o r p r e d i c t i o n r e s u l t s实际预测胶质瘤脑膜瘤垂体瘤胶质瘤11912脑膜瘤02940垂体瘤34190 由表3可见:122例脑膜瘤中只有3例被错误预测为垂体瘤;299例脑胶质瘤中有1例被错误预测为脑膜瘤,4例被错误预测为垂体瘤;192例垂体瘤中只有2例被错误预测为胶质瘤.为更直观地表达本文网络模型的分类效果,将脑膜瘤㊁胶质瘤㊁垂体瘤三类在P r e c i s i o n ,R e c a l l ,S p e c i f i c i t y ,F 1-S c o r e 评价指标上进行详细分析,结果列于表4.表4 本文网络模型的评价指标T a b l e 4 E v a l u a t i o n i n d e x e s o f p r o p o s e dn e t w o r km o d e l %类别P r e c i s i o n R e c a l l S p e c i f i c i t y F 1-S c o r e 胶质瘤97.597.599.497.5脑膜瘤98.3100.098.499.1垂体瘤99.096.499.597.7 经公式计算该模型在P r e c i s i o n ,R e c a l l ,S p e c i f i c i t y ,F 1-S c o r e4个评价指标上的均值达到98.3%,98.0%,99.1%,98.1%,进一步证明了改进的网络模型在脑肿瘤分类中的稳定性和鲁棒性.本文将C A -E f f i c i e n t N e t V 2网络模型与其他文献中使用的相同数据集的分类方法进行对比,对比结果列于表5.由表5可见:文献[6]使用传统机器学习的方法,采用灰度直方图㊁灰度共生矩阵(G L C M )和词袋模型(B OW )3种特征提取方法对脑肿瘤进行分类,该方法最终的分类准确率为91.28%;文献[26]采用G A N 网络作为鉴别器,用S o f t m a x 分类器代替最后一个全连接层,最终对脑肿瘤的分类准确率为93.01%;文献[27]采用多尺寸卷积核模块和多深度融合残差块相结合的方法,最终得到的脑肿瘤分类准确率为93.51%;本文方法得到的分类准确率达到98.4%.实验结果表明,本文方法对脑膜瘤㊁脑胶质瘤和垂体瘤3种脑部肿瘤有良好的分类效果.表5 不同方法的分类准确率对比T a b l e 5 C o m p a r i s o no f c l a s s i f i c a t i o na c c u r a c y o f d i f f e r e n tm e t h o d s 方法文献[6]文献[26]文献[27]本文分类准确性/%91.2893.0193.5198.403.4 消融实验与分析为更好地评估C A -E f f i c i e n t N e t V 2网络模型的性能,本文进行了消融实验.实验将分别验证引入坐标注意力模块和H a r d -S w i s h 激活函数这两个改进方法对模型准确率和损失率的影响.首先在原始数据集上进行测试,虽然模型在复杂程度上有所增加,但改进后的模型分类精确度有明显优势,在不同模块部分的消融对比结果列于表6.由表6可见:仅将S i L U 激活函数替换为H a r d -5711 第5期崔 博,等:基于改进E f f i c i e n t N e t V 2网络的脑肿瘤分类方法 Copyright ©博看网. All Rights Reserved.6711吉林大学学报(理学版)第61卷S w i s h激活函数的实验表明,虽然在模型性能方面有所提升,但准确率提升并不明显;当引入了坐标注意力模块后,模型的准确率显著提高,损失率也有所改善;同时引入以上两个模块,脑肿瘤图像分类的网络性能得到明显提升.通过消融实验,进一步证明了改进后的模型具有更好的分类性能.表6在不同模块部分的消融对比结果T a b l e6C o m p a r i s o no f a b l a t i o n r e s u l t s i nd i f f e r e n tm o d u l e s e c t i o n s模型基准模型准确率/%损失E f f i c i e n t N e t V2E f f i c i e n t N e t V294.80.165E f f i c i e n t N e t V2+C A E f f i c i e n t N e t V297.20.073E f f i c i e n t N e t V2+H a r d-S w i s h E f f i c i e n t N e t V295.60.101E f f i c i e n t N e t V2+C A+H a r d-S w i s h E f f i c i e n t N e t V298.40.060综上所述,针对脑肿瘤磁共振图像分类问题中过拟合及分类准确率较低的问题,本文提出了一种基于改进E f f i c i e n t N e t V2网络的脑肿瘤分类方法.首先,介绍了注意力机制的原理和模块,同时介绍了改进的H a r d-S w i s h激活函数;其次,介绍了基于改进E f f i c i e n t N e t V2网络的脑肿瘤分类方法,将原网络中的S E注意力机制替换成C A注意力机制,解决了因S E注意力机制在捕获特征信息不足时产生网络分类精度低的问题,实验结果表明,C A注意力机制在E f f i c e n t N e t V2网络模型中针对脑肿瘤分类的精度有很大提升;再次,本文改进了H a r d-S w i s h激活函数并且融合了B N和D r o p o u t层,加快了网络计算速度,避免发生过拟合问题,使网络在稳定性更好的前提下增加了准确率,因此改进后的模型相对于其他模型有较高的分类精度,并且性能也优于其他模型,分类准确率达到98.4%;最后,通过对比实验和消融实验验证了改进脑肿瘤分类模型的先进性和有效性,该模型能有效提高医生的诊断效率.参考文献[1] U S MA N K,R A J P O O T K.B r a i nT u m o rC l a s s i f i c a t i o nf r o m M u l t i-m o d a l i t y M R IU s i n g W a v e l e t sa n d M a c h i n eL e a r n i n g[J].P a t t e r nA n a l y s i s a n dA p p l i c a t i o n s,2017,20(3):871-881.[2] P O L A TÖ,GÜN G E N C.C l a s s i f i c a t i o no fB r a i nT u m o r s f r o m M RI m a g e sU s i n g D e e p T r a n s f e rL e a r n i n g[J].T h e J o u r n a l o f S u p e r c o m p u t i n g,2021,77(7):7236-7252.[3] P R A D HA N A,M I S H R A D,D A S K,e ta l.O nt h eC l a s s i f i c a t i o no f M RI m a g e s U s i n g E L M-S S A C o a t e dH y b r i d M o d e l[J].M a t h e m a t i c s,2021,9(17):2095-1-2095-21.[4] S WA T I Z N K,Z HA O Q H,K A B I R M,e ta l.B r a i n T u m o rC l a s s i f i c a t i o nf o r M RI m a g e s U s i n g T r a n s f e rL e a r n i n g a n dF i n e-T u n i n g[J].C o m p u t e r i z e d M e d i c a l I m a g i n g a n dG r a p h i c s,2019,75(9):34-46.[5] S H E L HAM E RE,L O N GJ,D A R R E L L T.F u l l y C o n v o l u t i o n a lN e t w o r k sf o rS e m a n t i cS e g m e n t a t i o n[C]//P r o c e e d i n g s o f t h e I E E EC o n f e r e n c e o nC o m p u t e rV i s i o n a n dP a t t e r nR e c o g n i t i o n.P i s c a t a w a y,N J:I E E E,2015: 3431-3440.[6] C H E N GJ,HU A N G W,C A O SL,e t a l.C o r r e c t i o n:E n h a n c e dP e r f o r m a n c eo fB r a i nT u m o rC l a s s i f i c a t i o nv i aT u m o rR e g i o nA u g m e n t a t i o na n dP a r t i t i o n[J].P l o SO n e,2015,10(12):e0144479-1-e0144479-13.[7] D A SS,A R A N Y A OF M RR,L A B I B A N N.B r a i nT u m o rC l a s s i f i c a t i o nU s i n g C o n v o l u t i o n a lN e u r a lN e t w o r k[C]//20191s t I n t e r n a t i o n a l C o n f e r e n c e o n A d v a n c e s i n S c i e n c e,E n g i n e e r i n g a n d R o b o t i c s T e c h n o l o g y(I C A S E R T).P i s c a t a w a y,N J:I E E E,2019:1-5.[8] B A D㊅Z A M M,B A R J A K T A R O V I C'M㊅C.C l a s s i f i c a t i o n o f B r a i nT u m o r s f r o m M R I I m a g e sU s i n g aC o n v o l u t i o n a lN e u r a lN e t w o r k[J].A p p l i e dS c i e n c e s,2020,10(6):1999-1-1999-23.[9] M Z O U G H IH,N J E HI,WA L IA,e t a l.D e e p M u l t i-s c a l e3D C o n v o l u t i o n a lN e u r a lN e t w o r k(C N N)f o r M R IG l i o m a sB r a i nT u m o rC l a s s i f i c a t i o n[J].J o u r n a l o fD i g i t a l I m a g i n g,2020,33(4):903-915.[10] L E C U N Y,B E N G I O Y,H I N T O N G.D e e p L e a r n i n g[J].N a t u r e,2015,521:436-444.[11] K R I Z H E V S K Y A,S U T S K E V E RI,H I N T O N G E.I m a g e n e tC l a s s i f i c a t i o n w i t h D e e p C o n v o l u t i o n a lN e u r a lN e t w o r k s[J].C o mm u n i c a t i o n s o f t h eA C M,2017,60(6):84-90.Copyright©博看网. All Rights Reserved.[12] Q I A OSY ,Z HA N G Z S ,S H E N W ,e ta l .G r a d u a l l y U p d a t e d N e u r a l N e t w o r k sf o r L a r g e -S c a l eI m a g e R e c o g n i t i o n [C ]//I n t e r n a t i o n a l C o n f e r e n c e o n M a c h i n eL e a r n i n g .[S .l .]:P M L R ,2018:4188-4197.[13] 李鹏松,李俊达,倪天宇,等.基于图像特征的卷积核初始化方法[J ].吉林大学学报(理学版),2021,59(3):587-594.(L IPS ,L I JD ,N IT Y ,e t a l .AC o n v o l u t i o n a lK e r n e l I n i t i a l i z a t i o n M e t h o dB a s e do n I m a g eF e a t u r e s [J ].J o u r n a l o f J i l i nU n i v e r s i t y (S c i e n c eE d i t i o n ),2021,59(3):587-594.)[14] 周祥全,张津.深层网络中的梯度消失现象[J ].科技展望,2017,27(27):284.(Z HO U X Q ,Z HA N GJ .G r a d i e n tV a n i s h i n g P h e n o m e n o n i nD e e p N e t w o r k s [J ].S c i e n c e a n dT e c h n o l o g y O u t l o o k ,2017,27(27):284.)[15] T A N M X ,L E Q V.E f f i c i e n t n e t v 2:S m a l l e rM o d e l s a n dF a s t e rT r a i n i n g [E B /O L ].(2021-04-01)[2022-02-01].h t t p s ://a r x i v .o r g /a b s /2104.00298.[16] T A N M X ,L E Q V.E f f i c i e n t n e t :R e t h i n k i n g M o d e lS c a l i n g fo r C o n v o l u t i o n a l N e u r a l N e t w o r k s [C ]//I n t e r n a t i o n a l C o n f e r e n c e o n M a c h i n eL e a r n i n g .[S .l .]:P M L R ,2019:6105-6114.[17] G U P T A S ,A K I N B .A c c e l e r a t o r -A w a r e N e u r a l N e t w o r k D e s i g n U s i n g a u t o m l [EB /O L ].(2020-05-05)[2022-02-20].h t t p s ://a r x i v .o r g /a b s /2003.02838.[18] HO U QB ,Z HO UDQ ,F E N GJ S .C o o r d i n a t eA t t e n t i o n f o rE f f i c i e n tM o b i l eN e t w o r kD e s i g n [C ]//P r o c e e d i n gs o ft h e I E E E /C V F C o n f e r e n c e o n C o m p u t e r V i s i o n a n d P a t t e r n R e c o g n i t i o n .P i s c a t a w a y,N J :I E E E ,2021:13713-13722.[19] 安晨,汪成亮,廖超,等.基于注意力关系网络的无线胶囊内镜图像分类方法[J ].计算机工程,2021,47(10):252-268.(A NC ,WA N GCL ,L I A OC ,e t a l .A W i r e l e s sC a p s u l eE n d o s c o p y I m a g eC l a s s i f i c a t i o nM e t h o dB a s e d o nA t t e n t i o nR e l a t i o nN e t w o r k [J ].C o m p u t e rE n g i n e e r i n g ,2021,47(10):252-268.)[20] I O F F E S ,S Z E G E D Y C .B a t c h N o r m a l i z a t i o n :A c c e l e r a t i n g D e e p N e t w o r k T r a i n i n g b y R e d u c i n g In t e r n a l C o v a r i a t eS h i f t [C ]//I n t e r n a t i o n a l C o n f e r e n c e o n M a c h i n eL e a r n i n g .[S .l .]:P M L R ,2015:448-456.[21] HOWA R D A ,S A N D L E R M ,C HU G ,e ta l .S e a r c h i n g f o r M o b i l e n e t v 3[C ]//P r o c e e d i n gso f t h eI E E E /C V F I n t e r n a t i o n a l C o n f e r e n c e o nC o m p u t e rV i s i o n .P i s c a t a w a y ,N J :I E E E ,2019:1314-1324.[22] B I S WA SS ,HA Z R A R.A c t i v eC o n t o u r sD r i v e nb y M o d i f i e dL o GE n e r g y T e r ma n dO p t i m i s e dP e n a l t y Te r mf o r I m ag eS e g m e n t a t i o n [J ].I E TI m a g eP r o c e s s i n g ,2020,14(13):3232-3242.[23] B I S WA SS ,HA Z R A R.A L e v e l S e tM o d e l b y R e g u l a r i z i n g L o c a lF i t t i n g E n e r g y a n dP e n a l t y E n e r g y Te r mf o r I m ag eS e g m e n t a t i o n [J ].S i g n a l P r o c e s s i n g ,2021,183:108043-1-108043-15.[24] Z HA N GJP ,L IZ W ,Y A N GJ .A P a r a l l e lS VM T r a i n i n g A l g o r i th m o nL a r g e -S c a l eC l a s s i f i c a t i o nP r o b l e m s [C ]//2005I n t e r n a t i o n a l C o n f e r e n c e o n M a c h i n e L e a r n i n g a n d C y b e r n e t i c s .P i s c a t a w a y,N J :I E E E ,2005:1637-1641.[25] 蒋瑞,刘哲,宋余庆,等.基于联合特征学习和多重迁移学习的肝脏病变分类[J ].江苏大学学报(自然科学版),2021,42(5):554-568.(J I A N G R ,L I U Z ,S O N G Y Q ,e ta l .C l a s s i f i c a t i o no fL i v e rL e s i o n sB a s e do nJ o i n tF e a t u r eL e a r n i n g a n d M u l t i p l eM i g r a t i o nL e a r n i n g [J ].J o u r n a l o f J i a n g s uU n i v e r s i t y (N a t u r a l S c i e n c eE d i t i o n ),2021,42(5):554-568.)[26] G HA S S E M IN ,S HO E I B I A ,R O UHA N I M.D e e p Ne u r a l N e t w o r k w i t h G e n e r a t i v e A d v e r s a r i a l N e t w o r k s P r e -t r a i n i n g f o rB r a i nT u m o rC l a s s i f i c a t i o nB a s e do n M RI m a g e s [J ].B i o m e d i c a l S i g n a lP r o c e s s i n g a n dC o n t r o l ,2020,57:101678-1-101678-8.[27] 夏景明,邢露萍,谈玲,等.基于M D M -R e s N e t 的脑肿瘤分类方法[J ].南京信息工程大学学报(自然科学版),2022,14(2):212-219.(X I AJ M ,X I N G LP ,T A N L ,e t a l .M D M -R e s N e t -B a s e dB r a i nT u m o rC l a s s i f i c a t i o n M e t h o d [J ].J o u r n a l o fN a n j i n g U n i v e r s i t y o f I n f o r m a t i o nE n g i n e e r i n g (N a t u r a l S c i e n c eE d i t i o n ),2022,14(2):212-219.)(责任编辑:韩 啸)7711 第5期崔 博,等:基于改进E f f i c i e n t N e t V 2网络的脑肿瘤分类方法 Copyright ©博看网. All Rights Reserved.。
3D Convolutional Neural Networks for Human Action Recognition
Shuiwang Ji shuiwang.ji@ Arizona State University,Tempe,AZ85287,USAWei Xu xw@ Ming Yang myang@ Kai Yu kyu@ NEC Laboratories America,Inc.,Cupertino,CA95014,USAAbstractWe consider the fully automated recognitionof actions in uncontrolled environment.Mostexisting work relies on domain knowledge toconstruct complex handcrafted features frominputs.In addition,the environments areusually assumed to be controlled.Convolu-tional neural networks(CNNs)are a type ofdeep models that can act directly on the rawinputs,thus automating the process of fea-ture construction.However,such models arecurrently limited to handle2D inputs.In thispaper,we develop a novel3D CNN model foraction recognition.This model extracts fea-tures from both spatial and temporal dimen-sions by performing3D convolutions,therebycapturing the motion information encodedin multiple adjacent frames.The developedmodel generates multiple channels of infor-mation from the input frames,and thefinalfeature representation is obtained by com-bining information from all channels.Weapply the developed model to recognize hu-man actions in real-world environment,andit achieves superior performance without re-lying on handcrafted features.1.IntroductionRecognizing human actions in real-world environmentfinds applications in a variety of domains including in-telligent video surveillance,customer attributes,andshopping behavior analysis.However,accurate recog-nition of actions is a highly challenging task due to1/projects/trecvid/handcrafted features,demonstrating that the3D CNN model is more effective for real-world environments such as those captured in TRECVID data.The exper-iments also show that the3D CNN model significantly outperforms the frame-based2D CNN for most tasks. We also observe that the performance differences be-tween3D CNN and other methods tend to be larger when the number of positive training samples is small.2.3D Convolutional Neural Networks In2D CNNs,2D convolution is performed at the con-volutional layers to extract features from local neigh-borhood on feature maps in the previous layer.Then an additive bias is applied and the result is passed through a sigmoid function.Formally,the value of unit at position(x,y)in the j th feature map in the i th layer,denoted as v xyij,is given byv xyij=tanh b ij+ m P i−1 p=0Q i−1 q=0w pq ijm v(x+p)(y+q)(i−1)m ,(1)where tanh(·)is the hyperbolic tangent function,b ij is the bias for this feature map,m indexes over the set of feature maps in the(i−1)th layer connectedto the current feature map,w pqijkis the value at the position(p,q)of the kernel connected to the k th fea-ture map,and P i and Q i are the height and width of the kernel,respectively.In the subsampling lay-ers,the resolution of the feature maps is reduced by pooling over local neighborhood on the feature maps in the previous layer,thereby increasing invariance to distortions on the inputs.A CNN architecture can be constructed by stacking multiple layers of convolution and subsampling in an alternating fashion.The pa-rameters of CNN,such as the bias b ij and the kernelweight w pqijk,are usually trained using either super-vised or unsupervised approaches(LeCun et al.,1998; Ranzato et al.,2007).2.1.3D ConvolutionIn2D CNNs,convolutions are applied on the2D fea-ture maps to compute features from the spatial dimen-sions only.When applied to video analysis problems, it is desirable to capture the motion information en-coded in multiple contiguous frames.To this end,we propose to perform3D convolutions in the convolution stages of CNNs to compute features from both spa-tial and temporal dimensions.The3D convolution is achieved by convolving a3D kernel to the cube formed by stacking multiple contiguous frames together.By this construction,the feature maps in the convolution layer is connected to multiple contiguous frames in theDate\Class Total26921349784520056182030758533220954653621870819604416235821156135898485957281848051428 Total235561Method Measure3D CNN Precision0.02820.02560.01520.0230 AUC(×103)Precision0.11090.13560.09310.1132 AUC(×103)2D CNN Precision0.00970.01760.01920.0155 AUC(×103)Precision0.05050.09740.10200.0833 AUC(×103)SPM cubegray Precision0.00880.01920.01910.0157 AUC(×103)Precision0.05580.09610.09880.0836 AUC(×103)SPM cubeMEHI Precision0.01490.01660.01560.0157 AUC(×103)Precision0.08720.08250.10060.0901 AUC(×103)In this work,we considered the CNN model for ac-tion recognition.There are also other deep architec-tures,such as the deep belief networks(Hinton et al., 2006;Lee et al.,2009a),which achieve promising per-formance on object recognition tasks.It would be in-teresting to extend such models for action recognition. The developed3D CNN model was trained using su-pervised algorithm in this work,and it requires a large number of labeled samples.Prior studies show that the number of labeled samples can be significantly reduced when such model is pre-trained using unsupervised al-gorithms(Ranzato et al.,2007).We will explore the unsupervised training of3D CNN models in the future. AcknowledgmentsThe main part of this work was done during the intern-ship of thefirst author at NEC Laboratories America, Inc.,Cupertino,CA.ReferencesAhmed,A.,Yu,K.,Xu,W.,Gong,Y.,and Xing,E. Training hierarchical feed-forward visual recognition models using transfer learning from pseudo-tasks.In ECCV,pp.69–82,2008.Bengio,Y.Learning deep architectures for AI.Foun-dations and Trends in Machine Learning,2(1):1–127,2009.Bromley,J.,Guyon,I.,LeCun,Y.,Sackinger,E.,and Shah,R.Signature verification using a siamese time delay neural network.In NIPS.1993. Collobert,R.and Weston,J.A unified architecture for natural language processing:deep neural net-works with multitask learning.In ICML,pp.160–167,2008.Doll´a r,P.,Rabaud,V.,Cottrell,G.,and Belongie, S.Behavior recognition via sparse spatio-temporal features.In ICCV VS-PETS,pp.65–72,2005.Method Average 90949784799797.959.773.660.454.983.8937785578590988693538882929892858796––––––Efros,A.A.,Berg,A.C.,Mori,G.,and Malik,J. Recognizing action at a distance.In ICCV,pp.726–733,2003.Fukushima,K.Neocognitron:A self-organizing neural network model for a mechanism of pattern recogni-tion unaffected by shift in position.Biol.Cyb.,36: 193–202,1980.Hinton,G.E.and Salakhutdinov,R.R.Reducing the dimensionality of data with neural networks.Sci-ence,313(5786):504–507,July2006.Hinton,G.E.,Osindero,S.,and Teh,Y.A fast learn-ing algorithm for deep belief nets.Neural Computa-tion,18:1527–1554,2006.Jain,V.,Murray,J.F.,Roth,F.,Turaga,S.,Zhigulin, V.,Briggman,K.L.,Helmstaedter,M.N.,Denk, W.,and Seung,H.S.Supervised learning of image restoration with convolutional networks.In ICCV, 2007.Jhuang,H.,Serre,T.,Wolf,L.,and Poggio,T.A biologically inspired system for action recognition. In ICCV,pp.1–8,2007.Kim,H.-J.,Lee,J.S.,and Yang,H.-S.Human ac-tion recognition using a modified convolutional neu-ral network.In Proceedings of the4th International Symposium on Neural Networks,pp.715–723,2007. Laptev,I.and P´e rez,P.Retrieving actions in movies. In ICCV,pp.1–8,2007.Lazebnik,S.,Achmid,C.,and Ponce,J.Beyond bags of features:Spatial pyramid matching for recogniz-ing natural scene categories.In CVPR,pp.2169–2178,2006.LeCun,Y.,Bottou,L.,Bengio,Y.,and Haffner,P. Gradient-based learning applied to document recog-nition.Proceedings of the IEEE,86(11):2278–2324, 1998.LeCun,Y.,Huang,F.-J.,and Bottou,L.Learning methods for generic object recognition with invari-ance to pose and lighting.In CVPR,2004.Lee,H.,Grosse,R.,Ranganath,R.,and Ng,A.Y. Convolutional deep belief networks for scalable un-supervised learning of hierarchical representations. In ICML,pp.609–616,2009a.Lee,H.,Pham,P.,Largman,Y.,and Ng,A.Unsuper-vised feature learning for audio classification using convolutional deep belief networks.In NIPS,pp. 1096–1104.2009b.Lowe,D.G.Distinctive image features from scale in-variant keypoints.International Journal of Com-puter Vision,60(2):91–110,2004.Mobahi,H.,Collobert,R.,and Weston,J.Deep learn-ing from temporal coherence in video.In ICML,pp. 737–744,2009.Mutch,J.and Lowe,D.G.Object class recognition and localization using sparse features with limited receptivefields.International Journal of Computer Vision,80(1):45–57,October2008.Niebles,J.C.,Wang,H.,and Fei-Fei,L.Unsupervised learning of human action categories using spatial-temporal words.International Journal of Computer Vision,79(3):299–318,2008.Ning,F.,Delhomme,D.,LeCun,Y.,Piano,F.,Bot-tou,L.,and Barbano,P.Toward automatic phe-notyping of developing embryos from videos.IEEE Trans.on Image Processing,14(9):1360–1371,2005. Ranzato,M.,Huang,F.-J.,Boureau,Y.,and LeCun, Y.Unsupervised learning of invariant feature hier-archies with applications to object recognition.In CVPR,2007.Schindler,K.and Van Gool,L.Action snippets: How many frames does human action recognition require?In CVPR,2008.Sch¨u ldt,C.,Laptev,I.,and Caputo,B.Recognizing human actions:A local SVM approach.In ICPR, pp.32–36,2004.Serre,T.,Wolf,L.,and Poggio,T.Object recognition with features inspired by visual cortex.In CVPR, pp.994–1000,2005.Yang,M.,Lv,F.,Xu,W.,Yu,K.,and Gong,Y.Hu-man action detection by boosting efficient motion features.In IEEE Workshop on Video-oriented Ob-ject and Event Classification,2009.Yu,K.,Xu,W.,and Gong,Y.Deep learning with kernel regularization for visual recognition.In NIPS, pp.1889–1896,2008.。
稀疏总结
稀疏表示在目标检测方面的学习总结1,稀疏表示的兴起大量研究表明视觉皮层复杂刺激的表达采用的是稀疏编码原则,以稀疏编码为基础的稀疏表示方法能较好刻画人类视觉系统对图像的认知特性,已引起人们极大的兴趣和关注,在机器学习和图像处理领域得到了广泛应用,是当前国内外的研究热点之一.[1]Vinje W E ,Gallant J L .Sparse coding and decorrelation in pri- mary visual cortex during natural vision [J].Science ,2000,287(5456):1273-1276.[2]Nirenberg S ,Carcieri S ,Jacobs A ,et al .Retinal ganglion cells act largely as independent encoders [J ].Nature ,2001,411(6838):698-701.[3]Serre T ,Wolf L ,Bileschi S ,et al .Robust object recognition with cortex-like mechanisms[J].IEEE Transactions on PatternAnalysis and Machine Intelligence ,2007,29(3):411-426.[4]赵松年,姚力,金真,等.视像整体特征在人类初级视皮层上的稀疏表象:脑功能成像的证据[J].科学通报,2008,53(11):1296-1304.图像稀疏表示研究主要沿着两条线展开:单一基方法和多基方法.前者主要是多尺度几何分析理论,认为图像具有非平稳性和非高斯性,用线性算法很难处理,应建立适合处理边缘及纹理各层面几何结构的图像模型,以脊波(Ridgelet)、曲波(Curvelet)等变换为代表的多尺度几何分析方法成为图像稀疏表示的有效途径;后者以Mallat 和Zhang 提出的过完备字典分解理论为基础,根据信号本身的特点自适应选取能够稀疏表示信号的冗余基。
基于动态贝叶斯模型的人工智能近似推理研究
基于动态贝叶斯模型的人工智能近似推理研究人工智能(Artificial Intelligence, AI)作为一门涉及模拟、理解和实现人类智能的学科,近年来取得了巨大的进展。
其中,推理是AI领域中的一个重要研究方向。
推理是指基于已有的知识和信息,通过逻辑和推断来得出新的结论或解决问题的过程。
而近似推理是指在不完全或不确定信息下进行推理,并得出可能性最大或最优解决方案。
动态贝叶斯模型(Dynamic Bayesian Network, DBN)是一种用于建模不确定性、动态系统和时间序列数据的概率图模型。
它结合了贝叶斯网络和时间序列分析方法,能够有效地处理时序数据,并进行概率推断。
本文旨在探讨基于动态贝叶斯模型的人工智能近似推理方法,并分析其在实际应用中的优势和局限性。
首先,我们将介绍动态贝叶斯网络及其基本原理。
动态贝叶斯网络是对时间序列数据进行建模和分析的有效工具。
它通过将时间上连续且相关联的随机变量表示为一个有向无环图(DAG)来描述系统的动态性质。
每个节点表示一个随机变量,每个边表示变量之间的依赖关系。
动态贝叶斯网络通过概率分布和条件概率分布来描述节点和边之间的关系,从而进行推理和预测。
接下来,我们将讨论动态贝叶斯模型在人工智能推理中的应用。
人工智能推理是指通过逻辑和推断从已知信息中得出新的结论或解决问题。
在不完全或不确定信息下进行推理是人工智能中一个重要而困难的问题。
动态贝叶斯模型通过引入概率和条件概率分布,能够对不完全或不确定信息进行建模,并进行近似推理。
然后,我们将探讨基于动态贝叶斯模型的人工智能近似推理方法。
在实际应用中,我们常常面临着大量、复杂且不完全的数据。
基于动态贝叶斯模型的近似推理方法通过对数据进行建模,并使用概率和条件概率分布来计算后验概率,从而得出最可能或最优解决方案。
最后,我们将分析基于动态贝叶斯模型的人工智能近似推理方法的优势和局限性。
优势在于能够处理不完全或不确定信息,并进行近似推理。
异质信息网络中基于表征学习的推荐算法研究
异质信息网络中基于表征学习的推荐算法研究随着互联网的快速发展和智能设备的普及,信息爆炸式增长给用户带来了巨大的信息过载问题。
在这样的背景下,个性化推荐系统应运而生,它通过分析用户的历史行为和兴趣,将用户感兴趣的信息推荐给他们。
然而,传统的推荐算法往往只考虑了用户和物品之间的关系,忽视了异质信息网络中的复杂关系。
因此,基于表征学习的推荐算法成为了解决这一问题的研究热点。
异质信息网络中存在多种类型的节点和边,例如用户、物品、标签等。
每个节点和边都具有丰富的属性和关系信息,而传统的推荐算法难以有效地利用这些信息。
基于表征学习的推荐算法通过学习节点和边的低维度表征向量,将复杂的网络结构转化为简化的向量表示,从而实现了对异质信息网络的建模和推荐任务的优化。
表征学习是一种通过学习节点和边的向量表示来捕捉网络结构和属性信息的方法。
常用的表征学习算法包括DeepWalk、Node2Vec和GraphSAGE等。
这些算法通过随机游走或图神经网络的方式,将网络中的节点转化为向量。
在推荐任务中,表征学习算法可以通过最大化推荐结果的准确性和覆盖率来优化节点的表征向量。
基于表征学习的推荐算法在异质信息网络中具有较好的性能。
首先,它可以充分利用节点和边的属性信息,挖掘用户和物品之间的潜在关系。
其次,它可以通过学习节点的表征向量,将用户和物品映射到同一向量空间,从而实现了跨类型推荐。
此外,基于表征学习的推荐算法还可以通过引入注意力机制和多任务学习等技术,进一步提升推荐结果的质量。
尽管基于表征学习的推荐算法在异质信息网络中取得了一定的研究进展,但仍存在一些问题和挑战。
例如,如何处理网络中的噪声和缺失数据,如何平衡推荐结果的多样性和准确性等。
因此,未来的研究可以从这些方面展开,进一步改进和完善基于表征学习的推荐算法。
总之,异质信息网络中基于表征学习的推荐算法是解决个性化推荐中复杂网络结构和属性信息的有效方法。
它通过学习节点的表征向量,实现了对异质信息网络的建模和推荐任务的优化。
攻克“贫瘠高原”量子人工智能获技术突破
65
Copyright©博看网 Bookan. All Rights Reserved.
中国民商 2022年第01期/总第109期
4
索尼开发出新能够在不平路面上稳定和高效移动 的机器人。这款机器人底部采用了 六条腿的轮子设计,腿部结构由六 个装有轮子的驱动器组成。机器人 在平地上使用轮子,而在楼梯等有 高度差的地方则使用轮子和腿来上 下移动。索尼称,这种设计让该机器 人无论在平坦还是不平坦的路面都 能稳定、高效的运行。该模型继承 了 2021 年国际智能机器人和系统
(责任编辑 顾岩娜)
编译自Scitech daily网站
63
Copyright©博看网 Bookan. All Rights Reserved.
科技|前沿
1
测。由于美国新冠病毒 Omicron 爆发,因此 这款检测仪一经上市便售罄,每天限量发售。
Detect 检测仪的原理是寻找病毒的基因片
段,采用“快速分子检测”的方式,而不是一些
文 / Los Alamos National Laboratory 编译 /《中国民商》 李雨蒙
在量子计算机上运行的卷积神经网络,其分析 量子数据的能力优于传统计算机,由此引发 了大量的关注。一个被称为“贫瘠高原”(Barren Plateau)的基础可解性问题限制了神经网络在大 数据集方面的应用。近日,来自美国洛斯阿拉莫斯 国家实验室(LANL)和伦敦大学的研究人员构建 了一个不存在贫瘠高原的特定量子神经网络架构, 这篇研究论文以《Absence of Barren Plateaus in Quantum Convolutional Neural Networks》为题 发表在《PHYSICAL REVIEW X》杂志上。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Sparse Feature Learning for Deep Belief NetworksMarc’Aurelio Ranzato1Y-Lan Boureau2,1Yann LeCun11Courant Institute of Mathematical Sciences,New York University2INRIA Rocquencourt{ranzato,ylan,yann@}AbstractUnsupervised learning algorithms aim to discover the structure hidden in the data,and to learn representations that are more suitable as input to a supervised machinethan the raw input.Many unsupervised methods are based on reconstructing theinput from the representation,while constraining the representation to have cer-tain desirable properties(e.g.low dimension,sparsity,etc).Others are based onapproximating density by stochastically reconstructing the input from the repre-sentation.We describe a novel and efficient algorithm to learn sparse represen-tations,and compare it theoretically and experimentally with a similar machinetrained probabilistically,namely a Restricted Boltzmann Machine.We propose asimple criterion to compare and select different unsupervised machines based onthe trade-off between the reconstruction error and the information content of therepresentation.We demonstrate this method by extracting features from a datasetof handwritten numerals,and from a dataset of natural image patches.We showthat by stacking multiple levels of such machines and by training sequentially,high-order dependencies between the input observed variables can be captured.1IntroductionOne of the main purposes of unsupervised learning is to produce good representations for data,that can be used for detection,recognition,prediction,or visualization.Good representations eliminate irrelevant variabilities of the input data,while preserving the information that is useful for the ul-timate task.One cause for the recent resurgence of interest in unsupervised learning is the ability to produce deep feature hierarchies by stacking unsupervised modules on top of each other,as pro-posed by Hinton et al.[1],Bengio et al.[2]and our group[3,4].The unsupervised module at one level in the hierarchy is fed with the representation vectors produced by the level below.Higher-level representations capture high-level dependencies between input variables,thereby improving the ability of the system to capture underlying regularities in the data.The output of the last layer in the hierarchy can be fed to a conventional supervised classifier.A natural way to design stackable unsupervised learning systems is the encoder-decoder paradigm[5].An encoder transforms the input into the representation(also known as the code or the feature vector),and a decoder reconstructs the input(perhaps stochastically)from the repre-sentation.PCA,Auto-encoder neural nets,Restricted Boltzmann Machines(RBMs),our previous sparse energy-based model[3],and the model proposed in[6]for noisy overcomplete channels are just examples of this kind of architecture.The encoder/decoder architecture is attractive for two rea-sons:1.after training,computing the code is a very fast process that merely consists in running the input through the encoder;2.reconstructing the input with the decoder provides a way to check that the code has captured the relevant information in the data.Some learning algorithms[7]do not have a decoder and must resort to computationally expensive Markov Chain Monte Carlo(MCMC)sam-pling methods in order to provide reconstructions.Other learning algorithms[8,9]lack an encoder, which makes it necessary to run an expensive optimization algorithm tofind the code associated with each new input sample.In this paper we will focus only on encoder-decoder architectures.In general terms,we can view an unsupervised model as defining a distribution over input vectors Y through an energy function E(Y,Z,W):P(Y|W)= z P(Y,z|W)= z e−βE(Y,z,W)y,z e−βE(y,z,W)(1)where Z is the code vector,W the trainable parameters of encoder and decoder,andβis an arbitrary positive constant.The energy function includes the reconstruction error,and perhaps other terms as well.For convenience,we will omit W from the notation in the following.Training the machine to model the input distribution is performed byfinding the encoder and decoder parameters that minimize a loss function equal to the negative log likelihood of the training data under the model. For a single training sample Y,the loss function isL(W,Y)=−1βlog z e−βE(Y,z)+1βlog y,z e−βE(y,z)(2)Thefirst term is the free energy Fβ(Y).Assuming that the distribution over Z is rather peaked,it can be simpler to approximate this distribution over Z by its mode,which turns the marginalization over Z into a minimization:L∗(W,Y)=E(Y,Z∗(Y))+1βlog y e−βE(y,Z∗(y))(3)where Z∗(Y)is the maximum likelihood value Z∗(Y)=argmin z E(Y,z),also known as the optimal code.We can then define an energy for each input point,that measures how well it is reconstructed by the model:F∞(Y)=E(Y,Z∗(Y))=limβ→∞−1βlog z e−βE(Y,z)(4)The second term in equation2and3is called the log partition function,and can be viewed as apenalty term for low energies.It ensures that the system produces low energy only for input vectorsthat have high probability in the(true)data distribution,and produces higher energies for all otherinput vectors[5].The overall loss is the average of the above over the training set.Regardless of whether only Z∗or the whole distribution over Z is considered,the main difficultywith this framework is that it can be very hard to compute the gradient of the log partition functionin equation2or3with respect to the parameters W.Efficient methods shortcut the computation bydrastically and cleverly reducing the integration domain.For instance,Restricted Boltzmann Ma-chines(RBM)[10]approximate the gradient of the log partition function in equation2by samplingvalues of Y whose energy will be pulled up using an MCMC technique.By running the MCMC fora short time,those samples are chosen in the vicinity of the training samples,thereby ensuring thatthe energy surface forms a ravine around the manifold of the training samples.This is the basis ofthe Contrastive Divergence method[10].The role of the log partition function is merely to ensure that the energy surface is lower aroundtraining samples than anywhere else.The method proposed here eliminates the log partition functionfrom the loss,and replaces it by a term that limits the volume of the input space over which the energysurface can take a low value.This is performed by adding a penalty term on the code rather than onthe input.While this class of methods does not directly maximize the likelihood of the data,it can beseen as a crude approximation of it.To understand the method,wefirst note that if for each vectorY,there exists a corresponding optimal code Z∗(Y)that makes the reconstruction error(or energy) F∞(Y)zero(or near zero),the model can perfectly reconstruct any input vector.This makes the energy surfaceflat and indiscriminate.On the other hand,if Z can only take a small number ofdifferent values(low entropy code),then the energy F∞(Y)can only be low in a limited number of places(the Y’s that are reconstructed from this small number of Z values),and the energy cannot beflat.More generally,a convenient method through whichflat energy surfaces can be avoided is to limitthe maximum information content of the code.Hence,minimizing the energy F∞(Y)together with the information content of the code is a good substitute for minimizing the log partition function.A popular way to minimize the information content in the code is to make the code sparse or low-dimensional[5].This technique is used in a number of unsupervised learning methods,including PCA,auto-encoders neural network,and sparse coding methods[6,3,8,9].In sparse methods, the code is forced to have only a few non-zero units while most code units are zero most of the time.Sparse-overcomplete representations have a number of theoretical and practical advantages, as demonstrated in a number of recent studies[6,8,3].In particular,they have good robustness to noise,and provide a good tiling of the joint space of location and frequency.In addition,they are advantageous for classifiers because classification is more likely to be easier in higher dimensional spaces.This may explain why biology seems to like sparse representations[11].In our context,the main advantage of sparsity constraints is to allow us to replace a marginalization by a minimization, and to free ourselves from the need to minimize the log partition function explicitly.In this paper we propose a new unsupervised learning algorithm called Sparse Encoding Symmetric Machine(SESM),which is based on the encoder-decoder paradigm,and which is able to produce sparse overcomplete representations efficiently without any need forfilter normalization[8,12]or code saturation[3].As described in more details in sec.2and3,we consider a loss function which is a weighted sum of the reconstruction error and a sparsity penalty,as in many other unsupervised learning algorithms[13,14,8].Encoder and decoder are constrained to be symmetric,and share a set of linearfilters.Although we only consider linearfilters in this paper,the method allows the use of any differentiable function for encoder and decoder.We propose an iterative on-line learning algorithm which is closely related to those proposed by Olshausen and Field[8]and by us previously[3].Thefirst step computes the optimal code by minimizing the energy for the given input.The second step updates the parameters of the machine so as to minimize the energy.In sec.4,we compare SESM with RBM and PCA.Following[15],we evaluate these methods by measuring the reconstruction error for a given entropy of the code.In another set of experiments, we train a classifier on the features extracted by the various methods,and measure the classification error on the MNIST dataset of handwritten numerals.Interestingly,the machine achieving the best recognition performance is the one with the best trade-off between RMSE and entropy.In sec.5,we compare thefilters learned by SESM and RBM for handwritten numerals and natural image patches. In sec.5.1.1,we describe a simple way to produce a deep belief net by stacking multiple levels of SESM modules.The representational power of this hierarchical non-linear feature extraction is demonstrated through the unsupervised discovery of the numeral class labels in the high-level code. 2ArchitectureIn this section we describe a Sparse Encoding Symmetric Machine(SESM)having a set of linearfil-ters in both encoder and decoder.However,everything can be easily extended to any other choice of parameterized functions as long as these are differentiable and maintain symmetry between encoder and decoder.Let us denote with Y the input defined in R N,and with Z the code defined in R M, where M is in general greater than N(for overcomplete representations).Let thefilters in encoder and decoder be the columns of matrix W∈R N×M,and let the biases in the encoder and decoder be denoted by b enc∈R M and b dec∈R N,respectively.Then,encoder and decoder compute:f enc(Y)=W T Y+b enc,f dec(Z)=W l(Z)+b dec(5) where the function l is a point-wise logistic non-linearity of the form:l(x)=1/(1+exp(−gx)),(6) with gfixed gain.The system is characterized by an energy measuring the compatibility between pairs of input Y and latent code Z,E(Y,Z)[16].The lower the energy,the more compatible(or likely)is the pair.We define the energy as:E(Y,Z)=αe Z−f enc(Y) 22+ Y−f dec(Z) 22(7) During training we minimize the following loss:L(W,Y)=E(Y,Z)+αs h(Z)+αr W 1=αe Z−f enc(Y) 22+ Y−f dec(Z) 22+αs h(Z)+αr W 1(8) Thefirst term tries to make the output of the encoder as similar as possible to the code Z.The second term is the mean-squared error between the input Y and the reconstruction provided by the decoder.The third term ensures the sparsity of the code by penalizing non zero values of code units;this term acts independently on each code unit and it is defined as h(Z)= M i=1log(1+l2(z i)),(correspond-ing to a factorized Student-t prior distribution on the non linearly transformed code units[8]through the logistic of equation6).The last term is an L1regularization on thefilters to suppress noise and favor more localizedfilters.The loss formulated in equation8combines terms that characterize also other methods.For instance,thefirst two terms appear in our previous model[3],but in that work,the weights of encoder and decoder were not tied and the parameters in the logistic were up-dated using running averages.The second and third terms are present in the“decoder-only”model proposed in[8].The third term was used in the“encoder-only”model of[7].Besides the already-mentioned advantages of using an encoder-decoder architecture,we point out another good feature of this algorithm due to its symmetry.A common idiosyncrasy for sparse-overcomplete methods using both a reconstruction and a sparsity penalty in the objective function(second and third term in equation8),is the need to normalize the basis functions in the decoder during learning[8,12]with somewhat ad-hoc technique,otherwise some of the basis functions collapse to zero,and some blow up to infinity.Because of the sparsity penalty and the linear reconstruction,code units become tiny and are compensated by thefilters in the decoder that grow without bound.Even though the overall loss decreases,training is unsuccessful.Unfortunately,simply normalizing thefilters makes less clear which objective function is minimized.Some authors have proposed quite expensive meth-ods to solve this issue:by making better approximations of the posterior distribution[15],or by using sampling techniques[17].In this work,we propose to enforce symmetry between encoder and decoder(through weight sharing)so as to have automatic scaling offilters.Their norm cannot possibly be large because code units,produced by the encoder weights,would have large values as well,producing bad reconstructions and increasing the energy(the second term in equation7and 8).3Learning AlgorithmLearning consists of determining the parameters in W,b enc,and b dec that minimize the loss in equation8.As indicated in the introduction,the energy augmented with the sparsity constraint is minimized with respect to the code tofind the optimal code.No marginalization over code distribu-tion is performed.This is akin to using the loss function in equation3.However,the log partition function term is dropped.Instead,we rely on the code sparsity constraints to ensure that the energy surface is notflat.Since the second term in equation8couples both Z and W and b dec,it is not straightforward to minimize this energy with respect to both.On the other hand,once Z is given,the minimization with respect to W is a convex quadratic problem.Vice versa,if the parameters W arefixed,the optimal code Z∗that minimizes L can be computed easily through gradient descent.This suggests the following iterative on-line coordinate descent learning algorithm:1.for a given sample Y and parameter setting,minimize the loss in equation8with respect to Z by gradient descent to obtain the optimal code Z∗2.clamping both the input Y and the optimal code Z∗found at the previous step,do one step of gradient descent to update the parameters.Unlike other methods[8,12],no column normalization of W is required.Also,all the parameters are updated by gradient descent unlike in our previous work[3]where some parameters are updated using a moving average.After training,the system converges to a state where the decoder produces good reconstructions from a sparse code,and the optimal code is predicted by a simple feed-forward propagation through the encoder.4Comparative Coding AnalysisIn the following sections,we mainly compare SESM with RBM in order to better understand their differences in terms of maximum likelihood approximation,and in terms of coding efficiency and robustness.RBM As explained in the introduction,RBMs minimize an approximation of the negative log likelihood of the data under the model.An RBM is a binary stochastic symmetric machine definedby an energy function of the form:E (Y,Z )=−Z T W T Y −b T enc Z −b T dec Y .Although this is not obvious at first glance,this energy can be seen as a special case of the encoder-decoder architecturethat pertains to binary data vectors and code vectors [5].Training an RBM minimizes an approxima-tion of the negative log likelihood loss function 2,averaged over the training set,through a gradient descent procedure.Instead of estimating the gradient of the log partition function,RBM training uses contrastive divergence [10],which takes random samples drawn over a limited region Ωaround the training samples.The loss becomes:L (W,Y )=−1βlog z e −βE (Y,z )+1βlog y ∈Ωz e −βE (y,z )(9)Because of the RBM architecture,given a Y ,the components of Z are independent,hence the sum over configurations of Z can be done independently for each component of Z .Sampling y in the neighborhood Ωis performed with one,or a few alternated MCMC steps over Y ,and Z .This means that only the energy of points around training samples is pulled up.Hence,the likelihood function takes the right shape around the training samples,but not necessarily everywhere.However,the code vector in an RBM is binary and noisy,and one may wonder whether this does not have the effect of surreptitiously limiting the information content of the code,thereby further minimizing the log partition function as a bonus.SESM RBM and SESM have almost the same architecture because they both have a symmetric encoder and decoder,and a logistic non-linearity on the top of the encoder.However,RBM is trained using (approximate)maximum likelihood,while SESM is trained by simply minimizing the average energy F ∞(Y )of equation 4with an additional code sparsity term.SESM relies on the sparsity term to prevent flat energy surfaces,while RBM relies on an explicit contrastive term in the loss,an approximation of the log partition function.Also,the coding strategy is very different because code units are “noisy”and binary in RBM,while they are quasi-binary and sparse in SESM.Features extracted by SESM look like object parts (see next section),while features produced by RBM lack an intuitive interpretation because they aim at modeling the input distribution and they are used in a distributed representation.4.1Experimental ComparisonIn the first experiment we have trained SESM,RBM,and PCA on the first 20000digits in the MNIST training dataset [18]in order to produce codes with 200components.Similarly to [15]we have collected test image codes after the logistic non linearity (except for PCA which is linear),and we have measured the root mean square error (RMSE)and the entropy.SESM was run for different values of the sparsity coefficient αs in equation 8(while all other parameters are left unchanged,seenext section for details).The RMSE is defined as 1σ 1P N Y −f dec (¯Z ) 22,where ¯Z is the uniformlyquantized code produced by the encoder,P is the number of test samples,and σis the estimated variance of units in the input Y .Assuming to encode the (quantized)code units independently and with the same distribution,the lower bound on the number of bits required to encode each of them is given by:H c.u.=− Q i =1c i P M log 2c i P M ,where c i is the number of counts in the i -th bin,and Q is the number of quantization levels.The number of bits per pixel is then equal to:M N H c.u..Unlike in [15,12],the reconstruction is done taking the quantized code in order to measure the robustness of the code to the quantization noise.As shown in fig.1-C,RBM is very robust to noise in the code because it is trained by sampling.The opposite is true for PCA which achieves the lowest RMSE when using high precision codes,but the highest RMSE when using a coarse quantization.SESM seems to give the best trade-off between RMSE and entropy.Fig.1-D/F compare the features learned by SESM and RBM.Despite the similarities in the architecture,filters look quite different in general,revealing two different coding strategies:distributed for RBM,and sparse for SESM.In the second experiment,we have compared these methods by means of a supervised task in order to assess which method produces the most discriminative representation.Since we have available also the labels in the MNIST,we have used the codes (produced by these machines trained unsupervised)as input to the same linear classifier.This is run for 100epochs to minimize the squared error between outputs and targets,and has a mild ridge regularizer.Fig.1-A/B show the result of these experiments in addition to what can be achieved by a linear classifier trained on the raw pixel data.Note that:1)training on features instead of raw data improves the recognition (except for PCA(A)(B)(C)(D)(E)(F)(G)(H)Figure1:(A)-(B)Error rate on MNIST training(with10,100and1000samples per class)andtest set produced by a linear classifier trained on the codes produced by SESM,RBM,and PCA.The entropy and RMSE refers to a quantization into256bins.The comparison has been extendedalso to the same classifier trained on raw pixel data(showing the advantage of extracting features).The error bars refer to1std.dev.of the error rate for10random choices of training datasets(same splits for all methods).The parameterαs in eq.8takes values:1,0.5,0.2,0.1,0.05.(C)Comparison between SESM,RBM,and PCA when quantizing the code into5and256bins.(D)Random selection from the200linearfilters that were learned by SESM(αs=0.2).(E)Some pairs of original and reconstructed digit from the code produced by the encoder in SESM(feed-forwardpropagation through encoder and decoder).(F)Random selection offilters learned by RBM.(G)Back-projection in image space of thefilters learned in the second stage of the hierarchical featureextractor.The second stage was trained on the non linearly transformed codes produced by thefirststage machine.The back-projection has been performed by using a1-of-10code in the second stagemachine,and propagating this through the second stage decoder andfirst stage decoder.Thefiltersat the second stage discover the class-prototypes(manually ordered for visual convenience)eventhough no class label was ever used during training.(H)Feature extraction from8x8natural imagepatches:somefilters that were learned.when the number of training samples is small),2)RBM performance is competitive overall when few training samples are available,3)the best performance is achieved by SESM for a sparsity level which trades off RMSE for entropy(overall for large training sets),4)the method with the best RMSE is not the one with lowest error rate,5)compared to a SESM having the same error rate RBM is more costly in terms of entropy.5ExperimentsThis section describes some experiments we have done with SESM.The coefficientαe in equation8 has always been set equal to1,and the gain in the logistic have been set equal to7in order to achieve a quasi-binary coding.The parameterαs has to be set by cross-validation to a value which depends on the level of sparsity required by the specific application.5.1Handwritten DigitsFig.1-B/E shows the result of training a SESM withαs is equal to0.2.Training was performed on 20000digits scaled between0and1,by settingαr to0.0004(in equation8)with a learning rate equal to0.025(decreased exponentially).Filters detect the strokes that can be combined to form a digit.Even if the code unit activation has a very sparse distribution,reconstructions are very good (no minimization in code space was performed).5.1.1Hierarchical FeaturesA hierarchical feature extractor can be trained layer-by-layer similarly to what has been proposed in[19,1]for training deep belief nets(DBNs).We have trained a second(higher)stage machine on the non linearly transformed codes produced by thefirst(lower)stage machine described in the previous example.We used just20000codes to produce a higher level representation with just10 components.Since we aimed tofind a1-of-10code we increased the sparsity level(in the second stage machine)by settingαs to1.Despite the completely unsupervised training procedure,the feature detectors in the second stage machine look like digit prototypes as can be seen infig.1-G. The hierarchical unsupervised feature extractor is able to capture higher order correlations among the input pixel intensities,and to discover the highly non-linear mapping from raw pixel data to the class labels.Changing the random initialization can sometimes lead to the discover of two different shapes of“9”without a unit encoding the“4”,for instance.Nevertheless,results are qualitatively very similar to this one.For comparison,when training a DBN,prototypes are not recovered because the learned code is distributed among units.5.2Natural Image PatchesA SESM with about the same set up was trained on a dataset of300008x8natural image patches randomly extracted from the Berkeley segmentation dataset[20].The input images were simply scaled down to the range[0,1.7],without even subtracting the mean.We have considered a2 times overcomplete code with128units.The parametersαs,αr and the learning rate were set to 0.4,0.025,and0.001respectively.Somefilters are localized Gabor-like edge detectors in different positions and orientations,other are more global,and some encode the mean value(seefig.1-H). 6ConclusionsThere are two strategies to train unsupervised machines:1)having a contrastive term in the loss function minimized during training,2)constraining the internal representation in such a way that training samples can be better reconstructed than other points in input space.We have shown that RBM,which falls in thefirst class of methods,is particularly robust to channel noise,it achieves very low RMSE and good recognition rate.We have also proposed a novel symmetric sparse encoding method following the second strategy which:is particularly efficient to train,has fast inference, works without requiring any withening or even mean removal from the input,can provide the best recognition performance and trade-off between entropy/RMSE,and can be easily extended to a hierarchy discovering hidden structure in the data.We have proposed an evaluation protocol to compare different machines which is based on RMSE,entropy and,eventually,error rate when alsolabels are available.Interestingly,the machine achieving the best performance in classification is the one with the best trade-off between reconstruction error and entropy.A future avenue of work is to understand the reasons for this“coincidence”,and deeper connections between these two strategies. AcknowledgmentsWe wish to thank Jonathan Goodman,Geoffrey Hinton,and Yoshua Bengio for helpful discussions.This work was supported in part by NSF grant IIS-0535166“toward category-level object recognition”,NSF ITR-0325463“new directions in predictive learning”,and ONR grant N00014-07-1-0535“integration and representation of high dimensional data”.References[1]G.E.Hinton and R.R Salakhutdinov.Reducing the dimensionality of data with neural networks.Science,313(5786):504–507,2006.[2]Y.Bengio,mblin,D.Popovici,and rochelle.Greedy layer-wise training of deep networks.InNIPS,2006.[3]M.Ranzato,C.Poultney,S.Chopra,and Y.LeCun.Efficient learning of sparse representations with anenergy-based model.In NIPS2006.MIT Press,2006.[4]Y.Bengio and Y.LeCun.Scaling learning algorithms towars ai.In D.DeCoste L.Bottou,O.Chapelleand J.Weston,editors,Large-Scale Kernel Machines.MIT Press,2007.[5]M.Ranzato,Y.Boureau,S.Chopra,and Y.LeCun.A unified energy-based framework for unsupervisedlearning.In Proc.Conference on AI and Statistics(AI-Stats),2007.[6] E.Doi,D.C.Balcan,and M.S.Lewicki.A theoretical analysis of robust coding over noisy overcompletechannels.In NIPS.MIT Press,2006.[7]Y.W.Teh,M.Welling,S.Osindero,and G.E.Hinton.Energy-based models for sparse overcompleterepresentations.Journal of Machine Learning Research,4:1235–1260,2003.[8] B.A.Olshausen and D.J.Field.Sparse coding with an overcomplete basis set:a strategy employed byv1?Vision Research,37:3311–3325,1997.[9] D.D.Lee and H.S.Seung.Learning the parts of objects by non-negative matrix factorization.Nature,401:788–791,1999.[10]G.E.Hinton.Training products of experts by minimizing contrastive divergence.Neural Computation,14:1771–1800,2002.[11]P.Lennie.The cost of cortical computation.Current biology,13:493–497,2003.[12]J.F.Murray and K.Kreutz-Delgado.Learning sparse overcomplete codes for images.The Journal ofVLSI Signal Processing,45:97–110,2008.[13]G.E.Hinton and R.S.Zemel.Autoencoders,minimum description length,and helmholtz free energy.InNIPS,1994.[14]G.E.Hinton,P.Dayan,and M.Revow.Modeling the manifolds of images of handwritten digits.IEEETransactions on Neural Networks,8:65–74,1997.[15]M.S.Lewicki and T.J.Sejnowski.Learning overcomplete representations.Neural Computation,12:337–365,2000.[16]Y.LeCun,S.Chopra,R.Hadsell,M.Ranzato,and F.J.Huang.A tutorial on energy-based learning.InG.Bakir and al..,editors,Predicting Structured Data.MIT Press,2006.[17]P.Sallee and B.A.Olshausen.Learning sparse multiscale image representations.In NIPS.MIT Press,2002.[18]/exdb/mnist/.[19]G.E.Hinton,S.Osindero,and Y.-W.Teh.A fast learning algorithm for deep belief nets.Neural Compu-tation,18:1527–1554,2006.[20]/projects/vision/grouping/segbench/.。