Sparse Representations in Unions of Bases
英文论文写作中一些可能用到的词汇
英⽂论⽂写作中⼀些可能⽤到的词汇英⽂论⽂写作过程中总是被⾃⼰可怜的词汇量击败, 所以我打算在这⾥记录⼀些在阅读论⽂过程中见到的⼀些⾃⼰不曾见过的词句或⽤法。
这些词句查词典都很容易查到,但是只有带⼊论⽂原⽂中才能体会内涵。
毕竟原⽂和译⽂中间总是存在⼀条看不见的思想鸿沟。
形容词1. vanilla: adj. 普通的, 寻常的, 毫⽆特⾊的. ordinary; not special in any way.2. crucial: adj. ⾄关重要的, 关键性的.3. parsimonious:adj. 悭吝的, 吝啬的, ⼩⽓的.e.g. Due to the underlying hyperbolic geometry, this allows us to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity.4. diverse: adj. 不同的, 相异的, 多种多样的, 形形⾊⾊的.5. intriguing: adj. ⾮常有趣的, 引⼈⼊胜的; 神秘的. *intrigue: v. 激起…的兴趣, 引发…的好奇⼼; 秘密策划(加害他⼈), 密谋.e.g. The results of this paper carry several intriguing implications.6. intimate: adj. 亲密的; 密切的. v.透露; (间接)表⽰, 暗⽰.e.g. The above problems are intimately linked to machine learning on graphs.7. akin: adj. 类似的, 同族的, 相似的.e.g. Akin to GNN, in LOCAL a graph plays a double role: ...8. abundant: adj. ⼤量的, 丰盛的, 充裕的.9. prone: adj. 有做(坏事)的倾向; 易于遭受…的; 俯卧的.e.g. It is thus prone to oversmoothing when convolutions are applied repeatedly.10.concrete: adj. 混凝⼟制的; 确实的, 具体的(⽽⾮想象或猜测的); 有形的; 实在的.e.g. ... as a concrete example ...e.g. More concretely, HGCN applies the Euclidean non-linear activation in...11. plausible: adj. 有道理的; 可信的; 巧⾔令⾊的, 花⾔巧语的.e.g. ... this interpretation may be a plausible explanation of the success of the recently introduced methods.12. ubiquitous: adj. 似乎⽆所不在的;⼗分普遍的.e.g. While these higher-order interac- tions are ubiquitous, an evaluation of the basic properties and organizational principles in such systems is missing.13. disparate: adj. 由不同的⼈(或事物)组成的;迥然不同的;⽆法⽐较的.e.g. These seemingly disparate types of data have something in common: ...14. profound: adj. 巨⼤的; 深切的, 深远的; 知识渊博的; 理解深刻的;深邃的, 艰深的; ⽞奥的.e.g. This has profound consequences for network models of relational data — a cornerstone in the interdisciplinary study of complex systems.15. blurry: adj. 模糊不清的.e.g. When applying these estimators to solve (2), the line between the critic and the encoders $g_1, g_2$ can be blurry.16. amenable: adj. 顺从的; 顺服的; 可⽤某种⽅式处理的.e.g. Ou et al. utilize sparse generalized SVD to generate a graph embedding, HOPE, from a similarity matrix amenableto de- composition into two sparse proximity matrices.17. elaborate: adj. 复杂的;详尽的;精⼼制作的 v.详尽阐述;详细描述;详细制订;精⼼制作e.g. Topic Modeling for Graphs also requires elaborate effort, as graphs are relational while documents are indepen- dent samples.18. pivotal: adj. 关键性的;核⼼的e.g. To ensure the stabilities of complex systems is of pivotal significance toward reliable and better service providing.19. eminent: adj. 卓越的,著名的,显赫的;⾮凡的;杰出的e.g. To circumvent those defects, theoretical studies eminently represented by percolation theories appeared.20. indispensable: adj. 不可或缺的;必不可少的 n. 不可缺少的⼈或物e.g. However, little attention is paid to multipartite networks, which are an indispensable part of complex networks.21. post-hoc: adj. 事后的e.g. Post-hoc explainability typically considers the question “Why the GNN predictor made certain prediction?”.22. prevalent: adj. 流⾏的;盛⾏的;普遍存在的e.g. A prevalent solution is building an explainer model to conduct feature attribution23. salient: adj. 最重要的;显著的;突出的. n. 凸⾓;[建]突出部;<军>进攻或防卫阵地的突出部分e.g. It decomposes the prediction into the contributions of the input features, which redistributes the probability of features according to their importance and sample the salient features as an explanatory subgraph.24. rigorous: adj. 严格缜密的;严格的;谨慎的;细致的;彻底的;严厉的e.g. To inspect the OOD effect rigorously, we take a causal look at the evaluation process with a Structural Causal Model.25. substantial: adj. ⼤量的;价值巨⼤的;重⼤的;⼤⽽坚固的;结实的;牢固的. substantially: adv. ⾮常;⼤⼤地;基本上;⼤体上;总的来说26. cogent: adj. 有说服⼒的;令⼈信服的e.g. The explanatory subgraph $G_s$ emphasizes tokens like “weak” and relations like “n’t→funny”, which is cogent according to human knowledge.27. succinct: adj. 简练的;简洁的 succinctly: adv. 简⽽⾔之,简明扼要地28. concrete: adj. 混凝⼟制的;确实的,具体的(⽽⾮想象或猜测的);有形的;实在的 concretely: adv. 具体地;具体;具体的;有形地29. predominant:adj. 主要的;主导的;显著的;明显的;盛⾏的;占优势的动词1. mitigate: v. 减轻, 缓和. (反 enforce)e.g. In this work, we focus on mitigating this problem for a certain class of symbolic data.2. corroborate: v. [VN] [often passive] (formal) 证实, 确证.e.g. This is corroborated by our experiments on real-world graph.3. endeavor: n./v. 努⼒, 尽⼒, 企图, 试图.e.g. It encourages us to continue the endeavor in applying principles mathematics and theory in successful deployment of deep learning.4. augment: v. 增加, 提⾼, 扩⼤. n. 增加, 补充物.e.g. We also augment the graph with geographic information (longitude, latitude and altitude), and GDP of the country where the airport belongs to.5. constitute: v. (被认为或看做)是, 被算作; 组成, 构成; (合法或正式地)成⽴, 设⽴.6. abide: v. 接受, 遵照(规则, 决定, 劝告); 逗留, 停留.e.g. Training a graph classifier entails identifying what constitutes a class, i.e., finding properties shared by graphs in one class but not the other, and then deciding whether new graphs abide to said learned properties.7. entail: v. 牵涉; 需要; 使必要. to involve sth that cannot be avoided.e.g. Due to the recursive definition of the Chebyshev polynomials, the computation of the filter $g_α(\Delta)f$ entails applying the Laplacian $r$ times, resulting cal operator affecting only 1-hop neighbors of a vertex and in $O(rn)$ operations.8. encompass: v. 包含, 包括, 涉及(⼤量事物); 包围, 围绕, 围住.e.g. This model is chosen as it is sufficiently general to encompass several state-of-the-art networks.e.g. The k-cycle detection problem entails determining if G contains a k-cycle.9. reveal: v. 揭⽰, 显⽰, 透露, 显出, 露出, 展⽰.10. bestow: v. 将(…)给予, 授予, 献给.e.g. Aiming to bestow GCNs with theoretical guarantees, one promising research direction is to study graph scattering transforms (GSTs).11. alleviate: v. 减轻, 缓和, 缓解.12. investigate: v. 侦查(某事), 调查(某⼈), 研究, 调查.e.g. The sensitivity of pGST to random and localized noise is also investigated.13. fuse: v. (使)融合, 熔接, 结合; (使)熔化, (使保险丝熔断⽽)停⽌⼯作.e.g. We then fuse the topological embeddings with the initial node features into the initial query representations using a query network$f_q$ implemented as a two-layer feed-forward neural network.14. magnify: v. 放⼤, 扩⼤; 增强; 夸⼤(重要性或严重性); 夸张.e.g. ..., adding more layers also leads to more parameters which magnify the potential of overfitting.15. circumvent: v. 设法回避, 规避; 绕过, 绕⾏.e.g. To circumvent the issue and fulfill both goals simultaneously, we can add a negative term...16. excel: v. 擅长, 善于; 突出; 胜过平时.e.g. Nevertheless, these methods have been repeatedly shown to excel in practice.17. exploit: v. 利⽤(…为⾃⼰谋利); 剥削, 压榨; 运⽤, 利⽤; 发挥.e.g. In time series and high-dimensional modeling, approaches that use next step prediction exploit the local smoothness of the signal.18. regulate: v. (⽤规则条例)约束, 控制, 管理; 调节, 控制(速度、压⼒、温度等).e.g. ... where $b >0$ is a parameter regulating the probability of this event.19. necessitate: v. 使成为必要.e.g. Combinatorial models reproduce many-body interactions, which appear in many systems and necessitate higher-order models that capture information beyond pairwise interactions.20. portray:描绘, 描画, 描写; 将…描写成; 给⼈以某种印象; 表现; 扮演(某⾓⾊).e.g. Considering pairwise interactions, a standard network model would portray the link topology of the underlying system as shown in Fig. 2b.21. warrant: v. 使有必要; 使正当; 使恰当. n. 执⾏令; 授权令; (接受款项、服务等的)凭单, 许可证; (做某事的)正当理由, 依据.e.g. Besides statistical methods that can be used to detect correlations that warrant higher-order models, ... (除了可以⽤来检测⽀持⾼阶模型的相关性的统计⽅法外, ...)22. justify: v. 证明…正确(或正当、有理); 对…作出解释; 为…辩解(或辩护); 调整使全⾏排满; 使每⾏排齐.e.g. ..., they also come with the assumption of transitive, Markovian paths, which is not justified in many real systems.23. hinder:v. 阻碍; 妨碍; 阻挡. (反 foster: v. 促进; 助长; 培养; ⿎励; 代养, 抚育, 照料(他⼈⼦⼥⼀段时间))e.g. The eigenvalues and eigenvectors of these matrix operators capture how the topology of a system influences the efficiency of diffusion and propagation processes, whether it enforces or mitigates the stability of dynamical systems, or if it hinders or fosters collective dynamics.24. instantiate:v. 例⽰;⽤具体例⼦说明.e.g. To learn the representation we instantiate (2) and split each input MNIST image into two parts ...25. favor:v. 赞同;喜爱, 偏爱; 有利于, 便于. n. 喜爱, 宠爱, 好感, 赞同; 偏袒, 偏爱; 善⾏, 恩惠.26. attenuate: v. 使减弱; 使降低效⼒.e.g. It therefore seems that the bounds we consider favor hard-to-invert encoders, which heavily attenuate part of the noise, over well conditioned encoders.27. elucidate:v. 阐明; 解释; 说明.e.g. Secondly, it elucidates the importance of appropriately choosing the negative samples, which is indeed a critical component in deep metric learning based on triplet losses.28. violate: v. 违反, 违犯, 违背(法律、协议等); 侵犯(隐私等); 使⼈不得安宁; 搅扰; 亵渎, 污损(神圣之地).e.g. Negative samples are obtained by patches from different images as well as patches from the same image, violating the independence assumption.29. compel:v. 强迫, 迫使; 使必须; 引起(反应).30. gauge: v. 判定, 判断(尤指⼈的感情或态度); (⽤仪器)测量, 估计, 估算. n. 测量仪器(或仪表);计量器;宽度;厚度;(枪管的)⼝径e.g. Yet this hyperparameter-tuned approach raises a cubic worst-case space complexity and compels the user to traverse several feature sets and gauge the one that attains the best performance in the downstream task.31. depict: v. 描绘, 描画; 描写, 描述; 刻画.e.g. As they depict different aspects of a node, it would take elaborate designs of graph convolutions such that each set of features would act as a complement to the other.32. sketch: n. 素描;速写;草图;幽默短剧;⼩品;简报;概述 v. 画素描;画速写;概述;简述e.g. Next we sketch how to apply these insights to learning topic models.33. underscore:v. 在…下⾯划线;强调;着重说明 n.下划线e.g. Moreover, the walk-topic distributions generated by Graph Anchor LDA are indeed sharper than those by ordinary LDA, underscoring the need for selecting anchors.34. disclose: v. 揭露;透露;泄露;使显露;使暴露e.g. Another drawback lies in their unexplainable nature, i.e., they cannot disclose the sciences beneath network dynamics.35. coincide: v. 同时发⽣;相同;相符;极为类似;相接;相交;同位;位置重合;重叠e.g. The simulation results coincide quite well with the theoretical results.36. inspect: v. 检查;查看;审视;视察 to look closely at sth/sb, especially to check that everything is as it should be名词1. capacity: n. 容量, 容积, 容纳能⼒; 领悟(或理解、办事)能⼒; 职位, 职责.e.g. This paper studies theoretically the computational capacity limits of graph neural networks (GNN) falling within the message-passing framework of Gilmer et al. (2017).2. implication: n. 可能的影响(或作⽤、结果); 含意, 暗指; (被)牵连, 牵涉.e.g. Section 4 analyses the implications of restricting the depth $d$ and width $w$ of GNN that do not use a readout function.3. trade-off:(在需要⽽⼜相互对⽴的两者间的)权衡, 协调.e.g. This reveals a direct trade-off between the depth and width of a graph neural network.4. cornerstone:n. 基⽯; 最重要部分; 基础; 柱⽯.5. umbrella: n. 伞; 综合体; 总体, 整体; 保护, 庇护(体系).e.g. Community detection is an umbrella term for a large number of algorithms that group nodes into distinct modules to simplify and highlight essential structures in the network topology.6. folklore:n. 民间传统, 民俗; 民间传说.e.g. It is folklore knowledge that maximizing MI does not necessarily lead to useful representations.7. impediment:n. 妨碍,阻碍,障碍; ⼝吃.e.g. While a recent approach overcomes this impediment, it results in poor quality in prediction tasks due to its linear nature.8. obstacle:n. 障碍;阻碍; 绊脚⽯; 障碍物; 障碍栅栏.e.g. However, several major obstacles stand in our path towards leveraging topic modeling of structural patterns to enhance GCNs.9. vicinity:n. 周围地区; 邻近地区; 附近.e.g. The traits with which they engage are those that are performed in their vicinity.10. demerit: n. 过失,缺点,短处; (学校给学⽣记的)过失分e.g. However, their principal demerit is that their implementations are time-consuming when the studied network is large in size. Another介/副/连词1. notwithstanding:prep. 虽然;尽管 adv. 尽管如此.e.g. Notwithstanding this fundamental problem, the negative sampling strategy is often treated as a design choice.2. albeit: conj. 尽管;虽然e.g. Such methods rely on an implicit, albeit rigid, notion of node neighborhood; yet this one-size-fits-all approach cannot grapple with the diversity of real-world networks and applications.3. Hitherto:adv. 迄今;直到某时e.g. Hitherto, tremendous endeavors have been made by researchers to gauge the robustness of complex networks in face of perturbations.短语1.in a nutshell: 概括地说, 简⾔之, ⼀⾔以蔽之.e.g. In a nutshell, GNN are shown to be universal if four strong conditions are met: ...2. counter-intuitively: 反直觉地.3. on-the-fly:动态的(地), 运⾏中的(地).4. shed light on/into:揭⽰, 揭露; 阐明; 解释; 将…弄明⽩; 照亮.e.g. These contemporary works shed light into the stability and generalization capabilities of GCNs.e.g. Discovering roles and communities in networks can shed light on numerous graph mining tasks such as ...5. boil down to: 重点是; 将…归结为.e.g. These aforementioned works usually boil down to a general classification task, where the model is learnt on a training set and selected by checking a validation set.6. for the sake of:为了.e.g. The local structures anchored around each node as well as the attributes of nodes therein are jointly encoded with graph convolution for the sake of high-level feature extraction.7. dates back to:追溯到.e.g. The usual problem setup dates back at least to Becker and Hinton (1992).8. carry out:实施, 执⾏, 实⾏.e.g. We carry out extensive ablation studies and sensi- tivity analysis to show the effectiveness of the proposed functional time encoding and TGAT-layer.9. lay beyond the reach of:...能⼒达不到e.g. They provide us with information on higher-order dependencies between the components of a system, which lay beyond the reach of models that exclusively capture pairwise links.10. account for: ( 数量或⽐例上)占; 导致, 解释(某种事实或情况); 解释, 说明(某事); (某⼈)对(⾏动、政策等)负有责任; 将(钱款)列⼊(预算).e.g. Multilayer models account for the fact that many real complex systems exhibit multiple types of interactions.11. along with: 除某物以外; 随同…⼀起, 跟…⼀起.e.g. Along with giving us the ability to reason about topological features including community structures or node centralities, network science enables us to understand how the topology of a system influences dynamical processes, and thus its function.12. dates back to:可追溯到.e.g. The usual problem setup dates back at least to Becker and Hinton (1992) and can conceptually be described as follows: ...13. to this end:为此⽬的;为此计;为了达到这个⽬标.e.g. To this end, we consider a simple setup of learning a representation of the top half of MNIST handwritten digit images.14. Unless stated otherwise:除⾮另有说明.e.g. Unless stated otherwise, we use a bilinear critic $f(x, y) = x^TWy$, set the batch size to $128$ and the learning rate to $10^{−4}$.15. As a reference point:作为参照.e.g. As a reference point, the linear classification accuracy from pixels drops to about 84% due to the added noise.16. through the lens of:透过镜头. (以...视⾓)e.g. There are (at least) two immediate benefits of viewing recent representation learning methods based on MI estimators through the lens of metric learning.17. in accordance with:符合;依照;和…⼀致.e.g. The metric learning view seems hence in better accordance with the observations from Section 3.2 than the MI view.It can be shown that the anchors selected by our Graph Anchor LDA are not only indicative of “topics” but are also in accordance with the actual graph structures.18. be akin to:近似, 类似, 类似于.e.g. Thus, our learning model is akin to complex contagion dynamics.19. to name a few:仅举⼏例;举⼏个来说.e.g. Multitasking, multidisciplinary work and multi-authored works, to name a few, are ingrained in the fabric of science culture and certainly multi-multi is expected in order to succeed and move up the scientific ranks.20. a handful of:⼀把;⼀⼩撮;少数e.g. A handful of empirical work has investigated the robustness of complex networks at the community level.21. wreak havoc: 破坏;肆虐;严重破坏;造成破坏;浩劫e.g. Failures on one network could elicit failures on its coupled networks, i.e., networks with which the focal network interacts, and eventually those failures would wreak havoc on the entire network.22. apart from: 除了e.g. We further posit that apart from node $a$ node $b$ has $k$ neighboring nodes.。
On Recovery of Sparse Signals Via Minimization
3388IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 7, JULY 2009On Recovery of Sparse Signals Via `1 MinimizationT. Tony Cai, Guangwu Xu, and Jun Zhang, Senior Member, IEEEAbstract—This paper considers constrained `1 minimization methods in a unified framework for the recovery of high-dimensional sparse signals in three settings: noiseless, bounded error, and Gaussian noise. Both `1 minimization with an ` constraint (Dantzig selector) and `1 minimization under an `2 constraint are considered. The results of this paper improve the existing results in the literature by weakening the conditions and tightening the error bounds. The improvement on the conditions shows that signals with larger support can be recovered accurately. In particular, our results illustrate the relationship between `1 minimization with an `2 constraint and `1 minimization with an ` constraint. This paper also establishes connections between restricted isometry property and the mutual incoherence property. Some results of Candes, Romberg, and Tao (2006), Candes and Tao (2007), and Donoho, Elad, and Temlyakov (2006) are extended.11linear equations with more variables than the number of equations. It is clear that the problem is ill-posed and there are generally infinite many solutions. However, in many applications the vector is known to be sparse or nearly sparse in the sense that it contains only a small number of nonzero entries. This sparsity assumption fundamentally changes the problem. Although there are infinitely many general solutions, under regularity conditions there is a unique sparse solution. Indeed, in many cases the unique sparse solution can be found exactly through minimization subject to (I.2)Index Terms—Dantzig selector`1 minimization, restricted isometry property, sparse recovery, sparsity.I. INTRODUCTION HE problem of recovering a high-dimensional sparse signal based on a small number of measurements, possibly corrupted by noise, has attracted much recent attention. This problem arises in many different settings, including compressed sensing, constructive approximation, model selection in linear regression, and inverse problems. Suppose we have observations of the formTThis minimization problem has been studied, for example, in Fuchs [13], Candes and Tao [5], and Donoho [8]. Understanding the noiseless case is not only of significant interest in its own right, it also provides deep insight into the problem of reconstructing sparse signals in the noisy case. See, for example, Candes and Tao [5], [6] and Donoho [8], [9]. minWhen noise is present, there are two well-known imization methods. One is minimization under an constraint on the residuals -Constraint subject to (I.3)(I.1) with is given and where the matrix is a vector of measurement errors. The goal is to reconstruct the unknown vector . Depending on settings, the error vector can either be zero (in the noiseless case), bounded, or . It is now well understood that Gaussian where minimization provides an effective way for reconstructing a sparse signal in all three settings. See, for example, Fuchs [13], Candes and Tao [5], [6], Candes, Romberg, and Tao [4], Tropp [18], and Donoho, Elad, and Temlyakov [10]. A special case of particular interest is when no noise is present . This is then an underdetermined system of in (I.1) andManuscript received May 01, 2008; revised November 13, 2008. Current version published June 24, 2009. The work of T. Cai was supported in part by the National Science Foundation (NSF) under Grant DMS-0604954, the work of G. Xu was supported in part by the National 973 Project of China (No. 2007CB807903). T. T. Cai is with the Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104 USA (e-mail: tcai@wharton.upenn. edu). G. Xu and J. Zhang are with the Department of Electrical Engineering and Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI 53211 USA (e-mail: gxu4uwm@; junzhang@uwm. edu). Communicated by J. Romberg, Associate Editor for Signal Processing. Digital Object Identifier 10.1109/TIT.2009.2021377Writing in terms of the Lagrangian function of ( -Constraint), this is closely related to finding the solution to the regularized least squares (I.4) The latter is often called the Lasso in the statistics literature (Tibshirani [16]). Tropp [18] gave a detailed treatment of the regularized least squares problem. Another method, called the Dantzig selector, was recently proposed by Candes and Tao [6]. The Dantzig selector solves the sparse recovery problem through -minimization with a constraint on the correlation between the residuals and the column vectors of subject to (I.5)Candes and Tao [6] showed that the Dantzig selector can be computed by solving a linear program subject to and where the optimization variables are . Candes and Tao [6] also showed that the Dantzig selector mimics the perfor. mance of an oracle procedure up to a logarithmic factor0018-9448/$25.00 © 2009 IEEEAuthorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.CAI et al.: ON RECOVERY OF SPARSE SIGNALS VIAMINIMIZATION3389It is clear that some regularity conditions are needed in order for these problems to be well behaved. Over the last few years, many interesting results for recovering sparse signals have been obtained in the framework of the Restricted Isometry Property (RIP). In their seminal work [5], [6], Candes and Tao considered sparse recovery problems in the RIP framework. They provided beautiful solutions to the problem under some conditions on the so-called restricted isometry constant and restricted orthogonality constant (defined in Section II). These conditions essentially require that every set of columns of with certain cardinality approximately behaves like an orthonormal system. Several different conditions have been imposed in various settings. For example, the condition was used in Candes and in Candes, Romberg, and Tao [5], Tao [4], in Candes and Tao [6], and in Candes [3], where is the sparsity index. A natural question is: Can these conditions be weakened in a unified way? Another widely used condition for sparse recovery is the Mutual Incoherence Property (MIP) which requires the to be pairwise correlations among the column vectors of small. See [10], [11], [13], [14], [18]. minimization methods in a In this paper, we consider single unified framework for sparse recovery in three cases, minnoiseless, bounded error, and Gaussian noise. Both constraint (DS) and minimization imization with an constraint ( -Constraint) are considered. Our under the results improve on the existing results in [3]–[6] by weakening the conditions and tightening the error bounds. In particular, miniour results clearly illustrate the relationship between mization with an constraint and minimization with an constraint (the Dantzig selector). In addition, we also establish connections between the concepts of RIP and MIP. As an application, we present an improvement to a recent result of Donoho, Elad, and Temlyakov [10]. In all cases, we solve the problems under the weaker condition (I.6) The improvement on the condition shows that for fixed and , signals with larger support can be recovered. Although our main interest is on recovering sparse signals, we state the results in the general setting of reconstructing an arbitrary signal. It is sometimes convenient to impose conditions that involve only the restricted isometry constant . Efforts have been made in this direction in the literature. In [7], the recovery result was . In [3], the weaker established under the condition was used. Similar conditions have condition also been used in the construction of (random) compressed and sensing matrices. For example, conditions were used in [15] and [1], respectively. We shall remark that, our results implies that the weaker conditionsparse recovery problem. We begin the analysis of minimization methods for sparse recovery by considering the exact recovery in the noiseless case in Section III. Our result improves the main result in Candes and Tao [5] by using weaker conditions and providing tighter error bounds. The analysis of the noiseless case provides insight to the case when the observations are contaminated by noise. We then consider the case of bounded error in Section IV. The connections between the RIP and MIP are also explored. Sparse recovery with Gaussian noise is treated in Section V. Appendices A–D contain the proofs of technical results. II. PRELIMINARIES In this section, we first introduce basic notation and definitions, and then present a technical inequality which will be used in proving our main results. . Let be a vector. The Let support of is the subset of defined byFor an integer , a vector is said to be -sparse if . For a given vector we shall denote by the vector with all but the -largest entries (in absolute value) set to zero and define , the vector with the -largest entries (in absolute value) set to zero. We shall use to denote the -norm of the vector . the standard notation Let the matrix and , the -restricted is defined to be the smallest constant isometry constant such that (II.1) for every -sparse vector . If , we can define another quantity, the -restricted orthogonality constant , as the smallest number that satisfies (II.2) for all and such that and are -sparse and -sparse, respectively, and have disjoint supports. Roughly speaking, the and restricted orthogonality constant isometry constant measure how close subsets of cardinality of columns of are to an orthonormal system. and For notational simplicity we shall write for for hereafter. It is easy to see that and are monotone. That is if if Candes and Tao [5] showed that the constants related by the following inequalities and (II.3) (II.4) are (II.5)suffices in sparse signal reconstruction. The paper is organized as follows. In Section II, after basic notation and definitions are reviewed, we introduce an elementary inequality, which allow us to make finer analysis of theAs mentioned in the Introduction, different conditions on and have been used in the literature. It is not always immediately transparent which condition is stronger and which is weaker. We shall present another important property on andAuthorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.3390IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 7, JULY 2009which can be used to compare the conditions. In addition, it is especially useful in producing simplified recovery conditions. Proposition 2.1: If , then (II.6)Theorem 3.1 (Candes and Tao [5]): Let satisfies. Suppose (III.1)Let be a -sparse vector and minimizer to the problem. Then .is the uniqueIn particular,.A proof of the proposition is provided in Appendix A. Remark: Candes and Tao [6] imposes in a more recent paper Candes [3] uses consequence of Proposition 2.1 is that a strictly stronger condition than sition 2.1 yields implies that and . A direct is in fact since Propowhich means .As mentioned in the Introduction, other conditions on and have also been used in the literature. Candes, Romberg, and . Candes and Tao Tao [4] uses the condition [6] considers the Gaussian noise case. A special case with noise of Theorem 1.1 in that paper improves Theorem 3.1 level to by weakening the condition from . Candes [3] imposes the condition . We shall show below that these conditions can be uniformly improved by a transparent argument. A direct application of Proposition 2.2 yields the following result which weakens the above conditions toWe now introduce a useful elementary inequality. This inequality allows us to perform finer estimation on , norms. It will be used in proving our main results. Proposition 2.2: Let be a positive integer. Then any descending chain of real numbersNote that it follows from (II.3) and (II.4) that , and . So the condition is weaker . It is also easy to see from (II.5) and than is also weaker than (II.6) that the condition and the other conditions mentioned above. Theorem 3.2: Let . Suppose satisfiessatisfiesand obeys The proof of Proposition 2.2 is given in Appendix B. where III. SIGNAL RECOVERY IN THE NOISELESS CASE As mentioned in the Introduction, we shall give a unified minimization with an contreatment for the methods of straint and minimization with an constraint for recovery of sparse signals in three cases: noiseless, bounded error, and Gaussian noise. We begin in this section by considering the simplest setting: exact recovery of sparse signals when no noise is present. This is an interesting problem by itself and has been considered in a number of papers. See, for example, Fuchs [13], Donoho [8], and Candes and Tao [5]. More importantly, the solutions to this “clean” problem shed light on the noisy case. Our result improves the main result given in Candes and Tao [5]. The improvement is obtained by using the technical inequalities we developed in previous section. Although the focus is on recovering sparse signals, our results are stated in the general setting of reconstructing an arbitrary signal. with and suppose we are given and Let where for some unknown vector . The goal is to recover exactly when it is sparse. Candes and Tao [5] showed that a sparse solution can be obtained by minimization which is then solved via linear programming.. Then the minimizerto the problem., i.e., the In particular, if is a -sparse vector, then minimization recovers exactly. This theorem improves the results in [5], [6]. The improvement on the condition shows that for fixed and , signals with larger support can be recovered accurately. Remark: It is sometimes more convenient to use conditions only involving the restricted isometry constant . Note that the condition (III.2) implies . This is due to the factby Proposition 2.1. Hence, Theorem 3.2 holds under the condican also be used. tion (III.2). The condition Proof of Theorem 3.2: The proof relies on Proposition 2.2 and makes use of the ideas from [4]–[6]. In this proof, we shall also identify a vector as a function by assigning .Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.CAI et al.: ON RECOVERY OF SPARSE SIGNALS VIAMINIMIZATION3391Let Letbe a solution to the and letminimization problem (Exact). be the support of . WriteClaim:such that. Let In fact, from Proposition 2.2 and the fact that , we have(III.4)For a subset characteristic function of, we use , i.e., if if .to denote theFor each , let . Then is decomposed to . Note that ’s are pairwise disjoint, , and for . We first consider the case where is divisible by . For each , we divide into two halves in the following manner: with where is the first half of , i.e., andIt then follows from Proposition 2.2 thatProposition 2.2 also yieldsand We shall treat four equal parts. as a sum of four functions and divide withintofor any and. ThereforeWe then define that Note thatfor .by. It is clear(III.3)In fact, since, we haveSince, this yieldsIn the rest of our proof we write . Note that . So we get the equation at the top of the following page. This yieldsThe following claim follows from our Proposition 2.2.Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.3392IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 7, JULY 2009It then follows from (III.4) thatIV. RECOVERY OF SPARSE SIGNALS IN BOUNDED ERROR We now turn to the case of bounded error. The results obtained in this setting have direct implication for the case of Gaussian noise which will be discussed in Section V. and let Letwhere the noise is bounded, i.e., for some bounded set . In this case the noise can either be stochastic or deterministic. The minimization approach is to estimate by the minimizer of subject to We now turn to the case that is not divisible by . Let . Note that in this case and are understood as and , respectively. So the proof for the previous case works if we set and for and We specifically consider two cases: and . The first case is closely connected to the Dantzig selector in the Gaussian noise setting which will be discussed in more detail in Section V. Our results improve the results in Candes, Romberg, and Tao [4], Candes and Tao [6], and Donoho, Elad, and Temlyakov [10]. We shall first consider where satisfies shallLet be the solution to the (DS) problem given in (I.1). The Dantzig selector has the following property. Theorem 4.1: Suppose satisfying . If and with (IV.1) then the solution In this case, we need to use the following inequality whose proof is essentially the same as Proposition 2.2: For any descending chain of real numbers , we have to (DS) obeys (IV.2) with In particular, if . and is a -sparse vector, then .andRemark: Theorem 4.1 is comparable to Theorem 1.1 of Candes and Tao [6], but the result here is a deterministic one instead of a probabilistic one as bounded errors are considered. The proof in Candes and Tao [6] can be adapted to yield aAuthorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.CAI et al.: ON RECOVERY OF SPARSE SIGNALS VIAMINIMIZATION3393similar result for bounded errors under the stronger condition . Proof of Theorem 4.1: We shall use the same notation as in the proof of Theorem 3.2. Since , letting and following essentially the same steps as in the first part of the proof of Theorem 3.2, we getwhere the noise satisfies . Once again, this problem can be solved through constrained minimization subject to (IV.3)An alternative to the constrained minimization approach is the so-called Lasso given in (I.4). The Lasso recovers a sparse regularized least squares. It is closely connected signal via minimization. The Lasso is a popular to the -constrained method in statistics literature (Tibshirani [16]). See Tropp [18] regularized least squares for a detailed treatment of the problem. By using a similar argument, we have the following result on the solution of the minimization (IV.3). Theorem 4.2: Let vector and with . Suppose . If is a -sparseIf that, then and for every , and we have. The latter forces . Otherwise(IV.4) then for any To finish the proof, we observe the following. 1) . be the submatrix obIn fact, let tained by extracting the columns of according to the in, as in [6]. Then dices in , the minimizer to the problem (IV.3) obeys (IV.5) with .imProof of Theorem 4.2: Notice that the condition , so we can use the first part of the proof plies that of Theorem 3.2. The notation used here is the same as that in the proof of Theorem 3.2. First, we haveand2) In factNote thatSoWe get the result by combining 1) and 2). This completes the proof. We now turn to the second case where the noise is bounded with . The problem is to in -norm. Let from recover the sparse signalRemark: Candes, Romberg, and Tao [4] showed that, if , then(The was set to be This impliesin [4].) Now suppose which yields. ,Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.3394IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 7, JULY 2009since and Theorem 4.2 that, with. It then follows fromProof of Theorem 4.3: It follows from Proposition 4.1 thatfor all -sparse vector where . Therefore, Theorem 4.2 improves the above result in Candes, Romberg, and Tao [4] by enlarging the support of by 60%. Remark: Similar to Theorems 3.2 and 4.1, we can have the estimation without assuming that is -sparse. In the general case, we haveSince Theorem 4.2, the conditionholds. ByA. Connections Between RIP and MIP In addition to the restricted isometry property (RIP), another commonly used condition in the sparse recovery literature is the mutual incoherence property (MIP). The mutual incoherence property of requires that the coherence bound (IV.6) be small, where are the columns of ( ’s are also assumed to be of length in -norm). Many interesting results on sparse recovery have been obtained by imposing conand the sparsity , see [10], ditions on the coherence bound [11], [13], [14], [18]. For example, a recent paper, Donoho, is a -sparse Elad, and Temlyakov [10] proved that if with , then for any , the vector and minimizer to the problem ( -Constraint) satisfiesRemarks: In this theorem, the result of Donoho, Elad, and Temlyakov [10] is improved in the following ways. to 1) The sparsity is relaxed from . So roughly speaking, Theorem 4.3 improves the result in Donoho, Elad, and Temlyakov [10] by enlarging the support of by 47%. is usually very 2) It is clear that larger is preferred. Since small, the bound is tightened from to , as is close to .V. RECOVERY OF SPARSE SIGNALS IN GAUSSIAN NOISE We now turn to the case where the noise is Gaussian. Suppose we observe (V.1) and wish to recover from and . We assume that is known and that the columns of are standardized to have unit norm. This is a case of significant interest, in particular in statistics. Many methods, including the Lasso (Tibshirani [16]), LARS (Efron, Hastie, Johnstone, and Tibshirani [12]) and Dantzig selector (Candes and Tao [6]), have been introduced and studied. The following results show that, with large probability, the Gaussian noise belongs to bounded sets. Lemma 5.1: The Gaussian error satisfies (V.2) and (V.3) Inequality (V.2) follows from standard probability calculations and inequality (V.3) is proved in Appendix D. Lemma 5.1 suggests that one can apply the results obtained in the previous section for the bounded error case to solve the Gaussian noise problem. Candes and Tao [6] introduced the Dantzig selector for sparse recovery in the Gaussian noise setting. Given the observations in (V.1), the Dantzig selector is the minimizer of subject to where . (V.4)with, provided.We shall now establish some connections between the RIP and MIP and show that the result of Donoho, Elad, and Temlyakov [10] can be improved under the RIP framework, by using Theorem 4.2. The following is a simple result that gives RIP constants from MIP. The proof can be found in Appendix C. It is remarked that the first inequality in the next proposition can be found in [17]. Proposition 4.1: Let be the coherence bound for and Now we are able to show the following result. Theorem 4.3: Suppose with satisfying (or, equivalently, the minimizer is a -sparse vector and . Let . If ), then for any . Then (IV.7),to the problem ( -Constraint) obeys (IV.8)with.Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.CAI et al.: ON RECOVERY OF SPARSE SIGNALS VIAMINIMIZATION3395In the classical linear regression problem when , the least squares estimator is the solution to the normal equation (V.5) in the convex program The constraint (DS) can thus be viewed as a relaxation of the normal (V.3). And similar to the noiseless case, minimization leads to the “sparsest” solution over the space of all feasible solutions. Candes and Tao [6] showed the following result. Theorem 5.1 (Candes and Tao [6]): Suppose -sparse vector. Let be such that Choose the Dantzig selector is asince . The improvement on the error bound is minor. The improvement on the condition is more significant as it shows signals with larger support can be recovered accurately for fixed and . Remark: Similar to the results obtained in the previous sections, if is not necessarily -sparse, in general we have, with probabilitywhere probabilityand, and within (I.1). Then with large probability, obeys (V.6) where and .with.1As mentioned earlier, the Lasso is another commonly used method in statistics. The Lasso solves the regularized least squares problem (I.4) and is closely related to the -constrained minimization problem ( -Constraint). In the Gaussian error be the mincase, we shall consider a particular setting. Let imizer of subject to (V.7)Remark: Candes and Tao [6] also proved an Oracle Inequality in the Gaussian noise setting under for the Dantzig selector . With some additional work, our the condition approach can be used to improve [6, Theorems 1.2 and 1.3] by . weakening the condition to APPENDIX A PROOF OF PROPOSITION 2.1 Let be -sparse and be supports are disjoint. Decompose such that is -sparse. Suppose their aswhere . Combining our results from the last section together with Lemma 5.1, we have the following results on the Dantzig seand the estimator obtained from minimizalector tion under the constraint. Again, these results improve the previous results in the literature by weakening the conditions and providing more precise bounds. Theorem 5.2: Suppose matrix satisfies Then with probability obeys (V.8) with obeys (V.9) with . , and with probability at least , is a -sparse vector and the-sparse for and for . Using the Cauchy–Schwartz inequality, we have, the Dantzig selectorThis yields we also have. Since . APPENDIX B PROOF OF PROPOSITION 2.2,Remark: In comparison to Theorem 5.1, our result in Theto orem 5.2 weakens the condition from and improves the constant in the bound from to . Note thatCandes and Tao [6], the constant C was stated as C appears that there was a typo and the constant C should be C .1InLet)= . It = 4=(1 0 0Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.3396IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 7, JULY 2009where eachis given (and bounded) byThereforeand the inequality is proved. Without loss of generality, we assume that is even. Write APPENDIX C PROOF OF PROPOSITION 4.1 Let be a -sparse vector. Without loss of generality, we as. A direct calculation shows sume that thatwhereandNow let us bound the second term. Note thatNowThese give usand henceFor the second inequality, we notice that follows from Proposition 2.1 that. It thenand APPENDIX D PROOF OF LEMMA 5.1 The first inequality is standard. For completeness, we give . Then each a short proof here. Let . Hence marginally has Gaussian distributionAuthorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.CAI et al.: ON RECOVERY OF SPARSE SIGNALS VIAMINIMIZATION3397where the last step follows from the Gaussian tail probability bound that for a standard Gaussian variable and any constant , . is We now prove inequality (V.3). Note that a random variable. It follows from Lemma 4 in Cai [2] that for anyHencewhere. It now follows from the fact that[6] E. J. Candes and T. Tao, “The Dantzig selector: Statistical estimation when p is much larger than n (with discussion),” Ann. Statist., vol. 35, pp. 2313–2351, 2007. [7] A. Cohen, W. Dahmen, and R. Devore, “Compressed Sensing and Best k -Term Approximation” 2006, preprint. [8] D. L. Donoho, “For most large underdetermined systems of linear equations the minimal ` -norm solution is also the sparsest solution,” Commun. Pure Appl. Math., vol. 59, pp. 797–829, 2006. [9] D. L. Donoho, “For most large underdetermined systems of equations, the minimal ` -norm near-solution approximates the sparsest near-solution,” Commun. Pure Appl. Math., vol. 59, pp. 907–934, 2006. [10] D. L. Donoho, M. Elad, and V. N. Temlyakov, “Stable recovery of sparse overcomplete representations in the presence of noise,” IEEE Trans. Inf. Theory, vol. 52, no. 1, pp. 6–18, Jan. 2006. [11] D. L. Donoho and X. Huo, “Uncertainty principles and ideal atomic decomposition,” IEEE Trans. Inf. Theory, vol. 47, no. 7, pp. 2845–2862, Nov. 2001. [12] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression (with discussion,” Ann. Statist., vol. 32, pp. 407–451, 2004. [13] J.-J. Fuchs, “On sparse representations in arbitrary redundant bases,” IEEE Trans. Inf. Theory, vol. 50, no. 6, pp. 1341–1344, Jun. 2004. [14] J.-J. Fuchs, “Recovery of exact sparse representations in the presence of bounded noise,” IEEE Trans. Inf. Theory, vol. 51, no. 10, pp. 3601–3608, Oct. 2005. [15] M. Rudelson and R. Vershynin, “Sparse reconstruction by convex relaxation: Fourier and Gaussian measurements,” in Proc. 40th Annu. Conf. Information Sciences and Systems, Princeton, NJ, Mar. 2006, pp. 207–212. [16] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. Ser. B, vol. 58, pp. 267–288, 1996. [17] J. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Trans. Inf. Theory, vol. 50, no. 10, pp. 2231–2242, Oct. 2004. [18] J. Tropp, “Just relax: Convex programming methods for identifying sparse signals in noise,” IEEE Trans. Inf. Theory, vol. 52, no. 3, pp. 1030–1051, Mar. 2006. T. Tony Cai received the Ph.D. degree from Cornell University, Ithaca, NY, in 1996. He is currently the Dorothy Silberberg Professor of Statistics at the Wharton School of the University of Pennsylvania, Philadelphia. His research interests include high-dimensional inference, large-scale multiple testing, nonparametric function estimation, functional data analysis, and statistical decision theory. Prof. Cai is the recipient of the 2008 COPSS Presidents’ Award and a fellow of the Institute of Mathematical Statistics.Inequality (V.3) now follows by verifying directly that for all . ACKNOWLEDGMENT The authors wish to thank the referees for thorough and useful comments which have helped to improve the presentation of the paper. REFERENCES[1] W. Bajwa, J. Haupt, J. Raz, S. Wright, and R. Nowak, “Toeplitz-structured compressed sensing matrices,” in Proc. IEEE SSP Workshop, Madison, WI, Aug. 2007, pp. 294–298. [2] T. Cai, “On block thresholding in wavelet regression: Adaptivity, block size and threshold level,” Statist. Sinica, vol. 12, pp. 1241–1273, 2002. [3] E. J. Candes, “The restricted isometry property and its implications for compressed sensing,” Compte Rendus de l’ Academie des Sciences Paris, ser. I, vol. 346, pp. 589–592, 2008. [4] E. J. Candes, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Commun. Pure Appl. Math., vol. 59, pp. 1207–1223, 2006. [5] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans. Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.Guangwu Xu received the Ph.D. degree in mathematics from the State University of New York (SUNY), Buffalo. He is now with the Department of Electrical engineering and Computer Science, University of Wisconsin-Milwaukee. His research interests include cryptography and information security, computational number theory, algorithms, and functional analysis.Jun Zhang (S’85–M’88–SM’01) received the B.S. degree in electrical and computer engineering from Harbin Shipbuilding Engineering Institute, Harbin, China, in 1982 and was admitted to the graduate program of the Radio Electronic Department of Tsinghua University. After a brief stay at Tsinghua, he came to the U.S. for graduate study on a scholarship from the Li Foundation, Glen Cover, New York. He received the M.S. and Ph.D. degrees, both in electrical engineering, from Rensselaer Polytechnic Institute, Troy, NY, in 1985 and 1988, respectively. He joined the faculty of the Department of Electrical Engineering and Computer Science, University of Wisconsin-Milwaukee, and currently is a Professor. His research interests include image processing and computer vision, signal processing and digital communications. Prof. Zhang has been an Associate Editor of IEEE TRANSACTIONS ON IMAGE PROCESSING.Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.。
2014_ICASSP_EFFICIENT CONVOLUTIONAL SPARSE CODING
EFFICIENT CONVOLUTIONAL SPARSE CODINGBrendt WohlbergTheoretical DivisionLos Alamos National LaboratoryLos Alamos,NM87545,USAABSTRACTWhen applying sparse representation techniques to images,the standard approach is to independently compute the rep-resentations for a set of image patches.Thismethod performs very well in a variety of applications,butthe independent sparse coding of each patch results in a rep-resentation that is not optimal for the image as a whole.Arecent development is convolutional sparse coding,in whicha sparse representation for an entire image is computed by re-placing the linear combination of a set of dictionary vectorsby the sum of a set of convolutions with dictionaryfilters.Adisadvantage of this formulation is its computational expense,but the development of efficient algorithms has received someattention in the literature,with the current leading method ex-ploiting a Fourier domain approach.The present paper intro-duces a new way of solving the problem in the Fourier do-main,leading to substantially reduced computational cost.Index Terms—Sparse Representation,Sparse Coding,Convolutional Sparse Coding,ADMM1.INTRODUCTIONOver the past15year or so,sparse representations[1]havebecome a very widely used technique for a variety of prob-lems in image processing.There are numerous approaches tosparse coding,the inverse problem of computing a sparse rep-resentation of a signal or image vector s,one of themost widely used being Basis Pursuit DeNoising(BPDN)[2]arg minx 12D x−s 22+λ x 1,(1)where D is a dictionary matrix,x is the sparse representation, andλis a regularization parameter.When applied to images, decomposition is usually applied independently to a set of overlapping image patches covering the image;this approach is convenient,but often necessitates somewhat ad hoc subse-quent handling of the overlap between patches,and results in a representation over the whole image that is suboptimal.This research was supported by the U.S.Department of Energy through the LANL/LDRD Program.More recently,these techniques have also begun to be ap-plied,with considerable success,to computer vision problems such as face recognition[3]and image classification[4,5,6]. It is in this application context that convolutional sparse rep-resentations were introduced[7],replacing(1)with arg min{x m}12md m∗x m−s22+λmx m 1,(2)where{d m}is a set of M dictionaryfilters,∗denotes convo-lution,and{x m}is a set of coefficient maps,each of which is the same size as s.Here s is a full image,and the{d m} are usually much smaller.For notational simplicity s and x m are considered to be N dimensional vectors,where N is the the number of pixels in an image,and the notation{x m}is adopted to denote all M of the x m stacked as a single column vector.The derivations presented here are for a single image with a single color band,but the extension to multiple color bands(for both image andfilters)and simultaneous sparse coding of multiple images is mathematically straightforward.The original algorithm proposed for convolutional sparse coding[7]adopted a splitting technique with alternating minimization of two subproblems,thefirst consisting of the solution of a large linear system via an iterative method, and the other a simple shrinkage.The resulting alternating minimization algorithm is similar to one that would be ob-tained within an Alternating Direction Method of Multipliers (ADMM)[8,9]framework,but requires continuation on the auxiliary parameter to enforce the constraint inherent in the splitting.All computation is performed in the spatial domain, the authors expecting that computation in the Discrete Fourier Transform(DFT)domain would result in undesirable bound-ary artifacts[7].Other algorithms that have been proposed for this problem include coordinate descent[10],and a proximal gradient method[11],both operating in the spatial domain.Very recently,an ADMM algorithm operating in the DFT domain has been proposed for dictionary learning for con-volutional sparse representations[12].The use of the Fast Fourier Transform(FFT)in solving the relevant linear sys-tems is shown to give substantially better asymptotic perfor-mance than the original spatial domain method,and evidence is presented to support the claim that the resulting boundary2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)effects are not significant.The present paper describes a convolutional sparse coding algorithm that is derived within the ADMM framework and exploits the FFT for computational advantage.It is very sim-ilar to the sparse coding component of the dictionary learning algorithm of[12],but introduces a method for solving the linear systems that dominate the computational cost of the al-gorithm in time that is linear in the number offilters,instead of cubic as in the method of[12].2.ADMM ALGORITHMRewriting(2)in a form suitable for ADMM by introducing auxiliary variables{y m},we havearg min {x m},{y m}12md m∗x m−s22+λmy m 1 such that x m−y m=0∀m,(3)for which the corresponding iterations(see[8,Sec.3]),with dual variables{u m},are{x m}(k+1)=arg min{x m}12md m∗x m−s22+ρ2mx m−y(k)m+u(k)m22(4){y m}(k+1)=arg min{y m}λmy m 1+ρ2mx(k+1)m−y m+u(k)m22(5)u(k+1) m =u(k)m+x(k+1)m−y(k+1)m.(6)Subproblem(5)is solved via shrinkage/soft thresholding asy(k+1) m =Sλ/ρx(k+1)m+u(k)m,(7)whereSγ(u)=sign(u) max(0,|u|−γ),(8) with sign(·)and|·|of a vector considered to be applied element-wise.The computational cost is O(MN).The only computationally expensive step is solving(4), which is of the formarg min {x m}12md m∗x m−s22+ρ2mx m−z m 22.(9)2.1.DFT Domain FormulationAn obvious approach is to attempt to exploit the FFT for ef-ficient implementation of the convolution via the DFT convo-lution theorem.(This does involve some increase in memory requirement since the d m are zero-padded to the size of the x m before application of the FFT.)Define linear operators D m such that D m x m=d m∗x m,and denote the variables D m,x m,s,and z m in the DFT domain byˆD m,ˆx m,ˆs,andˆz m respectively.It is easy to show via the DFT convolution theorem that(9)is equivalent toarg min{ˆx m}12mˆDmˆx m−ˆs22+ρ2mˆx m−ˆz m 22(10)with the{x m}minimizing(9)being given by the inverse DFT of the{ˆx m}minimizing(10).DefiningˆD= ˆDˆD1...,ˆx=⎛⎜⎝ˆx0ˆx1...⎞⎟⎠,ˆz=⎛⎜⎝ˆz0ˆz1...⎞⎟⎠,(11) this problem can be expressed asarg minˆx12ˆDˆx−ˆs22+ρ2ˆx−ˆz 22,(12) the solution being given by(ˆD HˆD+ρI)ˆx=ˆD Hˆs+ρˆz.(13) 2.2.Independent Linear SystemsMatrixˆD has a block structure consisting of M concatenated N×N diagonal matrices,where M is the number offilters and N is the number of samples in s.ˆD HˆD is an MN×MN matrix,but due to the diagonal block(not block diagonal) structure ofˆD,a row ofˆD H with its non-zero element at col-umn n will only have a non-zero product with a column ofˆD with its non-zero element at row n.As a result,there is no interaction between elements ofˆD corresponding to differ-ent frequencies,so that(as pointed out in[12])one need only solve N independent M×M linear systems to solve(13). Bristow et al.[12]do not specify how they solve these linear systems(and their software implementation was not available for inspection),but since they rate the computational cost of solving them as O(M3),it is reasonable to conclude that they apply a direct method such as Gaussian elimination.This can be very effective[8,Sec. 4.2.3]when it is possible to pre-compute and store a Cholesky or similar decomposition of the linear system(s),but in this case it is not practical unless M is very small,having an O(M2N)memory requirement for storage of these decomposition.Nevertheless,this remains a reasonable approach,the only obvious alternative being an iterative method such as conjugate gradient(CG).A more careful analysis of the unique structure of this problem,however,reveals that there is an alternative,and vastly more effective,solution.First,define the m th block of the right hand side of(13)asˆr m=ˆD H mˆs+ρˆz m,(14)so that⎛⎜⎝ˆr 0ˆr 1...⎞⎟⎠=ˆDH ˆs +ρˆz .(15)Now,denoting the n th element of a vector x by x (n )to avoid confusion between indexing of the vectors themselves and se-lection of elements of these vectors,definev n =⎛⎜⎝ˆx 0(n )ˆx 1(n )...⎞⎟⎠b n =⎛⎜⎝ˆr 0(n )ˆr 1(n )...⎞⎟⎠,(16)and define a n as the column vector containing all of the non-zero entries from column n of ˆDH ,i.e.writing ˆD =⎛⎜⎜⎜⎝ˆd 0,00...ˆd 1,00...0ˆd 0,10...0ˆd 1,10...00ˆd 0,2...00ˆd 1,2...........................⎞⎟⎟⎟⎠(17)thena n =⎛⎜⎝ˆd ∗0,nˆd ∗1,n ...⎞⎟⎠,(18)where ∗denotes complex conjugation.The linear system to solve corresponding to element n of the {x m }is (a n a H n +ρI )v n =b n .(19)The critical observation is that the matrix on the left handside of this system consists of a rank-one matrix plus a scaled identity.Applying the Sherman-Morrison formula(A +uv H )−1=A −1−A −1uv H A −11+u H A −1v (20)gives(ρI +aa H )−1=ρ−1 I −aaHρ+a H a,(21)so that the solution to (19)isv n =ρ−1b n −a H n b nρ+a H n a na n.(22)The only vector operations here are inner products,element-wise addition,and scalar multiplication,so that this method is O (M )instead of O (M 3)as in [12].The cost of solving N of these systems is O (MN ),and the cost of the FFTs is O (MN log N ).Here it is the cost of the FFTs that dominates,whereas in [12]the cost of solving the DFT domain linear systems dominates the cost of the FFTs.This approach can be implemented in an interpreted lan-guage such as Matlab in a form that avoids explicit iteration over the N frequency indices by passing data for all N in-dices as a single array to the relevant linear-algebraic routines (commonly referred to as vectorization in Matlab terminol-ogy).Some additional computation time improvement is pos-sible,at the cost of additional memory requirements,by pre-computing a H n /(ρ+a Hn a n )in (22).2.3.Algorithm SummaryThe proposed algorithm is summarized in Alg.1.stop-ping criteria are those discussed in [8,Sec.3.3],together withan upper bound on the number of iterations.The options for the ρupdate are (i)fixed ρ(i.e.no update),(ii)the adaptive update strategy described in [8,Sec. 3.4.1],and the multi-plicative increase scheme advocated in [12].Input :image s ,filter dictionary {d m },parameters λ,ρPrecompute:FFTs of {d m }→{ˆDm },FFT of s →ˆs Initialize:{y m }={u m }=0while stopping criteria not met doCompute FFTs of {y m }→{ˆym },{u m }→{ˆu m }Compute {ˆxm }using the method in Sec.2.2Compute inverse FFTs of {ˆxm }→{x m }{y m }=S λ/ρ({x m }+{u m }){u m }={u m }+{x m }−{y m }Update ρif appropriate endOutput :Coefficient maps {x m }Algorithm 1:Summary of proposed ADMM algorithm The computational cost of the algorithm components is O (MN log N )for the FFTs,order O (MN )for the proposed linear solver,and O (MN )for both the shrinkage and dual variable update,so that the cost of the entire algorithm is O (MN log N ),dominated by the cost of FFTs.In contrast,the cost of the algorithm proposed in [12]is O (M 3N )(there is also an O (MN log N )cost for FFTs,but it is dominated by the O (M 3N )cost of the linear solver),and the cost of the original spatial-domain algorithm [7]is O (M 2N 2L ),where L is the dimensionality of the filters.3.DICTIONARY LEARNINGThe extension of (2)to learning a dictionary from training data involves replacing the minimization with respect to x m with minimization with respect to both x m and d m .The op-timization is invariably performed via alternating minimiza-tion between the two variables,the most common approach consisting of a sparse coding step followed by a dictionary update [13].The commutativity of convolution suggests that the DFT domain solution of Sec.2.1can be directly applied in minimizing with respect to d m instead of x m ,but this is not possible since the d m are of constrained size,and must be zero-padded to the size of the x m prior to a DFT domain im-plementation of the convolution.If the size constraint is im-plemented in an ADMM framework [14],however,the prob-lem is decomposed into a computationally cheap subproblem corresponding to a projection onto to constraint set,and an-other subproblem that can be efficiently solved by extending the method in Sec.2.1.This iterative algorithm for the dictio-nary update can alternate with a sparse coding stage to form amore traditional dictionary learning method [15],or the sub-problems of the sparse coding and dictionary update algo-rithms can be combined into a single ADMM algorithm [12].4.RESULTScomparison of execution times for the algorithm (λ=0.05)with different methods of solving the linear system,for a set of overcomplete 8×8DCT dictionaries and the 512×512greyscale Lena image,is presented in Fig.1.It is worth em-phasizing that this is a large image by the standards of prior publications on convolutional sparse coding;the test images in [12],for example,are 50×50and 128×128pixels in size.The Gaussian elimination solution is computed using a Cholesky decomposition (since it is,in general,impossible to cache this decomposition,it is necessary to recompute it at every solution),as implemented by the Matlab mldivide function,and is applied by iterating over all frequencies in the apparent absence of any practical alternative.The conjugate gradient solution is computed using two different relative error tolerances.A significant part of the computational advantage here of CG over the direct method is that it is applied simultaneously over all frequencies.The two curves for the proposed solver based on the Sherman-Morrison formula illustrate the significant gain from an implementation that simultaneously solves over all frequencies and that the relative advantage of doing so de-creases with increasing M .Dictionary size (M )E x e c u t i o n t i m e (s )512256128641e+051e+041e+031e+021e+01Fig.1.A comparison of execution times for 10steps of the ADMM algorithm for different methods of solving the lin-ear system:Gaussian elimination (GE),Conjugate Gradient with relative error tolerance 10−5(CG 10−5)and 10−3(CG 10−3),and Sherman-Morrison implemented with a loop over frequencies (SM-L)or jointly over all frequencies (SM-V).The performance of the three ρupdate strategies dis-cussed in the previous section was compared by sparse cod-ing a 256×256Lena image using a 9×9×512dictionary (from [16],by the authors of [17])with a fixed value of λ=0.02and a range of initial ρvalues ρ0.The resulting values of the functional in (2)after 100,500,and 1000itera-tions of the proposed algorithm are displayed in Table 1.The adaptive update strategy uses the default parameters of [8,Sec. 3.4.1],and the increasing strategy uses a multiplica-tive update by a factor of 1.1with a maximum of 105,as advocated by [12].In summary,a fixed ρcan perform well,but is sensitive to a good choice of parameter.When initialized with a small ρ0,the increasing ρstrategy provides the most rapid decrease in functional value,but thereafter converges very slowly.Over-all,unless rapid computation of an approximate solution is desired,the adaptive ρstrategy appears to provide the best performance,with the least sensitivity to choice of ρ0.This is-sue is complex,however,and further experimentation is nec-essary before drawing any general conclusions that could be considered valid over a broad range of problems.Iter.ρ010−210−1100101102103Fixed ρ10028.2727.8018.1010.099.7611.6050028.0522.2511.118.899.1110.13100027.8017.009.648.828.969.71Adaptive ρ10021.6216.9714.5610.7111.1411.4150010.8110.239.819.019.189.0910009.449.219.068.838.878.84Increasing ρ10014.789.829.509.9011.5115.155009.559.459.469.8911.4714.5110009.539.449.459.8811.4113.97Table parison of functional value convergence for thesame problem with three different ρupdate strategies.5.CONCLUSIONA computationally efficient algorithm is proposed for solving the convolutional sparse coding problem in the Fourier do-main.This algorithm has the same general structure as a pre-viously proposed approach [12],but enables a very significantreduction in computational cost by careful design of a linear solver for the most critical component of the iterative algo-rithm.The theoretical computational cost of the algorithm is reduced from O (M 3)to O (MN log N )(where N is the di-mensionality of the data and M is the number of elementsin the dictionary),and is also shown empirically to result in greatly reduced computation time.The significant improve-ment in efficiency of the proposed approach is expected togreatly increase the range of problems that can practically be addressed via convolutional sparse representations.6.REFERENCES[1]A.M.Bruckstein,D.L.Donoho,and M.Elad,“Fromsparse solutions of systems of equations to sparse mod-eling of signals and images,”SIAM Review,vol.51, no.1,pp.34–81,2009.doi:10.1137/060657704[2]S.S.Chen,D.L.Donoho,and M.A.Saunders,“Atomicdecomposition by basis pursuit,”SIAM Journal on Sci-entific Computing,vol.20,no.1,pp.33–61,1998.doi:10.1137/S1064827596304010[3]J.Wright,A.Y.Yang,A.Ganesh,S.S.Sastry,andY.Ma,“Robust face recognition via sparse representa-tion,”IEEE Transactions on Pattern Analysis and Ma-chine Intelligence,vol.31,no.2,pp.210–227,February 2009.doi:10.1109/tpami.2008.79[4]Y.Boureau,F.Bach,Y.A.LeCun,and J.Ponce,“Learn-ing mid-level features for recognition,”in Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition(CVPR),June2010,pp.2559–2566.doi:10.1109/cvpr.2010.5539963[5]J.Yang,K.Yu,and T.S.Huang,“Supervisedtranslation-invariant sparse coding,”Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition(CVPR),pp.3517–3524,2010.doi:10.1109/cvpr.2010.5539958[6]J.Mairal,F.Bach,and J.Ponce,“Task-driven dictionarylearning,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.34,no.4,pp.791–804,April 2012.doi:10.1109/tpami.2011.156[7]M.D.Zeiler,D.Krishnan,G.W.Taylor,and R.Fer-gus,“Deconvolutional networks,”in Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition(CVPR),June2010,pp.2528–2535.doi:10.1109/cvpr.2010.5539957[8]S.Boyd,N.Parikh,E.Chu,B.Peleato,and J.Eckstein,“Distributed optimization and statistical learning via the alternating direction method of multipliers,”Founda-tions and Trends in Machine Learning,vol.3,no.1,pp.1–122,2010.doi:10.1561/2200000016[9]J.Eckstein,“Augmented Lagrangian and alternatingdirection methods for convex optimization:A tutorial and some illustrative computational results,”Rutgers Center for Operations Research,Rutgers University, Rutcor Research Report RRR32-2012,December 2012.[Online].Available:/pub/ rrr/reports2012/322012.pdf[10]K.Kavukcuoglu,P.Sermanet,Y.Boureau,K.Gregor,M.Mathieu,and Y.A.LeCun,“Learning convolutionalfeature hierachies for visual recognition,”in Advances in Neural Information Processing Systems(NIPS2010), 2010.[11]R.Chalasani,J.C.Principe,and N.Ramakrishnan,“A fast proximal method for convolutional sparse cod-ing,”in Proceedings of the International Joint Confer-ence on Neural Networks(IJCNN),Aug.2013,pp.1–5.doi:10.1109/IJCNN.2013.6706854[12]H.Bristow, A.Eriksson,and S.Lucey,“Fast con-volutional sparse coding,”in Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition(CVPR),Jun.2013,pp.391–398.doi:10.1109/CVPR.2013.57[13]B.Mailh´e and M.D.Plumbley,“Dictionary learningwith large step gradient descent for sparse representa-tions,”in Latent Variable Analysis and Signal Sepa-ration,ser.Lecture Notes in Computer Science,F.J.Theis,A.Cichocki,A.Yeredor,and M.Zibulevsky,Eds.Springer Berlin Heidelberg,2012,vol.7191,pp.231–238.doi:10.1007/978-3-642-28551-629[14]M.V.Afonso,J.M.Bioucas-Dias,and M. A.T.Figueiredo,“An Augmented Lagrangian approach to the constrained optimization formulation of imaging inverse problems,”IEEE Transactions on Image Pro-cessing,vol.20,no.3,pp.681–695,March2011.doi:10.1109/tip.2010.2076294[15]K.Engan,S.O.Aase,and J.H.Husøy,“Method ofoptimal directions for frame design,”in Proceedings of the IEEE International Conference on Acoustics, Speech,and Signal Processing(ICASSP),vol.5,1999, pp.2443–2446.doi:10.1109/icassp.1999.760624 [16]J.Mairal,Software available from http://lear.inrialpes.fr/people/mairal/denoise ICCV09.tar.gz.[17]J.Mairal,F.Bach,J.Ponce,G.Sapiro,and A.Zis-serman,“Non-local sparse models for image restora-tion,”in Proceedings of the IEEE International Con-ference on Computer Vision(CVPR),2009,pp.2272–2279.doi:10.1109/iccv.2009.5459452。
Unnatural L0 Sparse Representation for Natural Image Deblurring
(e) our x ˜
Hale Waihona Puke (f) final restored image
Figure 1. Intermediate unnatural image representation exists in many state-of-the-art approaches.
1.1. Analysis
Prior MAP-based approaches can be roughly categorized into two groups, i.e., methods with explicit edge prediction 1
taining salient image edges. These maps are vital to make motion deblurring accomplishable in different MAP-variant frameworks. Implicit Regularization Shan et al. [20] adopted a sparse image prior. This method, in the first a few iterations, uses a large regularization weight to suppress insignificant structures and preserve strong ones, creating crisp-edge image results, as exemplified in Fig. 1(b). This scheme is useful to remove harmful subtle image structures, making kernel estimation generally follow correct directions in iterations. Krishnan et al. [16] used an L1 /L2 regularization scheme. The main feature is to adapt L1 -norm regularization by treating the L2 -norm of image gradients as a weight in iterations. One intermediate result from this method is shown in Fig. 1(c). The main difference between this form and that of [20] is on the way to adapt regularization strength in iterations. Note both of them suppress details in the early stage during optimization. Explicit Filter and Selection In [19, 3], shock filter is introduced to create a sharpened reference map for kernel estimation. Cho and Lee [3] performed bilateral filter and edge thresholding in each iteration to remove small- and medium-amplitude structures (illustrated in Fig. 1(d)), also avoiding the trivial solution. Xu and Jia [25] proposed a texture-removal strategy, explained and extended in [28], to guide edge selection and detect large-scale structures. The resulting edge map in each step is also a small-edge-subdued version from the natural input. These two schemes have been extensively validated in motion deblurring. Unnatural Representation The above techniques enable several successful MAP frameworks in motion deblurring. All of them have their intermediate image results or edge maps different from a natural one, as shown in Fig. 1, due to only containing high-contrast and step-like structures while suppressing others. We generally call them unnatural representation, which is the key to robust kernel estimation in motion deblurring.
Sparse Subspace Clustering_ Algorithm, Theory, and Applications
. E. Elhamifar is with the Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94804. E-mail: ehsan@.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 11, NOVEMBER 2013
2765
Sparse Subspace Clustering: Algorithm, Theory, and Applications
Index Terms—High-dimensional data, intrinsic low-dimensionality, subspaces, clustering, sparse representation, ‘1-minimization, convex programming, spectral clustering, principal angles, motion segmentation, face clustering
On Single Image Scale-Up using Sparse-Representations(Michael Elad 的文章)
2
Roman Zeyde, Michael Elad and Matan Protter
for an additive i.i.d. white Gaussian noise, denoted by v ∼ N 0, σ2I . Given zl, the problem is to find yˆ ∈ RNh such that yˆ ≈ yh. Due to the
hereafter that H applies a known low-pass filter to the image, and S performs
a decimation by an integer factor s, by discarding rows/columns from the input
{romanz,elad,matanpr}@cs.technion.ac.il
Abstract. This paper deals with the single image scale-up problem using sparse-representation modeling. The goal is to recover an original image from its blurred and down-scaled noisy version. Since this problem is highly ill-posed, a prior is needed in order to regularize it. The literature offers various ways to address this problem, ranging from simple linear space-invariant interpolation schemes (e.g., bicubic interpolation), to spatially-adaptive and non-linear filters of various sorts. We embark from a recently-proposed successful algorithm by Yang et. al. [13,14], and similarly assume a local Sparse-Land model on image patches, serving as regularization. Several important modifications to the above-mentioned solution are introduced, and are shown to lead to improved results. These modifications include a major simplification of the overall process both in terms of the computational complexity and the algorithm architecture, using a different training approach for the dictionary-pair, and introducing the ability to operate without a trainingset by boot-strapping the scale-up task from the given low-resolution image. We demonstrate the results on true images, showing both visual and PSNR improvements.
Sparse Representations
Sparse RepresentationsTara N. SainathSLTC Newsletter, November 2010Sparse representations (SRs), including compressive sensing (CS), have gained popularity in the last few years as a technique used to reconstruct a signal from few training examples, a problem which arises in many machine learning applications. This reconstruction can be defined as adaptively finding a dictionary which best represents the signal on a per sample basis. This dictionary could include random projections, as is typically done for signal reconstruction, or actual training samples from the data, which is explored in many machine learning applications. SRs is a rapidly growing field with contributions in a variety of signal processing and machine learning conferences such as ICASSP, ICML and NIPS, and more recently in speech recognition. Recently, a special session on Sparse Representations took place at Interspeech 2010 in Makuhari, Japan from March 26-30, 2010. Below work from this special session is summarized in more detail.FACE RECOGNITION VIA COMPRESSIVE SENSINGYang et al. present a method for image-based robust face recognition using sparse representations [1]. Most state-of-the art face recognition systems suffer from limited abilities to handle image nuisances such as illumination, facial disguise, and pose misalignment. Motivated by work in compressive sensing, the described method finds the sparsest linear combination of a query image using all prior training images, where the dominant sparse coefficients reveal the identity of the query image. In addition, extensions of applying sparse representations for face recognition also address a wide range of problems in the field of face recognition, such as dimensionality reduction, image corruption, and face alignment. The paper also provides useful guidelines to practitioners working in similar fields, such as speech recognition.EXEMPLAR-BASED SPARSE REPRESENTATION FEATURESIn Sainath et al. [2], the authors explore the use of exemplar-based sparse representations (SRs) to map test features into the linear span of training examples. Specifically, given a test vector y and a set of exemplars from the training set, which are put into a dictionary H, y is represented as a linear combination of training examples by solving y = H&beta subject to a spareness constraint on &beta . The feature H&beta can be thought of as mapping test sample y back into the linear span of training examples in H.The authors show that the frame classification accuracy using SRs is higher than using a Gaussian Mixture Model (GMM), showing that not only do SRs move test features closer totraining, but also move the features closer to the correct class. A Hidden Markov Model (HMM) is trained on these new SR features and evaluated in a speech recognition task. On the TIMIT corpus, applying the SR features on top of our best discriminatively trained system allows for a 0.7% absolute reduction in phonetic error rate (PER). Furthermore, on a large vocabulary 50 hour broadcast news task, a reduction in word error rate (WER) of 0.3% absolute, demonstrating the benefit of these SR features for large vocabulary.OBSERVATION UNCERTAINTY MEASURES FOR SPARSE IMPUTATIONMissing data techniques are used to estimate clean speech features from noisy environments by finding reliable information in the noisy speech signal. Decoding is then performed based on either the reliable information alone or using both reliable and unreliable information where unreliable parts of the signal are reconstructed using missing data imputation prior to decoding. Sparse imputation (SI) is an exemplar-based reconstruction method which is based on representing segments of the noisy speech signal as linear combinations of as few as possible clean speech example segments.Decoding accuracy depends on several factors including the uncertainty in the speech segment. Gemmeke et al. propose various uncertainty measures to characterize the expected accuracy of a sparse imputation based missing data method [3]. In experiments on noisy large vocabulary speech data, using observation uncertainties derived from the proposed measures improved the speech recognition performance on features estimated with SI. Relative error reductions up to 15% compared to the baseline system using SI without uncertainties were achieved with the best measures.SPARSE AUTO-ASSOCIATIVE NEURAL NETWORKS: THEORY AND APPLICATION TO SPEECH RECOGNITIONGarimella et al. introduce a sparse auto-associative neural network (SAANN) in which the internal hidden layer output is forced to be sparse [4]. This is done by adding a sparse regularization term to the original reconstruction error cost function, and updating the parameters of the network to minimize the overall cost. The authors show the benefit of SAANN on the TIMIT phonetic recognition task. Specifically, a set of perceptual linear prediction (PLP) features are provided as input into the SAANN structure, and a set of sparse hidden layer outputs are produced and used as features. Experiments with the SAANN features on the TIMIT phoneme recognition system show a relative improvement in phoneme error rate of 5.1% over the baseline PLP features.DATA SELECTION FOR LANGUAGE MODELING USING SPARSE REPRESENTATIONSThe ability to adapt language models to specific domains from large generic text corpora is of considerable interest to the language modeling community. One of the key challenges is toidentify the text material relevant to a domain in the generic text collection. The text selection problem can be cast in a semi-supervised learning framework where the initial hypothesis from a speech recognition system is used to identify relevant training material. Sethy et al [5] present a novel sparse representation formulation which selects a sparse set of relevant sentences from the training data which match the test set distribution. In this formulation, the training sentences are treated as the columns of the sparse representation matrix and then-gram counts as the rows. The target vector is the n-gram probability distribution for the test data. A sparse solution to this problem formulation identifies a few columns which can best represent the target test vector, thus identifying the relevant set of sentences from the training data. Rescoring results with the language model built from the data selected using the proposed method yields modest gains on the English broadcast news RT-04 task, reducing the word error rate from 14.6% to 14.4%.SPARSE REPRESENTATIONS FOR TEXT CATEGORIZATIONGiven the superior performance of SRs compared to other classifiers for both image classification and phonetic classification, Sainath et al. extends the use of SRs for text classification [6], a method which has thus far not been explored for this domain. Specifically, Sainath et al. show how SRs can be used for text classification and how their performance varies with the vocabulary size of the documents. The research finds that the SR method offers promising results over the Naive Bayes (NB) classifier, a standard baseline classifier used for text categorization, thus introducing an alternative class of methods for text categorization.CONCLUSIONSThis article presented an overview about sparse representation research in the areas of face recognition, speech recognition, language modeling and text classification. For more information, please see:[1] A. Yang, Z. Zhou, Y. Ma and S. Shankar Sastry, "Towards a robust face recognition system using compressive sensing", in Proc. Interspeech, September 2010.[2] T. N. Sainath, B. Ramabhadran, D. Nahamoo, D. Kanevsky and A. Sethy,“ Exemplar-Based Sparse Representation Features for Speech Recognition ," in Proc. Interspeech, September 2010.[3] J. F. Gemmeke, U. Remes and K. J. Palomäki, "Observation Uncertainty Measures for Sparse Imputation", in Proc. Interspeech, September 2010.[4] G.S.V.S. Sivaram, S. Ganapathy and H. Hermansky, "Sparse Auto-associative Neural Networks: Theory and Application to Speech Recognition", in Proc. Interspeech, September 2010.[5] A. Sethy, T. N. Sainath, B. Ramabhadran and D. Kanevsky, “ Data Selection for Language Modeling Using Sparse Representations," in Proc. Interspeech, September 2010.[6] T. N. Sainath, S. Maskey, D. Kanevsky, B. Ramabhadran, D. Nahamoo and J. Hirschberg, “ Sparse Representations for Text Categorization," in Proc. Inte rspeech, September 2010.。
Representational Learning
2
Representational Learning
One rather abstract and general notion comes from considering casual models. The idea is that an input, such as the image of a scene, has distal causes, such as objects at given locations illuminated by some particular lighting scheme observed from a particular viewing direction. These causes make for the structure in the input, and, since inferences and decisions are normally based on underlying causes, make for appropriate representations of input. To put it another way, the images that we see live in an O (108 )dimensional space which has one dimension for each photoreceptor (plus one dimension for time). However, the set of all images we might ever naturally see is much smaller, since it is constrained by how images are actually generated. The image generation process specifies the natural coordinates for the observable images, in terms of the things that cause them, and it is these coordinates that we seek to use in order to represent new images. More concrete notions about extracting the structure in the input (and therefore also about extracting causes) include finding lower dimensional projections of the input that nevertheless convey most of the information in the input, or code the input in a way that makes it cheap to communicate, finding projections whose activities are mutually independent (also known as factorial) or sparse, or with distributions that are statistically unlikely and therefore potentially unusually informative. A major source of statistical structure in the input comes from groups of transformations – for instance translation and scale for visual images. Extracting the structure correctly requires respecting (or discovering) these transformations. We consider a number of learning methods that are classed as being unsupervised or self-supervised, in the sense that no information other than the patterns themselves is explicitly provided to guide the representational choice. This is by contrast with supervised methods described in chapter 8, which learn appropriate representations in the service of particular tasks. A main reason for studying unsupervised methods is that they are likely to be much more common than supervised ones. For instance, there is only a derisory amount of supervisory information available to learn how to perform a task such as object recognition that is based on the activities in the O (108 )-dimensional space of activities of the photoreceptors. Unsupervised learning is also often used to try to capture what is happening during activity-dependent development. We saw in chapters 8 and 9 mathematical models that capture the nature and plasticity of the development of cells in visual cortex on the basis of the activity of input neurons. We seek to provide quantifiable goals for this unsupervised process. As will become evident, in neural terms, unsupervised learning methods are still in their infancy – they are only barely capable of generating the forms of sophisticated population code representations that we have discussed and dissected, and even central concepts, such as the notion of causes in causal models and the groups of relevant transformations that particular inputs undergo are only relatively weakly formulated. The two things that unsupervised learning methods have to work with are Peter Dayan and L.F. Abbott Draft: March 13, 1999
代数中常用英语词汇
(0,2) 插值||(0,2) interpolation0#||zero-sharp; 读作零井或零开。
0+||zero-dagger; 读作零正。
1-因子||1-factor3-流形||3-manifold; 又称“三维流形”。
AIC准则||AIC criterion, Akaike information criterionAp 权||Ap-weightA稳定性||A-stability, absolute stabilityA最优设计||A-optimal designBCH 码||BCH code, Bose-Chaudhuri-Hocquenghem codeBIC准则||BIC criterion, Bayesian modification of the AICBMOA函数||analytic function of bounded mean oscillation; 全称“有界平均振动解析函数”。
BMO鞅||BMO martingaleBSD猜想||Birch and Swinnerton-Dyer conjecture; 全称“伯奇与斯温纳顿-戴尔猜想”。
B样条||B-splineC*代数||C*-algebra; 读作“C星代数”。
C0 类函数||function of class C0; 又称“连续函数类”。
CA T准则||CAT criterion, criterion for autoregressiveCM域||CM fieldCN 群||CN-groupCW 复形的同调||homology of CW complexCW复形||CW complexCW复形的同伦群||homotopy group of CW complexesCW剖分||CW decompositionCn 类函数||function of class Cn; 又称“n次连续可微函数类”。
Cp统计量||Cp-statisticC。
Summary
Proc.Valencia/ISBA8th World Meeting on Bayesian StatisticsBenidorm(Alicante,Spain),June1st–6th,2006Bayesian nonparametriclatent feature modelsZoubin GhahramaniUniversity of Cambridge,UKzoubin@Thomas L.GriffithsUniversity of California at Berkeley,USAtomZoubin Ghahramani is in the Department of Engineering,University of Cambridge,and the Machine Learning Department,Carnegie Mellon University;Thomas L.Griffiths is in the Psychology Department,University of California at Berkeley;Peter Sollich is in the Department of Mathematics,King’s College London.2Z.Ghahramani,T.L.Griffiths,P.Sollich1.INTRODUCTIONLatent or hidden variables are an important component of many statistical models. The role of these latent variables may be to represent properties of the objects or data points being modelled that have not been directly observed,or to represent hidden causes that explain the observed data.Most models with latent variables assume afinite number of latent variables per object.At the extreme,mixture models can be represented via a single discrete latent variable,and hidden Markov models(HMMs)via a single latent variable evolving over time.Factor analysis and independent components analysis(ICA) generally use more than one latent variable per object but this number is usually assumed to be small.The close relationship between latent variable models such as factor analysis,state-space models,finite mixture models,HMMs,and ICA is reviewed in(Roweis and Ghahramani,1999).Our goal is to describe a class of latent variable models in which each object is associated with a(potentially unbounded)vector of latent tent feature representations can be found in several widely-used statistical models.In Latent Dirichlet Allocation(LDA;Blei,Ng,&Jordan,2003)each object is associated with a probability distribution over latent features.LDA has proven very successful for modelling the content of documents,where each feature indicates one of the topics that appears in the document.While using a probability distribution over features may be sensible to model the distribution of topics in a document,it introduces a conservation constraint—the more an object expresses one feature,the less it can express others—which may not be appropriate in other contexts.Other latent feature representations include binary vectors with entries indicating the presence or absence of each feature(e.g.,Ueda&Saito,2003),continuous vectors representing objects as points in a latent space(e.g.,Jolliffe,1986),and factorial models,in which each feature takes on one of a discrete set of values(e.g.,Zemel&Hinton,1994; Ghahramani,1995).While it may be computationally convenient to define models with a smallfinite number of latent variables or latent features per object,it may be statistically inappropriate to constrain the number of latent variables a priori.The problem of finding the number of latent variables in a statistical model has often been treated as a model selection problem,choosing the model with the dimensionality that results in the best performance.However,this treatment of the problem assumes that there is a single,finite-dimensional representation that correctly characterizes the properties of the observed objects.This assumption may be unreasonable.For example,when modelling symptoms in medical patients,the latent variables may include not only presence or absence of known diseases but also any number of environmental and genetic factors and potentially unknown diseases which relate to the pattern of symptoms the patient exhibited.The assumption that the observed objects manifest a sparse subset of an un-bounded number of latent classes is often used in nonparametric Bayesian statistics. In particular,this assumption is made in Dirichlet process mixture models,which are used for nonparametric density estimation(Antoniak,1974;Escobar&West, 1995;Ferguson,1983;Neal,2000).Under one interpretation of a Dirichlet process mixture model,each object is assigned to a latent class,and each class is associated with a distribution over observable properties.The prior distribution over assign-ments of objects to classes is specified in such a way that the number of classes used by the model is bounded only by the number of objects,making Dirichlet process mixture models“infinite”mixture models(Rasmussen,2000).Recent workBayesian nonparametric latent feature models3 has extended these methods to models in which each object is represented by a dis-tribution over features(Blei,Griffiths,Jordan,&Tenenbaum,2004;Teh,Jordan,Beal,&Blei,2004).However,there are no equivalent methods for dealing withother feature-based representations,be they binary vectors,factorial structures,orvectors of continuous feature values.In this paper,we take the idea of defining priors over infinite combinatorialstructures from nonparametric Bayesian statistics,and use it to develop methodsfor unsupervised learning in which each object is represented by a sparse subsetof an unbounded number of features.These features can be binary,take on mul-tiple discrete values,or have continuous weights.In all of these representations,the difficult problem is deciding which features an object should possess.The setof features possessed by a set of objects can be expressed in the form of a binary matrix,where each row is an object,each column is a feature,and an entry of1indicates that a particular objects possesses a particular feature.We thus focuson the problem of defining a distribution on infinite sparse binary matrices.Ourderivation of this distribution is analogous to the limiting argument in(Neal2000;Green and Richardson,2001)used to derive the Dirichlet process mixture model (Antoniak,1974;Ferguson,1983),and the resulting process we obtain is analogousto the Chinese restaurant process(CRP;Aldous,1985;Pitman,2002).This distri-bution over infinite binary matrices can be used to specify probabilistic models thatrepresent objects with infinitely many binary features,and can be combined withpriors on feature values to produce factorial and continuous representations.The plan of the paper is as follows.Section2discusses the role of a prioron infinite binary matrices in defining infinite latent feature models.Section3 describes such a prior,corresponding to a stochastic process we call the Indianbuffet process(IBP).Section4describes a two-parameter extension of this model which allows additionalflexibility in the structure of the infinite binary matrices. Section6illustrates several applications of the IBP prior.Section7presents someconclusions.TENT FEATURE MODELSAssume we have N objects,represented by an N×D matrix X,where the i th row of this matrix,x i,consists of measurements of D observable properties of the i thobject.In a latent feature model,each object is represented by a vector of latentfeature values f i,and the properties x i are generated from a distribution determinedby those latent feature tent feature values can be continuous,as in prin-cipal component analysis(PCA;Jolliffe,1986),or discrete,as in cooperative vector quantization(CVQ;Zemel&Hinton,1994;Ghahramani,1995).In the remainderof this Section,we will assume that feature values are ing the matrix F=ˆf T1f T2···f T N˜T to indicate the latent feature values for all N objects,the model is specified by a prior over features,p(F),and a distribution over observed propertymatrices conditioned on those features,p(X|F).As with latent class models,thesedistributions can be dealt with separately:p(F)specifies the number of features,their probability,and the distribution over values associated with each feature,while p(X|F)determines how these features relate to the properties of objects.Our focus will be on p(F),showing how such a prior can be defined without placing an upper bound on the number of features.We can break the matrix F into two components:a binary matrix Z indicatingwhich features are possessed by each object,with z ik=1if object i has feature4Z.Ghahramani,T.L.Griffiths,P.Sollich f i∈{1...K}finite mixture model DPMf i∈[0,1]K, k f ik=1LDA HDPf i∈{0,1}K factorial models,CVQ IBPf i∈ℜK FA,PCA,ICA derivable from IBPBayesian nonparametric latent feature models53.A DISTRIBUTION ON INFINITE BINARY MATRICESIn this Section,we derive a distribution on infinite binary matrices by starting with a simple model that assumes K features,and then taking the limit as K→∞.The resulting distribution corresponds to a simple generative process,which we term the Indian buffet process.3.1.Afinite feature modelWe have N objects and K features,and the possession of feature k by object i is indicated by a binary variable z ik.Each object can possess multiple features. The z ik thus form a binary N×K feature matrix,Z.We will assume that each object possesses feature k with probabilityπk,and that the features are generated independently.The probabilitiesπk can each take on any value in[0,1].Under this model,the probability of a matrix Z givenπ={π1,π2,...,πK},isP(Z|π)=KY k=1N Y i=1P(z ik|πk)=K Y k=1πm k k(1−πk)N−m k,(1)where m k=P N i=1z ik is the number of objects possessing feature k.We can define a prior onπby assuming that eachπk follows a beta distribution. The beta distribution has parameters r and s,and is conjugate to the binomial. The probability of anyπk under the Beta(r,s)distribution is given byp(πk)=πr−1k(1−πk)s−1Γ(r+s).(3)We take r=αK ,1)=Γ(αΓ(1+αα,(4)exploiting the recursive definition of the gamma function.The effect of varying s is explored in Section4.The probability model we have defined isπk|α∼Beta(α6Z.Ghahramani,T.L.Griffiths,P.Sollich marginal probability of a binary matrix Z isP(Z)=KY k=1Z N Y i=1P(z ik|πk)!p(πk)dπk(5)=KY k=1B(m k+αB(αKΓ(m k+αΓ(N+1+αKK,(8)where the result follows from the fact that the expectation of a Beta(r,s)random variable is r/(r+s).Consequently,Eˆ1T Z1˜=KEˆ1T z k˜=Nα/(1+(α/K)). For any K,the expectation of the number of entries in Z is bounded above by Nα.3.2.Equivalence classesIn order tofind the limit of the distribution specified by Equation7as K→∞,we need to define equivalence classes of binary matrices.Our equivalence classes will be defined with respect to a function on binary matrices,lof(·).This function maps binary matrices to left-ordered binary matrices.lof(Z)is obtained by ordering the columns of the binary matrix Z from left to right by the magnitude of the binary number expressed by that column,taking thefirst row as the most significant bit. The left-ordering of a binary matrix is shown in Figure2.In thefirst row of the left-ordered matrix,the columns for which z1k=1are grouped at the left.In the second row,the columns for which z2k=1are grouped at the left of the sets for which z1k=1.This grouping structure persists throughout the matrix.The history of feature k at object i is defined to be(z1k,...,z(i−1)k).Where no object is specified,we will use history to refer to the full history of feature k,(z1k,...,z Nk).We will individuate the histories of features using the decimal equivalent of the binary numbers corresponding to the column entries.For example, at object3,features can have one of four histories:0,corresponding to a feature with no previous assignments,1,being a feature for which z2k=1but z1k=0,2,being a feature for which z1k=1but z2k=0,and3,being a feature possessed by both previous objects.K h will denote the number of features possessing the history h, with K0being the number of features for which m k=0and K+=P2N−1h=1K h being the number of features for which m k>0,so K=K0+K+.This method of denoting histories also facilitates the process of placing a binary matrix in left-ordered form, as it is used in the definition of lof(·).in [Z ]is reduced if Z contains identical columns,since some re-orderings of the columns of Z result in exactly the same matrix.Taking this into account,the cardinality of [Z ]is …K K 0...K 2N −1«=K !Q 2N −1h =0K h !K Y k =1αK )Γ(N −m k +1)K).(10)8Z.Ghahramani,T.L.Griffiths,P.Sollich In order to take the limit of this expression as K →∞,we will divide the columns of Z into two subsets,corresponding to the features for which m k =0and the features for which m k >0.Re-ordering the columns such that m k >0if k ≤K +,and m k =0otherwise,we can break the product in Equation 10into two parts,corresponding to these two subsets.The product thus becomesK Y k =1αK)Γ(N −m k +1)K)=…αK )Γ(N +1)K )«K −K +K +Y k =1αK )Γ(N −m k +1)K )(11)=…αK )Γ(N +1)K)«K K +Y k =1Γ(m k +αΓ(αQ N j =1(j +αK ”K +K +Y k =1(N −m k )!Q m k −1j =1(j +αN !.(13)Substituting Equation 13into Equation 10and rearranging terms,we can compute our limitlim K →∞αK +K 0!K K +· N !K )!K ·K +Y k =1(N −m k )!Q m k −1j =1(j +αN !=αK +N !,(14)where H N is the N th harmonic number,H N =P N j =11Bayesian nonparametric latent feature models9 arranged in a line.Thefirst customer starts at the left of the buffet and takes a serving from each dish,stopping after a Poisson(α)number of dishes as his plate becomes overburdened.The i th customer moves along the buffet,sampling dishes in proportion to their popularity,serving himself with probability m k/i,where m k is the number of previous customers who have sampled a dish.Having reached the end of all previous sampled dishes,the i th customer then tries a Poisson(α/i)number of new dishes.Dishes1011121314151617181920We can indicate which customers chose which dishes using a binary matrix Z with N rows and infinitely many columns,where z ik=1if the i th customer sampled the k th dish.Figure3shows a matrix generated using the IBP withα=10.The first customer tried17dishes.The second customer tried7of those dishes,and then tried3new dishes.The third customer tried3dishes tried by both previous customers,5dishes tried by only thefirst customer,and2new dishes.Vertically concatenating the choices of the customers produces the binary matrix shown in the figure.Using K(i)1to indicate the number of new dishes sampled by the i th customer, the probability of any particular matrix being produced by this process isP(Z)=αK+N!.(15)As can be seen from Figure3,the matrices produced by this process are generally not in left-ordered form.However,these matrices are also not ordered arbitrarily because the Poisson draws always result in choices of new dishes that are to the right of the previously sampled dishes.Customers are not exchangeable under this distribution,as the number of dishes counted as K(i)1depends upon the order in which the customers make their choices.However,if we only pay attention to the lof-equivalence classes of the matrices generated by this process,we obtain the exchangeable distribution P([Z])given by Equation14:(Q N i=1K(i)1!)/(Q2N−1h=1K h!) matrices generated via this process map to the same left-ordered form,and P([Z])is10Z.Ghahramani,T.L.Griffiths,P.Sollich obtained by multiplying P(Z)from Equation15by this quantity.It is also possible to define a similar sequential process that directly produces a distribution on left-ordered binary matrices in which customers are exchangeable,but this requires more effort on the part of the customers.We call this the exchangeable IBP(Griffiths and Ghahramani,2005).3.5.Some properties of this distributionThese different views of the distribution specified by Equation14make it straight-forward to derive some of its properties.First,the effective dimension of the model, K+,follows a Poisson(αH N)distribution.This is easily shown using the generative process described in previous Section,since under this process K+is the sum ofPoisson(α),Poisson(α3),etc.The sum of a set of Poisson distributionsis a Poisson distribution with parameter equal to the sum of the parameters of its ing the definition of the N th harmonic number,this isαH N.A second property of this distribution is that the number of features possessed by each object follows a Poisson(α)distribution.This also follows from the defi-nition of the IBP.Thefirst customer chooses a Poisson(α)number of dishes.By exchangeability,all other customers must also choose a Poisson(α)number of dishes, since we can always specify an ordering on customers which begins with a particular customer.Finally,it is possible to show that Z remains sparse as K→∞.The simplest way to do this is to exploit the previous result:if the number of features possessed by each object follows a Poisson(α)distribution,then the expected number of entries in Z is Nα.This is consistent with the quantity obtained by taking the limit of this expectation in thefinite model,which is given in Equation8:lim K→∞Eˆ1T Z1˜= lim K→∞NαK=Nα.More generally,we can use the property of sums of Poisson random variables described above to show that1T Z1will follow a Poisson(Nα) distribution.Consequently,the probability of values higher than the mean decreases exponentially.3.6.Inference by Gibbs samplingWe have defined a distribution over infinite binary matrices that satisfies one of our desiderata–objects(the rows of the matrix)are exchangeable under this dis-tribution.It remains to be shown that inference in infinite latent feature models is tractable,as was the case for infinite mixture models.We will derive a Gibbs sampler for latent feature models in which the exchangeable IBP is used as a prior. The critical quantity needed to define the sampling algorithm is the full conditional distributionP(z ik=1|Z−(ik),X)∝p(X|Z)P(z ik=1|Z−(ik)),(16) where Z−(ik)denotes the entries of Z other than z ik,and we are leaving aside the issue of the feature values V for the moment.The likelihood term,p(X|Z),relates the latent features to the observed data,and will depend on the model chosen for the observed data.The prior on Z contributes to this probability by specifying P(z ik=1|Z−(ik)).In thefinite model,where P(Z)is given by Equation7,it is straightforward to compute the full conditional distribution for any z ik.Integrating overπk givesP(z ik=1|z−i,k)=Z10P(z ik|πk)p(πk|z−i,k)dπk=m−i,k+αN+αwhere z−i,k is the set of assignments of other objects,not including i,for feature k, and m−i,k is the number of objects possessing feature k,not including i.We need only condition on z−i,k rather than Z−(ik)because the columns of the matrix are generated independently under this prior.In the infinite case,we can derive the conditional distribution from the exchange-able IBP.Choosing an ordering on objects such that the i th object corresponds to the last customer to visit the buffet,we obtainm−i,kP(z ik=1|z−i,k)=extreme cases is very useful,but in general it will be helpful to have a prior where the overall number of features used can be specified.The required generalization is simple:one takes r=(αβ)/K and s=βin Equation2.Settingβ=1then recovers the one-parameter IBP,but the calculations go through in basically the same way also for otherβ.Equation(7),the joint distribution of feature vectors forfinite K,becomesP(Z)=KY k=1B(m k+αβB(αβK)Γ(N−m k+β)K+β)Γ(αβΓ(αβQ h≥1K h!e−K+defined below.As the one-parameter model,this two-parameter model also has a sequential generative process.Again,we will use the Indian buffet analogy.Like before,the first customer starts at the left of the buffet and samples Poisson(α)dishes.The i th customer serves himself from any dish previously sampled by m k>0customers with probability m k/(β+i−1),and in addition from Poisson(αβ/(β+i−1))new dishes.The customer-dish matrix is a sample from this two-parameter IBP.Two other generative processes for this model are described in the Appendix.The parameterβis introduced above in such a way as to preserve the average number of features per object,α;this result follows from exchangeability and the fact that thefirst customer samples Poisson(α)dishes.Thus,also the average number of nonzero entries in Z remains Nα.More interesting is the expected value of the overall number of features,i.e.the number K+of k with m k>0.One gets directly from the buffet interpretation, or via any of the other routes,that the expected overall number of features isβ+i−1,and that the distribution of K+is Poisson with this mean.We can see from this that the total number of features used increases asβincreases, so we can interpretβas the feature repulsion,or1/βas the feature stickiness.In the limitβ→∞(forfixed N),K+→α,again as expected:in this limit features are infinitely sticky and all customers sample the same dishes as thefirst one.Forfiniteβ,one sees that the asymptotic behavior of K+∼αβln N,because in the relevant terms in the sum one can then approximateβ/(β+ i−1)≈β/i.Ifβ≫1,on the other hand,the logarithmic regime is preceded by linear growth at small N<β,during whichroughly the same number of 1s,the number of features used varies considerably.We can see that at small values of β,features are very sticky,and the feature vector variance is low across objects.Conversely,at high values of βthere is a high degree of feature repulsion,with the probability of two objects possessing the same feature α=10 β=0.20510150102030405060708090100Prior sample from IBP with α=10 β=1010203040500102030405060708090100Prior sample from IBP with α=10 β=502040608010012014016001020304050607080901005.AN ILLUSTRATIONThe Indian buffet process can be used as the basis of non-parametric Bayesian models in diverse ways.Different models can be obtained by combining the IBP prior over latent features with different generative distributions for the observed data,p (X |Z ).We illustrate this using a simple model in which real valued data X is assumed to be linearly generated from the latent features,with Gaussian noise.This linear-Gaussian model can be thought of as a version of factor analysis with binary,instead of Gaussian,latent factors,or as a factorial model (Zemel and Hinton,1994;Ghahramani 1995)with infinitely many factors.5.1.A linear Gaussian modelWe motivate the linear-Gaussian IBP model with a toy problem of modelling simple images (Griffiths and Ghahramani,2005;2006).In the model,greyscale images are generated by linearly superimposing different visual elements (objects)and adding Gaussian noise.Each image is composed of a vector of real-valued pixel intensities.The model assumes that there are some unknown number of visual elements and that each image is generated by choosing,for each visual element,whether the image possesses this element or not.The binary latent variable z ik indicates whether image i possesses visual element k .The goal of the modelling task is to discover both the identities and the number of visual elements from a set of observed images.We will start by describing a finite version of the simple linear-Gaussian model with binary latent features used here,and then consider the infinite limit.In the finite model,image i is represented by a D -dimensional vector of pixel intensities,x i which is assumed to be generated from a Gaussian distribution with mean z i Aand covariance matrix ΣX =σ2X I ,where z i is a K -dimensional binary vector,andA is a K ×D matrix of weights.In matrix notation,E [X ]=ZA .If Z is a featurematrix,this is a form of binary factor analysis.The distribution of X given Z,A, andσX is matrix Gaussian:p(X|Z,A,σX)=12σ2Xtr((X−ZA)T(X−ZA))}(22)where tr(·)is the trace of a matrix.We need to define a prior on A,which we also take to be matrix Gaussian:p(A|σA)=12σ2Atr(A T A)},(23)whereσA is a parameter setting the diffuseness of the prior.This prior is conjugate to the likelihood which makes it possible to integrate out the model parameters A.Using the approach outlined in Section3.6,it is possible to derive a Gibbs sampler for thisfinite model in which the parameters A remain marginalized out.To extend this to the infinite model with K→∞,we need to check that p(X|Z,σX,σA) remains well-defined if Z has an unbounded number of columns.This is indeed the case(Griffiths and Ghahramani,2005)and a Gibbs sampler can be defined for this model.We applied the Gibbs sampler for the infinite binary linear-Gaussian model to a simulated dataset,X,consisting of1006×6images.Each image,x i,was represented as a36-dimensional vector of pixel intensity values1.The images were generated from a representation with four latent features,corresponding to the image elements shown in Figure6(a).These image elements correspond to the rows of the matrix A in the model,specifying the pixel intensity values associated with each binary feature.The non-zero elements of A were set to1.0,and are indicated with white pixels in thefigure.A feature vector,z i,for each image was sampled from a distribution under which each feature was present with probability0.5.Each image was then generated from a Gaussian distribution with mean z i A and covariance σX I,whereσX=0.5.Some of these images are shown in Figure6(b),together with the feature vectors,z i,that were used to generate them.The Gibbs sampler was initialized with K+=1,choosing the feature assign-ments for thefirst column by setting z i1=1with probability0.5.σA,σX,andαwere initially set to1.0,and then sampled by adding Metropolis steps to the MCMC algorithm(see Gilks et al.,1996).Figure6shows trace plots for thefirst1000it-erations of MCMC for the log joint probability of the data and the latent features, log p(X,Z),the number of features used by at least one object,K+,and the model parametersσA,σX,andα.The algorithm reached relatively stable values for all of these quantities after approximately100iterations,and our remaining analyses will use only samples taken from that point forward.The latent feature representation discovered by the model was extremely con-sistent with that used to generate the data(Griffiths and Ghahramani,2005).The posterior mean of the feature weights,A,given X and Z isE[A|X,Z]=(Z T Z+σ2X1This simple toy example was inspired by the“shapes problem”in(Ghahramani,1995);a larger scale example with real images is presented in(Griffiths and Ghahramani,2006)features used to generate the data.(b)Sample images from the dataset.(c)Image elements corresponding to the four features possessed by the most ob-jects in the1000th iteration of MCMC.(d)Reconstructions of the images in(b)using the output of the algorithm.The lower portion of thefigure showstrace plots for the MCMC simulation,which are described in more detail inthe text.Figure6(c)shows the posterior mean of a k for the four most frequent features in the1000th sample produced by the algorithm,ordered to match the features shown in Figure6(a).These features pick out the image elements used in generating the data.Figure6(d)shows the feature vectors z i from this sample for the four images in Figure6(b),together with the posterior means of the reconstructions of these images for this sample,E[z i A|X,Z].Similar reconstructions are obtained by averaging over all values of Z produced by the Markov chain.The reconstructions provided by the model clearly pick out the relevant features,despite the high level of noise in the original images.6.APPLICATIONSWe now outlinefive applications of the IBP,each of which uses the same prior over infinite binary matrices,P(Z),but different choices for the likelihood relating latent features to observed data,p(X|Z).These applications will hopefully provide an indication for the potential uses of this distribution.。
离散数学中英文名词对照表
离散数学中英⽂名词对照表离散数学中英⽂名词对照表外⽂中⽂AAbel category Abel 范畴Abel group (commutative group) Abel 群(交换群)Abel semigroup Abel 半群accessibility relation 可达关系action 作⽤addition principle 加法原理adequate set of connectives 联结词的功能完备(全)集adjacent 相邻(邻接)adjacent matrix 邻接矩阵adjugate 伴随adjunction 接合affine plane 仿射平⾯algebraic closed field 代数闭域algebraic element 代数元素algebraic extension 代数扩域(代数扩张)almost equivalent ⼏乎相等的alternating group 三次交代群annihilator 零化⼦antecedent 前件anti symmetry 反对称性anti-isomorphism 反同构arboricity 荫度arc set 弧集arity 元数arrangement problem 布置问题associate 相伴元associative algebra 结合代数associator 结合⼦asymmetric 不对称的(⾮对称的)atom 原⼦atomic formula 原⼦公式augmenting digeon hole principle 加强的鸽⼦笼原理augmenting path 可增路automorphism ⾃同构automorphism group of graph 图的⾃同构群auxiliary symbol 辅助符号axiom of choice 选择公理axiom of equality 相等公理axiom of extensionality 外延公式axiom of infinity ⽆穷公理axiom of pairs 配对公理axiom of regularity 正则公理axiom of replacement for the formula Ф关于公式Ф的替换公式axiom of the empty set 空集存在公理axiom of union 并集公理Bbalanced imcomplete block design 平衡不完全区组设计barber paradox 理发师悖论base 基Bell number Bell 数Bernoulli number Bernoulli 数Berry paradox Berry 悖论bijective 双射bi-mdule 双模binary relation ⼆元关系binary symmetric channel ⼆进制对称信道binomial coefficient ⼆项式系数binomial theorem ⼆项式定理binomial transform ⼆项式变换bipartite graph ⼆分图block 块block 块图(区组)block code 分组码block design 区组设计Bondy theorem Bondy 定理Boole algebra Boole 代数Boole function Boole 函数Boole homomorophism Boole 同态Boole lattice Boole 格bound occurrence 约束出现bound variable 约束变量bounded lattice 有界格bridge 桥Bruijn theorem Bruijn 定理Burali-Forti paradox Burali-Forti 悖论Burnside lemma Burnside 引理Ccage 笼canonical epimorphism 标准满态射Cantor conjecture Cantor 猜想Cantor diagonal method Cantor 对⾓线法Cantor paradox Cantor 悖论cardinal number 基数Cartesion product of graph 图的笛卡⼉积Catalan number Catalan 数category 范畴Cayley graph Cayley 图Cayley theorem Cayley 定理center 中⼼characteristic function 特征函数characteristic of ring 环的特征characteristic polynomial 特征多项式check digits 校验位Chinese postman problem 中国邮递员问题chromatic number ⾊数chromatic polynomial ⾊多项式circuit 回路circulant graph 循环图circumference 周长class 类classical completeness 古典完全的classical consistent 古典相容的clique 团clique number 团数closed term 闭项closure 闭包closure of graph 图的闭包code 码code element 码元code length 码长code rate 码率code word 码字coefficient 系数coimage 上象co-kernal 上核coloring 着⾊coloring problem 着⾊问题combination number 组合数combination with repetation 可重组合common factor 公因⼦commutative diagram 交换图commutative ring 交换环commutative seimgroup 交换半群complement 补图(⼦图的余) complement element 补元complemented lattice 有补格complete bipartite graph 完全⼆分图complete graph 完全图complete k-partite graph 完全k-分图complete lattice 完全格composite 复合composite operation 复合运算composition (molecular proposition) 复合(分⼦)命题composition of graph (lexicographic product)图的合成(字典积)concatenation (juxtaposition) 邻接运算concatenation graph 连通图congruence relation 同余关系conjunctive normal form 正则合取范式connected component 连通分⽀connective 连接的connectivity 连通度consequence 推论(后承)consistent (non-contradiction) 相容性(⽆⽭盾性)continuum 连续统contraction of graph 图的收缩contradiction ⽭盾式(永假式)contravariant functor 反变函⼦coproduct 上积corank 余秩correct error 纠正错误corresponding universal map 对应的通⽤映射countably infinite set 可列⽆限集(可列集)covariant functor (共变)函⼦covering 覆盖covering number 覆盖数Coxeter graph Coxeter 图crossing number of graph 图的叉数cuset 陪集cotree 余树cut edge 割边cut vertex 割点cycle 圈cycle basis 圈基cycle matrix 圈矩阵cycle rank 圈秩cycle space 圈空间cycle vector 圈向量cyclic group 循环群cyclic index 循环(轮转)指标cyclic monoid 循环单元半群cyclic permutation 圆圈排列cyclic semigroup 循环半群DDe Morgan law De Morgan 律decision procedure 判决过程decoding table 译码表deduction theorem 演绎定理degree 次数,次(度)degree sequence 次(度)序列derivation algebra 微分代数Descartes product Descartes 积designated truth value 特指真值detect errer 检验错误deterministic 确定的diagonal functor 对⾓线函⼦diameter 直径digraph 有向图dilemma ⼆难推理direct consequence 直接推论(直接后承)direct limit 正向极限direct sum 直和directed by inclution 被包含关系定向discrete Fourier transform 离散 Fourier 变换disjunctive normal form 正则析取范式disjunctive syllogism 选⾔三段论distance 距离distance transitive graph 距离传递图distinguished element 特异元distributive lattice 分配格divisibility 整除division subring ⼦除环divison ring 除环divisor (factor) 因⼦domain 定义域Driac condition Dirac 条件dual category 对偶范畴dual form 对偶式dual graph 对偶图dual principle 对偶原则(对偶原理) dual statement 对偶命题dummy variable 哑变量(哑变元)Eeccentricity 离⼼率edge chromatic number 边⾊数edge coloring 边着⾊edge connectivity 边连通度edge covering 边覆盖edge covering number 边覆盖数edge cut 边割集edge set 边集edge-independence number 边独⽴数eigenvalue of graph 图的特征值elementary divisor ideal 初等因⼦理想elementary product 初等积elementary sum 初等和empty graph 空图empty relation 空关系empty set 空集endomorphism ⾃同态endpoint 端点enumeration function 计数函数epimorphism 满态射equipotent 等势equivalent category 等价范畴equivalent class 等价类equivalent matrix 等价矩阵equivalent object 等价对象equivalent relation 等价关系error function 错误函数error pattern 错误模式Euclid algorithm 欧⼏⾥德算法Euclid domain 欧⽒整环Euler characteristic Euler 特征Euler function Euler 函数Euler graph Euler 图Euler number Euler 数Euler polyhedron formula Euler 多⾯体公式Euler tour Euler 闭迹Euler trail Euler 迹existential generalization 存在推⼴规则existential quantifier 存在量词existential specification 存在特指规则extended Fibonacci number ⼴义 Fibonacci 数extended Lucas number ⼴义Lucas 数extension 扩充(扩张)extension field 扩域extension graph 扩图exterior algebra 外代数Fface ⾯factor 因⼦factorable 可因⼦化的factorization 因⼦分解faithful (full) functor 忠实(完满)函⼦Ferrers graph Ferrers 图Fibonacci number Fibonacci 数field 域filter 滤⼦finite extension 有限扩域finite field (Galois field ) 有限域(Galois 域)finite dimensional associative division algebra有限维结合可除代数finite set 有限(穷)集finitely generated module 有限⽣成模first order theory with equality 带符号的⼀阶系统five-color theorem 五⾊定理five-time-repetition 五倍重复码fixed point 不动点forest 森林forgetful functor 忘却函⼦four-color theorem(conjecture) 四⾊定理(猜想)F-reduced product F-归纳积free element ⾃由元free monoid ⾃由单元半群free occurrence ⾃由出现free R-module ⾃由R-模free variable ⾃由变元free-?-algebra ⾃由?代数function scheme 映射格式GGalileo paradox Galileo 悖论Gauss coefficient Gauss 系数GBN (G?del-Bernays-von Neumann system)GBN系统generalized petersen graph ⼴义 petersen 图generating function ⽣成函数generating procedure ⽣成过程generator ⽣成⼦(⽣成元)generator matrix ⽣成矩阵genus 亏格girth (腰)围长G?del completeness theorem G?del 完全性定理golden section number 黄⾦分割数(黄⾦分割率)graceful graph 优美图graceful tree conjecture 优美树猜想graph 图graph of first class for edge coloring 第⼀类边⾊图graph of second class for edge coloring 第⼆类边⾊图graph rank 图秩graph sequence 图序列greatest common factor 最⼤公因⼦greatest element 最⼤元(素)Grelling paradox Grelling 悖论Gr?tzsch graph Gr?tzsch 图group 群group code 群码group of graph 图的群HHajós conjecture Hajós 猜想Hamilton cycle Hamilton 圈Hamilton graph Hamilton 图Hamilton path Hamilton 路Harary graph Harary 图Hasse graph Hasse 图Heawood graph Heawood 图Herschel graph Herschel 图hom functor hom 函⼦homemorphism 图的同胚homomorphism 同态(同态映射)homomorphism of graph 图的同态hyperoctahedron 超⼋⾯体图hypothelical syllogism 假⾔三段论hypothese (premise) 假设(前提)Iideal 理想identity 单位元identity natural transformation 恒等⾃然变换imbedding 嵌⼊immediate predcessor 直接先⾏immediate successor 直接后继incident 关联incident axiom 关联公理incident matrix 关联矩阵inclusion and exclusion principle 包含与排斥原理inclusion relation 包含关系indegree ⼊次(⼊度)independent 独⽴的independent number 独⽴数independent set 独⽴集independent transcendental element 独⽴超越元素index 指数individual variable 个体变元induced subgraph 导出⼦图infinite extension ⽆限扩域infinite group ⽆限群infinite set ⽆限(穷)集initial endpoint 始端initial object 初始对象injection 单射injection functor 单射函⼦injective (one to one mapping) 单射(内射)inner face 内⾯inner neighbour set 内(⼊)邻集integral domain 整环integral subdomain ⼦整环internal direct sum 内直和intersection 交集intersection of graph 图的交intersection operation 交运算interval 区间invariant factor 不变因⼦invariant factor ideal 不变因⼦理想inverse limit 逆向极限inverse morphism 逆态射inverse natural transformation 逆⾃然变换inverse operation 逆运算inverse relation 逆关系inversion 反演isomorphic category 同构范畴isomorphism 同构态射isomorphism of graph 图的同构join of graph 图的联JJordan algebra Jordan 代数Jordan product (anti-commutator) Jordan乘积(反交换⼦)Jordan sieve formula Jordan 筛法公式j-skew j-斜元juxtaposition 邻接乘法Kk-chromatic graph k-⾊图k-connected graph k-连通图k-critical graph k-⾊临界图k-edge chromatic graph k-边⾊图k-edge-connected graph k-边连通图k-edge-critical graph k-边临界图kernel 核Kirkman schoolgirl problem Kirkman ⼥⽣问题Kuratowski theorem Kuratowski 定理Llabeled graph 有标号图Lah number Lah 数Latin rectangle Latin 矩形Latin square Latin ⽅lattice 格lattice homomorphism 格同态law 规律leader cuset 陪集头least element 最⼩元least upper bound 上确界(最⼩上界)left (right) identity 左(右)单位元left (right) invertible element 左(右)可逆元left (right) module 左(右)模left (right) zero 左(右)零元left (right) zero divisor 左(右)零因⼦left adjoint functor 左伴随函⼦left cancellable 左可消的left coset 左陪集length 长度Lie algebra Lie 代数line- group 图的线群logically equivanlent 逻辑等价logically implies 逻辑蕴涵logically valid 逻辑有效的(普效的)loop 环Lucas number Lucas 数Mmagic 幻⽅many valued proposition logic 多值命题逻辑matching 匹配mathematical structure 数学结构matrix representation 矩阵表⽰maximal element 极⼤元maximal ideal 极⼤理想maximal outerplanar graph 极⼤外平⾯图maximal planar graph 极⼤平⾯图maximum matching 最⼤匹配maxterm 极⼤项(基本析取式)maxterm normal form(conjunctive normal form) 极⼤项范式(合取范式)McGee graph McGee 图meet 交Menger theorem Menger 定理Meredith graph Meredith 图message word 信息字mini term 极⼩项minimal κ-connected graph 极⼩κ-连通图minimal polynomial 极⼩多项式Minimanoff paradox Minimanoff 悖论minimum distance 最⼩距离Minkowski sum Minkowski 和minterm (fundamental conjunctive form) 极⼩项(基本合取式)minterm normal form(disjunctive normal form)极⼩项范式(析取范式)M?bius function M?bius 函数M?bius ladder M?bius 梯M?bius transform (inversion) M?bius 变换(反演)modal logic 模态逻辑model 模型module homomorphism 模同态(R-同态)modus ponens 分离规则modus tollens 否定后件式module isomorphism 模同构monic morphism 单同态monoid 单元半群monomorphism 单态射morphism (arrow) 态射(箭)M?bius function M?bius 函数M?bius ladder M?bius 梯M?bius transform (inversion) M?bius 变换(反演)multigraph 多重图multinomial coefficient 多项式系数multinomial expansion theorem 多项式展开定理multiple-error-correcting code 纠多错码multiplication principle 乘法原理mutually orthogonal Latin square 相互正交拉丁⽅Nn-ary operation n-元运算n-ary product n-元积natural deduction system ⾃然推理系统natural isomorphism ⾃然同构natural transformation ⾃然变换neighbour set 邻集next state 下⼀个状态next state transition function 状态转移函数non-associative algebra ⾮结合代数non-standard logic ⾮标准逻辑Norlund formula Norlund 公式normal form 正规形normal model 标准模型normal subgroup (invariant subgroup) 正规⼦群(不变⼦群)n-relation n-元关系null object 零对象nullary operation 零元运算Oobject 对象orbit 轨道order 阶order ideal 阶理想Ore condition Ore 条件orientation 定向orthogonal Latin square 正交拉丁⽅orthogonal layout 正交表outarc 出弧outdegree 出次(出度)outer face 外⾯outer neighbour 外(出)邻集outerneighbour set 出(外)邻集outerplanar graph 外平⾯图Ppancycle graph 泛圈图parallelism 平⾏parallelism class 平⾏类parity-check code 奇偶校验码parity-check equation 奇偶校验⽅程parity-check machine 奇偶校验器parity-check matrix 奇偶校验矩阵partial function 偏函数partial ordering (partial relation) 偏序关系partial order relation 偏序关系partial order set (poset) 偏序集partition 划分,分划,分拆partition number of integer 整数的分拆数partition number of set 集合的划分数Pascal formula Pascal 公式path 路perfect code 完全码perfect t-error-correcting code 完全纠-错码perfect graph 完美图permutation 排列(置换)permutation group 置换群permutation with repetation 可重排列Petersen graph Petersen 图p-graph p-图Pierce arrow Pierce 箭pigeonhole principle 鸽⼦笼原理planar graph (可)平⾯图plane graph 平⾯图Pólya theorem Pólya 定理polynomail 多项式polynomial code 多项式码polynomial representation 多项式表⽰法polynomial ring 多项式环possible world 可能世界power functor 幂函⼦power of graph 图的幂power set 幂集predicate 谓词prenex normal form 前束范式pre-ordered set 拟序集primary cycle module 准素循环模prime field 素域prime to each other 互素primitive connective 初始联结词primitive element 本原元primitive polynomial 本原多项式principal ideal 主理想principal ideal domain 主理想整环principal of duality 对偶原理principal of redundancy 冗余性原则product 积product category 积范畴product-sum form 积和式proof (deduction) 证明(演绎)proper coloring 正常着⾊proper factor 真正因⼦proper filter 真滤⼦proper subgroup 真⼦群properly inclusive relation 真包含关系proposition 命题propositional constant 命题常量propositional formula(well-formed formula,wff)命题形式(合式公式)propositional function 命题函数propositional variable 命题变量pullback 拉回(回拖) pushout 推出Qquantification theory 量词理论quantifier 量词quasi order relation 拟序关系quaternion 四元数quotient (difference) algebra 商(差)代数quotient algebra 商代数quotient field (field of fraction) 商域(分式域)quotient group 商群quotient module 商模quotient ring (difference ring , residue ring) 商环(差环,同余类环)quotient set 商集RRamsey graph Ramsey 图Ramsey number Ramsey 数Ramsey theorem Ramsey 定理range 值域rank 秩reconstruction conjecture 重构猜想redundant digits 冗余位reflexive ⾃反的regular graph 正则图regular representation 正则表⽰relation matrix 关系矩阵replacement theorem 替换定理representation 表⽰representation functor 可表⽰函⼦restricted proposition form 受限命题形式restriction 限制retraction 收缩Richard paradox Richard 悖论right adjoint functor 右伴随函⼦right cancellable 右可消的right factor 右因⼦right zero divison 右零因⼦ring 环ring of endomorphism ⾃同态环ring with unity element 有单元的环R-linear independence R-线性⽆关root field 根域rule of inference 推理规则Russell paradox Russell 悖论Ssatisfiable 可满⾜的saturated 饱和的scope 辖域section 截⼝self-complement graph ⾃补图semantical completeness 语义完全的(弱完全的)semantical consistent 语义相容semigroup 半群separable element 可分元separable extension 可分扩域sequent ⽮列式sequential 序列的Sheffer stroke Sheffer 竖(谢弗竖)simple algebraic extension 单代数扩域simple extension 单扩域simple graph 简单图simple proposition (atomic proposition) 简单(原⼦)命题simple transcental extension 单超越扩域simplication 简化规则slope 斜率small category ⼩范畴smallest element 最⼩元(素)Socrates argument Socrates 论断(苏格拉底论断)soundness (validity) theorem 可靠性(有效性)定理spanning subgraph ⽣成⼦图spanning tree ⽣成树spectra of graph 图的谱spetral radius 谱半径splitting field 分裂域standard model 标准模型standard monomil 标准单项式Steiner triple Steiner 三元系⼤集Stirling number Stirling 数Stirling transform Stirling 变换subalgebra ⼦代数subcategory ⼦范畴subdirect product ⼦直积subdivison of graph 图的细分subfield ⼦域subformula ⼦公式subdivision of graph 图的细分subgraph ⼦图subgroup ⼦群sub-module ⼦模subrelation ⼦关系subring ⼦环sub-semigroup ⼦半群subset ⼦集substitution theorem 代⼊定理substraction 差集substraction operation 差运算succedent 后件surjection (surjective) 满射switching-network 开关⽹络Sylvester formula Sylvester公式symmetric 对称的symmetric difference 对称差symmetric graph 对称图symmetric group 对称群syndrome 校验⼦syntactical completeness 语法完全的(强完全的)Syntactical consistent 语法相容system ?3 , ?n , ??0 , ??系统?3 , ?n , ??0 , ??system L 公理系统 Lsystem ?公理系统?system L1 公理系统 L1system L2 公理系统 L2system L3 公理系统 L3system L4 公理系统 L4system L5 公理系统 L5system L6 公理系统 L6system ?n 公理系统?nsystem of modal prepositional logic 模态命题逻辑系统system Pm 系统 Pmsystem S1 公理系统 S1system T (system M) 公理系统 T(系统M)Ttautology 重⾔式(永真公式)technique of truth table 真值表技术term 项terminal endpoint 终端terminal object 终结对象t-error-correcing BCH code 纠 t -错BCH码theorem (provable formal) 定理(可证公式)thickess 厚度timed sequence 时间序列torsion 扭元torsion module 扭模total chromatic number 全⾊数total chromatic number conjecture 全⾊数猜想total coloring 全着⾊total graph 全图total matrix ring 全⽅阵环total order set 全序集total permutation 全排列total relation 全关系tournament 竞赛图trace (trail) 迹tranformation group 变换群transcendental element 超越元素transitive 传递的tranverse design 横截设计traveling saleman problem 旅⾏商问题tree 树triple system 三元系triple-repetition code 三倍重复码trivial graph 平凡图trivial subgroup 平凡⼦群true in an interpretation 解释真truth table 真值表truth value function 真值函数Turán graph Turán 图Turán theorem Turán 定理Tutte graph Tutte 图Tutte theorem Tutte 定理Tutte-coxeter graph Tutte-coxeter 图UUlam conjecture Ulam 猜想ultrafilter 超滤⼦ultrapower 超幂ultraproduct 超积unary operation ⼀元运算unary relation ⼀元关系underlying graph 基础图undesignated truth value ⾮特指值undirected graph ⽆向图union 并(并集)union of graph 图的并union operation 并运算unique factorization 唯⼀分解unique factorization domain (Gauss domain) 唯⼀分解整域unique k-colorable graph 唯⼀k着⾊unit ideal 单位理想unity element 单元universal 全集universal algebra 泛代数(Ω代数)universal closure 全称闭包universal construction 通⽤结构universal enveloping algebra 通⽤包络代数universal generalization 全称推⼴规则universal quantifier 全称量词universal specification 全称特指规则universal upper bound 泛上界unlabeled graph ⽆标号图untorsion ⽆扭模upper (lower) bound 上(下)界useful equivalent 常⽤等值式useless code 废码字Vvalence 价valuation 赋值Vandermonde formula Vandermonde 公式variery 簇Venn graph Venn 图vertex cover 点覆盖vertex set 点割集vertex transitive graph 点传递图Vizing theorem Vizing 定理Wwalk 通道weakly antisymmetric 弱反对称的weight 重(权)weighted form for Burnside lemma 带权形式的Burnside引理well-formed formula (wff) 合式公式(wff) word 字Zzero divison 零因⼦zero element (universal lower bound) 零元(泛下界)ZFC (Zermelo-Fraenkel-Cohen) system ZFC系统form)normal(Skolemformnormalprenex-存在正则前束范式(Skolem 正则范式)3-value proposition logic 三值命题逻辑。
Discriminatively Trained Sparse Code Gradients for Contour Detection
Discriminatively Trained Sparse Code Gradientsfor Contour DetectionXiaofeng Ren and Liefeng BoIntel Science and Technology Center for Pervasive Computing,Intel LabsSeattle,W A98195,USA{xiaofeng.ren,liefeng.bo}@AbstractFinding contours in natural images is a fundamental problem that serves as thebasis of many tasks such as image segmentation and object recognition.At thecore of contour detection technologies are a set of hand-designed gradient fea-tures,used by most approaches including the state-of-the-art Global Pb(gPb)operator.In this work,we show that contour detection accuracy can be signif-icantly improved by computing Sparse Code Gradients(SCG),which measurecontrast using patch representations automatically learned through sparse coding.We use K-SVD for dictionary learning and Orthogonal Matching Pursuit for com-puting sparse codes on oriented local neighborhoods,and apply multi-scale pool-ing and power transforms before classifying them with linear SVMs.By extract-ing rich representations from pixels and avoiding collapsing them prematurely,Sparse Code Gradients effectively learn how to measure local contrasts andfindcontours.We improve the F-measure metric on the BSDS500benchmark to0.74(up from0.71of gPb contours).Moreover,our learning approach can easily adaptto novel sensor data such as Kinect-style RGB-D cameras:Sparse Code Gradi-ents on depth maps and surface normals lead to promising contour detection usingdepth and depth+color,as verified on the NYU Depth Dataset.1IntroductionContour detection is a fundamental problem in vision.Accuratelyfinding both object boundaries and interior contours has far reaching implications for many vision tasks including segmentation,recog-nition and scene understanding.High-quality image segmentation has increasingly been relying on contour analysis,such as in the widely used system of Global Pb[2].Contours and segmentations have also seen extensive uses in shape matching and object recognition[8,9].Accuratelyfinding contours in natural images is a challenging problem and has been extensively studied.With the availability of datasets with human-marked groundtruth contours,a variety of approaches have been proposed and evaluated(see a summary in[2]),such as learning to clas-sify[17,20,16],contour grouping[23,31,12],multi-scale features[21,2],and hierarchical region analysis[2].Most of these approaches have one thing in common[17,23,31,21,12,2]:they are built on top of a set of gradient features[17]measuring local contrast of oriented discs,using chi-square distances of histograms of color and textons.Despite various efforts to use generic image features[5]or learn them[16],these hand-designed gradients are still widely used after a decade and support top-ranking algorithms on the Berkeley benchmarks[2].In this work,we demonstrate that contour detection can be vastly improved by replacing the hand-designed Pb gradients of[17]with rich representations that are automatically learned from data. We use sparse coding,in particularly Orthogonal Matching Pursuit[18]and K-SVD[1],to learn such representations on patches.Instead of a direct classification of patches[16],the sparse codes on the pixels are pooled over multi-scale half-discs for each orientation,in the spirit of the Pbimage patch: gray, abdepth patch (optional):depth, surface normal…local sparse coding multi-scale pooling oriented gradients power transformslinear SVM+ - …per-pixelsparse codes SVMSVMSVM … SVM RGB-(D) contoursFigure 1:We combine sparse coding and oriented gradients for contour analysis on color as well as depth images.Sparse coding automatically learns a rich representation of patches from data.With multi-scale pooling,oriented gradients efficiently capture local contrast and lead to much more accurate contour detection than those using hand-designed features including Global Pb (gPb)[2].gradients,before being classified with a linear SVM.The SVM outputs are then smoothed and non-max suppressed over orientations,as commonly done,to produce the final contours (see Fig.1).Our sparse code gradients (SCG)are much more effective in capturing local contour contrast than existing features.By only changing local features and keeping the smoothing and globalization parts fixed,we improve the F-measure on the BSDS500benchmark to 0.74(up from 0.71of gPb),a sub-stantial step toward human-level accuracy (see the precision-recall curves in Fig.4).Large improve-ments in accuracy are also observed on other datasets including MSRC2and PASCAL2008.More-over,our approach is built on unsupervised feature learning and can directly apply to novel sensor data such as RGB-D images from Kinect-style depth ing the NYU Depth dataset [27],we verify that our SCG approach combines the strengths of color and depth contour detection and outperforms an adaptation of gPb to RGB-D by a large margin.2Related WorkContour detection has a long history in computer vision as a fundamental building block.Modern approaches to contour detection are evaluated on datasets of natural images against human-marked groundtruth.The Pb work of Martin et.al.[17]combined a set of gradient features,using bright-ness,color and textons,to outperform the Canny edge detector on the Berkeley Benchmark (BSDS).Multi-scale versions of Pb were developed and found beneficial [21,2].Building on top of the Pb gradients,many approaches studied the globalization aspects,i.e.moving beyond local classifica-tion and enforcing consistency and continuity of contours.Ren et.al.developed CRF models on superpixels to learn junction types [23].Zhu ed circular embedding to enforce orderings of edgels [31].The gPb work of Arbelaez puted gradients on eigenvectors of the affinity graph and combined them with local cues [2].In addition to Pb gradients,Dollar et.al.[5]learned boosted trees on generic features such as gradients and Haar wavelets,Kokkinos used SIFT features on edgels [12],and Prasad et.al.[20]used raw pixels in class-specific settings.One closely related work was the discriminative sparse models of Mairal et al [16],which used K-SVD to represent multi-scale patches and had moderate success on the BSDS.A major difference of our work is the use of oriented gradients:comparing to directly classifying a patch,measuring contrast between oriented half-discs is a much easier problem and can be effectively learned.Sparse coding represents a signal by reconstructing it using a small set of basis functions.It has seen wide uses in vision,for example for faces [28]and recognition [29].Similar to deep network approaches [11,14],recent works tried to avoid feature engineering and employed sparse coding of image patches to learn features from “scratch”,for texture analysis [15]and object recognition [30,3].In particular,Orthogonal Matching Pursuit [18]is a greedy algorithm that incrementally finds sparse codes,and K-SVD is also efficient and popular for dictionary learning.Closely related to our work but on the different problem of recognition,Bo ed matching pursuit and K-SVD to learn features in a coding hierarchy [3]and are extending their approach to RGB-D data [4].Thanks to the mass production of Kinect,active RGB-D cameras became affordable and were quickly adopted in vision research and applications.The Kinect pose estimation of Shotton et. ed random forests to learn from a huge amount of data[25].Henry ed RGB-D cam-eras to scan large environments into3D models[10].RGB-D data were also studied in the context of object recognition[13]and scene labeling[27,22].In-depth studies of contour and segmentation problems for depth data are much in need given the fast growing interests in RGB-D perception.3Contour Detection using Sparse Code GradientsWe start by examining the processing pipeline of Global Pb(gPb)[2],a highly influential and widely used system for contour detection.The gPb contour detection has two stages:local contrast estimation at multiple scales,and globalization of the local cues using spectral grouping.The core of the approach lies within its use of local cues in oriented gradients.Originally developed in [17],this set of features use relatively simple pixel representations(histograms of brightness,color and textons)and similarity functions(chi-square distance,manually chosen),comparing to recent advances in using rich representations for high-level recognition(e.g.[11,29,30,3]).We set out to show that both the pixel representation and the aggregation of pixel information in local neighborhoods can be much improved and,to a large extent,learned from and adapted to input data. For pixel representation,in Section3.1we show how to use Orthogonal Matching Pursuit[18]and K-SVD[1],efficient sparse coding and dictionary learning algorithms that readily apply to low-level vision,to extract sparse codes at every pixel.This sparse coding approach can be viewed similar in spirit to the use offilterbanks but avoids manual choices and thus directly applies to the RGB-D data from Kinect.We show learned dictionaries for a number of channels that exhibit different characteristics:grayscale/luminance,chromaticity(ab),depth,and surface normal.In Section3.2we show how the pixel-level sparse codes can be integrated through multi-scale pool-ing into a rich representation of oriented local neighborhoods.By computing oriented gradients on this high dimensional representation and using a double power transform to code the features for linear classification,we show a linear SVM can be efficiently and effectively trained for each orientation to classify contour vs non-contour,yielding local contrast estimates that are much more accurate than the hand-designed features in gPb.3.1Local Sparse Representation of RGB-(D)PatchesK-SVD and Orthogonal Matching Pursuit.K-SVD[1]is a popular dictionary learning algorithm that generalizes K-Means and learns dictionaries of codewords from unsupervised data.Given a set of image patches Y=[y1,···,y n],K-SVD jointlyfinds a dictionary D=[d1,···,d m]and an associated sparse code matrix X=[x1,···,x n]by minimizing the reconstruction errorminY−DX 2F s.t.∀i, x i 0≤K;∀j, d j 2=1(1) D,Xwhere · F denotes the Frobenius norm,x i are the columns of X,the zero-norm · 0counts the non-zero entries in the sparse code x i,and K is a predefined sparsity level(number of non-zero en-tries).This optimization can be solved in an alternating manner.Given the dictionary D,optimizing the sparse code matrix X can be decoupled to sub-problems,each solved with Orthogonal Matching Pursuit(OMP)[18],a greedy algorithm forfinding sparse codes.Given the codes X,the dictionary D and its associated sparse coefficients are updated sequentially by singular value decomposition. For our purpose of representing local patches,the dictionary D has a small size(we use75for5x5 patches)and does not require a lot of sample patches,and it can be learned in a matter of minutes. Once the dictionary D is learned,we again use the Orthogonal Matching Pursuit(OMP)algorithm to compute sparse codes at every pixel.This can be efficiently done with convolution and a batch version of the OMP algorithm[24].For a typical BSDS image of resolution321x481,the sparse code extraction is efficient and takes1∼2seconds.Sparse Representation of RGB-D Data.One advantage of unsupervised dictionary learning is that it readily applies to novel sensor data,such as the color and depth frames from a Kinect-style RGB-D camera.We learn K-SVD dictionaries up to four channels of color and depth:grayscale for luminance,chromaticity ab for color in the Lab space,depth(distance to camera)and surface normal(3-dim).The learned dictionaries are visualized in Fig.2.These dictionaries are interesting(a)Grayscale (b)Chromaticity (ab)(c)Depth (d)Surface normal Figure 2:K-SVD dictionaries learned for four different channels:grayscale and chromaticity (in ab )for an RGB image (a,b),and depth and surface normal for a depth image (c,d).We use a fixed dictionary size of 75on 5x 5patches.The ab channel is visualized using a constant luminance of 50.The 3-dimensional surface normal (xyz)is visualized in RGB (i.e.blue for frontal-parallel surfaces).to look at and qualitatively distinctive:for example,the surface normal codewords tend to be more smooth due to flat surfaces,the depth codewords are also more smooth but with speckles,and the chromaticity codewords respect the opponent color pairs.The channels are coded separately.3.2Coding Multi-Scale Neighborhoods for Measuring ContrastMulti-Scale Pooling over Oriented Half-Discs.Over decades of research on contour detection and related topics,a number of fundamental observations have been made,repeatedly:(1)contrast is the key to differentiate contour vs non-contour;(2)orientation is important for respecting contour continuity;and (3)multi-scale is useful.We do not wish to throw out these principles.Instead,we seek to adopt these principles for our case of high dimensional representations with sparse codes.Each pixel is presented with sparse codes extracted from a small patch (5-by-5)around it.To aggre-gate pixel information,we use oriented half-discs as used in gPb (see an illustration in Fig.1).Each orientation is processed separately.For each orientation,at each pixel p and scale s ,we define two half-discs (rectangles)N a and N b of size s -by-(2s +1),on both sides of p ,rotated to that orienta-tion.For each half-disc N ,we use average pooling on non-zero entries (i.e.a hybrid of average and max pooling)to generate its representationF (N )= i ∈N |x i 1| i ∈N I |x i 1|>0,···, i ∈N |x im | i ∈NI |x im |>0 (2)where x ij is the j -th entry of the sparse code x i ,and I is the indicator function whether x ij is non-zero.We rotate the image (after sparse coding)and use integral images for fast computations (on both |x ij |and |x ij |>0,whose costs are independent of the size of N .For two oriented half-dics N a and N b at a scale s ,we compute a difference (gradient)vector DD (N a s ,N b s )= F (N a s )−F (N b s ) (3)where |·|is an element-wise absolute value operation.We divide D (N a s ,N b s )by their norms F (N a s ) + F (N b s ) + ,where is a positive number.Since the magnitude of sparse codes variesover a wide range due to local variations in illumination as well as occlusion,this step makes the appearance features robust to such variations and increases their discriminative power,as commonly done in both contour detection and object recognition.This value is not hard to set,and we find a value of =0.5is better than,for instance, =0.At this stage,one could train a classifier on D for each scale to convert it to a scalar value of contrast,which would resemble the chi-square distance function in gPb.Instead,we find that it is much better to avoid doing so separately at each scale,but combining multi-scale features in a joint representation,so as to allow interactions both between codewords and between scales.That is,our final representation of the contrast at a pixel p is the concatenation of sparse codes pooled at all thescales s ∈{1,···,S }(we use S =4):D p = D (N a 1,N b 1),···,D (N a S ,N b S );F (N a 1∪N b 1),···,F (N a S ∪N b S ) (4)In addition to difference D ,we also include a union term F (N a s ∪N b s ),which captures the appear-ance of the whole disc (union of the two half discs)and is normalized by F (N a s ) + F (N b s ) + .Double Power Transform and Linear Classifiers.The concatenated feature D p (non-negative)provides multi-scale contrast information for classifying whether p is a contour location for a partic-ular orientation.As D p is high dimensional (1200and above in our experiments)and we need to do it at every pixel and every orientation,we prefer using linear SVMs for both efficient testing as well as training.Directly learning a linear function on D p ,however,does not work very well.Instead,we apply a double power transformation to make the features more suitable for linear SVMs D p = D α1p ,D α2p (5)where 0<α1<α2<1.Empirically,we find that the double power transform works much better than either no transform or a single power transform α,as sometimes done in other classification contexts.Perronnin et.al.[19]provided an intuition why a power transform helps classification,which “re-normalizes”the distribution of the features into a more Gaussian form.One plausible intuition for a double power transform is that the optimal exponent αmay be different across feature dimensions.By putting two power transforms of D p together,we allow the classifier to pick its linear combination,different for each dimension,during the stage of supervised training.From Local Contrast to Global Contours.We intentionally only change the local contrast es-timation in gPb and keep the other steps fixed.These steps include:(1)the Savitzky-Goley filter to smooth responses and find peak locations;(2)non-max suppression over orientations;and (3)optionally,we apply the globalization step in gPb that computes a spectral gradient from the local gradients and then linearly combines the spectral gradient with the local ones.A sigmoid transform step is needed to convert the SVM outputs on D p before computing spectral gradients.4ExperimentsWe use the evaluation framework of,and extensively compare to,the publicly available Global Pb (gPb)system [2],widely used as the state of the art for contour detection 1.All the results reported on gPb are from running the gPb contour detection and evaluation codes (with default parameters),and accuracies are verified against the published results in [2].The gPb evaluation includes a number of criteria,including precision-recall (P/R)curves from contour matching (Fig.4),F-measures computed from P/R (Table 1,2,3)with a fixed contour threshold (ODS)or per-image thresholds (OIS),as well as average precisions (AP)from the P/R curves.Benchmark Datasets.The main dataset we use is the BSDS500benchmark [2],an extension of the original BSDS300benchmark and commonly used for contour evaluation.It includes 500natural images of roughly resolution 321x 481,including 200for training,100for validation,and 200for testing.We conduct both color and grayscale experiments (where we convert the BSDS500images to grayscale and retain the groundtruth).In addition,we also use the MSRC2and PASCAL2008segmentation datasets [26,6],as done in the gPb work [2].The MSRC2dataset has 591images of resolution 200x 300;we randomly choose half for training and half for testing.The PASCAL2008dataset includes 1023images in its training and validation sets,roughly of resolution 350x 500.We randomly choose half for training and half for testing.For RGB-D contour detection,we use the NYU Depth dataset (v2)[27],which includes 1449pairs of color and depth frames of resolution 480x 640,with groundtruth semantic regions.We choose 60%images for training and 40%for testing,as in its scene labeling setup.The Kinect images are of lower quality than BSDS,and we resize the frames to 240x 320in our experiments.Training Sparse Code Gradients.Given sparse codes from K-SVD and Orthogonal Matching Pur-suit,we train the Sparse Code Gradients classifiers,one linear SVM per orientation,from sampled locations.For positive data,we sample groundtruth contour locations and estimate the orientations at these locations using groundtruth.For negative data,locations and orientations are random.We subtract the mean from the patches in each data channel.For BSDS500,we typically have 1.5to 21In this work we focus on contour detection and do not address how to derive segmentations from contours.pooling disc size (pixel)a v e r a g e p r e c i s i o na v e r a g e p r e c i s i o nsparsity level a v e r a g e p r e c i s i o n (a)(b)(c)Figure 3:Analysis of our sparse code gradients,using average precision of classification on sampled boundaries.(a)The effect of single-scale vs multi-scale pooling (accumulated from the smallest).(b)Accuracy increasing with dictionary size,for four orientation channels.(c)The effect of the sparsity level K,which exhibits different behavior for grayscale and chromaticity.BSDS500ODS OIS AP l o c a l gPb (gray).67.69.68SCG (gray).69.71.71gPb (color).70.72.71SCG (color).72.74.75g l o b a l gPb (gray).69.71.67SCG (gray).71.73.74gPb (color).71.74.72SCG (color).74.76.77Table 1:F-measure evaluation on the BSDS500benchmark [2],comparing to gPb on grayscaleand color images,both for local contour detec-tion as well as for global detection (-bined with the spectral gradient analysis in [2]).Recall P r e c i s i o n Figure 4:Precision-recall curves of SCG vs gPb on BSDS500,for grayscale and color images.We make a substantial step beyondthe current state of the art toward reachinghuman-level accuracy (green dot).million data points.We use 4spatial scales,at half-disc sizes 2,4,7,25.For a dictionary size of 75and 4scales,the feature length for one data channel is 1200.For full RGB-D data,the dimension is 4800.For BSDS500,we train only using the 200training images.We modify liblinear [7]to take dense matrices (features are dense after pooling)and single-precision floats.Looking under the Hood.We empirically analyze a number of settings in our Sparse Code Gradi-ents.In particular,we want to understand how the choices in the local sparse coding affect contour classification.Fig.3shows the effects of multi-scale pooling,dictionary size,and sparsity level (K).The numbers reported are intermediate results,namely the mean of average precision of four oriented gradient classifier (0,45,90,135degrees)on sampled locations (grayscale unless otherwise noted,on validation).As a reference,the average precision of gPb on this task is 0.878.For multi-scale pooling,the single best scale for the half-disc filter is about 4x 8,consistent with the settings in gPb.For accumulated scales (using all the scales from the smallest up to the current level),the accuracy continues to increase and does not seem to be saturated,suggesting the use of larger scales.The dictionary size has a minor impact,and there is a small (yet observable)benefit to use dictionaries larger than 75,particularly for diagonal orientations (45-and 135-deg).The sparsity level K is a more intriguing issue.In Fig.3(c),we see that for grayscale only,K =1(normalized nearest neighbor)does quite well;on the other hand,color needs a larger K ,possibly because ab is a nonlinear space.When combining grayscale and color,it seems that we want K to be at least 3.It also varies with orientation:horizontal and vertical edges require a smaller K than diagonal edges.(If using K =1,our final F-measure on BSDS500is 0.730.)We also empirically evaluate the double power transform vs single power transform vs no transform.With no transform,the average precision is 0.865.With a single power transform,the best choice of the exponent is around 0.4,with average precision 0.884.A double power transform (with exponentsMSRC2ODS OIS APgPb.37.39.22SCG.43.43.33PASCAL2008ODS OIS APgPb.34.38.20SCG.37.41.27Table2:F-measure evaluation comparing our SCG approach to gPb on two addi-tional image datasets with contour groundtruth: MSRC2[26]and PASCAL2008[6].RGB-D(NYU v2)ODS OIS AP gPb(color).51.52.37 SCG(color).55.57.46gPb(depth).44.46.28SCG(depth).53.54.45gPb(RGB-D).53.54.40SCG(RGB-D).62.63.54Table3:F-measure evaluation on RGB-D con-tour detection using the NYU dataset(v2)[27].We compare to gPb on using color image only,depth only,as well as color+depth.Figure5:Examples from the BSDS500dataset[2].(Top)Image;(Middle)gPb output;(Bottom) SCG output(this work).Our SCG operator learns to preservefine details(e.g.windmills,faces,fish fins)while at the same time achieving higher precision on large-scale contours(e.g.back of zebras). (Contours are shown in double width for the sake of visualization.)0.25and0.75,which can be computed through sqrt)improves the average precision to0.900,which translates to a large improvement in contour detection accuracy.Image Benchmarking Results.In Table1and Fig.4we show the precision-recall of our Sparse Code Gradients vs gPb on the BSDS500benchmark.We conduct four sets of experiments,using color or grayscale images,with or without the globalization component(for which we use exactly the same setup as in gPb).Using Sparse Code Gradients leads to a significant improvement in accuracy in all four cases.The local version of our SCG operator,i.e.only using local contrast,is already better(F=0.72)than gPb with globalization(F=0.71).The full version,local SCG plus spectral gradient(computed from local SCG),reaches an F-measure of0.739,a large step forward from gPb,as seen in the precision-recall curves in Fig.4.On BSDS300,our F-measure is0.715. We observe that SCG seems to pick upfine-scale details much better than gPb,hence the much higher recall rate,while maintaining higher precision over the entire range.This can be seen in the examples shown in Fig.5.While our scale range is similar to that of gPb,the multi-scale pooling scheme allows theflexibility of learning the balance of scales separately for each code word,which may help detecting the details.The supplemental material contains more comparison examples.In Table2we show the benchmarking results for two additional datasets,MSRC2and PAS-CAL2008.Again we observe large improvements in accuracy,in spite of the somewhat different natures of the scenes in these datasets.The improvement on MSRC2is much larger,partly because the images are smaller,hence the contours are smaller in scale and may be over-smoothed in gPb. As for computational cost,using integral images,local SCG takes∼100seconds to compute on a single-thread Intel Core i5-2500CPU on a BSDS image.It is slower than but comparable to the highly optimized multi-thread C++implementation of gPb(∼60seconds).Figure6:Examples of RGB-D contour detection on the NYU dataset(v2)[27].Thefive panels are:input image,input depth,image-only contours,depth-only contours,and color+depth contours. Color is good picking up details such as photos on the wall,and depth is useful where color is uniform(e.g.corner of a room,row1)or illumination is poor(e.g.chair,row2).RGB-D Contour Detection.We use the second version of the NYU Depth Dataset[27],which has higher quality groundtruth than thefirst version.A medianfiltering is applied to remove double contours(boundaries from two adjacent regions)within3pixels.For RGB-D baseline,we use a simple adaptation of gPb:the depth values are in meters and used directly as a grayscale image in gPb gradient computation.We use a linear combination to put(soft)color and depth gradients together in gPb before non-max suppression,with the weight set from validation.Table3lists the precision-recall evaluations of SCG vs gPb for RGB-D contour detection.All the SCG settings(such as scales and dictionary sizes)are kept the same as for BSDS.SCG again outperforms gPb in all the cases.In particular,we are much better for depth-only contours,for which gPb is not designed.Our approach learns the low-level representations of depth data fully automatically and does not require any manual tweaking.We also achieve a much larger boost by combining color and depth,demonstrating that color and depth channels contain complementary information and are both critical for RGB-D contour detection.Qualitatively,it is easy to see that RGB-D combines the strengths of color and depth and is a promising direction for contour and segmentation tasks and indoor scene analysis in general[22].Fig.6shows a few examples of RGB-D contours from our SCG operator.There are plenty of such cases where color alone or depth alone would fail to extract contours for meaningful parts of the scenes,and color+depth would succeed. 5DiscussionsIn this work we successfully showed how to learn and code local representations to extract contours in natural images.Our approach combined the proven concept of oriented gradients with powerful representations that are automatically learned through sparse coding.Sparse Code Gradients(SCG) performed significantly better than hand-designed features that were in use for a decade,and pushed contour detection much closer to human-level accuracy as illustrated on the BSDS500benchmark. Comparing to hand-designed features(e.g.Global Pb[2]),we maintain the high dimensional rep-resentation from pooling oriented neighborhoods and do not collapse them prematurely(such as computing chi-square distance at each scale).This passes a richer set of information into learn-ing contour classification,where a double power transform effectively codes the features for linear paring to previous learning approaches(e.g.discriminative dictionaries in[16]),our uses of multi-scale pooling and oriented gradients lead to much higher classification accuracies. Our work opens up future possibilities for learning contour detection and segmentation.As we il-lustrated,there is a lot of information locally that is waiting to be extracted,and a learning approach such as sparse coding provides a principled way to do so,where rich representations can be automat-ically constructed and adapted.This is particularly important for novel sensor data such as RGB-D, for which we have less understanding but increasingly more need.。
Image Super-Resolution via Sparse Representation
1Image Super-Resolution via Sparse Representation Jianchao Yang,Student Member,IEEE,John Wright,Student Member,IEEE Thomas Huang,Life Fellow,IEEEand Yi Ma,Senior Member,IEEEAbstract—This paper presents a new approach to single-image superresolution,based on sparse signal representation.Research on image statistics suggests that image patches can be well-represented as a sparse linear combination of elements from an appropriately chosen over-complete dictionary.Inspired by this observation,we seek a sparse representation for each patch of the low-resolution input,and then use the coefficients of this representation to generate the high-resolution output.Theoretical results from compressed sensing suggest that under mild condi-tions,the sparse representation can be correctly recovered from the downsampled signals.By jointly training two dictionaries for the low-and high-resolution image patches,we can enforce the similarity of sparse representations between the low resolution and high resolution image patch pair with respect to their own dictionaries.Therefore,the sparse representation of a low resolution image patch can be applied with the high resolution image patch dictionary to generate a high resolution image patch. The learned dictionary pair is a more compact representation of the patch pairs,compared to previous approaches,which simply sample a large amount of image patch pairs[1],reducing the computational cost substantially.The effectiveness of such a sparsity prior is demonstrated for both general image super-resolution and the special case of face hallucination.In both cases,our algorithm generates high-resolution images that are competitive or even superior in quality to images produced by other similar SR methods.In addition,the local sparse modeling of our approach is naturally robust to noise,and therefore the proposed algorithm can handle super-resolution with noisy inputs in a more unified framework.I.I NTRODUCTIONSuper-resolution(SR)image reconstruction is currently a very active area of research,as it offers the promise of overcoming some of the inherent resolution limitations of low-cost imaging sensors(e.g.cell phone or surveillance cameras)allowing better utilization of the growing capability of high-resolution displays(e.g.high-definition LCDs).Such resolution-enhancing technology may also prove to be essen-tial in medical imaging and satellite imaging where diagnosis or analysis from low-quality images can be extremely difficult. Conventional approaches to generating a super-resolution im-age normally require as input multiple low-resolution images of the same scene,which are aligned with sub-pixel accuracy. The SR task is cast as the inverse problem of recovering the original high-resolution image by fusing the low-resolution images,based on reasonable assumptions or prior knowledge about the observation model that maps the high-resolution im-age to the low-resolution ones.The fundamental reconstruction Jianchao Yang and Thomas Huang are with Beckman Institute,Uni-versity of Illinois Urbana-Champaign,Urbana,IL61801USA(email: jyang29@;huang@).John Wright and Yi Ma are with CSL,University of Illinois Urbana-Champaign,Urbana,IL61801USA(email:jnwright@; yima@).constraint for SR is that the recovered image,after applying the same generation model,should reproduce the observed low-resolution images.However,SR image reconstruction is gen-erally a severely ill-posed problem because of the insufficient number of low resolution images,ill-conditioned registration and unknown blurring operators,and the solution from the reconstruction constraint is not unique.Various regularization methods have been proposed to further stabilize the inversion of this ill-posed problem,such as[2],[3],[4].However,the performance of these reconstruction-based super-resolution algorithms degrades rapidly when the desired magnification factor is large or the number of available input images is small.In these cases,the result may be overly smooth,lacking important high-frequency details[5].Another class of SR approach is based on interpolation[6],[7], [8].While simple interpolation methods such as Bilinear or Bicubic interpolation tend to generate overly smooth images with ringing and jagged artifacts,interpolation by exploiting the natural image priors will generally produce more favorable results.Dai et al.[7]represented the local image patches using the background/foreground descriptors and reconstructed the sharp discontinuity between the two.Sun et.al.[8]explored the gradient profile prior for local image structures and ap-plied it to super-resolution.Such approaches are effective in preserving the edges in the zoomed image.However,they are limited in modeling the visual complexity of the real images. For natural images withfine textures or smooth shading,these approaches tend to produce watercolor-like artifacts.A third category of SR approach is based on ma-chine learning techniques,which attempt to capture the co-occurrence prior between low-resolution and high-resolution image patches.[9]proposed an example-based learning strat-egy that applies to generic images where the low-resolution to high-resolution prediction is learned via a Markov Random Field(MRF)solved by belief propagation.[10]extends this approach by using the Primal Sketch priors to enhance blurred edges,ridges and corners.Nevertheless,the above methods typically require enormous databases of millions of high-resolution and low-resolution patch pairs,and are therefore computationally intensive.[11]adopts the philosophy of Lo-cally Linear Embedding(LLE)[12]from manifold learning, assuming similarity between the two manifolds in the high-resolution and the low-resolution patch spaces.Their algorithm maps the local geometry of the low-resolution patch space to the high-resolution one,generating high-resolution patch as a linear combination of ing this strategy,more patch patterns can be represented using a smaller training database.However,using afixed number K neighbors for reconstruction often results in blurring effects,due to over-or under-fitting.In our previous work[1],we proposed a method2for adaptively choosing the most relevant reconstruction neigh-bors based on sparse coding,avoiding over-or under-fitting of [11]and producing superior results.However,sparse codingover a large sampled image patch database directly is too time-consuming.While the mentioned approaches above were proposed forgeneric image super-resolution,specific image priors can be incorporated when tailored to SR applications for specificdomains such as human faces.This face hallucination prob-lem was addressed in the pioneering work of Baker and Kanade[13].However,the gradient pyramid-based predictionintroduced in[13]does not directly model the face prior,and the pixels are predicted individually,causing discontinuities and artifacts.Liu et al.[14]proposed a two-step statisticalapproach integrating the global PCA model and a local patch model.Although the algorithm yields good results,the holistic PCA model tends to yield results like the mean face and theprobabilistic local patch model is complicated and compu-tationally demanding.Wei Liu et al.[15]proposed a newapproach based on TensorPatches and residue compensation. While this algorithm adds more details to the face,it also introduces more artifacts.This paper focuses on the problem of recovering the super-resolution version of a given low-resolution image.Similar to the aforementioned learning-based methods,we will relyon patches from the input image.However,instead of work-ing directly with the image patch pairs sampled from high-and low-resolution images[1],we learn a compact repre-sentation for these patch pairs to capture the co-occurrence prior,significantly improving the speed of the algorithm.Our approach is motivated by recent results in sparse signal representation,which suggest that the linear relationships among high-resolution signals can be accurately recoveredfrom their low-dimensional projections[16],[17].Although the super-resolution problem is very ill-posed,making preciserecovery impossible,the image patch sparse representation demonstrates both effectiveness and robustness in regularizing the inverse problem.a)Basic Ideas:To be more precise,let D∈R n×K be an overcomplete dictionary of K atoms(K>n),and suppose a signal x∈R n can be represented as a sparse linearcombination with respect to D.That is,the signal x can be written as x=Dα0where whereα0∈R K is a vector with very few( n)nonzero entries.In practice,we might onlyobserve a small set of measurements y of x:y.=L x=L Dα0,(1) where L∈R k×n with k<n is a projection matrix.In our super-resolution context,x is a high-resolution image(patch), while y is its low-resolution counter part(or features extracted from it).If the dictionary D is overcomplete,the equation x=Dαis underdetermined for the unknown coefficientsα. The equation y=L Dαis even more dramatically under-determined.Nevertheless,under mild conditions,the sparsest solutionα0to this equation will be unique.Furthermore,if D satisfies an appropriate near-isometry condition,then for a wide variety of matrices L,any sufficiently sparse linear representation of a high-resolution image patch x interms Fig.1.Reconstruction of a raccoon face with magnification factor2.Left: result by our method.Right:the original image.There is little noticeable difference visually even for such a complicated texture.The RMSE for the reconstructed image is5.92(only the local patch model is employed).of the D can be recovered(almost)perfectly from the low-resolution image patch[17],[18].1Fig.1shows an example that demonstrates the capabilities of our method derived fromthis principle.The image of the raccoon face is blurred anddownsampled to half of its original size in both dimensions. Then we zoom the low-resolution image to the original sizeusing the proposed method.Even for such a complicated texture,sparse representation recovers a visually appealingreconstruction of the original signal.Recently sparse representation has been successfully appliedto many other related inverse problems in image processing, such as denoising[19]and restoration[20],often improving onthe state-of-the-art.For example in[19],the authors use theK-SVD algorithm[21]to learn an overcomplete dictionary from natural image patches and successfully apply it to theimage denoising problem.In our setting,we do not directlycompute the sparse representation of the high-resolution patch. Instead,we will work with two coupled dictionaries,D h forhigh-resolution patches,and D l for low-resolution ones.The sparse representation of a low-resolution patch in terms of D l will be directly used to recover the corresponding high-resolution patch from D h.We obtain a locally consistent solution by allowing patches to overlap and demanding that thereconstructed high-resolution patches agree on the overlappedareas.In this paper,we try to learn the two overcomplete dictionaries in a probabilistic model similar to[22].To enforce that the image patch pairs have the same sparse representations with respect to D h and D l,we learn the two dictionaries simultaneously by concatenating them with proper normal-ization.The learned compact dictionaries will be applied to both generic image super-resolution and face hallucination to demonstrate their effectiveness.Compared with the aforementioned learning-based methods,our algorithm requires only two compact learned dictionaries, instead of a large training patch database.The computation, mainly based on linear programming or convex optimization, is much more efficient and scalable,compared with[9],[10], [11].The online recovery of the sparse representation uses the low-resolution dictionary only–the high-resolution dictionary 1Even though the structured projection matrix defined by blurring and downsampling in our SR context does not guarantee exact recovery ofα0, empirical experiments indeed demonstrate the effectiveness of such a sparse prior for our SR tasks.3is used to calculate thefinal high-resolution image.The computed sparse representation adaptively selects the most relevant patch bases in the dictionary to best represent each patch of the given low-resolution image.This leads to superior performance,both qualitatively and quantitatively,compared to the method described in[11],which uses afixed number of nearest neighbors,generating sharper edges and clearer textures.In addition,the sparse representation is robust to noise as suggested in[19],and thus our algorithm is more robust to noise in the test image,while most other methods cannot perform denoising and super-resolution simultaneously.b)Organization of the Paper:The remainder of this paper is organized as follows.Section II details our formula-tion and solution to the image super-resolution problem based on sparse representation.Specifically,we study how to apply sparse representation for both generic image super-resolution and face hallucination.In Section III,we discuss how to learn the two dictionaries for the high-and low-resolution image patches respectively.Various experimental results in Section IV demonstrate the efficacy of sparsity as a prior for regularizing image super-resolution.c)Notations:X and Y denote the high-and low-resolution images respectively,and x and y denote the high-and low-resolution image patches respectively.We use bold uppercase D to denote the dictionary for sparse coding, specifically we use D h and D l to denote the dictionaries for high-and low-resolution image patches respectively.Bold lowercase letters denote vectors.Plain uppercase letters denote regular matrices,i.e.,S is used as a downsampling operation in matrix form.Plain lowercase letters are used as scalars.II.I MAGE S UPER-R ESOLUTION FROM S PARSITYThe single-image super-resolution problem asks:given a low-resolution image Y,recover a higher-resolution image X of the same scene.Two constraints are modeled in this work to solve this ill-posed problem:1)reconstruction constraint, which requires that the recovered X should be consistent with the input Y with respect to the image observation model; and2)sparsity prior,which assumes that the high resolution patches can be sparsely represented in an appropriately chosen overcomplete dictionary,and that their sparse representations can be recovered from the low resolution observation.1)Reconstruction constraint:The observed low-resolution image Y is a blurred and downsampled version of the high resolution image X:Y=SH X(2) Here,H represents a blurringfilter,and S the downsampling operator.Super-resolution remains extremely ill-posed,since for a given low-resolution input Y,infinitely many high-resolution images X satisfy the above reconstruction constraint.We further regularize the problem via the following prior on small patches x of X:2)Sparsity prior:The patches x of the high-resolution image X can be represented as a sparse linear combination in a dictionary D h trained from high-resolution patches sampled from training images:x≈D hαfor someα∈R K with α 0 K.(3) The sparse representationαwill be recovered by representing patches y of the input image Y,with respect to a low resolution dictionary D l co-trained with D h.The dictionary training process will be discussed in Section III.We apply our approach to both generic images and face images.For generic image super-resolution,we divide the problem into two steps.First,as suggested by the sparsity prior(3),wefind the sparse representation for each local patch,respecting spatial compatibility between neighbors. Next,using the result from this local sparse representation, we further regularize and refine the entire image using the reconstruction constraint(2).In this strategy,a local model from the sparsity prior is used to recover lost high-frequency for local details.The global model from the reconstruction constraint is then applied to remove possible artifacts from thefirst step and make the image more consistent and natural. The face images differ from the generic images in that the face images have more regular structure and thus reconstruction constraints in the face subspace can be more effective.For face image super-resolution,we reverse the above two steps to make better use of the global face structure as a regularizer. Wefirstfind a suitable subspace for human faces,and apply the reconstruction constraints to recover a medium resolution image.We then recover the local details using the sparsity prior for image patches.The remainder of this section is organized as follows:in Section II-A,we discuss super-resolution for generic images. We will introduce the local model based on sparse represen-tation and global model based on reconstruction constraints. In Section II-B we discuss how to introduce the global face structure into this framework to achieve more accurate and visually appealing super-resolution for face images.A.Generic Image Super-Resolution from Sparsity1)Local model from sparse representation:Similar to the patch-based methods mentioned previously,our algorithm tries to infer the high-resolution image patch for each low-resolution image patch from the input.For this local model, we have two dictionaries D h and D l,which are trained to have the same sparse representations for each high-resolution and low-resolution image patch pair.We subtract the mean pixel value for each patch,so that the dictionary represents image textures rather than absolute intensities.In the recovery process,the mean value for each high-resolution image patch is then predicted by its low-resolution version.For each input low-resolution patch y,wefind a sparse representation with respect to D l.The corresponding high-resolution patch bases D h will be combined according to these coefficients to generate the output high-resolution patch x. The problem offinding the sparsest representation of y can be formulated as:min α 0s.t. F D lα−F y 22≤ ,(4)4 where F is a(linear)feature extraction operator.The main roleof F in(4)is to provide a perceptually meaningful constraint2on how closely the coefficientsαmust approximate y.We willdiscuss the choice of F in Section III.Although the optimization problem(4)is NP-hard in gen-eral,recent results[23],[24]suggest that as long as the desiredcoefficientsαare sufficiently sparse,they can be efficientlyrecovered by instead minimizing the 1-norm3,as follows:min α 1s.t. F D lα−F y 22≤ .(5)Lagrange multipliers offer an equivalent formulationminαF D lα−F y 22+λ α 1,(6)where the parameterλbalances sparsity of the solutionandfidelity of the approximation to y.Notice that this isessentially a linear regression regularized with 1-norm on thecoefficients,known in statistical literature as the Lasso[27].Solving(6)individually for each local patch does notguarantee the compatibility between adjacent patches.Weenforce compatibility between adjacent patches using a one-pass algorithm similar to that of[28].4The patches areprocessed in raster-scan order in the image,from left to rightand top to bottom.We modify(5)so that the super-resolutionreconstruction D hαof patch y is constrained to closelyagree with the previously computed adjacent high-resolutionpatches.The resulting optimization problem ismin α 1s.t. F D lα−F y 22≤ 1,P D hα−w 22≤ 2,(7)where the matrix P extracts the region of overlap betweenthe current target patch and previously reconstructed high-resolution image,and w contains the values of the previouslyreconstructed high-resolution image on the overlap.The con-strained optimization(7)can be similarly reformulated as:minα˜Dα−˜y 22+λ α 1,(8)where˜D=F D lβP D hand˜y=F yβw.The parameterβcontrols the tradeoff between matching the low-resolution input andfinding a high-resolution patch that is compatible with its neighbors.In all our experiments,we simply set β=1.Given the optimal solutionα∗to(8),the high-resolution patch can be reconstructed as x=D hα∗.2Traditionally,one would seek the sparsestαs.t. D lα−y 2≤ .For super-resolution,it is more appropriate to replace this2-norm with a quadratic norm · F T F that penalizes visually salient high-frequency errors.3There are also some recent works showing certain non-convex optimization problems can produce superior sparse solutions to the 1convex problem,e.g., [25]and[26].4There are different ways to enforce compatibility.In[11],the values in the overlapped regions are simply averaged,which will result in blurring effects. The greedy one-pass algorithm[28]is shown to work almost as well as the use of a full MRF model[9].Our algorithm,not based on the MRF model, is essentially the same by trusting partially the previously recovered high resolution image patches in the overlapped regions.Algorithm1(Super-Resolution via Sparse Representation). 1:Input:training dictionaries D h and D l,a low-resolutionimage Y.2:For each3×3patch y of Y,taken starting from the upper-left corner with1pixel overlap in each direction,•Compute the mean pixel value m of patch y.•Solve the optimization problem with˜D and˜y definedin(8):minα ˜Dα−˜y 22+λ α 1.•Generate the high-resolution patch x=D hα∗.Putthe patch x+m into a high-resolution image X0. 3:End4:Using gradient descent,find the closest image to X0 which satisfies the reconstruction constraint:X∗=arg minXSH X−Y 22+c X−X0 22.5:Output:super-resolution image X∗.2)Enforcing global reconstruction constraint:Notice that (5)and(7)do not demand exact equality between the low-resolution patch y and its reconstruction D lα.Because of this,and also because of noise,the high-resolution image X0produced by the sparse representation approach of the previous section may not satisfy the reconstruction constraint (2)exactly.We eliminate this discrepancy by projecting X0 onto the solution space of SH X=Y,computingX∗=arg minXSH X−Y 22+c X−X0 22.(9) The solution to this optimization problem can be efficiently computed using gradient descent.The update equation for this iterative method isX t+1=X t+ν[H T S T(Y−SH X t)+c(X−X0)],(10) where X t is the estimate of the high-resolution image after the t-th iteration,νis the step size of the gradient descent. We take result X∗from the above optimization as our final estimate of the high-resolution image.This image is as close as possible to the initial super-resolution X0given by sparsity,while respecting the reconstruction constraint.The entire super-resolution process is summarized as Algorithm1.3)Global optimization interpretation:The simple SR algo-rithm outlined in the previous two subsections can be viewed as a special case of a more general sparse representation framework for inverse problems in image processing.Related ideas have been profitably applied in image compression, denoising[19],and restoration[20].In addition to placing our work in a larger context,these connections suggest means of further improving the performance,at the cost of increased computational complexity.Given sufficient computational resources,one could in prin-ciple solve for the coefficients associated with all patches simultaneously.Moreover,the entire high-resolution image X itself can be treated as a variable.Rather than demanding that X be perfectly reproduced by the sparse coefficientsα,we can penalize the difference between X and the high-resolution image given by these coefficients,allowing solutions that5are not perfectly sparse,but better satisfy the reconstruction constraints.This leads to a large optimization problem:X ∗=arg min X ,{αij }SH X −Y 22+λi,jαij 0+γi,jD h αij −P ij X 22+τρ(X ) .(11)Here,αij denotes the representation coefficients for the (i,j )th patch of X ,and P ij is a projection matrix that selects the (i,j )th patch from X .ρ(X )is a penalty function that encodes additional prior knowledge about the high-resolution image.This function may depend on the image category,or may take the form of a generic regularization term (e.g.,Huber MRF,Total Variation,Bilateral Total Variation).Algorithm 1can be interpreted as a computationally efficient approximation to (11).The sparse representation step recovers the coefficients αby approximately minimizing the sum of the second and third terms of (11).The sparsity term αij 0is relaxed to αij 1,while the high-resolution fidelity term D h αij −P ij X 2is approximated by its low-resolution version F D l αij −F y ij 2.Notice,that if the sparse coefficients αare fixed,the third term of (11)essentially penalizes the difference between the super-resolution image X and the reconstruction given by the coefficients: i,j D h αij −P ij X 22≈ X 0−X 22.Hence,for small γ,the back-projection step of Algorithm 1approximately minimizes the sum of the first and third terms of (11).Algorithm 1does not,however,incorporate any prior be-sides sparsity of the representation coefficients –the term ρ(X )is absent in our approximation.In Section IV we will see that sparsity in a relevant dictionary is a strong enough prior that we can already achieve good super-resolution per-formance.Nevertheless,in settings where further assumptions on the high-resolution signal are available,these priors can be incorperated into the global reconstruction step of our algorithm.B.Face super-resolution from SparsityFace image resolution enhancement is usually desirable in many surveillance scenarios,where there is always a large distance between the camera and the objects (people)of interest.Unlike the generic image super-resolution discussed earlier,face images are more regular in structure and thus should be easier to handle.Indeed,for face super-resolution,we can deal with lower resolution input images.The basic idea is first to use the face prior to zoom the input to a reasonable medium resolution,and then to employ the local sparsity prior model to recover details.To be precise,the solution is also approached in two steps:1)global model:use reconstruction constraint to recover a medium high-resolution face image,but the solution is searched only in the face subspace;and 2)local model:use the local sparse model to recover the image details.a)Non-negative matrix factorization:In face super-resolution,the most frequently used subspace method for mod-eling the human face is Principal Component Analysis (PCA),which chooses a low-dimensional subspace that captures as much of the variance as possible.However,the PCA bases are holistic,and tend to generate smooth faces similar to the mean.Moreover,because principal component representations allow negative coefficients,the PCA reconstruction is often hard to interpret.Even though faces are objects with lots of variance,they are made up of several relatively independent parts such as eyes,eyebrows,noses,mouths,checks and chins.Nonnegative Matrix Factorization (NMF)[29]seeks a representation of the given signals as an additive combination of local features.To find such a part-based subspace,NMF is formulated as the following optimization problem:arg min U,VX −UV 22s.t.U ≥0,V ≥0,(12)where X ∈R n ×m is the data matrix,U ∈R n ×r is the basismatrix and V ∈R r ×m is the coefficient matrix.In our context here,X simply consists of a set of pre-aligned high-resolution training face images as its column vectors.The number of the bases r can be chosen as n ∗m/(n +m )which is smaller than n and m ,meaning a more compact representation.It can be shown that a locally optimum of (12)can be obtained via the following update rules:V ij ←−V ij (U T X )ij(U TUV )ij U ki ←−U ki (XV T )ki(UV V T )ki,(13)where 1≤i ≤r ,1≤j ≤m and 1≤k ≤n .The obtained basis matrix U is often sparse and localized.b)Two step face super-resolution:Let X and Y denote the high resolution and low resolution faces respectively.Y is obtained from X by smoothing and downsampling as in Eq.2.We want to recover X from the observation Y .In this paper,we assume Y has been pre-aligned to the training database by either manually labeling the feature points or with some automatic face alignment algorithm such as the method used in [14].We can achieve the optimal solution for X based on the Maximum a Posteriori (MAP)criteria,X ∗=arg max Xp (Y |X )p (X ).(14)p (Y |X )models the image observation process,usually with Gaussian noise assumption on the observation Y ,p (Y |X )=1/Z exp(− SHU c −Y 22/(2∗σ2))with Z being a nor-malization factor.p (X )is a prior on the underlying high resolution image X ,typically in the exponential form p (X )=exp(−cρ(X )).Using the rules in (13),we can obtain the basis matrix U ,which is composed of sparse bases.Let Ωdenote the face subspace spanned by U .Then in the subspace Ω,the super-resolution problem in (14)can be formulated using the reconstruction constraints as:c ∗=arg min cSHU c −Y 22+ηρ(U c )s.t.c ≥0,(15)where ρ(U c )is a prior term regularizing the high resolution solution,c ∈R r ×1is the coefficient vector in the subspace Ω。
Sparse Representation
7
Notice: Our Task is Impossible
min 0 s.t. D x 2
Here is a recipe for solving this problem: Set L=1 There are (K L) such supports Gather all the supports {Si}i of cardinality L
?
Clearly, it might happen that
ˆ eventually
0 0
0.
14
0
Uniqueness Rule
Suppose this problem has been approximated somehow and we obtain a solution
ˆ Argmin 0 s.t. x D
Lets search for the sparsest solution of
As p 0 we get a count of the non-zeros in the vector k 0 p p
x D
2 2
f x xp
p
1
p
1
p p
1
p0
p 0j
j 1
Next steps: given the previously found atoms, find the next one to best fit the residual.
The algorithm stops when the error D x 2 is below the destination threshold.
Research Article Eects
int.j.geographical information science,2002vol.16,no.1,93±109Research ArticleEŒects of spatial aggregation approaches on classi ed satellite imageryHONG S.HESchool of Natural Resources,University of Missouri-Columbia,203M ABNRBuilding,Columbia,MO65211,USA;e-mail:Heh@missouri.ed uSTEPHEN J.VENTURAEnvironmental Studies and Soil Science,Land Information and ComputerGraphics Facility,University of Wisconsin-Madison,1525Observatory Dr.,Madison,WI53706,USAand DAVID J.MLADENOFFDepartment of Forest Ecology&Management,University of Wisconsin-Madison,1630Linden Dr.,Madison,WI53706,USA(Received10June2000;accepted21March2001)Abstract.Spatial aggregations of raster data based on the majority rule havebeen typically used in landscape ecological studies.It is generally acknowledgedthat(1)dominant classes increase in abundance while minor classes decrease inabundance or even disappear through aggregation processes;and(2)spatialpatterns also change with aggregations.In this paper,we examined an alternative,random rule-based aggregation and its eŒects on cover type abundance andlandscape patterns,in comparison with the majority rule-based aggregation.Weaggregated a classi ed TM imagery(about1.5million ha)from30m(42313717pixels)incrementally to990m resolution(132pixels116pixels).Cover typeproportion,mean patch size ratio,aggregation index(AI),and fractal dimension(FD)were used to assess the eŒects of aggregation.To compare landscapes underdiŒerent resolutions,we assumed that the landscapes were least distorted if(1)thecover type proportions and mean patch size ratios among classes were maintained,and(2)all cover types responded in the same way for a given index as aggregationlevels increased.For example,distortion is introduced by aggregation if somecover types increase their AI values with increasing aggregation levels while othercover types decrease.Our study indicated that the two spatial aggregation tech-niques led to diŒerent results in cover type proportions and they altered spatialpattern in opposite ways.The majority rule-based aggregations caused distortionsof cover type proportions and spatial patterns.Generally,the majority rule-basedaggregation ltered out minor classes and produced clumped landscapes.Suchlandscape representation s are relatively easy for interpreting and,therefore,aresuitable for land managers to conceptualize spatial patterns of a study region.Bycontrast,the random rule-based aggregations maintained cover type proportionsaccurately,but tended to make spatial patterns change toward disaggregation.Overall,the measurements of landscape indices used in this study indicated thatInternationa l Journal of Geographica l Information ScienceISSN1365-8816print/ISSN1362-3087online©2002Taylor&Francis Ltd/journalsDOI:10.1080/1365881011007597894H.S.He et al.the random rule-based aggregation maintains spatial patterning better than themajority rule-based aggregation.Random rule-based aggregations are moresuitable for studies in which the accuracy of spatially explicit information is ofconcern.They can be very useful in scaling data of ne resolution to coarseresolution,while retaining cover type proportions.1.IntroductionIssues of biodiversity,wildlife conservation,and landscape change increasingly need to be examined at large spatial extents and multiple scales(resolutions).These often involve the use of satellite imagery covering large areas and the aggregation of the data from ne resolutions to coarse resolutions(Rastetter et al.1992,Hess 1994,Bian and Butler1999,Leibowitz and Hyman1999).Aggregation is particularly important when data management and processing limitations constrain the use of ne resolution data,such as multiple full resolution Landsat scenes involving giga-bytes of data.For satellite data as well as other raster data,aggregations partition the input grid into blocks and determine the value for each block in the output grid.Such operations can be either numerical or categorical,depending on how the value for each output block is determined.Numerical aggregation involves generating new values by applying various mathematical methods to input data such as sum,mean, median,or a user-de ned algorithm.For example,the mean is often used when aggregating imagery of NDVI to coarser resolutions(Townshend and Justice1990, Jelinski and Wu1996).Categorical aggregation generates output values by logical processing such as majority or predominant input category,or spatial processing such as assigning output values based on the input cell at the centre of the output grid cell.These are commonly used for classi ed satellite imagery or other thematic raster data.Scientists have long noted data distortion or even data loss associated with aggregation or scaling(Gardner et al.1982,Johnson and Howarth1987,Woodcock and Strahler1987,Hall et al.1988,Wiens and Milne1989,Levin1992,Milne1992). EŒorts have been made to quantify such phenomena and to predict its eŒects across scales(Turner et al.1989a,Baker1993,Moody and Woodcock1995,Collins and Woodcock1999).While many studies have dealt with numerical aggregation such as the zoning or modi able area unit problems(Fortheringham and Wong1991, Jelinski and Wu1996),some studies have used categorical aggregation(Moody and Woodcock1995).The consensus from the studies of categorical aggregation is that (1)dominant classes become more dominant in terms of abundance or percentage cover,while minor classes become less common through aggregation processes;and (2)the rates of increase or decrease in abundance of various classes are strongly related to the spatial patterns,which also change with data aggregations.For example,Turner et al.(1989b)found that the abundance of minor land cover types decrease more slowly if they are clumped than if they are very dispersed as aggrega-tion level increases.Moody and Woodcock(1995)found that classes with moderately sized patches tend to increase in abundance much more dramatically than classes with very large patches.While these studies suggest the possibility of predicting the distortion or loss of information based on initial proportions and spatial patterns of cover types(Turner et al.1989a,Moody and Woodcock1994,1995),they have not examined alternative aggregation techniques.Two common methods of categorical aggregation are based on either a majorityE V ects of spatial aggregation on classi ed satellite imagery95 or a random rule.The majority rule nds the majority value(the value that appears most often,but not necessarily>50%)for the speci ed cells of the input grid,and records that value in the corresponding aggregated cell on the output grid.Most studies examining categorical data aggregation were based the majority rule(Turner et al.1989a,1989b,Moody and Woodcock1994,1995,Benson and MacKenzie1995, MladenoŒet al.1997).Because of this rule,dominant classes are expected to become more dominant and minor classes decrease and in some cases disappear entirely from the landscape.Random rule-based aggregations have not been evaluated as extensively as major-ity rule-based methods.The random rule assigns the value for the output cell by randomly picking a value among the speci ed cells of the input grid.Because the probability that a class is selected to represent an aggregated area is proportional to its presence in the input data,such aggregations are less likely to create data distortion or loss than those based on the majority rule,except under extreme conditions,such as when the starting map is aggregated to only a few pixels.Since cover type proportions of the random rule-based aggregation are scale independent, it is possible to examine directly the relationships between the levels of aggregation (scale)and resulting landscape patterns.Therefore,the objective of this paper is to compare the eŒects of majority and random rule-based aggregations.Speci cally we intend to use a vegetation map based on a classi ed Landsat Thematic Mapper (TM)satellite image to examine the distortions introduced by data aggregation in terms of both cover type quantities and landscape patterns.2.Methods2.1.Study area and the classi ed T M L andsat imageOur study area is about1.3million ha and is located in northern Wisconsin, USA(44N,91W)( gure1).It comprises62%forest,10%lowland,8%agriculture, 5%water,and14%other land covers(Wolter et al.1995).Forests in the study area are early successional(He et al.1998)and fall in the transitional zone between boreal forest to the north and temperate forests to the south(Curtis1959,Pastor and MladenoŒ1992).The forested areas have multiple ownerships,with about40%state and national forest,35%county forest,and25%privately owned forest.Publicly owned forests tend to occur as large patches,while privately owned forests occur as small patches.A full scene of Landsat TM satellite imagery of this region was classi ed at tree species level(Wolter et al.1995).The sub-area used in the present study contains 4231rows and3717columns of pixels with a map resolution of30m.For this study, we derived three major forest classes and a water class from the classi ed TM image. Maple,largely comprised of sugar maple(Acer saccharum),is the most common and clumped cover type,occurring on about26%of the landscape.Oak,primarily comprised of northern red oak(Quercus rubra),is the least common and most dispersed cover type,occurring on about4%of the landscape.Mixed deciduous and coniferous forest occurs as a moderately common(17%)and clumped cover type.It contains various combinations of conifer species including white pine(Pinus strobus), red pine(P.resinosa),and balsam r(Abies balsamea),and deciduous species including yellow birch(Betula alleghaniensis),paper birch(B.papyrifera),and basswood(T ilia americana).Water occurs on about5%of the landscape,similar in frequency to oak but it is the most clumped cover type.96H.S.He et al.Figure1.Study area,about 1.3million ha,located in northern Wisconsin,USA(44N,91W).2.2.L andscape metricsLandscape metrics commonly used to assess the eŒects of data aggregation include Shannon-Weave r diversity and evenness indices,dominance,contagion index, average patch size,and fractal dimension index(Turner et al.1989b,Cullinan and Thomas1992,Benson and MacKenzie1995,Moody and Woodcock1995,Wickham and Ritters1995,MladenoŒet al.1997).Diversity,evenness,and dominance indices are based on the proportion of each cover type on the landscape(Gustafson1998, O’Neill et al.1988),but inconsistent results in assessing spatial aggregation of diŒerent landscapes can be found by using these indices(Turner1989b,Wickham and Ritters1995,Moody and Woodcock1995),because these indices do not incorp-orate spatial attributes,which are altered by data aggregations.The contagion index (CI)is a spatial index and was developed to measure the degree of clumpiness of overall landscape patterns(O’Neill et al.1988).CI is sensitive to the number of cover types in the data set(Li and Reynolds1993).It can be di cult to interpret and has been found to be inconsistent especially when linking the measurement to a speci c cover type(Gardner and O’Neill1991,Gustafson and Parker1992, Wickham and Riitters1995,Ritters et al.1996,Schumaker1996).Average patch size is an intuitive index to measure aggregation and is especially suitable for categorical maps.Increasing levels of aggregation result in an increase in minimum patch sizes.Therefore,average patch sizes of all classes are expected to increase as well with the levels of aggregation.We calculated ratios of the average patch size among the four cover types.Fractal dimension(FD)for a given cover type can be calculated using the area/E V ects of spatial aggregation on classi ed satellite imagery97 perimeter ratio relationship as two times the slope of the regression line found by regressing log(area)on log(perimeter)(Krummel et al.1987).It has been found useful for assessing the eŒects of data aggregation(Wiens and Milne1989,Cullinan and Thomas1992,Jelinski and Wu1996,Emerson et al.1999).The complexity of patch shapes decreases with increasing levels of aggregation(pixel size),resulting in a decrease in FD value(Burrough1981,McGarigal and Marks1994).He et al.(2000)developed a raster data based aggregation index(AI)similar to the class interspersion(CL)metric(Miller et al.1997).AI for class i is derived as:AI i e i,i/max e i,Iwhere,e i,i,the total edges shared by class i itself and max e i,i is the largest possible number of shared edges for class i.Given a class i of area A i,the maximum aggregation level is reached when A clumps into one patch that has the largest e i,i (e.g.it does not have to be a square).In raster data,if n is the side of largest integer square smaller than A i,and m A i n2,then the largest number of shared edges for class i,max e i,will take one of the three forms:max e i,i2n(n1),when m0,ormax e i,i2n(n1)2m1,when m<n,ormax e i,i2n(n1)2m2,when m>0.Alternatively,AI can be also calculated using the following single equation.Given a polygon containing P pixels let X be the edge length of the rst square that fully contains area P.Then the denominator for AI isAI int A X2P X B2X2P(1)Note that the rst term is done using integer math(so,for instance,it truncates 2.99to2).AI assumes that a class with the highest level of aggregation(AI1)is comprised of pixels sharing the most possible edges.A class completely disaggregated (pixels share no edges)has the lowest level of aggregation(AI0).AI is class speci c, independent of landscape composition,and suitable for satellite data.It has been shown that AI can be used to compare classes from diŒerent landscapes,or the same class from a landscape at diŒerent resolutions(He et al.2000).AI values of a given class larger than the corresponding lowest value would indicate a non-random pattern with some levels of aggregation(He et al.2000).In this study,we use AI,fractal dimension(FD),average patch size,as well as cover type proportions to assess the eŒects of data aggregation.To compare distor-tions on landscape patterns among diŒerent resolutions,we assume that landscape patterns are least distorted if(1)the proportions of cover types and the ratios of average patch sizes among classes are maintained somewhat constant,and(2)the fractal dimension and aggregation index measured for the four cover types respond in the same fashion(decrease or increase)to the levels of aggregations.3.Aggregation scenariosWe aggregated the initial,classi ed TM satellite image based on either majority or random rule(hereafter referred to as RealM and RealR scenarios,respectively). The aggregations were sequential,from30to990m resolutions,a total of32iterations (30-m increment per iteration).Therefore,no pixel was divided by the aggregation98H.S.He et al.procedure and data were not distorted owing to the grouping scheme.In addition, the range of spatial resolutions cover the span of most existing studies on data aggregations(Turner et al.1989,Cullinan and Thomas1992,Benson and MacKenzie 1995,Moody and Woodcock1995,Wickham and Ritters1995,Ritters et al.1997) as well as the250-m and1000-m resolutions of Moderate Resolution Imaging Spectroradiometer(MODIS)(NASA’s Earth Observing System2000)and the Advanced Very High Resolution Radiometer(AVHRR),platforms used for broad-area ecological studies.To minimize the impacts of spatial patterns and provide base-level comparisons to measurements from the scenario RealM and RealR,we randomized the pixels on the initial classi ed TM image but kept the proportion of each cover type unchanged. This created a random image with no spatial patterns.Corresponding to the RealM and RealR scenarios,we applied both majority and random rule-based aggregations on the randomized TM image.We did not nd any comparative diŒerence in any measurements for these two randomized map treatments.Therefore,only one of them was used,hereafter referred to as the Random scenario.For all scenarios,we measured cover type percentage area,AI,FD,and average patch size.4.Results4.1.Percentage cover typeFor the RealM scenario,the abundance(percentage cover)of maple,the most common and contagious cover type,increased with the level of aggregation(increas-ing cell size).Such increases appeared to be more rapid for the initial aggregation iterations,while maple abundance increased by about50%after six iterations when cell size reached210m.For the remaining26aggregation iterations,its abundance increased gradually by another12%( gure2(a)).Oak,the least common and most dispersed cover type,exhibited the opposite trend to maple.Its abundance decreased rapidly for the rst few iterations of aggregation.When aggregated to990-m reso-lution,only22%of the original oak remained on the landscape( gure2(a)).The abundance of water cover type maintained its level(5%)throughout the aggregation process( gure2(a)).The abundance of the mixed deciduous and coniferous cover type lost to maple but was compensated from other minor classes.Therefore,it also remained relatively consistent during the aggregation process( gure2(a)).These results are consistent with ndings reported elsewhere where majority rule-based aggregation was examined(e.g.Turner et al.1989,Moody and Woodcock1995,He et al.2000).For the RealR scenario,abundance for all cover types remained unchanged,with average variations less than1%( gure2(b)).This suggests that no distortion and information loss in the amount of percentage cover be found when aggregating the classi ed TM satellite imagery from its nest resolution(30m)to very coarse reso-lution(990m).As expected,results of cover type percentage areas for the Random scenario remained constant across all scales examined and were identical to that of the RealR scenario( gure2(c)).4.2.Aggregation indexBefore the aggregations,measurements of AI at the30-m map resolution for the RealM and RealR scenarios indicated that maple had the highest AI value(0.54), while oak had the lowest AI(0.35).Water had a higher AI(0.47)than that of the mixed deciduous and coniferous class(AI0.42).For the Random scenario,the AIE V ects of spatial aggregation on classi ed satellite imagery99Figure2.Cover type proportions of the classi ed TM satellite image aggregated from30-m to990-m map resolution under the scenarios of(a)the real landscape with the major-ity rule-based aggregation(RealM),(b)the real landscape with the random rule-based aggregation(RealR),and(c)the random landscape with random rule-based aggregation(Random).of each cover type was signi cantly lower than the corresponding ones under the RealM and RealR scenarios.AI s of maple(0.26),mixed deciduous and coniferous (0.17),water(0.05),and oak(0.04)were the lowest possible and exactly corresponded to the proportions of cover types(percentage cover/100).For the RealM scenario,AI s for the least common classes(water and oak) decreased and for the mid-and most common classes increased across all spatial scales examined( gure3(a)).In other words,aggregation based on the majority rule made the dominant classes more aggregated and minor class more disaggregated ( gure4).For the Random scenario,AI s of all classes measured using the correspond-ing map resolutions remained constant( gure3(c)).In these cases,scaling to a coarseH.S.He et al.100Figure3.Aggregation Index(AI)measured using corresponding map resolutions for the classi ed TM satellite image aggregated from30to990m map resolution under the scenarios of(a)the real landscape with the majority rule-based aggregation(RealM),(b)the real landscape with the random rule-based aggregation(RealR),and(c)therandom landscape with random rule-based aggregation(Random).resolution,the resulting landscape becomes more aggregated when measured using the ner input map resolution.However,the resulting landscape is still random after the aggregation and,therefore,AI measured at the resulting map resolution remained unchanged and re ected the cover type proportions.This was seen across the spatial scales examined( gure3(c)).For the RealR scenario,the random rule-based aggregations randomly assigned values to output maps.This caused disaggregation of the existing cover type patches and resulted in less aggregated landscape patterns.Our results showed that AI s decreased for all classes,but more rapidly for the highly aggregated class(water) than for other classes( gure3(b)).By990-m resolution,AI decreased by64%,71%, 47%,and36%for water,oak,mixed deciduous and coniferous,and maple,respect-E V ects of spatial aggregation on classi ed satellite imagery101H.S.He et al.102ively.The major decreases occurred during the initial iterations of aggregation.The rates of such decreases in AI became moderate as the level of aggregation increased.4.3.Fractal dimensionExcept for the most common class under the RealM scenario,FD measured across the range of resolutions decreased for all classes under both scenarios as the levels of aggregation increased.This is consistent with the results reported elsewhere (Milne1988,O’Neill et al.1991)suggesting that aggregations result in simpli ed spatial patterns( gures5(a–c)).In the RealM scenario,the majority rule-based aggregation ltered out most small patches of minor classes and those left were in simple shapes( gure5(a)).FD for maple remained relatively unchanged in contrast to decreases for all other classes for this scenario.This may be related to overallFigure5.Fractal dimensions(FD)measured using corresponding map resolutions for the classi ed TM satellite image aggregated from30to990m map resolution under the scenarios of(a)the real landscape with the majority rule-based aggregation(RealM),(b)the real landscape with the random rule-based aggregation(RealR),and(c)therandom landscape with random rule-based aggregation(Random).landscape con gurations that made FD of maple patches less variable than all other classes.For the RealR scenario,FD s of all four classes exhibited a near linear decreasing pattern with the increasing levels of aggregation( gure5(b)).However, the slopes of such decreases are slower than those under the RealM scenario.This suggests that the RealR scenario creates less distortion in FD than the RealM scenario.4.4.Average patch sizeFor all scenarios,average patch sizes of each class increased with the increasing levels of aggregation,as explained in the Methods section( gures6(a,b)).Patch size variations re ected by the coe cients of variation(cv)decreased( gures6(c,d)), indicating that aggregating to coarse resolutions decreased patch diversity.The average patch size coe cient of variation decreased more for the majority rule-based aggregation than for the random rule-based aggregation.For instance,in the RealMFigure6.Mean patch size measured using corresponding map resolutions for the classi ed TM satellite image aggregated from30-m to990-m map resolution under the scenarios of(a)the real landscape with the majority rule-based aggregation(RealM),(b)the real landscape with the random rule-based aggregation(RealR),and coe cient of variation of mean patch size(cv)for(c)the real landscape with the majority rule-based aggrega-tion(RealM),(c)the real landscape with the random rule-based aggregation(RealR).scenario,the average patch size of water was5.1ha at30-m resolution and its coe cient of variation was176.70( gure6).After the rst iteration of aggregation (at60-m resolution),the average patch size increased to5.8ha,but the coe cient of variation decreased dramatically to6.77.Such a decrease was much faster than the corresponding one(cv38.03at60-m resolution)under the RealR scenario.In general,the coe cient of variation diminished under increasing aggregation levels, with the RealR scenario maintaining higher cv values than the RealM scenario ( gures6(c,d)).To compare eŒects of aggregation on average patch sizes,we also calculated ratios of the average patch sizes among the four cover types(table1). The RealM scenario altered the ratios of all cover types.Average patch size of maple increased more rapidly against all other classes as the levels of aggregation increased. At30-m resolution,the ratio of water:maple is1.14,mixed:maple0.47,oak:maple 0.29;while at990-m resolution,these ratios decreased to0.10,0.21,and0.04,respect-ively.Water:maple and oak:maple decreased the most with the increasing levels of aggregation.It was obvious that the average patch sizes of less common classes such as oak and water increased more slowly against the medium and most common classes,with the ratios against the latter classes decreasing.The ratio of these two minor classes remained somewhat constant.For the RealR scenario,there are only minor changes in average patch size ratios among mixed:maple,oak:maple,and therefore,oak:mixed.The only large change was the average patch size of water,which decreased more rapidly than all other classes.Water/maple ratio started at1.14at30-m resolution and decreased to0.32 at990-m resolution(table1).However,the water/maple ratio under the RealR scenario was still higher than that under the RealM scenario.4.5.Map delity through aggregationThere is no doubt that information is lost during data aggregation.However, there is no single measurement of map delity,de ned in this case as the correspond-ence of an aggregated image to the source image.The value of information preserved during data aggregation obviously depends upon how the derived maps are used. For example,if land cover proportion is the only concern,random rule-based aggregation does not change map delity.A general evaluation of map delity can be done using correlation analysis between the original map and the aggregated maps( gure7).The results indicate that map delity decreased substantially with increasing level of aggregation.Maps derived from the majority rule-based aggregation had lower delity than those under random rule-based aggregation,especially at the early steps of data aggregation.At the later steps,both methods show similar levels of decreasing map delity.At 990-m resolution,correlation coe cient(R)reaches about0.33,indicating that about 67%of the original information is lost.5.DiscussionOur study indicates that diŒerent spatial aggregation techniques across a broad range of spatial scales(30–990m)lead to signi cantly diŒerent results in terms of cover type proportions and spatial patterns.Majority rule-based aggregations cause distortions of cover type percentage areas as has been previously found.The most common and contagious classes and the least common and dispersed classes are aŒected the most.By contrast,random rule-based aggregations do not produce such distortions,as proportions of various cover types remain relatively stable overTable1.Average patch sizes ratio and coe cients of variation(cv)of four cover types under the both majority and random rule-based aggregations.RealM RealR Resolution Water/Mixed/Oak/Water/Water/Oak/Water/Mixed/Oak/ (m)maple maple maple mixed oak mixed maple maple maple 30 1.140.470.29 2.42 3.870.62 1.140.470.29 600.170.290.080.57 1.990.290.870.420.23 900.410.390.16 1.04 2.480.420.730.410.22 1200.350.380.150.91 2.270.400.630.410.21 1500.320.390.150.81 2.140.380.570.410.22 1800.290.380.140.75 2.030.370.510.410.22 2100.270.390.130.70 2.100.330.490.420.22 2400.250.360.120.71 2.100.340.450.420.23 2700.240.350.120.68 2.070.330.440.410.23 3000.230.340.100.68 2.280.300.430.430.23 3300.240.360.110.66 2.150.310.420.430.23 3600.220.350.110.63 2.110.300.400.410.23 3900.200.330.090.61 2.310.260.410.450.24 4200.190.320.080.60 2.280.260.390.430.23 4500.170.290.080.60 2.240.270.400.440.24 4800.170.290.080.57 1.990.290.390.440.24 5100.170.310.090.55 2.000.280.370.430.23 5400.140.280.070.51 2.080.240.370.440.23 5700.150.270.070.57 2.200.260.380.460.23 6000.160.270.080.57 2.040.280.350.420.24 6300.130.270.060.49 2.130.230.360.460.24 6600.120.250.060.48 2.100.230.350.430.23 6900.130.250.050.51 2.420.210.330.410.22 7200.110.210.060.51 1.870.270.340.430.22 7500.110.240.060.48 2.050.240.370.480.27 7800.120.280.050.44 2.350.190.320.460.23 8100.120.230.050.52 2.600.200.370.470.25 8400.120.220.050.52 2.360.220.360.470.24 8700.090.190.040.46 2.050.220.320.460.23 9000.100.220.050.46 2.140.220.330.460.25 9300.100.230.040.42 2.190.190.320.470.23 9600.100.200.040.49 2.440.200.300.440.24 9900.100.210.040.47 2.350.200.320.450.24Figure7.Correlation analysis of maps derived from the majority rule-based aggregation (RealM)has lower delity than those under random rule-based aggregation(RealR).at the early steps of data aggregation.At the later steps,both methods do not diŒer much.。
Advances in
Advances in Geosciences,4,17–22,2005 SRef-ID:1680-7359/adgeo/2005-4-17 European Geosciences Union©2005Author(s).This work is licensed under a Creative CommonsLicense.Advances in GeosciencesIncorporating level set methods in Geographical Information Systems(GIS)for land-surface process modelingD.PullarGeography Planning and Architecture,The University of Queensland,Brisbane QLD4072,Australia Received:1August2004–Revised:1November2004–Accepted:15November2004–Published:9August2005nd-surface processes include a broad class of models that operate at a landscape scale.Current modelling approaches tend to be specialised towards one type of pro-cess,yet it is the interaction of processes that is increasing seen as important to obtain a more integrated approach to land management.This paper presents a technique and a tool that may be applied generically to landscape processes. The technique tracks moving interfaces across landscapes for processes such as waterflow,biochemical diffusion,and plant dispersal.Its theoretical development applies a La-grangian approach to motion over a Eulerian grid space by tracking quantities across a landscape as an evolving front. An algorithm for this technique,called level set method,is implemented in a geographical information system(GIS).It fits with afield data model in GIS and is implemented as operators in map algebra.The paper describes an implemen-tation of the level set methods in a map algebra program-ming language,called MapScript,and gives example pro-gram scripts for applications in ecology and hydrology.1IntroductionOver the past decade there has been an explosion in the ap-plication of models to solve environmental issues.Many of these models are specific to one physical process and of-ten require expert knowledge to use.Increasingly generic modeling frameworks are being sought to provide analyti-cal tools to examine and resolve complex environmental and natural resource problems.These systems consider a vari-ety of land condition characteristics,interactions and driv-ing physical processes.Variables accounted for include cli-mate,topography,soils,geology,land cover,vegetation and hydro-geography(Moore et al.,1993).Physical interactions include processes for climatology,hydrology,topographic landsurface/sub-surfacefluxes and biological/ecological sys-Correspondence to:D.Pullar(d.pullar@.au)tems(Sklar and Costanza,1991).Progress has been made in linking model-specific systems with tools used by environ-mental managers,for instance geographical information sys-tems(GIS).While this approach,commonly referred to as loose coupling,provides a practical solution it still does not improve the scientific foundation of these models nor their integration with other models and related systems,such as decision support systems(Argent,2003).The alternative ap-proach is for tightly coupled systems which build functional-ity into a system or interface to domain libraries from which a user may build custom solutions using a macro language or program scripts.The approach supports integrated models through interface specifications which articulate the funda-mental assumptions and simplifications within these models. The problem is that there are no environmental modelling systems which are widely used by engineers and scientists that offer this level of interoperability,and the more com-monly used GIS systems do not currently support space and time representations and operations suitable for modelling environmental processes(Burrough,1998)(Sui and Magio, 1999).Providing a generic environmental modeling framework for practical environmental issues is challenging.It does not exist now despite an overwhelming demand because there are deep technical challenges to build integrated modeling frameworks in a scientifically rigorous manner.It is this chal-lenge this research addresses.1.1Background for ApproachThe paper describes a generic environmental modeling lan-guage integrated with a Geographical Information System (GIS)which supports spatial-temporal operators to model physical interactions occurring in two ways.The trivial case where interactions are isolated to a location,and the more common and complex case where interactions propa-gate spatially across landscape surfaces.The programming language has a strong theoretical and algorithmic basis.The-oretically,it assumes a Eulerian representation of state space,Fig.1.Shows a)a propagating interface parameterised by differ-ential equations,b)interface fronts have variable intensity and may expand or contract based onfield gradients and driving process. but propagates quantities across landscapes using Lagrangian equations of motion.In physics,a Lagrangian view focuses on how a quantity(water volume or particle)moves through space,whereas an Eulerian view focuses on a localfixed area of space and accounts for quantities moving through it.The benefit of this approach is that an Eulerian perspective is em-inently suited to representing the variation of environmen-tal phenomena across space,but it is difficult to conceptu-alise solutions for the equations of motion and has compu-tational drawbacks(Press et al.,1992).On the other hand, the Lagrangian view is often not favoured because it requires a global solution that makes it difficult to account for local variations,but has the advantage of solving equations of mo-tion in an intuitive and numerically direct way.The research will address this dilemma by adopting a novel approach from the image processing discipline that uses a Lagrangian ap-proach over an Eulerian grid.The approach,called level set methods,provides an efficient algorithm for modeling a natural advancing front in a host of settings(Sethian,1999). The reason the method works well over other approaches is that the advancing front is described by equations of motion (Lagrangian view),but computationally the front propagates over a vectorfield(Eulerian view).Hence,we have a very generic way to describe the motion of quantities,but can ex-plicitly solve their advancing properties locally as propagat-ing zones.The research work will adapt this technique for modeling the motion of environmental variables across time and space.Specifically,it will add new data models and op-erators to a geographical information system(GIS)for envi-ronmental modeling.This is considered to be a significant research imperative in spatial information science and tech-nology(Goodchild,2001).The main focus of this paper is to evaluate if the level set method(Sethian,1999)can:–provide a theoretically and empirically supportable methodology for modeling a range of integral landscape processes,–provide an algorithmic solution that is not sensitive to process timing,is computationally stable and efficient as compared to conventional explicit solutions to diffu-sive processes models,–be developed as part of a generic modelling language in GIS to express integrated models for natural resource and environmental problems?The outline for the paper is as follow.The next section will describe the theory for spatial-temporal processing us-ing level sets.Section3describes how this is implemented in a map algebra programming language.Two application examples are given–an ecological and a hydrological ex-ample–to demonstrate the use of operators for computing reactive-diffusive interactions in landscapes.Section4sum-marises the contribution of this research.2Theory2.1IntroductionLevel set methods(Sethian,1999)have been applied in a large collection of applications including,physics,chemistry,fluid dynamics,combustion,material science,fabrication of microelectronics,and computer vision.Level set methods compute an advancing interface using an Eulerian grid and the Lagrangian equations of motion.They are similar to cost distance modeling used in GIS(Burroughs and McDonnell, 1998)in that they compute the spread of a variable across space,but the motion is based upon partial differential equa-tions related to the physical process.The advancement of the interface is computed through time along a spatial gradient, and it may expand or contract in its extent.See Fig.1.2.2TheoryThe advantage of the level set method is that it models mo-tion along a state-space gradient.Level set methods start with the equation of motion,i.e.an advancing front with velocity F is characterised by an arrival surface T(x,y).Note that F is a velocityfield in a spatial sense.If F was constant this would result in an expanding series of circular fronts,but for different values in a velocityfield the front will have a more contorted appearance as shown in Fig.1b.The motion of thisinterface is always normal to the interface boundary,and its progress is regulated by several factors:F=f(L,G,I)(1)where L=local properties that determine the shape of advanc-ing front,G=global properties related to governing forces for its motion,I=independent properties that regulate and influ-ence the motion.If the advancing front is modeled strictly in terms of the movement of entity particles,then a straightfor-ward velocity equation describes its motion:|∇T|F=1given T0=0(2) where the arrival function T(x,y)is a travel cost surface,and T0is the initial position of the interface.Instead we use level sets to describe the interface as a complex function.The level set functionφis an evolving front consistent with the under-lying viscosity solution defined by partial differential equa-tions.This is expressed by the equation:φt+F|∇φ|=0givenφ(x,y,t=0)(3)whereφt is a complex interface function over time period 0..n,i.e.φ(x,y,t)=t0..tn,∇φis the spatial and temporal derivatives for viscosity equations.The Eulerian view over a spatial domain imposes a discretisation of space,i.e.the raster grid,which records changes in value z.Hence,the level set function becomesφ(x,y,z,t)to describe an evolv-ing surface over time.Further details are given in Sethian (1999)along with efficient algorithms.The next section de-scribes the integration of the level set methods with GIS.3Map algebra modelling3.1Map algebraSpatial models are written in a map algebra programming language.Map algebra is a function-oriented language that operates on four implicit spatial data types:point,neighbour-hood,zonal and whole landscape surfaces.Surfaces are typ-ically represented as a discrete raster where a point is a cell, a neighbourhood is a kernel centred on a cell,and zones are groups of mon examples of raster data include ter-rain models,categorical land cover maps,and scalar temper-ature surfaces.Map algebra is used to program many types of landscape models ranging from land suitability models to mineral exploration in the geosciences(Burrough and Mc-Donnell,1998;Bonham-Carter,1994).The syntax for map algebra follows a mathematical style with statements expressed as equations.These equations use operators to manipulate spatial data types for point and neighbourhoods.Expressions that manipulate a raster sur-face may use a global operation or alternatively iterate over the cells in a raster.For instance the GRID map algebra (Gao et al.,1993)defines an iteration construct,called do-cell,to apply equations on a cell-by-cell basis.This is triv-ially performed on columns and rows in a clockwork manner. However,for environmental phenomena there aresituations Fig.2.Spatial processing orders for raster.where the order of computations has a special significance. For instance,processes that involve spreading or transport acting along environmental gradients within the landscape. Therefore special control needs to be exercised on the order of execution.Burrough(1998)describes two extra control mechanisms for diffusion and directed topology.Figure2 shows the three principle types of processing orders,and they are:–row scan order governed by the clockwork lattice struc-ture,–spread order governed by the spreading or scattering ofa material from a more concentrated region,–flow order governed by advection which is the transport of a material due to velocity.Our implementation of map algebra,called MapScript (Pullar,2001),includes a special iteration construct that sup-ports these processing orders.MapScript is a lightweight lan-guage for processing raster-based GIS data using map alge-bra.The language parser and engine are built as a software component to interoperate with the IDRISI GIS(Eastman, 1997).MapScript is built in C++with a class hierarchy based upon a value type.Variants for value types include numeri-cal,boolean,template,cells,or a grid.MapScript supports combinations of these data types within equations with basic arithmetic and relational comparison operators.Algebra op-erations on templates typically result in an aggregate value assigned to a cell(Pullar,2001);this is similar to the con-volution integral in image algebras(Ritter et al.,1990).The language supports iteration to execute a block of statements in three ways:a)docell construct to process raster in a row scan order,b)dospread construct to process raster in a spreadwhile(time<100)dospreadpop=pop+(diffuse(kernel*pop))pop=pop+(r*pop*dt*(1-(pop/K)) enddoendwhere the diffusive constant is stored in thekernel:Fig.3.Map algebra script and convolution kernel for population dispersion.The variable pop is a raster,r,K and D are constants, dt is the model time step,and the kernel is a3×3template.It is assumed a time step is defined and the script is run in a simulation. Thefirst line contained in the nested cell processing construct(i.e. dospread)is the diffusive term and the second line is the population growth term.order,c)doflow to process raster byflow order.Examples are given in subsequent sections.Process models will also involve a timing loop which may be handled as a general while(<condition>)..end construct in MapScript where the condition expression includes a system time variable.This time variable is used in a specific fashion along with a system time step by certain operators,namely diffuse()andfluxflow() described in the next section,to model diffusion and advec-tion as a time evolving front.The evolving front represents quantities such as vegetation growth or surface runoff.3.2Ecological exampleThis section presents an ecological example based upon plant dispersal in a landscape.The population of a species follows a controlled growth rate and at the same time spreads across landscapes.The theory of the rate of spread of an organism is given in Tilman and Kareiva(1997).The area occupied by a species grows log-linear with time.This may be modelled by coupling a spatial diffusion term with an exponential pop-ulation growth term;the combination produces the familiar reaction-diffusion model.A simple growth population model is used where the reac-tion term considers one population controlled by births and mortalities is:dN dt =r·N1−NK(4)where N is the size of the population,r is the rate of change of population given in terms of the difference between birth and mortality rates,and K is the carrying capacity.Further dis-cussion of population models can be found in Jrgensen and Bendoricchio(2001).The diffusive term spreads a quantity through space at a specified rate:dudt=Dd2udx2(5) where u is the quantity which in our case is population size, and D is the diffusive coefficient.The model is operated as a coupled computation.Over a discretized space,or raster,the diffusive term is estimated using a numerical scheme(Press et al.,1992).The distance over which diffusion takes place in time step dt is minimally constrained by the raster resolution.For a stable computa-tional process the following condition must be satisfied:2Ddtdx2≤1(6) This basically states that to account for the diffusive pro-cess,the term2D·dx is less than the velocity of the advancing front.This would not be difficult to compute if D is constant, but is problematic if D is variable with respect to landscape conditions.This problem may be overcome by progressing along a diffusive front over the discrete raster based upon distance rather than being constrained by the cell resolution.The pro-cessing and diffusive operator is implemented in a map al-gebra programming language.The code fragment in Fig.3 shows a map algebra script for a single time step for the cou-pled reactive-diffusion model for population growth.The operator of interest in the script shown in Fig.3is the diffuse operator.It is assumed that the script is run with a given time step.The operator uses a system time step which is computed to balance the effect of process errors with effi-cient computation.With knowledge of the time step the it-erative construct applies an appropriate distance propagation such that the condition in Eq.(3)is not violated.The level set algorithm(Sethian,1999)is used to do this in a stable and accurate way.As a diffusive front propagates through the raster,a cost distance kernel assigns the proper time to each raster cell.The time assigned to the cell corresponds to the minimal cost it takes to reach that cell.Hence cell pro-cessing is controlled by propagating the kernel outward at a speed adaptive to the local context rather than meeting an arbitrary global constraint.3.3Hydrological exampleThis section presents a hydrological example based upon sur-face dispersal of excess rainfall across the terrain.The move-ment of water is described by the continuity equation:∂h∂t=e t−∇·q t(7) where h is the water depth(m),e t is the rainfall excess(m/s), q is the discharge(m/hr)at time t.Discharge is assumed to have steady uniformflow conditions,and is determined by Manning’s equation:q t=v t h t=1nh5/3ts1/2(8)putation of current cell(x+ x,t,t+ ).where q t is theflow velocity(m/s),h t is water depth,and s is the surface slope(m/m).An explicit method of calcula-tion is used to compute velocity and depth over raster cells, and equations are solved at each time step.A conservative form of afinite difference method solves for q t in Eq.(5). To simplify discussions we describe quasi-one-dimensional equations for theflow problem.The actual numerical com-putations are normally performed on an Eulerian grid(Julien et al.,1995).Finite-element approximations are made to solve the above partial differential equations for the one-dimensional case offlow along a strip of unit width.This leads to a cou-pled model with one term to maintain the continuity offlow and another term to compute theflow.In addition,all calcu-lations must progress from an uphill cell to the down slope cell.This is implemented in map algebra by a iteration con-struct,called doflow,which processes a raster byflow order. Flow distance is measured in cell size x per unit length. One strip is processed during a time interval t(Fig.4).The conservative solution for the continuity term using afirst or-der approximation for Eq.(5)is derived as:h x+ x,t+ t=h x+ x,t−q x+ x,t−q x,txt(9)where the inflow q x,t and outflow q x+x,t are calculated in the second term using Equation6as:q x,t=v x,t·h t(10) The calculations approximate discharge from previous time interval.Discharge is dynamically determined within the continuity equation by water depth.The rate of change in state variables for Equation6needs to satisfy a stability condition where v· t/ x≤1to maintain numerical stabil-ity.The physical interpretation of this is that afinite volume of water wouldflow across and out of a cell within the time step t.Typically the cell resolution isfixed for the raster, and adjusting the time step requires restarting the simulation while(time<120)doflow(dem)fvel=1/n*pow(depth,m)*sqrt(grade)depth=depth+(depth*fluxflow(fvel)) enddoendFig.5.Map algebra script for excess rainfallflow computed over a 120minute event.The variables depth and grade are rasters,fvel is theflow velocity,n and m are constants in Manning’s equation.It is assumed a time step is defined and the script is run in a simulation. Thefirst line in the nested cell processing(i.e.doflow)computes theflow velocity and the second line computes the change in depth from the previous value plus any net change(inflow–outflow)due to velocityflux across the cell.cycle.Flow velocities change dramatically over the course of a storm event,and it is problematic to set an appropriate time step which is efficient and yields a stable result.The hydrological model has been implemented in a map algebra programming language Pullar(2003).To overcome the problem mentioned above we have added high level oper-ators to compute theflow as an advancing front over a land-scape.The time step advances this front adaptively across the landscape based upon theflow velocity.The level set algorithm(Sethian,1999)is used to do this in a stable and accurate way.The map algebra script is given in Fig.5.The important operator is thefluxflow operator.It computes the advancing front for waterflow across a DEM by hydrologi-cal principles,and computes the local drainageflux rate for each cell.Theflux rate is used to compute the net change in a cell in terms offlow depth over an adaptive time step.4ConclusionsThe paper has described an approach to extend the function-ality of tightly coupled environmental models in GIS(Ar-gent,2004).A long standing criticism of GIS has been its in-ability to handle dynamic spatial models.Other researchers have also addressed this issue(Burrough,1998).The con-tribution of this paper is to describe how level set methods are:i)an appropriate scientific basis,and ii)able to perform stable time-space computations for modelling landscape pro-cesses.The level set method provides the following benefits:–it more directly models motion of spatial phenomena and may handle both expanding and contracting inter-faces,–is based upon differential equations related to the spatial dynamics of physical processes.Despite the potential for using level set methods in GIS and land-surface process modeling,there are no commercial or research systems that use this mercial sys-tems such as GRID(Gao et al.,1993),and research systems such as PCRaster(Wesseling et al.,1996)offerflexible andpowerful map algebra programming languages.But opera-tions that involve reaction-diffusive processing are specific to one context,such as groundwaterflow.We believe the level set method offers a more generic approach that allows a user to programflow and diffusive landscape processes for a variety of application contexts.We have shown that it pro-vides an appropriate theoretical underpinning and may be ef-ficiently implemented in a GIS.We have demonstrated its application for two landscape processes–albeit relatively simple examples–but these may be extended to deal with more complex and dynamic circumstances.The validation for improved environmental modeling tools ultimately rests in their uptake and usage by scientists and engineers.The tool may be accessed from the web site .au/projects/mapscript/(version with enhancements available April2005)for use with IDRSIS GIS(Eastman,1997)and in the future with ArcGIS. It is hoped that a larger community of users will make use of the methodology and implementation for a variety of environmental modeling applications.Edited by:P.Krause,S.Kralisch,and W.Fl¨u gelReviewed by:anonymous refereesReferencesArgent,R.:An Overview of Model Integration for Environmental Applications,Environmental Modelling and Software,19,219–234,2004.Bonham-Carter,G.F.:Geographic Information Systems for Geo-scientists,Elsevier Science Inc.,New York,1994. Burrough,P.A.:Dynamic Modelling and Geocomputation,in: Geocomputation:A Primer,edited by:Longley,P.A.,et al., Wiley,England,165-191,1998.Burrough,P.A.and McDonnell,R.:Principles of Geographic In-formation Systems,Oxford University Press,New York,1998. Gao,P.,Zhan,C.,and Menon,S.:An Overview of Cell-Based Mod-eling with GIS,in:Environmental Modeling with GIS,edited by: Goodchild,M.F.,et al.,Oxford University Press,325–331,1993.Goodchild,M.:A Geographer Looks at Spatial Information Theory, in:COSIT–Spatial Information Theory,edited by:Goos,G., Hertmanis,J.,and van Leeuwen,J.,LNCS2205,1–13,2001.Jørgensen,S.and Bendoricchio,G.:Fundamentals of Ecological Modelling,Elsevier,New York,2001.Julien,P.Y.,Saghafian,B.,and Ogden,F.:Raster-Based Hydro-logic Modelling of Spatially-Varied Surface Runoff,Water Re-sources Bulletin,31(3),523–536,1995.Moore,I.D.,Turner,A.,Wilson,J.,Jenson,S.,and Band,L.:GIS and Land-Surface-Subsurface Process Modeling,in:Environ-mental Modeling with GIS,edited by:Goodchild,M.F.,et al., Oxford University Press,New York,1993.Press,W.,Flannery,B.,Teukolsky,S.,and Vetterling,W.:Numeri-cal Recipes in C:The Art of Scientific Computing,2nd Ed.Cam-bridge University Press,Cambridge,1992.Pullar,D.:MapScript:A Map Algebra Programming Language Incorporating Neighborhood Analysis,GeoInformatica,5(2), 145–163,2001.Pullar,D.:Simulation Modelling Applied To Runoff Modelling Us-ing MapScript,Transactions in GIS,7(2),267–283,2003. Ritter,G.,Wilson,J.,and Davidson,J.:Image Algebra:An Overview,Computer Vision,Graphics,and Image Processing, 4,297–331,1990.Sethian,J.A.:Level Set Methods and Fast Marching Methods, Cambridge University Press,Cambridge,1999.Sklar,F.H.and Costanza,R.:The Development of Dynamic Spa-tial Models for Landscape Ecology:A Review and Progress,in: Quantitative Methods in Ecology,Springer-Verlag,New York, 239–288,1991.Sui,D.and R.Maggio:Integrating GIS with Hydrological Mod-eling:Practices,Problems,and Prospects,Computers,Environ-ment and Urban Systems,23(1),33–51,1999.Tilman,D.and P.Kareiva:Spatial Ecology:The Role of Space in Population Dynamics and Interspecific Interactions.Princeton University Press,Princeton,New Jersey,USA,1997. Wesseling C.G.,Karssenberg, D.,Burrough,P. A.,and van Deursen,W.P.:Integrating Dynamic Environmental Models in GIS:The Development of a Dynamic Modelling Language, Transactions in GIS,1(1),40–48,1996.。
(StOMP)Sparse Solution of Underdetermined Linear Equations by Stagewise Orthogonal Matching Pursuit
Sparse Solution of Underdetermined Linear Equationsby Stagewise Orthogonal Matching PursuitDavid L.Donoho 1,Yaakov Tsaig 2,Iddo Drori 1,Jean-Luc Starck 3March 2006AbstractFinding the sparsest solution to underdetermined systems of linear equations y =Φx is NP-hard in general.We show here that for systems with ‘typical’/‘random’Φ,a good approximation to the sparsest solution is obtained by applying a fixed number of standard operations from linear algebra.Our proposal,Stagewise Orthogonal Matching Pursuit (StOMP),successively transforms the signal into a negligible residual.Starting with initial residual r 0=y ,at the s -th stage it forms the ‘matched filter’ΦT r s −1,identifies all coordinates with amplitudes exceeding a specially-chosen threshold,solves a least-squares problem using the selected coordinates,and subtracts the least-squares fit,producing a new residual.After a fixed number of stages (e.g.10),it stops.In contrast to Orthogonal Matching Pursuit (OMP),many coefficients can enter the model at each stage in StOMP while only one enters per stage in OMP;and StOMP takes a fixed number of stages (e.g.10),while OMP can take many (e.g.n ).StOMP runs much faster than competing proposals for sparse solutions,such as 1minimization and OMP,and so is attractive for solving large-scale problems.We use phase diagrams to compare algorithm performance.The problem of recovering a k -sparse vector x 0from (y,Φ)where Φis random n ×N and y =Φx 0is represented by a point (n/N,k/n )in this diagram;here the interesting range is k <n <N .For n large,StOMP correctly recovers (an approximation to)the sparsest solution of y =Φx over a region of the sparsity/indeterminacy plane comparable to the region where 1minimization is successful.In fact,StOMP outperforms both 1minimization and OMP for extremely underdetermined problems.We rigorously derive a conditioned Gaussian distribution for the matched filtering coefficients at each stage of the procedure and rigorously establish a large-system limit for the performance variables of StOMP.We precisely calculate large-sample phase transitions;these provide asymptot-ically precise limits on the number of samples needed for approximate recovery of a sparse vector by StOMP.We give numerical examples showing that StOMP rapidly and reliably finds sparse solutions in compressed sensing,decoding of error-correcting codes,and overcomplete representation.Keywords:compressed sensing,decoding error-correcting codes,sparse overcomplete representation.phase transition,large-system limit.random matrix theory.Gaussian approximation. 1minimization.stepwise regression.thresholding,false discovery rate,false alarm rate.MIMO channel,mutual access interference,successive interference cancellation.iterative decoding.Acknowledgements This work was supported by grants from NIH,ONR-MURI,a DARPA BAA,and NSF DMS 00-77261,DMS 01-40698(FRG)and DMS 05-05303.1:Department of Statistics,Stanford University,Stanford CA,943052:Institute for Computational Mathematics in Engineering,Stanford University,Stanford CA,943053:DAPNIA/SEDI-SAP,Service d’Astrophysique,Centre Europeen d’Astronomie/Saclay,F-91191Gif-sur-Yvette Cedex France.欠定的可以忽略的渐近的1IntroductionThe possibility of exploiting sparsity in signal processing is attracting growing attention.Over the years, several applications have been found where signals of interest have sparse representations and exploiting this sparsity offers striking benefits;see for example[11,28,26,25,7].At the ICASSP2005conference a special session addressed the theme of exploiting sparsity,and a recent international workshop,SPARS05, was largely devoted to this topic.Very recently,considerable attention has focused on the following Sparse Solutions Problem(SSP). We are given an n×N matrixΦwhich is in some sense‘random’,for example a matrix with iid Gaussian entries.We are also given an n-vector y and we know that y=Φx0where x0is an unknown sparse vector.We wish to recover x0;however,crucially,n<N,the system of equations is underdetermined and so of course,this is not a properly-stated problem in linear algebra.Nevertheless,sparsity of x0is a powerful property that sometimes allows unique solutions.Applications areas for which this model is relevant include:App1:Compressed Sensing.x0represents the coefficients of a signal or image in a known basis which happens to sparsely represent that signal or image.Φencodes a measurement operator,i.e.an operator yielding linear combinations of the underlying object.Here n<N means that we collect fewer data than unknowns.Despite the indeterminacy,sparsity of x0allows for accurate recon-struction of the object from what would naively seem to be‘too few samples’[17,8,48].App2:Error rmation is transmitted in a coded block in which a small fraction of the entries may be corrupted.From the received data,one constructs a system y=Φx0;here x0 represents the values of errors which must be identifed/corrected,y represents(generalized)check sums,andΦrepresents a generalized checksum operator.It is assumed that the number of errors is smaller than a threshold,and so x0is sparse.This sparsity allows to perfectly correct the gross errors[9,48,28].App3:Sparse Overcomplete Representation.x0represents the synthesis coefficients of a signal y,which is assumed to be sparsely represented from terms in an overcomplete expansion;those terms are the columns ofΦ.The sparsity allows to recover a unique representation using only a few terms, despite the fact that representation is in general nonunique[43,11,21,20,50,51].In these applications,several algorithms are available to pursue sparse solutions;in some cases attractive theoretical results are known,guaranteeing that the solutions found are the sparsest possible solutions. For example,consider the algorithm of 1minimization,whichfinds the solution to y=Φx having minimal 1norm.Also called Basis Pursuit(BP)[11],this method enjoys some particularly striking theoretical properties,such as rigorous proofs of exact reconstruction under seemingly quite general circumstances[21,35,32,7,16,8,17,18]Unfortunately,some of the most powerful theoretical results are associated with fairly heavy com-putationally burdens.The research reported here began when,in applying the theory of compressed sensing to NMR spectroscopy,we found that a straightforward application of the 1minimization ideas in[17,58]required several days solution time per(multidimensional)spectrum.This seemed prohibitive computational expense to us,even though computer time is relatively less precious than spectrometer time.In fact,numerous researchers have claimed that general-purpose 1minimization is much too slow for large-scale applications.Some have advocated a heuristic approach,Orthogonal Matching Pursuit (OMP),(also called greedy approximation and stepwise regression in otherfields)[43,52,53,55,54], which though often effective in empirical work,does not offer the strong theoretical guarantees that attach to 1minimization.(For other heuristic approaches,see[50,51,29].)In this paper we describe Stagewise Orthogonal Matching Pursuit(StOMP),a method for approx-imate sparse solution of underdetermined systems with the property either thatΦis‘random’or that the nonzeros in x0are randomly located,or both.StOMP is significantly faster than the earlier methods BP and OMP on large-scale problems with sparse solutions.Moreover,StOMP permits a theoretical analysis showing that StOMP is similarly succcessful to BP atfinding sparse solutions.Our analysis uses the notion of comparison of phase transitions as a performance metric.We con-sider the phase diagram,a2D graphic with coordinates measuring the relative sparsity of x0(numberof nonzeros in x0/number of rows inΦ),as well as the indeterminacy of the system y=Φx(number of rows inΦ/number of columns inΦ).StOMP’s large-n performance exhibits two phases(success/failure) in this diagram,as does the performance of BP.The“success phase”(the region in the phase diagram where StOMP successfullyfinds sparse solutions)is large and comparable in size to the success phase for 1minimization.In a sense StOMP is more effective atfinding sparse solutions to large extremely under-determined problems than either 1minimization or OMP;its phase transition boundary is even higher at extreme sparsity than the other methods.Moreover,StOMP takes a few seconds to solve problems that may require days for 1solution.As a result StOMP is well suited to large-scale applications.Indeed we give several explicitly worked-out examples of realistic size illustrating the performance benefits of this approach.Our analysis suggests the slogannoiseless underdetermined problems behave like noisy well-determined problems,i.e.coping with incompleteness of the measurement data is(for‘randomΦ’)similar to coping with Gaus-sian noise in complete measurements.At each StOMP stage,the usual set of matchedfilter coefficients is a mixture of‘noise’caused by cross-talk(non-orthogonality)and true signal;setting an appropriate threshold,we can subtract identified signal,causing a reduction in cross-talk at the next iteration.This is more than a slogan;we develop a theoretical framework for rigorous asymptotic analysis.Theorems 1-3below allow us to track explicitly the successful capture of signal,and the reduction in cross-talk, stage by stage,rigorously establishing(asymptotic)success below phase transition,together with the failure that occurs above phase transition.The theory agrees with empiricalfinite-n results.Our paper is organized as follows.Section2presents background on the sparse solutions problem; Section3introduces the StOMP algorithm and documents its favorable performance;Section4develops a performance measurement approach based on the phase diagram and phase transition.Section5analyzes the computational complexity of the algorithm.Section6develops an analytic large-system-limit for predicting phase transitions which agree with empirical performance characteristics of StOMP.Section 7develops the Gaussian noise viewpoint which justifies our thresholding rules.Section8describes the performance of StOMP under variations[adding noise,of distribution of nonzero coefficients,of matrix ensemble]and documents the good performance of StOMP under all these variations.Section9presents computational examples in applications App1-App3that document the success of the method in simulated model problems.Section10describes the available software package which reproduces the results in this paper and Section11discusses the relationship of our results to related ideas in multiuser detection theory and to previous work in the sparse solutions problem.2Sparse Solution PreliminariesRecall the Sparse Solutions Problem(SSP)mentioned in the introduction.In the SSP,an unknown vector x0∈R N is of interest;it is assumed sparse,which is to say that the number k of nonzeros is much smaller than N.We have the linear measurements y=Φx0whereΦis a known n by N matrix, and wish to recover x0.Of course,ifΦwere a nonsingular square matrix,with n=N,we could easily recover x from y; but we are interested in the case where n<N.Elementary linear algebra tells us that x0is then not uniquely recoverable from y by linear algebraic means,as the equation y=Φx may have many solutions.However,we are seeking a sparse solution,and for certain matricesΦ,sparsity will prove a powerful constraint.Some of the rapidly accumulating literature documenting this phenomenon includes [21,20,32,55,56,50,51,8,18,16,57,58,48].For now,we consider a specific collection of matrices where sparsity proves valuable.Until we say otherwise,letΦbe a random matrix taken from the Uniform Spherical ensemble(USE);the columns of Φare iid points on the unit sphere S n−1[16,17].Later,several other ensembles will be introduced.3Stagewise Orthogonal Matching PursuitStOMP aims to achieve an approximate solution to y=Φx0whereΦcomes from the USE and x0is sparse.In this section,we describe its basic ingredients.In later sections we document and analyse itsMatched Filter"T r s Projection "I s T "I s ()#1"I s T y Interference Construction "x sFigure 1:Schematic Representation of the StOMP algorithm.performance.3.1The Procedure StOMP operates in S stages,building up a sequence of approximations x 0,x 1,...by removing detected structure from a sequence of residual vectors r 1,r 2,....Figure 1gives a diagrammatic representation.StOMP starts with initial ‘solution’x 0=0and initial residual r 0=y .The stage counter s starts at s =1.The algorithm also maintains a sequence of estimates I 1,...,I s of the locations of the nonzeros in x 0.The s -th stage applies matched filtering to the current residual,getting a vector of residual correlationsc s =ΦT r s −1,which we think of as containing a small number of significant nonzeros in a vector disturbed by Gaussian noise in each entry.The procedure next performs hard thresholding to find the significant nonzeros;the thresholds,are specially chosen based on the assumption of Gaussianity [see below].Thresholding yields a small set J s of “large”coordinates:J s ={j :|c s (j )|>t s σs };here σs is a formal noise level and t s is a threshold parameter.We merge the subset of newly selected coordinates with the previous support estimate,thereby updating the estimate:I s =I s −1∪J s .We then project the vector y on the columns of Φbelonging to the enlarged support.Letting ΦI denote the n ×|I |matrix with columns chosen using index set I ,we have the new approximation x s supported in I s with coefficients given by (x s )I s =(ΦT I s ΦI s )−1ΦT I s y.The updated residual isr s =y −Φx s .We check a stopping condition and,if it is not yet time to stop,we set s :=s +1and go to the next stage of the procedure.If it is time to stop,we set ˆx S =x s as the final output of the procedure.Remarks:1.The procedure resembles Orthogonal Matching Pursuit(known to statisticians as Forward StepwiseRegression).In fact the two would give identical results if S were equal to n and if,by coincidence, the threshold t s were set in such a way that a single new term were obtained in J s at each stage.2.The thresholding strategy used in StOMP(to be described below)aims to have numerous termsenter at each stage,and aims to have afixed number of stages.Hence the results will be different from OMP.3.The formal noise levelσs= r s 2/√n,and typically the threshold parameter takes values in therange2≤t s≤3.4.There are strong connections to:stagewise/stepwise regression in statistical model building;succes-sive interference cancellation multiuser detectors in digital communications and iterative decoders in error-control coding.See the discussion in Section11.Our recommended choice of S(10)and our recommended threshold-setting procedures(see Section 3.4below)have been designed so that when x0is sufficiently sparse,the following two conditions are likely to hold upon termination:•All nonzeros in x0are selected in I S.•I S has no more than n entries.The next lemma motivates this design criterion.Recall thatΦis sampled from the USE and so columns ofΦare in general position.The following is proved in Appendix A.Lemma3.1Let the columns ofΦbe in general position.Let I S denote the support ofˆx S.Suppose that the support I0of x0is a subset of I S.Suppose in addition that#I S≤n.Then we have perfect recovery:ˆx S=x0.(3.1)3.2An ExampleWe give a simple example showing that the procedure works in a special case.We generated a coefficient vector x0with k=32nonzeros,having amplitudes uniformly distributed on[0,1].We sampled a matrixΦat random from the USE with n=256,N=1024,and computed a linear measurement vector y=Φx0.Thus the problem of recovering x0given y is1:4underdetermined (i.e.δ=n/N=.25),with underlying sparsity measureρ=k/n=.125.To this SSP,we applied StOMP coupled with the CFAR threshold selection rule to be discussed below.The results are illustrated in Figure2.Panels(a)-(i)depict each matchedfiltering output,its hard thresholding and the evolving approxi-mation.As can be seen,after3stages a result is obtained which is quite sparse and,as thefinal panel shows,matches well the object x0which truly generated the data.In fact,the error at the end of the third stage measures ˆx3−x0 2/ x0 2=0.022,i.e.a mere3stages were required to achieve an accuracy of2decimal digits.3.3Approximate Gaussianity of Residual MAIAt the heart of our procedure are two thresholding schemes often used in Gaussian noise removal.(N.B. at this point we assume there is no noise in y!)To explain the relevance of Gaussian‘noise’concepts, note that at stage1,the algorithm is computing˜x=ΦT y;this is essentially the usual matchedfilter estimate of x0.If y=Φx0and x0vanishes except in one coordinate,the matchedfilter output˜x equals x0perfectly.Hence z=˜x−x0is a measure of the disturbance to exact reconstruction caused by multiple nonzeros in x0.The same notion arises in digital communications where it is called Multiple-Access Interference(MAI)[60].Perhaps surprisingly-because there is no noise in the problem-the MAI in our setting typically has a Gaussian behavior.MoreFigure2:Progression of the StOMP algorithm.Panels(a),(d),(g):successive matchedfiltering outputs c1,c2,c3;Panels(b),(e),(h):successive thresholding results;Panels(c),(f),(i):successive partial solutions. In this example,k=32,n=256,N=1024.specifically,ifΦis a matrix from the USE and if n and N are both large,then the entries in the MAI vector z have a histogram which is nearly Gaussian with standard deviationσ≈ x0 2/√n.(3.2)The heuristic justification is as follows.The MAI has the formz(j)=˜x(j)−x0(j)=j=φj,φ x0( ).The thing we regard as‘random’in this expression is the matrixΦ.The termξjk ≡ φj,φk measures theprojection of a random point on the sphere S n−1onto another random point.This random variable has approximately a Gaussian distribution N(0,1n).ForΦfrom the USE,for a givenfixedφj,the differentrandom variables(ξjk :k=j)are independently distributed.Hence the quantity z(j)is an iid sum ofapproximately normal r.v.’s,and so,by standard arguments,should be approximately normal with mean 0and varianceσ2j=V ar[j= ξjx0( )]=(j=x0( )2)·V ar(ξj1)≈n−1 x0 22Settingσ2= x0 2/n,this justifies(3.2).Computational experiments validate Gaussian approximation for the MAI.In Figure3,Panels(a),(d),(g) display Gaussian QQ-plots of the MAI in the sparse case with k/n=.125,.1875and.25,in the Uniform Spherical Ensemble with n=256and N=1024.In each case,the QQ-plot appears straight,as the Gaussian model would demand.Through the rest of this paper,the phrase Gaussian approximation means that the MAI has an approximately Gaussian marginal distribution.(The reader interested in formal proofs of Gaussian approximation can consult the literature of multiuser detection e.g.[46,61,12];such a proof is implicitin the proofs of Theorems1and2below.The connection between our work and MUD theory will be amplified in Section11below).Properly speaking,the term‘MAI’applies only at stage1of StOMP.At later stages there is residual MAI,i.e.MAI which has not yet been cancelled.This can be defined asz s(j)=x0(j)−φT j r s/ P Is−1φj 22,j∈I s−1;Figure3:QQ plots comparing MAI with Gaussian distribution.Left column:k/n=.125,middle column:k/n=.1875,right column:k/n=.25.Top row:USE,middle row:RSE,bottom row:URP. The RSE and URP ensembles are discussed in Section8.The plots all appear nearly linear,indicating that the MAI has a nearly Gaussian distribution.the coordinates j∈I s−1are ignored at stage s-the residual in those coordinates is deterministically0.Empirically,residual MAI has also a Gaussian behavior.Figure4shows quantile-quantile plots for the first few stages of the CFAR variant,comparing the residual MAI with a standard normal distribution. The plots are effectively straight lines,illustrating the Gaussian ter,we provide theoretical support for a perturbed Gaussian approximation to residual MAI.3.4Threshold SelectionOur threshold selection proposal is inspired by the Gaussian behavior of residual MAI.We view the vector of correlations c s at stage s as consisting of a small number of‘truly nonzero’entries,combined with a large number of‘Gaussian noise’entries.The problem of separating‘signal’from‘noise’in such problems has generated a large literature including the papers[24,27,26,1,23,37],which influenced our way of thinking.We adopt language from statistical decision theory[39]and thefield of multiple comparisons[38]. Recall that the support I0of x0is being(crudely)estimated in the StOMP algorithm.If a coordinate belonging to I0does not appear in I S,we call this a missed detection.If a coordinate not in I0does appear in I S we call this a false alarm.The coordinates in I S we call discoveries,and the coordinates in I S\I0we call false discoveries.(Note:false alarms are also false discoveries.The terminological distinction is relevant when we normalize to form a rate;thus the false alarm rate is the number of false alarms divided by the number of coordinates not in I0;the false discovery rate is the fraction of false discoveries within I S.)We propose two strategies for setting the threshold.Ultimately,each strategy should land us in a position to apply Lemma3.1:i.e.to arrive at a state where#I S≤n and there are no missed detections. Then,Lemma3.1assures us,we perfectly recover:ˆx S=x.The two strategies are:•False Alarm Control.We attempt to guarantee that the number of total false alarms,across all stages,does not exceed the natural codimension of the problem,defined as n−k.Subject to this, we attempt to make the maximal number of discoveries possible.To do so,we choose a threshold so the False Alarm rate at each stage does not exceed a per-stage budget.•False Discovery Control.We attempt to arrange that the number of False Discoveries cannot exceedFigure4:QQ plots comparing residual MAI with Gaussian distribution.Quantiles of residual MAI at different stages of StOMP are plotted against Gaussian quantiles.Near-linearity indicates approximate Gaussianity.afixed fraction q of all discoveries,and to make the maximum number of discoveries possible subject to that constraint.This leads us to consider Simes’rule[2,1].The False Alarm Control strategy requires knowledge of the number of nonzeros k or some upper bound.False Discovery Control does not require such knowledge,which makes it more convenient for applications,if slightly more complex to implement and substantially more complex to analyse[1].The choice of strategy matters;the basic StOMP algorithm behaves differently depending on the threshold strategy,as we will see below.Implementation details are available by downloading the software we have used to generate the results in this paper;see Section10below.4Performance Analysis by Phase TransitionWhen does StOMP work?To discuss this,we use the notions of phase diagram and phase transition.4.1Problem Suites,Performance MeasuresBy problem suite S(k,n,N)we mean a collection of Sparse Solution Problems defined by two ingredients: (a)an ensemble of random matricesΦof size n by N;(b)an ensemble of k-sparse vectors x0.By standard problem suite S st(k,n,N)we mean the suite withΦsampled from the uniform spherical ensemble,with x0a random variable having k nonzeros sampled iid from a standard N(0,1)distribution.For a given problem suite,a specific algorithm can be run numerous times on instances sampled from the problem suite.Its performance on each realization can then be measured according to some numerical or qualitative criterion.If we are really ambitious,and insist on perfect recovery,we use the performancemeasure1{ˆxS =x0}.More quantitative is the 0-norm, ˆx S−x0 0,the number of sites at which the twovectors disagree.Both these measures are inappropriate for use withfloating point arithmetic,which does not produce exact agreement.We prefer to use instead 0, ,the number of sites at which the reconstruction and the target disagree by more than =10−4.We can also use the quantitative measure relerr2= ˆx S−x0 2/ x0 2,declaring success when the measure is smaller than afixed threshold(say ).For a qualitative performance indicator we simply report the fraction of realizations where the qual-itative condition was true;for a quantitative performance measure,we present the mean value across instances at a given k,n,N.Figure5:Phase Diagram for 1minimization.Shaded attribute is the number of coordinates of recon-struction which differ from optimally sparse solution by more than10−4.The diagram displays a rapid transition from perfect reconstruction to perfect disagreement.Overlaid red curve is theoretical curve ρ1.4.2Phase DiagramA phase diagram depicts performance of an algorithm at a sequence of problem suites S(k,n,N).The average value of some performance measure as displayed as a function ofρ=k/n andδ=n/N.Both of these variablesρ,δ∈[0,1],so the diagram occupies the unit square.To illustrate such a phase diagram,consider a well-studied case where something interesting happens. Let x1solve the optimization problem:(P1)min x 1subject to y=Φx.As mentioned earlier,if y=Φx0where x0has k nonzeros,we mayfind that x1=x0exactly when k is small enough.Figure5displays a grid ofδ−ρvalues,withδranging through50equispaced points in the interval[.05,.95]andρranging through50equispaced points in[.05,.95];here N=800.Each point on the grid shows the mean number of coordinates at which original and reconstruction differ by more than10−4,averaged over100independent realizations of the standard problem suite S st(k,n,N). The experimental setting just described,i.e.theδ−ρgrid setup,the values of N,and the number of realizations,is used to generate phase diagrams later in this paper,although the problem suite being used may change.This diagram displays a phase transition.For smallρ,it seems that high-accuracy reconstruction is obtained,while for largeρreconstruction fails.The transition from success to failure occurs at different ρfor different values ofδ.This empirical observation is explained by a theory that accurately predicts the location of the observed phase transition and shows that,asymptotically for large n,this transition is perfectly sharp. Suppose that problem(y,Φ)is drawn at random from the standard problem suite,and consider the event E k,n,N that x0=x1i.e.that 1minimization exactly recovers x0.The paper[19]defines a functionρ1(δ)(called thereρW)with the following property.Consider sequences of(k n),(N n)obeying k n/n→ρand n/N n→δ.Suppose thatρ<ρ1(δ).Then as n→∞P rob(E kn ,n,N n)→1.On the other hand,suppose thatρ>ρ1(δ).Then as n→∞P rob(E kn ,n,N n)→0.The theoretical curve(δ,ρ1(δ))described there is overlaid on Figure5,showing good agreement betweenasymptotic theory and experimental results.Figure6:Phase diagram for CFAR thresholding.Overlaid red curve is heuristically-derived analytical curveρF AR(see Appendix B).Shaded attribute:number of coordinates wrong by more than10−4 relative error.4.3Phase Diagrams for StOMPWe now use phase diagrams to study the behavior of StOMP.Figure6displays performance of StOMP with CFAR thresholding with per-iteration false alarm rate(n−k)/(S(N−k)).The problem suite and un-derlying problem size,N=800,are the same as in Figure5.The shaded attribute again portrays the number of entries where the reconstruction misses by more than10−4.Once again,for very sparse problems(ρsmall),the algorithm is successful at recovering(a good approximation to)x0,while for less sparse problems(ρlarge),the algorithm fails.Superposed on this display is the graph of a heuristically-derived functionρF AR,which we call the Predicted Phase transition for CFAR thresholding.Again the agreement between the simulation results and the predicted transition is reasonably good.AppendixB explains the calculation of this predicted transition,although it is best read only afterfirst reading Section6.Figure7shows the number of mismatches for the StOMP algorithm based on CFDR thresholding with False Discovery Rate q=1/2.Here N=800and the display shows that,again,for very sparse problems(ρsmall),the algorithm is successful at recovering(a good approximation to)x0,while for less sparse problemsρlarge,the algorithm fails.Superposed on this display is the graph of a heuristically-derived functionρF DR,which we call the Predicted Phase transition for CFDR thresholding.Again the agreement between the simulation results and the predicted transition is reasonably good,though visibly not quite as good as in the CFAR case.5ComputationSince StOMP seems to work reasonably well,it makes sense to study how rapidly it runs.5.1Empirical ResultsTable1shows the running times for StOMP equipped with CFAR and CFDR thresholding,solving an instance of the problem suite S st(k,n,N).We compare thesefigures with the time needed to solve the same problem instance via 1minimization and OMP.Here 1minimization is implemented using Michael Saunders’PDCO solver[49].The simulations used to generate thefigures in the table were all executed on a3GHz Xeon workstation,comparable with current desktop CPUs.Table1suggests that a tremendous saving in computation time is achieved when using the StOMP scheme over traditional 1minimization.The conclusion is that CFAR-and CFDR-based methods have a large。
Sparse Representations and the Basis Pursuit Algorithm
Michael Elad
The CS Department The Technion – Israel Institute of technology Haifa 32000, Israel IPAM – MGA Program
• Signal Prior: in inverse problems seek a solution that has a sparse representation over a predetermined dictionary, and this way regularize the problem (just as TV, bilateral, Beltrami flow, wavelet, and other priors are used).
7
Signals‟ Origin in Sparse-Land
We shall assume that our signals of interest emerge from a random generator machine M
Random Signal Generator
x
8
M
Sparse representations for Signals – Theory and Applications
3. Success of BP/MP for Inverse Problems
Uniqueness, stability of BP and MP
4. Applications
Image separation and inpainting
A multiscale framework for blind separation of linearly mixed signals
(3)
where ymk are the decomposition coefficients of the mixtures. Often, overcomplete dictionaries, e.g. wavelet packets, contain bases as their subdictionaries, with the corresponding elements being orthogonal. Let us consider the case wherein the subset of functions {ϕk : k ∈ Ω} is constructed from the mutually orthogonal elements of the dictionary. This subset can be either complete or undercomplete, since only synthesis coefficients are used in the estimation of the mixing matrix, and sources are recovered without using these coefficients, as explained below. A more general case, wherein the above subset of functions is overcomplete, leads to the maximum a posteriori approach to the BSS problem (Zibulevsky et al., 2001). This approach arrives at a non-convex objective function which makes convergence not stable when optimization starts far from the solution (we use a more robust Maximum Likelihood formulation of the problem). Let us define vectors yk and ck , k ∈ Ω to be constructed from the k-th coefficients of mixtures and of sources, respectively. From (1) and (2), and using the orthogonality of the subset of functions {ϕk : k ∈ Ω}, the relation between the decomposition coefficients of the mixtures and of the sources, when the noise is small, is yk ≈ Ack . Note, that the relation between decomposition coefficients of the mixtures and the sources is exactly the same relation as in the original domain of signals, where xξ ≈ Asξ . Then, estimation of the mixing matrix and of sources is performed using the decomposition coefficients of the mixtures yk instead of the mixtures xξ . The property of sparsity often yields much better source separation than standard ICA, and can work well even with more sources than mixtures (Bofill and Zibulevsky, 2001). In many cases there are distinct groups of coefficients, wherein sources have different sparsity properties. The proposed multiscale, or multiresolution, approach to the BSS is based on selecting only a subset of features, or coefficients, Y = {yk : k ∈ Ω}, which is best suited for separation, with rctions ϕk (ξ) are called atoms or elements of the representation space that may constitute a basis or a frame. These elements do not necessarily have to be linearly independent and, instead, may form an overcomplete set (or dictionary), for example, wavelet-related dictionaries: wavelet packets, stationary wavelets, etc. (see, for example, Coifman and Wickerhauser, 1992, Mallat, 1998, Stanhill and Zeevi, 1996). The corresponding representation of the mixtures, according to the same signal dictionary, is: xm (ξ) = ∑ ymk ϕk (ξ),
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
( k )ຫໍສະໝຸດ By a representation of s in D we mean a (column) vector = 2 K (resp., in K ) such that s = D . We notice that when K > N , the vectors of D are no longer linearly independent and the representation of s is not unique. The hope is that among all possible representations of s there is a very sparse representation, i.e., a representation with few nonzero coefficients. The tradeoff is that we have to search all possible representations of s to find the sparse representations, and then determine whether there is a unique sparsest representation. Following [1] and [2], we will measure the sparsity of a representation s = D by two quantities: the `0 and the `1 norm of , respectively (the `0 -norm simply counts the number of nonzero entries of a vector). This leads to the following two minimization problems to determine the sparsest representation of s: minimize and minimize
It turns out that the optimization problem (2) is much easier to handle than (1) through the use of linear programming (LP), so it is important to know the relationship between the solution(s) of (1) and (2), and to determine sufficient conditions for the two problems to have the same unique solution. This problem has been studied in detail in [1] and later has been refined in [2] in the special case where the dictionary D is the union of two orthonormal bases. In what follows, we generalize the results of [1] and [2] to arbitrary dictionaries.1 The case where D is the union of L 2 orthonormal bases for H is studied in detail. This leads to a natural generalization of the recent results from [2] valid for L = 2. In Section II, we provide conditions for a solution of the problem
k k0 k k1
subject to s = D subject to s = D :
(1) (2)
Sparse Representations in Unions of Bases
Rémi Gribonval, Member, IEEE, and Morten Nielsen
Abstract—The purpose of this correspondence is to generalize a result by Donoho and Huo and Elad and Bruckstein on sparse representations of signals in a union of two orthonormal bases for . We consider general (redundant) dictionaries for , and derive sufficient conditions for having unique sparse representations of signals in such dictionaries. The special case 2 orthonormal bases for where the dictionary is given by the union of isstudiedinmoredetail.Inparticular,itisprovedthattheresultofDonoho and Huo, concerning the replacement of the optimization problem with a linear programming problem when searching for sparse representations, has an analog for dictionaries that may be highly redundant. Index Terms—Dictionaries, Grassmannian frames, linear programming, mutually incoherent bases, nonlinear approximation, sparse representations.
3320
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 12, DECEMBER 2003
rates taken so high that further increasing them produced no visible changes in the figure. As can be seen, the a;b obtained in that way turns from a well-behaved function for the values b = 1:5, 2:5 into a quite irregularly behaved one when b approaches 2 or 3. REFERENCES
[1] I. Daubechies, Ten Lectures on Wavelets. Philadelphia, PA: SIAM, 1992. [2] K. Gröchenig, Foundations of Time-Frequency Analysis. Boston, MA: Birkhäuser, 2001. [3] A. Ron and Z. Shen, “Weyl-Heisenberg frames and Riesz bases in L (R ),” Duke Math. J., vol. 89, no. 2, pp. 237–282, 1997. [4] A. J. E. M. Janssen, “Zak transforms with few zeros and the tie,” in Advances in Gabor Analysis, H. G. Feichtinger and T. Strohmer, Eds. Boston, MA: Birkhäuser, 2003, pp. 31–70. [5] Y. I. Lyubarski˘ i, “Frames in the Bargmann space of entire functions,” in Entire and Subharmonic Functions. Providence, RI: Amer. Math. Soc., 1992, pp. 167–180. [6] K. Seip, K. Seip, and R. Wallstén, “Density theorems for sampling and interpolation in the Bargmann-Fock space I; II,” J. Reine Angew. Math., vol. 429, pp. 91–106, 1992. [7] A. J. E. M. Janssen and T. Strohmer, “Hyperbolic secants yield Gabor frames,” Appl. Comput. Harmon. Anal., vol. 12, pp. 259–267, 2002. [8] A. J. E. M. Janssen, “On generating tight Gabor frames at critical density,” J. Fourier Anal. Appl., vol. 9, no. 2, pp. 175–214, 2003. [9] H. G. Feichtinger and N. Kaiblinger, “Varying the time-frequency lattice of Gabor frames,” Trans. Amer. Math. Soc., to be published. [10] A. Cavaretta, W. Dahmen, and C. A. Micchelli, “Stationary Subdivision,” Mem. Amer. Math. Soc., vol. 53, no. 453, pp. 1–186, 1991. [11] A. J. E. M. Janssen, “The duality condition for Weyl-Heisenberg frames,” in Gabor Analysis and Algorithms, H. G. Feichtinger and T. Strohmer, Eds. Boston, MA: Birkhäuser, 1998, pp. 33–84. , “From continuous to discrete Weyl-Heisenberg frames through [12] sampling,” J. Fourier Anal. Appl., vol. 3, pp. 583–597, 1997. [13] N. Kaiblinger, “Approximation of the Fourier transform and the Gabor dual function from samples,” preprint, 2003.