On syzygies of Segre embeddings

合集下载

文献 (10)Semi-supervised and unsupervised extreme learning

Semi-supervised and unsupervised extreme learningmachinesGao Huang,Shiji Song,Jatinder N.D.Gupta,and Cheng WuAbstract—Extreme learning machines(ELMs)have proven to be an efﬁcient and effective learning paradigm for pattern classiﬁcation and regression.However,ELMs are primarily applied to supervised learning problems.Only a few existing research studies have used ELMs to explore unlabeled data. In this paper,we extend ELMs for both semi-supervised and unsupervised tasks based on the manifold regularization,thus greatly expanding the applicability of ELMs.The key advantages of the proposed algorithms are1)both the semi-supervised ELM (SS-ELM)and the unsupervised ELM(US-ELM)exhibit the learning capability and computational efﬁciency of ELMs;2) both algorithms naturally handle multi-class classiﬁcation or multi-cluster clustering;and3)both algorithms are inductive and can handle unseen data at test time directly.Moreover,it is shown in this paper that all the supervised,semi-supervised and unsupervised ELMs can actually be put into a uniﬁed framework. This provides new perspectives for understanding the mechanism of random feature mapping,which is the key concept in ELM theory.Empirical study on a wide range of data sets demonstrates that the proposed algorithms are competitive with state-of-the-art semi-supervised or unsupervised learning algorithms in terms of accuracy and efﬁciency.Index Terms—Clustering,embedding,extreme learning ma-chine,manifold regularization,semi-supervised learning,unsu-pervised learning.I.I NTRODUCTIONS INGLE layer feedforward networks(SLFNs)have been intensively studied during the past several decades.Most of the existing learning algorithms for training SLFNs,such as the famous back-propagation algorithm[1]and the Levenberg-Marquardt algorithm[2],adopt gradient methods to optimize the weights in the network.Some existing works also use forward selection or backward elimination approaches to con-struct network dynamically during the training process[3]–[7].However,neither the gradient based methods nor the grow/prune methods guarantee a global optimal solution.Al-though various methods,such as the generic and evolutionary algorithms,have been proposed to handle the local minimum This work was supported by the National Natural Science Foundation of China under Grant61273233,the Research Fund for the Doctoral Program of Higher Education under Grant20120002110035and20130002130010, the National Key Technology R&D Program under Grant2012BAF01B03, the Project of China Ocean Association under Grant DY125-25-02,and Tsinghua University Initiative Scientiﬁc Research Program under Grants 2011THZ07132.Gao Huang,Shiji Song,and Cheng Wu are with the Department of Automation,Tsinghua University,Beijing100084,China(e-mail:huang-g09@;shijis@; wuc@).Jatinder N.D.Gupta is with the College of Business Administration,The University of Alabama in Huntsville,Huntsville,AL35899,USA.(e-mail: guptaj@).problem,they basically introduce high computational cost. One of the most successful algorithms for training SLFNs is the support vector machines(SVMs)[8],[9],which is a maximal margin classiﬁer derived under the framework of structural risk minimization(SRM).The dual problem of SVMs is a quadratic programming and can be solved conveniently.Due to its simplicity and stable generalization performance,SVMs have been widely studied and applied to various domains[10]–[14].Recently,Huang et al.[15],[16]proposed the extreme learning machines(ELMs)for training SLFNs.In contrast to most of the existing approaches,ELMs only update the output weights between the hidden layer and the output layer, while the parameters,i.e.,the input weights and biases,of the hidden layer are randomly generated.By adopting squared loss on the prediction error,the training of output weights turns into a regularized least squares(or ridge regression)problem which can be solved efﬁciently in closed form.It has been shown that even without updating the parameters of the hidden layer,the SLFN with randomly generated hidden neurons and tunable output weights maintains its universal approximation capability[17]–[19].Compared to gradient based algorithms, ELMs are much more efﬁcient and usually lead to better generalization performance[20]–[22].Compared to SVMs, solving the regularized least squares problem in ELMs is also faster than solving the quadratic programming problem in standard SVMs.Moreover,ELMs can be used for multi-class classiﬁcation problems directly.The predicting accuracy achieved by ELMs is comparable with or even higher than that of SVMs[16],[22]–[24].The differences and similarities between ELMs and SVMs are discussed in[25]and[26], and new algorithms are proposed by combining the advan-tages of both models.In[25],an extreme SVM(ESVM) model is proposed by combining ELMs and the proximal SVM(PSVM).The ESVM algorithm is shown to be more accurate than the basic ELMs model due to the introduced regularization technique,and much more efﬁcient than SVMs since there is no kernel matrix multiplication in ESVM.In [26],the traditional RBF kernel are replaced by ELM kernel, leading to an efﬁcient algorithm with matched accuracy of SVMs.In the past years,researchers from variesﬁelds have made substantial contribution to ELM theories and applications.For example,the universal approximation ability of ELMs has been further studied in a classiﬁcation context[23].The gen-eralization error bound of ELMs has been investigated from the perspective of the Vapnik-Chervonenkis(VC)dimension theory and the initial localized generalization error model(LGEM)[27],[28].Varies extensions have been made to the basic ELMs to make it more efﬁcient and more suitable for speciﬁc problems,such as ELMs for online sequential data [29]–[31],ELMs for noisy/missing data[32]–[34],ELMs for imbalanced data[35],etc.From the implementation aspect, ELMs has recently been implemented using parallel tech-niques[36],[37],and realized on hardware[38],which made ELMs feasible for large data sets and real time reasoning. Though ELMs have become popular in a wide range of domains,they are primarily used for supervised learning tasks such as classiﬁcation and regression,which greatly limits their applicability.In some cases,such as text classiﬁcation, information retrieval and fault diagnosis,obtaining labels for fully supervised learning is time consuming and expensive, while a multitude of unlabeled data are easy and cheap to collect.To overcome the disadvantage of supervised learning al-gorithms that they cannot make use of unlabeled data,semi-supervised learning(SSL)has been proposed to leverage both labeled and unlabeled data[39],[40].The SSL algorithms assume that the input patterns from both labeled and unlabeled data are drawn from the same marginal distribution.Therefore, the unlabeled data naturally provide useful information for exploring the data structure in the input space.By assuming that the input data follows some cluster structure or manifold in the input space,SSL algorithms can incorporate both la-beled and unlabeled data into the learning process.Since SSL requires less effort to collect labeled data and can offer higher accuracy,it has been applied to various domains[41]–[43].In some other cases where no labeled data are available,people may be interested in exploring the underlying structure of the data.To this end,unsupervised learning(USL)techniques, such as clustering,dimension reduction or data representation, are widely used to fulﬁll these tasks.In this paper,we extend ELMs to handle both semi-supervised and unsupervised learning problems by introducing the manifold regularization framework.Both the proposed semi-supervised ELM(SS-ELM)and unsupervised ELM(US-ELM)inherit the computational efﬁciency and the learn-ing capability of traditional pared with existing algorithms,SS-ELM and US-ELM are not only inductive (straightforward extension for out-of-sample examples at test time),but also can be used for multi-class classiﬁcation or multi-cluster clustering directly.We test our algorithms on a variety of data sets,and make comparisons with other related algorithms.The results show that the proposed algorithms are competitive with state-of-the-art algorithms in terms of accuracy and efﬁciency.It is worth to mention that all the supervised,semi-supervised and unsupervised ELMs can actually be put into a uniﬁed framework,that is all the algorithms consist of two stages:1)random feature mapping;and2)output weights solving.Theﬁrst stage is to construct the hidden layer using randomly generated hidden neurons.This is the key concept in the ELM theory,which differs it from many existing feature learning methods.Generating feature mapping randomly en-ables ELMs for fast nonlinear feature learning and alleviates the problem of over-ﬁtting.The second stage is to solve the weights between the hidden layer and the output layer, and this is where the main difference of supervised,semi-supervised and unsupervised ELMs lies.We believe that the uniﬁed framework for the three types of ELMs might provide us a new perspective to understand the underlying behavior of the random feature mapping in ELMs.The rest of the paper is organized as follows.In Section II,we give a brief review of related existing literature on semi-supervised and unsupervised learning.Section III and IV introduce the basic formulation of ELMs and the man-ifold regularization framework,respectively.We present the proposed SS-ELM and US-ELM algorithms in Sections V and VI.Experiment results are given in Section VII,and Section VIII concludes the paper.II.R ELATED WORKSOnly a few existing research studies on ELMs have dealt with the problem of semi-supervised learning or unsupervised learning.In[44]and[45],the manifold regularization frame-work was introduce into the ELMs model to leverage both labeled and unlabeled data,thus extended ELMs for semi-supervised learning.However,both of these two works are limited to binary classiﬁcation problems,thus they haven’t explore the full power of ELMs.Moreover,both algorithms are only effective when the number of training patterns is more than the number of hidden neurons.Unfortunately,this condition is usually violated in semi-supervised learning since the training data is relatively scarce compared to the hidden neurons,whose number is commonly set to several hundreds or several thousands.Recently,a co-training approach have been proposed to train ELMs in a semi-supervised setting [46].In this algorithm,the labeled training sets are augmented gradually by moving a small set of most conﬁdently predicted unlabeled data to the labeled set at each loop,and ELMs are trained repeatedly on the pseudo-labeled set.Since the algo-rithm need to train ELMs repeatedly,it introduces considerable extra computational cost.The proposed SS-ELM is related to a few other mani-fold assumption based semi-supervised learning algorithms, such as the Laplacian support vector machines(LapSVMs) [47],the Laplacian regularized least squares(LapRLS)[47], semi-supervised neural networks(SSNNs)[48],and semi-supervised deep embedding[49].It has been shown in these works that manifold regularization is effective in a wide range of domains and often leads to a state-of-the-art performance in terms of accuracy and efﬁciency.The US-ELM proposed in this paper are related to the Laplacian Eigenmaps(LE)[50]and spectral clustering(SC) [51]in that they both use spectral techniques for embedding and clustering.In all these algorithms,an afﬁnity matrix is ﬁrst built from the input patterns.The SC performs eigen-decomposition on the normalized afﬁnity matrix,and then embeds the original data into a d-dimensional space using the ﬁrst d eigenvectors(each row is normalized to have unit length and represents a point in the embedded space)corresponding to the d largest eigenvalues.The LE algorithm performs generalized eigen-decomposition on the graph Laplacian,anduses the d eigenvectors corresponding to the second through the(d+1)th smallest eigenvalues for embedding.When LE and SC are used for clustering,then k-means is adopted to cluster the data in the embedded space.Similar to LE and SC,the US-ELM are also based on the afﬁnity matrix,and it is converted to solving a generalized eigen-decomposition problem.However,the eigenvectors obtained in US-ELM are not used for data representation directly,but are used as the parameters of the network,i.e.,the output weights.Note that once the US-ELM model is trained,it can be applied to any presented data in the original input space.In this way,US-ELM provide a straightforward way for handling new patterns without recomputing eigenvectors as in LE and SC.III.E XTREME LEARNING MACHINES Consider a supervised learning problem where we have a training set with N samples,{X,Y}={x i,y i}N i=1.Herex i∈R n i,y i is a n o-dimensional binary vector with only one entry(correspond to the class that x i belongs to)equal to one for multi-classiﬁcation tasks,or y i∈R n o for regression tasks,where n i and n o are the dimensions of input and output respectively.ELMs aim to learn a decision rule or an approximation function based on the training data. Generally,the training of ELMs consists of two stages.The ﬁrst stage is to construct the hidden layer using aﬁxed number of randomly generated mapping neurons,which can be any nonlinear piecewise continuous functions,such as the Sigmoid function and Gaussian function given below.1)Sigmoid functiong(x;θ)=11+exp(−(a T x+b));(1)2)Gaussian functiong(x;θ)=exp(−b∥x−a∥);(2) whereθ={a,b}are the parameters of the mapping function and∥·∥denotes the Euclidean norm.A notable feature of ELMs is that the parameters of the hidden mapping functions can be randomly generated ac-cording to any continuous probability distribution,e.g.,the uniform distribution on(-1,1).This makes ELMs distinct from the traditional feedforward neural networks and SVMs. The only free parameters that need to be optimized in the training process are the output weights between the hidden neurons and the output nodes.By doing so,training ELMs is equivalent to solving a regularized least squares problem which is considerately more efﬁcient than the training of SVMs or backpropagation algorithms.In theﬁrst stage,a number of hidden neurons which map the data from the input space into a n h-dimensional feature space (n h is the number of hidden neurons)are randomly generated. We denote by h(x i)∈R1×n h the output vector of the hidden layer with respect to x i,andβ∈R n h×n o the output weights that connect the hidden layer with the output layer.Then,the outputs of the network are given byf(x i)=h(x i)β,i=1,...,N.(3)In the second stage,ELMs aim to solve the output weights by minimizing the sum of the squared losses of the prediction errors,which leads to the following formulationminβ∈R n h×n o12∥β∥2+C2N∑i=1∥e i∥2s.t.h(x i)β=y T i−e T i,i=1,...,N,(4)where theﬁrst term in the objective function is a regularization term which controls the complexity of the model,e i∈R n o is the error vector with respect to the i th training pattern,and C is a penalty coefﬁcient on the training errors.By substituting the constraints into the objective function, we obtain the following equivalent unconstrained optimization problem:minβ∈R n h×n oL ELM=12∥β∥2+C2∥Y−Hβ∥2(5)where H=[h(x1)T,...,h(x N)T]T∈R N×n h.The above problem is widely known as the ridge regression or regularized least squares.By setting the gradient of L ELM with respect toβto zero,we have∇L ELM=β+CH H T(Y−Hβ)=0(6) If H has more rows than columns and is of full column rank,which is usually the case where the number of training patterns are more than the number of the hidden neurons,the above equation is overdetermined,and we have the following closed form solution for(5):β∗=(H T H+I nhC)−1H T Y,(7)where I nhis an identity matrix of dimension n h.Note that in practice,rather than explicitly inverting the n h×n h matrix in the above expression,we can use Gaussian elimination to directly solve a set of linear equations in a more efﬁcient and numerically stable manner.If the number of training patterns are less than the number of hidden neurons,then H will have more columns than rows, which often leads to an underdetermined least squares prob-lem.In this case,βmay have inﬁnite number of solutions.To handle this problem,we restrictβto be a linear combination of the rows of H:β=H Tα(α∈R N×n o).Notice that when H has more columns than rows and is of full row rank,then H H T is invertible.Multiplying both side of(6) by(H H T)−1H,we getα+C(Y−H H Tα)=0,(8) This yieldsβ∗=H Tα∗=H T(H H T+I NC)−1Y(9)where I N is an identity matrix of dimension N. Therefore,in the case where training patterns are plentiful compared to the hidden neurons,we use(7)to compute the output weights,otherwise we use(9).IV.T HE MANIFOLD REGULARIZATION FRAMEWORK Semi-supervised learning is built on the following two assumptions:(1)both the label data X l and the unlabeled data X u are drawn from the same marginal distribution P X ;and (2)if two points x 1and x 2are close to each other,then the conditional probabilities P (y |x 1)and P (y |x 2)should be similar as well.The latter assumption is widely known as the smoothness assumption in machine learning.To enforce this assumption on the data,the manifold regularization framework proposes to minimize the following cost functionL m=12∑i,jw ij ∥P (y |x i )−P (y |x j )∥2,(10)where w ij is the pair-wise similarity between two patterns x iand x j .Note that the similarity matrix W =[w ij ]is usually sparse,since we only place a nonzero weight between two patterns x i and x j if they are close,e.g.,x i is among the k nearest neighbors of x j or x j is among the k nearest neighbors of x i .The nonzero weights are usually computed using Gaussian function exp (−∥x i −x j ∥2/2σ2),or simply ﬁxed to 1.Intuitively,the formulation (10)penalizes large variation in the conditional probability P (y |x )when x has a small change.This requires that P (y |x )vary smoothly along the geodesics of P (x ).Since it is difﬁcult to compute the conditional probability,we can approximate (10)with the following expression:ˆLm =12∑i,jw ij ∥ˆyi −ˆy j ∥2,(11)where ˆyi and ˆy j are the predictions with respect to pattern x i and x j ,respectively.It is straightforward to simplify the above expression in a matrix form:ˆL m =Tr (ˆY T L ˆY ),(12)where Tr (·)denotes the trace of a matrix,L =D −W isknown as the graph Laplacian ,and D is a diagonal matrixwith its diagonal elements D ii =l +u∑j =1w i,j .As discussed in [52],instead of using L directly,we can normalize it byD −12L D −12or replace it by L p (p is an integer),based on some prior knowledge.V.S EMI -SUPERVISED ELMIn the semi-supervised setting,we have few labeled data and plenty of unlabeled data.We denote the labeled data in the training set as {X l ,Y l }={x i ,y i }l i =1,and unlabeled dataas X u ={x i }ui =1,where l and u are the number of labeled and unlabeled data,respectively.The proposed SS-ELM incorporates the manifold regular-ization to leverage unlabeled data to improve the classiﬁcation accuracy when labeled data are scarce.By modifying the ordinary ELM formulation (4),we give the formulation ofSS-ELM as:minβ∈R n h ×n o12∥β∥2+12l∑i =1C i ∥e i ∥2+λ2Tr (F T L F )s.t.h (x i )β=y T i −e T i ,i =1,...,l,f i =h (x i )β,i =1,...,l +u(13)where L ∈R (l +u )×(l +u )is the graph Laplacian built fromboth labeled and unlabeled data,and F ∈R (l +u )×n o is the output matrix of the network with its i th row equal to f (x i ),λis a tradeoff parameter.Note that similar to the weighted ELM algorithm (W-ELM)introduced in [35],here we associate different penalty coefﬁ-cient C i on the prediction errors with respect to patterns from different classes.This is because we found that when the data is skewed,i.e.,some classes have signiﬁcantly more training patterns than other classes,traditional ELMs tend to ﬁt the classes that having the majority of patterns quite well but ﬁts other classes poorly.This usually leads to poor generalization performance on the testing set (while the prediction accuracy may be high,but the some classes are neglected).Therefore,we propose to alleviate this problem by re-weighting instances from different classes.Suppose that x i belongs to class t i ,which has N t i training patterns,then we associate e i with a penalty ofC i =C 0N t i.(14)where C 0is a user deﬁned parameter as in traditional ELMs.In this way,the patterns from the dominant classes will not be over ﬁtted by the algorithm,and the patterns from a class with less samples will not be neglected.We substitute the constraints into the objective function,and rewrite the above formulation in a matrix form:min β∈R n h×n o 12∥β∥2+12∥C 12( Y −Hβ)∥2+λ2Tr (βT H TL Hβ)(15)where Y∈R (l +u )×n o is the training target with its ﬁrst l rows equal to Y l and the rest equal to 0,C is a (l +u )×(l +u )diagonal matrix with its ﬁrst l diagonal elements [C ]ii =C i ,i =1,...,l and the rest equal to 0.Again,we compute the gradient of the objective function with respect to β:∇L SS −ELM =β+H T C ( Y−H β)+λH H T L H β.(16)By setting the gradient to zero,we obtain the solution tothe SS-ELM:β∗=(I n h +H T C H +λH H T L H )−1H TC Y .(17)As in Section III,if the number of labeled data is fewer thanthe number of hidden neurons,which is common in SSL,we have the following alternative solution:β∗=H T (I l +u +C H H T +λL L H H T )−1C Y .(18)where I l +u is an identity matrix of dimension l +u .Note that by settingλto be zero and the diagonal elements of C i(i=1,...,l)to be the same constant,(17)and (18)reduce to the solutions of traditional ELMs(7)and(9), respectively.Based on the above discussion,the SS-ELM algorithm is summarized as Algorithm1.Algorithm1The SS-ELM algorithmInput:The labeled patterns,{X l,Y l}={x i,y i}l i=1;The unlabeled patterns,X u={x i}u i=1;Output:The mapping function of SS-ELM:f:R n i→R n oStep1:Construct the graph Laplacian L from both X l and X u.Step2:Initiate an ELM network of n h hidden neurons with random input weights and biases,and calculate the output matrix of the hidden neurons H∈R(l+u)×n h.Step3:Choose the tradeoff parameter C0andλ.Step4:•If n h≤NCompute the output weightsβusing(17)•ElseCompute the output weightsβusing(18)return The mapping function f(x)=h(x)β.VI.U NSUPERVISED ELMIn this section,we introduce the US-ELM algorithm for unsupervised learning.In an unsupervised setting,the entire training data X={x i}N i=1are unlabeled(N is the number of training patterns)and our target is toﬁnd the underlying structure of the original data.The formulation of US-ELM follows from the formulation of SS-ELM.When there is no labeled data,(15)is reduced tomin β∈R n h×n o ∥β∥2+λTr(βT H T L Hβ)(19)Notice that the above formulation always attains its mini-mum atβ=0.As suggested in[50],we have to introduce addtional constraints to avoid a degenerated solution.Speciﬁ-cally,the formulation of US-ELM is given bymin β∈R n h×n o ∥β∥2+λTr(βT H T L Hβ)s.t.(Hβ)T Hβ=I no(20)Theorem1:An optimal solution to problem(20)is given by choosingβas the matrix whose columns are the eigenvectors (normalized to satisfy the constraint)corresponding to theﬁrst n o smallest eigenvalues of the generalized eigenvalue problem:(I nh +λH H T L H)v=γH H T H v.(21)Proof:We can rewrite the problem(20)asminβ∈R n h×n o,ββT Bβ=I no Tr(βT Aβ),(22)Algorithm2The US-ELM algorithmInput:The training data:X∈R N×n i;Output:•For embedding task:The embedding in a n o-dimensional space:E∈R N×n o;•For clustering task:The label vector of cluster index:y∈N N×1+.Step1:Construct the graph Laplacian L from X.Step2:Initiate an ELM network of n h hidden neurons withrandom input weights,and calculate the output matrix of thehidden neurons H∈R N×n h.Step3:•If n h≤NFind the generalized eigenvectors v2,v3,...,v no+1of(21)corresponding to the second through the n o+1smallest eigenvalues.Letβ=[ v2, v3,..., v no+1],where v i=v i/∥H v i∥,i=2,...,n o+1.•ElseFind the generalized eigenvectors u2,u3,...,u no+1of(24)corresponding to the second through the n o+1smallest eigenvalues.Letβ=H T[ u2, u3,..., u no+1],where u i=u i/∥H H T u i∥,i=2,...,n o+1.Step4:Calculate the embedding matrix:E=Hβ.Step5(For clustering only):Treat each row of E as a point,and cluster the N points into K clusters using the k-meansalgorithm.Let y be the label vector of cluster index for allthe points.return E(for embedding task)or y(for clustering task);where A=I nh+λH H T L H and B=H T H.It is easy to verify that both A and B are Hermitianmatrices.Thus,according to the Rayleigh-Ritz theorem[53],the above trace minimization problem attains its optimum ifand only if the column span ofβis the minimum span ofthe eigenspace corresponding to the smallest n o eigenvaluesof(21).Therefore,by stacking the normalized eigenvectors of(21)corresponding to the smallest n o generalized eigenvalues,we obtain an optimal solution to(20).In the algorithm of Laplacian eigenmaps,theﬁrst eigenvec-tor is discarded since it is always a constant vector proportionalto1(corresponding to the smallest eigenvalue0)[50].In theUS-ELM algorithm,theﬁrst eigenvector of(21)also leadsto small variations in embedding and is not useful for datarepresentation.Therefore,we suggest to discard this trivialsolution as well.Letγ1,γ2,...,γno+1(γ1≤γ2≤...≤γn o+1)be the(n o+1)smallest eigenvalues of(21)and v1,v2,...,v no+1be their corresponding eigenvectors.Then,the solution to theoutput weightsβis given byβ∗=[ v2, v3,..., v no+1],(23)where v i=v i/∥H v i∥,i=2,...,n o+1are the normalizedeigenvectors.If the number of labeled data is fewer than the numberTABLE ID ETAILS OF THE DATA SETS USED FOR SEMI-SUPERVISED LEARNINGData set Class Dimension|L||U||V||T|G50C2505031450136COIL20(B)2102440100040360USPST(B)225650140950498COIL2020102440100040360USPST1025650140950498of hidden neurons,problem(21)is underdetermined.In this case,we have the following alternative formulation by using the same trick as in previous sections:(I u+λL L H H T )u=γH H H T u.(24)Again,let u1,u2,...,u no +1be generalized eigenvectorscorresponding to the(n o+1)smallest eigenvalues of(24), then theﬁnal solution is given byβ∗=H T[ u2, u3,..., u no +1],(25)where u i=u i/∥H H T u i∥,i=2,...,n o+1are the normal-ized eigenvectors.If our task is clustering,then we can adopt the k-means algorithm to perform clustering in the embedded space.We summarize the proposed US-ELM in Algorithm2. Remark:Comparing the supervised ELM,the semi-supervised ELM and the unsupervised ELM,we can observe that all the algorithms have two similar stages in the training process,that is the random feature learning stage and the out-put weights learning stage.Under this two-stage framework,it is easy toﬁnd the differences and similarities between the three algorithms.Actually,all the algorithms share the same stage of random feature learning,and this is the essence of the ELM theory.This also means that no matter the task is a supervised, semi-supervised or unsupervised learning problem,we can always follow the same step to generate the hidden layer. The differences of the three types of ELMs lie in the second stage on how the output weights are computed.In supervised ELM and SS-ELM,the output weights are trained by solving a regularized least squares problem;while the output weights in the US-ELM are obtained by solving a generalized eigenvalue problem.The uniﬁed framework for the three types of ELMs might provide new perspectives to further develop the ELM theory.VII.E XPERIMENTAL RESULTSWe evaluated our algorithms on wide range of semi-supervised and unsupervised parisons were made with related state-of-the-art algorithms, e.g.,Transductive SVM(TSVM)[54],LapSVM[47]and LapRLS[47]for semi-supervised learning;and Laplacian Eigenmap(LE)[50], spectral clustering(SC)[51]and deep autoencoder(DA)[55] for unsupervised learning.All algorithms were implemented using Matlab R2012a on a2.60GHz machine with4GB of memory.TABLE IIIT RAINING TIME(IN SECONDS)COMPARISON OF TSVM,L AP RLS,L AP SVM AND SS-ELMData set TSVM LapRLS LapSVM SS-ELMG50C0.3240.0410.0450.035COIL20(B)16.820.5120.4590.516USPST(B)68.440.9210.947 1.029COIL2018.43 5.841 4.9460.814USPST68.147.1217.259 1.373A.Semi-supervised learning results1)Data sets:We tested the SS-ELM onﬁve popular semi-supervised learning benchmarks,which have been widely usedfor evaluating semi-supervised algorithms[52],[56],[57].•The G50C is a binary classiﬁcation data set of which each class is generated by a50-dimensional multivariate Gaus-sian distribution.This classiﬁcation problem is explicitlydesigned so that the true Bayes error is5%.•The Columbia Object Image Library(COIL20)is a multi-class image classiﬁcation data set which consists1440 gray-scale images of20objects.Each pattern is a32×32 gray scale image of one object taken from a speciﬁc view.The COIL20(B)data set is a binary classiﬁcation taskobtained from COIL20by grouping theﬁrst10objectsas Class1,and the last10objects as Class2.•The USPST data set is a subset(the testing set)of the well known handwritten digit recognition data set USPS.The USPST(B)data set is a binary classiﬁcation task obtained from USPST by grouping theﬁrst5digits as Class1and the last5digits as Class2.2)Experimental setup:We followed the experimental setup in[57]to evaluate the semi-supervised algorithms.Speciﬁ-cally,each of the data sets is split into4folds,one of which was used for testing(denoted by T)and the rest3folds for training.Each of the folds was used as the testing set once(4-fold cross-validation).As in[57],this random fold generation process were repeated3times,resulted in12different splits in total.Every training set was further partitioned into a labeled set L,a validation set V,and an unlabeled set U.When we train a semi-supervised learning algorithm,the labeled data from L and the unlabeled data from U were used.The validation set which consists of labeled data was only used for model selection,i.e.,ﬁnding the optimal hyperparameters C0andλin the SS-ELM algorithm.The characteristics of the data sets used in our experiment are summarized in Table I. The training of SS-ELM consists of two stages:1)generat-ing the random hidden layer;and2)training the output weights using(17)or(18).In theﬁrst stage,we adopted the Sigmoid function for nonlinear mapping,and the input weights and biases were generated according to the uniform distribution on(-1,1).The number of hidden neurons n h wasﬁxed to 1000for G50C,and2000for the rest four data sets.In the second stage,weﬁrst need to build the graph Laplacian L.We followed the methods discussed in[52]and[57]to compute L,and the hyperparameter settings can be found in[47],[52] and[57].The trade off parameters C andλwere selected from。

embeddings 的结果通俗解释

embeddings 的结果通俗解释：
Embeddings 的结果通俗解释如下：
Embedding 是一种将数据从高维空间映射到低维空间的方法，其结果可以看作是一种降维表示。

对于单词或文本数据，Embedding 可以将每个单词或文本表示为一个向量，这个向量包含了该单词或文本的语义信息和上下文信息。

通过训练，Embedding 可以学习到单词或文本之间的相似性和关联性，从而生成具有语义相似性的向量。

这些向量可以用于多种任务，如聚类、分类、文本相似性比较等。

在文本分类任务中，Embedding 可以将文本表示为向量，然后使用这些向量进行分类。

在聚类任务中，Embedding 可以将相似的文本聚类在一起。

在文本相似性比较任务中，Embedding 可以比较两个文本的相似性程度。

逻辑斯蒂4参数求导

逻辑斯蒂4参数求导
逻辑斯蒂4参数是指逻辑斯蒂回归模型中的4个参数，分别为：输入层的权重向量$w$、偏差$b$、输出层的权重向量$w'$和偏差$b'$。

对逻辑斯蒂4参数求导的过程如下：
1. 调用`torch.autograd.backward()`方法对损失函数进行反向传播，无返回值，但会更新各个叶子节点的梯度属性参数列表`grad_tensors`。

2. 调用`torch.autograd.grad()`方法求取梯度，返回值为导数（张量形式），参数列表包括：`outputs`是对谁求导，`input`是求谁的导数，`create_graph=Ture`可以保留计算图用于高阶求导，`retain_graph`同上，`grad_outputs`等于上面的`grad_tensors`。

需要注意的是，在进行多次求导时，需要先清零梯度，否则梯度不会自动清零。

在进行逻辑斯蒂4参数求导时，可以根据具体需求选择合适的方法。

若你想了解更多关于逻辑斯蒂4参数求导的内容，可以继续向我提问。

GoogLeNet深度学习模型的Caffe复现模型

Going deeper with convolutionsChristian Szegedy Google Inc.Wei Liu University of North Carolina,Chapel Hill Yangqing Jia Google Inc.Pierre Sermanet Google Inc.Scott Reed University of Michigan Dragomir Anguelov Google Inc.Dumitru Erhan Google Inc.Vincent Vanhoucke Google Inc.Andrew RabinovichGoogle Inc.AbstractWe propose a deep convolutional neural network architecture codenamed Incep-tion,which was responsible for setting the new state of the art for classiﬁcation and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014(ILSVRC14).The main hallmark of this architecture is the improved utilization of the computing resources inside the network.This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.To optimize quality,the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing.One particular incarnation used in our submission for ILSVRC14is called GoogLeNet,a 22layers deep network,the quality of which is assessed in the context of classiﬁcation and detection.1IntroductionIn the last three years,mainly due to the advances of deep learning,more concretely convolutional networks [10],the quality of image recognition and object detection has been progressing at a dra-matic pace.One encouraging news is that most of this progress is not just the result of more powerful hardware,larger datasets and bigger models,but mainly a consequence of new ideas,algorithms and improved network architectures.No new data sources were used,for example,by the top entries in the ILSVRC 2014competition besides the classiﬁcation dataset of the same competition for detec-tion purposes.Our GoogLeNet submission to ILSVRC 2014actually uses 12×fewer parameters than the winning architecture of Krizhevsky et al [9]from two years ago,while being signiﬁcantly more accurate.The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models,but from the synergy of deep architectures and classical computer vision,like the R-CNN algorithm by Girshick et al [6].Another notable factor is that with the ongoing traction of mobile and embedded computing,the efﬁciency of our algorithms –especially their power and memory use –gains importance.It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer ﬁxation on accuracy numbers.For most of the experiments,the models were designed to keep a computational budget of 1.5billion multiply-adds at inference time,so that the they do not end up to be a purely academic curiosity,but could be put to real world use,even on large datasets,at a reasonable cost.a r X i v :1409.4842v 1 [c s .C V ] 17 S e p 2014In this paper,we will focus on an efﬁcient deep neural network architecture for computer vision, codenamed Inception,which derives its name from the Network in network paper by Lin et al[12] in conjunction with the famous“we need to go deeper”internet meme[1].In our case,the word “deep”is used in two different meanings:ﬁrst of all,in the sense that we introduce a new level of organization in the form of the“Inception module”and also in the more direct sense of increased network depth.In general,one can view the Inception model as a logical culmination of[12] while taking inspiration and guidance from the theoretical work by Arora et al[2].The beneﬁts of the architecture are experimentally veriﬁed on the ILSVRC2014classiﬁcation and detection challenges,on which it signiﬁcantly outperforms the current state of the art.2Related WorkStarting with LeNet-5[10],convolutional neural networks(CNN)have typically had a standard structure–stacked convolutional layers(optionally followed by contrast normalization and max-pooling)are followed by one or more fully-connected layers.Variants of this basic design are prevalent in the image classiﬁcation literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classiﬁcation challenge[9,21].For larger datasets such as Imagenet,the recent trend has been to increase the number of layers[12]and layer size[21,14], while using dropout[7]to address the problem of overﬁtting.Despite concerns that max-pooling layers result in loss of accurate spatial information,the same convolutional network architecture as[9]has also been successfully employed for localization[9, 14],object detection[6,14,18,5]and human pose estimation[19].Inspired by a neuroscience model of the primate visual cortex,Serre et al.[15]use a series ofﬁxed Gaborﬁlters of different sizes in order to handle multiple scales,similarly to the Inception model.However,contrary to theﬁxed 2-layer deep model of[15],allﬁlters in the Inception model are learned.Furthermore,Inception layers are repeated many times,leading to a22-layer deep model in the case of the GoogLeNet model.Network-in-Network is an approach proposed by Lin et al.[12]in order to increase the representa-tional power of neural networks.When applied to convolutional layers,the method could be viewed as additional1×1convolutional layers followed typically by the rectiﬁed linear activation[9].This enables it to be easily integrated in the current CNN pipelines.We use this approach heavily in our architecture.However,in our setting,1×1convolutions have dual purpose:most critically,they are used mainly as dimension reduction modules to remove computational bottlenecks,that would otherwise limit the size of our networks.This allows for not just increasing the depth,but also the width of our networks without signiﬁcant performance penalty.The current leading approach for object detection is the Regions with Convolutional Neural Net-works(R-CNN)proposed by Girshick et al.[6].R-CNN decomposes the overall detection problem into two subproblems:toﬁrst utilize low-level cues such as color and superpixel consistency for potential object proposals in a category-agnostic fashion,and to then use CNN classiﬁers to identify object categories at those locations.Such a two stage approach leverages the accuracy of bound-ing box segmentation with low-level cues,as well as the highly powerful classiﬁcation power of state-of-the-art CNNs.We adopted a similar pipeline in our detection submissions,but have ex-plored enhancements in both stages,such as multi-box[5]prediction for higher object bounding box recall,and ensemble approaches for better categorization of bounding box proposals.3Motivation and High Level ConsiderationsThe most straightforward way of improving the performance of deep neural networks is by increas-ing their size.This includes both increasing the depth–the number of levels–of the network and its width:the number of units at each level.This is as an easy and safe way of training higher quality models,especially given the availability of a large amount of labeled training data.However this simple solution comes with two major drawbacks.Bigger size typically means a larger number of parameters,which makes the enlarged network more prone to overﬁtting,especially if the number of labeled examples in the training set is limited. This can become a major bottleneck,since the creation of high quality training sets can be tricky(a)Siberian husky(b)Eskimo dogFigure1:Two distinct classes from the1000classes of the ILSVRC2014classiﬁcation challenge.and expensive,especially if expert human raters are necessary to distinguish betweenﬁne-grained visual categories like those in ImageNet(even in the1000-class ILSVRC subset)as demonstrated by Figure1.Another drawback of uniformly increased network size is the dramatically increased use of compu-tational resources.For example,in a deep vision network,if two convolutional layers are chained, any uniform increase in the number of theirﬁlters results in a quadratic increase of computation.If the added capacity is used inefﬁciently(for example,if most weights end up to be close to zero), then a lot of computation is wasted.Since in practice the computational budget is alwaysﬁnite,an efﬁcient distribution of computing resources is preferred to an indiscriminate increase of size,even when the main objective is to increase the quality of results.The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures,even inside the convolutions.Besides mimicking biological systems,this would also have the advantage ofﬁrmer theoretical underpinnings due to the ground-breaking work of Arora et al.[2].Their main result states that if the probability distribution of the data-set is representable by a large,very sparse deep neural network,then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.Although the strict math-ematical proof requires very strong conditions,the fact that this statement resonates with the well known Hebbian principle–neurons thatﬁre together,wire together–suggests that the underlying idea is applicable even under less strict conditions,in practice.On the downside,todays computing infrastructures are very inefﬁcient when it comes to numerical calculation on non-uniform sparse data structures.Even if the number of arithmetic operations is reduced by100×,the overhead of lookups and cache misses is so dominant that switching to sparse matrices would not pay off.The gap is widened even further by the use of steadily improving, highly tuned,numerical libraries that allow for extremely fast dense matrix multiplication,exploit-ing the minute details of the underlying CPU or GPU hardware[16,9].Also,non-uniform sparse models require more sophisticated engineering and computing infrastructure.Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of em-ploying convolutions.However,convolutions are implemented as collections of dense connections to the patches in the earlier layer.ConvNets have traditionally used random and sparse connection tables in the feature dimensions since[11]in order to break the symmetry and improve learning,the trend changed back to full connections with[9]in order to better optimize parallel computing.The uniformity of the structure and a large number ofﬁlters and greater batch size allow for utilizing efﬁcient dense computation.This raises the question whether there is any hope for a next,intermediate step:an architecture that makes use of the extra sparsity,even atﬁlter level,as suggested by the theory,but exploits ourcurrent hardware by utilizing computations on dense matrices.The vast literature on sparse matrix computations(e.g.[3])suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication.It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.The Inception architecture started out as a case study of theﬁrst author for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by[2]for vision networks and covering the hypothesized outcome by dense,read-ily available components.Despite being a highly speculative undertaking,only after two iterations on the exact choice of topology,we could already see modest gains against the reference architec-ture based on[12].After further tuning of learning rate,hyperparameters and improved training methodology,we established that the resulting Inception architecture was especially useful in the context of localization and object detection as the base network for[6]and[5].Interestingly,while most of the original architectural choices have been questioned and tested thoroughly,they turned out to be at least locally optimal.One must be cautious though:although the proposed architecture has become a success for computer vision,it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction.Making sure would require much more thorough analysis and veriﬁcation: for example,if automated tools based on the principles described below wouldﬁnd similar,but better topology for the vision networks.The most convincing proof would be if an automated system would create network topologies resulting in similar gains in other domains using the same algorithm but with very differently looking global architecture.At very least,the initial success of the Inception architecture yieldsﬁrm motivation for exciting future work in this direction.4Architectural DetailsThe main idea of the Inception architecture is based onﬁnding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components.Note that assuming translation invariance means that our network will be built from convolutional building blocks.All we need is toﬁnd the optimal local construction and to repeat it spatially.Arora et al.[2]suggests a layer-by layer construction in which one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer.We assume that each unit from the earlier layer corresponds to some region of the input image and these units are grouped intoﬁlter banks.In the lower layers(the ones close to the input)correlated units would concentrate in local regions.This means,we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of1×1convolutions in the next layer,as suggested in[12].However,one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches,and there will be a decreasing number of patches over larger and larger regions.In order to avoid patch-alignment issues,current incarnations of the Inception architecture are restricted toﬁlter sizes1×1, 3×3and5×5,however this decision was based more on convenience rather than necessity.It also means that the suggested architecture is a combination of all those layers with their outputﬁlter banks concatenated into a single output vector forming the input of the next stage.Additionally, since pooling operations have been essential for the success in current state of the art convolutional networks,it suggests that adding an alternative parallel pooling path in each such stage should have additional beneﬁcial effect,too(see Figure2(a)).As these“Inception modules”are stacked on top of each other,their output correlation statistics are bound to vary:as features of higher abstraction are captured by higher layers,their spatial concentration is expected to decrease suggesting that the ratio of3×3and5×5convolutions should increase as we move to higher layers.One big problem with the above modules,at least in this na¨ıve form,is that even a modest number of 5×5convolutions can be prohibitively expensive on top of a convolutional layer with a large number ofﬁlters.This problem becomes even more pronounced once pooling units are added to the mix: their number of outputﬁlters equals to the number ofﬁlters in the previous stage.The merging of the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable(a)Inception module,na¨ıve version(b)Inception module with dimension reductionsFigure2:Inception moduleincrease in the number of outputs from stage to stage.Even while this architecture might cover the optimal sparse structure,it would do it very inefﬁciently,leading to a computational blow up within a few stages.This leads to the second idea of the proposed architecture:judiciously applying dimension reduc-tions and projections wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings:even low dimensional embeddings might contain a lot of information about a relatively large image patch.However,embeddings represent information in a dense,compressed form and compressed information is harder to model.We would like to keep our representation sparse at most places(as required by the conditions of[2])and compress the signals only whenever they have to be aggregated en masse.That is,1×1convolutions are used to compute reductions before the expensive3×3and5×5convolutions.Besides being used as reduc-tions,they also include the use of rectiﬁed linear activation which makes them dual-purpose.The ﬁnal result is depicted in Figure2(b).In general,an Inception network is a network consisting of modules of the above type stacked upon each other,with occasional max-pooling layers with stride2to halve the resolution of the grid.For technical reasons(memory efﬁciency during training),it seemed beneﬁcial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary,simply reﬂecting some infrastructural inefﬁciencies in our current implementation.One of the main beneﬁcial aspects of this architecture is that it allows for increasing the number of units at each stage signiﬁcantly without an uncontrolled blow-up in computational complexity.The ubiquitous use of dimension reduction allows for shielding the large number of inputﬁlters of the last stage to the next layer,ﬁrst reducing their dimension before convolving over them with a large patch size.Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difﬁculties.Another way to utilize the inception architecture is to create slightly inferior,but computationally cheaper versions of it.We have found that all the included the knobs and levers allow for a controlled balancing of computational resources that can result in networks that are2−3×faster than similarly performing networks with non-Inception architecture,however this requires careful manual design at this point. 5GoogLeNetWe chose GoogLeNet as our team-name in the ILSVRC14competition.This name is an homage to Yann LeCuns pioneering LeNet5network[10].We also use GoogLeNet to refer to the particular incarnation of the Inception architecture used in our submission for the competition.We have also used a deeper and wider Inception network,the quality of which was slightly inferior,but adding it to the ensemble seemed to improve the results marginally.We omit the details of that network,since our experiments have shown that the inﬂuence of the exact architectural parameters is relativelytype patch size/strideoutputsizedepth#1×1#3×3reduce#3×3#5×5reduce#5×5poolprojparams opsconvolution7×7/2112×112×641 2.7K34M max pool3×3/256×56×640convolution3×3/156×56×192264192112K360M max pool3×3/228×28×1920inception(3a)28×28×25626496128163232159K128M inception(3b)28×28×4802128128192329664380K304M max pool3×3/214×14×4800inception(4a)14×14×512219296208164864364K73M inception(4b)14×14×5122160112224246464437K88M inception(4c)14×14×5122128128256246464463K100M inception(4d)14×14×5282112144288326464580K119M inception(4e)14×14×832225616032032128128840K170M max pool3×3/27×7×8320inception(5a)7×7×8322256160320321281281072K54M inception(5b)7×7×10242384192384481281281388K71M avg pool7×7/11×1×10240dropout(40%)1×1×10240linear1×1×100011000K1M softmax1×1×10000Table1:GoogLeNet incarnation of the Inception architectureminor.Here,the most successful particular instance(named GoogLeNet)is described in Table1for demonstrational purposes.The exact same topology(trained with different sampling methods)was used for6out of the7models in our ensemble.All the convolutions,including those inside the Inception modules,use rectiﬁed linear activation. The size of the receptiveﬁeld in our network is224×224taking RGB color channels with mean sub-traction.“#3×3reduce”and“#5×5reduce”stands for the number of1×1ﬁlters in the reduction layer used before the3×3and5×5convolutions.One can see the number of1×1ﬁlters in the pro-jection layer after the built-in max-pooling in the pool proj column.All these reduction/projection layers use rectiﬁed linear activation as well.The network was designed with computational efﬁciency and practicality in mind,so that inference can be run on individual devices including even those with limited computational resources,espe-cially with low-memory footprint.The network is22layers deep when counting only layers with parameters(or27layers if we also count pooling).The overall number of layers(independent build-ing blocks)used for the construction of the network is about100.However this number depends on the machine learning infrastructure system used.The use of average pooling before the classiﬁer is based on[12],although our implementation differs in that we use an extra linear layer.This enables adapting andﬁne-tuning our networks for other label sets easily,but it is mostly convenience and we do not expect it to have a major effect.It was found that a move from fully connected layers to average pooling improved the top-1accuracy by about0.6%,however the use of dropout remained essential even after removing the fully connected layers.Given the relatively large depth of the network,the ability to propagate gradients back through all the layers in an effective manner was a concern.One interesting insight is that the strong performance of relatively shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative.By adding auxiliary classiﬁers connected to these intermediate layers,we would expect to encourage discrimination in the lower stages in the classiﬁer,increase the gradient signal that gets propagated back,and provide additional regulariza-tion.These classiﬁers take the form of smaller convolutional networks put on top of the output of the Inception(4a)and(4d)modules.During training,their loss gets added to the total loss of the network with a discount weight(the losses of the auxiliary classiﬁers were weighted by0.3).At inference time,these auxiliary networks are discarded.The exact structure of the extra network on the side,including the auxiliary classiﬁer,is as follows:•An average pooling layer with5×5ﬁlter size and stride3,resulting in an4×4×512output for the(4a),and4×4×528for the(4d)stage.Figure3:GoogLeNet network with all the bells and whistles•A1×1convolution with128ﬁlters for dimension reduction and rectiﬁed linear activation.•A fully connected layer with1024units and rectiﬁed linear activation.•A dropout layer with70%ratio of dropped outputs.•A linear layer with softmax loss as the classiﬁer(predicting the same1000classes as the main classiﬁer,but removed at inference time).A schematic view of the resulting network is depicted in Figure3.6Training MethodologyOur networks were trained using the DistBelief[4]distributed machine learning system using mod-est amount of model and data-parallelism.Although we used CPU based implementation only,a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week,the main limitation being the memory usage.Our training used asynchronous stochastic gradient descent with0.9momentum[17],ﬁxed learning rate schedule(de-creasing the learning rate by4%every8epochs).Polyak averaging[13]was used to create theﬁnal model used at inference time.Our image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options,sometimes in conjunction with changed hyperparameters,like dropout and learning rate,so it is hard to give a deﬁnitive guidance to the most effective single way to train these networks.To complicate matters further,some of the models were mainly trained on smaller relative crops,others on larger ones,inspired by[8]. Still,one prescription that was veriﬁed to work very well after the competition includes sampling of various sized patches of the image whose size is distributed evenly between8%and100%of the image area and whose aspect ratio is chosen randomly between3/4and4/3.Also,we found that the photometric distortions by Andrew Howard[8]were useful to combat overﬁtting to some extent.In addition,we started to use random interpolation methods(bilinear,area,nearest neighbor and cubic, with equal probability)for resizing relatively late and in conjunction with other hyperparameter changes,so we could not tell deﬁnitely whether theﬁnal results were affected positively by their use.7ILSVRC2014Classiﬁcation Challenge Setup and ResultsThe ILSVRC2014classiﬁcation challenge involves the task of classifying the image into one of 1000leaf-node categories in the Imagenet hierarchy.There are about1.2million images for training, 50,000for validation and100,000images for testing.Each image is associated with one ground truth category,and performance is measured based on the highest scoring classiﬁer predictions. Two numbers are usually reported:the top-1accuracy rate,which compares the ground truth against theﬁrst predicted class,and the top-5error rate,which compares the ground truth against theﬁrst 5predicted classes:an image is deemed correctly classiﬁed if the ground truth is among the top-5, regardless of its rank in them.The challenge uses the top-5error rate for ranking purposes.We participated in the challenge with no external data used for training.In addition to the training techniques aforementioned in this paper,we adopted a set of techniques during testing to obtain a higher performance,which we elaborate below.1.We independently trained7versions of the same GoogLeNet model(including one widerversion),and performed ensemble prediction with them.These models were trained with the same initialization(even with the same initial weights,mainly because of an oversight) and learning rate policies,and they only differ in sampling methodologies and the random order in which they see input images.2.During testing,we adopted a more aggressive cropping approach than that of Krizhevsky etal.[9].Speciﬁcally,we resize the image to4scales where the shorter dimension(height or width)is256,288,320and352respectively,take the left,center and right square of these resized images(in the case of portrait images,we take the top,center and bottom squares).For each square,we then take the4corners and the center224×224crop as well as theTeam Year Place Error(top-5)Uses external dataSuperVision20121st16.4%noSuperVision20121st15.3%Imagenet22kClarifai20131st11.7%noClarifai20131st11.2%Imagenet22kMSRA20143rd7.35%noVGG20142nd7.32%noGoogLeNet20141st6.67%noTable2:Classiﬁcation performanceNumber of models Number of Crops Cost Top-5error compared to base11110.07%base110109.15%-0.92%11441447.89%-2.18%7178.09%-1.98%710707.62%-2.45%714410086.67%-3.45%Table3:GoogLeNet classiﬁcation performance break downsquare resized to224×224,and their mirrored versions.This results in4×3×6×2=144 crops per image.A similar approach was used by Andrew Howard[8]in the previous year’s entry,which we empirically veriﬁed to perform slightly worse than the proposed scheme.We note that such aggressive cropping may not be necessary in real applications,as the beneﬁt of more crops becomes marginal after a reasonable number of crops are present(as we will show later on).3.The softmax probabilities are averaged over multiple crops and over all the individual clas-siﬁers to obtain theﬁnal prediction.In our experiments we analyzed alternative approaches on the validation data,such as max pooling over crops and averaging over classiﬁers,but they lead to inferior performance than the simple averaging.In the remainder of this paper,we analyze the multiple factors that contribute to the overall perfor-mance of theﬁnal submission.Ourﬁnal submission in the challenge obtains a top-5error of6.67%on both the validation and testing data,ranking theﬁrst among other participants.This is a56.5%relative reduction compared to the SuperVision approach in2012,and about40%relative reduction compared to the previous year’s best approach(Clarifai),both of which used external data for training the classiﬁers.The following table shows the statistics of some of the top-performing approaches.We also analyze and report the performance of multiple testing choices,by varying the number of models and the number of crops used when predicting an image in the following table.When we use one model,we chose the one with the lowest top-1error rate on the validation data.All numbers are reported on the validation dataset in order to not overﬁt to the testing data statistics.8ILSVRC2014Detection Challenge Setup and ResultsThe ILSVRC detection task is to produce bounding boxes around objects in images among200 possible classes.Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least50%(using the Jaccard index).Extraneous detections count as false positives and are penalized.Contrary to the classiﬁcation task,each image may contain。

On Recovery of Sparse Signals Via Minimization

3388IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 7, JULY 2009On Recovery of Sparse Signals Via `1 MinimizationT. Tony Cai, Guangwu Xu, and Jun Zhang, Senior Member, IEEEAbstract—This paper considers constrained `1 minimization methods in a uniﬁed framework for the recovery of high-dimensional sparse signals in three settings: noiseless, bounded error, and Gaussian noise. Both `1 minimization with an ` constraint (Dantzig selector) and `1 minimization under an `2 constraint are considered. The results of this paper improve the existing results in the literature by weakening the conditions and tightening the error bounds. The improvement on the conditions shows that signals with larger support can be recovered accurately. In particular, our results illustrate the relationship between `1 minimization with an `2 constraint and `1 minimization with an ` constraint. This paper also establishes connections between restricted isometry property and the mutual incoherence property. Some results of Candes, Romberg, and Tao (2006), Candes and Tao (2007), and Donoho, Elad, and Temlyakov (2006) are extended.11linear equations with more variables than the number of equations. It is clear that the problem is ill-posed and there are generally inﬁnite many solutions. However, in many applications the vector is known to be sparse or nearly sparse in the sense that it contains only a small number of nonzero entries. This sparsity assumption fundamentally changes the problem. Although there are inﬁnitely many general solutions, under regularity conditions there is a unique sparse solution. Indeed, in many cases the unique sparse solution can be found exactly through minimization subject to (I.2)Index Terms—Dantzig selector`1 minimization, restricted isometry property, sparse recovery, sparsity.I. INTRODUCTION HE problem of recovering a high-dimensional sparse signal based on a small number of measurements, possibly corrupted by noise, has attracted much recent attention. This problem arises in many different settings, including compressed sensing, constructive approximation, model selection in linear regression, and inverse problems. Suppose we have observations of the formTThis minimization problem has been studied, for example, in Fuchs [13], Candes and Tao [5], and Donoho [8]. Understanding the noiseless case is not only of signiﬁcant interest in its own right, it also provides deep insight into the problem of reconstructing sparse signals in the noisy case. See, for example, Candes and Tao [5], [6] and Donoho [8], [9]. minWhen noise is present, there are two well-known imization methods. One is minimization under an constraint on the residuals -Constraint subject to (I.3)(I.1) with is given and where the matrix is a vector of measurement errors. The goal is to reconstruct the unknown vector . Depending on settings, the error vector can either be zero (in the noiseless case), bounded, or . It is now well understood that Gaussian where minimization provides an effective way for reconstructing a sparse signal in all three settings. See, for example, Fuchs [13], Candes and Tao [5], [6], Candes, Romberg, and Tao [4], Tropp [18], and Donoho, Elad, and Temlyakov [10]. A special case of particular interest is when no noise is present . This is then an underdetermined system of in (I.1) andManuscript received May 01, 2008; revised November 13, 2008. Current version published June 24, 2009. The work of T. Cai was supported in part by the National Science Foundation (NSF) under Grant DMS-0604954, the work of G. Xu was supported in part by the National 973 Project of China (No. 2007CB807903). T. T. Cai is with the Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104 USA (e-mail: tcai@wharton.upenn. edu). G. Xu and J. Zhang are with the Department of Electrical Engineering and Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI 53211 USA (e-mail: gxu4uwm@; junzhang@uwm. edu). Communicated by J. Romberg, Associate Editor for Signal Processing. Digital Object Identiﬁer 10.1109/TIT.2009.2021377Writing in terms of the Lagrangian function of ( -Constraint), this is closely related to ﬁnding the solution to the regularized least squares (I.4) The latter is often called the Lasso in the statistics literature (Tibshirani [16]). Tropp [18] gave a detailed treatment of the regularized least squares problem. Another method, called the Dantzig selector, was recently proposed by Candes and Tao [6]. The Dantzig selector solves the sparse recovery problem through -minimization with a constraint on the correlation between the residuals and the column vectors of subject to (I.5)Candes and Tao [6] showed that the Dantzig selector can be computed by solving a linear program subject to and where the optimization variables are . Candes and Tao [6] also showed that the Dantzig selector mimics the perfor. mance of an oracle procedure up to a logarithmic factor0018-9448/$25.00 © 2009 IEEEAuthorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.CAI et al.: ON RECOVERY OF SPARSE SIGNALS VIAMINIMIZATION3389It is clear that some regularity conditions are needed in order for these problems to be well behaved. Over the last few years, many interesting results for recovering sparse signals have been obtained in the framework of the Restricted Isometry Property (RIP). In their seminal work [5], [6], Candes and Tao considered sparse recovery problems in the RIP framework. They provided beautiful solutions to the problem under some conditions on the so-called restricted isometry constant and restricted orthogonality constant (deﬁned in Section II). These conditions essentially require that every set of columns of with certain cardinality approximately behaves like an orthonormal system. Several different conditions have been imposed in various settings. For example, the condition was used in Candes and in Candes, Romberg, and Tao [5], Tao [4], in Candes and Tao [6], and in Candes [3], where is the sparsity index. A natural question is: Can these conditions be weakened in a uniﬁed way? Another widely used condition for sparse recovery is the Mutual Incoherence Property (MIP) which requires the to be pairwise correlations among the column vectors of small. See [10], [11], [13], [14], [18]. minimization methods in a In this paper, we consider single uniﬁed framework for sparse recovery in three cases, minnoiseless, bounded error, and Gaussian noise. Both constraint (DS) and minimization imization with an constraint ( -Constraint) are considered. Our under the results improve on the existing results in [3]–[6] by weakening the conditions and tightening the error bounds. In particular, miniour results clearly illustrate the relationship between mization with an constraint and minimization with an constraint (the Dantzig selector). In addition, we also establish connections between the concepts of RIP and MIP. As an application, we present an improvement to a recent result of Donoho, Elad, and Temlyakov [10]. In all cases, we solve the problems under the weaker condition (I.6) The improvement on the condition shows that for ﬁxed and , signals with larger support can be recovered. Although our main interest is on recovering sparse signals, we state the results in the general setting of reconstructing an arbitrary signal. It is sometimes convenient to impose conditions that involve only the restricted isometry constant . Efforts have been made in this direction in the literature. In [7], the recovery result was . In [3], the weaker established under the condition was used. Similar conditions have condition also been used in the construction of (random) compressed and sensing matrices. For example, conditions were used in [15] and [1], respectively. We shall remark that, our results implies that the weaker conditionsparse recovery problem. We begin the analysis of minimization methods for sparse recovery by considering the exact recovery in the noiseless case in Section III. Our result improves the main result in Candes and Tao [5] by using weaker conditions and providing tighter error bounds. The analysis of the noiseless case provides insight to the case when the observations are contaminated by noise. We then consider the case of bounded error in Section IV. The connections between the RIP and MIP are also explored. Sparse recovery with Gaussian noise is treated in Section V. Appendices A–D contain the proofs of technical results. II. PRELIMINARIES In this section, we ﬁrst introduce basic notation and deﬁnitions, and then present a technical inequality which will be used in proving our main results. . Let be a vector. The Let support of is the subset of deﬁned byFor an integer , a vector is said to be -sparse if . For a given vector we shall denote by the vector with all but the -largest entries (in absolute value) set to zero and deﬁne , the vector with the -largest entries (in absolute value) set to zero. We shall use to denote the -norm of the vector . the standard notation Let the matrix and , the -restricted is deﬁned to be the smallest constant isometry constant such that (II.1) for every -sparse vector . If , we can deﬁne another quantity, the -restricted orthogonality constant , as the smallest number that satisﬁes (II.2) for all and such that and are -sparse and -sparse, respectively, and have disjoint supports. Roughly speaking, the and restricted orthogonality constant isometry constant measure how close subsets of cardinality of columns of are to an orthonormal system. and For notational simplicity we shall write for for hereafter. It is easy to see that and are monotone. That is if if Candes and Tao [5] showed that the constants related by the following inequalities and (II.3) (II.4) are (II.5)sufﬁces in sparse signal reconstruction. The paper is organized as follows. In Section II, after basic notation and deﬁnitions are reviewed, we introduce an elementary inequality, which allow us to make ﬁner analysis of theAs mentioned in the Introduction, different conditions on and have been used in the literature. It is not always immediately transparent which condition is stronger and which is weaker. We shall present another important property on andAuthorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.3390IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 7, JULY 2009which can be used to compare the conditions. In addition, it is especially useful in producing simpliﬁed recovery conditions. Proposition 2.1: If , then (II.6)Theorem 3.1 (Candes and Tao [5]): Let satisﬁes. Suppose (III.1)Let be a -sparse vector and minimizer to the problem. Then .is the uniqueIn particular,.A proof of the proposition is provided in Appendix A. Remark: Candes and Tao [6] imposes in a more recent paper Candes [3] uses consequence of Proposition 2.1 is that a strictly stronger condition than sition 2.1 yields implies that and . A direct is in fact since Propowhich means .As mentioned in the Introduction, other conditions on and have also been used in the literature. Candes, Romberg, and . Candes and Tao Tao [4] uses the condition [6] considers the Gaussian noise case. A special case with noise of Theorem 1.1 in that paper improves Theorem 3.1 level to by weakening the condition from . Candes [3] imposes the condition . We shall show below that these conditions can be uniformly improved by a transparent argument. A direct application of Proposition 2.2 yields the following result which weakens the above conditions toWe now introduce a useful elementary inequality. This inequality allows us to perform ﬁner estimation on , norms. It will be used in proving our main results. Proposition 2.2: Let be a positive integer. Then any descending chain of real numbersNote that it follows from (II.3) and (II.4) that , and . So the condition is weaker . It is also easy to see from (II.5) and than is also weaker than (II.6) that the condition and the other conditions mentioned above. Theorem 3.2: Let . Suppose satisﬁessatisﬁesand obeys The proof of Proposition 2.2 is given in Appendix B. where III. SIGNAL RECOVERY IN THE NOISELESS CASE As mentioned in the Introduction, we shall give a uniﬁed minimization with an contreatment for the methods of straint and minimization with an constraint for recovery of sparse signals in three cases: noiseless, bounded error, and Gaussian noise. We begin in this section by considering the simplest setting: exact recovery of sparse signals when no noise is present. This is an interesting problem by itself and has been considered in a number of papers. See, for example, Fuchs [13], Donoho [8], and Candes and Tao [5]. More importantly, the solutions to this “clean” problem shed light on the noisy case. Our result improves the main result given in Candes and Tao [5]. The improvement is obtained by using the technical inequalities we developed in previous section. Although the focus is on recovering sparse signals, our results are stated in the general setting of reconstructing an arbitrary signal. with and suppose we are given and Let where for some unknown vector . The goal is to recover exactly when it is sparse. Candes and Tao [5] showed that a sparse solution can be obtained by minimization which is then solved via linear programming.. Then the minimizerto the problem., i.e., the In particular, if is a -sparse vector, then minimization recovers exactly. This theorem improves the results in [5], [6]. The improvement on the condition shows that for ﬁxed and , signals with larger support can be recovered accurately. Remark: It is sometimes more convenient to use conditions only involving the restricted isometry constant . Note that the condition (III.2) implies . This is due to the factby Proposition 2.1. Hence, Theorem 3.2 holds under the condican also be used. tion (III.2). The condition Proof of Theorem 3.2: The proof relies on Proposition 2.2 and makes use of the ideas from [4]–[6]. In this proof, we shall also identify a vector as a function by assigning .Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.CAI et al.: ON RECOVERY OF SPARSE SIGNALS VIAMINIMIZATION3391Let Letbe a solution to the and letminimization problem (Exact). be the support of . WriteClaim:such that. Let In fact, from Proposition 2.2 and the fact that , we have(III.4)For a subset characteristic function of, we use , i.e., if if .to denote theFor each , let . Then is decomposed to . Note that ’s are pairwise disjoint, , and for . We ﬁrst consider the case where is divisible by . For each , we divide into two halves in the following manner: with where is the ﬁrst half of , i.e., andIt then follows from Proposition 2.2 thatProposition 2.2 also yieldsand We shall treat four equal parts. as a sum of four functions and divide withintofor any and. ThereforeWe then deﬁne that Note thatfor .by. It is clear(III.3)In fact, since, we haveSince, this yieldsIn the rest of our proof we write . Note that . So we get the equation at the top of the following page. This yieldsThe following claim follows from our Proposition 2.2.Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.3392IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 7, JULY 2009It then follows from (III.4) thatIV. RECOVERY OF SPARSE SIGNALS IN BOUNDED ERROR We now turn to the case of bounded error. The results obtained in this setting have direct implication for the case of Gaussian noise which will be discussed in Section V. and let Letwhere the noise is bounded, i.e., for some bounded set . In this case the noise can either be stochastic or deterministic. The minimization approach is to estimate by the minimizer of subject to We now turn to the case that is not divisible by . Let . Note that in this case and are understood as and , respectively. So the proof for the previous case works if we set and for and We speciﬁcally consider two cases: and . The ﬁrst case is closely connected to the Dantzig selector in the Gaussian noise setting which will be discussed in more detail in Section V. Our results improve the results in Candes, Romberg, and Tao [4], Candes and Tao [6], and Donoho, Elad, and Temlyakov [10]. We shall ﬁrst consider where satisﬁes shallLet be the solution to the (DS) problem given in (I.1). The Dantzig selector has the following property. Theorem 4.1: Suppose satisfying . If and with (IV.1) then the solution In this case, we need to use the following inequality whose proof is essentially the same as Proposition 2.2: For any descending chain of real numbers , we have to (DS) obeys (IV.2) with In particular, if . and is a -sparse vector, then .andRemark: Theorem 4.1 is comparable to Theorem 1.1 of Candes and Tao [6], but the result here is a deterministic one instead of a probabilistic one as bounded errors are considered. The proof in Candes and Tao [6] can be adapted to yield aAuthorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.CAI et al.: ON RECOVERY OF SPARSE SIGNALS VIAMINIMIZATION3393similar result for bounded errors under the stronger condition . Proof of Theorem 4.1: We shall use the same notation as in the proof of Theorem 3.2. Since , letting and following essentially the same steps as in the ﬁrst part of the proof of Theorem 3.2, we getwhere the noise satisﬁes . Once again, this problem can be solved through constrained minimization subject to (IV.3)An alternative to the constrained minimization approach is the so-called Lasso given in (I.4). The Lasso recovers a sparse regularized least squares. It is closely connected signal via minimization. The Lasso is a popular to the -constrained method in statistics literature (Tibshirani [16]). See Tropp [18] regularized least squares for a detailed treatment of the problem. By using a similar argument, we have the following result on the solution of the minimization (IV.3). Theorem 4.2: Let vector and with . Suppose . If is a -sparseIf that, then and for every , and we have. The latter forces . Otherwise(IV.4) then for any To ﬁnish the proof, we observe the following. 1) . be the submatrix obIn fact, let tained by extracting the columns of according to the in, as in [6]. Then dices in , the minimizer to the problem (IV.3) obeys (IV.5) with .imProof of Theorem 4.2: Notice that the condition , so we can use the ﬁrst part of the proof plies that of Theorem 3.2. The notation used here is the same as that in the proof of Theorem 3.2. First, we haveand2) In factNote thatSoWe get the result by combining 1) and 2). This completes the proof. We now turn to the second case where the noise is bounded with . The problem is to in -norm. Let from recover the sparse signalRemark: Candes, Romberg, and Tao [4] showed that, if , then(The was set to be This impliesin [4].) Now suppose which yields. ,Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.3394IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 7, JULY 2009since and Theorem 4.2 that, with. It then follows fromProof of Theorem 4.3: It follows from Proposition 4.1 thatfor all -sparse vector where . Therefore, Theorem 4.2 improves the above result in Candes, Romberg, and Tao [4] by enlarging the support of by 60%. Remark: Similar to Theorems 3.2 and 4.1, we can have the estimation without assuming that is -sparse. In the general case, we haveSince Theorem 4.2, the conditionholds. ByA. Connections Between RIP and MIP In addition to the restricted isometry property (RIP), another commonly used condition in the sparse recovery literature is the mutual incoherence property (MIP). The mutual incoherence property of requires that the coherence bound (IV.6) be small, where are the columns of ( ’s are also assumed to be of length in -norm). Many interesting results on sparse recovery have been obtained by imposing conand the sparsity , see [10], ditions on the coherence bound [11], [13], [14], [18]. For example, a recent paper, Donoho, is a -sparse Elad, and Temlyakov [10] proved that if with , then for any , the vector and minimizer to the problem ( -Constraint) satisﬁesRemarks: In this theorem, the result of Donoho, Elad, and Temlyakov [10] is improved in the following ways. to 1) The sparsity is relaxed from . So roughly speaking, Theorem 4.3 improves the result in Donoho, Elad, and Temlyakov [10] by enlarging the support of by 47%. is usually very 2) It is clear that larger is preferred. Since small, the bound is tightened from to , as is close to .V. RECOVERY OF SPARSE SIGNALS IN GAUSSIAN NOISE We now turn to the case where the noise is Gaussian. Suppose we observe (V.1) and wish to recover from and . We assume that is known and that the columns of are standardized to have unit norm. This is a case of signiﬁcant interest, in particular in statistics. Many methods, including the Lasso (Tibshirani [16]), LARS (Efron, Hastie, Johnstone, and Tibshirani [12]) and Dantzig selector (Candes and Tao [6]), have been introduced and studied. The following results show that, with large probability, the Gaussian noise belongs to bounded sets. Lemma 5.1: The Gaussian error satisﬁes (V.2) and (V.3) Inequality (V.2) follows from standard probability calculations and inequality (V.3) is proved in Appendix D. Lemma 5.1 suggests that one can apply the results obtained in the previous section for the bounded error case to solve the Gaussian noise problem. Candes and Tao [6] introduced the Dantzig selector for sparse recovery in the Gaussian noise setting. Given the observations in (V.1), the Dantzig selector is the minimizer of subject to where . (V.4)with, provided.We shall now establish some connections between the RIP and MIP and show that the result of Donoho, Elad, and Temlyakov [10] can be improved under the RIP framework, by using Theorem 4.2. The following is a simple result that gives RIP constants from MIP. The proof can be found in Appendix C. It is remarked that the ﬁrst inequality in the next proposition can be found in [17]. Proposition 4.1: Let be the coherence bound for and Now we are able to show the following result. Theorem 4.3: Suppose with satisfying (or, equivalently, the minimizer is a -sparse vector and . Let . If ), then for any . Then (IV.7),to the problem ( -Constraint) obeys (IV.8)with.Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.CAI et al.: ON RECOVERY OF SPARSE SIGNALS VIAMINIMIZATION3395In the classical linear regression problem when , the least squares estimator is the solution to the normal equation (V.5) in the convex program The constraint (DS) can thus be viewed as a relaxation of the normal (V.3). And similar to the noiseless case, minimization leads to the “sparsest” solution over the space of all feasible solutions. Candes and Tao [6] showed the following result. Theorem 5.1 (Candes and Tao [6]): Suppose -sparse vector. Let be such that Choose the Dantzig selector is asince . The improvement on the error bound is minor. The improvement on the condition is more signiﬁcant as it shows signals with larger support can be recovered accurately for ﬁxed and . Remark: Similar to the results obtained in the previous sections, if is not necessarily -sparse, in general we have, with probabilitywhere probabilityand, and within (I.1). Then with large probability, obeys (V.6) where and .with.1As mentioned earlier, the Lasso is another commonly used method in statistics. The Lasso solves the regularized least squares problem (I.4) and is closely related to the -constrained minimization problem ( -Constraint). In the Gaussian error be the mincase, we shall consider a particular setting. Let imizer of subject to (V.7)Remark: Candes and Tao [6] also proved an Oracle Inequality in the Gaussian noise setting under for the Dantzig selector . With some additional work, our the condition approach can be used to improve [6, Theorems 1.2 and 1.3] by . weakening the condition to APPENDIX A PROOF OF PROPOSITION 2.1 Let be -sparse and be supports are disjoint. Decompose such that is -sparse. Suppose their aswhere . Combining our results from the last section together with Lemma 5.1, we have the following results on the Dantzig seand the estimator obtained from minimizalector tion under the constraint. Again, these results improve the previous results in the literature by weakening the conditions and providing more precise bounds. Theorem 5.2: Suppose matrix satisﬁes Then with probability obeys (V.8) with obeys (V.9) with . , and with probability at least , is a -sparse vector and the-sparse for and for . Using the Cauchy–Schwartz inequality, we have, the Dantzig selectorThis yields we also have. Since . APPENDIX B PROOF OF PROPOSITION 2.2,Remark: In comparison to Theorem 5.1, our result in Theto orem 5.2 weakens the condition from and improves the constant in the bound from to . Note thatCandes and Tao [6], the constant C was stated as C appears that there was a typo and the constant C should be C .1InLet)= . It = 4=(1 0 0Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.3396IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 7, JULY 2009where eachis given (and bounded) byThereforeand the inequality is proved. Without loss of generality, we assume that is even. Write APPENDIX C PROOF OF PROPOSITION 4.1 Let be a -sparse vector. Without loss of generality, we as. A direct calculation shows sume that thatwhereandNow let us bound the second term. Note thatNowThese give usand henceFor the second inequality, we notice that follows from Proposition 2.1 that. It thenand APPENDIX D PROOF OF LEMMA 5.1 The ﬁrst inequality is standard. For completeness, we give . Then each a short proof here. Let . Hence marginally has Gaussian distributionAuthorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.CAI et al.: ON RECOVERY OF SPARSE SIGNALS VIAMINIMIZATION3397where the last step follows from the Gaussian tail probability bound that for a standard Gaussian variable and any constant , . is We now prove inequality (V.3). Note that a random variable. It follows from Lemma 4 in Cai [2] that for anyHencewhere. It now follows from the fact that[6] E. J. Candes and T. Tao, “The Dantzig selector: Statistical estimation when p is much larger than n (with discussion),” Ann. Statist., vol. 35, pp. 2313–2351, 2007. [7] A. Cohen, W. Dahmen, and R. Devore, “Compressed Sensing and Best k -Term Approximation” 2006, preprint. [8] D. L. Donoho, “For most large underdetermined systems of linear equations the minimal ` -norm solution is also the sparsest solution,” Commun. Pure Appl. Math., vol. 59, pp. 797–829, 2006. [9] D. L. Donoho, “For most large underdetermined systems of equations, the minimal ` -norm near-solution approximates the sparsest near-solution,” Commun. Pure Appl. Math., vol. 59, pp. 907–934, 2006. [10] D. L. Donoho, M. Elad, and V. N. Temlyakov, “Stable recovery of sparse overcomplete representations in the presence of noise,” IEEE Trans. Inf. Theory, vol. 52, no. 1, pp. 6–18, Jan. 2006. [11] D. L. Donoho and X. Huo, “Uncertainty principles and ideal atomic decomposition,” IEEE Trans. Inf. Theory, vol. 47, no. 7, pp. 2845–2862, Nov. 2001. [12] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression (with discussion,” Ann. Statist., vol. 32, pp. 407–451, 2004. [13] J.-J. Fuchs, “On sparse representations in arbitrary redundant bases,” IEEE Trans. Inf. Theory, vol. 50, no. 6, pp. 1341–1344, Jun. 2004. [14] J.-J. Fuchs, “Recovery of exact sparse representations in the presence of bounded noise,” IEEE Trans. Inf. Theory, vol. 51, no. 10, pp. 3601–3608, Oct. 2005. [15] M. Rudelson and R. Vershynin, “Sparse reconstruction by convex relaxation: Fourier and Gaussian measurements,” in Proc. 40th Annu. Conf. Information Sciences and Systems, Princeton, NJ, Mar. 2006, pp. 207–212. [16] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. Ser. B, vol. 58, pp. 267–288, 1996. [17] J. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Trans. Inf. Theory, vol. 50, no. 10, pp. 2231–2242, Oct. 2004. [18] J. Tropp, “Just relax: Convex programming methods for identifying sparse signals in noise,” IEEE Trans. Inf. Theory, vol. 52, no. 3, pp. 1030–1051, Mar. 2006. T. Tony Cai received the Ph.D. degree from Cornell University, Ithaca, NY, in 1996. He is currently the Dorothy Silberberg Professor of Statistics at the Wharton School of the University of Pennsylvania, Philadelphia. His research interests include high-dimensional inference, large-scale multiple testing, nonparametric function estimation, functional data analysis, and statistical decision theory. Prof. Cai is the recipient of the 2008 COPSS Presidents’ Award and a fellow of the Institute of Mathematical Statistics.Inequality (V.3) now follows by verifying directly that for all . ACKNOWLEDGMENT The authors wish to thank the referees for thorough and useful comments which have helped to improve the presentation of the paper. REFERENCES[1] W. Bajwa, J. Haupt, J. Raz, S. Wright, and R. Nowak, “Toeplitz-structured compressed sensing matrices,” in Proc. IEEE SSP Workshop, Madison, WI, Aug. 2007, pp. 294–298. [2] T. Cai, “On block thresholding in wavelet regression: Adaptivity, block size and threshold level,” Statist. Sinica, vol. 12, pp. 1241–1273, 2002. [3] E. J. Candes, “The restricted isometry property and its implications for compressed sensing,” Compte Rendus de l’ Academie des Sciences Paris, ser. I, vol. 346, pp. 589–592, 2008. [4] E. J. Candes, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Commun. Pure Appl. Math., vol. 59, pp. 1207–1223, 2006. [5] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans. Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.Guangwu Xu received the Ph.D. degree in mathematics from the State University of New York (SUNY), Buffalo. He is now with the Department of Electrical engineering and Computer Science, University of Wisconsin-Milwaukee. His research interests include cryptography and information security, computational number theory, algorithms, and functional analysis.Jun Zhang (S’85–M’88–SM’01) received the B.S. degree in electrical and computer engineering from Harbin Shipbuilding Engineering Institute, Harbin, China, in 1982 and was admitted to the graduate program of the Radio Electronic Department of Tsinghua University. After a brief stay at Tsinghua, he came to the U.S. for graduate study on a scholarship from the Li Foundation, Glen Cover, New York. He received the M.S. and Ph.D. degrees, both in electrical engineering, from Rensselaer Polytechnic Institute, Troy, NY, in 1985 and 1988, respectively. He joined the faculty of the Department of Electrical Engineering and Computer Science, University of Wisconsin-Milwaukee, and currently is a Professor. His research interests include image processing and computer vision, signal processing and digital communications. Prof. Zhang has been an Associate Editor of IEEE TRANSACTIONS ON IMAGE PROCESSING.Authorized licensed use limited to: University of Pennsylvania. Downloaded on June 17, 2009 at 09:48 from IEEE Xplore. Restrictions apply.。

英语写作表达逻辑关系的关联词

01
03
"Ratio" is used to indicate a situation or choice that is opposite to the one mentioned earlier,
and is commonly used in comparative structures.
04
Instead: Used to indicate a situation or choice that is opposite to the one mentioned earlier.
Classification
According to the logical relationship expressed, conjunctions can be divided into the following categories
Transition relationship
For example, "how", "but", "never less", etc., used to express a turning point or contrast in meaning.
Lead to
Refers to leading to a certain outcome or consequence, commonly used in spoken and written language, emphasizing the immediacy of the outcome.
04 Parallel related words
On the contrary, on the contrary

image alignment and stitching a tutorial

Image Alignment and Stitching: A Tutorial1
Richard Szeliski Last updated, December 10, 2006 Technical Report MSR-TR-2004-92
This tutorial reviews image alignment and image stitching algorithms. Image alignment algorithms can discover the correspondence relationships among images with varying degrees of overlap. They are ideally suited for applications such as video stabilization, summarization, and the creation of panoramic mosaics. Image stitching algorithms take the alignment estimates produced by such registration algorithms and blend the images in a seamless manner, taking care to deal with potential problems such as blurring or ghosting caused by parallax and scene movement as well as varying image exposures. This tutorial reviews the basic motion models underlying alignment and stitching algorithms, describes effective direct (pixel-based) and feature-based alignment algorithms, and describes blending algorithms used to produce seamless mosaics. It closes with a discussion of open research problems in the area.

On Single Image Scale-Up using Sparse-Representations(Michael Elad 的文章)

2
Roman Zeyde, Michael Elad and Matan Protter
for an additive i.i.d. white Gaussian noise, denoted by v ∼ N 0, σ2I . Given zl, the problem is to ﬁnd yˆ ∈ RNh such that yˆ ≈ yh. Due to the
hereafter that H applies a known low-pass ﬁlter to the image, and S performs
a decimation by an integer factor s, by discarding rows/columns from the input
{romanz,elad,matanpr}@cs.technion.ac.il
Abstract. This paper deals with the single image scale-up problem using sparse-representation modeling. The goal is to recover an original image from its blurred and down-scaled noisy version. Since this problem is highly ill-posed, a prior is needed in order to regularize it. The literature oﬀers various ways to address this problem, ranging from simple linear space-invariant interpolation schemes (e.g., bicubic interpolation), to spatially-adaptive and non-linear ﬁlters of various sorts. We embark from a recently-proposed successful algorithm by Yang et. al. [13,14], and similarly assume a local Sparse-Land model on image patches, serving as regularization. Several important modiﬁcations to the above-mentioned solution are introduced, and are shown to lead to improved results. These modiﬁcations include a major simpliﬁcation of the overall process both in terms of the computational complexity and the algorithm architecture, using a diﬀerent training approach for the dictionary-pair, and introducing the ability to operate without a trainingset by boot-strapping the scale-up task from the given low-resolution image. We demonstrate the results on true images, showing both visual and PSNR improvements.

A Fast and Accurate Plane Detection Algorithm for Large Noisy Point Clouds Using Filtered Normals

A Fast and Accurate Plane Detection Algorithm for Large Noisy Point CloudsUsing Filtered Normals and Voxel GrowingJean-Emmanuel DeschaudFranc¸ois GouletteMines ParisTech,CAOR-Centre de Robotique,Math´e matiques et Syst`e mes60Boulevard Saint-Michel75272Paris Cedex06jean-emmanuel.deschaud@mines-paristech.fr francois.goulette@mines-paristech.frAbstractWith the improvement of3D scanners,we produce point clouds with more and more points often exceeding millions of points.Then we need a fast and accurate plane detection algorithm to reduce data size.In this article,we present a fast and accurate algorithm to detect planes in unorganized point clouds usingﬁltered normals and voxel growing.Our work is based on aﬁrst step in estimating better normals at the data points,even in the presence of noise.In a second step,we compute a score of local plane in each point.Then, we select the best local seed plane and in a third step start a fast and robust region growing by voxels we call voxel growing.We have evaluated and tested our algorithm on different kinds of point cloud and compared its performance to other algorithms.1.IntroductionWith the growing availability of3D scanners,we are now able to produce large datasets with millions of points.It is necessary to reduce data size,to decrease the noise and at same time to increase the quality of the model.It is in-teresting to model planar regions of these point clouds by planes.In fact,plane detection is generally aﬁrst step of segmentation but it can be used for many applications.It is useful in computer graphics to model the environnement with basic geometry.It is used for example in modeling to detect building facades before classiﬁcation.Robots do Si-multaneous Localization and Mapping(SLAM)by detect-ing planes of the environment.In our laboratory,we wanted to detect small and large building planes in point clouds of urban environments with millions of points for modeling. As mentioned in[6],the accuracy of the plane detection is important for after-steps of the modeling pipeline.We also want to be fast to be able to process point clouds with mil-lions of points.We present a novel algorithm based on re-gion growing with improvements in normal estimation and growing process.For our method,we are generic to work on different kinds of data like point clouds fromﬁxed scan-ner or from Mobile Mapping Systems(MMS).We also aim at detecting building facades in urban point clouds or little planes like doors,even in very large data sets.Our input is an unorganized noisy point cloud and with only three”in-tuitive”parameters,we generate a set of connected compo-nents of planar regions.We evaluate our method as well as explain and analyse the signiﬁcance of each parameter. 2.Previous WorksAlthough there are many methods of segmentation in range images like in[10]or in[3],three have been thor-oughly studied for3D point clouds:region-growing, hough-transform from[14]and Random Sample Consen-sus(RANSAC)from[9].The application of recognising structures in urban laser point clouds is frequent in literature.Bauer in[4]and Boulaassal in[5]detect facades in dense3D point cloud by a RANSAC algorithm.V osselman in[23]reviews sur-face growing and3D hough transform techniques to de-tect geometric shapes.Tarsh-Kurdi in[22]detect roof planes in3D building point cloud by comparing results on hough-transform and RANSAC algorithm.They found that RANSAC is more efﬁcient than theﬁrst one.Chao Chen in[6]and Yu in[25]present algorithms of segmentation in range images for the same application of detecting planar regions in an urban scene.The method in[6]is based on a region growing algorithm in range images and merges re-sults in one labelled3D point cloud.[25]uses a method different from the three we have cited:they extract a hi-erarchical subdivision of the input image built like a graph where leaf nodes represent planar regions.There are also other methods like bayesian techniques. In[16]and[8],they obtain smoothed surface from noisy point clouds with objects modeled by probability distribu-tions and it seems possible to extend this idea to point cloud segmentation.But techniques based on bayesian statistics need to optimize global statistical model and then it is difﬁ-cult to process points cloud larger than one million points.We present below an analysis of the two main methods used in literature:RANSAC and region-growing.Hough-transform algorithm is too time consuming for our applica-tion.To compare the complexity of the algorithm,we take a point cloud of size N with only one plane P of size n.We suppose that we want to detect this plane P and we deﬁne n min the minimum size of the plane we want to detect.The size of a plane is the area of the plane.If the data density is uniform in the point cloud then the size of a plane can be speciﬁed by its number of points.2.1.RANSACRANSAC is an algorithm initially developped by Fis-chler and Bolles in[9]that allows theﬁtting of models with-out trying all possibilities.RANSAC is based on the prob-ability to detect a model using the minimal set required to estimate the model.To detect a plane with RANSAC,we choose3random points(enough to estimate a plane).We compute the plane parameters with these3points.Then a score function is used to determine how the model is good for the remaining ually,the score is the number of points belonging to the plane.With noise,a point belongs to a plane if the distance from the point to the plane is less than a parameter γ.In the end,we keep the plane with the best score.Theprobability of getting the plane in theﬁrst trial is p=(nN )3.Therefore the probability to get it in T trials is p=1−(1−(nN )3)ing equation1and supposing n minN1,we know the number T min of minimal trials to have a probability p t to get planes of size at least n min:T min=log(1−p t)log(1−(n minN))≈log(11−p t)(Nn min)3.(1)For each trial,we test all data points to compute the score of a plane.The RANSAC algorithm complexity lies inO(N(Nn min )3)when n minN1and T min→0whenn min→N.Then RANSAC is very efﬁcient in detecting large planes in noisy point clouds i.e.when the ratio n minN is 1but very slow to detect small planes in large pointclouds i.e.when n minN 1.After selecting the best model,another step is to extract the largest connected component of each plane.Connnected components mean that the min-imum distance between each point of the plane and others points is smaller(for distance)than aﬁxed parameter.Schnabel et al.[20]bring two optimizations to RANSAC:the points selection is done locally and the score function has been improved.An octree isﬁrst created from point cloud.Points used to estimate plane parameters are chosen locally at a random depth of the octree.The score function is also different from RANSAC:instead of testing all points for one model,they test only a random subset and ﬁnd the score by interpolation.The algorithm complexity lies in O(Nr4Ndn min)where r is the number of random subsets for the score function and d is the maximum octree depth. Their algorithm improves the planes detection speed but its complexity lies in O(N2)and it becomes slow on large data sets.And again we have to extract the largest connected component of each plane.2.2.Region GrowingRegion Growing algorithms work well in range images like in[18].The principle of region growing is to start with a seed region and to grow it by neighborhood when the neighbors satisfy some conditions.In range images,we have the neighbors of each point with pixel coordinates.In case of unorganized3D data,there is no information about the neighborhood in the data structure.The most common method to compute neighbors in3D is to compute a Kd-tree to search k nearest neighbors.The creation of a Kd-tree lies in O(NlogN)and the search of k nearest neighbors of one point lies in O(logN).The advantage of these region growing methods is that they are fast when there are many planes to extract,robust to noise and extract the largest con-nected component immediately.But they only use the dis-tance from point to plane to extract planes and like we will see later,it is not accurate enough to detect correct planar regions.Rabbani et al.[19]developped a method of smooth area detection that can be used for plane detection.Theyﬁrst estimate the normal of each point like in[13].The point with the minimum residual starts the region growing.They test k nearest neighbors of the last point added:if the an-gle between the normal of the point and the current normal of the plane is smaller than a parameterαthen they add this point to the smooth region.With Kd-tree for k nearest neighbors,the algorithm complexity is in O(N+nlogN). The complexity seems to be low but in worst case,when nN1,example for facade detection in point clouds,the complexity becomes O(NlogN).3.Voxel Growing3.1.OverviewIn this article,we present a new algorithm adapted to large data sets of unorganized3D points and optimized to be accurate and fast.Our plane detection method works in three steps.In theﬁrst part,we compute a better esti-mation of the normal in each point by aﬁltered weighted planeﬁtting.In a second step,we compute the score of lo-cal planarity in each point.We select the best seed point that represents a good seed plane and in the third part,we grow this seed plane by adding all points close to the plane.Thegrowing step is based on a voxel growing algorithm.The ﬁltered normals,the score function and the voxel growing are innovative contributions of our method.As an input,we need dense point clouds related to the level of detail we want to detect.As an output,we produce connected components of planes in the point cloud.This notion of connected components is linked to the data den-sity.With our method,the connected components of planes detected are linked to the parameter d of the voxel grid.Our method has 3”intuitive”parameters :d ,area min and γ.”intuitive”because there are linked to physical mea-surements.d is the voxel size used in voxel growing and also represents the connectivity of points in detected planes.γis the maximum distance between the point of a plane and the plane model,represents the plane thickness and is linked to the point cloud noise.area min represents the minimum area of planes we want to keep.3.2.Details3.2.1Local Density of Point CloudsIn a ﬁrst step,we compute the local density of point clouds like in [17].For that,we ﬁnd the radius r i of the sphere containing the k nearest neighbors of point i .Then we cal-culate ρi =kπr 2i.In our experiments,we ﬁnd that k =50is a good number of neighbors.It is important to know the lo-cal density because many laser point clouds are made with a ﬁxed resolution angle scanner and are therefore not evenly distributed.We use the local density in section 3.2.3for the score calculation.3.2.2Filtered Normal EstimationNormal estimation is an important part of our algorithm.The paper [7]presents and compares three normal estima-tion methods.They conclude that the weighted plane ﬁt-ting or WPF is the fastest and the most accurate for large point clouds.WPF is an idea of Pauly and al.in [17]that the ﬁtting plane of a point p must take into consider-ation the nearby points more than other distant ones.The normal least square is explained in [21]and is the mini-mum of ki =1(n p ·p i +d )2.The WPF is the minimum of ki =1ωi (n p ·p i +d )2where ωi =θ( p i −p )and θ(r )=e −2r 2r2i .For solving n p ,we compute the eigenvec-tor corresponding to the smallest eigenvalue of the weightedcovariance matrix C w = ki =1ωi t (p i −b w )(p i −b w )where b w is the weighted barycenter.For the three methods ex-plained in [7],we get a good approximation of normals in smooth area but we have errors in sharp corners.In ﬁg-ure 1,we have tested the weighted normal estimation on two planes with uniform noise and forming an angle of 90˚.We can see that the normal is not correct on the corners of the planes and in the red circle.To improve the normal calculation,that improves the plane detection especially on borders of planes,we propose a ﬁltering process in two phases.In a ﬁrst step,we com-pute the weighted normals (WPF)of each point like we de-scribed it above by minimizing ki =1ωi (n p ·p i +d )2.In a second step,we compute the ﬁltered normal by us-ing an adaptive local neighborhood.We compute the new weighted normal with the same sum minimization but keep-ing only points of the neighborhood whose normals from the ﬁrst step satisfy |n p ·n i |>cos (α).With this ﬁltering step,we have the same results in smooth areas and better results in sharp corners.We called our normal estimation ﬁltered weighted plane ﬁtting(FWPF).Figure 1.Weighted normal estimation of two planes with uniform noise and with 90˚angle between them.We have tested our normal estimation by computing nor-mals on synthetic data with two planes and different angles between them and with different values of the parameter α.We can see in ﬁgure 2the mean error on normal estimation for WPF and FWPF with α=20˚,30˚,40˚and 90˚.Us-ing α=90˚is the same as not doing the ﬁltering step.We see on Figure 2that α=20˚gives smaller error in normal estimation when angles between planes is smaller than 60˚and α=30˚gives best results when angle between planes is greater than 60˚.We have considered the value α=30˚as the best results because it gives the smaller mean error in normal estimation when angle between planes vary from 20˚to 90˚.Figure 3shows the normals of the planes with 90˚angle and better results in the red circle (normals are 90˚with the plane).3.2.3The score of local planarityIn many region growing algorithms,the criteria used for the score of the local ﬁtting plane is the residual,like in [18]or [19],i.e.the sum of the square of distance from points to the plane.We have a different score function to estimate local planarity.For that,we ﬁrst compute the neighbors N i of a point p with points i whose normals n i are close toFigure parison of mean error in normal estimation of two planes with α=20˚,30˚,40˚and 90˚(=Noﬁltering).Figure 3.Filtered Weighted normal estimation of two planes with uniform noise and with 90˚angle between them (α=30˚).the normal n p .More precisely,we compute N i ={p in k neighbors of i/|n i ·n p |>cos (α)}.It is a way to keep only the points which are probably on the local plane before the least square ﬁtting.Then,we compute the local plane ﬁtting of point p with N i neighbors by least squares like in [21].The set N i is a subset of N i of points belonging to the plane,i.e.the points for which the distance to the local plane is smaller than the parameter γ(to consider the noise).The score s of the local plane is the area of the local plane,i.e.the number of points ”in”the plane divided by the localdensity ρi (seen in section 3.2.1):the score s =card (N i)ρi.We take into consideration the area of the local plane as the score function and not the number of points or the residual in order to be more robust to the sampling distribution.3.2.4Voxel decompositionWe use a data structure that is the core of our region growing method.It is a voxel grid that speeds up the plane detection process.V oxels are small cubes of length d that partition the point cloud space.Every point of data belongs to a voxel and a voxel contains a list of points.We use the Octree Class Template in [2]to compute an Octree of the point cloud.The leaf nodes of the graph built are voxels of size d .Once the voxel grid has been computed,we start the plane detection algorithm.3.2.5Voxel GrowingWith the estimator of local planarity,we take the point p with the best score,i.e.the point with the maximum area of local plane.We have the model parameters of this best seed plane and we start with an empty set E of points belonging to the plane.The initial point p is in a voxel v 0.All the points in the initial voxel v 0for which the distance from the seed plane is less than γare added to the set E .Then,we compute new plane parameters by least square reﬁtting with set E .Instead of growing with k nearest neighbors,we grow with voxels.Hence we test points in 26voxel neigh-bors.This is a way to search the neighborhood in con-stant time instead of O (logN )for each neighbor like with Kd-tree.In a neighbor voxel,we add to E the points for which the distance to the current plane is smaller than γand the angle between the normal computed in each point and the normal of the plane is smaller than a parameter α:|cos (n p ,n P )|>cos (α)where n p is the normal of the point p and n P is the normal of the plane P .We have tested different values of αand we empirically found that 30˚is a good value for all point clouds.If we added at least one point in E for this voxel,we compute new plane parameters from E by least square ﬁtting and we test its 26voxel neigh-bors.It is important to perform plane least square ﬁtting in each voxel adding because the seed plane model is not good enough with noise to be used in all voxel growing,but only in surrounding voxels.This growing process is faster than classical region growing because we do not compute least square for each point added but only for each voxel added.The least square ﬁtting step must be computed very fast.We use the same method as explained in [18]with incre-mental update of the barycenter b and covariance matrix C like equation 2.We know with [21]that the barycen-ter b belongs to the least square plane and that the normal of the least square plane n P is the eigenvector of the smallest eigenvalue of C .b0=03x1C0=03x3.b n+1=1n+1(nb n+p n+1).C n+1=C n+nn+1t(pn+1−b n)(p n+1−b n).(2)where C n is the covariance matrix of a set of n points,b n is the barycenter vector of a set of n points and p n+1is the (n+1)point vector added to the set.This voxel growing method leads to a connected com-ponent set E because the points have been added by con-nected voxels.In our case,the minimum distance between one point and E is less than parameter d of our voxel grid. That is why the parameter d also represents the connectivity of points in detected planes.3.2.6Plane DetectionTo get all planes with an area of at least area min in the point cloud,we repeat these steps(best local seed plane choice and voxel growing)with all points by descending order of their score.Once we have a set E,whose area is bigger than area min,we keep it and classify all points in E.4.Results and Discussion4.1.Benchmark analysisTo test the improvements of our method,we have em-ployed the comparative framework of[12]based on range images.For that,we have converted all images into3D point clouds.All Point Clouds created have260k points. After our segmentation,we project labelled points on a seg-mented image and compare with the ground truth image. We have chosen our three parameters d,area min andγby optimizing the result of the10perceptron training image segmentation(the perceptron is portable scanner that pro-duces a range image of its environment).Bests results have been obtained with area min=200,γ=5and d=8 (units are not provided in the benchmark).We show the re-sults of the30perceptron images segmentation in table1. GT Regions are the mean number of ground truth planes over the30ground truth range images.Correct detection, over-segmentation,under-segmentation,missed and noise are the mean number of correct,over,under,missed and noised planes detected by methods.The tolerance80%is the minimum percentage of points we must have detected comparing to the ground truth to have a correct detection. More details are in[12].UE is a method from[12],UFPR is a method from[10]. It is important to notice that UE and UFPR are range image methods and our method is not well suited for range images but3D Point Cloud.Nevertheless,it is a good benchmark for comparison and we see in table1that the accuracy of our method is very close to the state of the art in range image segmentation.To evaluate the different improvements of our algorithm, we have tested different variants of our method.We have tested our method without normals(only with distance from points to plane),without voxel growing(with a classical region growing by k neighbors),without our FWPF nor-mal estimation(with WPF normal estimation),without our score function(with residual score function).The compari-son is visible on table2.We can see the difference of time computing between region growing and voxel growing.We have tested our algorithm with and without normals and we found that the accuracy cannot be achieved whithout normal computation.There is also a big difference in the correct de-tection between WPF and our FWPF normal estimation as we can see in theﬁgure4.Our FWPF normal brings a real improvement in border estimation of planes.Black points in theﬁgure are non classiﬁedpoints.Figure5.Correct Detection of our segmentation algorithm when the voxel size d changes.We would like to discuss the inﬂuence of parameters on our algorithm.We have three parameters:area min,which represents the minimum area of the plane we want to keep,γ,which represents the thickness of the plane(it is gener-aly closely tied to the noise in the point cloud and espe-cially the standard deviationσof the noise)and d,which is the minimum distance from a point to the rest of the plane. These three parameters depend on the point cloud features and the desired segmentation.For example,if we have a lot of noise,we must choose a highγvalue.If we want to detect only large planes,we set a large area min value.We also focus our analysis on the robustess of the voxel size d in our algorithm,i.e.the ratio of points vs voxels.We can see inﬁgure5the variation of the correct detection when we change the value of d.The method seems to be robust when d is between4and10but the quality decreases when d is over10.It is due to the fact that for a large voxel size d,some planes from different objects are merged into one plane.GT Regions Correct Over-Under-Missed Noise Duration(in s)detection segmentation segmentationUE14.610.00.20.3 3.8 2.1-UFPR14.611.00.30.1 3.0 2.5-Our method14.610.90.20.1 3.30.7308Table1.Average results of different segmenters at80%compare tolerance.GT Regions Correct Over-Under-Missed Noise Duration(in s) Our method detection segmentation segmentationwithout normals14.6 5.670.10.19.4 6.570 without voxel growing14.610.70.20.1 3.40.8605 without FWPF14.69.30.20.1 5.0 1.9195 without our score function14.610.30.20.1 3.9 1.2308 with all improvements14.610.90.20.1 3.30.7308 Table2.Average results of variants of our segmenter at80%compare tolerance.4.1.1Large scale dataWe have tested our method on different kinds of data.We have segmented urban data inﬁgure6from our Mobile Mapping System(MMS)described in[11].The mobile sys-tem generates10k pts/s with a density of50pts/m2and very noisy data(σ=0.3m).For this point cloud,we want to de-tect building facades.We have chosen area min=10m2, d=1m to have large connected components andγ=0.3m to cope with the noise.We have tested our method on point cloud from the Trim-ble VX scanner inﬁgure7.It is a point cloud of size40k points with only20pts/m2with less noise because it is a ﬁxed scanner(σ=0.2m).In that case,we also wanted to detect building facades and keep the same parameters ex-ceptγ=0.2m because we had less noise.We see inﬁg-ure7that we have detected two facades.By setting a larger voxel size d value like d=10m,we detect only one plane. We choose d like area min andγaccording to the desired segmentation and to the level of detail we want to extract from the point cloud.We also tested our algorithm on the point cloud from the LEICA Cyrax scanner inﬁgure8.This point cloud has been taken from AIM@SHAPE repository[1].It is a very dense point cloud from multipleﬁxed position of scanner with about400pts/m2and very little noise(σ=0.02m). In this case,we wanted to detect all the little planes to model the church in planar regions.That is why we have chosen d=0.2m,area min=1m2andγ=0.02m.Inﬁgures6,7and8,we have,on the left,input point cloud and on the right,we only keep points detected in a plane(planes are in random colors).The red points in theseﬁgures are seed plane points.We can see in theseﬁg-ures that planes are very well detected even with high noise. Table3show the information on point clouds,results with number of planes detected and duration of the algorithm.The time includes the computation of the FWPF normalsof the point cloud.We can see in table3that our algo-rithm performs linearly in time with respect to the numberof points.The choice of parameters will have little inﬂuence on time computing.The computation time is about one mil-lisecond per point whatever the size of the point cloud(we used a PC with QuadCore Q9300and2Go of RAM).The algorithm has been implented using only one thread andin-core processing.Our goal is to compare the improve-ment of plane detection between classical region growing and our region growing with better normals for more ac-curate planes and voxel growing for faster detection.Our method seems to be compatible with out-of-core implemen-tation like described in[24]or in[15].MMS Street VX Street Church Size(points)398k42k7.6MMean Density50pts/m220pts/m2400pts/m2 Number of Planes202142Total Duration452s33s6900sTime/point 1ms 1ms 1msTable3.Results on different data.5.ConclusionIn this article,we have proposed a new method of plane detection that is fast and accurate even in presence of noise. We demonstrate its efﬁciency with different kinds of data and its speed in large data sets with millions of points.Our voxel growing method has a complexity of O(N)and it is able to detect large and small planes in very large data sets and can extract them directly in connected components.Figure 4.Ground truth,Our Segmentation without and with ﬁlterednormals.Figure 6.Planes detection in street point cloud generated by MMS (d =1m,area min =10m 2,γ=0.3m ).References[1]Aim@shape repository /.6[2]Octree class template /code/octree.html.4[3] A.Bab-Hadiashar and N.Gheissari.Range image segmen-tation using surface selection criterion.2006.IEEE Trans-actions on Image Processing.1[4]J.Bauer,K.Karner,K.Schindler,A.Klaus,and C.Zach.Segmentation of building models from dense 3d point-clouds.2003.Workshop of the Austrian Association for Pattern Recognition.1[5]H.Boulaassal,ndes,P.Grussenmeyer,and F.Tarsha-Kurdi.Automatic segmentation of building facades using terrestrial laser data.2007.ISPRS Workshop on Laser Scan-ning.1[6] C.C.Chen and I.Stamos.Range image segmentationfor modeling and object detection in urban scenes.2007.3DIM2007.1[7]T.K.Dey,G.Li,and J.Sun.Normal estimation for pointclouds:A comparison study for a voronoi based method.2005.Eurographics on Symposium on Point-Based Graph-ics.3[8]J.R.Diebel,S.Thrun,and M.Brunig.A bayesian methodfor probable surface reconstruction and decimation.2006.ACM Transactions on Graphics (TOG).1[9]M.A.Fischler and R.C.Bolles.Random sample consen-sus:A paradigm for model ﬁtting with applications to image analysis and automated munications of the ACM.1,2[10]P.F.U.Gotardo,O.R.P.Bellon,and L.Silva.Range imagesegmentation by surface extraction using an improved robust estimator.2003.Proceedings of Computer Vision and Pat-tern Recognition.1,5[11] F.Goulette,F.Nashashibi,I.Abuhadrous,S.Ammoun,andurgeau.An integrated on-board laser range sensing sys-tem for on-the-way city and road modelling.2007.Interna-tional Archives of the Photogrammetry,Remote Sensing and Spacial Information Sciences.6[12] A.Hoover,G.Jean-Baptiste,and al.An experimental com-parison of range image segmentation algorithms.1996.IEEE Transactions on Pattern Analysis and Machine Intelligence.5[13]H.Hoppe,T.DeRose,T.Duchamp,J.McDonald,andW.Stuetzle.Surface reconstruction from unorganized points.1992.International Conference on Computer Graphics and Interactive Techniques.2[14]P.Hough.Method and means for recognizing complex pat-terns.1962.In US Patent.1[15]M.Isenburg,P.Lindstrom,S.Gumhold,and J.Snoeyink.Large mesh simpliﬁcation using processing sequences.2003.。

Empirical processes of dependent random variables

2
Preliminaries
n i=1
from R to R. The centered G -indexed empirical process is given by (P n − P )g = 1 n
n
the marginal and empirical distribution functions. Let G be a class of measurabrocesses that have been discussed include linear processes and Gaussian processes; see Dehling and Taqqu (1989) and Cs¨ org˝ o and Mielniczuk (1996) for long and short-range dependent subordinated Gaussian processes and Ho and Hsing (1996) and Wu (2003a) for long-range dependent linear processes. A collection of recent results is presented in Dehling, Mikosch and Sorensen (2002). In that collection Dedecker and Louhichi (2002) made an important generalization of Ossiander’s (1987) result. Here we investigate the empirical central limit problem for dependent random variables from another angle that avoids strong mixing conditions. In particular, we apply a martingale method and establish a weak convergence theory for stationary, causal processes. Our results are comparable with the theory for independent random variables in that the imposed moment conditions are optimal or almost optimal. We show that, if the process is short-range dependent in a certain sense, then the limiting behavior is similar to that of iid random variables in that the limiting distribution is a Gaussian process and the norming √ sequence is n. For long-range dependent linear processes, one needs to apply asymptotic √ expansions to obtain n-norming limit theorems (Section 6.2.2). The paper is structured as follows. In Section 2 we introduce some mathematical preliminaries necessary for the weak convergence theory and illustrate the essence of our approach. Two types of empirical central limit theorems are established. Empirical processes indexed by indicators of left half lines, absolutely continuous functions, and piecewise diﬀerentiable functions are discussed in Sections 3, 4 and 5 respectively. Applications to linear processes and iterated random functions are made in Section 6. Section 7 presents some integral and maximal inequalities that may be of independent interest. Some proofs are given in Sections 8 and 9.

general-specific-text

• The general-to-specific pattern is probably one of the more common patterns in college writing.Ise familiar places:
• introduction to a paper • background in a research paper • opening paragraphs for a discussion or an
General-Specific Texts
• The general-to-specific approach opens an idea with a general statement and then leads into details that support and explain the general statement.
Fig.Shape of GS texts 1
Writing
1Writing is a complex sociocognitive process involving the construction of recorded messages on paper or on some other material, and, more recently, on a computer screen.2The skills needed to write range from making the appropriate graphic marks, through utilizing the resources of the chosen language, to anticipating the reactions of the intended

An api for manipulating matrices stored by blocks

∗
Abstract We discuss an API that simpliﬁes the speciﬁcation of a data structure for storing hierarchical matrices (matrices stored recursively by blocks) and the manipulation of such matrices. This work addresses a recent demand for libraries that support such storage of matrices for performance reasons. We believe a move towards such libraries has been slow largely because of the diﬃculty that has been encountered when implementing them using more traditional coding styles. The impact on ease of coding and performance is demonstrated in examples and experiments. The applicability of the approach for sparse matrices is also discussed.
∗ This work was supported in part by NSF contracts ACI-0305163 and CCF-0342369 and an equipment donation from Hewlett-Packard.

Survey of clustering data mining techniques

A Survey of Clustering Data Mining TechniquesPavel BerkhinYahoo!,Inc.pberkhin@Summary.Clustering is the division of data into groups of similar objects.It dis-regards some details in exchange for data simpliﬁrmally,clustering can be viewed as data modeling concisely summarizing the data,and,therefore,it re-lates to many disciplines from statistics to numerical analysis.Clustering plays an important role in a broad range of applications,from information retrieval to CRM. Such applications usually deal with large datasets and many attributes.Exploration of such data is a subject of data mining.This survey concentrates on clustering algorithms from a data mining perspective.1IntroductionThe goal of this survey is to provide a comprehensive review of diﬀerent clus-tering techniques in data mining.Clustering is a division of data into groups of similar objects.Each group,called a cluster,consists of objects that are similar to one another and dissimilar to objects of other groups.When repre-senting data with fewer clusters necessarily loses certainﬁne details(akin to lossy data compression),but achieves simpliﬁcation.It represents many data objects by few clusters,and hence,it models data by its clusters.Data mod-eling puts clustering in a historical perspective rooted in mathematics,sta-tistics,and numerical analysis.From a machine learning perspective clusters correspond to hidden patterns,the search for clusters is unsupervised learn-ing,and the resulting system represents a data concept.Therefore,clustering is unsupervised learning of a hidden data concept.Data mining applications add to a general picture three complications:(a)large databases,(b)many attributes,(c)attributes of diﬀerent types.This imposes on a data analysis se-vere computational requirements.Data mining applications include scientiﬁc data exploration,information retrieval,text mining,spatial databases,Web analysis,CRM,marketing,medical diagnostics,computational biology,and many others.They present real challenges to classic clustering algorithms. These challenges led to the emergence of powerful broadly applicable data2Pavel Berkhinmining clustering methods developed on the foundation of classic techniques.They are subject of this survey.1.1NotationsTo ﬁx the context and clarify terminology,consider a dataset X consisting of data points (i.e.,objects ,instances ,cases ,patterns ,tuples ,transactions )x i =(x i 1,···,x id ),i =1:N ,in attribute space A ,where each component x il ∈A l ,l =1:d ,is a numerical or nominal categorical attribute (i.e.,feature ,variable ,dimension ,component ,ﬁeld ).For a discussion of attribute data types see [106].Such point-by-attribute data format conceptually corresponds to a N ×d matrix and is used by a majority of algorithms reviewed below.However,data of other formats,such as variable length sequences and heterogeneous data,are not uncommon.The simplest subset in an attribute space is a direct Cartesian product of sub-ranges C = C l ⊂A ,C l ⊂A l ,called a segment (i.e.,cube ,cell ,region ).A unit is an elementary segment whose sub-ranges consist of a single category value,or of a small numerical bin.Describing the numbers of data points per every unit represents an extreme case of clustering,a histogram .This is a very expensive representation,and not a very revealing er driven segmentation is another commonly used practice in data exploration that utilizes expert knowledge regarding the importance of certain sub-domains.Unlike segmentation,clustering is assumed to be automatic,and so it is a machine learning technique.The ultimate goal of clustering is to assign points to a ﬁnite system of k subsets (clusters).Usually (but not always)subsets do not intersect,and their union is equal to a full dataset with the possible exception of outliersX =C 1 ··· C k C outliers ,C i C j =0,i =j.1.2Clustering Bibliography at GlanceGeneral references regarding clustering include [110],[205],[116],[131],[63],[72],[165],[119],[75],[141],[107],[91].A very good introduction to contem-porary data mining clustering techniques can be found in the textbook [106].There is a close relationship between clustering and many other ﬁelds.Clustering has always been used in statistics [10]and science [158].The clas-sic introduction into pattern recognition framework is given in [64].Typical applications include speech and character recognition.Machine learning clus-tering algorithms were applied to image segmentation and computer vision[117].For statistical approaches to pattern recognition see [56]and [85].Clus-tering can be viewed as a density estimation problem.This is the subject of traditional multivariate statistical estimation [197].Clustering is also widelyA Survey of Clustering Data Mining Techniques3 used for data compression in image processing,which is also known as vec-tor quantization[89].Dataﬁtting in numerical analysis provides still another venue in data modeling[53].This survey’s emphasis is on clustering in data mining.Such clustering is characterized by large datasets with many attributes of diﬀerent types. Though we do not even try to review particular applications,many important ideas are related to the speciﬁcﬁelds.Clustering in data mining was brought to life by intense developments in information retrieval and text mining[52], [206],[58],spatial database applications,for example,GIS or astronomical data,[223],[189],[68],sequence and heterogeneous data analysis[43],Web applications[48],[111],[81],DNA analysis in computational biology[23],and many others.They resulted in a large amount of application-speciﬁc devel-opments,but also in some general techniques.These techniques and classic clustering algorithms that relate to them are surveyed below.1.3Plan of Further PresentationClassiﬁcation of clustering algorithms is neither straightforward,nor canoni-cal.In reality,diﬀerent classes of algorithms overlap.Traditionally clustering techniques are broadly divided in hierarchical and partitioning.Hierarchical clustering is further subdivided into agglomerative and divisive.The basics of hierarchical clustering include Lance-Williams formula,idea of conceptual clustering,now classic algorithms SLINK,COBWEB,as well as newer algo-rithms CURE and CHAMELEON.We survey these algorithms in the section Hierarchical Clustering.While hierarchical algorithms gradually(dis)assemble points into clusters (as crystals grow),partitioning algorithms learn clusters directly.In doing so they try to discover clusters either by iteratively relocating points between subsets,or by identifying areas heavily populated with data.Algorithms of theﬁrst kind are called Partitioning Relocation Clustering. They are further classiﬁed into probabilistic clustering(EM framework,al-gorithms SNOB,AUTOCLASS,MCLUST),k-medoids methods(algorithms PAM,CLARA,CLARANS,and its extension),and k-means methods(diﬀer-ent schemes,initialization,optimization,harmonic means,extensions).Such methods concentrate on how well pointsﬁt into their clusters and tend to build clusters of proper convex shapes.Partitioning algorithms of the second type are surveyed in the section Density-Based Partitioning.They attempt to discover dense connected com-ponents of data,which areﬂexible in terms of their shape.Density-based connectivity is used in the algorithms DBSCAN,OPTICS,DBCLASD,while the algorithm DENCLUE exploits space density functions.These algorithms are less sensitive to outliers and can discover clusters of irregular shape.They usually work with low-dimensional numerical data,known as spatial data. Spatial objects could include not only points,but also geometrically extended objects(algorithm GDBSCAN).4Pavel BerkhinSome algorithms work with data indirectly by constructing summaries of data over the attribute space subsets.They perform space segmentation and then aggregate appropriate segments.We discuss them in the section Grid-Based Methods.They frequently use hierarchical agglomeration as one phase of processing.Algorithms BANG,STING,WaveCluster,and FC are discussed in this section.Grid-based methods are fast and handle outliers well.Grid-based methodology is also used as an intermediate step in many other algorithms (for example,CLIQUE,MAFIA).Categorical data is intimately connected with transactional databases.The concept of a similarity alone is not suﬃcient for clustering such data.The idea of categorical data co-occurrence comes to the rescue.The algorithms ROCK,SNN,and CACTUS are surveyed in the section Co-Occurrence of Categorical Data.The situation gets even more aggravated with the growth of the number of items involved.To help with this problem the eﬀort is shifted from data clustering to pre-clustering of items or categorical attribute values. Development based on hyper-graph partitioning and the algorithm STIRR exemplify this approach.Many other clustering techniques are developed,primarily in machine learning,that either have theoretical signiﬁcance,are used traditionally out-side the data mining community,or do notﬁt in previously outlined categories. The boundary is blurred.In the section Other Developments we discuss the emerging direction of constraint-based clustering,the important researchﬁeld of graph partitioning,and the relationship of clustering to supervised learning, gradient descent,artiﬁcial neural networks,and evolutionary methods.Data Mining primarily works with large databases.Clustering large datasets presents scalability problems reviewed in the section Scalability and VLDB Extensions.Here we talk about algorithms like DIGNET,about BIRCH and other data squashing techniques,and about Hoﬀding or Chernoﬀbounds.Another trait of real-life data is high dimensionality.Corresponding de-velopments are surveyed in the section Clustering High Dimensional Data. The trouble comes from a decrease in metric separation when the dimension grows.One approach to dimensionality reduction uses attributes transforma-tions(DFT,PCA,wavelets).Another way to address the problem is through subspace clustering(algorithms CLIQUE,MAFIA,ENCLUS,OPTIGRID, PROCLUS,ORCLUS).Still another approach clusters attributes in groups and uses their derived proxies to cluster objects.This double clustering is known as co-clustering.Issues common to diﬀerent clustering methods are overviewed in the sec-tion General Algorithmic Issues.We talk about assessment of results,de-termination of appropriate number of clusters to build,data preprocessing, proximity measures,and handling of outliers.For reader’s convenience we provide a classiﬁcation of clustering algorithms closely followed by this survey:•Hierarchical MethodsA Survey of Clustering Data Mining Techniques5Agglomerative AlgorithmsDivisive Algorithms•Partitioning Relocation MethodsProbabilistic ClusteringK-medoids MethodsK-means Methods•Density-Based Partitioning MethodsDensity-Based Connectivity ClusteringDensity Functions Clustering•Grid-Based Methods•Methods Based on Co-Occurrence of Categorical Data•Other Clustering TechniquesConstraint-Based ClusteringGraph PartitioningClustering Algorithms and Supervised LearningClustering Algorithms in Machine Learning•Scalable Clustering Algorithms•Algorithms For High Dimensional DataSubspace ClusteringCo-Clustering Techniques1.4Important IssuesThe properties of clustering algorithms we are primarily concerned with in data mining include:•Type of attributes algorithm can handle•Scalability to large datasets•Ability to work with high dimensional data•Ability toﬁnd clusters of irregular shape•Handling outliers•Time complexity(we frequently simply use the term complexity)•Data order dependency•Labeling or assignment(hard or strict vs.soft or fuzzy)•Reliance on a priori knowledge and user deﬁned parameters •Interpretability of resultsRealistically,with every algorithm we discuss only some of these properties. The list is in no way exhaustive.For example,as appropriate,we also discuss algorithms ability to work in pre-deﬁned memory buﬀer,to restart,and to provide an intermediate solution.6Pavel Berkhin2Hierarchical ClusteringHierarchical clustering builds a cluster hierarchy or a tree of clusters,also known as a dendrogram.Every cluster node contains child clusters;sibling clusters partition the points covered by their common parent.Such an ap-proach allows exploring data on diﬀerent levels of granularity.Hierarchical clustering methods are categorized into agglomerative(bottom-up)and divi-sive(top-down)[116],[131].An agglomerative clustering starts with one-point (singleton)clusters and recursively merges two or more of the most similar clusters.A divisive clustering starts with a single cluster containing all data points and recursively splits the most appropriate cluster.The process contin-ues until a stopping criterion(frequently,the requested number k of clusters) is achieved.Advantages of hierarchical clustering include:•Flexibility regarding the level of granularity•Ease of handling any form of similarity or distance•Applicability to any attribute typesDisadvantages of hierarchical clustering are related to:•Vagueness of termination criteria•Most hierarchical algorithms do not revisit(intermediate)clusters once constructed.The classic approaches to hierarchical clustering are presented in the sub-section Linkage Metrics.Hierarchical clustering based on linkage metrics re-sults in clusters of proper(convex)shapes.Active contemporary eﬀorts to build cluster systems that incorporate our intuitive concept of clusters as con-nected components of arbitrary shape,including the algorithms CURE and CHAMELEON,are surveyed in the subsection Hierarchical Clusters of Arbi-trary Shapes.Divisive techniques based on binary taxonomies are presented in the subsection Binary Divisive Partitioning.The subsection Other Devel-opments contains information related to incremental learning,model-based clustering,and cluster reﬁnement.In hierarchical clustering our regular point-by-attribute data representa-tion frequently is of secondary importance.Instead,hierarchical clustering frequently deals with the N×N matrix of distances(dissimilarities)or sim-ilarities between training points sometimes called a connectivity matrix.So-called linkage metrics are constructed from elements of this matrix.The re-quirement of keeping a connectivity matrix in memory is unrealistic.To relax this limitation diﬀerent techniques are used to sparsify(introduce zeros into) the connectivity matrix.This can be done by omitting entries smaller than a certain threshold,by using only a certain subset of data representatives,or by keeping with each point only a certain number of its nearest neighbors(for nearest neighbor chains see[177]).Notice that the way we process the original (dis)similarity matrix and construct a linkage metric reﬂects our a priori ideas about the data model.A Survey of Clustering Data Mining Techniques7With the(sparsiﬁed)connectivity matrix we can associate the weighted connectivity graph G(X,E)whose vertices X are data points,and edges E and their weights are deﬁned by the connectivity matrix.This establishes a connection between hierarchical clustering and graph partitioning.One of the most striking developments in hierarchical clustering is the algorithm BIRCH.It is discussed in the section Scalable VLDB Extensions.Hierarchical clustering initializes a cluster system as a set of singleton clusters(agglomerative case)or a single cluster of all points(divisive case) and proceeds iteratively merging or splitting the most appropriate cluster(s) until the stopping criterion is achieved.The appropriateness of a cluster(s) for merging or splitting depends on the(dis)similarity of cluster(s)elements. This reﬂects a general presumption that clusters consist of similar points.An important example of dissimilarity between two points is the distance between them.To merge or split subsets of points rather than individual points,the dis-tance between individual points has to be generalized to the distance between subsets.Such a derived proximity measure is called a linkage metric.The type of a linkage metric signiﬁcantly aﬀects hierarchical algorithms,because it re-ﬂects a particular concept of closeness and connectivity.Major inter-cluster linkage metrics[171],[177]include single link,average link,and complete link. The underlying dissimilarity measure(usually,distance)is computed for every pair of nodes with one node in theﬁrst set and another node in the second set.A speciﬁc operation such as minimum(single link),average(average link),or maximum(complete link)is applied to pair-wise dissimilarity measures:d(C1,C2)=Op{d(x,y),x∈C1,y∈C2}Early examples include the algorithm SLINK[199],which implements single link(Op=min),Voorhees’method[215],which implements average link (Op=Avr),and the algorithm CLINK[55],which implements complete link (Op=max).It is related to the problem ofﬁnding the Euclidean minimal spanning tree[224]and has O(N2)complexity.The methods using inter-cluster distances deﬁned in terms of pairs of nodes(one in each respective cluster)are called graph methods.They do not use any cluster representation other than a set of points.This name naturally relates to the connectivity graph G(X,E)introduced above,because every data partition corresponds to a graph partition.Such methods can be augmented by so-called geometric methods in which a cluster is represented by its central point.Under the assumption of numerical attributes,the center point is deﬁned as a centroid or an average of two cluster centroids subject to agglomeration.It results in centroid,median,and minimum variance linkage metrics.All of the above linkage metrics can be derived from the Lance-Williams updating formula[145],d(C iC j,C k)=a(i)d(C i,C k)+a(j)d(C j,C k)+b·d(C i,C j)+c|d(C i,C k)−d(C j,C k)|.8Pavel BerkhinHere a,b,c are coeﬃcients corresponding to a particular linkage.This formula expresses a linkage metric between a union of the two clusters and the third cluster in terms of underlying nodes.The Lance-Williams formula is crucial to making the dis(similarity)computations feasible.Surveys of linkage metrics can be found in [170][54].When distance is used as a base measure,linkage metrics capture inter-cluster proximity.However,a similarity-based view that results in intra-cluster connectivity considerations is also used,for example,in the original average link agglomeration (Group-Average Method)[116].Under reasonable assumptions,such as reducibility condition (graph meth-ods satisfy this condition),linkage metrics methods suﬀer from O N 2 time complexity [177].Despite the unfavorable time complexity,these algorithms are widely used.As an example,the algorithm AGNES (AGlomerative NESt-ing)[131]is used in S-Plus.When the connectivity N ×N matrix is sparsiﬁed,graph methods directly dealing with the connectivity graph G can be used.In particular,hierarchical divisive MST (Minimum Spanning Tree)algorithm is based on graph parti-tioning [116].2.1Hierarchical Clusters of Arbitrary ShapesFor spatial data,linkage metrics based on Euclidean distance naturally gener-ate clusters of convex shapes.Meanwhile,visual inspection of spatial images frequently discovers clusters with curvy appearance.Guha et al.[99]introduced the hierarchical agglomerative clustering algo-rithm CURE (Clustering Using REpresentatives).This algorithm has a num-ber of novel features of general importance.It takes special steps to handle outliers and to provide labeling in assignment stage.It also uses two techniques to achieve scalability:data sampling (section 8),and data partitioning.CURE creates p partitions,so that ﬁne granularity clusters are constructed in parti-tions ﬁrst.A major feature of CURE is that it represents a cluster by a ﬁxed number,c ,of points scattered around it.The distance between two clusters used in the agglomerative process is the minimum of distances between two scattered representatives.Therefore,CURE takes a middle approach between the graph (all-points)methods and the geometric (one centroid)methods.Single and average link closeness are replaced by representatives’aggregate closeness.Selecting representatives scattered around a cluster makes it pos-sible to cover non-spherical shapes.As before,agglomeration continues until the requested number k of clusters is achieved.CURE employs one additional trick:originally selected scattered points are shrunk to the geometric centroid of the cluster by a user-speciﬁed factor α.Shrinkage suppresses the aﬀect of outliers;outliers happen to be located further from the cluster centroid than the other scattered representatives.CURE is capable of ﬁnding clusters of diﬀerent shapes and sizes,and it is insensitive to outliers.Because CURE uses sampling,estimation of its complexity is not straightforward.For low-dimensional data authors provide a complexity estimate of O (N 2sample )deﬁnedA Survey of Clustering Data Mining Techniques9 in terms of a sample size.More exact bounds depend on input parameters: shrink factorα,number of representative points c,number of partitions p,and a sample size.Figure1(a)illustrates agglomeration in CURE.Three clusters, each with three representatives,are shown before and after the merge and shrinkage.Two closest representatives are connected.While the algorithm CURE works with numerical attributes(particularly low dimensional spatial data),the algorithm ROCK developed by the same researchers[100]targets hierarchical agglomerative clustering for categorical attributes.It is reviewed in the section Co-Occurrence of Categorical Data.The hierarchical agglomerative algorithm CHAMELEON[127]uses the connectivity graph G corresponding to the K-nearest neighbor model spar-siﬁcation of the connectivity matrix:the edges of K most similar points to any given point are preserved,the rest are pruned.CHAMELEON has two stages.In theﬁrst stage small tight clusters are built to ignite the second stage.This involves a graph partitioning[129].In the second stage agglomer-ative process is performed.It utilizes measures of relative inter-connectivity RI(C i,C j)and relative closeness RC(C i,C j);both are locally normalized by internal interconnectivity and closeness of clusters C i and C j.In this sense the modeling is dynamic:it depends on data locally.Normalization involves certain non-obvious graph operations[129].CHAMELEON relies heavily on graph partitioning implemented in the library HMETIS(see the section6). Agglomerative process depends on user provided thresholds.A decision to merge is made based on the combinationRI(C i,C j)·RC(C i,C j)αof local measures.The algorithm does not depend on assumptions about the data model.It has been proven toﬁnd clusters of diﬀerent shapes,densities, and sizes in2D(two-dimensional)space.It has a complexity of O(Nm+ Nlog(N)+m2log(m),where m is the number of sub-clusters built during the ﬁrst initialization phase.Figure1(b)(analogous to the one in[127])clariﬁes the diﬀerence with CURE.It presents a choice of four clusters(a)-(d)for a merge.While CURE would merge clusters(a)and(b),CHAMELEON makes intuitively better choice of merging(c)and(d).2.2Binary Divisive PartitioningIn linguistics,information retrieval,and document clustering applications bi-nary taxonomies are very useful.Linear algebra methods,based on singular value decomposition(SVD)are used for this purpose in collaborativeﬁlter-ing and information retrieval[26].Application of SVD to hierarchical divisive clustering of document collections resulted in the PDDP(Principal Direction Divisive Partitioning)algorithm[31].In our notations,object x is a docu-ment,l th attribute corresponds to a word(index term),and a matrix X entry x il is a measure(e.g.TF-IDF)of l-term frequency in a document x.PDDP constructs SVD decomposition of the matrix10Pavel Berkhin(a)Algorithm CURE (b)Algorithm CHAMELEONFig.1.Agglomeration in Clusters of Arbitrary Shapes(X −e ¯x ),¯x =1Ni =1:N x i ,e =(1,...,1)T .This algorithm bisects data in Euclidean space by a hyperplane that passes through data centroid orthogonal to the eigenvector with the largest singular value.A k -way split is also possible if the k largest singular values are consid-ered.Bisecting is a good way to categorize documents and it yields a binary tree.When k -means (2-means)is used for bisecting,the dividing hyperplane is orthogonal to the line connecting the two centroids.The comparative study of SVD vs.k -means approaches [191]can be used for further references.Hier-archical divisive bisecting k -means was proven [206]to be preferable to PDDP for document clustering.While PDDP or 2-means are concerned with how to split a cluster,the problem of which cluster to split is also important.Simple strategies are:(1)split each node at a given level,(2)split the cluster with highest cardinality,and,(3)split the cluster with the largest intra-cluster variance.All three strategies have problems.For a more detailed analysis of this subject and better strategies,see [192].2.3Other DevelopmentsOne of early agglomerative clustering algorithms,Ward’s method [222],is based not on linkage metric,but on an objective function used in k -means.The merger decision is viewed in terms of its eﬀect on the objective function.The popular hierarchical clustering algorithm for categorical data COB-WEB [77]has two very important qualities.First,it utilizes incremental learn-ing.Instead of following divisive or agglomerative approaches,it dynamically builds a dendrogram by processing one data point at a time.Second,COB-WEB is an example of conceptual or model-based learning.This means that each cluster is considered as a model that can be described intrinsically,rather than as a collection of points assigned to it.COBWEB’s dendrogram is calleda classiﬁcation tree.Each tree node(cluster)C is associated with the condi-tional probabilities for categorical attribute-values pairs,P r(x l=νlp|C),l=1:d,p=1:|A l|.This easily can be recognized as a C-speciﬁc Na¨ıve Bayes classiﬁer.During the classiﬁcation tree construction,every new point is descended along the tree and the tree is potentially updated(by an insert/split/merge/create op-eration).Decisions are based on the category utility[49]CU{C1,...,C k}=1j=1:kCU(C j)CU(C j)=l,p(P r(x l=νlp|C j)2−(P r(x l=νlp)2.Category utility is similar to the GINI index.It rewards clusters C j for in-creases in predictability of the categorical attribute valuesνlp.Being incre-mental,COBWEB is fast with a complexity of O(tN),though it depends non-linearly on tree characteristics packed into a constant t.There is a similar incremental hierarchical algorithm for all numerical attributes called CLAS-SIT[88].CLASSIT associates normal distributions with cluster nodes.Both algorithms can result in highly unbalanced trees.Chiu et al.[47]proposed another conceptual or model-based approach to hierarchical clustering.This development contains several diﬀerent use-ful features,such as the extension of scalability preprocessing to categori-cal attributes,outliers handling,and a two-step strategy for monitoring the number of clusters including BIC(deﬁned below).A model associated with a cluster covers both numerical and categorical attributes and constitutes a blend of Gaussian and multinomial models.Denote corresponding multivari-ate parameters byθ.With every cluster C we associate a logarithm of its (classiﬁcation)likelihoodl C=x i∈Clog(p(x i|θ))The algorithm uses maximum likelihood estimates for parameterθ.The dis-tance between two clusters is deﬁned(instead of linkage metric)as a decrease in log-likelihoodd(C1,C2)=l C1+l C2−l C1∪C2caused by merging of the two clusters under consideration.The agglomerative process continues until the stopping criterion is satisﬁed.As such,determina-tion of the best k is automatic.This algorithm has the commercial implemen-tation(in SPSS Clementine).The complexity of the algorithm is linear in N for the summarization phase.Traditional hierarchical clustering does not change points membership in once assigned clusters due to its greedy approach:after a merge or a split is selected it is not reﬁned.Though COBWEB does reconsider its decisions,its。

On Sequential Monte Carlo Sampling Methods for Bayesian Filtering

methods, see (Akashi et al., 1975)(Handschin et. al, 1969)(Handschin 1970)(Zaritskii et al., 1975). Possibly owing to the severe computational limitations of the time these Monte Carlo algorithms have been largely neglected until recently. In the late 80’s, massive increases in computational power allowed the rebirth of numerical integration methods for Bayesian ﬁltering (Kitagawa 1987). Current research has now focused on MC integration methods, which have the great advantage of not being subject to the assumption of linearity or Gaussianity in the model, and relevant work includes (M¨ uller 1992)(West, 1993)(Gordon et al., 1993)(Kong et al., 1994)(Liu et al., 1998). The main objective of this article is to include in a uniﬁed framework many old and more recent algorithms proposed independently in a number of applied science areas. Both (Liu et al., 1998) and (Doucet, 1997) (Doucet, 1998) underline the central rˆ ole of sequential importance sampling in Bayesian ﬁltering. However, contrary to (Liu et al., 1998) which emphasizes the use of hybrid schemes combining elements of importance sampling with Markov Chain Monte Carlo (MCMC), we focus here on computationally cheaper alternatives. We describe also how it is possible to improve current existing methods via Rao-Blackwellisation for a useful class of dynamic models. Finally, we show how to extend these methods to compute the prediction and ﬁxed-interval smoothing distributions as well as the likelihood. The paper is organised as follows. In section 2, we brieﬂy review the Bayesian ﬁltering problem and classical Bayesian importance sampling is proposed for its solution. We then present a sequential version of this method which allows us to obtain a general recursive MC ﬁlter: the sequential importance sampling (SIS) ﬁlter. Under a criterion of minimum conditional variance of the importance weights, we obtain the optimal importance function for this method. Unfortunately, for numerous models of applied interest the optimal importance function leads to non-analytic importance weights, and hence we propose several suboptimal distributions and show how to obtain as special cases many of the algorithms presented in the literature. Firstly we consider local linearisation methods of either the state space model 3

sinusoidal position embeddings 理解

sinusoidal position embeddings 理解全文共四篇示例，供读者参考第一篇示例：近年来，深度学习在各个领域取得了巨大突破，其中的注意力机制更是备受瞩目。

在自然语言处理和计算机视觉等领域，注意力机制被广泛应用于提高模型的性能。

而在Transformer模型中，位置编码是至关重要的一部分，它能够为序列中不同位置的单词或像素提供信息，使得模型能够更好地理解序列中不同位置之间的关系。

而sinusoidal position embeddings便是一种常用的位置编码方法之一。

那么，什么是sinusoidal position embeddings呢？在Transformer模型中，每个位置都会被编码为一个向量。

而sinusoidal position embeddings则是一种通过正弦和余弦函数来生成位置向量的方法。

具体来说，给定一个位置i和embedding维度d，sinusoidal position embeddings会生成一个大小为d的向量，其中的每个元素由正弦和余弦函数计算得到。

通过这种方式，每个位置都具有一个唯一的、与其相对应的位置向量，使得模型能够更好地区分不同位置之间的关系。

sinusoidal position embeddings是可学习的。

在训练过程中，模型会根据数据自动学习每个位置的位置向量，而不需要手动设计或调整位置编码的参数。

这使得模型更加灵活，能够适应不同数据集和任务的需求。

sinusoidal position embeddings具有周期性的特点。

由于是通过正弦和余弦函数生成的，位置向量会在不同维度上呈现周期性变化。

这种周期性特点能够帮助模型更好地捕捉序列中的周期性规律，从而提高模型的性能。

sinusoidal position embeddings是一种简单且有效的位置编码方法，能够帮助模型更好地理解序列中不同位置之间的关系。

在实际应用中，研究人员和工程师可以根据具体的任务和数据集选择合适的位置编码方法，以提高模型的性能和泛化能力。

gcae模型公式解读

gcae模型公式解读GCAE模型公式解读GCAE（Global Context-Aware Embedding）模型是一种用于文本建模和推荐系统的模型，旨在捕捉全局上下文信息并生成有意义的嵌入表示。

该模型的公式包括以下几个关键部分：1. 上下文编码器（Context Encoder）：这一部分负责将输入的文本序列转化为句子级别的嵌入表示。

其中，上下文表示（Contextual Representation）是通过将句子中每个词的嵌入进行加权平均得到的。

这样，每个句子都可以被表示为一个固定长度的向量。

2. 全局上下文编码器（Global Context Encoder）：该部分考虑了全局上下文信息，并将其结合到每个句子的嵌入表示中。

通过学习得到的全局上下文权重，GCAE模型可以更好地理解文本的整体语义，并捕捉到文本之前的相关性。

3. 局部上下文编码器（Local Context Encoder）：这一部分主要关注句子内部的局部上下文信息，并考虑每个词与其周围词的关系。

通过利用卷积神经网络等技术，局部上下文编码器可以有效地捕捉到句子中的局部依赖关系。

4. 嵌入表示（Embedding）：GCAE模型通过将上述三个部分的输出进行合并，得到包含全局和局部上下文信息的句子级别嵌入表示。

这种嵌入表示能够更全面地表达文本的意义和语义关系，有助于下游任务的处理和优化。

GCAE模型通过上下文编码器、全局上下文编码器和局部上下文编码器的结合，提供了一种能够捕捉文本全局和局部上下文信息的有效方法。

它不仅能够更准确地建模文本的语义关系，还能够生成具有丰富含义的文本嵌入表示，为推荐系统等应用提供更具表现力的文本特征。

Data and signal processing using photochromic molecules

Cite this:mun .,2012,48,1947–1957Data and signal processing using photochromic moleculesDevens Gust,*a Joakim Andreasson,b Uwe Pischel,c Thomas A.Moore a and Ana L.Moore aReceived 26th August 2011,Accepted 14th November 2011DOI:10.1039/c1cc15329cPhotochromes are chromophores that are reversibly isomerized between two metastable forms using light,or light and heat.When photochromes are covalently linked to other chromophores,they can act as molecular photonic analogues of electronic transistors.As bistable switches,they can be incorporated into the design of molecules capable of binary arithmetic and both combinatorial and sequential digital logic operations.Small ensembles of such molecules canperform analogue signal modulation similar to that carried out by transistor ampliﬁers.Examples of molecules that perform multiple logic functions,act as control elements for ﬂuorescent reporters,and mimic natural photoregulatory functions are presented.IntroductionThe transistor (and its predecessor the vacuum tube)have allowed the development of electronics and electronic data processing.A ﬁeld eﬀect transistor acts as an ampliﬁer,in which current ﬂowing from an input (source)to an output (drain)is controlled by the voltage at a second input (gate).In addition to this analogue function,transistors can also act as binary digital switches.Application of a gate voltage above a suitable threshold level switches on the transistor currentbetween the source and drain.A gate voltage less than the threshold switches the current ‘‘oﬀ,’’which means that the current is reduced to a lower-level ‘‘leakage’’current.Suitably designed molecules (or strictly speaking,ensembles of molecules)can behave as photochemical analogues of transistors.In digital applications they can function as binary switches and much more complex Boolean logic gates.In analogue applications,they can both act as photonic ‘‘ampliﬁers’’and mimic the kinds of control functions found in biology.Below,we exemplify these types of molecular behaviour with molecules from our laboratories,and reference some of the excellent research of others working in this area.Photochemically active molecules as digital switches and logic devicesBroadly deﬁned,a binary switch is a system that can be placed reversibly in either of two states by the application of an input,coupled with a means of observing in which state the switchaDepartment of Chemistry and Biochemistry,Arizona State University,Tempe,AZ 85202,USA.E-mail:gust@;Fax:+1-480-965-5927;Tel:+1-480-965-4547bDepartment of Chemical and Biological Engineering,PhysicalChemistry,Chalmers University of Technology,SE-41296Go ¨teborg,Sweden.E-mail:a-son@chalmers.se;Tel:+46317722838cCenter for Research in Sustainable Chemistry and Department of Chemical Engineering,Physical Chemistry,and Organic Chemistry,University of Huelva,Campus de El Carmen s/n,E-21071Huelva,SpainDevens GustDevens Gust is Regents’Professor and Foundation Professor of Chemistry in the Department of Chemistry and Biochemistry at Arizona State University.He directs the Center for Bio-Inspired Solar Fuel Production,and has research interests in organic photochemistry,molecular logic,and artiﬁcial photosynthesis.Joakim Andreasson Joakim Andre ´asson is associate professor in Physical Chemistry at Chalmers University of Technology.His research focuses on the application of photochromic molecules in the ﬁelds of molecular logic and light-activated anticancer drugs.ChemCommDynamic Article Links/chemcommFEATURE ARTICLED o w n l o a d e d b y U n i v e r s i t y o f H o n g K o n g o n 26 M a r c h 2012P u b l i s h e d o n 05 D e c e m b e r 2011 o n h t t p ://p u b s .r s c .o r g | d o i :10.1039/C 1C C 15329CView Online / Journal Homepage / Table of Contents for this issueexists.Thus,there are a huge number of switching operations possible in molecules,such as protonation,binding of a ligand or substrate,oxidation or reduction,isomerization,base paring,light absorption to produce an excited state,etc .For example,the common indicator dye phenolphthalein is colourless in acidic solution,but turns pink upon addition of base,allowing it to act as a simple single-throw-single-pole binary switch whose state is detected by light absorption.The ability to employ molecules as switches has inspired chemists to consider using molecules as replacements for transistors for carrying out digital operations.A signiﬁcant literature on molecular logic has developed,and the subject has been reviewed.1–22Molecules that act as switches using only light as inputs and outputs are particularly attractive for molecular logic applications.They have several actual and potential advantages: Light does not require addition of chemicals or build-up and removal of products from repeated cycling of the switch. Physical access to the switching element,other than light transmission,is not necessary.This allows the placement of many switching elements in planar or three-dimensional arrays.Molecular switches can be small,and in principle function at the diﬀraction limit,or even below if near-ﬁeld optical techniques are employed.Uwe PischelUwe Pischel obtained his PhD in Chemistry at the University of Basel (Switzerland)in 2001and is currently Lecturer in Organic Chemistry at the University of Huelva.His research interests focus on organic photochemistry,molecular information processing,and supramolecular chemistry.Thomas A.MooreThomas Moore is Regents’Professor in the Department of Chemistry and Biochemistry and Distinguished Sustainability Scholar in the Global Institute of Sustainability at Arizona State University.He directs the Center for Bioenergy and Photosynthesis and has research interests in bio-inspired technology for sustainable development.Ana L.MooreAna Moore is Regents’Professor in the Department of Chemistry and Biochemis-try.Her research interests are in the design of molecules for solar energy conversion and photoprotection.D o w n l o a d e d b y U n i v e r s i t y o f H o n g K o n g o n 26 M a r c h 2012P u b l i s h e d o n 05 D e c e m b e r 2011 o n h t t p ://p u b s .r s c .o r g | d o i :10.1039/C 1C C 15329CMolecular photoswitches can be fast.Light can be delivered as pulses,and reversible photochemical reactions such as photoisomerizations occur rapidly in order to compete with deactivation of excited states by other pathways.Typically,this means the ps or ns time scales.The actual time necessary to switch a small ensemble of molecules depends on the light ﬂux,the molar extinction coeﬃcient,and the quantum yield of the switching reaction.In addition,unimolecular reactions activated by light do not require diﬀusion,and reaction rates are therefore not diﬀusion limited.Light-activated molecules can be cycled many times,although in practice this is often limited by competing uni-molecular reactions,or photodestruction by oxygen.In 2006we reviewed some of the applications of photo-chemistry and photochemical reactions to molecular logic.1Signiﬁcant progress has been made since then.Illustrative examples from our laboratories appear below.Simple molecular switchesOne of the most useful classes of light-operated molecular switches is the photochromes.Photochromic compounds exist in two metastable isomeric forms,and light can be used to interconvert them by photoisomerization.An example is fulgimide 1(Fig.1).23,24The compound exists in solution as an open form FGo (actually a mixture of E -and Z -isomers)with an absorption maximum at 380nm (see inset,Fig.2a).Irradiation at,for example,366nm leads to cyclization to the closed isomer FGc.A photostationary distribution consisting of mainly FGc results,and displays absorption maxima at 400and 510nm (Fig.2a).Photochrome FGc ﬂuoresces (l max =600nm).Irradiation of FGc in the visible (e.g.560nm)results in photoisomerization back to FGo.Thus,1is a photochemically operated molecular switch whose inputs and outputs are light.In order to perform more sophisticated logic operations,the output of one molecular switch must be transmitted to another.That is,the two switches must be ‘‘wired together’’to form a more complex system.This can be achieved by chemically linking the switches together in a single molecular device.In such a molecule,two chromophores can ‘‘communicate’’by energy transfer or electron transfer between them.For example,the isomerization state of a photochrome can aﬀect the properties of an adjacent chromophore,thus transferring information between the two.Such communication is illustrated by porphyrin-fulgimide (P-FG)dyad 2(Fig.1).23The absorption spectrum of P-FGo,where the fulgimide is in the open form,shows that the absorption properties of both chromophores are virtually unchanged from those of model compounds (Fig.2a).The porphyrin Q-bands (648,591,547and 513nm)and Soret (418nm)band are present,as well as absorption in the ultraviolet by the fulgimide.Irradiation at 366nm results in photoisomerization of the fulgimide,giving a photostationary distribution con-taining mainly P-FGc,as evidenced by increased absorption between 430and 600nm and decreased absorption between 320and 400nm (Fig.2).The P-FGc is thermally stable,but may be readily isomerized back to P-FGo by irradiation in the visible (e.g.560nm).Both isomers display normal free base porphyrin emission (l max =650and 720nm,Fig.2b),but noﬂuorescence from the fulgimide of P-FGc is observed.Fluores-cence excitation experiments demonstrate that this is because singlet–singlet energy transfer from FGc to P occurs with B 100%eﬃciency.This conclusion is supported by time-resolved emission experiments.23While the ﬂuorescence of FGc is quenched in this dyad,the porphyrin ﬂuorescence is enhanced due to the energy transfer when the molecule is excited at wavelengths where FGc absorbs.For example,with excitation at 470nm where FGc absorbs strongly,porphyrin emission from P-FGc is 2.5times more intense than that from P-FGo.Through energy transfer,the porphyrin moiety of 2receives information about the isomerization state of the fulgimide,and its ﬂuores-cence output is adjusted in response to this information.Photochemical molecular logic gatesThis ability of photochromic molecules to both act as molecular switches and communicate with other chromophores makes possible the construction of all-photonic Boolean logic gates.These gates have inputs that embody two or more simpleD o w n l o a d e d b y U n i v e r s i t y o f H o n g K o n g o n 26 M a r c h 2012P u b l i s h e d o n 05 D e c e m b e r 2011 o n h t t p ://p u b s .r s c .o r g | d o i :10.1039/C 1C C 15329Con–oﬀswitches and internal ‘‘circuitry’’that generates a certain output based on the state of the input switches.A particularly striking example of such a molecule is 3(Fig.3),which consists of three covalently-linked photochromes.24,25Molecule 3bears two fulgimides and a dithienylethene (DTE).Each of these photochromes may exist in two photo-isomeric forms,as shown in Fig.3.The molecule may there-fore exist in six constitutionally isomeric forms (there are stereoisomers as well,but these are not relevant for this discussion).For our purposes,only the four isomers in which the two identical fulgimides are in the same form (open or closed)are relevant:FGo-DTEo,FGc-DTEo,FGo-DTEc,and FGc-DTEc.As shown in Fig.4,each isomer of 3has a unique spectral signature in absorption and emission.Thus,if 3is present as a solution in a suitable solvent,such as 2-methylte-trahydrofuran,wavelengths may be identiﬁed which will convert any mixture of isomers to a solution that is very highly enriched in any of the four forms.Such wavelengths and the corresponding interconversions are identiﬁed in Fig.5.It is important to note that isomer FGc-DTEc does not show FGc emission,whereas FGc-DTEo does.This is because the molecule was designed so that singlet excitation energy is rapidly (t o 5ps)transferred from FGc to DTEc,thus quenching the ﬂuorescence.As will be explained below,this behaviour is achievable only in a covalently linked system,and is vital for several logic functions.Simple logic gatesThe photochemistry of 3forms the basis for molecular logic.Irradiation into the various absorption bands constitutes device inputs.Each input causes photoisomerization,which comprises the switching operation central to binary logic.After each input is turned oﬀ,the molecule remains in the selected state,recording the result of an input or series of inputs and allowing subsequent readout.The choice of readout (absorbance or emission at a particular wavelength)selects the logic operation performed.An AND gate is an example of a simple logic gate.As shown in the truth table in Table 1,the gate has two binaryFig.4Absorption and emission spectra of diﬀerent forms of FG-DTE triad 3.Solid line:thermally stable FGo-DTEo.Dotted line:FGc-DTEo.Dash-dot:FGo-DTEc.Dash-dot-dot:FGc-DTEc.Also shown is the emission from FGc in the FGc-DTEo form of the molecule (squares).The coloured vertical lines indicate wavelengths that are used as outputs for the various binary arithmetic functions of the triad.Fig.5Photochemical isomerizations among the 4isomers of 3that are relevant for the logic operations discussed.Green light signiﬁes (460o l o 590nm)and red light indicates l 4615nm.D o w n l o a d e d b y U n i v e r s i t y o f H o n g K o n g o n 26 M a r c h 2012P u b l i s h e d o n 05 D e c e m b e r 2011 o n h t t p ://p u b s .r s c .o r g | d o i :10.1039/C 1C C 15329Cinputs,either of which may be in the OFF (0)or ON (1)state.The gate outputs a signal reﬂecting the result of the AND logic operation:the output switches ON (1)only if both inputs have been switched ON.A solution of 3can operate as a molecular photonic AND gate,as shown in Fig.6a.The initial state is FGo-DTEo,inputs are light pulses at 397nm and 302nm,and the output is absorbance at 535nm above a threshold value (dashed line in Fig.6a).An output above the threshold is observed only after both inputs have been applied.Chemically,this occurs because only the two inputs together,applied at the same time or in any sequence,generate isomer FGc-DTEc (Fig.5).This is the only isomer having the necessary absorbance at 535nm (Fig.4).Other combinations of inputs give outputs at levels below the threshold,corresponding to subthreshold leakage currents in a transistor.Many examples of molecular AND gates,and indeed other Boolean logic gates,have been reported.Triad 3,however,is unique in that it can be reconﬁgured to carry out a total of 13diﬀerent logic operations.All of these operations have the same initial state,FGo-DTEo.Reconﬁguring simply requires changing the input and output wavelengths as required.In fact,the compound can even perform several operations simultaneously by monitoring several outputs at once.Table 1shows three diﬀerent logic gates that are important for performing binary arithmetic.Triad 3can carry out all of these,as shown in Fig.6,using the same initial FGo-DTEo state and the same inputs (397nm and 302nm).For example,an XOR (exclusive OR)gate delivers an ON output (1)only when either input is turned ON,but not when neither or both are ON.For triad 3,the XOR output is the absolute value of the absorbance change at 393nm,|D A |(Fig.6b).An INH (inhibit)gate gives an output of 1only when one particular input (not the other or both)is applied.Triad 3can function as two diﬀerent INH gates as shown in Table 1and Fig.6c and d.Each gate responds to a diﬀerent input (302nm or 397nm),and the outputs diﬀer (absorbance at 393nm and emission at 624nm,respectively).The photochemistry responsible for these diﬀerent gate functions is apparent from Fig.5.Adders and subtractorsJust as simple binary switches may be combined to act as more complex logic gates,these logic gates may be combined to implement yet more complex operations.Because 3can perform several logic functions,the device is capable of carrying out more complex binary arithmetic.For example,combination of AND and XOR gates which use the same two inputs produces ahalf-adder.The half-adder adds two binary digits represented by the two inputs,each of which may be OFF (binary 0)or ON (binary 1).The readout of XOR represents the sum digit,and that of AND the carry digit.If neither input is ON,both gates read out 0.This gives the binary sum 00,representing the decimal sum of 0+0=0.If either input is turned ON,AND reports a 0whereas XOR reports a 1.Now,the binary sum reads 01(0+1=1in the decimal notation).If both inputs are ON,XOR reads out 0,and the AND gate delivers a 1output.The binary output combination 10represents the decimal 2(1+1=2).Inspection of Fig.6a and b demonstrate that 3functions as required for a half-adder.If instead,the XOR function of 3is paired with that of the INH1function,a binary half-subtractor is generated.The inputs remain as previously described.The output of the XOR gate (Fig.6b)represents the diﬀerence output,and the INH1outputTable 1Truth table for binary arithmetic functions.Wavelengths (nm)for the various inputs and outputs (absorption A or emission Em )are shown on the third line Inputs Outputs a b AND XOR INH1INH2397302A 535|D A |393A 393Em 624000000100101010110111D o w n l o a d e d b y U n i v e r s i t y o f H o n g K o n g o n 26 M a r c h 2012P u b l i s h e d o n 05 D e c e m b e r 2011 o n h t t p ://p u b s .r s c .o r g | d o i :10.1039/C 1C C 15329Cgives the borrow output.If instead,XOR and INH2are used,the order of the subtrahend and minuend is switched.25Non-arithmetic logicMolecular logic may be used for non-arithmetic purposes as well.For example,3may function as a digital 2:1multiplexer (Fig.7a).Multiplexers are analogous to mechanical rotary switches that connect any one of several inputs to a single output.They are used to combine several data input streams into a single line for transmission.The initial state for the operation of 3as a multiplexer is FGo-DTEo.The two data inputs are 397nm (In1)and red light (4615nm,In2).There isa third input,the selector (Sel,366nm)which determines the input from which the data will be taken for transmission to the output (ﬂuorescence at 624nm).As shown in Fig.7a,if Sel is not applied,the output is ON only when In1is ON,and the state of In2is ignored.Alternatively,if Sel is ON,the output reﬂects the state of IN2,and In1is ignored.Multiplexed signals must be separated again in order to make sense of the separate data streams.This is the job of a demultiplexer.Triad 3can function as a 1:2digital demulti-plexer.The initial state is again FGo-DTEo.There are only one data input,In (397nm),and two data outputs,O1(624nm emission)and O2(absorbance at 535nm).An additional input (302nm)serves as the address (Ad),which determines to which output the input is sent.As shown in Fig.7b,the state of O1reﬂects that of In when Ad is OFF,and O2has a zero output.When Ad has been applied,the state of In appears in O2,and O1has a zero output.Several other logic operations are available with 3.For example,the molecule can act as a single-bit 4-to-2encoder and 2-to-4decoder.11These devices translate numbers in base-10into binary numbers,and vice versa .The initial state for both functions is FGo-DTEo.Brieﬂy,the encoder inputs are light at four diﬀerent wavelengths,460,397,302and 366nm,and the outputs are absorbance at 475and 625nm.When acting as a decoder,the inputs are light at 397and 302nm,and the outputs are absorbance at 393and 535nm,transmittance at 535nm,and ﬂuorescence at 624nm.A set of dual transfer gates is also possible.These simply transfer the state of an input to that of an output with no logical change.These two gates comprise a logically reversible system,26where each combination of inputs gives a unique output.25Sequential logicThe logic functions described thus far are mainly combinatorial functions,where the state of the output is determined solely by the input combination,with no dependence on the order in which the inputs are applied.In sequential logic,the output depends on the order in which inputs are given.A simple example is the keypad lock,which opens only when the correct inputs are applied in the correct sequence.Triad 3functions as a keypad lock with the initial state FGo-DTEo.There are two inputs:red light at 4615nm and UV light at 366nm.The output of the lock is ﬂuorescence emission of FGc at 624nm.Fig.8shows this output as a function of not only the state of the two inputs,but also the order in which they are applied.It is clear from the ﬁgure that the output requires not only that both inputs be turned ON,but that the UV input be applied before the red light is switched on.Thus,of the 8possible ordered combinations of the two inputs,only 1opens the lock.Advantages and limitationsTriad 3illustrates the current stage of development of photo-chemical molecular logic in a single molecule.Some important features are:Simple photochromic binary switches may be combined to form more complex logic gates (e.g.,XOR).This is facilitatedD o w n l o a d e d b y U n i v e r s i t y o f H o n g K o n g o n 26 M a r c h 2012P u b l i s h e d o n 05 D e c e m b e r 2011 o n h t t p ://p u b s .r s c .o r g | d o i :10.1039/C 1C C 15329Cby communication among photochromes and other chromo-phores by energy transfer and electron transfer.Simple logic gates may be combined within a molecule,functionally integrating them in order to perform more complex mathematical operations (e.g.,half-adder).Non-arithmetic logic operations are possible (e.g.,multiplexer).Sequential logic is possible (e.g.keypad lock).In all these operations,the individual photochromes record the last input for later use or readout.The actual input may be applied as a pulse of light,and all inputs do not need to be applied at the same time.The triad is a multifunctional system that can be readily reconﬁgured to carry out a variety of logic functions.All 13functions have the same initial state,and diﬀerent applications require only choosing diﬀerent photonic readouts and/or inputs. The triad may be cycled many times between the same or diﬀerent applications by resetting to the initial state with green light.These qualities demonstrate that the triad and similar systems are true nanoscale information processing units.Although a useful device will in general require a small ensemble of molecules rather than a single molecule,the volume necessary can still be at the diﬀraction limit or,in principle,below.It is unlikely that the ﬁrst applications of such molecules will be in the construction of analogues of modern electronic computers.One reason for this is that photodamage ultimately occurs,limiting the number of cycles that the molecules can endure.Restricting oxygen access can greatly diminish photodamage,but photochemical side reactions will eventually take a toll,even if their quantum yields are extremely small.Secondly,computing applications will generally require concatenation of switches,gates,etc.so that the output of one device forms the input of the next.This problem is circumvented to some extent in molecules like the triad by combining several logic elements in the same molecule,so that the components may communicate.However,this approach via functional integration is limited.Concatenation could be achieved with these photonic systems by using external electronic devices to receive the output fromone operation (e.g.ﬂuorescence or absorption),and generate a new input (light pulse)based on the output received.In this way,a functional computer could be made based on these molecules.However,a more direct method for communication between molecules would be preferable.There are,however,applications for which molecular photonic systems are much better suited than are electronic devices.Because they operate at the nanoscale and require no wires or similar connections,they may be applied in an almost unlimited variety of milieus.For example,molecular photonic switches have been used to speciﬁcally label micro-objects so that they can be diﬀerentiated in solution.27Molecular logic systems are in principle compatible with biological systems.Photonic switches could be used to initiate drug delivery only at certain sites (for example a volume element where light beams of two diﬀerent colours intersect).They could also be used to track the history of nanoparticles (biological or non-biological)circulating in complex ﬂow systems.Because they are non-metallic and are not inﬂuenced by low-frequency electro-magnetic radiation,they can be applied where conventional electronics are not suitable.Doubtless many more possible applications will arise as the ﬁeld becomes more mature.Photochemically active molecules as analogue devicesAs mentioned in the Introduction,transistors can act both as digital switches and as analogue ampliﬁers or signal transducers.In the latter mode,the device again has inputs and outputs,but the output magnitude is a continuously variable function of the magnitude of one of the inputs.The triode ampliﬁer tube or its transistor analogue are examples.Although the properties of a single molecule are quantized by nature,it is possible to employ ensembles of photochemically active molecules in this analogue fashion.Two examples will be presented.A photochemical ‘‘triode’’molecular signal transducer Molecular hexad 4(Fig.9)features ﬁve bisphenylethynyl-anthracene (BPEA)ﬂuorophores and a dithienylethene photo-chrome organized by a central hexaphenylbenzene core.28The core is relatively rigid,as rotation of the aryl groups linked to the central phenyl ring is slow,and serves to restrict inter-chromophore distances and orientations,which in turn aﬀect energy transfer rates.29The BPEA units are strong ﬂuorophores that absorb in the 430nm region and emit in the 520nm region with a quantum yield ofB 0.94.The BPEA singlet excited state lifetime is 2.8ns.The DTE in the open form (DTEo,see 4o ,Fig.9)does not absorb in the visible region,and has no eﬀect on the photophysics of the BPEA moieties.In addition to ﬂuorescing,the BPEA units exchange singlet excitation energy with time constants of 0.4ps and B 60ps,as determined by ﬂuorescence anisotropy measurements.30As mentioned above,UV light photoisomerizes the DTE to the closed form DTEc (as in 4c ),which has a broad absorption in the 600nm region (see Fig.4).The DTEc is thermally stable,but isomerized back to DTEo with red light,as discussed above.In 4c ,the DTEc absorbs strongly in the wavelength region where BPEA emits.Thus,it is ideally suited to accept singlet excitation energy from the BPEA singlet states.This energy transfer occurs with time constants of o 13ps,and with a quantum yield of essentiallyD o w n l o a d e d b y U n i v e r s i t y o f H o n g K o n g o n 26 M a r c h 2012P u b l i s h e d o n 05 D e c e m b e r 2011 o n h t t p ://p u b s .r s c .o r g | d o i :10.1039/C 1C C 15329Cunity.Thus,although 4o is strongly ﬂuorescent when the BPEA moieties are excited,4c is essentially non-ﬂuorescent due to energy transfer quenching.These photochemical properties allow 4to act as a molecular photonic analogue of a transistor ampliﬁer or triode tube.The input (corresponding to the source contact of the transistor)is steady-state blue light at 350nm which serves to excite BPEA green ﬂuorescence.This ﬂuorescence is monitored at 520nm and comprises the output of the device (drain).The 350nm light also photoisomerizes an ensemble of molecules of 4to a mixture highly enriched in 4c ,which is essentially non-ﬂuorescent as discussed above.Thus,any mixture of 4o and 4c is rapidly converted to nonemissive 4c when exposed to the UV light.A second input,corresponding to the gate of a transistor,is red light of 4610nm.As noted above,light of these wavelengths is not absorbed by and cannot excite emission from BPEA,but does photoisomerize 4c back to the ﬂuorescent 4o form.If the intensity of this red light is modulated while the intensity of the 350nm light is kept constant,the photostationary distribution of 4o and 4c in solution is likewise modulated,and the ﬂuorescence intensity from the BPEA chromophores is modulated accordingly.If the red light is modulated according to some waveform,the BPEA emission is likewise modulated.A few examples of the operation of the molecular photonic transistor ampliﬁer are shown in Fig.10.In all these experiments,a solution of 4in 2-methyltetrahydrofuran was irradiated with 350nm light of constant intensity and red light with modulated intensity,and the BPEA ﬂuorescent output was monitored at 520nm.In Fig.10a,the red light was modulated with a sine wave having a period of 3600s.The BPEA emission traces out a sine wave with the same period.In Fig.10b,the modulation period is decreased to 960s.The BPEA ﬂuorescence output is again a sine wave of this period,but there is a signiﬁcant phaseshift of 601.This occurs because at the light intensities employed,each change in red light intensity initiates a change in the photostationary distribution of 4o and 4c ,and photo-isomerization does not quite ‘‘catch up’’to the new photo-stationary distribution before the red light intensity changes once again.This eﬀect becomes more pronounced as the modulation frequency is increased.In addition to frequency modulation,amplitude modulation is possible.Fig.10c shows BPEA emission when the sample is irradiated with red light whose intensity was modulated with the product of two sine waves of periods 200s and 2000s.The modulation eﬀects observed with 4are unique.Although it might seem that a similar result could be accomplished by simply modulating the UV light beam that excites ﬂuorescence,the hexad operates in a fundamentally diﬀerent way.Fluorescence in 4excited through light absorption by a shorter-wavelength electronic transition is modulated by light absorbed by a second,longer-wavelength transition in a second chromo-phore.Such a result is thermodynamically precluded in a single-chromophore system,or a multichromophoric system in which energy is simply transferred from an absorbing moiety to a ﬂuorophore.It can occur in 4because the long-wavelength transition modulates the population ratio of the two isomeric forms of the DTE,only one of which quenches BPEA ﬂuorescence.This unique mode of operation also allows generation of an output waveform that is phase shifted relative to the modulating waveform,or is even of a diﬀerent shape.28Simple modulation of light exciting a ﬂuorophore cannot accomplish these things.These phenomena might be useful for limiting interference from background ﬂuorescence in a variety of detection and labeling applications in biomedical imaging and elsewhere.28Phase-sensitive detection of ﬂuorescence from a shorter-wavelength emitter at the modulation frequency of the longer-wavelength light absorbed by the modulatingFig.9Molecular ‘‘triode’’4.In 4o ,the dithienylethene photochrome is in the open form,which absorbs only UV light.In 4c ,the dithienylethene is in the closed,cyclic form that absorbs in the red region.D o w n l o a d e d b y U n i v e r s i t y o f H o n g K o n g o n 26 M a r c h 2012P u b l i s h e d o n 05 D e c e m b e r 2011 o n h t t p ://p u b s .r s c .o r g | d o i :10.1039/C 1C C 15329C。

coupled binary embedding for large-scale

Coupled Binary Embedding forLarge-Scale Image RetrievalLiang Zheng,Shengjin Wang,Member,IEEE,and Qi Tian,Senior Member,IEEEAbstract—Visual matching is a crucial step in image retrieval based on the bag-of-words(BoW)model.In the baseline method, two keypoints are considered as a matching pair if their SIFT descriptors are quantized to the same visual word.However,the SIFT visual word has two limitations.First,it loses most of its discriminative power during quantization.Second,SIFT only describes the local texture feature.Both drawbacks impair the discriminative power of the BoW model and lead to false positive matches.To tackle this problem,this paper proposes to embed multiple binary features at indexing level.To model correlation between features,a multi-IDF scheme is introduced,through which different binary features are coupled into the inverted ﬁle.We show that matching veriﬁcation methods based on binary features,such as Hamming embedding,can be effectively incorporated in our framework.As an extension,we explore the fusion of binary color feature into image retrieval.The joint integration of the SIFT visual word and binary features greatly enhances the precision of visual matching,reducing the impact of false positive matches.Our method is evaluated through extensive experiments on four benchmark datasets(Ukbench,Holidays, DupImage,and MIR Flickr1M).We show that our method signiﬁcantly improves the baseline approach.In addition,large-scale experiments indicate that the proposed method requires acceptable memory usage and query time compared with other approaches.Further,when global color feature is integrated,our method yields competitive performance with the state-of-the-arts.Index Terms—Feature fusion,coupled binary embedding, multi-IDF,image retrieval.I.I NTRODUCTIONT HIS paper focuses on the task of large scale partial-duplicate image retrieval.Given a query image,our target is toﬁnd images containing the same object or scene in a large database in real time.Due to the low descriptive power of texts Manuscript received January1,2014;revised April14,2014;accepted June2,2014.Date of publication June12,2014;date of current version July1, 2014.This work was supported in part by the National High Technology Research and Development Program of China(863program)under Grant 2012AA011004and in part by the National Science and Technology Support Program under Grant2013BAK02B04.The work of Q.Tian was supported in part by the Army Research Ofﬁce under Grant W911NF-12-1-0057,in part by the Faculty Research Awards through the NEC Laboratories of America,in part by the2012UTSA START-R Research Award,and in part by the National Science Foundation of China under Grant61128007.The associate editor coordinating the review of this manuscript and approving it for publication was Mr.Pierre-Marc Jodoin.(Corresponding authors:Shengjin Wang and Qi Tian).L.Zheng and S.Wang are with the State Key Laboratory of Intelligent Technology and Systems,Tsinghua National Laboratory for Information Science and Technology,Department of Electrical Engineering,Tsinghua University,Beijing100084,China(e-mail:zheng-l06@; wgsgj@).Q.Tian is with the Department of Computer Science,University of Texas at San Antonio,San Antonio,TX78249-1604USA(e-mail:qitian@). Color versions of one or more of theﬁgures in this paper are available online at .Digital Object Identiﬁer10.1109/TIP.2014.2330763or tags[2],[3],content based image retrieval(CBIR)has been a hot topic in computer vision community.One of the most popular approaches to perform such a task is the Bag-of-Words(BoW)model[4].The introduction of the SIFT descriptor[5]has enabled accurate partial-duplicate image retrieval based on feature matching.Speciﬁcally,the BoW modelﬁrst constructs a codebook via unsupervised clustering algorithms[6],[7].Then,an image is represented as a histogram of visual words,produced by feature quantization. Each bin of the histogram is weighted with tf-idf score[4]or its variants[1],[8],[9].With the invertedﬁle data structure, images are indexed for efﬁcient retrieval.Essentially,one key issue of the BoW model involves visual word matching between images.Accurate feature matching leads to high image retrieval performance.However,two drawbacks compromise this procedure.First,in quantization, a128-D double SIFT feature is quantized to a single integer. Although it enables efﬁcient online retrieval,the discriminative power of SIFT feature is largely lost.Features that lie away from each other may actually fall into the same cell,thus producing false positive matches.Second,the state-of-the-art systems rely on the SIFT descriptor,which only describes the local gradient distribution,with rare description of other characteristics,such as color,of this local region.As a result, regions which are similar in texture space but different in color space may also be considered as a true match.Both drawbacks lead to false positive matches and impair the image retrieval accuracy.Therefore,it is undesirable to take visual word index as the only ticket to visual matching.Instead,the matching procedure should be further checked by other cues,which should be efﬁcient in terms of both memory and time.A reasonable choice to address the above problem involves the usage of binary features.Typically,the binary features are extracted along with SIFT,and embedded into the invertedﬁle. The reason why binary feature can be employed for matching veriﬁcation is two-fold.First,compared withﬂoating-point vectors of the same length,binary features consume much less memory.For example,for a128-D vector,it takes512bytes and16bytes for theﬂoating-point and binary features,respec-tively.Second,during matching veriﬁcation,the Hamming distance between two binary features can be efﬁciently calcu-lated via xor operations,while the Euclidean distance between ﬂoating-point vectors is very expensive to compute.Previous work of this line includes Hamming Embedding(HE)[1]and its variants[10],[11],which use binary SIFT features for veriﬁcation.Meanwhile,binary features also include spatial context[12],heterogeneous feature such as color[13],etc. In light of the effectiveness of binary features,this paper proposes to reﬁne visual matching via the embedding of1057-7149©2014IEEE.Personal use is permitted,but republication/redistribution requires IEEE permission.See /publications_standards/publications/rights/index.html for more information.Fig.1.Two sample image retrieval results from the Holidays dataset.Foreach query,the baseline results(theﬁrst row)and results with color fusion(the second row)are demonstrated.The results start from the second imagein the rank list.multiple binary features.On one hand,binary features providecomplementary clues to rebuild the discriminative power ofSIFT visual word.On the other hand,in this feature fusionprocess,binary features are coupled by links derived froma virtual multi-index structure.In this structure,SIFT visualword and other binary features are combined at indexing levelby taking each feature as one dimension of the virtual multi-index.Therefore,the image retrieval process votes for candi-date images not only similar in local texture feature,but alsoconsistent in other feature spaces.With the concept of multi-index,a novel IDF scheme,called multi-IDF,is introduced.We show that binary feature veriﬁcation methods such asHamming Embedding,can be effectively incorporated in ourframework.Moreover,we extend the proposed framework byembedding binary color feature.This paper argues that feature fusion by coupled binaryfeature embedding signiﬁcantly enhances the discriminativepower of SIFT visual word.First,SIFT binary feature retainsmore information from the original feature,providing effectivecheck for visual word matching.Second,color binary featuregives complementary clues to SIFT feature(see Fig.2forthe effects of color fusion).Both aspects serve to improvefeature matching accuracy.Extensive experiments on fourimage retrieval datasets conﬁrm that the proposed method dra-matically improves image retrieval accuracy,while remainingefﬁcient as well.Fig.1gives some examples where our methodreturns challenging images candidates while the conventionalSIFT-based model fails.The rest of the paper is organized as follows.After a briefreview of related work in Section II,we introduce the proposedbinary feature embedding method based on virtual multi-indexin Section III.Section IV presents the experimental resultson four benchmark datasets for image retrieval applications.Finally,conclusions are given in Section V.II.R ELATED W ORKThis paper aims at improving BoW-based image retrievalvia indexing-level feature fusion.So we brieﬂy reviewfourFig.2.An example of image matching using the baseline(left)and theproposed fusion(right)method.For each image pair,the query image is onthe left.Theﬁrst row represents matching between relevant image,while thesecond row contains irrelevant ones.We also show the ranks of the candidateimages.We can see that the fusion of color information improves performancesigniﬁcantly.closely related aspects,i.e,feature quantization,spatial con-straint encoding,feature fusion,and indexing strategy.A.Feature QuantizationUsually,hundreds or thousands of local features, e.g,SIFT[5]or its variants[14],[15]are extracted in an image.To reduce memory cost and speed up image matching,eachSIFT feature is assigned to one or a few nearest centroidsin the codebook via approximate nearest neighbor(ANN)algorithms[6],[7].This process is featured by a signiﬁcantinformation loss from a128-D double vector to a1-D integer.To reduce quantization error,multiple assignment[16]or softquantization[17]is employed,which instead increase thequery time and memory overload.Another choice includesthe Fisher Vector(FV)[18].In FV,the Gaussian MixtureModel(GMM)is used to train a codebook.The quantizationprocess is performed softly by estimating the probability thata given feature falls into each Gaussian mixture.Quantizationerror can also be tackled using binary features.Hammingembedding[1]generates binary signatures coupling SIFTvisual word for matching veriﬁcation.These binary featuresprovide information toﬁlter out false matches,rebuildingthe discriminative power of SIFT visual word.Quantizationartifact can also be addressed by modeling spatial constraintsamong local features[12],[19],[20].Another recent trendincludes designing codebook-free methods[21],[22]for efﬁ-cient feature quantization.B.Feature FusionThe combination of multiple features has been demonstratedto obtain superior performance in various tasks,such astopic modeling[23],[24],boundary detection[25],characterrecognition[26],object classiﬁcation[27]and detection[28],[29]tasks.Typically,early and late fusions are the twomain approaches.Early fusion[27]refers to fusing multiplefeatures at pixel level,while late fusion[30]learns semanticconcepts directly from unimodal features.In theﬁeld ofinstance retrieval,feature fusion is no easy task due to thelack of sufﬁcient training data.Douze et al.[31]combineﬁshervector and attributes in a manner equivalent to early fusion. Zhang et al.[32]perform late fusion by combining rank lists of BoW and global features by graph fusion.In[33], co-indexing is employed to augment the inverted index with globally similar images.In[34],multi-level features including the popular CNN feature[35]are integrated,and state-of-the-art results on benchmarks are reported on benchmarks. Wengert et al.[13]use global and local color features to pro-vide complementary information.Their work is similar to ours in that binary color feature is also integrated into the inverted ﬁle.However,in their work the trade-off between color and SIFT hamming embeddings is heuristic and dependent on the dataset.Instead this paper focuses on the indexing-level feature fusion by modeling correlations between features,which could be generalized on different datasets,providing a different view from previous works.C.Indexing StrategyThe invertedﬁle structure[4]greatly promotes the efﬁciency of large scale image retrieval[36].In essence,the inverted ﬁle stores image IDs where the corresponding visual word appears.Modiﬁed invertedﬁle may also include other cues for further visual match check,such as binary Hamming codes [1],[37],feature position,scale,and orientation[38],[39],etc. For example,Zhou et al.perform on-the-ﬂy spatial coding with the metadata stored in the invertedﬁle.Zheng et al.[37] employ indexing-level feature fusion with a2D invertedﬁle, and greatly improve retrieval efﬁciency.In[40],a Bayes prob-abilistic model is proposed to merge multiple inverted indices, while cross-indexing[41]traverses two inverted indices in an iterative manner.The closest inspiring work to ours includes[42],which addresses ANN problem via inverted multi-indices built on product quantization(PQ)[43].In their work,each dimension of the multi-index corresponds to a segment of the SIFT feature vector,so the multi-index is a product of the“de-composition”of the SIFT feature.Opposite to[42],in this work,we“compose”SIFT and color features into the multi-index,implicitly performing feature fusion at indexing level.The multi-index used in this paper serves as an illustration of the coupling mechanism and derives the multi-IDF formula,which is a bridge between features.III.P ROPOSED A PPROACHThis section provides a formal description of the proposed framework for binary feature embedding.A.Binary Feature Veriﬁcation RevisitThe SIFT visual word is such a weak discriminator that false positive matches occur prevalently:dissimilar SIFT fea-tures are assigned to the same visual word,and vice versa. To rebuild its discriminative power,binary features are employed to provide further veriﬁcation for visual word matching pairs.The Hamming Embedding(HE)proposed in[1]suggests a way to inject SIFT binary feature into the retrieval system. This paper,however,exploits the embedding of multiple binary signatures from heterogeneous features.A binary feature can be generated as,f=(f1,f2,...,f n)T q(·)−→b=(b1,b2,...,b m)(1) where an n-D feature f is projected into an m-D binary signature b,which is stored in the invertedﬁle.During online query,given a query feature x,its matching strength with a indexed feature y which is quantized to the same visual word with x can be written as,f(x,y)=exp−d2b/σ2,if d b<κ,0otherwise,(2)where d b denotes the Hamming distance between the binary signatures of x and y,σis a weighting parameter,andκis a predeﬁned threshold.If d b exceedsκ,then x and y are viewed as a false match and rejected.In this scenario,given a query image Q,and a database image I,the similarity function between them can be formu-lated as,sim(Q,I)=x∈Q,y∈If(x,y)·id f2Q 2 I 2,(3) where id f stands for the inverse document frequency(IDF) of the corresponding visual word.For a conventional inverted ﬁle,the IDF value of visual word w i is formulated as,id f(w i)=logNn i(4) where N represents the overall number of images in the corpus,and n i denotes the number of images in which w i appears.The basic idea of IDF is to assign more weight to rare words,and less weight to frequent words.In this paper,the binary features used are not limited to those derived from the original SIFT[1].It also involves other heterogenous binary features,such as color feature,or potentially,the recently proposed ORB[44],BRISK[45], FREAK[46],etc.In essence,our task is to perform feature fusion on the indexing level bridged by the introduction of multi-IDF scheme,which differs from previous works on binary veriﬁcation signiﬁcantly.anization of the Inverted FileThe invertedﬁle is prevalently used to index database images in the BoW-based image retrieval pipeline.This data structure not only calculates the inner product between images explicitly,but,more importantly,enables efﬁcient online retrieval process.We assume that an image collection possesses N images denoted as D={I i}N i=1.Each image I i has a set of keypoints{x j}d i j=1,where d i is the number of key-points in I i.Given a codebook{w i}K k=1of size K,image I i is quantized to a vector representation v i=[v i,1,v i,2,...,v i,K]T, where v i,k stands for the response of visual word w k in I i.A conventional invertedﬁle can be denoted as W={W1,W2,...,W K}.In W,each entry W i contains a list of postings.For an image-level invertedﬁle,each posting stores the image ID and the TF score.For a keypoint-level invertedﬁle,each posting stores the image ID and other metadata[1],[38],[47]associated with the indexed keypoint.Fig.3.Structure of the keypoint-level invertedﬁle.Each posting in the list stores the information of an indexed keypoint,e.g.,the image ID,binary features,etc.In this paper,a keypoint-level invertedﬁle is organizedas shown in Fig.3.Similar to Hamming Embedding[1], binary features of each SIFT visual word are stored in thecorresponding entry of the1-D invertedﬁle.However,our work differs from[1]in two aspects.First,we illustrate thefeasibility of multiple feature fusion at indexing level.Second,a multi-IDF scheme(see Section III-D)is employed,which takes advantage of the multi-index structure(see Section III-C)and feature correlation.C.A Multi-Index IllustrationIn this section,we provide an alternative explanation of the proposed method from the perspective of multi-index structure,which stands as the foundation of the multi-IDF formula(seeSection III-D).With M kinds of features,the“dimension”of the multi-index is M.Each dimension corresponds to a conventionalinvertedﬁle of feature F m,m=1,2,...,M.Then build-ing the multi-index can be processed as follows.First,for each keypoint{x i}in an image,multiple descriptors (f0i,f1i,...,f M i)are computed.Then,the descriptors asso-ciated with a keypoint are quantized into a visual word tuple (w0i,w1i,...,w M i)using codebooks{C m}M m=0of each feature. Finally,for each tuple,an entry in the multi-index is identiﬁed, where the metadata of this keypoint can be stored.During online retrieval,each image is represented by a bag of word tuples as in the ofﬂine phase.In this manner,every keypoint is described by multiple features.Then,for each word tuple,weﬁnd and vote for the candidate images from the corresponding entry in the multi-index.1)Embedding Binary Features:This paper embeds binaryfeatures into the SIFT visual word framework.Due to its bit-wise nature,each binary feature equals to a decimal number.So the binary feature itself can be viewed as a visual word: there is no need to train a codebook explicitly[21].The reason why we use binary features instead of traditional visual words is that a coarser-to-ﬁne mechanism is implied in binary features.Basically,the Hamming distance between two binary features represents their similarity,while the traditional visual word only allows a“hard”matching mode.A binary feature of k bits corresponds to a codebook of size2k.Consequently, 1It refers to neighborhood in the feature space,but for better illustration, we place them as neighborhood in the invertedﬁle.Fig.4.An example of binary multi-index fusing SIFT and color feature. The codebook sizes are1M and222,respectively.During online retrieval,the entry corresponding to word tuple(S4,C4)is located.Then,the neighborhood entries1are also checked as an implementation of Multiple Assignment(MA).Color indicates the weight of these entries.A darker color signiﬁes a largerweight.we can adapt each binary feature to the virtual multi-index easily.Speciﬁcally,the SIFT binary features involved in[1] can be coupled with the SIFT visual word in the virtual multi-index as well.To boost the recall of candidate images,Multiple Assign-ment(MA)[16],[17]is employed.Suppose r0,r1,...,r M nearest neighbors are used in the quantization of query fea-tures,then a weight is computed according to its distance to the cluster center.Similar to[17],the weight takes the form of exp(−d ),where d is the distance to the center,and is the weighting factor.For the SIFT visual word,we adopt the Euclidean distance,while Hamming distance is used for binary features.A2D multi-index fusing SIFT and color feature is illustrated in Fig.4.Note that,the memory cost of the multi-index grows rapidly with the number of dimensions,leading to memory inefﬁ-ciency.As a result,in practice,the multi-index structure is used to calculate the multi-IDF only,and the invertedﬁle in Fig.3is actually being used.D.Multi-IDF FormulaIn this section,we introduce the multi-IDF formula which bridges various features in the multi-index.1)Conventional IDF:In essence,as Robertson[48]sug-gests,IDF weight can be interpreted in terms of probability, i.e.,the probability that a random image I contains visual word w i,estimated asP(w i)=P(w i occurs in I)≈n iN(5) In Eq.5,it is assumed that the occurrences of different visual words across the image database are statistically independent.Note that the independence assumption is also taken in the whole BoW model.Under this assumption,the IDF value can be estimated as the inverse of the fraction in Eq.5,plus a logarithm operator,id f(w i)=−log P(w i)=log Nn i(6)Note that,the reason why a log(·)operator is used associates with the idea of addition-based scoring function(see Eq.8 below).2)Proposed IDF:Without loss of generality,we illustrate the case of2D multi-index.For each entry,its IDF value is determined by the probability that a visual word tuple(w0i,w1j) occurs in an arbitrary image I.For simplicity,assume featuresF0and F1are independent,i.e,at each keypoint,the value distribution of the two features does not affect each other. Therefore,the occurrence of visual words w0i and w1j are also independent.We can deriveP(w0i,w1j)=P(w0i)P(w1j)≈n0iN·n1jN(7)where n0i and n1j stand for the number of images containing visual word w i and w j for feature F0and F1,respectively.2 Following Eq.6,the IDF value of this entry can be calculated as,id f(w0i,w1j)indep=−log P(w0i,w1j)=−(log P(w0i)+log P(w1j))=id f(w0i)+id f(w1j)(8) The advantage of the multi-IDF is two-fold.First,it greatly reduces the computational complexity from O(K(0)·K(1))to O(K(0)+K(1)):only the IDF value of each individual feature codebook needs to be calculated.In the case of higher-order multi-index,the computation efﬁciency would be more prominent.Secondly,multi-IDF in Eq.8can be adopted easily when more features are fused under the assumption of independence between features.Nonetheless,when the independence assumption does not hold,the multi-IDF formula would undergo some small changes.Consider the extreme case where two features are perfectly dependent,e.g,they are the same feature,the IDF formula is the same as the conventional IDF,id f(w0i,w1j)dep=id f(w0i)=id f(w1j)(9) In light of Eq.8and Eq.9,IDF value for two partially dependent features can be viewed as a weighted sum of each individual IDF,id f(w0i,w1j)=id f(w0i)+t·id f(w1j),t∈[0,1](10) where the parameter t measures the independence of features F0and rger independence leads to a larger t.Perfect 2We note that Eq.7calculates the probability that the two words occur in the same image,not the same spot.However,the two probabilities differ in multiplying a constant¯n,i.e,the average number of visual words or keypoints in an image,which can be omittedafterwards.Fig.5.An example of visual match.Top:A matched SIFT pair between two images.The11-D color name descriptors of the matched keypoints in the left(middle)and right(bottom)images are presented below.Also shown are the prototypes of the11basic colors(colored discs).In this example,the two matched keypoints differ a lot in color,thus considered as a false positive match.independence and dependence correspond to1and0,reducing Eq.10to Eq.8and Eq.9,respectively.Eq.10can be generalized to the multiple feature case as,id f(w0i,w1j1,...w M jM)=id f(w0i)+Mm=1t m·id f(w m jm)(11)where t m∈[0,1]is the correlation of F0and F m. In this paper,we take F0as the principal feature(SIFT visual word),and F m as the auxiliary feature(binary features).Note that,when multiple features are fused in the multi-index,it is a multiple dimensional indexing structure,in which each dimension corresponds a feature.Speciﬁcally,if we set t m to0,the proposed framework reduces to case of binary feature veriﬁcation.Coupled with the multi-index,Eq.11helps generalize binary feature embedding by building links among these features.The effectiveness of the multi-IDF scheme is evaluated in Section IV-C.E.Fusion of Color FeatureIn this paper,we embed color feature with SIFT at indexing level.Fig.2and Fig.5provide examples of how color feature helpsﬁlter our false matches.1)Color Descriptor:This paper employs the Color Names (CN)descriptor[28]for two reasons.First,it is shown in[28]that CN has superior performance compared with several commonly used color descriptors such as the Robust hue descriptor[49]and Opponent derivative descriptor [49].Second,although colored SIFT descriptors such as HSV-SIFT[50]and HueSIFT[51]provide color information, the descriptors typically lose some invariance properties and are high-dimensional[52].Basically,the CN descriptor assigns to each pixel a11-D vector,of which each dimension encodes one of the eleven basic colors:black,blue,brown,grey,green,orange,pink, purple,red,white and yellow.The effectiveness of CN has been validated in image classiﬁcation and detection applica-tions[27],[28].We further test it in the scenario of image retrieval.2)Feature Extraction:At each keypoint,two descriptors are extracted,i.e.,a SIFT descriptor and a CN descriptor.In this scenario,SIFT is extracted with the standard algorithm[5]. As with CN,weﬁrst compute CN vectors of pixels surround-ing the keypoint,with the area proportional to the scale of the keypoint.Then,we take the average CN vector as the color feature.The two descriptors of a keypoint are individually quantized,binarized,and fed into our model,respectively. 3)Binarization:Because the CN descriptor has explicit semantic meaning in each dimension,we do not adopt the classical clustering method to perform quantization.Instead, we directly convert a CN vector into a binary feature,which itself can be viewed as a distinct visual word[21].Speciﬁcally, we try two binarization schemes,producing11-bit vector b(11) and22-bit vector b(22),respectively.Suppose the CN vector is presented as(f1,f2,...,f11)T,and the binarization can be processed as follows,b(11) i =1,if f i≥ˆth,0,if f i<ˆth(12)(b(22)i ,b(22)i+11)=⎧⎪⎨⎪⎩(1,1),if f i>ˆth1,(1,0),ifˆth2<f i≤ˆth1,(0,0),if f i≤ˆth2(13)where b i(i=1,2,...,11)is the i th entry of the resulting binary feature.Thresholdsˆth=g3,ˆth1=g2,ˆth2=g5,where (g1,g2,...,g11)is the sorted vector of(f1,f2,...,f11) in descending order.A comparison of the two quantizers (Eq.12and Eq.13)is shown in Section IV-C.Intuitively,the binarization schemes introduced in Eq.12and Eq.13represent a uniform partition(similar to the lattice partitioning in[53]) of the CN feature space.In fact,a vital difference between the CN vector and the SIFT vector is that each entry of the CN vector has explicit physical meaning.In Eq.13,we propose to assign(1,1)to the two most salient colors of a local region and(0,0)to the least dominant colors.We speculate that two regions are of a similar color if they have the same dominant colors,and vice versa.Another consideration of Eq.13is that the minor colors may be subject to the impact of illumination or view changes.In this manner,this binarization scheme is more robust to image variations.On the other hand,one may ask why we choose g2and g5instead of a moreuniform Fig.6.Sample images used to calculate the correlation coefﬁcients.In this example,images in each row are obtained by the same text query.The text queries used to crawl these images are(from top to bottom):“Asia”,“Atlanta”,“Babados”,and“Brazil”.threshold setting.In fact,region for CN extraction is quite small:in typical cases it covers tens of pixels.In such a small patch,the number of dominant colors is small(see Fig.5 for an example).After counting the several dominant colors, approximately half of the CN dimensions are close to zero. Therefore,we use the presented thresholds which also yield satisfying performance in the experiments.Therefore,in this paper,we do not employ the binarization method proposed in[1].Nevertheless,this paper provides a comparison with the standard Locality-Sensitive Hashing (LSH)[54]as well as the state-of-the-art Kernelized Locality-Sensitive Hashing(KLSH)[55]methods in Section IV-C. 4)Estimation of Feature Correlation:To determine t in Eq.10,we propose a simple scheme to measure the correlation between features.We take as an example calculating the feature correlation between SIFT visual word and binary CN.To this end,we crawled200K high-resolution images uploaded by users from Flickr using the names of60countries and regions across the world.These images are generally high-resolution,with the most common size of1024×768. The content of the images are very diverse,from scenes to objects,which can be viewed as a good representation of natural images.Some sample images are shown in Fig.6. From the images,we extract over2×109(SIFT,CN)feature tuples.Then,we perform quantization on the two features: classical codebook for SIFT,and22-bit quantizer for CN. Further,two histograms are calculated using these data:For feature tuples with the same/different SIFT visual word, compute the Hamming distance of CN features,and calculate the normalized distance histogram.Finally,we calculate the correlation coefﬁcient of the two histograms,which serves as the estimation of feature correlation parameter t.Brieﬂy,the intuition is that,for highly independent features, the two histograms should be very similar:whether or not the SIFT visual words of a keypoint pair are the same,the Hamming distance of the other feature is not affected.In this case,the correlation coefﬁcient of the two histograms is close to1,which also means a larger t.The“dependent features”case can be analysed in a similar way.Results for feature correlation calculation is presented in Fig.7.Speciﬁcally,we present the statistical results obtained on the Flickr200K and Holidays datasets.Note that,the reason why the Holidays dataset is included is that the correlation coefﬁcient may be subject to dataset bias,or dataset dependency.The correlation。