Pitch-based segmentation and recognition of dot-matrix text

合集下载

A Discriminatively Trained, Multiscale, Deformable Part Model

A Discriminatively Trained,Multiscale,Deformable Part ModelPedro Felzenszwalb University of Chicago pff@David McAllesterToyota Technological Institute at Chicagomcallester@Deva RamananUC Irvinedramanan@AbstractThis paper describes a discriminatively trained,multi-scale,deformable part model for object detection.Our sys-tem achieves a two-fold improvement in average precision over the best performance in the2006PASCAL person de-tection challenge.It also outperforms the best results in the 2007challenge in ten out of twenty categories.The system relies heavily on deformable parts.While deformable part models have become quite popular,their value had not been demonstrated on difﬁcult benchmarks such as the PASCAL challenge.Our system also relies heavily on new methods for discriminative training.We combine a margin-sensitive approach for data mining hard negative examples with a formalism we call latent SVM.A latent SVM,like a hid-den CRF,leads to a non-convex training problem.How-ever,a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is speciﬁed for the positive examples.We believe that our training meth-ods will eventually make possible the effective use of more latent information such as hierarchical(grammar)models and models involving latent three dimensional pose.1.IntroductionWe consider the problem of detecting and localizing ob-jects of a generic category,such as people or cars,in static images.We have developed a new multiscale deformable part model for solving this problem.The models are trained using a discriminative procedure that only requires bound-ing box labels for the positive ing these mod-els we implemented a detection system that is both highly efﬁcient and accurate,processing an image in about2sec-onds and achieving recognition rates that are signiﬁcantly better than previous systems.Our system achieves a two-fold improvement in average precision over the winning system[5]in the2006PASCAL person detection challenge.The system also outperforms the best results in the2007challenge in ten out of twenty This material is based upon work supported by the National Science Foundation under Grant No.0534820and0535174.Figure1.Example detection obtained with the person model.The model is deﬁned by a coarse template,several higher resolution part templates and a spatial model for the location of each part. object categories.Figure1shows an example detection ob-tained with our person model.The notion that objects can be modeled by parts in a de-formable conﬁguration provides an elegant framework for representing object categories[1–3,6,10,12,13,15,16,22]. While these models are appealing from a conceptual point of view,it has been difﬁcult to establish their value in prac-tice.On difﬁcult datasets,deformable models are often out-performed by“conceptually weaker”models such as rigid templates[5]or bag-of-features[23].One of our main goals is to address this performance gap.Our models include both a coarse global template cov-ering an entire object and higher resolution part templates. The templates represent histogram of gradient features[5]. As in[14,19,21],we train models discriminatively.How-ever,our system is semi-supervised,trained with a max-margin framework,and does not rely on feature detection. We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data.In contrast to computa-tionally demanding approaches such as[4],we can learn a model in3hours on a single CPU.Another contribution of our work is a new methodology for discriminative training.We generalize SVMs for han-dling latent variables such as part positions,and introduce a new method for data mining“hard negative”examples dur-ing training.We believe that handling partially labeled data is a signiﬁcant issue in machine learning for computer vi-sion.For example,the PASCAL dataset only speciﬁes abounding box for each positive example of an object.We treat the position of each object part as a latent variable.We also treat the exact location of the object as a latent vari-able,requiring only that our classiﬁer select a window that has large overlap with the labeled bounding box.A latent SVM,like a hidden CRF[19],leads to a non-convex training problem.However,unlike a hidden CRF, a latent SVM is semi-convex and the training problem be-comes convex once latent information is speciﬁed for thepositive training examples.This leads to a general coordi-nate descent algorithm for latent SVMs.System Overview Our system uses a scanning window approach.A model for an object consists of a global“root”ﬁlter and several part models.Each part model speciﬁes a spatial model and a partﬁlter.The spatial model deﬁnes a set of allowed placements for a part relative to a detection window,and a deformation cost for each placement.The score of a detection window is the score of the root ﬁlter on the window plus the sum over parts,of the maxi-mum over placements of that part,of the partﬁlter score on the resulting subwindow minus the deformation cost.This is similar to classical part-based models[10,13].Both root and partﬁlters are scored by computing the dot product be-tween a set of weights and histogram of gradient(HOG) features within a window.The rootﬁlter is equivalent to a Dalal-Triggs model[5].The features for the partﬁlters are computed at twice the spatial resolution of the rootﬁlter. Our model is deﬁned at aﬁxed scale,and we detect objects by searching over an image pyramid.In training we are given a set of images annotated with bounding boxes around each instance of an object.We re-duce the detection problem to a binary classiﬁcation prob-lem.Each example x is scored by a function of the form, fβ(x)=max zβ·Φ(x,z).Hereβis a vector of model pa-rameters and z are latent values(e.g.the part placements). To learn a model we deﬁne a generalization of SVMs that we call latent variable SVM(LSVM).An important prop-erty of LSVMs is that the training problem becomes convex if weﬁx the latent values for positive examples.This can be used in a coordinate descent algorithm.In practice we iteratively apply classical SVM training to triples( x1,z1,y1 ,..., x n,z n,y n )where z i is selected to be the best scoring latent label for x i under the model learned in the previous iteration.An initial rootﬁlter is generated from the bounding boxes in the PASCAL dataset. The parts are initialized from this rootﬁlter.2.ModelThe underlying building blocks for our models are the Histogram of Oriented Gradient(HOG)features from[5]. We represent HOG features at two different scales.Coarse features are captured by a rigid template covering anentireImage pyramidFigure2.The HOG feature pyramid and an object hypothesis de-ﬁned in terms of a placement of the rootﬁlter(near the top of the pyramid)and the partﬁlters(near the bottom of the pyramid). detection window.Finer scale features are captured by part templates that can be moved with respect to the detection window.The spatial model for the part locations is equiv-alent to a star graph or1-fan[3]where the coarse template serves as a reference position.2.1.HOG RepresentationWe follow the construction in[5]to deﬁne a dense repre-sentation of an image at a particular resolution.The image isﬁrst divided into8x8non-overlapping pixel regions,or cells.For each cell we accumulate a1D histogram of gra-dient orientations over pixels in that cell.These histograms capture local shape properties but are also somewhat invari-ant to small deformations.The gradient at each pixel is discretized into one of nine orientation bins,and each pixel“votes”for the orientation of its gradient,with a strength that depends on the gradient magnitude.For color images,we compute the gradient of each color channel and pick the channel with highest gradi-ent magnitude at each pixel.Finally,the histogram of each cell is normalized with respect to the gradient energy in a neighborhood around it.We look at the four2×2blocks of cells that contain a particular cell and normalize the his-togram of the given cell with respect to the total energy in each of these blocks.This leads to a vector of length9×4 representing the local gradient information inside a cell.We deﬁne a HOG feature pyramid by computing HOG features of each level of a standard image pyramid(see Fig-ure2).Features at the top of this pyramid capture coarse gradients histogrammed over fairly large areas of the input image while features at the bottom of the pyramid capture ﬁner gradients histogrammed over small areas.2.2.FiltersFilters are rectangular templates specifying weights for subwindows of a HOG pyramid.A w by hﬁlter F is a vector with w×h×9×4weights.The score of aﬁlter is deﬁned by taking the dot product of the weight vector and the features in a w×h subwindow of a HOG pyramid.The system in[5]uses a singleﬁlter to deﬁne an object model.That system detects objects from a particular class by scoring every w×h subwindow of a HOG pyramid and thresholding the scores.Let H be a HOG pyramid and p=(x,y,l)be a cell in the l-th level of the pyramid.Letφ(H,p,w,h)denote the vector obtained by concatenating the HOG features in the w×h subwindow of H with top-left corner at p.The score of F on this detection window is F·φ(H,p,w,h).Below we useφ(H,p)to denoteφ(H,p,w,h)when the dimensions are clear from context.2.3.Deformable PartsHere we consider models deﬁned by a coarse rootﬁlter that covers the entire object and higher resolution partﬁlters covering smaller parts of the object.Figure2illustrates a placement of such a model in a HOG pyramid.The rootﬁl-ter location deﬁnes the detection window(the pixels inside the cells covered by theﬁlter).The partﬁlters are placed several levels down in the pyramid,so the HOG cells at that level have half the size of cells in the rootﬁlter level.We have found that using higher resolution features for deﬁning partﬁlters is essential for obtaining high recogni-tion performance.With this approach the partﬁlters repre-sentﬁner resolution edges that are localized to greater ac-curacy when compared to the edges represented in the root ﬁlter.For example,consider building a model for a face. The rootﬁlter could capture coarse resolution edges such as the face boundary while the partﬁlters could capture details such as eyes,nose and mouth.The model for an object with n parts is formally deﬁned by a rootﬁlter F0and a set of part models(P1,...,P n) where P i=(F i,v i,s i,a i,b i).Here F i is aﬁlter for the i-th part,v i is a two-dimensional vector specifying the center for a box of possible positions for part i relative to the root po-sition,s i gives the size of this box,while a i and b i are two-dimensional vectors specifying coefﬁcients of a quadratic function measuring a score for each possible placement of the i-th part.Figure1illustrates a person model.A placement of a model in a HOG pyramid is given by z=(p0,...,p n),where p i=(x i,y i,l i)is the location of the rootﬁlter when i=0and the location of the i-th part when i>0.We assume the level of each part is such that a HOG cell at that level has half the size of a HOG cell at the root level.The score of a placement is given by the scores of eachﬁlter(the data term)plus a score of the placement of each part relative to the root(the spatial term), ni=0F i·φ(H,p i)+ni=1a i·(˜x i,˜y i)+b i·(˜x2i,˜y2i),(1)where(˜x i,˜y i)=((x i,y i)−2(x,y)+v i)/s i gives the lo-cation of the i-th part relative to the root location.Both˜x i and˜y i should be between−1and1.There is a large(exponential)number of placements for a model in a HOG pyramid.We use dynamic programming and distance transforms techniques[9,10]to compute the best location for the parts of a model as a function of the root location.This takes O(nk)time,where n is the number of parts in the model and k is the number of cells in the HOG pyramid.To detect objects in an image we score root locations according to the best possible placement of the parts and threshold this score.The score of a placement z can be expressed in terms of the dot product,β·ψ(H,z),between a vector of model parametersβand a vectorψ(H,z),β=(F0,...,F n,a1,b1...,a n,b n).ψ(H,z)=(φ(H,p0),φ(H,p1),...φ(H,p n),˜x1,˜y1,˜x21,˜y21,...,˜x n,˜y n,˜x2n,˜y2n,). We use this representation for learning the model parame-ters as it makes a connection between our deformable mod-els and linear classiﬁers.On interesting aspect of the spatial models deﬁned here is that we allow for the coefﬁcients(a i,b i)to be negative. This is more general than the quadratic“spring”cost that has been used in previous work.3.LearningThe PASCAL training data consists of a large set of im-ages with bounding boxes around each instance of an ob-ject.We reduce the problem of learning a deformable part model with this data to a binary classiﬁcation problem.Let D=( x1,y1 ,..., x n,y n )be a set of labeled exam-ples where y i∈{−1,1}and x i speciﬁes a HOG pyramid, H(x i),together with a range,Z(x i),of valid placements for the root and partﬁlters.We construct a positive exam-ple from each bounding box in the training set.For these ex-amples we deﬁne Z(x i)so the rootﬁlter must be placed to overlap the bounding box by at least50%.Negative exam-ples come from images that do not contain the target object. Each placement of the rootﬁlter in such an image yields a negative training example.Note that for the positive examples we treat both the part locations and the exact location of the rootﬁlter as latent variables.We have found that allowing uncertainty in the root location during training signiﬁcantly improves the per-formance of the system(see Section4).tent SVMsA latent SVM is deﬁned as follows.We assume that each example x is scored by a function of the form,fβ(x)=maxz∈Z(x)β·Φ(x,z),(2)whereβis a vector of model parameters and z is a set of latent values.For our deformable models we deﬁne Φ(x,z)=ψ(H(x),z)so thatβ·Φ(x,z)is the score of placing the model according to z.In analogy to classical SVMs we would like to trainβfrom labeled examples D=( x1,y1 ,..., x n,y n )by optimizing the following objective function,β∗(D)=argminβλ||β||2+ni=1max(0,1−y i fβ(x i)).(3)By restricting the latent domains Z(x i)to a single choice, fβbecomes linear inβ,and we obtain linear SVMs as a special case of latent tent SVMs are instances of the general class of energy-based models[18].3.2.Semi-ConvexityNote that fβ(x)as deﬁned in(2)is a maximum of func-tions each of which is linear inβ.Hence fβ(x)is convex inβ.This implies that the hinge loss max(0,1−y i fβ(x i)) is convex inβwhen y i=−1.That is,the loss function is convex inβfor negative examples.We call this property of the loss function semi-convexity.Consider an LSVM where the latent domains Z(x i)for the positive examples are restricted to a single choice.The loss due to each positive example is now bined with the semi-convexity property,(3)becomes convex inβ.If the labels for the positive examples are notﬁxed we can compute a local optimum of(3)using a coordinate de-scent algorithm:1.Holdingβﬁxed,optimize the latent values for the pos-itive examples z i=argmax z∈Z(xi )β·Φ(x,z).2.Holding{z i}ﬁxed for positive examples,optimizeβby solving the convex problem deﬁned above.It can be shown that both steps always improve or maintain the value of the objective function in(3).If both steps main-tain the value we have a strong local optimum of(3),in the sense that Step1searches over an exponentially large space of latent labels for positive examples while Step2simulta-neously searches over weight vectors and an exponentially large space of latent labels for negative examples.3.3.Data Mining Hard NegativesIn object detection the vast majority of training exam-ples are negative.This makes it infeasible to consider all negative examples at a time.Instead,it is common to con-struct training data consisting of the positive instances and “hard negative”instances,where the hard negatives are data mined from the very large set of possible negative examples.Here we describe a general method for data mining ex-amples for SVMs and latent SVMs.The method iteratively solves subproblems using only hard instances.The innova-tion of our approach is a theoretical guarantee that it leads to the exact solution of the training problem deﬁned using the complete training set.Our results require the use of a margin-sensitive deﬁnition of hard examples.The results described here apply both to classical SVMs and to the problem deﬁned by Step2of the coordinate de-scent algorithm for latent SVMs.We omit the proofs of the theorems due to lack of space.These results are related to working set methods[17].We deﬁne the hard instances of D relative toβas,M(β,D)={ x,y ∈D|yfβ(x)≤1}.(4)That is,M(β,D)are training examples that are incorrectly classiﬁed or near the margin of the classiﬁer deﬁned byβ. We can show thatβ∗(D)only depends on hard instances. Theorem1.Let C be a subset of the examples in D.If M(β∗(D),D)⊆C thenβ∗(C)=β∗(D).This implies that in principle we could train a model us-ing a small set of examples.However,this set is deﬁned in terms of the optimal modelβ∗(D).Given aﬁxedβwe can use M(β,D)to approximate M(β∗(D),D).This suggests an iterative algorithm where we repeatedly compute a model from the hard instances de-ﬁned by the model from the last iteration.This is further justiﬁed by the followingﬁxed-point theorem.Theorem2.Ifβ∗(M(β,D))=βthenβ=β∗(D).Let C be an initial“cache”of examples.In practice we can take the positive examples together with random nega-tive examples.Consider the following iterative algorithm: 1.Letβ:=β∗(C).2.Shrink C by letting C:=M(β,C).3.Grow C by adding examples from M(β,D)up to amemory limit L.Theorem3.If|C|<L after each iteration of Step2,the algorithm will converge toβ=β∗(D)inﬁnite time.3.4.Implementation detailsMany of the ideas discussed here are only approximately implemented in our current system.In practice,when train-ing a latent SVM we iteratively apply classical SVM train-ing to triples x1,z1,y1 ,..., x n,z n,y n where z i is se-lected to be the best scoring latent label for x i under themodel trained in the previous iteration.Each of these triples leads to an example Φ(x i,z i),y i for training a linear clas-siﬁer.This allows us to use a highly optimized SVM pack-age(SVMLight[17]).On a single CPU,the entire training process takes3to4hours per object class in the PASCAL datasets,including initialization of the parts.Root Filter Initialization:For each category,we auto-matically select the dimensions of the rootﬁlter by looking at statistics of the bounding boxes in the training data.1We train an initial rootﬁlter F0using an SVM with no latent variables.The positive examples are constructed from the unoccluded training examples(as labeled in the PASCAL data).These examples are anisotropically scaled to the size and aspect ratio of theﬁlter.We use random subwindows from negative images to generate negative examples.Root Filter Update:Given the initial rootﬁlter trained as above,for each bounding box in the training set weﬁnd the best-scoring placement for theﬁlter that signiﬁcantly overlaps with the bounding box.We do this using the orig-inal,un-scaled images.We retrain F0with the new positive set and the original random negative set,iterating twice.Part Initialization:We employ a simple heuristic to ini-tialize six parts from the rootﬁlter trained above.First,we select an area a such that6a equals80%of the area of the rootﬁlter.We greedily select the rectangular region of area a from the rootﬁlter that has the most positive energy.We zero out the weights in this region and repeat until six parts are selected.The partﬁlters are initialized from the rootﬁl-ter values in the subwindow selected for the part,butﬁlled in to handle the higher spatial resolution of the part.The initial deformation costs measure the squared norm of a dis-placement with a i=(0,0)and b i=−(1,1).Model Update:To update a model we construct new training data triples.For each positive bounding box in the training data,we apply the existing detector at all positions and scales with at least a50%overlap with the given bound-ing box.Among these we select the highest scoring place-ment as the positive example corresponding to this training bounding box(Figure3).Negative examples are selected byﬁnding high scoring detections in images not containing the target object.We add negative examples to a cache un-til we encounterﬁle size limits.A new model is trained by running SVMLight on the positive and negative examples, each labeled with part placements.We update the model10 times using the cache scheme described above.In each it-eration we keep the hard instances from the previous cache and add as many new hard instances as possible within the memory limit.Toward theﬁnal iterations,we are able to include all hard instances,M(β,D),in the cache.1We picked a simple heuristic by cross-validating over5object classes. We set the model aspect to be the most common(mode)aspect in the data. We set the model size to be the largest size not larger than80%of thedata.Figure3.The image on the left shows the optimization of the la-tent variables for a positive example.The dotted box is the bound-ing box label provided in the PASCAL training set.The large solid box shows the placement of the detection window while the smaller solid boxes show the placements of the parts.The image on the right shows a hard-negative example.4.ResultsWe evaluated our system using the PASCAL VOC2006 and2007comp3challenge datasets and protocol.We refer to[7,8]for details,but emphasize that both challenges are widely acknowledged as difﬁcult testbeds for object detec-tion.Each dataset contains several thousand images of real-world scenes.The datasets specify ground-truth bounding boxes for several object classes,and a detection is consid-ered correct when it overlaps more than50%with a ground-truth bounding box.One scores a system by the average precision(AP)of its precision-recall curve across a testset.Recent work in pedestrian detection has tended to report detection rates versus false positives per window,measured with cropped positive examples and negative images with-out objects of interest.These scores are tied to the reso-lution of the scanning window search and ignore effects of non-maximum suppression,making it difﬁcult to compare different systems.We believe the PASCAL scoring method gives a more reliable measure of performance.The2007challenge has20object categories.We entered a preliminary version of our system in the ofﬁcial competi-tion,and obtained the best score in6categories.Our current system obtains the highest score in10categories,and the second highest score in6categories.Table1summarizes the results.Our system performs well on rigid objects such as cars and sofas as well as highly deformable objects such as per-sons and horses.We also note that our system is successful when given a large or small amount of training data.There are roughly4700positive training examples in the person category but only250in the sofa category.Figure4shows some of the models we learned.Figure5shows some ex-ample detections.We evaluated different components of our system on the longer-established2006person dataset.The top AP scoreaero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tvOur rank 31211224111422112141Our score .180.411.092.098.249.349.396.110.155.165.110.062.301.337.267.140.141.156.206.336Darmstadt .301INRIA Normal .092.246.012.002.068.197.265.018.097.039.017.016.225.153.121.093.002.102.157.242INRIA Plus.136.287.041.025.077.279.294.132.106.127.067.071.335.249.092.072.011.092.242.275IRISA .281.318.026.097.119.289.227.221.175.253MPI Center .060.110.028.031.000.164.172.208.002.044.049.141.198.170.091.004.091.034.237.051MPI ESSOL.152.157.098.016.001.186.120.240.007.061.098.162.034.208.117.002.046.147.110.054Oxford .262.409.393.432.375.334TKK .186.078.043.072.002.116.184.050.028.100.086.126.186.135.061.019.036.058.067.090Table 1.PASCAL VOC 2007results.Average precision scores of our system and other systems that entered the competition [7].Empty boxes indicate that a method was not tested in the corresponding class.The best score in each class is shown in bold.Our current system ranks ﬁrst in 10out of 20classes.A preliminary version of our system ranked ﬁrst in 6classes in the ofﬁcial competition.BottleCarBicycleSofaFigure 4.Some models learned from the PASCAL VOC 2007dataset.We show the total energy in each orientation of the HOG cells in the root and part ﬁlters,with the part ﬁlters placed at the center of the allowable displacements.We also show the spatial model for each part,where bright values represent “cheap”placements,and dark values represent “expensive”placements.in the PASCAL competition was .16,obtained using a rigid template model of HOG features [5].The best previous re-sult of.19adds a segmentation-based veriﬁcation step [20].Figure 6summarizes the performance of several models we trained.Our root-only model is equivalent to the model from [5]and it scores slightly higher at .18.Performance jumps to .24when the model is trained with a LSVM that selects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templates because they allow for self-adjustment of the detection win-dow in the training examples.Adding deformable parts in-creases performance to .34AP —a factor of two above the best previous score.Finally,we trained a model with partsbut no root ﬁlter and obtained .29AP.This illustrates the advantage of using a multiscale representation.We also investigated the effect of the spatial model and allowable deformations on the 2006person dataset.Recall that s i is the allowable displacement of a part,measured in HOG cells.We trained a rigid model with high-resolution parts by setting s i to 0.This model outperforms the root-only system by .27to .24.If we increase the amount of allowable displacements without using a deformation cost,we start to approach a bag-of-features.Performance peaks at s i =1,suggesting it is useful to constrain the part dis-placements.The optimal strategy allows for larger displace-ments while using an explicit deformation cost.The follow-Figure 5.Some results from the PASCAL 2007dataset.Each row shows detections using a model for a speciﬁc class (Person,Bottle,Car,Sofa,Bicycle,Horse).The ﬁrst three columns show correct detections while the last column shows false positives.Our system is able to detect objects over a wide range of scales (such as the cars)and poses (such as the horses).The system can also detect partially occluded objects such as a person behind a bush.Note how the false detections are often quite reasonable,for example detecting a bus with the car model,a bicycle sign with the bicycle model,or a dog with the horse model.In general the part ﬁlters represent meaningful object parts that are well localized in each detection such as the head in the person model.Figure6.Evaluation of our system on the PASCAL VOC2006 person dataset.Root uses only a rootﬁlter and no latent place-ment of the detection windows on positive examples.Root+Latent uses a rootﬁlter with latent placement of the detection windows. Parts+Latent is a part-based system with latent detection windows but no rootﬁlter.Root+Parts+Latent includes both root and part ﬁlters,and latent placement of the detection windows.ing table shows AP as a function of freely allowable defor-mation in theﬁrst three columns.The last column gives the performance when using a quadratic deformation cost and an allowable displacement of2HOG cells.s i01232+quadratic costAP.27.33.31.31.345.DiscussionWe introduced a general framework for training SVMs with latent structure.We used it to build a recognition sys-tem based on multiscale,deformable models.Experimental results on difﬁcult benchmark data suggests our system is the current state-of-the-art in object detection.LSVMs allow for exploration of additional latent struc-ture for recognition.One can consider deeper part hierar-chies(parts with parts),mixture models(frontal vs.side cars),and three-dimensional pose.We would like to train and detect multiple classes together using a shared vocab-ulary of parts(perhaps visual words).We also plan to use A*search[11]to efﬁciently search over latent parameters during detection.References[1]Y.Amit and A.Trouve.POP:Patchwork of parts models forobject recognition.IJCV,75(2):267–282,November2007.[2]M.Burl,M.Weber,and P.Perona.A probabilistic approachto object recognition using local photometry and global ge-ometry.In ECCV,pages II:628–641,1998.[3] D.Crandall,P.Felzenszwalb,and D.Huttenlocher.Spatialpriors for part-based recognition using statistical models.In CVPR,pages10–17,2005.[4] D.Crandall and D.Huttenlocher.Weakly supervised learn-ing of part-based spatial models for visual object recognition.In ECCV,pages I:16–29,2006.[5]N.Dalal and B.Triggs.Histograms of oriented gradients forhuman detection.In CVPR,pages I:886–893,2005.[6] B.Epshtein and S.Ullman.Semantic hierarchies for recog-nizing objects and parts.In CVPR,2007.[7]M.Everingham,L.Van Gool,C.K.I.Williams,J.Winn,and A.Zisserman.The PASCAL Visual Object Classes Challenge2007(VOC2007)Results./challenges/VOC/voc2007/workshop.[8]M.Everingham, A.Zisserman, C.K.I.Williams,andL.Van Gool.The PASCAL Visual Object Classes Challenge2006(VOC2006)Results./challenges/VOC/voc2006/results.pdf.[9]P.Felzenszwalb and D.Huttenlocher.Distance transformsof sampled functions.Cornell Computing and Information Science Technical Report TR2004-1963,September2004.[10]P.Felzenszwalb and D.Huttenlocher.Pictorial structures forobject recognition.IJCV,61(1),2005.[11]P.Felzenszwalb and D.McAllester.The generalized A*ar-chitecture.JAIR,29:153–190,2007.[12]R.Fergus,P.Perona,and A.Zisserman.Object class recog-nition by unsupervised scale-invariant learning.In CVPR, 2003.[13]M.Fischler and R.Elschlager.The representation andmatching of pictorial structures.IEEE Transactions on Com-puter,22(1):67–92,January1973.[14] A.Holub and P.Perona.A discriminative framework formodelling object classes.In CVPR,pages I:664–671,2005.[15]S.Ioffe and D.Forsyth.Probabilistic methods forﬁndingpeople.IJCV,43(1):45–68,June2001.[16]Y.Jin and S.Geman.Context and hierarchy in a probabilisticimage model.In CVPR,pages II:2145–2152,2006.[17]T.Joachims.Making large-scale svm learning practical.InB.Sch¨o lkopf,C.Burges,and A.Smola,editors,Advances inKernel Methods-Support Vector Learning.MIT Press,1999.[18]Y.LeCun,S.Chopra,R.Hadsell,R.Marc’Aurelio,andF.Huang.A tutorial on energy-based learning.InG.Bakir,T.Hofman,B.Sch¨o lkopf,A.Smola,and B.Taskar,editors, Predicting Structured Data.MIT Press,2006.[19] A.Quattoni,S.Wang,L.Morency,M.Collins,and T.Dar-rell.Hidden conditional randomﬁelds.PAMI,29(10):1848–1852,October2007.[20] ing segmentation to verify object hypothe-ses.In CVPR,pages1–8,2007.[21] D.Ramanan and C.Sminchisescu.Training deformablemodels for localization.In CVPR,pages I:206–213,2006.[22]H.Schneiderman and T.Kanade.Object detection using thestatistics of parts.IJCV,56(3):151–177,February2004. [23]J.Zhang,M.Marszalek,zebnik,and C.Schmid.Localfeatures and kernels for classiﬁcation of texture and object categories:A comprehensive study.IJCV,73(2):213–238, June2007.。

基于时频分析的深度学习调制识别算法

在通信系统中,通常用不同的调制方法来调制发送的信号,以进行有效的数据传输。

调制识别是信号检测和信号解调之间的中间过程[1],自动调制识别(Automatic Modulation Recog⁃nition ,AMR )可以识别出信号的调制信息,在实际的民用和军事应用中发挥关键作用,如认知无线电、信号识别和频谱监控[2]。

通常AMR 算法可分为两类:基于似然的方法和基于特征的方法。

基于似然的方法使用概率论、假设检验理论和适当的决策准则来解决调制识别问题[3],但其计算十分复杂;而基于特征的方法则对调制信号进行特征提取和分类,其识别性能与提取的特征数量有关。

调制信号的瞬时幅度、相位和频率等各种统计特征已用于对调制方式的识别,如循环平稳特性和高阶统计[4-5]。

现有基于特征的分类器包括决策树算法和机器学习算法,如朴素贝叶斯和人工神经网络。

近年来,作为一种强大的机器学习方法的深度学习在图像分类和语音识别等方面取得了巨大的成功。

基于深度学习的方法级联多层非线性处理单元对输入数据自动优化提取特征,以最大程度地减少分类错误。

深度学习方法也已应用于调制识别[6-7]。

文献[8]论述了在无线电信号处理中的深度学习新兴应用,并使用GNU Ra⁃dio 软件生成了具有同相(in-phase ,I )和正交(quadrature ,Q )分量的调制信号的开放数据集,从而进行调制识别;文献[9]在开源数据集上应用CNN 网络结构进行识别,相对于基于循环特征的识别算法,识别性能更优;文献[10]提出使用基于CNN 、起始网络(inception network )、卷积长短时记忆全连接网络(ConVoluional ,Long Short -Term Memory ,Fully Connected Deep Neural Network ,CLDNN )和残差神经网络(ResidualNetwork ,ResNet )对调制信号数据集进行识别同时对各自识别性能进行比较;文献[11]提出将调制信号的同相、正交分量以及四阶高阶累积量一起组成数据集来提升调制识别性能。

Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram

Fast3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu,Gary Bradski,Romain Thibaux,John HsuWillow Garage68Willow Rd.,Menlo Park,CA94025,USA{rusu,bradski,thibaux,hsu}@Abstract—We present the Viewpoint Feature Histogram (VFH),a descriptor for3D point cloud data that encodes geometry and viewpoint.We demonstrate experimentally on a set of60objects captured with stereo cameras that VFH can be used as a distinctive signature,allowing simultaneous recognition of the object and its pose.The pose is accurate enough for robot manipulation,and the computational cost is low enough for real time operation.VFH was designed to be robust to large surface noise and missing depth information in order to work reliably on stereo data.I.I NTRODUCTIONAs part of a long term goal to develop reliable capabilities in the area of perception for mobile manipulation,we address a table top manipulation task involving objects that can be manipulated by one robot hand.Our robot is shown in Fig.1. In order to manipulate an object,the robot must reliably identify it,as well as its6degree-of-freedom(6DOF)pose. This paper proposes a method to identify both at the same time,reliably and at high speed.We make the following assumptions.•Objects are rigid and relatively Lambertian.They can be shiny,but not reﬂective or transparent.•Objects are in light clutter.They can be easily seg-mented in3D and can be grabbed by the robot hand without obstruction.•The item of interest can be grabbed directly,so it is not occluded.•Items can be grasped even given an approximate pose.The gripper on our robot can open to9cm and each grip is2.5cm wide which allows an object8.5cm wide object to be grasped when the pose is off by+/-10 degrees.Despite these assumptions our problem has several prop-erties that make the task difﬁcult.•The objects need not contain texture.•Our dataset includes objects of very similar shapes,for example many slight variations of typical wine glasses.•To be usable,the recognition accuracy must be very high,typically much higher than,say,for image retrieval tasks,since false positives have very high costs and so must be kept extremely rare.•To interact usefully with humans,recognition cannot take more than a fraction of a second.This puts constraints on computation,but more importantly this precludes the use of accurate but slow3Dacquisition Fig.1.A PR2robot from Willow Garage,showing its grippers and stereo camerasusing lasers.Instead we rely on stereo data,which suffers from higher noise and missing data.Our focus is perception for mobile manipulation.Working on a mobile versus a stationary robot means that we can’t depend on instrumenting the external world with active vision systems or special lighting,but we can put such devices on the robot.In our case,we use projected texture1 to yield dense stereo depth maps at30Hz.We also cannot ensure environmental conditions.We may move from a sunlit room to a dim hallway into a room with no light at all.The projected texture gives us a fair amount of resilience to local lighting conditions as well.1Not structured light,this is random textureAlthough this paper focuses on3D depth features,2D imagery is clearly important,for example for shiny and transparent objects,or to distinguish items based on texture such as telling apart a Coke can from a Diet Coke can.In our case,the textured light alternates with no light to allow for2D imagery aligned with the texture based dense depth, however adding2D visual features will be studied in future work.Here,we look for an effective purely3D feature. Our philosophy is that one should use or design a recogni-tion algorithm thatﬁts one’s engineering needs such as scal-ability,training speed,incremental training needs,and so on, and thenﬁnd features that make the recognition performance of that architecture meet one’s speciﬁcations.For reasons of online training,and because of large memory availability, we choose fast approximate K-Nearest Neighbors(K-NN) implemented in the FLANN library[1]as our recognition architecture.The key contribution of this paper is then the design of a new,computationally efﬁcient3D feature that yields object recognition and6DOF pose.The structure of this paper is as follows:Related work is described in Section II.Next,we give a brief description of our system architecture in Section III.We discuss our surface normal and segmentation algorithm in Section IV followed by a discussion of the Viewpoint Feature Histogram in Section V.Experimental setup and resulting computational and recognition performance are described in Section VI. Conclusions and future work are discussed in Section VII.II.R ELATED W ORKThe problem that we are trying to solve requires global (3D object level)classiﬁcation based on estimated features. This has been under investigation for a long time in various researchﬁelds,such as computer graphics,robotics,and pattern matching,see[2]–[4]for comprehensive reviews.We address the most relevant work below.Some of the widely used3D point feature extraction approaches include:spherical harmonic invariants[5],spin images[6],curvature maps[7],or more recently,Point Feature Histograms(PFH)[8],and conformal factors[9]. Spherical harmonic invariants and spin images have been successfully used for the problem of object recognition for densely sampled datasets,though their performance seems to degrade for noisier and sparser datasets[4].Our stereo data is noisier and sparser than typical line scan data which motivated the use of our new features.Conformal factors are based on conformal geometry,which is invariant to isometric transformations,and thus obtains good results on databases of watertight models.Its main drawback is that it can only be applied to manifold meshes which can be problematic in stereo.Curvature maps and PFH descriptors have been studied in the context of local shape comparisons for data registration.A side study[10]applied the PFH descriptors to the problem of surface classiﬁcation into3D geometric primitives,although only for data acquired using precise laser sensors.A different pointﬁngerprint representation using the projections of geodesic circles onto the tangent plane at a point p i was proposed in[11]for the problem of surface registration.As the authors note,geodesic distances are more sensitive to surface sampling noise,and thus are unsuitable for real sensed data without a priori smoothing and reconstruction.A decomposition of objects into parts learned using spin images is presented in[12]for the problem of vehicle identiﬁcation.Methods relying on global features include descriptors such as Extended Gaussian Images(EGI)[13],eigen shapes[14],or shape distributions[15].The latter samples statistics of the entire object and represents them as distri-butions of shape properties,however they do not take into account how the features are distributed over the surface of the object.Eigen shapes show promising results but they have limits on their discrimination ability since important higher order variances are discarded.EGIs describe objects based on the unit normal sphere,but have problems handling arbitrarily curved objects.The work in[16]makes use of spin-image signatures and normal-based signatures to achieve classiﬁcation rates over 90%with synthetic and CAD model datasets.The datasets used however are very different than the ones acquired using noisy640×480stereo cameras such as the ones used in our work.In addition,the authors do not provide timing information on the estimation and matching parts which is critical for applications such as ours.A system for fully automatic3D model-based object recognition and segmentation is presented in[17]with good recognition rates of over95%for a database of55objects.Unfortunately,the computational performance of the proposed method is not suitable for real-time as the authors report the segmentation of an object model in a cluttered scene to be around2 minutes.Moreover,the objects in the database are scanned using a high resolution Minolta scanner and their geometric shapes are very different.As shown in Section VI,the objects used in our experiments are much more similar in terms of geometry,so such a registration-based method would fail. In[18],the authors propose a system for recognizing3D objects in photographs.The techniques presented can only be applied in the presence of texture information,and require a cumbersome generation of models in an ofﬂine step,which makes this unsuitable for our work.As previously presented,our requirements are real-time object recognition and pose identiﬁcation from noisy real-world datasets acquired using projective texture stereo cam-eras.Our3D object classiﬁcation is based on an extension of the recently proposed Fast Point Feature Histogram(FPFH) descriptors[8],which record the relative angular directions of surface normals with respect to one another.The FPFH performs well in classiﬁcation applications and is robust to noise but it is invariant to viewpoint.This paper proposes a novel descriptor that encodes the viewpoint information and has two parts:(1)an extended FPFH descriptor that achieves O(k∗n)to O(n)speed up over FPFHs where n is the number of points in the point cloud and k is how many points used in each local neighborhood;(2)a new signature that encodes important statistics between the viewpoint and the surface normals on the object.We callthis new feature the Viewpoint Feature Histogram(VFH)as detailed below.III.A RCHITECTUREOur system architecture employs the following processing steps:•Synchronized,calibrated and epipolar aligned left and right images of the scene are acquired.•A dense depth map is computed from the stereo pair.•Surface normals in the scene are calculated.•Planes are identiﬁed and segmented out and the remain-ing point clouds from non-planar objects are clustered in Euclidean space.•The Viewpoint Feature Histogram(VFH)is calculated over large enough objects(here,objects having at least 100points).–If there are multiple objects in a scene,they are processed front to back relative to the camera.–Occluded point clouds with less than75%of the number of points of the frontal objects are notedbut not identiﬁed.•Fast approximate K-NN is used to classify the object and its view.Some steps from the early processing pipeline are shown in Figure2.Shown left to right,top to bottom in thatﬁgure are: a moderately complex scene with many different vertical and horizontal surfaces,the resulting depth map,the estimated surface normals and the objects segmented from the planar surfaces in thescene.Fig.2.Early processing steps row wise,top to bottom:A scene,its depthmap,surface normals and segmentation into planes and outlier objects.For computing3D depth maps,we use640x480stereowith textured light.The textureﬂashes on only very brieﬂyas the cameras take a picture resulting in lights that look dimto the human eye but bright to the camera.Textureﬂashesonly every other frame so that raw imagery without texturecan be gathered alternating with densely textured scenes.Thestereo has a38degreeﬁeld of view and is designed for closein manipulation tasks,thus the objects that we deal with arefrom0.5to1.5meters away.The stereo algorithm that weuse was developed in[19]and uses the implementation in theOpenCV library[20]as described in detail in[21],runningat30Hz.IV.S URFACE N ORMALS AND3D S EGMENTATIONWe employ segmentation prior to the actual feature es-timation because in robotic manipulation scenarios we areonly interested in certain precise parts of the environment,and thus computational resources can be saved by tacklingonly those parts.Here,we are looking to manipulate reach-able objects that lie on horizontal surfaces.Therefore,oursegmentation scheme proceeds at extracting these horizontalsurfacesﬁrst.Fig.3.From left to right:raw point cloud dataset,planar and clustersegmentation,more complex segmentation.Compared to our previous work[22],we have improvedthe planar segmentation algorithms by incorporating surfacenormals into the sample selection and model estimationsteps.We also took care to carefully build SSE aligneddata structures in memory for any computationally expensiveoperation.By rejecting candidates which do not supportour constraints,our system can segment data at about7Hz,including normal estimation,on a regular Core2Duo laptopusing a single core.To get frame rate performance(realtime),we use a voxelized data structure over the input point cloudand downsample with a leaf size of0.5cm.The surfacenormals are therefore estimated only for the downsampledresult,but using the information in the original point cloud.The planar components are extracted using a RMSAC(Ran-domized MSAC)method that takes into account weightedaverages of distances to the model together with the angleof the surface normals.We then select candidate table planesusing a heuristic combining the number of inliers whichsupport the planar model as well as their proximity to thecamera viewpoint.This approach emphasizes the part of thespace where the robot manipulators can reach and grasp theobjects.The segmentation of object candidates supported by thetable surface is performed by looking at points whose projec-tion falls inside the bounding2D polygon for the table,andapplying single-link clustering.The result of these processingsteps is a set of Euclidean point clusters.This works toreliably segment objects that are separated by about half theirminimum radius from each other.An can be seen in Figure3.To resolve further ambiguities with to the chosen candidate clusters,such as objects stacked on other planar objects(such as books),we repeat the mentioned step by treating each additional horizontal planar structure on top of the table candidates as a table itself and repeating the segmentation step(see results in Figure3).We emphasize that this segmentation step is of extreme importance for our application,because it allows our methods to achieve favorable computational performances by extract-ing only the regions of interest in a scene(i.e.,objects that are to be manipulated,located on horizontal surfaces).In cases where our“light clutter”assumption does not hold and the geometric Euclidean clustering is prone to failure, a more sophisticated segmentation scheme based on texture properties could be implemented.V.V IEWPOINT F EATURE H ISTOGRAMIn order to accurately and robustly classify points with respect to their underlying surface,we borrow ideas from the recently proposed Point Feature Histogram(PFH)[10]. The PFH is a histogram that collects the pairwise pan,tilt and yaw angles between every pair of normals on a surface patch (see Figure4).In detail,for a pair of3D points p i,p j ,and their estimated surface normals n i,n j ,the set of normal angular deviations can be estimated as:α=v·n jφ=u·(p j−p i)dθ=arctan(w·n j,u·n j)(1)where u,v,w represent a Darboux frame coordinate system chosen at p i.Then,the Point Feature Histogram at a patch of points P={p i}with i={1···n}captures all the sets of α,φ,θ between all pairs of p i,p j from P,and bins the results in a histogram.The bottom left part of Figure4 presents the selection of the Darboux frame and a graphical representation of the three angular features.Because all possible pairs of points are considered,the computation complexity of a PFH is O(n2)in the number of surface normals n.In order to make a more efﬁcient algorithm,the Fast Point Feature Histogram[8]was de-veloped.The FPFH measures the same angular features as PFH,but estimates the sets of values only between every point and its k nearest neighbors,followed by a reweighting of the resultant histogram of a point with the neighboring histograms,thus reducing the computational complexity to O(k∗n).Our past work[22]has shown that a global descriptor (GFPFH)can be constructed from the classiﬁcation results of many local FPFH features,and used on a wide range of confusable objects(20different types of glasses,bowls, mugs)in500scenes achieving96.69%on object class recognition.However,the categorized objects were only split into4distinct classes,which leaves the scaling problem open.Moreover,the GFPFH is susceptible to the errors of the local classiﬁcation results,and is more cumbersome to estimate.In any case,for manipulation,we require that the robot not only identiﬁes objects,but also recognizes their6DOF poses for grasping.FPFH is invariant both to object scale (distance)and object pose and so cannot achieve the latter task.In this work,we decided to leverage the strong recognition results of FPFH,but to add in viewpoint variance while retaining invariance to scale,since the dense stereo depth map gives us scale/distance directly.Our contribution to the problem of object recognition and pose identiﬁcation is to extend the FPFH to be estimated for the entire object cluster (as seen in Figure4),and to compute additional statistics between the viewpoint direction and the normals estimated at each point.To do this,we used the key idea of mixing the viewpoint direction directly into the relative normal angle calculation in the FPFH.Figure6presents this idea with the new feature consisting of two parts:(1)a viewpoint direction component(see Figure5)and(2)a surface shape component comprised of an extended FPFH(see Figure4).The viewpoint component is computed by collecting a histogram of the angles that the viewpoint direction makes with each normal.Note,we do not mean the view angle to each normal as this would not be scale invariant,but instead we mean the angle between the central viewpoint direction translated to each normal.The second component measures the relative pan,tilt and yaw angles as described in[8],[10] but now measured between the viewpoint direction at the central point and each of the normals on the surface.We call the new assembled feature the Viewpoint Feature Histogram (VFH).Figure6presents the resultant assembled VFH for a random object.piαFig.5.The Viewpoint Feature Histogram is created from the extendedFast Point Feature Histogram as seen in Figure4together with the statisticsof the relative angles between each surface normal to the central viewpointdirection.The computational complexity of VFH is O(n).In ourexperiments,we divided the viewpoint angles into128binsand theα,φandθangles into45bins each or a total of263dimensions.The estimation of a VFH takes about0.3ms onaverage on a2.23GHz single core of a Core2Duo machineusing optimized SSE instructions.p 7p p 8p 9p 10p 11p 5p 1p p 3p 4cn c =uun 5v=(p 5-c)×u w=u ×vc p 5wv αφθFig.4.The extended Fast Point Feature Histogram collects the statistics of the relative angles between the surface normals at each point to the surface normal at the centroid of the object.The bottom left part of the ﬁgure describes the three angular feature for an example pair of points.Viewpoint componentextended FPFH componentFig.6.An example of the resultant Viewpoint Feature Histogram for one of the objects used.Note the two concatenated components.VI.V ALIDATION AND E XPERIMENTAL R ESULTS To evaluate our proposed descriptor and system archi-tecture,we collected a large dataset consisting of over 60IKEA kitchenware objects as show in Figure 8.These objects consisted of many kinds each of:wine glasses,tumblers,drinking glasses,mugs,bowls,and a couple of boxes.In each of these categories,many of the objects were distinguished only by subtle variations in shape as can be seen for example in the confusions in Figure 10.We captured over 54000scenes of these objects by spinning them on a turn table 180◦2at each of 2offsets on a platform that tilted 0,8,16,22and 30degrees.Each 180◦rotation was captured with about 90images.The turn table is shown in Fig.7.We additionally worked with a subset of 20objects in 500lightly cluttered scenes with varying arrangements of horizontal and vertical surfaces,using the same data set provided by in [22].No2Wedidn’t go 360degrees so that we could keep the calibration box inviewFig.7.The turn table used to collect views of objects with known orientation.pose information was available for this second dataset so we only ran experiments separately for object recognition results.The complete source code used to generate our experimen-tal results together with both object databases are available under a BSD open source license in our ROS repository at Willow Garage 3.We are currently taking steps towards creating a web page with complete tutorials on how to fully replicate the experiments presented herein.Both the objects in the [22]dataset as well as the ones we acquired,constitute valid examples of objects of daily use that our robot needs to be able to reliably identify and manipulate.While 60objects is far from the number of objects the robot eventually needs to be able to recognize,it may be enough if we assume that the robot knows what3Fig.8.The complete set of IKEA objects used for the purpose of our experiments.All transparent glasses have been painted white to obtain3D information during the acquisition process.TABLE IR ESULTS FOR OBJECT RECOGNITION AND POSE DETECTION OVER 54000SCENES PLUS500LIGHTLY CLUTTERED SCENES.Object PoseMethod Recognition EstimationVFH98.52%98.52%Spin75.3%61.2%context(kitchen table,workbench,coffee table)it is in, so that it needs only discriminate among a small context dependent set of objects.The geometric variations between objects are subtle,and the data acquired is noisy due to the stereo sensor character-istics,yet the perception system has to work well enough to differentiate between,say,glasses that look similar but serve different purposes(e.g.,a wine glass versus a brandy glass). As presented in Section II,the performance of the3D descriptors proposed in the literature degrade on noisier datasets.One of the most popular3D descriptor to date used on datasets acquired using sensing devices similar to ours (e.g.,similar noise characteristics)is the spin image[6].To validate the VFH feature we thus compare it to the spin image,by running the same experiments multiple times. For the reasons given in Section I,we base our recogni-tion architecture on fast approximate K-Nearest Neighbors (KNN)searches using kd-trees[1].The construction of the tree and the search of the nearest neighbors places an equal weight on each histogram bin in the VFH and spin images features.Figure11shows time stop sequentially aggregated exam-ples of the training set.Figure12shows example recognition results for VFH.Andﬁnally,Figure10gives some idea of the performance differences between VFH and spin images. The object recognition rates over the lightly cluttered dataset were98.1%for VFH and73.2%for spin images.The overall recognition rates for VFH andSpin imagesare shown inTable I where VFH handily outperforms spin images for both object recognition and pose.Fig.9.Data training performed in simulation.Theﬁgure presents a snapshot of the simulation with a water bottle from the object model database and the corresponding stereo point cloud output.VII.C ONCLUSIONS AND F UTURE W ORKIn this paper we presented a novel3D feature descriptor, the Viewpoint Feature Histogram(VFH),useful for object recognition and6DOF pose identiﬁcation for application where a priori segmentation is possible.The high recognition performance and fast computational properties,demonstrated the superiority of VFH over spin images on a large scale dataset consisting of over54000scenes with pared to other similar initiatives,our architecture works well with noisy data acquired using standard stereo cameras in real-time,and can detect subtle variations in the geometry of objects.Moreover,we presented an integrated approach for both recognition and6DOF pose identiﬁcation for untextured objects,the latter being of extreme importance for mobile manipulation and grasping applications.Fig.10.VFH consistently outperforms spin images for both recognition and for pose.The bottom of the ﬁgure presents an example result of VFH run on a mug.The bottom left corner is the learned models and the matches go from best to worse from left to right across the bottom followed by left to right across the top.The top part of the ﬁgure presents the results obtained using a spin image.For VFH,3of 5object recognition and 3of 5pose results are correct.For spin images,2of 5object recognition results are correct and 0of 5pose results arecorrect.Fig.11.Sequence examples of object training with calibration box on the outside.An automatic training pipeline can be integrated with our 3D simulator based on Gazebo [23]as depicted in ﬁgure 9,where the stereo point cloud is generated from perfectly rectiﬁed camera images.We are currently working on making both the fully an-notated database of objects together with the source codeof VFH available to the research community as open source.The preliminary results of our efforts can already be checked from the trunk of our Willow Garage ROS repository,but we are taking steps towards generating a set of tutorials on how to replicate and extend the experiments presented in this paper.R EFERENCES[1]M.Muja and D.G.Lowe,“Fast approximate nearest neighbors withautomatic algorithm conﬁguration,”VISAPP ,2009.[2]J.W.Tangelder and R.C.Veltkamp,“A Survey of Content Based3D Shape Retrieval Methods,”in SMI ’04:Proceedings of the Shape Modeling International ,2004,pp.145–156.[3] A.K.Jain and C.Dorai,“3D object recognition:Representation andmatching,”Statistics and Computing ,vol.10,no.2,pp.167–182,2000.[4] A.D.Bimbo and P.Pala,“Content-based retrieval of 3D models,”ACM Trans.Multimedia mun.Appl.,vol.2,no.1,pp.20–43,2006.[5]G.Burel and H.H´e nocq,“Three-dimensional invariants and theirapplication to object recognition,”Signal Process.,vol.45,no.1,pp.1–22,1995.[6] A.Johnson and M.Hebert,“Using spin images for efﬁcient objectrecognition in cluttered 3D scenes,”IEEE Transactions on Pattern Analysis and Machine Intelligence ,May 1999.[7]T.Gatzke,C.Grimm,M.Garland,and S.Zelinka,“Curvature Mapsfor Local Shape Comparison,”in SMI ’05:Proceedings of the Inter-national Conference on Shape Modeling and Applications 2005(SMI’05),2005,pp.246–255.[8]R.B.Rusu,N.Blodow,and M.Beetz,“Fast Point Feature Histograms(FPFH)for 3D Registration,”in ICRA ,2009.[9] B.-C.M.and G.C.,“Characterizing shape using conformal factors,”in Eurographics Workshop on 3D Object Retrieval ,2008.[10]R.B.Rusu,Z.C.Marton,N.Blodow,and M.Beetz,“LearningInformative Point Classes for the Acquisition of Object Model Maps,”in In Proceedings of the 10th International Conference on Control,Automation,Robotics and Vision (ICARCV),2008.[11]Y .Sun and M.A.Abidi,“Surface matching by 3D point’s ﬁngerprint,”in Proc.IEEE Int’l Conf.on Computer Vision ,vol.II,2001,pp.263–269.[12] D.Huber,A.Kapuria,R.R.Donamukkala,and M.Hebert,“Parts-based 3D object classiﬁcation,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 04),June 2004.[13] B.K.P.Horn,“Extended Gaussian Images,”Proceedings of the IEEE ,vol.72,pp.1671–1686,1984.[14]R.J.Campbell and P.J.Flynn,“Eigenshapes for 3D object recognitionin range data,”in Computer Vision and Pattern Recognition,1999.IEEE Computer Society Conference on.,pp.505–510.[15]R.Osada,T.Funkhouser,B.Chazelle,and D.Dobkin,“Shape distri-butions,”ACM Transactions on Graphics ,vol.21,pp.807–832,2002.[16]X.Li and I.Guskov,“3D object recognition from range images usingpyramid matching,”in ICCV07,2007,pp.1–6.[17] A.S.Mian,M.Bennamoun,and R.Owens,“Three-dimensionalmodel-based object recognition and segmentation in cluttered scenes,”IEEE Trans.Pattern Anal.Mach.Intell ,vol.28,pp.1584–1601,2006.[18] F.Rothganger,zebnik,C.Schmid,and J.Ponce,“3d objectmodeling and recognition using local afﬁne-invariant image descriptors and multi-view spatial constraints,”International Journal of Computer Vision ,vol.66,p.2006,2006.[19]K.Konolige,“Small vision systems:hardware and implementation,”in In Eighth International Symposium on Robotics Research ,1997,pp.111–116.[20]“OpenCV ,Open source Computer Vision library,”in/wiki/,2009.[21]G.Bradski and A.Kaehler,“Learning OpenCV:Computer Vision withthe OpenCV Library,”in O’Reilly Media,Inc.,2008,pp.415–453.[22]R. B.Rusu, A.Holzbach,M.Beetz,and G.Bradski,“Detectingand segmenting objects for mobile manipulation,”in ICCV S3DV workshop ,2009.[23]N.Koenig and A.Howard,“Design and use paradigms for gazebo,an open-source multi-robot simulator,”in IEEE/RSJ International Conference on Intelligent Robots and Systems ,Sendai,Japan,Sep 2004,pp.2149–2154.。

tracking-by-segmentation方法的原理

tracking-by-segmentation方法的原理"Tracking by Segmentation"（通过分割进行跟踪）是一种计算机视觉中用于目标跟踪的方法。

该方法的原理是通过将目标从视频帧中分割出来，然后在连续帧之间跟踪这个目标的运动。

以下是"Tracking by Segmentation" 方法的基本原理：目标分割：首先，从视频帧中分割出包含目标对象的图像区域。

这通常需要使用图像分割算法，例如背景减除、阈值分割、边缘检测或语义分割等技术。

目标分割的目的是将目标与背景分离，以便进一步的跟踪。

特征提取：一旦目标被成功分割，就需要从目标区域中提取特征，以描述目标的外观和形状。

这些特征可以包括颜色直方图、纹理特征、形状描述符等。

这些特征将用于后续帧中的目标匹配。

运动估计：在接下来的视频帧中，通过比较当前帧中的目标特征与之前帧中的特征，估计目标的运动。

这可以通过不同的方法实现，如光流估计、外观模型匹配等。

通过运动估计，系统可以预测目标在下一帧中的位置。

目标匹配和跟踪：使用目标的特征和运动信息，将目标在连续帧之间进行匹配和跟踪。

目标匹配可以是一个关键步骤，它确定目标在新帧中的位置，以确保跟踪的连续性。

匹配可以通过各种方法实现，包括相关滤波、卡尔曼滤波、粒子滤波等。

更新目标模型：随着时间的推移，目标的外观可能会发生变化，例如光照条件的变化、遮挡或目标本身的运动。

因此，需要定期更新目标模型，以确保跟踪的准确性。

这可能涉及到在线学习或模型适应的技术。

终止条件：跟踪可以在达到某些终止条件时结束，例如目标不再可见、跟踪失败或用户停止跟踪。

在终止时，系统可能会输出跟踪结果或汇总目标的轨迹信息。

"Tracking by Segmentation" 方法的优点是它能够处理目标在复杂背景下的跟踪，并且对目标的外观和形状变化相对鲁棒。

然而，它也面临着挑战，例如遮挡、光照变化、目标形状变化等问题可能会导致跟踪失败。

基于分形维和独立分量分析的声发射特征提取

发射信号具有来源于缺陷本身、对检测对象干扰小、可以长期连续监测物体安全性的特点，因此在无损检测中具有广阔的应用前景．目前采集和处理声发射信号的方法可分为两大类：１以多个简化的波形特征参数来表示声发射（）信号的特征，然后对其进行分析和处理；２存储和（）记录声发射信号的波形，对波形进行频谱分析．简化波形特征参数分析法是２世纪５年代以来广泛使００
用的经典声发射信号分析方法，目前在声发射检测
析（Ａ对信号进行预处理，Ｉ）Ｃ以提取与噪声源分离的声发射信号进行分形维处理，进而可以提高对结构材料的完整性、安全性检测的准确性．
１声发射信号的分形维特征
分形几何是近几年发展起来的数学分支．方该法是对具有自相似性或统计自相似性信号进行处理
刘国华黄平捷龚翔顾江周泽魁
（浙江大学工业控制技术国家重点实验室，浙江杭州３０２）１０７
摘要：针对噪声对声发射信号分形维的影响，出了一种基于分形维和独立分量分析提
（Ｃ的结构材料声发射信号特征提取方法．中首先给出了分形维的概念，ＩＡ）文并从理论上
裂时，以弹性波的形式释放能量的现象』由于声．
显的区别，又有某种相似性．但已有实验证明［５，４］－
声发射信号不仅在时域上的分布是分形的，而且在空间上的分布也是分形的．但实际声发射信号在空间传播途径中折射、反
射频繁，种噪声混杂，过简单的分形维分析，各通容易引起最终结果的误判．因此文中采用独立分量分 Βιβλιοθήκη 维普资讯第１期

A Fast and Accurate Plane Detection Algorithm for Large Noisy Point Clouds Using Filtered Normals

A Fast and Accurate Plane Detection Algorithm for Large Noisy Point CloudsUsing Filtered Normals and Voxel GrowingJean-Emmanuel DeschaudFranc¸ois GouletteMines ParisTech,CAOR-Centre de Robotique,Math´e matiques et Syst`e mes60Boulevard Saint-Michel75272Paris Cedex06jean-emmanuel.deschaud@mines-paristech.fr francois.goulette@mines-paristech.frAbstractWith the improvement of3D scanners,we produce point clouds with more and more points often exceeding millions of points.Then we need a fast and accurate plane detection algorithm to reduce data size.In this article,we present a fast and accurate algorithm to detect planes in unorganized point clouds usingﬁltered normals and voxel growing.Our work is based on aﬁrst step in estimating better normals at the data points,even in the presence of noise.In a second step,we compute a score of local plane in each point.Then, we select the best local seed plane and in a third step start a fast and robust region growing by voxels we call voxel growing.We have evaluated and tested our algorithm on different kinds of point cloud and compared its performance to other algorithms.1.IntroductionWith the growing availability of3D scanners,we are now able to produce large datasets with millions of points.It is necessary to reduce data size,to decrease the noise and at same time to increase the quality of the model.It is in-teresting to model planar regions of these point clouds by planes.In fact,plane detection is generally aﬁrst step of segmentation but it can be used for many applications.It is useful in computer graphics to model the environnement with basic geometry.It is used for example in modeling to detect building facades before classiﬁcation.Robots do Si-multaneous Localization and Mapping(SLAM)by detect-ing planes of the environment.In our laboratory,we wanted to detect small and large building planes in point clouds of urban environments with millions of points for modeling. As mentioned in[6],the accuracy of the plane detection is important for after-steps of the modeling pipeline.We also want to be fast to be able to process point clouds with mil-lions of points.We present a novel algorithm based on re-gion growing with improvements in normal estimation and growing process.For our method,we are generic to work on different kinds of data like point clouds fromﬁxed scan-ner or from Mobile Mapping Systems(MMS).We also aim at detecting building facades in urban point clouds or little planes like doors,even in very large data sets.Our input is an unorganized noisy point cloud and with only three”in-tuitive”parameters,we generate a set of connected compo-nents of planar regions.We evaluate our method as well as explain and analyse the signiﬁcance of each parameter. 2.Previous WorksAlthough there are many methods of segmentation in range images like in[10]or in[3],three have been thor-oughly studied for3D point clouds:region-growing, hough-transform from[14]and Random Sample Consen-sus(RANSAC)from[9].The application of recognising structures in urban laser point clouds is frequent in literature.Bauer in[4]and Boulaassal in[5]detect facades in dense3D point cloud by a RANSAC algorithm.V osselman in[23]reviews sur-face growing and3D hough transform techniques to de-tect geometric shapes.Tarsh-Kurdi in[22]detect roof planes in3D building point cloud by comparing results on hough-transform and RANSAC algorithm.They found that RANSAC is more efﬁcient than theﬁrst one.Chao Chen in[6]and Yu in[25]present algorithms of segmentation in range images for the same application of detecting planar regions in an urban scene.The method in[6]is based on a region growing algorithm in range images and merges re-sults in one labelled3D point cloud.[25]uses a method different from the three we have cited:they extract a hi-erarchical subdivision of the input image built like a graph where leaf nodes represent planar regions.There are also other methods like bayesian techniques. In[16]and[8],they obtain smoothed surface from noisy point clouds with objects modeled by probability distribu-tions and it seems possible to extend this idea to point cloud segmentation.But techniques based on bayesian statistics need to optimize global statistical model and then it is difﬁ-cult to process points cloud larger than one million points.We present below an analysis of the two main methods used in literature:RANSAC and region-growing.Hough-transform algorithm is too time consuming for our applica-tion.To compare the complexity of the algorithm,we take a point cloud of size N with only one plane P of size n.We suppose that we want to detect this plane P and we deﬁne n min the minimum size of the plane we want to detect.The size of a plane is the area of the plane.If the data density is uniform in the point cloud then the size of a plane can be speciﬁed by its number of points.2.1.RANSACRANSAC is an algorithm initially developped by Fis-chler and Bolles in[9]that allows theﬁtting of models with-out trying all possibilities.RANSAC is based on the prob-ability to detect a model using the minimal set required to estimate the model.To detect a plane with RANSAC,we choose3random points(enough to estimate a plane).We compute the plane parameters with these3points.Then a score function is used to determine how the model is good for the remaining ually,the score is the number of points belonging to the plane.With noise,a point belongs to a plane if the distance from the point to the plane is less than a parameter γ.In the end,we keep the plane with the best score.Theprobability of getting the plane in theﬁrst trial is p=(nN )3.Therefore the probability to get it in T trials is p=1−(1−(nN )3)ing equation1and supposing n minN1,we know the number T min of minimal trials to have a probability p t to get planes of size at least n min:T min=log(1−p t)log(1−(n minN))≈log(11−p t)(Nn min)3.(1)For each trial,we test all data points to compute the score of a plane.The RANSAC algorithm complexity lies inO(N(Nn min )3)when n minN1and T min→0whenn min→N.Then RANSAC is very efﬁcient in detecting large planes in noisy point clouds i.e.when the ratio n minN is 1but very slow to detect small planes in large pointclouds i.e.when n minN 1.After selecting the best model,another step is to extract the largest connected component of each plane.Connnected components mean that the min-imum distance between each point of the plane and others points is smaller(for distance)than aﬁxed parameter.Schnabel et al.[20]bring two optimizations to RANSAC:the points selection is done locally and the score function has been improved.An octree isﬁrst created from point cloud.Points used to estimate plane parameters are chosen locally at a random depth of the octree.The score function is also different from RANSAC:instead of testing all points for one model,they test only a random subset and ﬁnd the score by interpolation.The algorithm complexity lies in O(Nr4Ndn min)where r is the number of random subsets for the score function and d is the maximum octree depth. Their algorithm improves the planes detection speed but its complexity lies in O(N2)and it becomes slow on large data sets.And again we have to extract the largest connected component of each plane.2.2.Region GrowingRegion Growing algorithms work well in range images like in[18].The principle of region growing is to start with a seed region and to grow it by neighborhood when the neighbors satisfy some conditions.In range images,we have the neighbors of each point with pixel coordinates.In case of unorganized3D data,there is no information about the neighborhood in the data structure.The most common method to compute neighbors in3D is to compute a Kd-tree to search k nearest neighbors.The creation of a Kd-tree lies in O(NlogN)and the search of k nearest neighbors of one point lies in O(logN).The advantage of these region growing methods is that they are fast when there are many planes to extract,robust to noise and extract the largest con-nected component immediately.But they only use the dis-tance from point to plane to extract planes and like we will see later,it is not accurate enough to detect correct planar regions.Rabbani et al.[19]developped a method of smooth area detection that can be used for plane detection.Theyﬁrst estimate the normal of each point like in[13].The point with the minimum residual starts the region growing.They test k nearest neighbors of the last point added:if the an-gle between the normal of the point and the current normal of the plane is smaller than a parameterαthen they add this point to the smooth region.With Kd-tree for k nearest neighbors,the algorithm complexity is in O(N+nlogN). The complexity seems to be low but in worst case,when nN1,example for facade detection in point clouds,the complexity becomes O(NlogN).3.Voxel Growing3.1.OverviewIn this article,we present a new algorithm adapted to large data sets of unorganized3D points and optimized to be accurate and fast.Our plane detection method works in three steps.In theﬁrst part,we compute a better esti-mation of the normal in each point by aﬁltered weighted planeﬁtting.In a second step,we compute the score of lo-cal planarity in each point.We select the best seed point that represents a good seed plane and in the third part,we grow this seed plane by adding all points close to the plane.Thegrowing step is based on a voxel growing algorithm.The ﬁltered normals,the score function and the voxel growing are innovative contributions of our method.As an input,we need dense point clouds related to the level of detail we want to detect.As an output,we produce connected components of planes in the point cloud.This notion of connected components is linked to the data den-sity.With our method,the connected components of planes detected are linked to the parameter d of the voxel grid.Our method has 3”intuitive”parameters :d ,area min and γ.”intuitive”because there are linked to physical mea-surements.d is the voxel size used in voxel growing and also represents the connectivity of points in detected planes.γis the maximum distance between the point of a plane and the plane model,represents the plane thickness and is linked to the point cloud noise.area min represents the minimum area of planes we want to keep.3.2.Details3.2.1Local Density of Point CloudsIn a ﬁrst step,we compute the local density of point clouds like in [17].For that,we ﬁnd the radius r i of the sphere containing the k nearest neighbors of point i .Then we cal-culate ρi =kπr 2i.In our experiments,we ﬁnd that k =50is a good number of neighbors.It is important to know the lo-cal density because many laser point clouds are made with a ﬁxed resolution angle scanner and are therefore not evenly distributed.We use the local density in section 3.2.3for the score calculation.3.2.2Filtered Normal EstimationNormal estimation is an important part of our algorithm.The paper [7]presents and compares three normal estima-tion methods.They conclude that the weighted plane ﬁt-ting or WPF is the fastest and the most accurate for large point clouds.WPF is an idea of Pauly and al.in [17]that the ﬁtting plane of a point p must take into consider-ation the nearby points more than other distant ones.The normal least square is explained in [21]and is the mini-mum of ki =1(n p ·p i +d )2.The WPF is the minimum of ki =1ωi (n p ·p i +d )2where ωi =θ( p i −p )and θ(r )=e −2r 2r2i .For solving n p ,we compute the eigenvec-tor corresponding to the smallest eigenvalue of the weightedcovariance matrix C w = ki =1ωi t (p i −b w )(p i −b w )where b w is the weighted barycenter.For the three methods ex-plained in [7],we get a good approximation of normals in smooth area but we have errors in sharp corners.In ﬁg-ure 1,we have tested the weighted normal estimation on two planes with uniform noise and forming an angle of 90˚.We can see that the normal is not correct on the corners of the planes and in the red circle.To improve the normal calculation,that improves the plane detection especially on borders of planes,we propose a ﬁltering process in two phases.In a ﬁrst step,we com-pute the weighted normals (WPF)of each point like we de-scribed it above by minimizing ki =1ωi (n p ·p i +d )2.In a second step,we compute the ﬁltered normal by us-ing an adaptive local neighborhood.We compute the new weighted normal with the same sum minimization but keep-ing only points of the neighborhood whose normals from the ﬁrst step satisfy |n p ·n i |>cos (α).With this ﬁltering step,we have the same results in smooth areas and better results in sharp corners.We called our normal estimation ﬁltered weighted plane ﬁtting(FWPF).Figure 1.Weighted normal estimation of two planes with uniform noise and with 90˚angle between them.We have tested our normal estimation by computing nor-mals on synthetic data with two planes and different angles between them and with different values of the parameter α.We can see in ﬁgure 2the mean error on normal estimation for WPF and FWPF with α=20˚,30˚,40˚and 90˚.Us-ing α=90˚is the same as not doing the ﬁltering step.We see on Figure 2that α=20˚gives smaller error in normal estimation when angles between planes is smaller than 60˚and α=30˚gives best results when angle between planes is greater than 60˚.We have considered the value α=30˚as the best results because it gives the smaller mean error in normal estimation when angle between planes vary from 20˚to 90˚.Figure 3shows the normals of the planes with 90˚angle and better results in the red circle (normals are 90˚with the plane).3.2.3The score of local planarityIn many region growing algorithms,the criteria used for the score of the local ﬁtting plane is the residual,like in [18]or [19],i.e.the sum of the square of distance from points to the plane.We have a different score function to estimate local planarity.For that,we ﬁrst compute the neighbors N i of a point p with points i whose normals n i are close toFigure parison of mean error in normal estimation of two planes with α=20˚,30˚,40˚and 90˚(=Noﬁltering).Figure 3.Filtered Weighted normal estimation of two planes with uniform noise and with 90˚angle between them (α=30˚).the normal n p .More precisely,we compute N i ={p in k neighbors of i/|n i ·n p |>cos (α)}.It is a way to keep only the points which are probably on the local plane before the least square ﬁtting.Then,we compute the local plane ﬁtting of point p with N i neighbors by least squares like in [21].The set N i is a subset of N i of points belonging to the plane,i.e.the points for which the distance to the local plane is smaller than the parameter γ(to consider the noise).The score s of the local plane is the area of the local plane,i.e.the number of points ”in”the plane divided by the localdensity ρi (seen in section 3.2.1):the score s =card (N i)ρi.We take into consideration the area of the local plane as the score function and not the number of points or the residual in order to be more robust to the sampling distribution.3.2.4Voxel decompositionWe use a data structure that is the core of our region growing method.It is a voxel grid that speeds up the plane detection process.V oxels are small cubes of length d that partition the point cloud space.Every point of data belongs to a voxel and a voxel contains a list of points.We use the Octree Class Template in [2]to compute an Octree of the point cloud.The leaf nodes of the graph built are voxels of size d .Once the voxel grid has been computed,we start the plane detection algorithm.3.2.5Voxel GrowingWith the estimator of local planarity,we take the point p with the best score,i.e.the point with the maximum area of local plane.We have the model parameters of this best seed plane and we start with an empty set E of points belonging to the plane.The initial point p is in a voxel v 0.All the points in the initial voxel v 0for which the distance from the seed plane is less than γare added to the set E .Then,we compute new plane parameters by least square reﬁtting with set E .Instead of growing with k nearest neighbors,we grow with voxels.Hence we test points in 26voxel neigh-bors.This is a way to search the neighborhood in con-stant time instead of O (logN )for each neighbor like with Kd-tree.In a neighbor voxel,we add to E the points for which the distance to the current plane is smaller than γand the angle between the normal computed in each point and the normal of the plane is smaller than a parameter α:|cos (n p ,n P )|>cos (α)where n p is the normal of the point p and n P is the normal of the plane P .We have tested different values of αand we empirically found that 30˚is a good value for all point clouds.If we added at least one point in E for this voxel,we compute new plane parameters from E by least square ﬁtting and we test its 26voxel neigh-bors.It is important to perform plane least square ﬁtting in each voxel adding because the seed plane model is not good enough with noise to be used in all voxel growing,but only in surrounding voxels.This growing process is faster than classical region growing because we do not compute least square for each point added but only for each voxel added.The least square ﬁtting step must be computed very fast.We use the same method as explained in [18]with incre-mental update of the barycenter b and covariance matrix C like equation 2.We know with [21]that the barycen-ter b belongs to the least square plane and that the normal of the least square plane n P is the eigenvector of the smallest eigenvalue of C .b0=03x1C0=03x3.b n+1=1n+1(nb n+p n+1).C n+1=C n+nn+1t(pn+1−b n)(p n+1−b n).(2)where C n is the covariance matrix of a set of n points,b n is the barycenter vector of a set of n points and p n+1is the (n+1)point vector added to the set.This voxel growing method leads to a connected com-ponent set E because the points have been added by con-nected voxels.In our case,the minimum distance between one point and E is less than parameter d of our voxel grid. That is why the parameter d also represents the connectivity of points in detected planes.3.2.6Plane DetectionTo get all planes with an area of at least area min in the point cloud,we repeat these steps(best local seed plane choice and voxel growing)with all points by descending order of their score.Once we have a set E,whose area is bigger than area min,we keep it and classify all points in E.4.Results and Discussion4.1.Benchmark analysisTo test the improvements of our method,we have em-ployed the comparative framework of[12]based on range images.For that,we have converted all images into3D point clouds.All Point Clouds created have260k points. After our segmentation,we project labelled points on a seg-mented image and compare with the ground truth image. We have chosen our three parameters d,area min andγby optimizing the result of the10perceptron training image segmentation(the perceptron is portable scanner that pro-duces a range image of its environment).Bests results have been obtained with area min=200,γ=5and d=8 (units are not provided in the benchmark).We show the re-sults of the30perceptron images segmentation in table1. GT Regions are the mean number of ground truth planes over the30ground truth range images.Correct detection, over-segmentation,under-segmentation,missed and noise are the mean number of correct,over,under,missed and noised planes detected by methods.The tolerance80%is the minimum percentage of points we must have detected comparing to the ground truth to have a correct detection. More details are in[12].UE is a method from[12],UFPR is a method from[10]. It is important to notice that UE and UFPR are range image methods and our method is not well suited for range images but3D Point Cloud.Nevertheless,it is a good benchmark for comparison and we see in table1that the accuracy of our method is very close to the state of the art in range image segmentation.To evaluate the different improvements of our algorithm, we have tested different variants of our method.We have tested our method without normals(only with distance from points to plane),without voxel growing(with a classical region growing by k neighbors),without our FWPF nor-mal estimation(with WPF normal estimation),without our score function(with residual score function).The compari-son is visible on table2.We can see the difference of time computing between region growing and voxel growing.We have tested our algorithm with and without normals and we found that the accuracy cannot be achieved whithout normal computation.There is also a big difference in the correct de-tection between WPF and our FWPF normal estimation as we can see in theﬁgure4.Our FWPF normal brings a real improvement in border estimation of planes.Black points in theﬁgure are non classiﬁedpoints.Figure5.Correct Detection of our segmentation algorithm when the voxel size d changes.We would like to discuss the inﬂuence of parameters on our algorithm.We have three parameters:area min,which represents the minimum area of the plane we want to keep,γ,which represents the thickness of the plane(it is gener-aly closely tied to the noise in the point cloud and espe-cially the standard deviationσof the noise)and d,which is the minimum distance from a point to the rest of the plane. These three parameters depend on the point cloud features and the desired segmentation.For example,if we have a lot of noise,we must choose a highγvalue.If we want to detect only large planes,we set a large area min value.We also focus our analysis on the robustess of the voxel size d in our algorithm,i.e.the ratio of points vs voxels.We can see inﬁgure5the variation of the correct detection when we change the value of d.The method seems to be robust when d is between4and10but the quality decreases when d is over10.It is due to the fact that for a large voxel size d,some planes from different objects are merged into one plane.GT Regions Correct Over-Under-Missed Noise Duration(in s)detection segmentation segmentationUE14.610.00.20.3 3.8 2.1-UFPR14.611.00.30.1 3.0 2.5-Our method14.610.90.20.1 3.30.7308Table1.Average results of different segmenters at80%compare tolerance.GT Regions Correct Over-Under-Missed Noise Duration(in s) Our method detection segmentation segmentationwithout normals14.6 5.670.10.19.4 6.570 without voxel growing14.610.70.20.1 3.40.8605 without FWPF14.69.30.20.1 5.0 1.9195 without our score function14.610.30.20.1 3.9 1.2308 with all improvements14.610.90.20.1 3.30.7308 Table2.Average results of variants of our segmenter at80%compare tolerance.4.1.1Large scale dataWe have tested our method on different kinds of data.We have segmented urban data inﬁgure6from our Mobile Mapping System(MMS)described in[11].The mobile sys-tem generates10k pts/s with a density of50pts/m2and very noisy data(σ=0.3m).For this point cloud,we want to de-tect building facades.We have chosen area min=10m2, d=1m to have large connected components andγ=0.3m to cope with the noise.We have tested our method on point cloud from the Trim-ble VX scanner inﬁgure7.It is a point cloud of size40k points with only20pts/m2with less noise because it is a ﬁxed scanner(σ=0.2m).In that case,we also wanted to detect building facades and keep the same parameters ex-ceptγ=0.2m because we had less noise.We see inﬁg-ure7that we have detected two facades.By setting a larger voxel size d value like d=10m,we detect only one plane. We choose d like area min andγaccording to the desired segmentation and to the level of detail we want to extract from the point cloud.We also tested our algorithm on the point cloud from the LEICA Cyrax scanner inﬁgure8.This point cloud has been taken from AIM@SHAPE repository[1].It is a very dense point cloud from multipleﬁxed position of scanner with about400pts/m2and very little noise(σ=0.02m). In this case,we wanted to detect all the little planes to model the church in planar regions.That is why we have chosen d=0.2m,area min=1m2andγ=0.02m.Inﬁgures6,7and8,we have,on the left,input point cloud and on the right,we only keep points detected in a plane(planes are in random colors).The red points in theseﬁgures are seed plane points.We can see in theseﬁg-ures that planes are very well detected even with high noise. Table3show the information on point clouds,results with number of planes detected and duration of the algorithm.The time includes the computation of the FWPF normalsof the point cloud.We can see in table3that our algo-rithm performs linearly in time with respect to the numberof points.The choice of parameters will have little inﬂuence on time computing.The computation time is about one mil-lisecond per point whatever the size of the point cloud(we used a PC with QuadCore Q9300and2Go of RAM).The algorithm has been implented using only one thread andin-core processing.Our goal is to compare the improve-ment of plane detection between classical region growing and our region growing with better normals for more ac-curate planes and voxel growing for faster detection.Our method seems to be compatible with out-of-core implemen-tation like described in[24]or in[15].MMS Street VX Street Church Size(points)398k42k7.6MMean Density50pts/m220pts/m2400pts/m2 Number of Planes202142Total Duration452s33s6900sTime/point 1ms 1ms 1msTable3.Results on different data.5.ConclusionIn this article,we have proposed a new method of plane detection that is fast and accurate even in presence of noise. We demonstrate its efﬁciency with different kinds of data and its speed in large data sets with millions of points.Our voxel growing method has a complexity of O(N)and it is able to detect large and small planes in very large data sets and can extract them directly in connected components.Figure 4.Ground truth,Our Segmentation without and with ﬁlterednormals.Figure 6.Planes detection in street point cloud generated by MMS (d =1m,area min =10m 2,γ=0.3m ).References[1]Aim@shape repository /.6[2]Octree class template /code/octree.html.4[3] A.Bab-Hadiashar and N.Gheissari.Range image segmen-tation using surface selection criterion.2006.IEEE Trans-actions on Image Processing.1[4]J.Bauer,K.Karner,K.Schindler,A.Klaus,and C.Zach.Segmentation of building models from dense 3d point-clouds.2003.Workshop of the Austrian Association for Pattern Recognition.1[5]H.Boulaassal,ndes,P.Grussenmeyer,and F.Tarsha-Kurdi.Automatic segmentation of building facades using terrestrial laser data.2007.ISPRS Workshop on Laser Scan-ning.1[6] C.C.Chen and I.Stamos.Range image segmentationfor modeling and object detection in urban scenes.2007.3DIM2007.1[7]T.K.Dey,G.Li,and J.Sun.Normal estimation for pointclouds:A comparison study for a voronoi based method.2005.Eurographics on Symposium on Point-Based Graph-ics.3[8]J.R.Diebel,S.Thrun,and M.Brunig.A bayesian methodfor probable surface reconstruction and decimation.2006.ACM Transactions on Graphics (TOG).1[9]M.A.Fischler and R.C.Bolles.Random sample consen-sus:A paradigm for model ﬁtting with applications to image analysis and automated munications of the ACM.1,2[10]P.F.U.Gotardo,O.R.P.Bellon,and L.Silva.Range imagesegmentation by surface extraction using an improved robust estimator.2003.Proceedings of Computer Vision and Pat-tern Recognition.1,5[11] F.Goulette,F.Nashashibi,I.Abuhadrous,S.Ammoun,andurgeau.An integrated on-board laser range sensing sys-tem for on-the-way city and road modelling.2007.Interna-tional Archives of the Photogrammetry,Remote Sensing and Spacial Information Sciences.6[12] A.Hoover,G.Jean-Baptiste,and al.An experimental com-parison of range image segmentation algorithms.1996.IEEE Transactions on Pattern Analysis and Machine Intelligence.5[13]H.Hoppe,T.DeRose,T.Duchamp,J.McDonald,andW.Stuetzle.Surface reconstruction from unorganized points.1992.International Conference on Computer Graphics and Interactive Techniques.2[14]P.Hough.Method and means for recognizing complex pat-terns.1962.In US Patent.1[15]M.Isenburg,P.Lindstrom,S.Gumhold,and J.Snoeyink.Large mesh simpliﬁcation using processing sequences.2003.。

基于区域分割和蒙特卡洛采样的静态图片人体姿态估计

数据集上给出了更加精确的估计结果，时运行时间也减少＿２％．同『５
关键词：静态图片；人体姿态估计；区域分割；蒙特卡洛采样；置信传播
中图分类号：Ｐ９．文献标识码：文章编号：７４７５２１）１）３４６Ｔ３１４Ａ１３８（０１０４８）６０
验概率最大化的代表性的有文献［一５．献［］８ｌ］文８
变、的着装变化、景干扰以及遮挡都使得这一问人背题具有很大的挑战性．
考虑到人体各个部分有着比较固定的连接关
收稿日期：００１－５２１．１０．
静态图片中的人体姿态估计问题是计算机视觉领域中的一个重要问题，在基于内容的图片检索和过滤方面有广泛的应用 ¨ 考虑到人体的多关节引．特性，般估计人的姿态时都需要将人分成多个部一分来分别检测（通常是按照关节来进行划分）因此，静态图片中的人体姿态估计问题可以等同于检测每
ｔｅｐｏａｉｉｔｃｇａｈｃｌｍｏｅ，ａｄＭｏｔ — ｒｏｓｍｐｉｓｕｉｚｄｔａｒｕｒｂｂｌｔｃｉｆｒｎｅＥｘｈｒｂｂｌｓｉｒｐｉａｄｌｎｎｅＣａｌａｌｎｇｗａｔｉｅｏｃｒｙｏｔｐｏａｉｉｉｎｅｅｃ．ｌｓ — ｐｒｍｅｔｌｒｓｌｓｄｍｏｓｒｔｈａｔｅｐｏｓｄａｇｒｔｍｅｆｒｓｂｔｒｏａｃｍｍｏｄｔｂｓｏａｅｔｅｉｎａｅｕｔｅｎｔａｅｔｔｈｒｐｏｅｌｏｉｈｐｒｏｍｅｔｎｏｅｎａａａｅｃｍｐｒｄｗｉｈ

自动驾驶中的图像语义分割与特征提取方法研究

自动驾驶中的图像语义分割与特征提取方法研究自动驾驶是当今科技领域的热门研究方向之一，其中图像语义分割与特征提取方法是其关键技术之一。

本文将介绍图像语义分割与特征提取在自动驾驶中的研究现状和方法。

图像语义分割是将图像中的每个像素进行分类的任务，目的是为了将图像中的不同物体进行标记。

在自动驾驶中，图像语义分割可以将道路、车辆、行人等不同的物体进行区分，从而更好地理解和感知车辆周围的环境。

图像语义分割的方法主要包括传统的基于机器学习的方法和基于深度学习的方法。

传统的基于机器学习的方法主要是使用一些特征提取算法，如SIFT、HOG等，来提取图像中的特征，然后通过机器学习算法进行分类。

这种方法需要依赖人工设计的特征和复杂的分类器，容易受到图像质量、光照和角度等因素的影响，对于复杂场景的适应性较差。

而基于深度学习的方法则通过使用卷积神经网络（CNN）对图像进行端到端的训练和学习，能够自动学习图像中的特征。

这种方法不需要手工设计特征，具有更好的适应性和鲁棒性。

在自动驾驶中，研究者们通过构建深度神经网络，如FCN、SegNet等，来实现图像语义分割任务。

这些网络采用编码-解码结构，将图像进行特征提取和重建，使得网络能够更好地理解图像中的语义信息，并进行像素级别的分类。

除了图像语义分割，特征提取也是自动驾驶中的重要任务。

特征提取是指从图像或者传感器数据中提取出有用的特征信息，用于自动驾驶系统的决策和控制。

常用的特征有纹理特征、颜色特征等，通过提取这些特征，可以对图像进行分析和理解。

传统的特征提取方法多采用手工设计的特征提取器，如HOG、SIFT等。

这些特征提取方法需要人工设计特征提取器，难以适应复杂场景。

而基于深度学习的特征提取方法则能够通过数据驱动的方式自动地学习图像中的特征。

研究者们通过设计卷积神经网络，如VGG、ResNet等，来提取图像中的特征向量。

这些网络通过多层网络的堆叠和卷积操作，能够自动地提取出图像中的抽象特征，使得特征具有更好的可区分性和表达能力。

基于独立分量分析的说话人自动识别方法的研究

司、个人对信息安全的关注度日益提高，所以需
要更多更可靠的识别措施，来保证关键资料的安全，而说话人识别为这些行业提供了一种新的安
全解决方案。
信号。ＩＡ问题就是仅利用观测信号置（＝ｌＣｉ，２，…，Ⅳ 的信息来估计混合矩阵和独立成分）ｓ，所以需要求得一个分离矩阵，使得能得到最
阵空问中去寻找解混矩阵，这样就减少了变量的
一
１ — ３
《仪器仪表与分析监测》２１０１年第１期
数目，简化了问题的求解。
白化的通常方法是通过数学期望Ｅ｛Ｘ｝Ｘ的协方差矩阵分解法得到ＢＢ，其中Ｂ是Ｅ｛Ｘ｝ＤＸ特征矢量的正交矩阵，Ｄ是它的特征值的对角矩阵。
Ｅ｛Ｊ｝Ｅ｛ｙ＝Ｗ｝＝唧，：（）４
１混合语音分离
１１独立分量分析（Ｃ．ＩＡ）
即矩阵是正交矩阵，所以只需要在正交矩
独立分量分析（ｎｅｅｄｎｏｐｎｎＡａｙＩｄｐｎｅｔｍｏｅｔｎｌＣ —
型（ＭＭ）Ｇ，提出了一个解决的方法。
处理，其包括两个部分，即中心化和白化。所ｊ
谓的中心化及白化就是对观测向量进行线性变换，
使得到的数据满足Ｅ｛＝，Ｅ｛｝０
｝，＝。在源
信号单位方差且相互统计独立的假设下，可知
再运用高斯混合模型对分离出来的语音进行识别。仿真结果表明，此方法在混叠说话人目标识别中可以

基于改进的局部表面凸性算法三维点云分割

图１局部表面凸性Ｆｉｇ．１Ｌｏｃａｌｓｕｒｆａｃｅｃｏｎｖｅｘｉｔｙ
{ } ｓｉｇｍ［－ｎＴｉｎｊ，－ｃｏｓ（ｖｎＳｉｍ），ｖｎＳｉｍＦ］
ｃｉ，ｊ＝ｍａｘｓｉｇｍ［ｍａｘ（‖ｎｄＴｉｄｉ，ｉｊ，‖ｊ，‖ｎｄＴｊｄｊ，ｉｊ，‖ｉ），ｃｏｓ（９０°－ｖｃｏｎｖ），ｖｃｏｎｖＦ］，
（１）
（２）边界判定
ｓｉｇｍ（ｘ，θ，ｍ）＝０．５－０．５（ｘ－θ）ｍ，
槡１＋（ｘ－θ）２ｍ２
（２）式中，θ为有效阈值，ｍ为影响阈值处切线斜率的范围参数。
由于三维点云中的邻近点不一定属于同一个物体，需要对局部连通点集中的点进行物体边界判定，判断获得的局部连通点集是否属于同一物体。根据深度值的不连续可以判断物体边界的存在以及邻近像素点是否属于同一部分，对于任意
{ ｌｉ，ｊ＝ｍｉｎｓｉｇｍ［｜ｍ（ｉｒｎｉ｛－ｒｉ，ｒｊｒ）ｊ｝｜，ｖｒＤｉｆｆ，ｖｒ２Ｄｉｆｆ］，
ｓｉｇｍ［｜（ｒｉ－（ｒｊｒ）ｈ－－（ｒｊｒ）ｈ－ｒｊ）｜，ｖｒＤｉｆｆ，ｖｒＮＦ（ｒｉ）］，
} ｓｉｇｍ［｜（ｒｉ－（ｒｊｒ）ｊ－－ｒ（ｋｒ）ｊ－ｒｋ）｜，ｖｒＮＤｉｆｆ，ｖｒＮＦ（ｒｉ）］
（１中国科学院长春光学精密机械与物理研究所激光与物质相互作用国家重点实验室，吉林长春１３００３３；
２中国科学院大学，北京１０００４９）
摘要：点云分割是点云分类、识别以及三维重建等处理的基础，分割结果对后续应用影响巨大。本文提出利用连通点集改进局部表面凸性算法中邻近点关系的方法，解决目前激光三维成像系统点云分割算法在处理复杂环境散乱点云时存在分割过度及分割不充分的问题，通过主顶点与周围点构成连通集，作为分割判断局部子点集，形成有效分割区域。该方法解决了常用点云分割方法无法对形状不规则物体进行有效分割的问题，提高了分割精度。算法实验结果表明，相比于最小切割算法和区域生长算法，基于连通点集的改进局部表面凸性算法对实际路面环境信息的分割效果更好，并能在一定程度上避免分割过度和分割不充分的情况，证明该方法适用于复杂环境散乱点云数据分割。关键词：激光三维成像；点云分割；连通点集；局部表面凸性中图分类号：ＴＮ９５８．９８文献标识码：Ａｄｏｉ：１０．３７８８／ＣＯ．２０１７１００３．０３４８

Discriminatively Trained Sparse Code Gradients for Contour Detection

Discriminatively Trained Sparse Code Gradientsfor Contour DetectionXiaofeng Ren and Liefeng BoIntel Science and Technology Center for Pervasive Computing,Intel LabsSeattle,W A98195,USA{xiaofeng.ren,liefeng.bo}@AbstractFinding contours in natural images is a fundamental problem that serves as thebasis of many tasks such as image segmentation and object recognition.At thecore of contour detection technologies are a set of hand-designed gradient fea-tures,used by most approaches including the state-of-the-art Global Pb(gPb)operator.In this work,we show that contour detection accuracy can be signif-icantly improved by computing Sparse Code Gradients(SCG),which measurecontrast using patch representations automatically learned through sparse coding.We use K-SVD for dictionary learning and Orthogonal Matching Pursuit for com-puting sparse codes on oriented local neighborhoods,and apply multi-scale pool-ing and power transforms before classifying them with linear SVMs.By extract-ing rich representations from pixels and avoiding collapsing them prematurely,Sparse Code Gradients effectively learn how to measure local contrasts andﬁndcontours.We improve the F-measure metric on the BSDS500benchmark to0.74(up from0.71of gPb contours).Moreover,our learning approach can easily adaptto novel sensor data such as Kinect-style RGB-D cameras:Sparse Code Gradi-ents on depth maps and surface normals lead to promising contour detection usingdepth and depth+color,as veriﬁed on the NYU Depth Dataset.1IntroductionContour detection is a fundamental problem in vision.Accuratelyﬁnding both object boundaries and interior contours has far reaching implications for many vision tasks including segmentation,recog-nition and scene understanding.High-quality image segmentation has increasingly been relying on contour analysis,such as in the widely used system of Global Pb[2].Contours and segmentations have also seen extensive uses in shape matching and object recognition[8,9].Accuratelyﬁnding contours in natural images is a challenging problem and has been extensively studied.With the availability of datasets with human-marked groundtruth contours,a variety of approaches have been proposed and evaluated(see a summary in[2]),such as learning to clas-sify[17,20,16],contour grouping[23,31,12],multi-scale features[21,2],and hierarchical region analysis[2].Most of these approaches have one thing in common[17,23,31,21,12,2]:they are built on top of a set of gradient features[17]measuring local contrast of oriented discs,using chi-square distances of histograms of color and textons.Despite various efforts to use generic image features[5]or learn them[16],these hand-designed gradients are still widely used after a decade and support top-ranking algorithms on the Berkeley benchmarks[2].In this work,we demonstrate that contour detection can be vastly improved by replacing the hand-designed Pb gradients of[17]with rich representations that are automatically learned from data. We use sparse coding,in particularly Orthogonal Matching Pursuit[18]and K-SVD[1],to learn such representations on patches.Instead of a direct classiﬁcation of patches[16],the sparse codes on the pixels are pooled over multi-scale half-discs for each orientation,in the spirit of the Pbimage patch: gray, abdepth patch (optional):depth, surface normal…local sparse coding multi-scale pooling oriented gradients power transformslinear SVM+ - …per-pixelsparse codes SVMSVMSVM … SVM RGB-(D) contoursFigure 1:We combine sparse coding and oriented gradients for contour analysis on color as well as depth images.Sparse coding automatically learns a rich representation of patches from data.With multi-scale pooling,oriented gradients efﬁciently capture local contrast and lead to much more accurate contour detection than those using hand-designed features including Global Pb (gPb)[2].gradients,before being classiﬁed with a linear SVM.The SVM outputs are then smoothed and non-max suppressed over orientations,as commonly done,to produce the ﬁnal contours (see Fig.1).Our sparse code gradients (SCG)are much more effective in capturing local contour contrast than existing features.By only changing local features and keeping the smoothing and globalization parts ﬁxed,we improve the F-measure on the BSDS500benchmark to 0.74(up from 0.71of gPb),a sub-stantial step toward human-level accuracy (see the precision-recall curves in Fig.4).Large improve-ments in accuracy are also observed on other datasets including MSRC2and PASCAL2008.More-over,our approach is built on unsupervised feature learning and can directly apply to novel sensor data such as RGB-D images from Kinect-style depth ing the NYU Depth dataset [27],we verify that our SCG approach combines the strengths of color and depth contour detection and outperforms an adaptation of gPb to RGB-D by a large margin.2Related WorkContour detection has a long history in computer vision as a fundamental building block.Modern approaches to contour detection are evaluated on datasets of natural images against human-marked groundtruth.The Pb work of Martin et.al.[17]combined a set of gradient features,using bright-ness,color and textons,to outperform the Canny edge detector on the Berkeley Benchmark (BSDS).Multi-scale versions of Pb were developed and found beneﬁcial [21,2].Building on top of the Pb gradients,many approaches studied the globalization aspects,i.e.moving beyond local classiﬁca-tion and enforcing consistency and continuity of contours.Ren et.al.developed CRF models on superpixels to learn junction types [23].Zhu ed circular embedding to enforce orderings of edgels [31].The gPb work of Arbelaez puted gradients on eigenvectors of the afﬁnity graph and combined them with local cues [2].In addition to Pb gradients,Dollar et.al.[5]learned boosted trees on generic features such as gradients and Haar wavelets,Kokkinos used SIFT features on edgels [12],and Prasad et.al.[20]used raw pixels in class-speciﬁc settings.One closely related work was the discriminative sparse models of Mairal et al [16],which used K-SVD to represent multi-scale patches and had moderate success on the BSDS.A major difference of our work is the use of oriented gradients:comparing to directly classifying a patch,measuring contrast between oriented half-discs is a much easier problem and can be effectively learned.Sparse coding represents a signal by reconstructing it using a small set of basis functions.It has seen wide uses in vision,for example for faces [28]and recognition [29].Similar to deep network approaches [11,14],recent works tried to avoid feature engineering and employed sparse coding of image patches to learn features from “scratch”,for texture analysis [15]and object recognition [30,3].In particular,Orthogonal Matching Pursuit [18]is a greedy algorithm that incrementally ﬁnds sparse codes,and K-SVD is also efﬁcient and popular for dictionary learning.Closely related to our work but on the different problem of recognition,Bo ed matching pursuit and K-SVD to learn features in a coding hierarchy [3]and are extending their approach to RGB-D data [4].Thanks to the mass production of Kinect,active RGB-D cameras became affordable and were quickly adopted in vision research and applications.The Kinect pose estimation of Shotton et. ed random forests to learn from a huge amount of data[25].Henry ed RGB-D cam-eras to scan large environments into3D models[10].RGB-D data were also studied in the context of object recognition[13]and scene labeling[27,22].In-depth studies of contour and segmentation problems for depth data are much in need given the fast growing interests in RGB-D perception.3Contour Detection using Sparse Code GradientsWe start by examining the processing pipeline of Global Pb(gPb)[2],a highly inﬂuential and widely used system for contour detection.The gPb contour detection has two stages:local contrast estimation at multiple scales,and globalization of the local cues using spectral grouping.The core of the approach lies within its use of local cues in oriented gradients.Originally developed in [17],this set of features use relatively simple pixel representations(histograms of brightness,color and textons)and similarity functions(chi-square distance,manually chosen),comparing to recent advances in using rich representations for high-level recognition(e.g.[11,29,30,3]).We set out to show that both the pixel representation and the aggregation of pixel information in local neighborhoods can be much improved and,to a large extent,learned from and adapted to input data. For pixel representation,in Section3.1we show how to use Orthogonal Matching Pursuit[18]and K-SVD[1],efﬁcient sparse coding and dictionary learning algorithms that readily apply to low-level vision,to extract sparse codes at every pixel.This sparse coding approach can be viewed similar in spirit to the use ofﬁlterbanks but avoids manual choices and thus directly applies to the RGB-D data from Kinect.We show learned dictionaries for a number of channels that exhibit different characteristics:grayscale/luminance,chromaticity(ab),depth,and surface normal.In Section3.2we show how the pixel-level sparse codes can be integrated through multi-scale pool-ing into a rich representation of oriented local neighborhoods.By computing oriented gradients on this high dimensional representation and using a double power transform to code the features for linear classiﬁcation,we show a linear SVM can be efﬁciently and effectively trained for each orientation to classify contour vs non-contour,yielding local contrast estimates that are much more accurate than the hand-designed features in gPb.3.1Local Sparse Representation of RGB-(D)PatchesK-SVD and Orthogonal Matching Pursuit.K-SVD[1]is a popular dictionary learning algorithm that generalizes K-Means and learns dictionaries of codewords from unsupervised data.Given a set of image patches Y=[y1,···,y n],K-SVD jointlyﬁnds a dictionary D=[d1,···,d m]and an associated sparse code matrix X=[x1,···,x n]by minimizing the reconstruction errorminY−DX 2F s.t.∀i, x i 0≤K;∀j, d j 2=1(1) D,Xwhere · F denotes the Frobenius norm,x i are the columns of X,the zero-norm · 0counts the non-zero entries in the sparse code x i,and K is a predeﬁned sparsity level(number of non-zero en-tries).This optimization can be solved in an alternating manner.Given the dictionary D,optimizing the sparse code matrix X can be decoupled to sub-problems,each solved with Orthogonal Matching Pursuit(OMP)[18],a greedy algorithm forﬁnding sparse codes.Given the codes X,the dictionary D and its associated sparse coefﬁcients are updated sequentially by singular value decomposition. For our purpose of representing local patches,the dictionary D has a small size(we use75for5x5 patches)and does not require a lot of sample patches,and it can be learned in a matter of minutes. Once the dictionary D is learned,we again use the Orthogonal Matching Pursuit(OMP)algorithm to compute sparse codes at every pixel.This can be efﬁciently done with convolution and a batch version of the OMP algorithm[24].For a typical BSDS image of resolution321x481,the sparse code extraction is efﬁcient and takes1∼2seconds.Sparse Representation of RGB-D Data.One advantage of unsupervised dictionary learning is that it readily applies to novel sensor data,such as the color and depth frames from a Kinect-style RGB-D camera.We learn K-SVD dictionaries up to four channels of color and depth:grayscale for luminance,chromaticity ab for color in the Lab space,depth(distance to camera)and surface normal(3-dim).The learned dictionaries are visualized in Fig.2.These dictionaries are interesting(a)Grayscale (b)Chromaticity (ab)(c)Depth (d)Surface normal Figure 2:K-SVD dictionaries learned for four different channels:grayscale and chromaticity (in ab )for an RGB image (a,b),and depth and surface normal for a depth image (c,d).We use a ﬁxed dictionary size of 75on 5x 5patches.The ab channel is visualized using a constant luminance of 50.The 3-dimensional surface normal (xyz)is visualized in RGB (i.e.blue for frontal-parallel surfaces).to look at and qualitatively distinctive:for example,the surface normal codewords tend to be more smooth due to ﬂat surfaces,the depth codewords are also more smooth but with speckles,and the chromaticity codewords respect the opponent color pairs.The channels are coded separately.3.2Coding Multi-Scale Neighborhoods for Measuring ContrastMulti-Scale Pooling over Oriented Half-Discs.Over decades of research on contour detection and related topics,a number of fundamental observations have been made,repeatedly:(1)contrast is the key to differentiate contour vs non-contour;(2)orientation is important for respecting contour continuity;and (3)multi-scale is useful.We do not wish to throw out these principles.Instead,we seek to adopt these principles for our case of high dimensional representations with sparse codes.Each pixel is presented with sparse codes extracted from a small patch (5-by-5)around it.To aggre-gate pixel information,we use oriented half-discs as used in gPb (see an illustration in Fig.1).Each orientation is processed separately.For each orientation,at each pixel p and scale s ,we deﬁne two half-discs (rectangles)N a and N b of size s -by-(2s +1),on both sides of p ,rotated to that orienta-tion.For each half-disc N ,we use average pooling on non-zero entries (i.e.a hybrid of average and max pooling)to generate its representationF (N )= i ∈N |x i 1| i ∈N I |x i 1|>0,···, i ∈N |x im | i ∈NI |x im |>0 (2)where x ij is the j -th entry of the sparse code x i ,and I is the indicator function whether x ij is non-zero.We rotate the image (after sparse coding)and use integral images for fast computations (on both |x ij |and |x ij |>0,whose costs are independent of the size of N .For two oriented half-dics N a and N b at a scale s ,we compute a difference (gradient)vector DD (N a s ,N b s )= F (N a s )−F (N b s ) (3)where |·|is an element-wise absolute value operation.We divide D (N a s ,N b s )by their norms F (N a s ) + F (N b s ) + ,where is a positive number.Since the magnitude of sparse codes variesover a wide range due to local variations in illumination as well as occlusion,this step makes the appearance features robust to such variations and increases their discriminative power,as commonly done in both contour detection and object recognition.This value is not hard to set,and we ﬁnd a value of =0.5is better than,for instance, =0.At this stage,one could train a classiﬁer on D for each scale to convert it to a scalar value of contrast,which would resemble the chi-square distance function in gPb.Instead,we ﬁnd that it is much better to avoid doing so separately at each scale,but combining multi-scale features in a joint representation,so as to allow interactions both between codewords and between scales.That is,our ﬁnal representation of the contrast at a pixel p is the concatenation of sparse codes pooled at all thescales s ∈{1,···,S }(we use S =4):D p = D (N a 1,N b 1),···,D (N a S ,N b S );F (N a 1∪N b 1),···,F (N a S ∪N b S ) (4)In addition to difference D ,we also include a union term F (N a s ∪N b s ),which captures the appear-ance of the whole disc (union of the two half discs)and is normalized by F (N a s ) + F (N b s ) + .Double Power Transform and Linear Classiﬁers.The concatenated feature D p (non-negative)provides multi-scale contrast information for classifying whether p is a contour location for a partic-ular orientation.As D p is high dimensional (1200and above in our experiments)and we need to do it at every pixel and every orientation,we prefer using linear SVMs for both efﬁcient testing as well as training.Directly learning a linear function on D p ,however,does not work very well.Instead,we apply a double power transformation to make the features more suitable for linear SVMs D p = D α1p ,D α2p (5)where 0<α1<α2<1.Empirically,we ﬁnd that the double power transform works much better than either no transform or a single power transform α,as sometimes done in other classiﬁcation contexts.Perronnin et.al.[19]provided an intuition why a power transform helps classiﬁcation,which “re-normalizes”the distribution of the features into a more Gaussian form.One plausible intuition for a double power transform is that the optimal exponent αmay be different across feature dimensions.By putting two power transforms of D p together,we allow the classiﬁer to pick its linear combination,different for each dimension,during the stage of supervised training.From Local Contrast to Global Contours.We intentionally only change the local contrast es-timation in gPb and keep the other steps ﬁxed.These steps include:(1)the Savitzky-Goley ﬁlter to smooth responses and ﬁnd peak locations;(2)non-max suppression over orientations;and (3)optionally,we apply the globalization step in gPb that computes a spectral gradient from the local gradients and then linearly combines the spectral gradient with the local ones.A sigmoid transform step is needed to convert the SVM outputs on D p before computing spectral gradients.4ExperimentsWe use the evaluation framework of,and extensively compare to,the publicly available Global Pb (gPb)system [2],widely used as the state of the art for contour detection 1.All the results reported on gPb are from running the gPb contour detection and evaluation codes (with default parameters),and accuracies are veriﬁed against the published results in [2].The gPb evaluation includes a number of criteria,including precision-recall (P/R)curves from contour matching (Fig.4),F-measures computed from P/R (Table 1,2,3)with a ﬁxed contour threshold (ODS)or per-image thresholds (OIS),as well as average precisions (AP)from the P/R curves.Benchmark Datasets.The main dataset we use is the BSDS500benchmark [2],an extension of the original BSDS300benchmark and commonly used for contour evaluation.It includes 500natural images of roughly resolution 321x 481,including 200for training,100for validation,and 200for testing.We conduct both color and grayscale experiments (where we convert the BSDS500images to grayscale and retain the groundtruth).In addition,we also use the MSRC2and PASCAL2008segmentation datasets [26,6],as done in the gPb work [2].The MSRC2dataset has 591images of resolution 200x 300;we randomly choose half for training and half for testing.The PASCAL2008dataset includes 1023images in its training and validation sets,roughly of resolution 350x 500.We randomly choose half for training and half for testing.For RGB-D contour detection,we use the NYU Depth dataset (v2)[27],which includes 1449pairs of color and depth frames of resolution 480x 640,with groundtruth semantic regions.We choose 60%images for training and 40%for testing,as in its scene labeling setup.The Kinect images are of lower quality than BSDS,and we resize the frames to 240x 320in our experiments.Training Sparse Code Gradients.Given sparse codes from K-SVD and Orthogonal Matching Pur-suit,we train the Sparse Code Gradients classiﬁers,one linear SVM per orientation,from sampled locations.For positive data,we sample groundtruth contour locations and estimate the orientations at these locations using groundtruth.For negative data,locations and orientations are random.We subtract the mean from the patches in each data channel.For BSDS500,we typically have 1.5to 21In this work we focus on contour detection and do not address how to derive segmentations from contours.pooling disc size (pixel)a v e r a g e p r e c i s i o na v e r a g e p r e c i s i o nsparsity level a v e r a g e p r e c i s i o n (a)(b)(c)Figure 3:Analysis of our sparse code gradients,using average precision of classiﬁcation on sampled boundaries.(a)The effect of single-scale vs multi-scale pooling (accumulated from the smallest).(b)Accuracy increasing with dictionary size,for four orientation channels.(c)The effect of the sparsity level K,which exhibits different behavior for grayscale and chromaticity.BSDS500ODS OIS AP l o c a l gPb (gray).67.69.68SCG (gray).69.71.71gPb (color).70.72.71SCG (color).72.74.75g l o b a l gPb (gray).69.71.67SCG (gray).71.73.74gPb (color).71.74.72SCG (color).74.76.77Table 1:F-measure evaluation on the BSDS500benchmark [2],comparing to gPb on grayscaleand color images,both for local contour detec-tion as well as for global detection (-bined with the spectral gradient analysis in [2]).Recall P r e c i s i o n Figure 4:Precision-recall curves of SCG vs gPb on BSDS500,for grayscale and color images.We make a substantial step beyondthe current state of the art toward reachinghuman-level accuracy (green dot).million data points.We use 4spatial scales,at half-disc sizes 2,4,7,25.For a dictionary size of 75and 4scales,the feature length for one data channel is 1200.For full RGB-D data,the dimension is 4800.For BSDS500,we train only using the 200training images.We modify liblinear [7]to take dense matrices (features are dense after pooling)and single-precision ﬂoats.Looking under the Hood.We empirically analyze a number of settings in our Sparse Code Gradi-ents.In particular,we want to understand how the choices in the local sparse coding affect contour classiﬁcation.Fig.3shows the effects of multi-scale pooling,dictionary size,and sparsity level (K).The numbers reported are intermediate results,namely the mean of average precision of four oriented gradient classiﬁer (0,45,90,135degrees)on sampled locations (grayscale unless otherwise noted,on validation).As a reference,the average precision of gPb on this task is 0.878.For multi-scale pooling,the single best scale for the half-disc ﬁlter is about 4x 8,consistent with the settings in gPb.For accumulated scales (using all the scales from the smallest up to the current level),the accuracy continues to increase and does not seem to be saturated,suggesting the use of larger scales.The dictionary size has a minor impact,and there is a small (yet observable)beneﬁt to use dictionaries larger than 75,particularly for diagonal orientations (45-and 135-deg).The sparsity level K is a more intriguing issue.In Fig.3(c),we see that for grayscale only,K =1(normalized nearest neighbor)does quite well;on the other hand,color needs a larger K ,possibly because ab is a nonlinear space.When combining grayscale and color,it seems that we want K to be at least 3.It also varies with orientation:horizontal and vertical edges require a smaller K than diagonal edges.(If using K =1,our ﬁnal F-measure on BSDS500is 0.730.)We also empirically evaluate the double power transform vs single power transform vs no transform.With no transform,the average precision is 0.865.With a single power transform,the best choice of the exponent is around 0.4,with average precision 0.884.A double power transform (with exponentsMSRC2ODS OIS APgPb.37.39.22SCG.43.43.33PASCAL2008ODS OIS APgPb.34.38.20SCG.37.41.27Table2:F-measure evaluation comparing our SCG approach to gPb on two addi-tional image datasets with contour groundtruth: MSRC2[26]and PASCAL2008[6].RGB-D(NYU v2)ODS OIS AP gPb(color).51.52.37 SCG(color).55.57.46gPb(depth).44.46.28SCG(depth).53.54.45gPb(RGB-D).53.54.40SCG(RGB-D).62.63.54Table3:F-measure evaluation on RGB-D con-tour detection using the NYU dataset(v2)[27].We compare to gPb on using color image only,depth only,as well as color+depth.Figure5:Examples from the BSDS500dataset[2].(Top)Image;(Middle)gPb output;(Bottom) SCG output(this work).Our SCG operator learns to preserveﬁne details(e.g.windmills,faces,ﬁsh ﬁns)while at the same time achieving higher precision on large-scale contours(e.g.back of zebras). (Contours are shown in double width for the sake of visualization.)0.25and0.75,which can be computed through sqrt)improves the average precision to0.900,which translates to a large improvement in contour detection accuracy.Image Benchmarking Results.In Table1and Fig.4we show the precision-recall of our Sparse Code Gradients vs gPb on the BSDS500benchmark.We conduct four sets of experiments,using color or grayscale images,with or without the globalization component(for which we use exactly the same setup as in gPb).Using Sparse Code Gradients leads to a signiﬁcant improvement in accuracy in all four cases.The local version of our SCG operator,i.e.only using local contrast,is already better(F=0.72)than gPb with globalization(F=0.71).The full version,local SCG plus spectral gradient(computed from local SCG),reaches an F-measure of0.739,a large step forward from gPb,as seen in the precision-recall curves in Fig.4.On BSDS300,our F-measure is0.715. We observe that SCG seems to pick upﬁne-scale details much better than gPb,hence the much higher recall rate,while maintaining higher precision over the entire range.This can be seen in the examples shown in Fig.5.While our scale range is similar to that of gPb,the multi-scale pooling scheme allows theﬂexibility of learning the balance of scales separately for each code word,which may help detecting the details.The supplemental material contains more comparison examples.In Table2we show the benchmarking results for two additional datasets,MSRC2and PAS-CAL2008.Again we observe large improvements in accuracy,in spite of the somewhat different natures of the scenes in these datasets.The improvement on MSRC2is much larger,partly because the images are smaller,hence the contours are smaller in scale and may be over-smoothed in gPb. As for computational cost,using integral images,local SCG takes∼100seconds to compute on a single-thread Intel Core i5-2500CPU on a BSDS image.It is slower than but comparable to the highly optimized multi-thread C++implementation of gPb(∼60seconds).Figure6:Examples of RGB-D contour detection on the NYU dataset(v2)[27].Theﬁve panels are:input image,input depth,image-only contours,depth-only contours,and color+depth contours. Color is good picking up details such as photos on the wall,and depth is useful where color is uniform(e.g.corner of a room,row1)or illumination is poor(e.g.chair,row2).RGB-D Contour Detection.We use the second version of the NYU Depth Dataset[27],which has higher quality groundtruth than theﬁrst version.A medianﬁltering is applied to remove double contours(boundaries from two adjacent regions)within3pixels.For RGB-D baseline,we use a simple adaptation of gPb:the depth values are in meters and used directly as a grayscale image in gPb gradient computation.We use a linear combination to put(soft)color and depth gradients together in gPb before non-max suppression,with the weight set from validation.Table3lists the precision-recall evaluations of SCG vs gPb for RGB-D contour detection.All the SCG settings(such as scales and dictionary sizes)are kept the same as for BSDS.SCG again outperforms gPb in all the cases.In particular,we are much better for depth-only contours,for which gPb is not designed.Our approach learns the low-level representations of depth data fully automatically and does not require any manual tweaking.We also achieve a much larger boost by combining color and depth,demonstrating that color and depth channels contain complementary information and are both critical for RGB-D contour detection.Qualitatively,it is easy to see that RGB-D combines the strengths of color and depth and is a promising direction for contour and segmentation tasks and indoor scene analysis in general[22].Fig.6shows a few examples of RGB-D contours from our SCG operator.There are plenty of such cases where color alone or depth alone would fail to extract contours for meaningful parts of the scenes,and color+depth would succeed. 5DiscussionsIn this work we successfully showed how to learn and code local representations to extract contours in natural images.Our approach combined the proven concept of oriented gradients with powerful representations that are automatically learned through sparse coding.Sparse Code Gradients(SCG) performed signiﬁcantly better than hand-designed features that were in use for a decade,and pushed contour detection much closer to human-level accuracy as illustrated on the BSDS500benchmark. Comparing to hand-designed features(e.g.Global Pb[2]),we maintain the high dimensional rep-resentation from pooling oriented neighborhoods and do not collapse them prematurely(such as computing chi-square distance at each scale).This passes a richer set of information into learn-ing contour classiﬁcation,where a double power transform effectively codes the features for linear paring to previous learning approaches(e.g.discriminative dictionaries in[16]),our uses of multi-scale pooling and oriented gradients lead to much higher classiﬁcation accuracies. Our work opens up future possibilities for learning contour detection and segmentation.As we il-lustrated,there is a lot of information locally that is waiting to be extracted,and a learning approach such as sparse coding provides a principled way to do so,where rich representations can be automat-ically constructed and adapted.This is particularly important for novel sensor data such as RGB-D, for which we have less understanding but increasingly more need.。

CVPR2013总结

CVPR2013总结前不久的结果出来了，⾸先恭喜我⼀个已经毕业⼯作的师弟中了⼀篇。

完整的⽂章列表已经在CVPR的主页上公布了（），今天把其中⼀些感兴趣的整理⼀下，虽然论⽂下载的链接⼤部分还都没出来，不过可以follow最新动态。

等下载链接出来的时候⼀⼀补上。

由于没有下载链接，所以只能通过题⽬和作者估计⼀下论⽂的内容。

难免有偏差，等看了论⽂以后再修正。

显著性Saliency Aggregation: A Data-driven Approach Long Mai, Yuzhen Niu, Feng Liu 现在还没有搜到相关的资料，应该是多线索的⾃适应融合来进⾏显著性检测的PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors Keyang Shi, Keze Wang, Jiangbo Lu, Liang Lin 这⾥的两个线索看起来都不新，应该是集成框架⽐较好。

⽽且像素级的，估计能达到分割或者matting的效果Looking Beyond the Image: Unsupervised Learning for Object Saliency and Detection Parthipan Siva, Chris Russell, Tao Xiang, 基于学习的的显著性检测Learning video saliency from human gaze using candidate selection , Dan Goldman, Eli Shechtman, Lihi Zelnik-Manor这是⼀个做视频显著性的，估计是选择显著的视频⽬标Hierarchical Saliency Detection Qiong Yan, Li Xu, Jianping Shi, Jiaya Jia的学⽣也开始做显著性了，多尺度的⽅法Saliency Detection via Graph-Based Manifold Ranking Chuan Yang, Lihe Zhang, Huchuan Lu, Ming-Hsuan Yang, Xiang Ruan这个应该是扩展了那个经典的 graph based saliency，应该是⽤到了显著性传播的技巧Salient object detection: a discriminative regional feature integration approach , Jingdong Wang, Zejian Yuan, , Nanning Zheng⼀个多特征⾃适应融合的显著性检测⽅法Submodular Salient Region Detection , Larry Davis⼜是⼤⽜下⾯的⽂章，提法也很新颖，⽤了submodular。

一种改进的基于LP倒谱特征的孤立词语音识别方法

中相位卷绕的繁琐处理。ＬＰ倒谱特征与ＦＴ复倒Ｆ谱特征相比，者求出的频谱包络能更好地重现谱前的峰值，且运算量仅是后者的一半。本文在而
ｌ语音识别流程
图１所示为语音识别流程。
ＬＣＰＣ特征基础上，虑了Ｍｅ频率尺度，者结合考ｌ二
维普资讯
第３７卷
第５期
太
原
理
工
大
学
学
报
Ｖ０｜７Ｎｏ５ｌ３．Ｓｐ２０ｅ．０６
２００６年９月
ＪＯＵＲＮＡＩＯＦＴＡＩＹＵＡＮＵＮＩＲＳＴＹＯＦＴＥＣＨＮＯＩＶＥＩＯＧＹ
波器来实现预加重，一般为一阶Ｆ法分析语音时得到的有关Ｐ语音相邻样值间某些相关特性的参数组。线性预测分析基于如下的基本概念，即一语音样本值能用过
文章编号：０７９３２（０）５００８０１０ —４２０６０ — ５ — ３
一
种改进的基于ＬＰ倒谱特征的孤立词语音识别方法
侯雪梅，雪英，高峰张赵
（原理工大学信息工程学院，西太原００２）太山３０４
统对频率及幅度的感知实验结果，此尺度下提取在语音特征，更符合人耳的听觉特性 ”。笔者将Ｍｅｌ频率与ＬＰ倒谱结合起来形成Ｌｌ谱系数ＰＭｅ倒（ＰＣ作为特征参数，ＬＭＣ）使用径向基函数（Ｆ神ＲＢ）

A survey of content based 3d shape retrieval methods

A Survey of Content Based3D Shape Retrieval MethodsJohan W.H.Tangelder and Remco C.VeltkampInstitute of Information and Computing Sciences,Utrecht University hanst@cs.uu.nl,Remco.Veltkamp@cs.uu.nlAbstractRecent developments in techniques for modeling,digitiz-ing and visualizing3D shapes has led to an explosion in the number of available3D models on the Internet and in domain-speciﬁc databases.This has led to the development of3D shape retrieval systems that,given a query object, retrieve similar3D objects.For visualization,3D shapes are often represented as a surface,in particular polygo-nal meshes,for example in VRML format.Often these mod-els contain holes,intersecting polygons,are not manifold, and do not enclose a volume unambiguously.On the con-trary,3D volume models,such as solid models produced by CAD systems,or voxels models,enclose a volume prop-erly.This paper surveys the literature on methods for con-tent based3D retrieval,taking into account the applicabil-ity to surface models as well as to volume models.The meth-ods are evaluated with respect to several requirements of content based3D shape retrieval,such as:(1)shape repre-sentation requirements,(2)properties of dissimilarity mea-sures,(3)efﬁciency,(4)discrimination abilities,(5)ability to perform partial matching,(6)robustness,and(7)neces-sity of pose normalization.Finally,the advantages and lim-its of the several approaches in content based3D shape re-trieval are discussed.1.IntroductionThe advancement of modeling,digitizing and visualizing techniques for3D shapes has led to an increasing amount of3D models,both on the Internet and in domain-speciﬁc databases.This has led to the development of theﬁrst exper-imental search engines for3D shapes,such as the3D model search engine at Princeton university[2,57],the3D model retrieval system at the National Taiwan University[1,17], the Ogden IV system at the National Institute of Multimedia Education,Japan[62,77],the3D retrieval engine at Utrecht University[4,78],and the3D model similarity search en-gine at the University of Konstanz[3,84].Laser scanning has been applied to obtain archives recording cultural heritage like the Digital Michelan-gelo Project[25,48],and the Stanford Digital Formae Urbis Romae Project[75].Furthermore,archives contain-ing domain-speciﬁc shape models are now accessible by the Internet.Examples are the National Design Repos-itory,an online repository of CAD models[59,68], and the Protein Data Bank,an online archive of struc-tural data of biological macromolecules[10,80].Unlike text documents,3D models are not easily re-trieved.Attempting toﬁnd a3D model using textual an-notation and a conventional text-based search engine would not work in many cases.The annotations added by human beings depend on language,culture,age,sex,and other fac-tors.They may be too limited or ambiguous.In contrast, content based3D shape retrieval methods,that use shape properties of the3D models to search for similar models, work better than text based methods[58].Matching is the process of determining how similar two shapes are.This is often done by computing a distance.A complementary process is indexing.In this paper,indexing is understood as the process of building a datastructure to speed up the search.Note that the term indexing is also of-ten used for the identiﬁcation of features in models,or mul-timedia documents in general.Retrieval is the process of searching and delivering the query results.Matching and in-dexing are often part of the retrieval process.Recently,a lot of researchers have investigated the spe-ciﬁc problem of content based3D shape retrieval.Also,an extensive amount of literature can be found in the related ﬁelds of computer vision,object recognition and geomet-ric modelling.Survey papers to this literature have been provided by Besl and Jain[11],Loncaric[50]and Camp-bell and Flynn[16].For an overview of2D shape match-ing methods we refer the reader to the paper by Veltkamp [82].Unfortunately,most2D methods do not generalize di-rectly to3D model matching.Work in progress by Iyer et al.[40]provides an extensive overview of3D shape search-ing techniques.Atmosukarto and Naval[6]describe a num-ber of3D model retrieval systems and methods,but do not provide a categorization and evaluation.In contrast,this paper evaluates3D shape retrieval meth-ods with respect to several requirements on content based 3D shape retrieval,such as:(1)shape representation re-quirements,(2)properties of dissimilarity measures,(3)ef-ﬁciency,(4)discrimination abilities,(5)ability to perform partial matching,(6)robustness,and(7)necessity of posenormalization.In section2we discuss several aspects of3D shape retrieval.The literature on3D shape matching meth-ods is discussed in section3and evaluated in section4. 2.3D shape retrieval aspectsIn this section we discuss several issues related to3D shape retrieval.2.1.3D shape retrieval frameworkAt a conceptual level,a typical3D shape retrieval frame-work as illustrated byﬁg.1consists of a database with an index structure created ofﬂine and an online query engine. Each3D model has to be identiﬁed with a shape descrip-tor,providing a compact overall description of the shape. To efﬁciently search a large collection online,an indexing data structure and searching algorithm should be available. The online query engine computes the query descriptor,and models similar to the query model are retrieved by match-ing descriptors to the query descriptor from the index struc-ture of the database.The similarity between two descriptors is quantiﬁed by a dissimilarity measure.Three approaches can be distinguished to provide a query object:(1)browsing to select a new query object from the obtained results,(2) a direct query by providing a query descriptor,(3)query by example by providing an existing3D model or by creating a3D shape query from scratch using a3D tool or sketch-ing2D projections of the3D model.Finally,the retrieved models can be visualized.2.2.Shape representationsAn important issue is the type of shape representation(s) that a shape retrieval system accepts.Most of the3D models found on the World Wide Web are meshes deﬁned in aﬁle format supporting visual appearance.Currently,the most common format used for this purpose is the Virtual Real-ity Modeling Language(VRML)format.Since these mod-els have been designed for visualization,they often contain only geometry and appearance attributes.In particular,they are represented by“polygon soups”,consisting of unorga-nized sets of polygons.Also,in general these models are not“watertight”meshes,i.e.they do not enclose a volume. By contrast,for volume models retrieval methods depend-ing on a properly deﬁned volume can be applied.2.3.Measuring similarityIn order to measure how similar two objects are,it is nec-essary to compute distances between pairs of descriptors us-ing a dissimilarity measure.Although the term similarity is often used,dissimilarity corresponds to the notion of dis-tance:small distances means small dissimilarity,and large similarity.A dissimilarity measure can be formalized by a func-tion deﬁned on pairs of descriptors indicating the degree of their resemblance.Formally speaking,a dissimilarity measure d on a set S is a non-negative valued function d:S×S→R+∪{0}.Function d may have some of the following properties:i.Identity:For all x∈S,d(x,x)=0.ii.Positivity:For all x=y in S,d(x,y)>0.iii.Symmetry:For all x,y∈S,d(x,y)=d(y,x).iv.Triangle inequality:For all x,y,z∈S,d(x,z)≤d(x,y)+d(y,z).v.Transformation invariance:For a chosen transforma-tion group G,for all x,y∈S,g∈G,d(g(x),g(y))= d(x,y).The identity property says that a shape is completely similar to itself,while the positivity property claims that dif-ferent shapes are never completely similar.This property is very strong for a high-level shape descriptor,and is often not satisﬁed.However,this is not a severe drawback,if the loss of uniqueness depends on negligible details.Symmetry is not always wanted.Indeed,human percep-tion does not alwaysﬁnd that shape x is equally similar to shape y,as y is to x.In particular,a variant x of prototype y,is often found more similar to y then vice versa[81].Dissimilarity measures for partial matching,giving a small distance d(x,y)if a part of x matches a part of y, do not obey the triangle inequality.Transformation invariance has to be satisﬁed,if the com-parison and the extraction process of shape descriptors have to be independent of the place,orientation and scale of the object in its Cartesian coordinate system.If we want that a dissimilarity measure is not affected by any transforma-tion on x,then we may use as alternative formulation for (v):Transformation invariance:For a chosen transforma-tion group G,for all x,y∈S,g∈G,d(g(x),y)=d(x,y).When all the properties(i)-(iv)hold,the dissimilarity measure is called a metric.Other combinations are possi-ble:a pseudo-metric is a dissimilarity measure that obeys (i),(iii)and(iv)while a semi-metric obeys only(i),(ii)and(iii).If a dissimilarity measure is a pseudo-metric,the tri-angle inequality can be applied to make retrieval more efﬁ-cient[7,83].2.4.EfﬁciencyFor large shape collections,it is inefﬁcient to sequen-tially match all objects in the database with the query object. Because retrieval should be fast,efﬁcient indexing search structures are needed to support efﬁcient retrieval.Since for query by example the shape descriptor is computed online, it is reasonable to require that the shape descriptor compu-tation is fast enough for interactive querying.2.5.Discriminative powerA shape descriptor should capture properties that dis-criminate objects well.However,the judgement of the sim-ilarity of the shapes of two3D objects is somewhat sub-jective,depending on the user preference or the application at hand.E.g.for solid modeling applications often topol-ogy properties such as the numbers of holes in a model are more important than minor differences in shapes.On the contrary,if a user searches for models looking visually sim-ilar the existence of a small hole in the model,may be of no importance to the user.2.6.Partial matchingIn contrast to global shape matching,partial matching ﬁnds a shape of which a part is similar to a part of another shape.Partial matching can be applied if3D shape mod-els are not complete,e.g.for objects obtained by laser scan-ning from one or two directions only.Another application is the search for“3D scenes”containing an instance of the query object.Also,this feature can potentially give the user ﬂexibility towards the matching problem,if parts of inter-est of an object can be selected or weighted by the user. 2.7.RobustnessIt is often desirable that a shape descriptor is insensitive to noise and small extra features,and robust against arbi-trary topological degeneracies,e.g.if it is obtained by laser scanning.Also,if a model is given in multiple levels-of-detail,representations of different levels should not differ signiﬁcantly from the original model.2.8.Pose normalizationIn the absence of prior knowledge,3D models have ar-bitrary scale,orientation and position in the3D space.Be-cause not all dissimilarity measures are invariant under ro-tation and translation,it may be necessary to place the3D models into a canonical coordinate system.This should be the same for a translated,rotated or scaled copy of the model.A natural choice is toﬁrst translate the center to the ori-gin.For volume models it is natural to translate the cen-ter of mass to the origin.But for meshes this is in gen-eral not possible,because they have not to enclose a vol-ume.For meshes it is an alternative to translate the cen-ter of mass of all the faces to the origin.For example the Principal Component Analysis(PCA)method computes for each model the principal axes of inertia e1,e2and e3 and their eigenvaluesλ1,λ2andλ3,and make the nec-essary conditions to get right-handed coordinate systems. These principal axes deﬁne an orthogonal coordinate sys-tem(e1,e2,e3),withλ1≥λ2≥λ3.Next,the polyhe-dral model is rotated around the origin such that the co-ordinate system(e x,e y,e z)coincides with the coordinatesystem(e1,e2,e3).The PCA algorithm for pose estimation is fairly simple and efﬁcient.However,if the eigenvalues are equal,prin-cipal axes may switch,without affecting the eigenvalues. Similar eigenvalues may imply an almost symmetrical mass distribution around an axis(e.g.nearly cylindrical shapes) or around the center of mass(e.g.nearly spherical shapes). Fig.2illustrates the problem.3.Shape matching methodsIn this section we discuss3D shape matching methods. We divide shape matching methods in three broad cate-gories:(1)feature based methods,(2)graph based meth-ods and(3)other methods.Fig.3illustrates a more detailed categorization of shape matching methods.Note,that the classes of these methods are not completely disjoined.For instance,a graph-based shape descriptor,in some way,de-scribes also the global feature distribution.By this point of view the taxonomy should be a graph.3.1.Feature based methodsIn the context of3D shape matching,features denote ge-ometric and topological properties of3D shapes.So3D shapes can be discriminated by measuring and comparing their features.Feature based methods can be divided into four categories according to the type of shape features used: (1)global features,(2)global feature distributions,(3)spa-tial maps,and(4)local features.Feature based methods from theﬁrst three categories represent features of a shape using a single descriptor consisting of a d-dimensional vec-tor of values,where the dimension d isﬁxed for all shapes.The value of d can easily be a few hundred.The descriptor of a shape is a point in a high dimensional space,and two shapes are considered to be similar if they are close in this space.Retrieving the k best matches for a3D query model is equivalent to solving the k nearest neighbors -ing the Euclidean distance,matching feature descriptors can be done efﬁciently in practice by searching in multiple1D spaces to solve the approximate k nearest neighbor prob-lem as shown by Indyk and Motwani[36].In contrast with the feature based methods from theﬁrst three categories,lo-cal feature based methods describe for a number of surface points the3D shape around the point.For this purpose,for each surface point a descriptor is used instead of a single de-scriptor.3.1.1.Global feature based similarityGlobal features characterize the global shape of a3D model. Examples of these features are the statistical moments of the boundary or the volume of the model,volume-to-surface ra-tio,or the Fourier transform of the volume or the boundary of the shape.Zhang and Chen[88]describe methods to com-pute global features such as volume,area,statistical mo-ments,and Fourier transform coefﬁcients efﬁciently.Paquet et al.[67]apply bounding boxes,cords-based, moments-based and wavelets-based descriptors for3D shape matching.Corney et al.[21]introduce convex-hull based indices like hull crumpliness(the ratio of the object surface area and the surface area of its convex hull),hull packing(the percentage of the convex hull volume not occupied by the object),and hull compactness(the ratio of the cubed sur-face area of the hull and the squared volume of the convex hull).Kazhdan et al.[42]describe a reﬂective symmetry de-scriptor as a2D function associating a measure of reﬂec-tive symmetry to every plane(speciﬁed by2parameters) through the model’s centroid.Every function value provides a measure of global shape,where peaks correspond to the planes near reﬂective symmetry,and valleys correspond to the planes of near anti-symmetry.Their experimental results show that the combination of the reﬂective symmetry de-scriptor with existing methods provides better results.Since only global features are used to characterize the overall shape of the objects,these methods are not very dis-criminative about object details,but their implementation is straightforward.Therefore,these methods can be used as an activeﬁlter,after which more detailed comparisons can be made,or they can be used in combination with other meth-ods to improve results.Global feature methods are able to support user feed-back as illustrated by the following research.Zhang and Chen[89]applied features such as volume-surface ratio, moment invariants and Fourier transform coefﬁcients for 3D shape retrieval.They improve the retrieval performance by an active learning phase in which a human annotator as-signs attributes such as airplane,car,body,and so on to a number of sample models.Elad et al.[28]use a moments-based classiﬁer and a weighted Euclidean distance measure. Their method supports iterative and interactive database searching where the user can improve the weights of the distance measure by marking relevant search results.3.1.2.Global feature distribution based similarityThe concept of global feature based similarity has been re-ﬁned recently by comparing distributions of global features instead of the global features directly.Osada et al.[66]introduce and compare shape distribu-tions,which measure properties based on distance,angle, area and volume measurements between random surface points.They evaluate the similarity between the objects us-ing a pseudo-metric that measures distances between distri-butions.In their experiments the D2shape distribution mea-suring distances between random surface points is most ef-fective.Ohbuchi et al.[64]investigate shape histograms that are discretely parameterized along the principal axes of inertia of the model.The shape descriptor consists of three shape histograms:(1)the moment of inertia about the axis,(2) the average distance from the surface to the axis,and(3) the variance of the distance from the surface to the axis. Their experiments show that the axis-parameterized shape features work only well for shapes having some form of ro-tational symmetry.Ip et al.[37]investigate the application of shape distri-butions in the context of CAD and solid modeling.They re-ﬁned Osada’s D2shape distribution function by classifying2random points as1)IN distances if the line segment con-necting the points lies complete inside the model,2)OUT distances if the line segment connecting the points lies com-plete outside the model,3)MIXED distances if the line seg-ment connecting the points lies passes both inside and out-side the model.Their dissimilarity measure is a weighted distance measure comparing D2,IN,OUT and MIXED dis-tributions.Since their method requires that a line segment can be classiﬁed as lying inside or outside the model it is required that the model deﬁnes a volume properly.There-fore it can be applied to volume models,but not to polyg-onal soups.Recently,Ip et al.[38]extend this approach with a technique to automatically categorize a large model database,given a categorization on a number of training ex-amples from the database.Ohbuchi et al.[63],investigate another extension of the D2shape distribution function,called the Absolute Angle-Distance histogram,parameterized by a parameter denot-ing the distance between two random points and by a pa-rameter denoting the angle between the surfaces on which two random points are located.The latter parameter is ac-tually computed as an inner product of the surface normal vectors.In their evaluation experiment this shape distribu-tion function outperformed the D2distribution function at about1.5times higher computational costs.Ohbuchi et al.[65]improved this method further by a multi-resolution ap-proach computing a number of alpha-shapes at different scales,and computing for each alpha-shape their Absolute Angle-Distance descriptor.Their experimental results show that this approach outperforms the Angle-Distance descrip-tor at the cost of high processing time needed to compute the alpha-shapes.Shape distributions distinguish models in broad cate-gories very well:aircraft,boats,people,animals,etc.How-ever,they perform often poorly when having to discrimi-nate between shapes that have similar gross shape proper-ties but vastly different detailed shape properties.3.1.3.Spatial map based similaritySpatial maps are representations that capture the spatial lo-cation of an object.The map entries correspond to physi-cal locations or sections of the object,and are arranged in a manner that preserves the relative positions of the features in an object.Spatial maps are in general not invariant to ro-tations,except for specially designed maps.Therefore,typ-ically a pose normalization is doneﬁrst.Ankerst et al.[5]use shape histograms as a means of an-alyzing the similarity of3D molecular surfaces.The his-tograms are not built from volume elements but from uni-formly distributed surface points taken from the molecular surfaces.The shape histograms are deﬁned on concentric shells and sectors around a model’s centroid and compare shapes using a quadratic form distance measure to compare the histograms taking into account the distances between the shape histogram bins.Vrani´c et al.[85]describe a surface by associating to each ray from the origin,the value equal to the distance to the last point of intersection of the model with the ray and compute spherical harmonics for this spherical extent func-tion.Spherical harmonics form a Fourier basis on a sphere much like the familiar sine and cosine do on a line or a cir-cle.Their method requires pose normalization to provide rotational invariance.Also,Yu et al.[86]propose a descrip-tor similar to a spherical extent function and a descriptor counting the number of intersections of a ray from the ori-gin with the model.In both cases the dissimilarity between two shapes is computed by the Euclidean distance of the Fourier transforms of the descriptors of the shapes.Their method requires pose normalization to provide rotational in-variance.Kazhdan et al.[43]present a general approach based on spherical harmonics to transform rotation dependent shape descriptors into rotation independent ones.Their method is applicable to a shape descriptor which is deﬁned as either a collection of spherical functions or as a function on a voxel grid.In the latter case a collection of spherical functions is obtained from the function on the voxel grid by restricting the grid to concentric spheres.From the collection of spher-ical functions they compute a rotation invariant descriptor by(1)decomposing the function into its spherical harmon-ics,(2)summing the harmonics within each frequency,and computing the L2-norm for each frequency component.The resulting shape descriptor is a2D histogram indexed by ra-dius and frequency,which is invariant to rotations about the center of the mass.This approach offers an alternative for pose normalization,because their method obtains rotation invariant shape descriptors.Their experimental results show indeed that in general the performance of the obtained ro-tation independent shape descriptors is better than the cor-responding normalized descriptors.Their experiments in-clude the ray-based spherical harmonic descriptor proposed by Vrani´c et al.[85].Finally,note that their approach gen-eralizes the method to compute voxel-based spherical har-monics shape descriptor,described by Funkhouser et al.[30],which is deﬁned as a binary function on the voxel grid, where the value at each voxel is given by the negatively ex-ponentiated Euclidean Distance Transform of the surface of a3D model.Novotni and Klein[61]present a method to compute 3D Zernike descriptors from voxelized models as natural extensions of spherical harmonics based descriptors.3D Zernike descriptors capture object coherence in the radial direction as well as in the direction along a sphere.Both 3D Zernike descriptors and spherical harmonics based de-scriptors achieve rotation invariance.However,by sampling the space only in radial direction the latter descriptors donot capture object coherence in the radial direction,as illus-trated byﬁg.4.The limited experiments comparing spherical harmonics and3D Zernike moments performed by Novotni and Klein show similar results for a class of planes,but better results for the3D Zernike descriptor for a class of chairs.Vrani´c[84]expects that voxelization is not a good idea, because manyﬁne details are lost in the voxel grid.There-fore,he compares his ray-based spherical harmonic method [85]and a variation of it using functions deﬁned on concen-tric shells with the voxel-based spherical harmonics shape descriptor proposed by Funkhouser et al.[30].Also,Vrani´c et al.[85]accomplish pose normalization using the so-called continuous PCA algorithm.In the paper it is claimed that the continuous PCA is better as the conventional PCA and better as the weighted PCA,which takes into account the differing sizes of the triangles of a mesh.In contrast with Kazhdan’s experiments[43]the experiments by Vrani´c show that for ray-based spherical harmonics using the con-tinuous PCA without voxelization is better than using rota-tion invariant shape descriptors obtained using voxelization. Perhaps,these results are opposite to Kazhdan results,be-cause of the use of different methods to compute the PCA or the use of different databases or both.Kriegel et al.[46,47]investigate similarity for voxelized models.They obtain a spatial map by partitioning a voxel grid into disjoint cells which correspond to the histograms bins.They investigate three different spatial features asso-ciated with the grid cells:(1)volume features recording the fraction of voxels from the volume in each cell,(2) solid-angle features measuring the convexity of the volume boundary in each cell,(3)eigenvalue features estimating the eigenvalues obtained by the PCA applied to the voxels of the model in each cell[47],and a fourth method,using in-stead of grid cells,a moreﬂexible partition of the voxels by cover sequence features,which approximate the model by unions and differences of cuboids,each containing a number of voxels[46].Their experimental results show that the eigenvalue method and the cover sequence method out-perform the volume and solid-angle feature method.Their method requires pose normalization to provide rotational in-variance.Instead of representing a cover sequence with a single feature vector,Kriegel et al.[46]represent a cover sequence by a set of feature vectors.This approach allows an efﬁcient comparison of two cover sequences,by compar-ing the two sets of feature vectors using a minimal match-ing distance.The spatial map based approaches show good retrieval results.But a drawback of these methods is that partial matching is not supported,because they do not encode the relation between the features and parts of an object.Fur-ther,these methods provide no feedback to the user about why shapes match.3.1.4.Local feature based similarityLocal feature based methods provide various approaches to take into account the surface shape in the neighbourhood of points on the boundary of the shape.Shum et al.[74]use a spherical coordinate system to map the surface curvature of3D objects to the unit sphere. By searching over a spherical rotation space a distance be-tween two curvature distributions is computed and used as a measure for the similarity of two objects.Unfortunately, the method is limited to objects which contain no holes, i.e.have genus zero.Zaharia and Prˆe teux[87]describe the 3D Shape Spectrum Descriptor,which is deﬁned as the histogram of shape index values,calculated over an en-tire mesh.The shape index,ﬁrst introduced by Koenderink [44],is deﬁned as a function of the two principal curvatures on continuous surfaces.They present a method to compute these shape indices for meshes,byﬁtting a quadric surface through the centroids of the faces of a mesh.Unfortunately, their method requires a non-trivial preprocessing phase for meshes that are not topologically correct or not orientable.Chua and Jarvis[18]compute point signatures that accu-mulate surface information along a3D curve in the neigh-bourhood of a point.Johnson and Herbert[41]apply spin images that are2D histograms of the surface locations around a point.They apply spin images to recognize models in a cluttered3D scene.Due to the complexity of their rep-resentation[18,41]these methods are very difﬁcult to ap-ply to3D shape matching.Also,it is not clear how to deﬁne a dissimilarity function that satisﬁes the triangle inequality.K¨o rtgen et al.[45]apply3D shape contexts for3D shape retrieval and matching.3D shape contexts are semi-local descriptions of object shape centered at points on the sur-face of the object,and are a natural extension of2D shape contexts introduced by Belongie et al.[9]for recognition in2D images.The shape context of a point p,is deﬁned as a coarse histogram of the relative coordinates of the re-maining surface points.The bins of the histogram are de-。

基于改进BiSeNet_的实时图像语义分割

第 31 卷第 8 期2023 年 4 月Vol.31 No.8Apr. 2023光学精密工程Optics and Precision Engineering基于改进BiSeNet的实时图像语义分割任凤雷1，2，杨璐1，2*，周海波1，2，张诗雨1，2，何昕3，徐文学4（1.天津理工大学天津市先进机电系统设计与智能控制重点实验室，天津 300384；2.天津理工大学机电工程国家级实验教学示范中心，天津 300384；3.中国科学院长春光学精密机械与物理研究所，吉林长春 130033；4.天津卓越信通科技有限公司，天津 300384）摘要：为了提升图像语义分割算法的性能，使其同时满足准确性和实时性需求，本文提出了一种基于改进BiSeNet的实时图像语义分割算法。

首先，通过使双分支网络头部共享以消除BiSeNet网络结构部分通道和参数的冗余，同时有效提取图像的浅层特征；然后，将上述共享网络拆分为由细节分支和语义分支组成的双分支网络，并分别用于提取空间细节信息和语义上下文信息；此外，在语义分支尾部引入通道和空间注意力机制以增强特征表达能力，通过使用双注意力机制对BiSeNet算法进行优化以更有效地提取语义上下文特征；最后，对细节分支和语义分支的特征进行融合并通过上采样操作恢复至输入图像分辨率大小以实现图像语义分割。

本文算法在Cityscapes数据集以95.3FPS的实时性表现达到77.2% mIoU的准确性；在CamVid数据集以179.1 FPS的实时性表现达到73.8% mIoU的准确性。

实验结果表明，本文算法在实时性和准确性方面获得了很好的平衡，其语义分割性能相较于BiSeNet算法及其它现有算法得到了显著的提升。

关键词：语义分割；注意力机制；实时性；深度学习中图分类号：TP394.1 文献标识码：A doi：10.37188/OPE.20233108.1217Real-time semantic segmentation based on improved BiSeNetREN Fenglei1，2，YANG Lu1，2*，ZHOU Haibo1，2，ZHANG Shiyv1，2，HE Xin3，XU Wenxue4（1.Tianjin Key Laboratory for Advanced Mechatronic System Design and Intelligent Control，School of Mechanical Engineering， Tianjin University of Technology， Tianjin 300384， China；2.National Demonstration Center for Experimental Mechanical and Electrical Engineering Education，Tianjin University of Technology， Tianjin 300384， China；3.Changchun Institute of Optics， Fine Mechanics and Physics， Chinese Academy of Sciences，Changchun 130033， China；4.Transcend Communication Technology Tianjin Co.， Ltd， Tianjin 300384， China）* Corresponding author， E-mail： yanglu8206@Abstract：To improve the performance of image semantic segmentation on accuracy and efficiency for practical applications， in this study， we propose a real-time semantic segmentation algorithm based on im⁃文章编号1004-924X（2023）08-1217-11收稿日期：2022-09-12；修订日期：2022-10-01.基金项目：国家自然科学基金资助项目（No.51275209）；天津市自然科学基金重点项目资助（No. 17JCZDJC30400）；广东省重点领域研发计划资助项目（No. 2019B090922002）第 31 卷光学精密工程proved BiSeNet.First，the redundancy of certain channels and parameters of BiSeNet is eliminated by sharing the heads of dual branches， and the affluent shallow features are effectively extracted at the same time. Subsequently， the shared layers are divided into dual branches， namely， the detail branch and the se⁃mantic branch， which are used to extract detailed spatial information and contextual semantic information，respectively. Furthermore， both the channel attention mechanism and spatial attention mechanism are in⁃troduced into the tail of the semantic branch to enhance the feature representation； thus the BiSeNet is opti⁃mized by using dual attention mechanisms to extract contextual semantic features more effectively. Final⁃ly， the features of the detail branch and semantic branch are fused and up-sampled to the resolution of the input image to obtain semantic segmentation. Our proposed algorithm achieves 77.2% mIoU on accuracy with real-time performance of 95.3 FPS on Cityscapes dataset and 73.8% mIoU on accuracy with real-time performance of 179.1 FPS on CamVid dataset. The experiments demonstrate that our proposed se⁃mantic segmentation algorithm achieves a good trade-off between accuracy and efficiency.Furthermore，the performance of semantic segmentation is significantly improved compared with BiSeNet and other ex⁃isting algorithms.Key words： semantic segmentation； attention mechanism； real time； deep learning1 引言图像语义分割作为计算机视觉领域的一项重要技术，旨在将每一个图像像素分类为相应的语义类别，其在自动驾驶、医学检测、机器人导航、场景解析、人机交互等领域均有着极为广泛的应用［1-5］。

语义分割结合深度估计

语义分割结合深度估计近年来，随着人工智能技术的飞速发展，语义分割和深度估计成为计算机视觉领域的热门研究方向。

语义分割是指将图像中的每个像素分配到特定的语义类别，而深度估计则是通过分析图像获得场景中每个像素点的深度信息。

这两种技术的结合，可以为图像理解和应用提供更加精确和全面的信息。

语义分割是计算机视觉中的一项重要任务，它可以将图像中的每个像素点标记为不同的语义类别，如人、车、树等。

传统的语义分割方法主要基于图像的纹理、颜色和形状等特征进行像素分类，但这种方法存在着较大的局限性，无法准确地识别复杂的场景。

而深度学习的兴起，为语义分割带来了新的突破。

深度学习通过构建深层神经网络，可以自动学习图像中的特征表示，并通过大量的训练数据进行参数优化，从而实现高效准确的语义分割。

深度学习模型在语义分割任务中取得了巨大的成功，如FCN、U-Net和DeepLab等。

这些模型通过卷积神经网络提取图像的特征，并将其与上下文信息相结合，实现对图像的像素级别分类。

然而，仅进行语义分割无法提供图像中物体的立体信息，这就需要借助深度估计技术。

深度估计的目标是通过分析图像中的纹理、投影关系和颜色等信息，推断出每个像素点的深度值。

深度估计可以用于三维重建、虚拟现实、自动驾驶等领域，提供更加精准的场景理解和感知。

语义分割和深度估计的结合可以相互促进，提高图像理解的准确性和效率。

首先，语义分割可以为深度估计提供上下文信息，改善深度图的质量。

语义分割可以帮助深度估计网络更好地区分不同物体的边界，避免深度估计中的模糊和误差。

其次，深度估计可以为语义分割提供空间信息，改善语义分割的精度。

深度估计可以帮助语义分割网络更好地理解物体的空间位置和大小，提高语义分割的准确性和鲁棒性。

为了实现语义分割和深度估计的结合，研究者们提出了一系列的方法和模型。

其中，一种常用的方法是将深度估计任务作为语义分割的辅助任务进行训练。

通过联合训练深度估计和语义分割网络，可以共享和融合两个任务中的特征表示，提高整体模型的性能。

基于分形特征的音频检索

ｈｒｃａｄｍｅｓｏｓｔｅｔｒｅｔｒｆａｈａｄｏｉｕｉａａａｅａｄｔｎｓｖｓｉｔｅｔｅｖｃａａａｅＩｅｒｖｌｏｅｓｔｅｆａｔｉｎｉｎａｈｅｆａｕｅｖｃｏｒｅｃｕｉｎａｄｏｄｔｂｓｎｅａｅｔｉｅｆａｒｅｍｒｄｔｂｓ．ｎｔｅｒｔｅａｒｃｓ，ｌｏｈｎｈｕｈｉｐｈｒｃａｉｎｉｎｆｕｒｕｉｓｆｒｔｅｔｃｅ，ｙｗｈｃｓｉｌｒａｉｓｆｏｔｕｉａａａｅａｅｒｖｄｈｅｆａｔｌｔｅｆａｔｌｄｍｅｓｏｒｔｅｑｅｙａｄｏｉｉｓｌｘａｔｄｂｉｈｔｅｍｏｔｓｍｉａｕｄｏｒｍｅａｄｏｄｔｂｓｅｒｔｅｅ．Ｔｒｃａｏｈｙｒｈｈｒｉｄｍｅｉｎｉｎｉｓｃｆｒｅｃｕｏｓｃｓｓｌ－ｉｌｒｔＯａｏｍａｏｅｓｔｅｔｏｓｄｐｓｔｏｆｔｅａｄｏｆａｍｅｔｏｂｅｒｅｅｉｎｓｏｓｉｔｎｉａｈａｄｕｈａｅｆｓｍｉａｉＳｓｔｋｅｉｎｔｓｎｉｖｎｉｅａｏｉｎｏｕｉｇｎｅｒｔｉｖｄｒｏｉｙｔｉｏｎｉｈｒｔｆｏｔｅｌｎｕｉＩａｓｅｉｖｓｔｅａｄｏｓｑｉｋｙＣｏｐｅｔＲＡＣＴｒｍｏｇａｄｏ．ｔｌｏｒｔｅｅｈｕｉｕｃｌ．ｍａｄｗｉＦｈｒｒｈＡＬ，ＦＣＣｄＳＭｎａＯＬＡＲ，ｈｘｅｍｅｔｅｕｌａｉａｅｔａｅｔｅｅｐｒｉｎａｒｓｔｖｄｔｔｔｌｓｌｈｈｐｏｏｅｐｏｃｄａｃｓｉｅｆｒｎｃｄｔｅｃｍｐｅｉ．ｒｐｓｄａｐｒａｈａｖｎｅｐｒｏｍａｅａｍｏｌｘｔｎｎｉｙ

基于注意力和多标签分类的图像实时语义分割

第33卷第1期计算机辅助设计与图形学学报Vol.33No.1 2021年1月Journal of Computer-Aided Design & Computer Graphics Jan. 2021基于注意力和多标签分类的图像实时语义分割高翔, 李春庚*, 安居白(大连海事大学信息科学技术学院大连116026)(********************.cn)摘要: 针对现阶段很多实时语义分割算法分割精度低, 尤其对边界像素分割模糊的问题, 提出一种基于跨级注意力机制和多标签分类的高精度实时语义分割算法. 首先基于DeepLabv3进行优化, 使其达到实时运算速度. 然后在此网络基础上增加跨级注意力模块, 使深层特征为浅层特征提供像素级注意力, 以抑制浅层特征中不准确语义信息的输出; 并在训练阶段引入多标签分类损失函数辅助监督训练. 在Cityscapes数据集和CamVid数据集上的实验结果表明, 该算法的分割精度分别为68.1%和74.1%, 分割速度分别为42帧/s和89帧/s, 在实时性与准确性之间达到较好的平衡, 能够优化边缘分割, 在复杂场景分割中具有较好的鲁棒性.关键词: 卷积神经网络; 实时语义分割; 多标签分类; 跨级注意力机制中图法分类号: TP391.4 DOI: 10.3724/SP.J.1089.2021.18233Real-Time Image Semantic Segmentation Based on Attention Mechanism and Multi-Label ClassificationGao Xiang, Li Chungeng*, and An Jubai(College of Information Sciences and Technology,Dalian Maritime University,Dalian 116026)Abstract:Improving the accuracy is the goal in real-time semantic segmentation, especially for fuzzy boundary pixel segmentation. We proposed a high-precision and real-time semantic segmentation algorithm based on cross-level attention mechanism and multi-label classification. The procedure started with an optimi-zation of DeepLabv3 to achieve real-time segmentation speed. Then, a cross-level attention module was added, so that the high-level features provided pixel-level attention for the low-level features, so as to inhibit the out-put of inaccurate semantic information in the low-level features. In the training phase, the multi-label classifi-cation loss function was introduced to assist the supervised training. The experimental results on Cityscapes dataset and CamVid dataset show that the segmentation accuracy is 68.1% and 74.1% respectively, and the segmentation speed is 42frames/s and 89frames/s respectively. It achieves a good balance between segmenta-tion speed and accuracy, can optimize edge segmentation, and has strong robustness in complex scene seg-mentation.Key words: convolutional neural networks; real-time semantic segmentation; multi-label classification; cross-level attention mechanism收稿日期: 2020-02-16; 修回日期: 2020-05-23. 基金项目: 国家自然科学基金(61471079). 高翔(1994—), 女, 硕士研究生, 主要研究方向为深度学习图像语义分割; 李春庚(1969—), 男, 博士, 副教授, 硕士生导师, 论文通讯作者, 主要研究方向为数字图像处理、基于视频的运动目标追踪; 安居白(1958—), 男, 博士, 教授, 博士生导师, 主要研究方向为模式识别、海上遥感图像分析.60 计算机辅助设计与图形学学报第33卷图像语义分割是计算机视觉的一项重要技术, 相比图像分类和目标检测, 它是一种更细粒度的像素级分类技术[1], 该技术在生产环境中具有实现成本低、部署方便的优点, 因此在无人驾驶、机器人视觉等领域[2-3]常常被应用于可行驶区域的感知系统, 这些应用领域对快速交互或响应速度有很高的要求.文献[4]提出使用全卷积神经网络(fully con-volutional networks, FCN)实现端到端的语义分割, 通过卷积和池化层对输入图像逐步下采样获得具有强鲁棒性的特征, 但也导致特征分辨率降低, 对目标边界的分割不够精细. 此后, 为更加精确地恢复高分辨率的特征, 文献[5-6]使用编码器来获得深层特征的语义信息, 使用解码器融合浅层和深层特征, 逐步恢复空间和细节信息. 此外, 文献[7]提出放弃编码器最后2次下采样操作, 使用空洞卷积保持算法的整体感受野不变, 并在网络末端增加全连接条件随机场进一步精细化网络的分割结果. 为避免特征图分辨率变小、定位精度过低等问题, DeepLabv2[8]将空洞卷积与空间金字塔池化方法结合, 提出空洞空间金字塔池化(atrous spatial pyramid pooling, ASPP)模块整合多尺度特征、增大感受野, 进而提高分割精度. 基于以上2种方法, DeepLabv3[9]进一步讨论了空洞卷积的并联和串联方式对算法分割效果的影响, 改进ASPP模块, 进而获取不同感受野信息, 提高了分割不同尺度目标的能力, 取得更好的语义分割效果. 文献[10]认为, 丰富的上下文信息可以增强网络的信息丰富度与类别区分度, 使得网络模型具有更好的语义分割能力.以上工作主要解决因网络下采样造成特征的空间信息丢失问题, 虽然提高了分割精度, 但是分割速度较慢, 无法满足实时分割任务需求. 目前实时语义分割算法大都以牺牲分割精度为前提达到实时分割速度. 实时语义分割算法SegNet[11]采用编码器-解码器结构, 在编码过程中通过多次卷积与池化运算提取特征, 在解码过程中使用池化索引执行非线性上采样, 减少内存占用并提升了速度. 为追求模型轻量化, 实时语义分割算法ENet[12]放弃最后下采样阶段, 感受野不足以覆盖比较大的对象, 导致算法分割精度较低, 但是该算法具有分割速度快的优点. 实时语义分割算法BiSeNet[13]使用浅层网络处理高分辨率图像, 并提出一种快速下采样的深层网络以平衡分类能力和感受野大小, 此算法可取得较高的分割速度和分割精度.文献[14]指出浅层特征中存在不准确的语义信息, 将深层和浅层特征直接叠加会产生大量噪声, 导致模型分割精度降低. 为解决此问题, 本文使用注意力机制为浅层特征分配像素级权重可抑制不准确语义信息的输出. 除此之外, 本文认为输入图像在神经网络中经下采样后得到的特征图中, 每个特征点在空间位置上与图像中的若干个像素点组成的区域相对应, 而这些像素点所属的类别可能不同, 因此采用多标签分类损失函数显式地监督训练网络, 使每个特征点可以具有多种类别信息, 提升特征语义信息的准确性, 进而提升算法分割精度.1本文算法为使本文算法具有实时分割速度且具有较高分割精度, 本文在当前最先进的语义分割算法DeepLabv3基础上轻量化特征提取网络, 即使用ResNet34[15]作为基础网络结构; 使用特征金字塔结构为其增加解码结构, 使其具有实时分割速度; 将优化后的网络结构作为本文算法的基础网络, 同时增加注意力机制模块和多标签损失函数监督训练, 进一步提升算法分割精度. 本节将详细介绍这3点改进.本文算法整体结构如图1所示, 其中backbone 表示ResNet34, /2和×2分别表示特征图下采样2倍和特征图上采样2倍, 虚线箭头表示在该阶段使用多标签分类损失监督网络训练, “+”表示特征图以相加的方式融合.1.1神经网络结构在自然图像中的对象往往具有不同尺度和纵横比, 如街景图像中天空、建筑、马路与路灯、广告牌尺度差别较大, 具有不恰当感受野的神经网络将无法给不同尺度目标均衡的关注. 比如, 具有小感受野的神经网络将会更加关注小目标或者将大目标分割成多个部分, 相反, 具有大感受野的神经网络将忽视小目标. 因此, 获得多感受野网络对于精细分割具有重要的意义, 而DeepLabv3利用图像的空间局部关联性和空洞卷积的采样特点对图像进行卷积运算, 既可以获取多种感受视野信息, 又可以保留特征的空间信息; 但这导致大量参数与高分辨率特征作点积运算, 降低分割速度. 而本文在保留DeepLabv3多感受野性能的前提下对分割速度进行优化. 本文算法整体结构如图1所示, 首先使用相对轻量级的ResNet34代替DeepLabv3第1期高翔, 等: 基于注意力和多标签分类的图像实时语义分割 61图1 本文算法整体结构中的ResNet101[15]作为特征编码器, 其次增加特征金字塔网络(feature pyramid networks, FPN)[16]作为解码器逐层上采样恢复特征空间信息和语义信息, 最后压缩空洞金字塔池化网络中的参数数量. 其中, ResNet34残差连接单元可避免梯度在反向传播阶段消失; FPN 是一种融合不同层级特征图的方式, 解码阶段使用FPN 重用浅层特征修复深层特征图的空间细节信息, 可以进一步增强特征图鲁棒性. 如图2所示, ASPP 由不同空洞率r 的卷积并联组成, 特征经过FPN 处理之后具有较强的鲁棒性, 因此只需要较少参数的ASPP 模块即可实现多尺度目标分割.图2 ASPP从网络组成方面来讲, 本文算法的网络主要由卷积层、激活层、空洞卷积层和批标准化层(batch normalization, BN)[17]共4种基础单元堆叠而成, 其中卷积层负责提取图像特征, 激活层负责提高网络的非线性程度, 空洞卷积层负责在保留特征空间信息的前提下增大算法感受野、提升特征的鲁棒性[7], 而BN 通过对网络不同层之间传递的数据进行标准化以消除内部协变量移位现象[17], 进而提高算法的收敛速度和精度.1.2 跨级注意力模块文献[14]指出, 浅层特征中存在不准确的语义信息, 而FPN 将深浅特征图直接相加, 这种特征融合方式将浅层特征中错误信息或冗余信息加到深层特征, 影响了算法分割精度. 鉴于此, 在深浅层特征融合的过程中, 本文引入跨级注意力模块(cross-level attention mechanism, CAM)抑制错误信息或冗余信息的输出. 注意力机制[18]的作用机制类似人类观察环境, 往往只关注某些特别重要的局部, 获取不同局部的重要信息, 抑制对当前识别作用不大的特征, 增强有效特征的作用. 注意力特征有助于增强模型的特征表达能力, 综合不同信息, 提高模型的理解能力[19].本文提出的CAM 如图3所示, 深层特征经过3×3卷积、BN 、激活处理后, 得到与编码器中浅层特征图尺度相对应的可解释权重矩阵, 然后与浅层特征图相乘, 最后将加权后的浅层特征图与深层特征图相加. 该模块以一种简单的方式使用深层特征指导浅层特征加权, 为浅层特征图提供像素级注意力, 使其关注更加具有信息量的特征点, 即在有限参数量下尽可能表达重要的信息. 该模块能够更好地平衡修剪模型架构与增强模型表达力.图3 CAM1.3 联合多标签分类监督训练不同于传统的单标签学习任务中每个样本只与一种类别信息有关, 多标签学习[20]需要输出多个标签信息, 其中每个实例可以与一组标签相关联. 假设n X = 表示n 维实例空间, {}12,,,= q Y y y y 表示标签空间, 该标签空间有q 种可能的标签类62计算机辅助设计与图形学学报第33卷别. 多标签学习的任务是从多标签训练数据集D =(){},|1i i x Y i m ≤≤中学习一个函数:2Y f X →, 对于任意一个测试实例x X ∈, 多标签分类器()f ⋅预测x 的标签集合()f x Y ∈. 如图4所示, 在单标签分类任务中, 可能将图4中箭头所指位置的标签分配为马或者人, 但是在多标签分类任务中, 将会同时分配这2种类别, 并以此作为神经网络的标签, 监督网络同时学习这2种类别的特征.图4 多标签分类图示在当前语义分割技术中, 使用线性插值对特征进行上采样恢复语义信息类似于多标签分类任务中提取特征的过程. 如图4所示, 在语义分割编码阶段, 输入图像经卷积和下采样运算后输出原图像1/K 大小的特征图, 其中, K 为下采样倍数. 图4中本文根据下采样倍数对图像按照空间位置划分多个网格, 根据卷积神经网络输出特征与原图的映射关系可知, 特征图中的每个特征点与每个网格一一对应, 与网格中的像素点为一对多的关系. 通过图4可以发现, 在一部分网格中只存在背景类别的像素, 另一部分网格中存在马、人、背景共3种类别像素, 因此, 在某些网格内部存在多种类别目标的边界交汇. 在解码器上采样阶段使用线性插值对特征点采样提高特征图分辨率, 由于线性插值是基于空间不变模型的方法, 无法捕捉边缘快速变化的信息, 会产生边缘模糊效果[21].因此, 为进一步提高解码阶段上采样特征的准确性, 本文在特征图分辨率为原图1/32和1/16大小的特征图上进行上采样时, 引入多标签分类损失函数显式地监督网络训练. 这样可以使特征点包含的语义信息与图像中对应网格区域中的像素类别信息一致, 进而可以在恢复特征图分辨率的同时保证类别信息准确性, 并且不会降低算法的分割速度. 本文使用多标签分类损失函数和交叉熵损失函数共同监督网络学习, 修正目标边界信息. 损失函数描述为()()()1CE 2BCE 16163BCE 3232ˆˆ,,ˆ ,.=++L L y y L s s L s sλλλ其中, CE L 表示交叉熵损失函数; BCE L 表示多标签损失函数, 本文中多标签损失函数使用二进制交叉熵损失函数; 1632, , ⨯∈ H W y s s 表示真实标签;16s 和32s 表示标签分辨率大小分别为原标签分辨率大小的1/16和1/32; 1632ˆˆˆ, , ⨯∈ H W ys s 表示对应预测值; 123, , λλλ表示控制3个损失函数权重的3个超参数.2 实验与分析2.1 评价指标在本文实验中使用平均交并比(mean intersec-tion-over-union, mIoU), 处理每幅图像所用时间t (ms)和图像处理速度v (帧/s)作为算法性能评价指标.mIoU 为语义分割的标准度量, 计算2个集合的交集与并集之比, 在每个像素类别内计算交并比(intersection-over-union, IoU), 然后计算平均值. 使用处理每幅图像所用的时间t (ms)和图像处理速度v (帧/s)来衡量算法的速度, mIoU 和v 计算公式分别为001mIoU ,1kiik ki ij ji iij j p k p p p ====++-∑∑∑.NiiNv t =∑其中, ii p 表示分割正确的数量; ij p 表示本属于i 类但预测为j 类的像素数量; ji p 表示本属于j 类被预测为i 类的像素数量; N 表示图像数量; t 表示处理每幅图像所用的时间.2.2 实验数据与实验环境Cityscapes 数据集[22]包含来自不同城市、不同季节拍摄的5 000幅精确标注和20 000幅粗略标注的街景图像, 每幅图像分辨率为1 024像素×2 048像素. 数据集共19个街景类别. 在本文实验中, 仅使用精确标注的5 000幅图像, 其中3 475幅用于训练模型, 1 525幅用于测试, 测试数据没有提供真实标签, 需要提交其官方服务器测评.CamVid 数据集[23]包含从视频序列中提取的701幅分辨率为760像素×960像素的图像, 其中367幅用于训练, 101幅用于验证, 233幅用于测试,第1期高翔, 等: 基于注意力和多标签分类的图像实时语义分割 63在本文实验中共测试11个语义类别.本文所有实验的仿真实验环境为Ubuntu18.04, Python3.7.4, Pytorch1.1.0, 显卡为NVIDIA Titan RTX 和GTX1060. 模型编码器基础网络是在ImageNet 数据集[24]上预训练的ResNet34. 初始学习速率设置为0.005, 学习速率调整策略使用多项式衰减策略, 权重衰减使用L 2正则化, 衰减系数设置为0.000 5, 动量设置为0.9.2.3 算法性能分析与比较首先, 本文基于Cityscapes 和CamVid 数据集的分割结果在速度和精度2个方面进行模块有效性评估, 之后与模型FCN-8s, DeepLabv2, ENet, SegNet, ICNet [25]和BiSeNet 对比, 最后通过可视化结果进一步分析算法的分割性能. 2.3.1 模块有效性评估为评估本文提出的CAM 的效果, 首先对轻量化后的DeepLabv3模型进行评估, 记为Baseline, 然后对应用了本文CAM 模型进行评估, 记为Baseline+CAM. 表1所示为在显卡为NVIDIA Titan RTX 的实验环境下的消融实验结果, 在Cityscapes 和CamVid 数据集上的实验结果显示, 引入该CAM 后mIoU 分别提高了1.2%和1.4%, 处理每幅图像的运算时间分别增加1.7 ms 和1.4 ms, 证明该注意力模块可以在消耗很少运算时间的前提下提高模型分割精度, 并且说明深层特征图可以有效地指导浅层特征图保留有效信息, 防止传入过多干扰信息.表1 各模块有效性评估对比表Cityscapes(768×1536) CamVid(720×960)算法mIoU/%/ms t mIoU/% /ms t Baseline 65.3 21.9 71.5 9.8 Baseline+CAM 66.5 23.6 72.9 11.2Baseline+CAM+L ML68.123.674.111.2为评估本文提出的多标签分类损失函数辅助监督算法(记为Baseline+CAM+L ML )的有效性, 将仅使用交叉熵损失函数的算法与在解码器阶段使用多标签分类损失函数的算法对比. 其中, 关于多标签损失函数的超参数设定, 本文通过经验方式人为确定几组不同超参数值对比网络性能, 从中选取一组较优的超参数作为本文多标签损失函数的超参数, 最终超参数设定为121, 0.3,==λλ30.7=λ. 由表1可以看出, 相对于仅使用交叉熵损失函数的算法, 使用多标签分类函数监督网络训练的精度在Cityscapes 和CamVid 数据集上分别提高了1.6%和1.2%, 并且对网络运行速度没有影响, 可以说明本文采用多标签分类损失函数监督网络解码训练的有效性. 此外, 通过改变网络监督方式提升网络性能, 对于图像实时语义分割是一种非常有效的方式, 既能提高分割精度又不会影响分割速度.2.3.2 算法整体分析与比较为验证本文算法的有效性, 实验将本文算法与DeepLabv2, FCN-8s, SegNet, ENet 在Cityscapes 和CamVid 数据集上进行分割精度和速度的对比, 统一采用mIoU 衡量语义分割精度, 训练参数设置见第2.2节.从算法分割精度和处理速度上分析, 在Cityscapes 数据集上使用分辨率为768像素×1 536像素, 512像素×1 024像素进行训练、测试, 在CamVid 数据集上分别使用720像素×960像素, 384像素×480像素图像进行训练和测试. 表2是在显卡为NVIDIA Titan RTX 的实验环境下的部分实验结果. 其中, 本文算法在2个数据集上的处理速度分别为42帧/s 和89帧/s, 虽然分割速度略慢于ENet, 但是在2个数据集上的mIoU 比ENet 分别提高了9.8%和5.8%. 与SegNet 相比, 在速度更快的前提下, 在2个数据集上的mIoU 提高了11.1%和8.9%. 可见, 本文算法在速度与精度上与现有的实时分割算法相比有较好的表现. 与DeepLabv2和FCN-8s 非实时语义分割模型相比, 本文算法的分割精度也具有较大优势. 综合分析, 本文算法可以在分割速度与分割精度之间取得较好的平衡, 可以实现精确高效的分割.表2 不同算法在2个数据集上的性能对比Cityscapes(768×1 536) CamVid(720×960) 算法mIoU/%/ms t v /(帧·s –1) mIoU/% /ms t v /(帧·s –1)DeepLabv2[8] 63.1 4 000.2<1 71.3 830.0 1 FCN-8s [4] 60.4 250.0 4 66.9 76.913 SegNet [11] 57.0 31.332 65.2 13.574 ENet [12]58.3 11.983 68.3 7.3136 本文-ResNet3468.123.64274.111.289图5所示为本文算法在2个数据集上训练的损失函数曲线, 其中非联合多标签分类的交叉熵损失仅使用交叉熵损失. 可以看出, 使用本文提出的多标签分类损失函数作为损失函数辅助训练的算法在训练过程中损失值平稳下降, 相较于仅使用交叉熵损失函数有更好的表现, 进而可有效地监督算法在低分辨率特征图上进行多标签分类.64计算机辅助设计与图形学学报第33卷a. Cityscapes 数据集b. CamVid 数据集图5 不同算法在2个数据集上的损失对比曲线表3和表4为6 GB 显存GTX1060显卡实验环境下的实验性能对比, 训练过程中无任何数据扩充, 图6~图8所示为相应分割效果展示图. 表3和图6分别展示本文算法(本文-ResNet34)与ICNet, BiSeNet 在Cityscapes 数据集上的分割结果对比. 根据表3在Cityscapes 和CamVid 数据集上的分割实验数据可知, 相比于ICNet, 本文算法在分割精度和分割速度上均有较大优势. 相比于BiSeNet, 本文算法分割精度有所提高, 但是分割速度稍慢. 根据图6分割效果图可以看出, 从分割细节来讲, ICNet 丢失细节较多, BiSeNet 对细节分割好于ICNet, 而本文算法好于BiSeNet, 这一点可以通过图6中的杆状目标分割效果看出; 另一方面, 这3种实时分割算法对图像的整体均能取得较好的分割效果, 但依据图6第2行和第3行的展示图可以看出, 本文算法和BiSeNet 对马路的分割效果好于ICNet. 而Baseline-ResNet34虽然对细节分割效果好于ICNet, 但对图像整体分割效果较差.表3 不同算法在2个数据集上性能对比Cityscapes(512×1 024) CamVid(360×480)mIoU/%/ms t v /(帧·s −1) mIoU/% /ms t v /(帧·s −1)ICNet [25] 65.3 60.016 63.4 40.025 BiSeNet [13] 66.2 20.651 65.1 14.967 本文-ResNet3466.823.04365.317.059表4 本文算法和DeepLabv3使用MobileNetv2性能对比mIoU/%t/msv/(帧·s −1) mIoU/%t/msv/(帧·s −1)Deeplabv3- MobileNetv2 64.330.2 31 62.3 22.245 本文-MobileNetv265.420.05063.9 15.963a. 输入图像b. Baseline-ResNet34c. ICNet [25]d. BiSeNet [13]e. 本文-ResNet34图6 实时语义分割算法结果对比算法算法Cityscapes(512×1 024) CamVid(360×480)第1期高翔, 等: 基于注意力和多标签分类的图像实时语义分割65本文算法与DeepLabv3使用MobileNetv2[26]作为特征提取网络的分割效果对比如图7所示, 其对应的实验数据如表4所示. 从图7可以看出, 本文算法在分割细节方面稍好于DeepLabv3, 但DeepLabv3的分割效果表现出该模型丢失了图像中更多的小目标和细节现象; 从图像整体分割效果来看, 这2种模型均能取得较好的分割效果.再结合图6和图7分析ResNet34和MobileNetv2作为特征提取网络对本文算法分割性能的影响. 可以看出, 使用ResNet34作为特征提取网络的模型在图像细节分割效果上要好于使用MobileNetv2的模型, 在图像整体分割效果上二者分割效果大体相似. 而结合表3和表4数据可以看出, 本文- ResNet34与本文-MobileNetv2模型分割速度相近. 本文分析这与测试Batch大小有关, 本文实验测试Batch为1, 当Batch增大, 本文-MobileNetv2模型占用显存少的优势更明显, 可被并行处理的图像数量增多, 其分割速度会有显著提升.a. 输入图像b. Deeplabv3-MobileNetv2c. 本文-MobileNetv2图7 Deeplabv3与本文算法在Cityscapes数据集上的对比接下来通过图8分析本文算法性能. 图8c所示为Baseline, 其提取特征网络为ResNet34, 该模型不含有CAM和多标签分类; 图8d所示为使用MobileNetv2作为提取特征网络的DeepLabv3模型; 图8e和图8f分别为使用MobileNetv2和ResNet34作为提取特征网络并与CAM和多标签分类相结合的模型, 即本文最终模型. 对于CamVid数据集, 其图像中细节部分较多, 实验中的图像分辨率只有360像素×480像素, 此分辨率较低, 因此在该数据集进行实验更加考察模型分割图像细节的能力. 从图8可以看出, Baseline对细节丢失最多, 而本文算法和DeepLabv3对细节的分割要好于Baseline, 可见本文算法的有效性.3 结语本文提出了一种基于注意力机制和多标签分类的实时图像语义分割网络, 首先优化DeepLabv3神经网络架构至满足实时分割的要求, 在此基础上设计了跨级特征注意力模块和多标签分类损失函数. CAM利用蕴含丰富语义信息的深层特征对浅层特征进行像素级加权, 实现更精确的空间信息选取, 同时, 使用多标签分类损失函数辅助监督网络学习, 在恢复特征图分辨率时有效提高类别信息准确性, 二者共同作用使得类别边界像素分割更精细. 最后, 在Cityscapes数据集和CamVid数据集上进行了一系列对比实验, 实验结果表明,本文算法能够更加准确地处理复杂场景图像中图66计算机辅助设计与图形学学报第33卷a. 输入图像b. 真实标签c. Baselined. DeepLabv3e. 本文-MobileNetv2f. 本文-ResNet34图8 DeepLabv3与本文算法在CamVid 数据集上的对比像分割问题, 显著改善类别边缘区域分割效果, 同时证明本文算法是一种分割精度高、分割速度快的图像实时语义分割算法.参考文献(References):[1] Csurka G , Perronnin F. An efficient approach to semantic seg-mentation[J]. International Journal of Computer Vision, 2011, 95(2): 198-212[2] He Y H, Wang H, Zhang B. Color-based road detection in ur-ban traffic scenes[J]. IEEE Transactions on Intelligent Trans-portation Systems, 2004, 5(4): 309-318[3] An Zhe, Xu Xiping, Yang Jinhua, et al . Design of augmentedreality head-up display system based on image semantic seg-mentation[J]. Acta Optica Sinica, 2018, 38(7): 77-83(in Chi-nese)(安喆, 徐熙平, 杨进华, 等. 结合图像语义分割的增强现实型平视显示系统设计与研究[J]. 光学学报, 2018, 38(7):77-83)[4] Long J, Shelhamer E, Darrell T. Fully convolutional networksfor semantic segmentation[C] //Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2015: 3431-3440 [5] Lin G S, Milan A, Shen C H, et al . Refinenet: multi-path re-finement networks for high-resolution semantic segmenta-tion[C] //Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2017: 5168-5177[6] Ronneberger O, Fischer P, Brox T. U-Net: convolutional net-works for biomedical image segmentation[C] //Proceedings ofMedical Image Computing and Computer Assisted Interven-tion. Heidelberg: Springer, 2015: 234-241[7] Chen L C, Papandreou G , Kokkinos I, et al . Semantic imagesegmentation with deep convolutional nets and fully connected crfs[OL]. [2020-02-16]. https:///abs/1412.7062, 2014 [8] Chen L C, Papandreou G , Kokkinos I, et al . DeepLab: semanticimage segmentation with deep convolutional nets, atrous con-volution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848 [9] Chen L C, Papandreou G , Schroff F, et al . Rethinking atrousconvolutionforsemanticimagesegmentation[OL].[2020-02-16]. https:///abs/1706.05587[10] Yue Shiyi. Image semantic segmentation based on hierarchicalcontext information[J]. Laser & Optoelectronics Progress, 2019, 56(24): 107-115 (in Chinese)(岳师怡. 基于多层级上下文信息的图像语义分割[J]. 激光与光电子学进展, 2019, 56(24): 107-115)[11] Badrinarayanan V , Kendall A, Cipolla R. SegNet: a deep con-volutional encoder-decoder architecture for image segmenta-tion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495[12] Paszke A, Chaurasia A, Kim S, et al . ENet: a deep neural net-work architecture for real-time semantic segmentation[OL]. [2020-02-16]. https:///abs/1606.02147[13] Yu C Q, Wang J B, Peng C, et al. BiSeNet: bilateral segmen-tation network for real-time semantic segmentation[C] //Proceedings of the European Conference on Computer Vision. Heidelberg: Springer, 2018: 334-349[14] Ghiasi G , Fowlkes C C. Laplacian pyramid reconstruction andrefinement for semantic segmentation[C] //Proceedings of the European Conference on Computer Vision. Heidelberg: Springer, 2016: 519-534。

基于光流跟踪的pitch角估计

基于光流跟踪的pitch角估计
光流跟踪是一种基于计算机视觉的技术，用于估计相邻图像帧
之间的像素位移。

在航空航天领域，光流跟踪技术被广泛应用于飞
行器的姿态估计和控制中。

其中，估计飞行器的pitch角（俯仰角）是至关重要的，因为它直接影响飞行器的稳定性和飞行性能。

光流跟踪的基本原理是通过比较相邻图像帧中的像素强度变化
来计算像素的位移。

在估计pitch角时，光流跟踪算法可以利用飞
行器相机拍摄的连续图像序列，通过追踪特定像素点在图像中的运
动轨迹来推断飞行器的俯仰角变化。

这种方法不依赖于传感器的惯
性测量，因此对于一些特定应用场景，如低成本的无人机系统或者
对惯性传感器有限的空间应用，具有很大的优势。

光流跟踪算法的实现通常包括特征点检测、匹配和位移计算等
步骤。

在估计pitch角时，需要考虑相机的内外参数，以及地面运
动的影响等因素。

此外，由于光流跟踪算法对光照变化和纹理缺失
等情况较为敏感，因此在复杂的环境中需要进行额外的处理和优化。

近年来，随着计算机视觉和机器学习技术的发展，基于深度学
习的光流跟踪算法也逐渐应用于飞行器姿态估计中。

这些算法能够
更加准确地估计像素位移，并且对于复杂场景具有更好的鲁棒性。

总之，基于光流跟踪的pitch角估计是一种有效的飞行器姿态估计方法，它能够在一定程度上减少对惯性测量单元的依赖，提高飞行器的稳定性和飞行性能。

随着技术的不断进步，相信这一方法在未来会有更广泛的应用。