On Using Histograms of Local Invariant Features for Image Retrieval
计算机视觉代码合集
计算机视觉是结合了传统摄影测量,现代计算机信息技术、人工智能等多学科的一个大学科,是一片开垦不足的大陆,路很远,但很多人都在跋涉!
本文转自CSDN(地址/whucv/article/details/7907391),是一篇很好的算法与代码总结文档,转载在此供大家学习参考。
http://cvlab.epfl.ch/research/detect/brief/
Dimension Reduction
Diffusion maps
/~annlee/software.htm
Dimension Reduction
Dimensionality Reduction Toolbox
http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
Shared Matting
E. S. L.Gastaland M. M. Oliveira, Computer Graphics Forum, 2010
http://www.inf.ufrgs.br/~eslgastal/SharedMatting/
Alpha Matting
Bayesian Matting
Camera Calibration
EpipolarGeometry Toolbox
G.L.Mariottini, D.Prattichizzo, EGT: a Toolbox for Multiple View Geometry and VisualServoing, IEEE Robotics & Automation Magazine, 2005
A performance evaluation of local descriptors
A performance evaluation of local descriptorsKrystian Mikolajczyk and Cordelia SchmidDept.of Engineering Science INRIA Rhˆo ne-AlpesUniversity of Oxford655,av.de l’EuropeOxford,OX13PJ38330MontbonnotUnited Kingdom Francekm@ schmid@inrialpes.frAbstractIn this paper we compare the performance of descriptors computed for local interest regions,as for example extracted by the Harris-Affine detector[32].Many different descriptors have been proposed inthe literature.However,it is unclear which descriptors are more appropriate and how their performancedepends on the interest region detector.The descriptors should be distinctive and at the same time robustto changes in viewing conditions as well as to errors of the detector.Our evaluation uses as criterionrecall with respect to precision and is carried out for different image transformations.We compareshape context[3],steerablefilters[12],PCA-SIFT[19],differential invariants[20],spin images[21],SIFT[26],complexfilters[37],moment invariants[43],and cross-correlation for different types ofinterest regions.We also propose an extension of the SIFT descriptor,and show that it outperforms theoriginal method.Furthermore,we observe that the ranking of the descriptors is mostly independent ofthe interest region detector and that the SIFT based descriptors perform best.Moments and steerablefilters show the best performance among the low dimensional descriptors.Index TermsLocal descriptors,interest points,interest regions,invariance,matching,recognition.I.I NTRODUCTIONLocal photometric descriptors computed for interest regions have proved to be very successful in applications such as wide baseline matching[37,42],object recognition[10,25],textureCorresponding author is K.Mikolajczyk,km@.recognition[21],image retrieval[29,38],robot localization[40],video data mining[41],building panoramas[4],and recognition of object categories[8,9,22,35].They are distinctive,robust to occlusion and do not require segmentation.Recent work has concentrated on making these descriptors invariant to image transformations.The idea is to detect image regions covariant to a class of transformations,which are then used as support regions to compute invariant descriptors. Given invariant region detectors,the remaining questions are which is the most appropriate descriptor to characterize the regions,and does the choice of the descriptor depend on the region detector.There is a large number of possible descriptors and associated distance measures which emphasize different image properties like pixel intensities,color,texture,edges etc.In this work we focus on descriptors computed on gray-value images.The evaluation of the descriptors is performed in the context of matching and recognition of the same scene or object observed under different viewing conditions.We have selected a number of descriptors,which have previously shown a good performance in such a context and compare them using the same evaluation scenario and the same test data.The evaluation criterion is recall-precision,i.e.the number of correct and false matches between two images.Another possible evaluation criterion is the ROC(Receiver Operating Characteristics)in the context of image retrieval from databases[6,31].The detection rate is equivalent to recall but the false positive rate is computed for a database of images instead of a single image pair.It is therefore difficult to predict the actual number of false matches for a pair of similar images.Local features were also successfully used for object category recognition and classification. The comparison of descriptors in this context requires a different evaluation setup.However,it is unclear how to select a representative set of images for an object category and how to prepare the ground truth,since there is no linear transformation relating images within a category.A possible solution is to select manually a few corresponding points and apply loose constraints to verify correct matches,as proposed in[18].In this paper the comparison is carried out for different descriptors,different interest regions and for different matching pared to our previous work[31],this paper performs a more exhaustive evaluation and introduces a new descriptor.Several descriptors and detectors have been added to the comparison and the data set contains a larger variety of scenes types and transformations.We have modified the evaluation criterion and now use recall-precision for image pairs.The ranking of the top descriptors is the same as in the ROC based evaluation[31].Furthermore,our new descriptor,gradient location and orientation histogram(GLOH),which is an extension of the SIFT descriptor,is shown to outperform SIFT as well as the other descriptors.A.Related workPerformance evaluation has gained more and more importance in computer vision[7].In the context of matching and recognition several authors have evaluated interest point detectors[14, 30,33,39].The performance is measured by the repeatability rate,that is the percentage of points simultaneously present in two images.The higher the repeatability rate between two images,the more points can potentially be matched and the better are the matching and recognition results. Very little work has been done on the evaluation of local descriptors in the context of matching and recognition.Carneiro and Jepson[6]evaluate the performance of point descriptors using ROC (Receiver Operating Characteristics).They show that their phase-based descriptor performs better than differential invariants.In their comparison interest points are detected by the Harris detector and the image transformations are generated artificially.Recently,Ke and Sukthankar[19]have developed a descriptor similar to the SIFT descriptor.It applies Principal Components Analysis (PCA)to the normalized image gradient patch and performs better than the SIFT descriptor on artificially generated data.The criterion recall-precision and image pairs were used to compare the descriptors.Local descriptors(also calledfilters)have also been evaluated in the context of texture classification.Randen and Husoy[36]compare differentfilters for one texture classification algorithm.Thefilters evaluated in this paper are Laws masks,Gaborfilters,wavelet transforms, DCT,eigenfilters,linear predictors and optimizedfinite impulse responsefilters.No single approach is identified as best.The classification error depends on the texture type and the dimensionality of the descriptors.Gaborfilters were in most cases outperformed by the other filters.Varma and Zisserman[44]also compared differentfilters for texture classification and showed that MRF perform better than Gaussian basedfilter zebnik et al.[21]propose a new invariant descriptor called“spin image”and compare it with Gaborfilters in the context of texture classification.They show that the region-based spin image outperforms the point-based Gaborfilter.However,the texture descriptors and the results for texture classification cannot be directly transposed to region descriptors.The regions often contain a single structure without repeated patterns,and the statistical dependency frequently explored in texture descriptors cannotbe used in this context.B.OverviewIn section II we present a state of the art on local descriptors.Section III describes the implementation details for the detectors and descriptors used in our comparison as well as our evaluation criterion and the data set.In section IV we present the experimental results.Finally, we discuss the results.II.D ESCRIPTORSMany different techniques for describing local image regions have been developed.The simplest descriptor is a vector of image pixels.Cross-correlation can then be used to compute a similarity score between two descriptors.However,the high dimensionality of such a description results in a high computational complexity for recognition.Therefore,this technique is mainly used forfinding correspondences between two images.Note that the region can be sub-sampled to reduce the dimension.Recently,Ke and Sukthankar[19]proposed to use the image gradient patch and to apply PCA to reduce the size of the descriptor.Distribution based descriptors.These techniques use histograms to represent different charac-teristics of appearance or shape.A simple descriptor is the distribution of the pixel intensities represented by a histogram.A more expressive representation was introduced by Johnson and Hebert[17]for3D object recognition in the context of range data.Their representation(spin image)is a histogram of the relative positions in the neighborhood of a3D interest point.This descriptor was recently adapted to images[21].The two dimensions of the histogram are distance from the center point and the intensity value.Zabih and Woodfill[45]have developed an approach robust to illumination changes.It relies on histograms of ordering and reciprocal relations between pixel intensities which are more robust than raw pixel intensities.The binary relations between intensities of several neighboring pixels are encoded by binary strings and a distribution of all possible combinations is represented by histograms.This descriptor is suitable for texture representation but a large number of dimensions is required to build a reliable descriptor[34].Lowe[25]proposed a scale invariant feature transform(SIFT),which combines a scale invari-ant region detector and a descriptor based on the gradient distribution in the detected regions.Thedescriptor is represented by a3D histogram of gradient locations and orientations,seefigure1 for illustration.The contribution to the location and orientation bins is weighted by the gradient magnitude.The quantization of gradient locations and orientations makes the descriptor robust to small geometric distortions and small errors in the region detection.Geometric histogram[1] and shape context[3]implement the same idea and are very similar to the SIFT descriptor.Both methods compute a3D histogram of location and orientation for edge points where all the edge points have equal contribution in the histogram.These descriptors were successfully used,for example,for shape recognition of drawings for which edges are reliable features.Spatial-frequency techniques.Many techniques describe the frequency content of an image. The Fourier transform decomposes the image content into the basis functions.However,in this representation the spatial relations between points are not explicit and the basis functions are infinite,therefore difficult to adapt to a local approach.The Gabor transform[13]overcomes these problems,but a large number of Gaborfilters is required to capture small changes in frequency and orientation.Gaborfilters and wavelets[27]are frequently explored in the context of texture classification.Differential descriptors.A set of image derivatives computed up to a given order approximates a point neighborhood.The properties of local derivatives(local jet)were investigated by Koen-derink[20].Florack et al.[11]derived differential invariants,which combine components of the local jet to obtain rotation invariance.Freeman and Adelson[12]developed steerablefilters, which steer derivatives in a particular direction given the components of the local jet.Steering derivatives in the direction of the gradient makes them invariant to rotation.A stable estimation of the derivatives is obtained by convolution with Gaussian derivatives.Figure2(a)shows Gaussian derivatives up to order4.Baumberg[2]and Schaffalitzky and Zisserman[37]proposed to use complexfilters derived from the family,where is the orientation.For the function Baumberg uses Gaussian derivatives and Schaffalitzky and Zisserman apply a polynomial(cf. section III-B andfigure2(b)).Thesefilters differ from the Gaussian derivatives by a linear coordinates change infilter response space.Other techniques.Generalized moment invariants have been introduced by Van Gool et al.[43] to describe the multi-spectral nature of the image data.The invariants combine central moments defined by with order and degree.The moments char-acterize shape and intensity distribution in a region.They are independent and can be easily computed for any order and degree.However,the moments of high order and degree are sensitive to small geometric and photometric puting the invariants reduces the number of dimensions.These descriptors are therefore more suitable for color images where the invariants can be computed for each color channel and between the channels.III.E XPERIMENTAL SETUPIn the following wefirst describe the region detectors used in our comparison and the region normalization necessary for computing the descriptors.We then give implementation details for the evaluated descriptors.Finally,we discuss the evaluation criterion and the image data used in the tests.A.Support regionsRegion detectors use different image measurements and are either scale or affine invariant. Lindeberg[23]has developed a scale-invariant“blob”detector,where a“blob”is defined by a maximum of the normalized Laplacian in scale-space.Lowe[25]approximates the Laplacian with difference-of-Gaussian(DoG)filters and also detects local extrema in scale-space.Lindeberg and G˚a rding[24]make the blob detector affine-invariant using an affine adaptation process based on the second moment matrix.Mikolajczyk and Schmid[29,30]use a multi-scale version of the Harris interest point detector to localize interest points in space and then employ Lindeberg’s scheme for scale selection and affine adaptation.A similar idea was explored by Baumberg[2] as well as Schaffalitzky and Zisserman[37].Tuytelaars and Van Gool[42]construct two types of affine-invariant regions,one based on a combination of interest points and edges and the other one based on image intensities.Matas et al.[28]introduced Maximally Stable Extremal Regions extracted with a watershed like segmentation algorithm.Kadir et al.[18]measure the entropy of pixel intensity histograms computed for elliptical regions tofind local maxima in affine transformation space.A comparison of state-of the art affine region detectors can be found in[33].1)Region detectors:The detectors provide the regions which are used to compute the de-scriptors.If not stated otherwise the detection scale determines the size of the region.In this evaluation we have usedfive detectors:Harris points[15]are invariant to rotation.The support region is afixed size neighborhood of 41x41pixels centered at the interest point.Harris-Laplace regions[29]are invariant to rotation and scale changes.The points are detected by the scale-adapted Harris function and selected in scale-space by the Laplacian-of-Gaussian operator.Harris-Laplace detects corner-like structures.Hessian-Laplace regions[25,32]are invariant to rotation and scale changes.Points are localized in space at the local maxima of the Hessian determinant and in scale at the local maxima of the Laplacian-of-Gaussian.This detector is similar to the DoG approach[26],which localizes points at local scale-space maxima of the difference-of-Gaussian.Both approaches detect the same blob-like structures.However,Hessian-Laplace obtains a higher localization accuracy in scale-space,as DoG also responds to edges and detection is unstable in this case.The scale selection accuracy is also higher than in the case of the Harris-Laplace placian scale selection acts as a matchedfilter and works better on blob-like structures than on corners since the shape of the Laplacian kernelfits to the blobs.The accuracy of the detectors affects the descriptor performance.Harris-Affine regions[32]are invariant to affine image transformations.Localization and scale are estimated by the Harris-Laplace detector.The affine neighborhood is determined by the affine adaptation process based on the second moment matrix.Hessian-Affine regions[32,33]are invariant to affine image transformations.Localization and scale are estimated by the Hessian-Laplace detector and the affine neighborhood is determined by the affine adaptation process.Note that Harris-Affine differs from Harris-Laplace by the affine adaptation,which is applied to Harris-Laplace regions.In this comparison we use the same regions except that for Harris-Laplace the region shape is circular.The same holds for the Hessian based detector.Thus the number of regions is the same for affine and scale invariant detectors.Implementation details for these detectors as well as default thresholds are described in[32].The number of detected regions varies from200to3000per image depending on the content.2)Region normalization:The detectors provide circular or elliptic regions of different size, which depends on the detection scale.Given a detected region it is possible to change its size or shape by scale or affine covariant construction.Thus,we can modify the set of pixels which contribute to the descriptor computation.Typically,larger regions contain more signalvariations.Hessian-Affine and Hessian-Laplace detect mainly blob-like structures for which the signal variations lie on the blob boundaries.To include these signal changes into the description, the measurement region is3times larger than the detected region.This factor is used for all scale and affine detectors.All the regions are mapped to a circular region of constant radius to obtain scale and affine invariance.The size of the normalized region should not be too small in order to represent the local structure at a sufficient resolution.In all experiments this size is arbitrarily set to41pixels.A similar patch size was used in[19].Regions which are larger than the normalized size,are smoothed before the size normalization.The parameter of the smoothing Gaussian kernel is given by the ratio measurement/normalized region size.Spin images,differential invariants and complexfilters are invariant to rotation.To obtain rotation invariance for the other descriptors the normalized regions are rotated in the direction of the dominant gradient orientation,which is computed in a small neighborhood of the region center. To estimate the dominant orientation we build a histogram of gradient angles weighted by the gradient magnitude and select the orientation corresponding to the largest histogram bin,as suggested in[25].Illumination changes can be modeled by an affine transformation of the pixel intensities.To compensate for such affine illumination changes the image patch is normalized with mean and standard deviation of the pixel intensities within the region.The regions,which are used for descriptor evaluation,are normalized with this method if not stated otherwise. Derivative-based descriptors(steerablefilters,differential invariants)can also be normalized by computing illumination invariants.The offset is eliminated by the differentiation operation. The invariance to linear scaling with factor is obtained by dividing the higher order derivatives by the gradient magnitude raised to the appropriate power.A similar normalization is possible for moments and complexfilters,but has not been implemented here.B.DescriptorsIn the following we present the implementation details for the descriptors used in our experi-mental evaluation.We use ten different descriptors:SIFT[25],gradient location and orientation histogram(GLOH),shape context[3],PCA-SIFT[19],spin images[21],steerablefilters[12], differential invariants[20],complexfilters[37],moment invariants[43],and cross-correlation of sampled pixel values.Gradient location and orientation histogram(GLOH)is a new descriptorwhich extends SIFT by changing the location grid and using PCA to reduce the size.SIFT descriptors are computed for normalized image patches with the code provided by Lowe[25].A descriptor is a3D histogram of gradient location and orientation,where location is quantized into a4x4location grid and the gradient angle is quantized into8orientations.The resulting descriptor is of dimension128.Figure1illustrates the approach.Each orientation plane represents the gradient magnitude corresponding to a given orientation.To obtain illumination invariance, the descriptor is normalized by the square root of the sum of squared components.Gradient location-orientation histogram(GLOH)is an extension of the SIFT descriptor designed to increase its robustness and distinctiveness.We compute the SIFT descriptor for a log-polar location grid with3bins in radial direction(the radius set to6,11and15)and8in angular direction(cf.figure1(e)),which results17location bins.Note that the central bin is not divided in angular directions.The gradient orientations are quantized in16bins.This gives a272bin histogram.The size of this descriptor is reduced with PCA.The covariance matrix for PCA is estimated on47000image patches collected from various images(see section III-C.1).The128 largest eigenvectors are used for description.(a)(b)(c)(d)(e)Fig.1.SIFT descriptor.(a)Detected region.(b)Gradient image and location grid.(c)Dimensions of the histogram.(d)4of8 orientation planes.(e)Cartesian and the log-polar location grids.The log-polar grid shows9location bins used in shape context (4in angular direction).Shape context is similar to the SIFT descriptor,but is based on edges.Shape context is a3D histogram of edge point locations and orientations.Edges are extracted by the Canny[5]detector. Location is quantized into9bins of a log-polar coordinate system as displayed infigure1(e) with the radius set to6,11and15and orientation quantized into4bins(horizontal,vertical and two diagonals).We therefore obtain a36dimensional descriptor.In our experiments we weight a point contribution to the histogram with the gradient magnitude.This has shown to give betterresults than using the same weight for all edge points,as proposed in[3].Note that the original shape context was computed only for edge point locations and not for orientations.PCA-SIFT descriptor is a vector of image gradients in and direction computed within the support region.The gradient region is sampled at39x39locations therefore the vector is of dimension3042.The dimension is reduced to36with PCA.Spin image is a histogram of quantized pixel locations and intensity values.The intensity of a normalized patch is quantized into10bins.A10bin normalized histogram is computed for each of5rings centered on the region.The dimension of the spin descriptor is50.(a)(b)Fig.2.Derivative basedfilters.(a)Gaussian derivatives up to4th order.(b)Complexfilters up to6th order.Note that the displayedfilters are not weighted by a Gaussian,forfigure clarity.Steerablefilters and differential invariants use derivatives computed by convolution with Gaus-sian derivatives of for an image patch of size41.Changing the orientation of derivatives as proposed in[12]gives equivalent results to computing the local jet on rotated image patches. We use the second approach.The derivatives are computed up to4th order,that is the descriptor has dimension14.Figure2(a)shows8of14derivatives;the remaining derivatives are obtained by rotation by.The differential invariants are computed up to3rd order(dimension8). We compare steerablefilters and differential invariants computed up to the same order(cf. section IV-A.3).Complexfilters are derived from the following equation. The original implementation[37]has been used for generating the kernels.The kernels are computed for a unit disk of radius1and sampled at41x41locations.We use15filters defined by(swapping and just gives complex conjugatefilters);the response of the filters with is the average intensity of the region.Figure2(b)shows8of15filters.Rotation changes the phase but not the magnitude of the response,therefore we use the modulus of each complexfilter response.Moment invariants are computed up to2nd order and2nd degree.The moments are computed for derivatives of an image patch withchange(c)&(d);viewpoint change(e)&(f);image blur(g)&(h);JPEG compression(i);and illumination(j).In the case of rotation,scale change,viewpoint change and blur,we use two different scene types.One scene type contains structured scenes,that is homogeneous regions with distinctive edge boundaries(e.g.graffiti,buildings)and the other contains repeated textures of different forms.This allows to analyze the influence of image transformation and scene type separately.Image rotations are obtained by rotating the camera around its optical axis in the range of 30and45degrees.Scale change and blur sequences are acquired by varying the camera zoom and focus respectively.The scale changes are in the range of2-2.5.In the case of the viewpoint change sequences the camera position varies from a fronto-parallel view to one with significant foreshortening at approximately50-60degrees.The light changes are introduced by varying the camera aperture.The JPEG sequence is generated with a standard xv image browser with the image quality parameter set to5%.The images are either of planar scenes or the camera position wasfixed during acquisition.The images are therefore always related by a homography(plane projective transformation).The ground truth homographies are computed in two steps.First, an approximation of the homography is computed using manually selected correspondences. The transformed image is warped with this homography so that it is roughly aligned with the reference image.Second,a robust small baseline homography estimation algorithm is used to compute an accurate residual homography between the reference image and the warped image, with automatically detected and matched interest points[16].The composition of the approximate and residual homography results in an accurate homography between the images.In section IV we display the results for image pairs fromfigure3.The transformation between these images is significant enough to introduce some noise in the detected regions.Yet, many correspondences are found and the matching results are stable.Typically,the descriptor performance is higher for small image transformations but the ranking remains the same.There are few corresponding regions for large transformations and the recall-precision curves are not smooth.A data set different from the test data was used to estimate the covariance matrices for PCA and descriptor normalization.In both cases we have used21image sequences of different planar(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)Fig.3.Data set.Examples of images used for the evaluation,(a)(b)Rotation,(c)(d)Zoom+rotation,(e)(f)Viewpoint change,(g)(h)Image blur,(i)JPEG compression,(j)Light change.scenes which are viewed under all the transformations for which we evaluate the descriptors2.2)Evaluation criterion:We use a criterion similar to the one proposed in[19].It is based on the number of correct matches and the number of false matches obtained for an image pair. Two regions and are matched if the distance between their descriptors and is below a threshold.Each descriptor from the reference image is compared with each descriptor from the transformed one and we count the number of correct matches as well as the number of false matches.The value of is varied to obtain the curves.The results are presented with recall versus1-precision.Recall is the number of correctly matched regions with respect to the number of corresponding regions between two images of the same scene:Given recall,1-precision and the number of corresponding regions,the number of correct matches can be determined by and the number of false matches2The data set is available at /˜vgg/research/affineby.For example,there are3708corre-sponding regions between the images used to generatefigure4(a).For a point on the GLOH curve with recall of0.3and1-precision of0.6,the number of correct matches is,and the number of false matches is.Note that recall and1-precision are independent terms.Recall is computed with respect to the number of corresponding regions and1-precision with respect to the total number of matches.Before we start the evaluation we discuss the interpretation offigures and possible curve shapes.A perfect descriptor would give a recall equal to1for any precision.In practice, recall increases for an increasing distance threshold,as noise which is introduced by image transformations and region detection increases the distance between similar descriptors.Hor-izontal curves indicate that the recall is attained with a high precision and is limited by the specificity of the scene i.e.the detected structures are very similar to each other and the descriptor cannot distinguish them.Another possible reason for non-increasing recall is that the remaining corresponding regions are very different from each other(partial overlap close to50%)and therefore the descriptors are different.A slowly increasing curve shows that the descriptor is affected by the image degradation(viewpoint change,blur,noise etc.).If curves corresponding to different descriptors are far apart and have different slopes,then the distinctiveness and robustness of the descriptors is different for a given image transformation or scene type.IV.E XPERIMENTAL RESULTSIn this section we present and discuss the experimental results of the evaluation.The perfor-mance is compared for affine transformations,scale changes,rotation,blur,jpeg compression and illumination changes.In the case of affine transformations we also examine different matching strategies,the influence of the overlap error and the dimension of the descriptor.A.Affine transformationsIn this section we evaluate the performance for viewpoint changes of approximately degrees. This introduces a perspective transformation which can locally be approximated by an affine transformation.This is the most challenging transformation of the ones evaluated in this paper. Note that there are also some scale and brightness changes in the test images,seefigure3(e)(f). In the following wefirst examine different matching approaches.Second,we investigate the。
A Discriminatively Trained, Multiscale, Deformable Part Model
A Discriminatively Trained,Multiscale,Deformable Part ModelPedro Felzenszwalb University of Chicago pff@David McAllesterToyota Technological Institute at Chicagomcallester@Deva RamananUC Irvinedramanan@AbstractThis paper describes a discriminatively trained,multi-scale,deformable part model for object detection.Our sys-tem achieves a two-fold improvement in average precision over the best performance in the2006PASCAL person de-tection challenge.It also outperforms the best results in the 2007challenge in ten out of twenty categories.The system relies heavily on deformable parts.While deformable part models have become quite popular,their value had not been demonstrated on difficult benchmarks such as the PASCAL challenge.Our system also relies heavily on new methods for discriminative training.We combine a margin-sensitive approach for data mining hard negative examples with a formalism we call latent SVM.A latent SVM,like a hid-den CRF,leads to a non-convex training problem.How-ever,a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified for the positive examples.We believe that our training meth-ods will eventually make possible the effective use of more latent information such as hierarchical(grammar)models and models involving latent three dimensional pose.1.IntroductionWe consider the problem of detecting and localizing ob-jects of a generic category,such as people or cars,in static images.We have developed a new multiscale deformable part model for solving this problem.The models are trained using a discriminative procedure that only requires bound-ing box labels for the positive ing these mod-els we implemented a detection system that is both highly efficient and accurate,processing an image in about2sec-onds and achieving recognition rates that are significantly better than previous systems.Our system achieves a two-fold improvement in average precision over the winning system[5]in the2006PASCAL person detection challenge.The system also outperforms the best results in the2007challenge in ten out of twenty This material is based upon work supported by the National Science Foundation under Grant No.0534820and0535174.Figure1.Example detection obtained with the person model.The model is defined by a coarse template,several higher resolution part templates and a spatial model for the location of each part. object categories.Figure1shows an example detection ob-tained with our person model.The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework for representing object categories[1–3,6,10,12,13,15,16,22]. While these models are appealing from a conceptual point of view,it has been difficult to establish their value in prac-tice.On difficult datasets,deformable models are often out-performed by“conceptually weaker”models such as rigid templates[5]or bag-of-features[23].One of our main goals is to address this performance gap.Our models include both a coarse global template cov-ering an entire object and higher resolution part templates. The templates represent histogram of gradient features[5]. As in[14,19,21],we train models discriminatively.How-ever,our system is semi-supervised,trained with a max-margin framework,and does not rely on feature detection. We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data.In contrast to computa-tionally demanding approaches such as[4],we can learn a model in3hours on a single CPU.Another contribution of our work is a new methodology for discriminative training.We generalize SVMs for han-dling latent variables such as part positions,and introduce a new method for data mining“hard negative”examples dur-ing training.We believe that handling partially labeled data is a significant issue in machine learning for computer vi-sion.For example,the PASCAL dataset only specifies abounding box for each positive example of an object.We treat the position of each object part as a latent variable.We also treat the exact location of the object as a latent vari-able,requiring only that our classifier select a window that has large overlap with the labeled bounding box.A latent SVM,like a hidden CRF[19],leads to a non-convex training problem.However,unlike a hidden CRF, a latent SVM is semi-convex and the training problem be-comes convex once latent information is specified for thepositive training examples.This leads to a general coordi-nate descent algorithm for latent SVMs.System Overview Our system uses a scanning window approach.A model for an object consists of a global“root”filter and several part models.Each part model specifies a spatial model and a partfilter.The spatial model defines a set of allowed placements for a part relative to a detection window,and a deformation cost for each placement.The score of a detection window is the score of the root filter on the window plus the sum over parts,of the maxi-mum over placements of that part,of the partfilter score on the resulting subwindow minus the deformation cost.This is similar to classical part-based models[10,13].Both root and partfilters are scored by computing the dot product be-tween a set of weights and histogram of gradient(HOG) features within a window.The rootfilter is equivalent to a Dalal-Triggs model[5].The features for the partfilters are computed at twice the spatial resolution of the rootfilter. Our model is defined at afixed scale,and we detect objects by searching over an image pyramid.In training we are given a set of images annotated with bounding boxes around each instance of an object.We re-duce the detection problem to a binary classification prob-lem.Each example x is scored by a function of the form, fβ(x)=max zβ·Φ(x,z).Hereβis a vector of model pa-rameters and z are latent values(e.g.the part placements). To learn a model we define a generalization of SVMs that we call latent variable SVM(LSVM).An important prop-erty of LSVMs is that the training problem becomes convex if wefix the latent values for positive examples.This can be used in a coordinate descent algorithm.In practice we iteratively apply classical SVM training to triples( x1,z1,y1 ,..., x n,z n,y n )where z i is selected to be the best scoring latent label for x i under the model learned in the previous iteration.An initial rootfilter is generated from the bounding boxes in the PASCAL dataset. The parts are initialized from this rootfilter.2.ModelThe underlying building blocks for our models are the Histogram of Oriented Gradient(HOG)features from[5]. We represent HOG features at two different scales.Coarse features are captured by a rigid template covering anentireImage pyramidFigure2.The HOG feature pyramid and an object hypothesis de-fined in terms of a placement of the rootfilter(near the top of the pyramid)and the partfilters(near the bottom of the pyramid). detection window.Finer scale features are captured by part templates that can be moved with respect to the detection window.The spatial model for the part locations is equiv-alent to a star graph or1-fan[3]where the coarse template serves as a reference position.2.1.HOG RepresentationWe follow the construction in[5]to define a dense repre-sentation of an image at a particular resolution.The image isfirst divided into8x8non-overlapping pixel regions,or cells.For each cell we accumulate a1D histogram of gra-dient orientations over pixels in that cell.These histograms capture local shape properties but are also somewhat invari-ant to small deformations.The gradient at each pixel is discretized into one of nine orientation bins,and each pixel“votes”for the orientation of its gradient,with a strength that depends on the gradient magnitude.For color images,we compute the gradient of each color channel and pick the channel with highest gradi-ent magnitude at each pixel.Finally,the histogram of each cell is normalized with respect to the gradient energy in a neighborhood around it.We look at the four2×2blocks of cells that contain a particular cell and normalize the his-togram of the given cell with respect to the total energy in each of these blocks.This leads to a vector of length9×4 representing the local gradient information inside a cell.We define a HOG feature pyramid by computing HOG features of each level of a standard image pyramid(see Fig-ure2).Features at the top of this pyramid capture coarse gradients histogrammed over fairly large areas of the input image while features at the bottom of the pyramid capture finer gradients histogrammed over small areas.2.2.FiltersFilters are rectangular templates specifying weights for subwindows of a HOG pyramid.A w by hfilter F is a vector with w×h×9×4weights.The score of afilter is defined by taking the dot product of the weight vector and the features in a w×h subwindow of a HOG pyramid.The system in[5]uses a singlefilter to define an object model.That system detects objects from a particular class by scoring every w×h subwindow of a HOG pyramid and thresholding the scores.Let H be a HOG pyramid and p=(x,y,l)be a cell in the l-th level of the pyramid.Letφ(H,p,w,h)denote the vector obtained by concatenating the HOG features in the w×h subwindow of H with top-left corner at p.The score of F on this detection window is F·φ(H,p,w,h).Below we useφ(H,p)to denoteφ(H,p,w,h)when the dimensions are clear from context.2.3.Deformable PartsHere we consider models defined by a coarse rootfilter that covers the entire object and higher resolution partfilters covering smaller parts of the object.Figure2illustrates a placement of such a model in a HOG pyramid.The rootfil-ter location defines the detection window(the pixels inside the cells covered by thefilter).The partfilters are placed several levels down in the pyramid,so the HOG cells at that level have half the size of cells in the rootfilter level.We have found that using higher resolution features for defining partfilters is essential for obtaining high recogni-tion performance.With this approach the partfilters repre-sentfiner resolution edges that are localized to greater ac-curacy when compared to the edges represented in the root filter.For example,consider building a model for a face. The rootfilter could capture coarse resolution edges such as the face boundary while the partfilters could capture details such as eyes,nose and mouth.The model for an object with n parts is formally defined by a rootfilter F0and a set of part models(P1,...,P n) where P i=(F i,v i,s i,a i,b i).Here F i is afilter for the i-th part,v i is a two-dimensional vector specifying the center for a box of possible positions for part i relative to the root po-sition,s i gives the size of this box,while a i and b i are two-dimensional vectors specifying coefficients of a quadratic function measuring a score for each possible placement of the i-th part.Figure1illustrates a person model.A placement of a model in a HOG pyramid is given by z=(p0,...,p n),where p i=(x i,y i,l i)is the location of the rootfilter when i=0and the location of the i-th part when i>0.We assume the level of each part is such that a HOG cell at that level has half the size of a HOG cell at the root level.The score of a placement is given by the scores of eachfilter(the data term)plus a score of the placement of each part relative to the root(the spatial term), ni=0F i·φ(H,p i)+ni=1a i·(˜x i,˜y i)+b i·(˜x2i,˜y2i),(1)where(˜x i,˜y i)=((x i,y i)−2(x,y)+v i)/s i gives the lo-cation of the i-th part relative to the root location.Both˜x i and˜y i should be between−1and1.There is a large(exponential)number of placements for a model in a HOG pyramid.We use dynamic programming and distance transforms techniques[9,10]to compute the best location for the parts of a model as a function of the root location.This takes O(nk)time,where n is the number of parts in the model and k is the number of cells in the HOG pyramid.To detect objects in an image we score root locations according to the best possible placement of the parts and threshold this score.The score of a placement z can be expressed in terms of the dot product,β·ψ(H,z),between a vector of model parametersβand a vectorψ(H,z),β=(F0,...,F n,a1,b1...,a n,b n).ψ(H,z)=(φ(H,p0),φ(H,p1),...φ(H,p n),˜x1,˜y1,˜x21,˜y21,...,˜x n,˜y n,˜x2n,˜y2n,). We use this representation for learning the model parame-ters as it makes a connection between our deformable mod-els and linear classifiers.On interesting aspect of the spatial models defined here is that we allow for the coefficients(a i,b i)to be negative. This is more general than the quadratic“spring”cost that has been used in previous work.3.LearningThe PASCAL training data consists of a large set of im-ages with bounding boxes around each instance of an ob-ject.We reduce the problem of learning a deformable part model with this data to a binary classification problem.Let D=( x1,y1 ,..., x n,y n )be a set of labeled exam-ples where y i∈{−1,1}and x i specifies a HOG pyramid, H(x i),together with a range,Z(x i),of valid placements for the root and partfilters.We construct a positive exam-ple from each bounding box in the training set.For these ex-amples we define Z(x i)so the rootfilter must be placed to overlap the bounding box by at least50%.Negative exam-ples come from images that do not contain the target object. Each placement of the rootfilter in such an image yields a negative training example.Note that for the positive examples we treat both the part locations and the exact location of the rootfilter as latent variables.We have found that allowing uncertainty in the root location during training significantly improves the per-formance of the system(see Section4).tent SVMsA latent SVM is defined as follows.We assume that each example x is scored by a function of the form,fβ(x)=maxz∈Z(x)β·Φ(x,z),(2)whereβis a vector of model parameters and z is a set of latent values.For our deformable models we define Φ(x,z)=ψ(H(x),z)so thatβ·Φ(x,z)is the score of placing the model according to z.In analogy to classical SVMs we would like to trainβfrom labeled examples D=( x1,y1 ,..., x n,y n )by optimizing the following objective function,β∗(D)=argminβλ||β||2+ni=1max(0,1−y i fβ(x i)).(3)By restricting the latent domains Z(x i)to a single choice, fβbecomes linear inβ,and we obtain linear SVMs as a special case of latent tent SVMs are instances of the general class of energy-based models[18].3.2.Semi-ConvexityNote that fβ(x)as defined in(2)is a maximum of func-tions each of which is linear inβ.Hence fβ(x)is convex inβ.This implies that the hinge loss max(0,1−y i fβ(x i)) is convex inβwhen y i=−1.That is,the loss function is convex inβfor negative examples.We call this property of the loss function semi-convexity.Consider an LSVM where the latent domains Z(x i)for the positive examples are restricted to a single choice.The loss due to each positive example is now bined with the semi-convexity property,(3)becomes convex inβ.If the labels for the positive examples are notfixed we can compute a local optimum of(3)using a coordinate de-scent algorithm:1.Holdingβfixed,optimize the latent values for the pos-itive examples z i=argmax z∈Z(xi )β·Φ(x,z).2.Holding{z i}fixed for positive examples,optimizeβby solving the convex problem defined above.It can be shown that both steps always improve or maintain the value of the objective function in(3).If both steps main-tain the value we have a strong local optimum of(3),in the sense that Step1searches over an exponentially large space of latent labels for positive examples while Step2simulta-neously searches over weight vectors and an exponentially large space of latent labels for negative examples.3.3.Data Mining Hard NegativesIn object detection the vast majority of training exam-ples are negative.This makes it infeasible to consider all negative examples at a time.Instead,it is common to con-struct training data consisting of the positive instances and “hard negative”instances,where the hard negatives are data mined from the very large set of possible negative examples.Here we describe a general method for data mining ex-amples for SVMs and latent SVMs.The method iteratively solves subproblems using only hard instances.The innova-tion of our approach is a theoretical guarantee that it leads to the exact solution of the training problem defined using the complete training set.Our results require the use of a margin-sensitive definition of hard examples.The results described here apply both to classical SVMs and to the problem defined by Step2of the coordinate de-scent algorithm for latent SVMs.We omit the proofs of the theorems due to lack of space.These results are related to working set methods[17].We define the hard instances of D relative toβas,M(β,D)={ x,y ∈D|yfβ(x)≤1}.(4)That is,M(β,D)are training examples that are incorrectly classified or near the margin of the classifier defined byβ. We can show thatβ∗(D)only depends on hard instances. Theorem1.Let C be a subset of the examples in D.If M(β∗(D),D)⊆C thenβ∗(C)=β∗(D).This implies that in principle we could train a model us-ing a small set of examples.However,this set is defined in terms of the optimal modelβ∗(D).Given afixedβwe can use M(β,D)to approximate M(β∗(D),D).This suggests an iterative algorithm where we repeatedly compute a model from the hard instances de-fined by the model from the last iteration.This is further justified by the followingfixed-point theorem.Theorem2.Ifβ∗(M(β,D))=βthenβ=β∗(D).Let C be an initial“cache”of examples.In practice we can take the positive examples together with random nega-tive examples.Consider the following iterative algorithm: 1.Letβ:=β∗(C).2.Shrink C by letting C:=M(β,C).3.Grow C by adding examples from M(β,D)up to amemory limit L.Theorem3.If|C|<L after each iteration of Step2,the algorithm will converge toβ=β∗(D)infinite time.3.4.Implementation detailsMany of the ideas discussed here are only approximately implemented in our current system.In practice,when train-ing a latent SVM we iteratively apply classical SVM train-ing to triples x1,z1,y1 ,..., x n,z n,y n where z i is se-lected to be the best scoring latent label for x i under themodel trained in the previous iteration.Each of these triples leads to an example Φ(x i,z i),y i for training a linear clas-sifier.This allows us to use a highly optimized SVM pack-age(SVMLight[17]).On a single CPU,the entire training process takes3to4hours per object class in the PASCAL datasets,including initialization of the parts.Root Filter Initialization:For each category,we auto-matically select the dimensions of the rootfilter by looking at statistics of the bounding boxes in the training data.1We train an initial rootfilter F0using an SVM with no latent variables.The positive examples are constructed from the unoccluded training examples(as labeled in the PASCAL data).These examples are anisotropically scaled to the size and aspect ratio of thefilter.We use random subwindows from negative images to generate negative examples.Root Filter Update:Given the initial rootfilter trained as above,for each bounding box in the training set wefind the best-scoring placement for thefilter that significantly overlaps with the bounding box.We do this using the orig-inal,un-scaled images.We retrain F0with the new positive set and the original random negative set,iterating twice.Part Initialization:We employ a simple heuristic to ini-tialize six parts from the rootfilter trained above.First,we select an area a such that6a equals80%of the area of the rootfilter.We greedily select the rectangular region of area a from the rootfilter that has the most positive energy.We zero out the weights in this region and repeat until six parts are selected.The partfilters are initialized from the rootfil-ter values in the subwindow selected for the part,butfilled in to handle the higher spatial resolution of the part.The initial deformation costs measure the squared norm of a dis-placement with a i=(0,0)and b i=−(1,1).Model Update:To update a model we construct new training data triples.For each positive bounding box in the training data,we apply the existing detector at all positions and scales with at least a50%overlap with the given bound-ing box.Among these we select the highest scoring place-ment as the positive example corresponding to this training bounding box(Figure3).Negative examples are selected byfinding high scoring detections in images not containing the target object.We add negative examples to a cache un-til we encounterfile size limits.A new model is trained by running SVMLight on the positive and negative examples, each labeled with part placements.We update the model10 times using the cache scheme described above.In each it-eration we keep the hard instances from the previous cache and add as many new hard instances as possible within the memory limit.Toward thefinal iterations,we are able to include all hard instances,M(β,D),in the cache.1We picked a simple heuristic by cross-validating over5object classes. We set the model aspect to be the most common(mode)aspect in the data. We set the model size to be the largest size not larger than80%of thedata.Figure3.The image on the left shows the optimization of the la-tent variables for a positive example.The dotted box is the bound-ing box label provided in the PASCAL training set.The large solid box shows the placement of the detection window while the smaller solid boxes show the placements of the parts.The image on the right shows a hard-negative example.4.ResultsWe evaluated our system using the PASCAL VOC2006 and2007comp3challenge datasets and protocol.We refer to[7,8]for details,but emphasize that both challenges are widely acknowledged as difficult testbeds for object detec-tion.Each dataset contains several thousand images of real-world scenes.The datasets specify ground-truth bounding boxes for several object classes,and a detection is consid-ered correct when it overlaps more than50%with a ground-truth bounding box.One scores a system by the average precision(AP)of its precision-recall curve across a testset.Recent work in pedestrian detection has tended to report detection rates versus false positives per window,measured with cropped positive examples and negative images with-out objects of interest.These scores are tied to the reso-lution of the scanning window search and ignore effects of non-maximum suppression,making it difficult to compare different systems.We believe the PASCAL scoring method gives a more reliable measure of performance.The2007challenge has20object categories.We entered a preliminary version of our system in the official competi-tion,and obtained the best score in6categories.Our current system obtains the highest score in10categories,and the second highest score in6categories.Table1summarizes the results.Our system performs well on rigid objects such as cars and sofas as well as highly deformable objects such as per-sons and horses.We also note that our system is successful when given a large or small amount of training data.There are roughly4700positive training examples in the person category but only250in the sofa category.Figure4shows some of the models we learned.Figure5shows some ex-ample detections.We evaluated different components of our system on the longer-established2006person dataset.The top AP scoreaero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tvOur rank 31211224111422112141Our score .180.411.092.098.249.349.396.110.155.165.110.062.301.337.267.140.141.156.206.336Darmstadt .301INRIA Normal .092.246.012.002.068.197.265.018.097.039.017.016.225.153.121.093.002.102.157.242INRIA Plus.136.287.041.025.077.279.294.132.106.127.067.071.335.249.092.072.011.092.242.275IRISA .281.318.026.097.119.289.227.221.175.253MPI Center .060.110.028.031.000.164.172.208.002.044.049.141.198.170.091.004.091.034.237.051MPI ESSOL.152.157.098.016.001.186.120.240.007.061.098.162.034.208.117.002.046.147.110.054Oxford .262.409.393.432.375.334TKK .186.078.043.072.002.116.184.050.028.100.086.126.186.135.061.019.036.058.067.090Table 1.PASCAL VOC 2007results.Average precision scores of our system and other systems that entered the competition [7].Empty boxes indicate that a method was not tested in the corresponding class.The best score in each class is shown in bold.Our current system ranks first in 10out of 20classes.A preliminary version of our system ranked first in 6classes in the official competition.BottleCarBicycleSofaFigure 4.Some models learned from the PASCAL VOC 2007dataset.We show the total energy in each orientation of the HOG cells in the root and part filters,with the part filters placed at the center of the allowable displacements.We also show the spatial model for each part,where bright values represent “cheap”placements,and dark values represent “expensive”placements.in the PASCAL competition was .16,obtained using a rigid template model of HOG features [5].The best previous re-sult of.19adds a segmentation-based verification step [20].Figure 6summarizes the performance of several models we trained.Our root-only model is equivalent to the model from [5]and it scores slightly higher at .18.Performance jumps to .24when the model is trained with a LSVM that selects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templates because they allow for self-adjustment of the detection win-dow in the training examples.Adding deformable parts in-creases performance to .34AP —a factor of two above the best previous score.Finally,we trained a model with partsbut no root filter and obtained .29AP.This illustrates the advantage of using a multiscale representation.We also investigated the effect of the spatial model and allowable deformations on the 2006person dataset.Recall that s i is the allowable displacement of a part,measured in HOG cells.We trained a rigid model with high-resolution parts by setting s i to 0.This model outperforms the root-only system by .27to .24.If we increase the amount of allowable displacements without using a deformation cost,we start to approach a bag-of-features.Performance peaks at s i =1,suggesting it is useful to constrain the part dis-placements.The optimal strategy allows for larger displace-ments while using an explicit deformation cost.The follow-Figure 5.Some results from the PASCAL 2007dataset.Each row shows detections using a model for a specific class (Person,Bottle,Car,Sofa,Bicycle,Horse).The first three columns show correct detections while the last column shows false positives.Our system is able to detect objects over a wide range of scales (such as the cars)and poses (such as the horses).The system can also detect partially occluded objects such as a person behind a bush.Note how the false detections are often quite reasonable,for example detecting a bus with the car model,a bicycle sign with the bicycle model,or a dog with the horse model.In general the part filters represent meaningful object parts that are well localized in each detection such as the head in the person model.Figure6.Evaluation of our system on the PASCAL VOC2006 person dataset.Root uses only a rootfilter and no latent place-ment of the detection windows on positive examples.Root+Latent uses a rootfilter with latent placement of the detection windows. Parts+Latent is a part-based system with latent detection windows but no rootfilter.Root+Parts+Latent includes both root and part filters,and latent placement of the detection windows.ing table shows AP as a function of freely allowable defor-mation in thefirst three columns.The last column gives the performance when using a quadratic deformation cost and an allowable displacement of2HOG cells.s i01232+quadratic costAP.27.33.31.31.345.DiscussionWe introduced a general framework for training SVMs with latent structure.We used it to build a recognition sys-tem based on multiscale,deformable models.Experimental results on difficult benchmark data suggests our system is the current state-of-the-art in object detection.LSVMs allow for exploration of additional latent struc-ture for recognition.One can consider deeper part hierar-chies(parts with parts),mixture models(frontal vs.side cars),and three-dimensional pose.We would like to train and detect multiple classes together using a shared vocab-ulary of parts(perhaps visual words).We also plan to use A*search[11]to efficiently search over latent parameters during detection.References[1]Y.Amit and A.Trouve.POP:Patchwork of parts models forobject recognition.IJCV,75(2):267–282,November2007.[2]M.Burl,M.Weber,and P.Perona.A probabilistic approachto object recognition using local photometry and global ge-ometry.In ECCV,pages II:628–641,1998.[3] D.Crandall,P.Felzenszwalb,and D.Huttenlocher.Spatialpriors for part-based recognition using statistical models.In CVPR,pages10–17,2005.[4] D.Crandall and D.Huttenlocher.Weakly supervised learn-ing of part-based spatial models for visual object recognition.In ECCV,pages I:16–29,2006.[5]N.Dalal and B.Triggs.Histograms of oriented gradients forhuman detection.In CVPR,pages I:886–893,2005.[6] B.Epshtein and S.Ullman.Semantic hierarchies for recog-nizing objects and parts.In CVPR,2007.[7]M.Everingham,L.Van Gool,C.K.I.Williams,J.Winn,and A.Zisserman.The PASCAL Visual Object Classes Challenge2007(VOC2007)Results./challenges/VOC/voc2007/workshop.[8]M.Everingham, A.Zisserman, C.K.I.Williams,andL.Van Gool.The PASCAL Visual Object Classes Challenge2006(VOC2006)Results./challenges/VOC/voc2006/results.pdf.[9]P.Felzenszwalb and D.Huttenlocher.Distance transformsof sampled functions.Cornell Computing and Information Science Technical Report TR2004-1963,September2004.[10]P.Felzenszwalb and D.Huttenlocher.Pictorial structures forobject recognition.IJCV,61(1),2005.[11]P.Felzenszwalb and D.McAllester.The generalized A*ar-chitecture.JAIR,29:153–190,2007.[12]R.Fergus,P.Perona,and A.Zisserman.Object class recog-nition by unsupervised scale-invariant learning.In CVPR, 2003.[13]M.Fischler and R.Elschlager.The representation andmatching of pictorial structures.IEEE Transactions on Com-puter,22(1):67–92,January1973.[14] A.Holub and P.Perona.A discriminative framework formodelling object classes.In CVPR,pages I:664–671,2005.[15]S.Ioffe and D.Forsyth.Probabilistic methods forfindingpeople.IJCV,43(1):45–68,June2001.[16]Y.Jin and S.Geman.Context and hierarchy in a probabilisticimage model.In CVPR,pages II:2145–2152,2006.[17]T.Joachims.Making large-scale svm learning practical.InB.Sch¨o lkopf,C.Burges,and A.Smola,editors,Advances inKernel Methods-Support Vector Learning.MIT Press,1999.[18]Y.LeCun,S.Chopra,R.Hadsell,R.Marc’Aurelio,andF.Huang.A tutorial on energy-based learning.InG.Bakir,T.Hofman,B.Sch¨o lkopf,A.Smola,and B.Taskar,editors, Predicting Structured Data.MIT Press,2006.[19] A.Quattoni,S.Wang,L.Morency,M.Collins,and T.Dar-rell.Hidden conditional randomfields.PAMI,29(10):1848–1852,October2007.[20] ing segmentation to verify object hypothe-ses.In CVPR,pages1–8,2007.[21] D.Ramanan and C.Sminchisescu.Training deformablemodels for localization.In CVPR,pages I:206–213,2006.[22]H.Schneiderman and T.Kanade.Object detection using thestatistics of parts.IJCV,56(3):151–177,February2004. [23]J.Zhang,M.Marszalek,zebnik,and C.Schmid.Localfeatures and kernels for classification of texture and object categories:A comprehensive study.IJCV,73(2):213–238, June2007.。
Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram
Fast3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu,Gary Bradski,Romain Thibaux,John HsuWillow Garage68Willow Rd.,Menlo Park,CA94025,USA{rusu,bradski,thibaux,hsu}@Abstract—We present the Viewpoint Feature Histogram (VFH),a descriptor for3D point cloud data that encodes geometry and viewpoint.We demonstrate experimentally on a set of60objects captured with stereo cameras that VFH can be used as a distinctive signature,allowing simultaneous recognition of the object and its pose.The pose is accurate enough for robot manipulation,and the computational cost is low enough for real time operation.VFH was designed to be robust to large surface noise and missing depth information in order to work reliably on stereo data.I.I NTRODUCTIONAs part of a long term goal to develop reliable capabilities in the area of perception for mobile manipulation,we address a table top manipulation task involving objects that can be manipulated by one robot hand.Our robot is shown in Fig.1. In order to manipulate an object,the robot must reliably identify it,as well as its6degree-of-freedom(6DOF)pose. This paper proposes a method to identify both at the same time,reliably and at high speed.We make the following assumptions.•Objects are rigid and relatively Lambertian.They can be shiny,but not reflective or transparent.•Objects are in light clutter.They can be easily seg-mented in3D and can be grabbed by the robot hand without obstruction.•The item of interest can be grabbed directly,so it is not occluded.•Items can be grasped even given an approximate pose.The gripper on our robot can open to9cm and each grip is2.5cm wide which allows an object8.5cm wide object to be grasped when the pose is off by+/-10 degrees.Despite these assumptions our problem has several prop-erties that make the task difficult.•The objects need not contain texture.•Our dataset includes objects of very similar shapes,for example many slight variations of typical wine glasses.•To be usable,the recognition accuracy must be very high,typically much higher than,say,for image retrieval tasks,since false positives have very high costs and so must be kept extremely rare.•To interact usefully with humans,recognition cannot take more than a fraction of a second.This puts constraints on computation,but more importantly this precludes the use of accurate but slow3Dacquisition Fig.1.A PR2robot from Willow Garage,showing its grippers and stereo camerasusing lasers.Instead we rely on stereo data,which suffers from higher noise and missing data.Our focus is perception for mobile manipulation.Working on a mobile versus a stationary robot means that we can’t depend on instrumenting the external world with active vision systems or special lighting,but we can put such devices on the robot.In our case,we use projected texture1 to yield dense stereo depth maps at30Hz.We also cannot ensure environmental conditions.We may move from a sunlit room to a dim hallway into a room with no light at all.The projected texture gives us a fair amount of resilience to local lighting conditions as well.1Not structured light,this is random textureAlthough this paper focuses on3D depth features,2D imagery is clearly important,for example for shiny and transparent objects,or to distinguish items based on texture such as telling apart a Coke can from a Diet Coke can.In our case,the textured light alternates with no light to allow for2D imagery aligned with the texture based dense depth, however adding2D visual features will be studied in future work.Here,we look for an effective purely3D feature. Our philosophy is that one should use or design a recogni-tion algorithm thatfits one’s engineering needs such as scal-ability,training speed,incremental training needs,and so on, and thenfind features that make the recognition performance of that architecture meet one’s specifications.For reasons of online training,and because of large memory availability, we choose fast approximate K-Nearest Neighbors(K-NN) implemented in the FLANN library[1]as our recognition architecture.The key contribution of this paper is then the design of a new,computationally efficient3D feature that yields object recognition and6DOF pose.The structure of this paper is as follows:Related work is described in Section II.Next,we give a brief description of our system architecture in Section III.We discuss our surface normal and segmentation algorithm in Section IV followed by a discussion of the Viewpoint Feature Histogram in Section V.Experimental setup and resulting computational and recognition performance are described in Section VI. Conclusions and future work are discussed in Section VII.II.R ELATED W ORKThe problem that we are trying to solve requires global (3D object level)classification based on estimated features. This has been under investigation for a long time in various researchfields,such as computer graphics,robotics,and pattern matching,see[2]–[4]for comprehensive reviews.We address the most relevant work below.Some of the widely used3D point feature extraction approaches include:spherical harmonic invariants[5],spin images[6],curvature maps[7],or more recently,Point Feature Histograms(PFH)[8],and conformal factors[9]. Spherical harmonic invariants and spin images have been successfully used for the problem of object recognition for densely sampled datasets,though their performance seems to degrade for noisier and sparser datasets[4].Our stereo data is noisier and sparser than typical line scan data which motivated the use of our new features.Conformal factors are based on conformal geometry,which is invariant to isometric transformations,and thus obtains good results on databases of watertight models.Its main drawback is that it can only be applied to manifold meshes which can be problematic in stereo.Curvature maps and PFH descriptors have been studied in the context of local shape comparisons for data registration.A side study[10]applied the PFH descriptors to the problem of surface classification into3D geometric primitives,although only for data acquired using precise laser sensors.A different pointfingerprint representation using the projections of geodesic circles onto the tangent plane at a point p i was proposed in[11]for the problem of surface registration.As the authors note,geodesic distances are more sensitive to surface sampling noise,and thus are unsuitable for real sensed data without a priori smoothing and reconstruction.A decomposition of objects into parts learned using spin images is presented in[12]for the problem of vehicle identification.Methods relying on global features include descriptors such as Extended Gaussian Images(EGI)[13],eigen shapes[14],or shape distributions[15].The latter samples statistics of the entire object and represents them as distri-butions of shape properties,however they do not take into account how the features are distributed over the surface of the object.Eigen shapes show promising results but they have limits on their discrimination ability since important higher order variances are discarded.EGIs describe objects based on the unit normal sphere,but have problems handling arbitrarily curved objects.The work in[16]makes use of spin-image signatures and normal-based signatures to achieve classification rates over 90%with synthetic and CAD model datasets.The datasets used however are very different than the ones acquired using noisy640×480stereo cameras such as the ones used in our work.In addition,the authors do not provide timing information on the estimation and matching parts which is critical for applications such as ours.A system for fully automatic3D model-based object recognition and segmentation is presented in[17]with good recognition rates of over95%for a database of55objects.Unfortunately,the computational performance of the proposed method is not suitable for real-time as the authors report the segmentation of an object model in a cluttered scene to be around2 minutes.Moreover,the objects in the database are scanned using a high resolution Minolta scanner and their geometric shapes are very different.As shown in Section VI,the objects used in our experiments are much more similar in terms of geometry,so such a registration-based method would fail. In[18],the authors propose a system for recognizing3D objects in photographs.The techniques presented can only be applied in the presence of texture information,and require a cumbersome generation of models in an offline step,which makes this unsuitable for our work.As previously presented,our requirements are real-time object recognition and pose identification from noisy real-world datasets acquired using projective texture stereo cam-eras.Our3D object classification is based on an extension of the recently proposed Fast Point Feature Histogram(FPFH) descriptors[8],which record the relative angular directions of surface normals with respect to one another.The FPFH performs well in classification applications and is robust to noise but it is invariant to viewpoint.This paper proposes a novel descriptor that encodes the viewpoint information and has two parts:(1)an extended FPFH descriptor that achieves O(k∗n)to O(n)speed up over FPFHs where n is the number of points in the point cloud and k is how many points used in each local neighborhood;(2)a new signature that encodes important statistics between the viewpoint and the surface normals on the object.We callthis new feature the Viewpoint Feature Histogram(VFH)as detailed below.III.A RCHITECTUREOur system architecture employs the following processing steps:•Synchronized,calibrated and epipolar aligned left and right images of the scene are acquired.•A dense depth map is computed from the stereo pair.•Surface normals in the scene are calculated.•Planes are identified and segmented out and the remain-ing point clouds from non-planar objects are clustered in Euclidean space.•The Viewpoint Feature Histogram(VFH)is calculated over large enough objects(here,objects having at least 100points).–If there are multiple objects in a scene,they are processed front to back relative to the camera.–Occluded point clouds with less than75%of the number of points of the frontal objects are notedbut not identified.•Fast approximate K-NN is used to classify the object and its view.Some steps from the early processing pipeline are shown in Figure2.Shown left to right,top to bottom in thatfigure are: a moderately complex scene with many different vertical and horizontal surfaces,the resulting depth map,the estimated surface normals and the objects segmented from the planar surfaces in thescene.Fig.2.Early processing steps row wise,top to bottom:A scene,its depthmap,surface normals and segmentation into planes and outlier objects.For computing3D depth maps,we use640x480stereowith textured light.The textureflashes on only very brieflyas the cameras take a picture resulting in lights that look dimto the human eye but bright to the camera.Textureflashesonly every other frame so that raw imagery without texturecan be gathered alternating with densely textured scenes.Thestereo has a38degreefield of view and is designed for closein manipulation tasks,thus the objects that we deal with arefrom0.5to1.5meters away.The stereo algorithm that weuse was developed in[19]and uses the implementation in theOpenCV library[20]as described in detail in[21],runningat30Hz.IV.S URFACE N ORMALS AND3D S EGMENTATIONWe employ segmentation prior to the actual feature es-timation because in robotic manipulation scenarios we areonly interested in certain precise parts of the environment,and thus computational resources can be saved by tacklingonly those parts.Here,we are looking to manipulate reach-able objects that lie on horizontal surfaces.Therefore,oursegmentation scheme proceeds at extracting these horizontalsurfacesfirst.Fig.3.From left to right:raw point cloud dataset,planar and clustersegmentation,more complex segmentation.Compared to our previous work[22],we have improvedthe planar segmentation algorithms by incorporating surfacenormals into the sample selection and model estimationsteps.We also took care to carefully build SSE aligneddata structures in memory for any computationally expensiveoperation.By rejecting candidates which do not supportour constraints,our system can segment data at about7Hz,including normal estimation,on a regular Core2Duo laptopusing a single core.To get frame rate performance(realtime),we use a voxelized data structure over the input point cloudand downsample with a leaf size of0.5cm.The surfacenormals are therefore estimated only for the downsampledresult,but using the information in the original point cloud.The planar components are extracted using a RMSAC(Ran-domized MSAC)method that takes into account weightedaverages of distances to the model together with the angleof the surface normals.We then select candidate table planesusing a heuristic combining the number of inliers whichsupport the planar model as well as their proximity to thecamera viewpoint.This approach emphasizes the part of thespace where the robot manipulators can reach and grasp theobjects.The segmentation of object candidates supported by thetable surface is performed by looking at points whose projec-tion falls inside the bounding2D polygon for the table,andapplying single-link clustering.The result of these processingsteps is a set of Euclidean point clusters.This works toreliably segment objects that are separated by about half theirminimum radius from each other.An can be seen in Figure3.To resolve further ambiguities with to the chosen candidate clusters,such as objects stacked on other planar objects(such as books),we repeat the mentioned step by treating each additional horizontal planar structure on top of the table candidates as a table itself and repeating the segmentation step(see results in Figure3).We emphasize that this segmentation step is of extreme importance for our application,because it allows our methods to achieve favorable computational performances by extract-ing only the regions of interest in a scene(i.e.,objects that are to be manipulated,located on horizontal surfaces).In cases where our“light clutter”assumption does not hold and the geometric Euclidean clustering is prone to failure, a more sophisticated segmentation scheme based on texture properties could be implemented.V.V IEWPOINT F EATURE H ISTOGRAMIn order to accurately and robustly classify points with respect to their underlying surface,we borrow ideas from the recently proposed Point Feature Histogram(PFH)[10]. The PFH is a histogram that collects the pairwise pan,tilt and yaw angles between every pair of normals on a surface patch (see Figure4).In detail,for a pair of3D points p i,p j ,and their estimated surface normals n i,n j ,the set of normal angular deviations can be estimated as:α=v·n jφ=u·(p j−p i)dθ=arctan(w·n j,u·n j)(1)where u,v,w represent a Darboux frame coordinate system chosen at p i.Then,the Point Feature Histogram at a patch of points P={p i}with i={1···n}captures all the sets of α,φ,θ between all pairs of p i,p j from P,and bins the results in a histogram.The bottom left part of Figure4 presents the selection of the Darboux frame and a graphical representation of the three angular features.Because all possible pairs of points are considered,the computation complexity of a PFH is O(n2)in the number of surface normals n.In order to make a more efficient algorithm,the Fast Point Feature Histogram[8]was de-veloped.The FPFH measures the same angular features as PFH,but estimates the sets of values only between every point and its k nearest neighbors,followed by a reweighting of the resultant histogram of a point with the neighboring histograms,thus reducing the computational complexity to O(k∗n).Our past work[22]has shown that a global descriptor (GFPFH)can be constructed from the classification results of many local FPFH features,and used on a wide range of confusable objects(20different types of glasses,bowls, mugs)in500scenes achieving96.69%on object class recognition.However,the categorized objects were only split into4distinct classes,which leaves the scaling problem open.Moreover,the GFPFH is susceptible to the errors of the local classification results,and is more cumbersome to estimate.In any case,for manipulation,we require that the robot not only identifies objects,but also recognizes their6DOF poses for grasping.FPFH is invariant both to object scale (distance)and object pose and so cannot achieve the latter task.In this work,we decided to leverage the strong recognition results of FPFH,but to add in viewpoint variance while retaining invariance to scale,since the dense stereo depth map gives us scale/distance directly.Our contribution to the problem of object recognition and pose identification is to extend the FPFH to be estimated for the entire object cluster (as seen in Figure4),and to compute additional statistics between the viewpoint direction and the normals estimated at each point.To do this,we used the key idea of mixing the viewpoint direction directly into the relative normal angle calculation in the FPFH.Figure6presents this idea with the new feature consisting of two parts:(1)a viewpoint direction component(see Figure5)and(2)a surface shape component comprised of an extended FPFH(see Figure4).The viewpoint component is computed by collecting a histogram of the angles that the viewpoint direction makes with each normal.Note,we do not mean the view angle to each normal as this would not be scale invariant,but instead we mean the angle between the central viewpoint direction translated to each normal.The second component measures the relative pan,tilt and yaw angles as described in[8],[10] but now measured between the viewpoint direction at the central point and each of the normals on the surface.We call the new assembled feature the Viewpoint Feature Histogram (VFH).Figure6presents the resultant assembled VFH for a random object.piαFig.5.The Viewpoint Feature Histogram is created from the extendedFast Point Feature Histogram as seen in Figure4together with the statisticsof the relative angles between each surface normal to the central viewpointdirection.The computational complexity of VFH is O(n).In ourexperiments,we divided the viewpoint angles into128binsand theα,φandθangles into45bins each or a total of263dimensions.The estimation of a VFH takes about0.3ms onaverage on a2.23GHz single core of a Core2Duo machineusing optimized SSE instructions.p 7p p 8p 9p 10p 11p 5p 1p p 3p 4cn c =uun 5v=(p 5-c)×u w=u ×vc p 5wv αφθFig.4.The extended Fast Point Feature Histogram collects the statistics of the relative angles between the surface normals at each point to the surface normal at the centroid of the object.The bottom left part of the figure describes the three angular feature for an example pair of points.Viewpoint componentextended FPFH componentFig.6.An example of the resultant Viewpoint Feature Histogram for one of the objects used.Note the two concatenated components.VI.V ALIDATION AND E XPERIMENTAL R ESULTS To evaluate our proposed descriptor and system archi-tecture,we collected a large dataset consisting of over 60IKEA kitchenware objects as show in Figure 8.These objects consisted of many kinds each of:wine glasses,tumblers,drinking glasses,mugs,bowls,and a couple of boxes.In each of these categories,many of the objects were distinguished only by subtle variations in shape as can be seen for example in the confusions in Figure 10.We captured over 54000scenes of these objects by spinning them on a turn table 180◦2at each of 2offsets on a platform that tilted 0,8,16,22and 30degrees.Each 180◦rotation was captured with about 90images.The turn table is shown in Fig.7.We additionally worked with a subset of 20objects in 500lightly cluttered scenes with varying arrangements of horizontal and vertical surfaces,using the same data set provided by in [22].No2Wedidn’t go 360degrees so that we could keep the calibration box inviewFig.7.The turn table used to collect views of objects with known orientation.pose information was available for this second dataset so we only ran experiments separately for object recognition results.The complete source code used to generate our experimen-tal results together with both object databases are available under a BSD open source license in our ROS repository at Willow Garage 3.We are currently taking steps towards creating a web page with complete tutorials on how to fully replicate the experiments presented herein.Both the objects in the [22]dataset as well as the ones we acquired,constitute valid examples of objects of daily use that our robot needs to be able to reliably identify and manipulate.While 60objects is far from the number of objects the robot eventually needs to be able to recognize,it may be enough if we assume that the robot knows what3Fig.8.The complete set of IKEA objects used for the purpose of our experiments.All transparent glasses have been painted white to obtain3D information during the acquisition process.TABLE IR ESULTS FOR OBJECT RECOGNITION AND POSE DETECTION OVER 54000SCENES PLUS500LIGHTLY CLUTTERED SCENES.Object PoseMethod Recognition EstimationVFH98.52%98.52%Spin75.3%61.2%context(kitchen table,workbench,coffee table)it is in, so that it needs only discriminate among a small context dependent set of objects.The geometric variations between objects are subtle,and the data acquired is noisy due to the stereo sensor character-istics,yet the perception system has to work well enough to differentiate between,say,glasses that look similar but serve different purposes(e.g.,a wine glass versus a brandy glass). As presented in Section II,the performance of the3D descriptors proposed in the literature degrade on noisier datasets.One of the most popular3D descriptor to date used on datasets acquired using sensing devices similar to ours (e.g.,similar noise characteristics)is the spin image[6].To validate the VFH feature we thus compare it to the spin image,by running the same experiments multiple times. For the reasons given in Section I,we base our recogni-tion architecture on fast approximate K-Nearest Neighbors (KNN)searches using kd-trees[1].The construction of the tree and the search of the nearest neighbors places an equal weight on each histogram bin in the VFH and spin images features.Figure11shows time stop sequentially aggregated exam-ples of the training set.Figure12shows example recognition results for VFH.Andfinally,Figure10gives some idea of the performance differences between VFH and spin images. The object recognition rates over the lightly cluttered dataset were98.1%for VFH and73.2%for spin images.The overall recognition rates for VFH andSpin imagesare shown inTable I where VFH handily outperforms spin images for both object recognition and pose.Fig.9.Data training performed in simulation.Thefigure presents a snapshot of the simulation with a water bottle from the object model database and the corresponding stereo point cloud output.VII.C ONCLUSIONS AND F UTURE W ORKIn this paper we presented a novel3D feature descriptor, the Viewpoint Feature Histogram(VFH),useful for object recognition and6DOF pose identification for application where a priori segmentation is possible.The high recognition performance and fast computational properties,demonstrated the superiority of VFH over spin images on a large scale dataset consisting of over54000scenes with pared to other similar initiatives,our architecture works well with noisy data acquired using standard stereo cameras in real-time,and can detect subtle variations in the geometry of objects.Moreover,we presented an integrated approach for both recognition and6DOF pose identification for untextured objects,the latter being of extreme importance for mobile manipulation and grasping applications.Fig.10.VFH consistently outperforms spin images for both recognition and for pose.The bottom of the figure presents an example result of VFH run on a mug.The bottom left corner is the learned models and the matches go from best to worse from left to right across the bottom followed by left to right across the top.The top part of the figure presents the results obtained using a spin image.For VFH,3of 5object recognition and 3of 5pose results are correct.For spin images,2of 5object recognition results are correct and 0of 5pose results arecorrect.Fig.11.Sequence examples of object training with calibration box on the outside.An automatic training pipeline can be integrated with our 3D simulator based on Gazebo [23]as depicted in figure 9,where the stereo point cloud is generated from perfectly rectified camera images.We are currently working on making both the fully an-notated database of objects together with the source codeof VFH available to the research community as open source.The preliminary results of our efforts can already be checked from the trunk of our Willow Garage ROS repository,but we are taking steps towards generating a set of tutorials on how to replicate and extend the experiments presented in this paper.R EFERENCES[1]M.Muja and D.G.Lowe,“Fast approximate nearest neighbors withautomatic algorithm configuration,”VISAPP ,2009.[2]J.W.Tangelder and R.C.Veltkamp,“A Survey of Content Based3D Shape Retrieval Methods,”in SMI ’04:Proceedings of the Shape Modeling International ,2004,pp.145–156.[3] A.K.Jain and C.Dorai,“3D object recognition:Representation andmatching,”Statistics and Computing ,vol.10,no.2,pp.167–182,2000.[4] A.D.Bimbo and P.Pala,“Content-based retrieval of 3D models,”ACM Trans.Multimedia mun.Appl.,vol.2,no.1,pp.20–43,2006.[5]G.Burel and H.H´e nocq,“Three-dimensional invariants and theirapplication to object recognition,”Signal Process.,vol.45,no.1,pp.1–22,1995.[6] A.Johnson and M.Hebert,“Using spin images for efficient objectrecognition in cluttered 3D scenes,”IEEE Transactions on Pattern Analysis and Machine Intelligence ,May 1999.[7]T.Gatzke,C.Grimm,M.Garland,and S.Zelinka,“Curvature Mapsfor Local Shape Comparison,”in SMI ’05:Proceedings of the Inter-national Conference on Shape Modeling and Applications 2005(SMI’05),2005,pp.246–255.[8]R.B.Rusu,N.Blodow,and M.Beetz,“Fast Point Feature Histograms(FPFH)for 3D Registration,”in ICRA ,2009.[9] B.-C.M.and G.C.,“Characterizing shape using conformal factors,”in Eurographics Workshop on 3D Object Retrieval ,2008.[10]R.B.Rusu,Z.C.Marton,N.Blodow,and M.Beetz,“LearningInformative Point Classes for the Acquisition of Object Model Maps,”in In Proceedings of the 10th International Conference on Control,Automation,Robotics and Vision (ICARCV),2008.[11]Y .Sun and M.A.Abidi,“Surface matching by 3D point’s fingerprint,”in Proc.IEEE Int’l Conf.on Computer Vision ,vol.II,2001,pp.263–269.[12] D.Huber,A.Kapuria,R.R.Donamukkala,and M.Hebert,“Parts-based 3D object classification,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 04),June 2004.[13] B.K.P.Horn,“Extended Gaussian Images,”Proceedings of the IEEE ,vol.72,pp.1671–1686,1984.[14]R.J.Campbell and P.J.Flynn,“Eigenshapes for 3D object recognitionin range data,”in Computer Vision and Pattern Recognition,1999.IEEE Computer Society Conference on.,pp.505–510.[15]R.Osada,T.Funkhouser,B.Chazelle,and D.Dobkin,“Shape distri-butions,”ACM Transactions on Graphics ,vol.21,pp.807–832,2002.[16]X.Li and I.Guskov,“3D object recognition from range images usingpyramid matching,”in ICCV07,2007,pp.1–6.[17] A.S.Mian,M.Bennamoun,and R.Owens,“Three-dimensionalmodel-based object recognition and segmentation in cluttered scenes,”IEEE Trans.Pattern Anal.Mach.Intell ,vol.28,pp.1584–1601,2006.[18] F.Rothganger,zebnik,C.Schmid,and J.Ponce,“3d objectmodeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints,”International Journal of Computer Vision ,vol.66,p.2006,2006.[19]K.Konolige,“Small vision systems:hardware and implementation,”in In Eighth International Symposium on Robotics Research ,1997,pp.111–116.[20]“OpenCV ,Open source Computer Vision library,”in/wiki/,2009.[21]G.Bradski and A.Kaehler,“Learning OpenCV:Computer Vision withthe OpenCV Library,”in O’Reilly Media,Inc.,2008,pp.415–453.[22]R. B.Rusu, A.Holzbach,M.Beetz,and G.Bradski,“Detectingand segmenting objects for mobile manipulation,”in ICCV S3DV workshop ,2009.[23]N.Koenig and A.Howard,“Design and use paradigms for gazebo,an open-source multi-robot simulator,”in IEEE/RSJ International Conference on Intelligent Robots and Systems ,Sendai,Japan,Sep 2004,pp.2149–2154.。
Beyond Bags of FeaturesSpatial Pyramid Matching
Beyond Bags of Features:Spatial Pyramid Matching for Recognizing Natural Scene CategoriesSvetlana Lazebnik1 slazebni@ 1Beckman InstituteUniversity of IllinoisCordelia Schmid2Cordelia.Schmid@inrialpes.fr2INRIA Rhˆo ne-AlpesMontbonnot,FranceJean Ponce1,3ponce@3Ecole Normale Sup´e rieureParis,FranceAbstractThis paper presents a method for recognizing scene cat-egories based on approximate global geometric correspon-dence.This technique works by partitioning the image into increasinglyfine sub-regions and computing histograms of local features found inside each sub-region.The result-ing“spatial pyramid”is a simple and computationally effi-cient extension of an orderless bag-of-features image rep-resentation,and it shows significantly improved perfor-mance on challenging scene categorization tasks.Specifi-cally,our proposed method exceeds the state of the art on the Caltech-101database and achieveshigh accuracy on alarge database offifteen natural scene categories.The spa-tial pyramid framework also offers insights into the successof several recently proposed image descriptions,includingTorralba’s“gist”and Lowe’s SIFT descriptors.1.IntroductionIn this paper,we consider the problem of recognizingthe semantic category of an image.For example,we maywant to classify a photograph as depicting a scene(forest,street,office,etc.)or as containing a certain object of in-terest.For such whole-image categorization tasks,bag-of-methods,which represent an image as an orderlessof local features,have recently demonstrated im-levels of performance[7,22,23,25].However,because these methods disregard all information about thespatial layout of the features,they have severely limited de-scriptive ability.In particular,they are incapable of captur-ing shape or of segmenting an object from its background.Unfortunately,overcoming these limitations to build effec-tive structural object descriptions has proven to be quitechallenging,especially when the recognition system mustbe made to work in the presence of heavy clutter,occlu-sion,or large viewpoint changes.Approaches based ongenerative part models[3,5]and geometric correspondencesearch[1,11]achieve robustness at significant computa-tional expense.A more efficient approach is to augment abasic bag-of-features representation with pairwise relationsbetween neighboring local features,but existing implemen-tations of this idea[11,17]have yielded inconclusive re-sults.One other strategy for increasing robustness to geo-metric deformations is to increase the level of invariance oflocal features(e.g.,by using affine-invariant detectors),buta recent large-scale evaluation[25]suggests that this strat-egy usually does not pay off.Though we remain sympathetic to the goal of develop-ing robust and geometrically invariant structural object rep-resentations,we propose in this paper to revisit“global”non-invariant representations based on aggregating statis-tics of local features overfixed subregions.We introduce akernel-based recognition method that works by computingrough geometric correspondence on a global scale using anefficient approximation technique adapted from the pyramidmatching scheme of Grauman and Darrell[7].Our methodinvolves repeatedly subdividing the image and computinghistograms of local features at increasinglyfine resolutions.As shown by experiments in Section5,this simple oper-ation suffices to significantly improve performance over abasic bag-of-features representation,and even over meth-ods based on detailed geometric correspondence.Previous research has shown that statistical properties ofthe scene considered in a holistic fashion,without any anal-ysis of its constituent objects,yield a rich set of cues to itssemantic category[13].Our own experiments confirm thatglobal representations can be surprisingly effective not onlyfor identifying the overall scene,but also for categorizingimages as containing specific objects,even when these ob-jects are embedded in heavy clutter and vary significantlyin pose and appearance.This said,we do not advocate thedirect use of a global method for object recognition(exceptfor very restricted sorts of imagery).Instead,we envision asubordinate role for this method.It may be used to capturethe“gist”of an image[21]and to inform the subsequentsearch for specific objects(e.g.,if the image,based on its global description,is likely to be a highway,we have a high probability offinding a car,but not a toaster).In addition, the simplicity and efficiency of our method,in combina-tion with its tendency to yield unexpectedly high recogni-tion rates on challenging data,could make it a good base-line for“calibrating”new datasets and for evaluating more sophisticated recognition approaches.2.Previous WorkIn computer vision,histograms have a long history as a method for image description(see,e.g.,[16,19]).Koen-derink and Van Doorn[10]have generalized histograms to locally orderless images,or histogram-valued scale spaces (i.e.,for each Gaussian aperture at a given location and scale,the locally orderless image returns the histogram of image features aggregated over that aperture).Our spatial pyramid approach can be thought of as an alternative for-mulation of a locally orderless image,where instead of a Gaussian scale space of apertures,we define afixed hier-archy of rectangular windows.Koenderink and Van Doorn have argued persuasively that locally orderless images play an important role in visual perception.Our retrieval exper-iments(Fig.4)confirm that spatial pyramids can capture perceptually salient features and suggest that“locally or-derless matching”may be a powerful mechanism for esti-mating overall perceptual similarity between images.It is important to contrast our proposed approach with multiresolution histograms[8],which involve repeatedly subsampling an image and computing a global histogram of pixel values at each new level.In other words,a mul-tiresolution histogram varies the resolution at which the fea-tures(intensity values)are computed,but the histogram res-olution(intensity scale)staysfixed.We take the opposite approach offixing the resolution at which the features are computed,but varying the spatial resolution at which they are aggregated.This results in a higher-dimensional rep-resentation that preserves more information(e.g.,an image consisting of thin black and white stripes would retain two modes at every level of a spatial pyramid,whereas it would become indistinguishable from a uniformly gray image at all but thefinest levels of a multiresolution histogram).Fi-nally,unlike a multiresolution histogram,a spatial pyramid, when equipped with an appropriate kernel,can be used for approximate geometric matching.The operation of“subdivide and disorder”—i.e.,par-tition the image into subblocks and compute histograms (or histogram statistics,such as means)of local features in these subblocks—has been practiced numerous times in computer vision,both for global image description[6,18, 20,21]and for local description of interest regions[12]. Thus,though the operation itself seems fundamental,pre-vious methods leave open the question of what is the right subdivision scheme(although a regular4×4grid seemsto be the most popular implementation choice),and what isthe right balance between“subdividing”and“disordering.”The spatial pyramid framework suggests a possible way toaddress this issue:namely,the best results may be achievedwhen multiple resolutions are combined in a principled way.It also suggests that the reason for the empirical success of“subdivide and disorder”techniques is the fact that they ac-tually perform approximate geometric matching.3.Spatial Pyramid MatchingWefirst describe the original formulation of pyramidmatching[7],and then introduce our application of thisframework to create a spatial pyramid image representation.3.1.Pyramid Match KernelsLet X and Y be two sets of vectors in a d-dimensionalfeature space.Grauman and Darrell[7]propose pyramidmatching tofind an approximate correspondence betweenthese two rmally,pyramid matching works byplacing a sequence of increasingly coarser grids over thefeature space and taking a weighted sum of the number ofmatches that occur at each level of resolution.At anyfixedresolution,two points are said to match if they fall into thesame cell of the grid;matches found atfiner resolutions areweighted more highly than matches found at coarser resolu-tions.More specifically,let us construct a sequence of gridsat resolutions0,...,L,such that the grid at level has2cells along each dimension,for a total of D=2d cells.LetH X and H Y denote the histograms of X and Y at this res-olution,so that H X(i)and H Y(i)are the numbers of points from X and Y that fall into the i th cell of the grid.Thenthe number of matches at level is given by the histogramintersection function[19]:I(H X,H Y)=Di=1minH X(i),H Y(i).(1)In the following,we will abbreviate I(H X,H Y)to I .Note that the number of matches found at level also in-cludes all the matches found at thefiner level +1.There-fore,the number of new matches found at level is given by I −I +1for =0,...,L−1.The weight associatedwith level is set to12L−,which is inversely proportional to cell width at that level.Intuitively,we want to penalize matches found in larger cells because they involve increas-ingly dissimilar features.Putting all the pieces together,weget the following definition of a pyramid match kernel :κL(X,Y )=I L+L −1 =012 I −I+1(2)=12LI 0+L =112L − +1I .(3)Both the histogram intersection and the pyramid match ker-nel are Mercer kernels [7].3.2.Spatial Matching SchemeAs introduced in [7],a pyramid match kernel workswith an orderless image representation.It allows for pre-cise matching of two collections of features in a high-dimensional appearance space,but discards all spatial in-formation.This paper advocates an “orthogonal”approach:perform pyramid matching in the two-dimensional image space,and use traditional clustering techniques in feature space.1Specifically,we quantize all feature vectors into M discrete types,and make the simplifyingassumption that only features ofthe same type can be matched to one an-other.Each channel m gives us two sets of two-dimensional vectors,X m and Y m ,representing the coordinates of fea-tures of type m found in the respective images.The final kernel is then the sum of the separate channel kernels:K L(X,Y )=M m =1κL (X m ,Y m ).(4)This approach has the advantage of maintaining continuity with the popular “visual vocabulary”paradigm —in fact,it reduces to a standard bag of features when L =0.Because the pyramid match kernel (3)is simply a weighted sum of histogram intersections,and because c min(a,b )=min(ca,cb )for positive numbers,we can implement K L as a single histogram intersection of “long”vectors formed by concatenating the appropriately weighted histograms of all channels at all resolutions (Fig.1).For L levels and M channels,the resulting vector has dimen-sionality M L =04 =M 13(4L +1−1).Several experi-ments reported in Section 5use the settings of M =400and L =3,resulting in 34000-dimensional histogram in-tersections.However,these operations are efficient because the histogram vectors are extremely sparse (in fact,just as in [7],the computational complexity of the kernel is linear in the number of features).It must also be noted that we did not observe any significant increase in performance beyond M =200and L =2,where the concatenated histograms are only 4200-dimensional.1Inprinciple,it is possible to integrate geometric information directly into the original pyramid matching framework by treating image coordi-nates as two extra dimensions in the feature space.´Figure 1.Toy example of constructing a three-level pyramid.Theimage has three feature types,indicated by circles,diamonds,and crosses.At the top,we subdivide the image at three different lev-els of resolution.Next,for each level of resolution and each chan-nel,we count the features that fall in each spatial bin.Finally,we weight each spatial histogram according to eq.(3).The final implementation issue is that of normalization.For maximum computational efficiency,we normalize all histograms by the total weight of all features in the image,in effect forcing the total number of features in all images to be the same.Because we use a dense feature representation (see Section 4),and thus do not need to worry about spuri-ous feature detections resulting from clutter,this practice is sufficient to deal with the effects of variable image size.4.Feature ExtractionThis section briefly describes the two kinds of features used in the experiments of Section 5.First,we have so-called “weak features,”which are oriented edge points,i.e.,points whose gradient magnitude in a given direction ex-ceeds a minimum threshold.We extract edge points at two scales and eight orientations,for a total of M =16chan-nels.We designed these features to obtain a representation similar to the “gist”[21]or to a global SIFT descriptor [12]of the image.For better discriminative power,we also utilize higher-dimensional “strong features,”which are SIFT descriptors of 16×16pixel patches computed over a grid with spacing of 8pixels.Our decision to use a dense regular grid in-stead of interest points was based on the comparative evalu-ation of Fei-Fei and Perona [4],who have shown that dense features work better for scene classification.Intuitively,a dense image description is necessary to capture uniform re-gions such as sky,calm water,or road surface (to deal with low-contrast regions,we skip the usual SIFT normalization procedure when the overall gradient magnitude of the patch is too weak).We perform k -means clustering of a random subset of patches from the training set to form a visual vo-cabulary.Typical vocabulary sizes for our experiments are M =200and M =400.office kitchen livingroombedroom storeindustrialtall building ∗inside city ∗street∗highway ∗coast ∗open country∗mountain ∗forest ∗suburbFigure 2.Example images from the scene category database.The starred categories originate from Oliva and Torralba [13].Weak features (M =16)Strong features (M =200)Strong features (M =400)L Single-level Pyramid Single-level Pyramid Single-level Pyramid 0(1×1)45.3±0.572.2±0.674.8±0.31(2×2)53.6±0.356.2±0.677.9±0.679.0±0.578.8±0.480.1±0.52(4×4)61.7±0.664.7±0.779.4±0.381.1±0.379.7±0.581.4±0.53(8×8)63.3±0.866.8±0.677.2±0.480.7±0.377.2±0.581.1±0.6Table 1.in bold.5.ExperimentsIn this section,we report results on three diverse datasets:fifteen scene categories [4],Caltech-101[3],and Graz [14].We perform all processing in grayscale,even when color images are available.All experiments are re-peated ten times with different randomly selected training and test images,and the average of per-class recognition rates 2is recorded for each run.The final result is reported as the mean and standard deviation of the results from the in-dividual runs.Multi-class classification is done with a sup-port vector machine (SVM)trained using the one-versus-all rule:a classifier is learned to separate each class from the rest,and a test image is assigned the label of the classifier with the highest response.2Thealternative performance measure,the percentage of all test im-ages classified correctly,can be biased if test set sizes for different classes vary significantly.This is especially true of the Caltech-101dataset,where some of the “easiest”classes are disproportionately large.5.1.Scene Category RecognitionOur first dataset (Fig.2)is composed of fifteen scene cat-egories:thirteen were provided by Fei-Fei and Perona [4](eight of these were originally collected by Oliva and Tor-ralba [13]),and two (industrial and store)were collected by ourselves.Each category has 200to 400images,and av-erage image size is 300×250pixels.The major sources of the pictures in the dataset include the COREL collection,personal photographs,and Google image search.This is one of the most complete scene category dataset used in the literature thus far.Table 1shows detailed results of classification experi-ments using 100images per class for training and the rest for testing (the same setup as [4]).First,let us examine the performance of strong features for L =0and M =200,corresponding to a standard bag of features.Our classi-fication rate is 72.2%(74.7%for the 13classes inherited from Fei-Fei and Perona),which is much higher than their best results of 65.2%,achieved with an orderless method and a feature set comparable to ours.We conjecture that Fei-Fei and Perona’s approach is disadvantaged by its re-o92.7 kitchen k 68.5living rooml60.4b68.3store s 76.2industriali 65.4tall buildingt91.1i80.5s90.2h86.6c82.4open country o 70.5mountain88.8f94.7 suburb s 99.4images from class i that were misidentified as class j.liance on latent Dirichlet allocation(LDA)[2],which is essentially an unsupervised dimensionality reduction tech-nique and as such,is not necessarily conducive to achiev-ing the highest classification accuracy.To verify this,we have experimented with probabilistic latent semantic analy-sis(pLSA)[9],which attempts to explain the distribution of features in the image as a mixture of a few“scene topics”or“aspects”and performs very similarly to LDA in prac-tice[17].Following the scheme of Quelhas et al.[15],we run pLSA in an unsupervised setting to learn a60-aspect model of half the training images.Next,we apply this model to the other half to obtain probabilities of topics given each image(thus reducing the dimensionality of the feature space from200to60).Finally,we train the SVM on these reduced features and use them to classify the test set.In this setup,our average classification rate drops to63.3%from the original72.2%.For the13classes inherited from Fei-Fei and Perona,it drops to65.9%from74.7%,which is now very similar to their results.Thus,we can see that la-tent factor analysis techniques can adversely affect classifi-cation performance,which is also consistent with the results of Quelhas et al.[15].Next,let us examine the behavior of spatial pyramid matching.For completeness,Table1lists the performance achieved using just the highest level of the pyramid(the “single-level”columns),as well as the performance of the complete matching scheme using multiple levels(the“pyra-mid”columns).For all three kinds of features,results im-prove dramatically as we go from L=0to a multi-level setup.Though matching at the highest pyramid level seems to account for most of the improvement,using all the levels inated at higher pyramid levels.Thus,we can conclude that the coarse-grained geometric cues provided by the pyramid have more discriminative power than an enlarged visual vo-cabulary.Of course,the optimal way to exploit structure both in the image and in the feature space may be to com-bine them in a unified multiresolution framework;this is subject for future research.Fig.3shows a confusion table between thefifteen scene categories.Not surprisingly,confusion occurs between the indoor classes(kitchen,bedroom,living room),and also be-tween some natural classes,such as coast and open country. Fig.4shows examples of image retrieval using the spatial pyramid kernel and strong features with M=200.These examples give a sense of the kind of visual information cap-tured by our approach.In particular,spatial pyramids seem successful at capturing the organization of major pictorial elements or“blobs,”and the directionality of dominant lines and edges.Because the pyramid is based on features com-puted at the original image resolution,even high-frequency details can be preserved.For example,query image(b) shows white kitchen cabinet doors with dark borders.Three of the retrieved“kitchen”images contain similar cabinets, the“office”image shows a wall plastered with white docu-ments in dark frames,and the“inside city”image shows a white building with darker window frames.5.2.Caltech-101Our second set of experiments is on the Caltech-101 database[3](Fig.5).This database contains from31to 800images per category.Most images are medium resolu-tion,i.e.,about300×300pixels.Caltech-101is probably the most diverse object database available today,though it(a)kitchen living room living room living room office living room living room living room livingroom(b)kitchen office insidecity(c)store mountainforest(d)tall bldg inside city insidecity(e)tall bldg inside city mountain mountainmountain(f)inside city tallbldg(g)streetFigure 4.Retrieval from the scene category database.The query images are on the left,and the eight images giving the highest values of the spatial pyramid kernel (for L =2,M =200)are on the right.The actual class of incorrectly retrieved images is listed below them.is not without ly,most images feature relatively little clutter,and the objects are centered and oc-cupy most of the image.In addition,a number of categories,such as minaret (see Fig.5),are affected by “corner”arti-facts resulting from artificial image rotation.Though these artifacts are semantically irrelevant,they can provide stable cues resulting in misleadingly high recognition rates.We follow the experimental setup of Grauman and Dar-rell [7]and J.Zhang et al.[25],namely,we train on 30im-ages per class and test on the rest.For efficiency,we limit the number of test images to 50per class.Note that,be-cause some categories are very small,we may end up with just a single test image per class.Table 2gives a break-down of classification rates for different pyramid levels for weak features and strong features with M =200.The results for M =400are not shown,because just as for the scene category database,they do not bring any signifi-cant improvement.For L =0,strong features give 41.2%,which is slightly below the 43%reported by Grauman and Darrell.Our best result is 64.6%,achieved with strong fea-tures at L =2.This exceeds the highest classification rate previously published,3that of 53.9%reported by J.Zhang et al.[25].Berg et al.[1]report 48%accuracy using 15training images per class.Our average recognition rate with this setup is 56.4%.The behavior of weak features on this database is also noteworthy:for L =0,they give a clas-sification rate of 15.5%,which is consistent with a naive graylevel correlation baseline [1],but in conjunction with a four-level spatial pyramid,their performance rises to 54%—on par with the best results in the literature.Fig.5shows a few of the “easiest”and “hardest”object classes for our method.The successful classes are either dominated by rotation artifacts (like minaret),have very lit-tle clutter (like windsor chair),or represent coherent natural “scenes”(like joshua tree and okapi).The least success-ful classes are either textureless animals (like beaver and cougar),animals that camouflage well in their environment3See,however,H.Zhang et al.[24]in these proceedings,for an al-gorithm that yields a classification rate of 66.2±0.5%for 30training examples,and 59.1±0.6%for 15examples.minaret (97.6%)windsor chair (94.6%)joshua tree (87.9%)okapi (87.8%)cougar body (27.6%)beaver (27.5%)crocodile (25.0%)ant (25.0%)Figure 5.Caltech-101results.Top:some classes on which our method (L =2,M =200)achieved high performance.Bottom:some classes on which our method performed poorly.Weak featuresStrong features (200)L Single-level Pyramid Single-level Pyramid 015.5±0.941.2±1.2131.4±1.232.8±1.355.9±0.957.0±0.8247.2±1.149.3±1.463.6±0.964.6±0.8352.2±0.854.0±1.160.3±0.964.6±0.7Table 2.Classification results for the Caltech-101database.class 1mis-class 2mis-class 1/class 2classified asclassified as class 2class 1ketch /schooner 21.614.8lotus /water lily 15.320.0crocodile /crocodile head 10.510.0crayfish /lobster 11.39.1flamingo /ibis 9.510.4)on the Caltech-101database.ClassL =0L =2Opelt [14]Zhang [25]Bikes 82.4±2.086.3±2.586.592.0People 79.5±2.382.3±3.180.888.0and comparison with two existing methods.(like crocodile),or “thin”objects (like ant).Table 3shows the top five of our method’s confusions,all of which are between closely related classes.To summarize,our method has outperformed both state-of-the-art orderless methods [7,25]and methods based on precise geometric correspondence [1].Significantly,all these methods rely on sparse features (interest points or sparsely sampled edge points).However,because of the geometric stability and lack of clutter of Caltech-101,dense features combined with global spatial relations seem to cap-ture more discriminative information about the objects.5.3.The Graz DatasetAs seen from Sections 5.1and 5.2,our proposed ap-proach does very well on global scene classification tasks,or on object recognition tasks in the absence of clutter with most of the objects assuming “canonical”poses.However,it was not designed to cope with heavy clutter and pose changes.It is interesting to see how well our algorithm can do by exploiting the global scene cues that still remain under these conditions.Accordingly,our final set of ex-periments is on the Graz dataset [14](Fig.6),which is characterized by high intra-class variation.This dataset has two object classes,bikes (373images)and persons (460im-ages),and a background class (270images).The image res-olution is 640×480,and the range of scales and poses at which exemplars are presented is very diverse,e.g.,a “per-son”image may show a pedestrian in the distance,a side view of a complete body,or just a closeup of a head.For this database,we perform two-class detection (object vs.back-ground)using an experimental setup consistent with that of Opelt et al.[14].Namely,we train detectors for persons and bikes on 100positive and 100negative images (of which 50are drawn from the other object class and 50from the back-ground),and test on a similarly distributed set.We generate ROC curves by thresholding raw SVM output,and report the ROC equal error rate averaged over ten runs.Table 4summarizes our results for strong features with M =200.Note that the standard deviation is quite high be-cause the images in the database vary greatly in their level of difficulty,so the performance for any single run is depen-dent on the composition of the training set (in particular,for L =2,the performance for bikes ranges from 81%to 91%).For this database,the improvement from L =0to L =2is relatively small.This makes intuitive sense:when a class is characterized by high geometric variability,it is difficult to find useful global features.Despite this disadvantage of our method,we still achieve results very close to those of Opelt et al.[14],who use a sparse,locally invariant feature representation.In the future,we plan to combine spatial pyramids with invariant features for improved robustness against geometric changes.6.DiscussionThis paper has presented a “holistic”approach for image categorization based on a modification of pyramid match kernels [7].Our method,which works by repeatedly sub-dividing an image and computing histograms of image fea-tures over the resulting subregions,has shown promising re-。
图像邻域像素分布分析
图像邻域像素分布分析谭啸,冯久超【摘要】从数字图像的成像原理出发,对邻域像素的分布进行了分析,进而对邻域像素跳变分布提出了2种新的函数模型。
实验采用非线性最小二乘法对从实验图像中得到的分布结果进行函数拟合,并用差值能量函数(DPF)对新提出的函数模型与文献[8,12]的分布函数进行对比分析。
结果表明:新提出的函数模型拥有更高的拟合度,更符合邻域像素值的分布情况。
【期刊名称】吉林大学学报(工学版)【年(卷),期】2011(041)002【总页数】6【关键词】信息处理;分布函数;非线性最小二乘法;邻域;像素;跳变近些年,科研工作者展开了大量的基于邻域性质的研究工作,邻域性质被广泛地应用到图像分割和图像重建等领域[1-5]。
同时,水印和密写工作者们依据密写行为会减弱邻域之间相关性的性质,直接利用邻域像素值跳变关系或者考虑Gray-level co-occurrence matrix(GLCM)和Histogram characteristics function(HCF)的变化达到对密写行为分析的目的[6-11]。
可见,邻域像素之间的关系在图像分析工作中具有重要作用。
有研究者认为[8,12]邻域像素值跳变分布函数服从高斯分布或拉普拉斯分布,并在不同的实验中得到较好的拟合效果。
但是,并没有文献对这种分布的正确性加以分析。
本文对数字邻域像素值跳变分布函数进行了分析。
首先,从数字设备的成像原理出发,分析了一定区域内的像素值分布,并依据大数定理得到这些像素值应服从高斯分布。
然后,分析了不同区域之间的像素分布差异,经过数学分析和推导得到最终的分布函数表达式,并证明了邻域像素值跳变更符合拉普拉斯分布,同时给出了更具一般性的分布函数。
1 数字图像成像原理数字成像设备的成像过程是:通过感光部件(CCD或者CMOS)对自然图像进行采样得到采样信号,再按照一定算法(不同的数字图像设备采用不同的算法)对采样信号进行处理得到相应的数字图像[13]。
基于视觉特征的跨域图像检索算法的研究
摘要摘要随着成像传感器性能的不断发展和类型的不断丰富,同一事物来自不同成像载体、不同成像光谱、不同成像条件的跨域图像也日益增多。
为了高效地利用这些数字资源,人们往往需要综合多种不同的成像传感器以获得更加全面丰富的信息。
跨域图像检索就是研究如何在不同视觉域图像进行检索,相关课题已经成为当今计算机视觉领域的研究热点之一,并在很多领域都有广泛的应用,例如:异质图像配准与融合、视觉定位与导航、场景分类等。
因此,深入研究跨域图像之间的检索问题具有重要的理论意义和应用价值。
本文详细介绍了跨域图像检索的研究现状,深入分析了不同视觉域图像之间的内在联系,重点研究了跨域视觉显著性检测、跨域特征的提取与描述、跨域图像相似度度量这三个关键问题,并实现了基于显著性检测的跨域视觉检索方法、基于视觉词汇翻译器的跨域图像检索方法和基于共生特征的跨域图像检索方法。
论文的主要研究工作如下:(1)分析了跨域图像的视觉显著性,提出了一种基于显著性检测的跨域视觉检索方法。
该方法首先利用图像超像素区域的边界连接值,赋予各个区域不同的显著性值,获得主体目标区域;然后通过线性分类器进一步优化跨域特征,并对数据库图像进行多尺度处理;最后计算图像间的相似度,返回相似度得分最高的图像作为检索结果。
该方法有效地降低了背景等无关区域的干扰,提高了检索准确率。
(2)针对跨域图像特征差异较大无法直接进行匹配的问题,提出了一种基于视觉词汇翻译器的跨域图像检索方法。
该方法受语言翻译机制的启发,利用视觉词汇翻译器,建立了不同视觉域之间的联系。
该翻译器主要包含两部分:一是视觉词汇树,它可以看作是每个视觉域的字典;二是从属词汇树叶节点的索引文件,其中保存了不同视觉域间的翻译关系。
通过视觉词汇翻译器,跨域检索问题被转化为同域检索问题,从新的角度实现了跨视觉域间的图像检索。
实验验证了算法的性能。
(3)利用不同视觉域间的跨域共生相关性,提出了一种基于视觉共生特征的跨域图像检索方法。
一种监控摄像机干扰检测模型
引言监控视频是犯罪案件调查中一项重要的信息来源与破案依据,但犯罪者极有可能通过对摄像机进行人为干预甚至破坏来掩盖其可疑活动,因此,如何有效地检测监控摄像机干扰事件具有重要的应用价值。
在目前的监控摄相机干扰检测方法中,特殊场景误检测依旧是巨大的挑战,如:照明变化、天气变化、人群流动、大型物体通过等。
对于一些纹理较少或无纹理背景的、黑暗或低质量的监控视频来说,大多数检测方法都会将其误识别为散焦干扰;对于使用带纹理物体进行镜头遮挡的情况,当遮挡物的灰度和亮度与图像背景非常相似时,无法将遮挡物与图像背景区分。
除此之外,物体缓慢地遮挡镜头也是一个具有挑战性的检测问题。
本文使用深度神经网络建立检测模型,利用改进的ConvGRU (Convolutional Gated Recurrent Unit )提取视频的时序特征和图像的空间全局依赖关系,结合Siamese 架构,提出了SCG 模型。
一种监控摄像机干扰检测模型刘小楠,邵培南(中国电子科技集团公司第三十二研究所,上海201808)摘要为减少监控干扰检测中因特殊场景引起的误检测,文中提出一种基于Siamese 架构的SCG(Siamese with Convolutional Gated Recurrent Unit )模型,利用视频片段间的潜在相似性来区分特殊场景与干扰事件。
通过在Siamese 架构中融合改进ConvGRU 网络,使模型充分利用监控视频的帧间时序相关性,在GRU 单元间嵌入的非局部操作可以使网络建立图像空间依赖响应。
与使用传统的GRU 模块的干扰检测模型相比,使用改进的ConvGRU 模块的模型准确率提升了4.22%。
除此之外,文中还引入残差注意力模块来提高特征提取网络对图像前景变化的感知能力,与未加入注意力模块的模型相比,改进后模型的准确率再次提高了2.49%。
关键词Siamese ;ConvGRU ;Non-local block ;相机干扰;干扰检测中图分类号TP391文献标识码A文章编号1009-2552(2021)01-0090-07DOI 10.13274/ki.hdzj.2021.01.016A surveillance camera tampering detection modelLIU Xiao -nan,SHAO Pei -nan(The 32nd Research Institute of China Electronics Technology Group Corporation ,Shanghai 201808,China )Abstract :This paper proposes an SCG model based on Siamese network to reduce the detection error caused by some special scenes in camera tampering detection.The model can use the potential similarity be⁃tween video clips to distinguish special scenes from tampering events.The improved ConvGRU network is in⁃tegrated to capture the temporal correlation between the frames of surveillance video.We embed non -local block s between GRU cells simultaneously,so the model can establish the spatial dependence of the image.The improved ConvGRU network improves model performance by 4.22%.We also add residual attention mod⁃ule to improve the perception ability of the model to the change of image foreground,this again improves the accuracy of the model by 2.49%.Key words :Siamese ;ConvGRU ;Non-local block ;camera tampering ;tampering detection作者简介:刘小楠(1994-),女,硕士研究生,研究方向为计算机视觉、深度学习。
On the Interactions of Light Gravitinos
On the Interactions of Light GravitinosT.E.Clark1,Taekoon Lee2,S.T.Love3,Guo-Hong Wu4Department of PhysicsPurdue UniversityWest Lafayette,IN47907-1396AbstractIn models of spontaneously broken supersymmetry,certain light gravitino processes are governed by the coupling of its Goldstino components.The rules for constructing SUSY and gauge invariant actions involving the Gold-stino couplings to matter and gaugefields are presented.The explicit oper-ator construction is found to be at variance with some previously reported claims.A phenomenological consequence arising from light gravitino inter-actions in supernova is reexamined and scrutinized.1e-mail address:clark@2e-mail address:tlee@3e-mail address:love@4e-mail address:wu@1In the supergravity theories obtained from gauging a spontaneously bro-ken global N=1supersymmetry(SUSY),the Nambu-Goldstone fermion, the Goldstino[1,2],provides the helicity±1degrees of freedom needed to render the spin3gravitino massive through the super-Higgs mechanism.For a light gravitino,the high energy(well above the gravitino mass)interactions of these helicity±1modes with matter will be enhanced according to the su-persymmetric version of the equivalence theorem[3].The effective action de-scribing such interactions can then be constructed using the properties of the Goldstinofields.Currently studied gauge mediated supersymmetry breaking models[4]provide a realization of this scenario as do certain no-scale super-gravity models[5].In the gauge mediated case,the SUSY is dynamically broken in a hidden sector of the theory by means of gauge interactions re-sulting in a hidden sector Goldstinofield.The spontaneous breaking is then mediated to the minimal supersymmetric standard model(MSSM)via radia-tive corrections in the standard model gauge interactions involving messenger fields which carry standard model vector representations.In such models,the supergravity contributions to the SUSY breaking mass splittings are small compared to these gauge mediated contributions.Being a gauge singlet,the gravitino mass arises only from the gravitational interaction and is thus farsmaller than the scale √,where F is the Goldstino decay constant.More-2over,since the gravitino is the lightest of all hidden and messenger sector degrees of freedom,the spontaneously broken SUSY can be accurately de-scribed via a non-linear realization.Such a non-linear realization of SUSY on the Goldstinofields was originally constructed by Volkov and Akulov[1].The leading term in a momentum expansion of the effective action de-scribing the Goldstino self-dynamics at energy scales below √4πF is uniquelyfixed by the Volkov-Akulov effective Lagrangian[1]which takes the formL AV=−F 22det A.(1)Here the Volkov-Akulov vierbein is defined as Aµν=δνµ+iF2λ↔∂µσν¯λ,withλ(¯λ)the Goldstino Weyl spinorfield.This effective Lagrangian pro-vides a valid description of the Goldstino self interactions independent of the particular(non-perturbative)mechanism by which the SUSY is dynam-ically broken.The supersymmetry transformations are nonlinearly realized on the Goldstinofields asδQ(ξ,¯ξ)λα=Fξα+Λρ∂ρλα;δQ(ξ,¯ξ)¯λ˙α= F¯ξ˙α+Λρ∂ρ¯λ˙α,whereξα,¯ξ˙αare Weyl spinor SUSY transformation param-eters andΛρ≡−i Fλσρ¯ξ−ξσρ¯λis a Goldstinofield dependent translationvector.Since the Volkov-Akulov Lagrangian transforms as the total diver-genceδQ(ξ,¯ξ)L AV=∂ρ(ΛρL AV),the associated action I AV= d4x L AV is SUSY invariant.The supersymmetry algebra can also be nonlinearly realized on the matter3(non-Goldstino)fields,generically denoted byφi,where i can represent any Lorentz or internal symmetry labels,asδQ(ξ,¯ξ)φi=Λρ∂ρφi.(2) This is referred to as the standard realization[6]-[9].It can be used,along with space-time translations,to readily establish the SUSY algebra.Under the non-linear SUSY standard realization,the derivative of a matterfield transforms asδQ(ξ,¯ξ)(∂νφi)=Λρ∂ρ(∂νφi)+(∂νΛρ)(∂ρφi).In order to elim-inate the second term on the right hand side and thus restore the standard SUSY realization,a SUSY covariant derivative is introduced and defined so as to transform analogously toφi.To achieve this,we use the transformation property of the Volkov-Akulov vierbein and define the non-linearly realized SUSY covariant derivative[9]Dµφi=(A−1)µν∂νφi,(3) which varies according to the standard realization of SUSY:δQ(ξ,¯ξ)(Dµφi)=Λρ∂ρ(Dµφi).Any realization of the SUSY transformations can be converted to the standard realization.In particular,consider the gauge covariant derivative,(Dµφ)i≡∂µφi+T a ij A aµφj,(4)4with a=1,2,...,Dim G.We seek a SUSY and gauge covariant deriva-tive(Dµφ)i,which transforms as the SUSY standard ing the Volkov-Akulov vierbein,we define(Dµφ)i≡(A−1)µν(Dνφ)i,(5) which has the desired transformation property,δQ(ξ,¯ξ)(Dµφ)i=Λρ∂ρ(Dµφ)i, provided the vector potential has the SUSY transformationδQ(ξ,¯ξ)Aµ≡Λρ∂ρAµ+∂µΛρAρ.Alternatively,we can introduce a redefined gaugefieldV aµ≡(A−1)µνA aν,(6) which itself transforms as the standard realization,δQ(ξ,¯ξ)V aµ=Λρ∂ρV aµ, and in terms of which the standard realization SUSY and gauge covariant derivative then takes the form(Dµφ)i≡(A−1)µν∂νφi+T a ij V aµφj.(7) Under gauge transformations parameterized byωa,the original gaugefield varies asδG(ω)A aµ=(Dµω)a=∂µωa+gf abc A bµωc,while the redefinedgaugefield V aµhas the Goldstino dependent transformation:δG(ω)V aµ= (A−1)µν(Dνω)a.For all realizations,the gauge transformation and SUSY transformation commutator yields a gauge variation with a SUSY trans-formed value of the gauge transformation parameter,δG(ω),δQ(ξ,¯ξ)=δG(Λρ∂ρω−δQ(ξ,¯ξ)ω).(8) 5If we further require the local gauge transformation parameter to also trans-form under the standard realization so thatδQ(ξ,¯ξ)ωa=Λρ∂ρωa,then the gauge and SUSY transformations commute.In order to construct an invariant kinetic energy term for the gaugefields, it is convenient for the gauge covariant anti-symmetric tensorfield strength to also be brought into the standard realization.The usualfield strengthF a αβ=∂αA aβ−∂βA aα+if abc A bαA cβvaries under SUSY transformations asδQ(ξ,¯ξ)F aµν=Λρ∂ρF aµν+∂µΛρF aρν+∂νΛρF aµρ.A standard realization of thegauge covariantfield strength tensor,F aµν,can be then defined asF aµν=(A−1)µα(A−1)νβF aαβ,(9) so thatδQ(ξ,¯ξ)F aµν=Λρ∂ρF aµν.These standard realization building blocks consisting of the gauge singlet Goldstino SUSY covariant derivatives,Dµλ,Dµ¯λ,the matterfields,φi,their SUSY-gauge covariant derivatives,Dµφi,and thefield strength tensor,F aµν, along with their higher covariant derivatives can be combined to make SUSY and gauge invariant actions.These invariant action terms then dictate the couplings of the Goldstino which,in general,carries the residual consequences of the spontaneously broken supersymmetry.A generic SUSY and gauge invariant action can be constructed[9]asI eff=d4x detA L eff(Dµλ,Dµ¯λ,φi,Dµφi,Fµν)(10)6where L effis any gauge invariant function of the standard realization basic building ing the nonlinear SUSY transformationsδQ(ξ,¯ξ)detA=∂ρ(ΛρdetA)andδQ(ξ,¯ξ)L eff=Λρ∂ρL eff,it follows thatδQ(ξ,¯ξ)I eff=0.It proves convenient to catalog the terms in the effective Lagranian,L eff, by an expansion in the number of Goldstinofields which appear when covari-ant derivatives are replaced by ordinary derivatives and the Volkov-Akulov vierbein appearing in the standard realizationfield strengths are set to unity. So doing,we expandL eff=L(0)+L(1)+L(2)+···,(11)where the subscript n on L(n)denotes that each independent SUSY invariant operator in that set begins with n Goldstinofields.L(0)consists of all gauge and SUSY invariant operators made only from light matterfields and their SUSY covariant derivatives.Any Goldstinofield appearing in L(0)arises only from higher dimension terms in the matter covariant derivatives and/or thefield strength tensor.Taking the light non-Goldstinofields to be those of the MSSM and retaining terms through mass dimension4,then L(0)is well approximated by the Lagrangian of the mini-mal supersymmetric standard model which includes the soft SUSY breaking terms,but in which all derivatives have been replaced by SUSY covariant ones and thefield strength tensor replaced by the standard realizationfield7strength:L(0)=L MSSM(φ,Dµφ,Fµν).(12) Note that the coefficients of these terms arefixed by the normalization of the gauge and matterfields,their masses and self-couplings;that is,the normalization of the Goldstino independent Lagrangian.The L(1)terms in the effective Lagrangian begin with direct coupling of one Goldstino covariant derivative to the non-Goldstinofields.The general form of these terms,retaining operators through mass dimension6,is given byL(1)=1[DµλαQµMSSMα+¯QµMSSM˙αDµ¯λ˙α],(13)Fwhere QµMSSMαand¯QµMSSM˙αcontain the pure MSSMfield contributions to the conserved gauge invariant supersymmetry currents with once again all field derivatives being replaced by SUSY covariant derivatives and the vector field strengths in the standard realization.That is,it is this term in the effective Lagrangian which,using the Noether construction,produces the Goldstino independent piece of the conserved supersymmetry current.The Lagrangian L(1)describes processes involving the emission or absorption of a single helicity±1gravitino.Finally the remaining terms in the effective Lagrangian all contain two or more Goldstinofields.In particular,L(2)begins with the coupling of two8Goldstinofields to matter or gaugefields.Retaining terms through mass dimension8and focusing only on theλ−¯λterms,we can writeL(2)=1F2DµλαDν¯λ˙αMµν1α˙α+1F2Dµλα↔DρDν¯λ˙αMµνρ2α˙α+1F2DρDµλαDν¯λ˙αMµνρ3α˙α,(14)where the standard realization composite operators that contain matter and gaugefields are denoted by the M i.They can be enumerated by their oper-ator dimension,Lorentz structure andfield content.In the gauge mediated models,these terms are all generated by radiative corrections involving the standard model gauge coupling constants.Let us now focus on the pieces of L(2)which contribute to a local operator containing two gravitinofields and is bilinear in a Standard Model fermion (f,¯f).Those lowest dimension operators(which involve no derivatives on f or¯f)are all contained in the M1piece.After application of the Goldstino field equation(neglecting the gravitino mass)and making prodigious use of Fierz rearrangement identities,this set reduces to just1independent on-shell interaction term.In addition to this operator,there is also an operator bilinear in f and¯f and containing2gravitinos which arises from the product of det A with L(0).Combining the two independent on-shell interaction terms involving2gravitinos and2fermions,results in the effective actionIf¯f˜G˜G =d4x−12F2λ↔∂µσν¯λf↔∂νσµ¯f9+C ffF2(f∂µλ)¯f∂µ¯λ,(15)where C ff is a model dependent real coefficient.Note that the coefficient of thefirst operator isfixed by the normaliztion of the MSSM Lagrangian. This result is in accord with a recent analysis[10]where it was found that the fermion-Goldstino scattering amplitudes depend on only one parameter which corresponds to the coefficient C ff in our notation.In a similar manner,the lowest mass dimension operator contributing to the effective action describing the coupling of two on-shell gravitinos to a single photon arises from the M1and M3pieces of L(2)and has the formIγ˜G˜G =d4xCγF2∂µλσρ∂ν¯λ∂µFρν+h.c.,(16)with Cγa model dependent real coefficient and Fµνis the electromagnetic field strength.Note that the operator in the square bracket is odd under both parity(P)and charge conjugation(C).In fact any operator arising from a gauge and SUSY invariant structure which is bilinear in two on-shell gravitinos and contains only a single photon is necessarily odd in both P and C.Thus the generation of any such operator requires a violation of both P and ing the Goldstino equation of motion,the analogous term containing˜Fµνreduces to Eq.(16)with Cγ→−iCγ.Recently,there has appeared in the literature[11]the claim that there is a lower dimensional operator of the form˜M2F2∂νλσµ¯λFµνwhich contributes to the single photon-102gravitino interaction.Here˜M is a model dependent SUSY breaking massparameter which is roughly an order(s)of magnitude less than √.¿Fromour analysis,we do notfind such a term to be part of a SUSY invariant action piece and thus it should not be included in the effective action.Such a term is also absent if one employs the equivalent formalism of Wess and Samuel [6].We have also checked that such a term does not appear via radiative corrections by an explicit graphical calculation using the correct non-linearly realized SUSY invariant action.This is also contrary to the previous claim.There have been several recent attempts to extract a lower bound on the SUSY breaking scale using the supernova cooling rate[11,12,13].Unfortu-nately,some of these estimates[11,13]rely on the existence of the non-SUSY invariant dimension6operator referred to ing the correct low en-ergy effective lagrangian of gravitino interactions,the leading term coupling 2gravitinos to a single photon contains an additional supression factor ofroughly Cγs˜M .Taking√s 0.1GeV for the processes of interest and using˜M∼100GeV,this introduces an additional supression of at least10−12in the rate and obviates the previous estimates of a bound on F.Assuming that the mass scales of gauginos and the superpartners of light fermions are above the core temperature of supernova,the gravitino cooling of supernova occurs mainly via gravitino pair production.It is interesting to11compare the gravitino pair production cross section to that of the neutrino pair production,which is the main supernova cooling channel.We have seen that for low energy gravitino interactions with matter,the amplitudes for gravitino pair production is proportional to1/F2.A simple dimensional analysis then suggests the ratio of the cross sections is:σχχσνν∼s2F4G2F(17)where GF is the Fermi coupling and√s is the typical energy scale of theparticles in a supernova.Even with the most optimistic values for F,thegravitino production is too small to be relevant.For example,taking √F=100GeV,√s=.1GeV,the ratio is of O(10−11).It seems,therefore,thatsuch an astrophysical bound on the SUSY breaking scale is untenable in mod-els where the gravitino is the only superparticle below the scale of supernova core temperature.We thank T.K.Kuo for useful conversations.This work was supported in part by the U.S.Department of Energy under grant DE-FG02-91ER40681 (Task B).12References[1]D.V.Volkov and V.P.Akulov,Pis’ma Zh.Eksp.Teor.Fiz.16(1972)621[JETP Lett.16(1972)438].[2]P.Fayet and J.Iliopoulos,Phys.Lett.B51(1974)461.[3]R.Casalbuoni,S.De Curtis,D.Dominici,F.Feruglio and R.Gatto,Phys.Lett.B215(1988)313.[4]M.Dine and A.E.Nelson,Phys.Rev.D48(1993)1277;M.Dine,A.E.Nelson and Y.Shirman,Phys.Rev.D51(1995)1362;M.Dine,A.E.Nelson,Y.Nir and Y.Shirman,Phys.Rev.D53,2658(1996).[5]J.Ellis,K.Enqvist and D.V.Nanopoulos,Phys.Lett.B147(1984)99.[6]S.Samuel and J.Wess,Nucl.Phys.B221(1983)153.[7]J.Wess and J.Bagger,Supersymmetry and Supergravity,second edition,(Princeton University Press,Princeton,1992).[8]T.E.Clark and S.T.Love,Phys.Rev.D39(1989)2391.[9]T.E.Clark and S.T.Love,Phys.Rev.D54(1996)5723.[10]A.Brignole,F.Feruglio and F.Zwirner,hep-th/9709111.[11]M.A.Luty and E.Ponton,hep-ph/9706268.13[12]J.A.Grifols,R.N.Mohapatra and A.Riotto,Phys.Lett.B400,124(1997);J.A.Grifols,R.N.Mohapatra and A.Riotto,Phys.Lett.B401, 283(1997).[13]J.A.Grifols,E.Masso and R.Toldra,hep-ph/970753.D.S.Dicus,R.N.Mohapatra and V.L.Teplitz,hep-ph/9708369.14。
新教材高中英语Unit10Connections课件北师大版选择性必修第四册
主题:人与社会 学科素养:思维品质 难度系数:★★★★★ 【语篇导读】本文简单介绍了“小世界理论”的原理及其 在社交网络中的广泛运用。
The Small-World Phenomenon and Decentralized Search The small-world phenomenon—the principle that we are all linked by short chains of acquaintances, or “six degrees of separation”—is a fundamental issue in social networks. It is a basic statement about the abundance of short paths in a graph whose nodes are people, with links joining pairs who know one another. It is also a topic on which the feedback between social, mathematical, and computational issues has been particularly fluid.
○句型精析
1.The small-world phenomenon—the principle that we are all linked by short chains of acquaintances, or “six degrees of separation”—is a fundamental issue in social networks.
Working much more recently, applied mathematicians Duncan Watts and Steve Strogatz proposed thinking about networks with this small-world property as a superposition: a highly clustered(成 群的)sub-network consisting of the “local acquaintances” of nodes, together with a collection of random long-range shortcuts that help produce short paths.
Non-parametric Local Transforms for Computing Visual Correspondence
to that of Nishihara 12] and Seitz 14, 1]. Nishihara's transform is the sign bit of the image after convolution with a Laplacian, while Seitz's transform is the direction of the intensity gradient.
3 The rank transform and the census transform
We next describe two non-parametric local transforms. The rst, called the rank traeasure of local intensity. The second, called the census transform, is a non-parametric summary of local spatial structure.
Local Rademacher complexities
a rX iv:mat h /58275v1[mat h.ST]16Aug25The Annals of Statistics 2005,Vol.33,No.4,1497–1537DOI:10.1214/009053605000000282c Institute of Mathematical Statistics ,2005LOCAL RADEMACHER COMPLEXITIES By Peter L.Bartlett,Olivier Bousquet and Shahar Mendelson University of California at Berkeley ,Max Planck Institute for Biological Cybernetics and Australian National University We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity.The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages,in the sense that the Rademacher averages are computed from the data,on a subset of functions with small empirical error.We present some applications to classification and prediction with convex function classes,and with kernel classes in particular.1.Introduction.Estimating the performance of statistical procedures is useful for providing a better understanding of the factors that influence their behavior,as well as for suggesting ways to improve them.Although asymptotic analysis is a crucial first step toward understanding the behavior,finite sample error bounds are of more value as they allow the design of model selection (or parameter tuning)procedures.These error bounds typically have the following form:with high probability,the error of the estimator (typically a function in a certain class)is bounded by an empirical estimate of error plus a penalty term depending on the complexity of the class of functions that can be chosen by the algorithm.The differences between the true and empirical errors of functions in that class can be viewed as an empirical process.Many tools have been developed for understanding the behavior of such objects,and especially for evaluating their suprema—which can be thought of as a measure of how hard it is to estimate functions in the class at hand.The goal is thus to obtain the sharpest possible estimateson the complexity of function classes.A problem arises since the notion of complexity might depend on the (unknown)underlying probability measure2P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON according to which the data is produced.Distribution-free notions of the complexity,such as the Vapnik–Chervonenkis dimension[35]or the metric entropy[28],typically give conservative estimates.Distribution-dependent estimates,based for example on entropy numbers in the L2(P)distance, where P is the underlying distribution,are not useful when P is unknown. Thus,it is desirable to obtain data-dependent estimates which can readily be computed from the sample.One of the most interesting data-dependent complexity estimates is the so-called Rademacher average associated with the class.Although known for a long time to be related to the expected supremum of the empirical process (thanks to symmetrization inequalities),it wasfirst proposed as an effective complexity measure by Koltchinskii[15],Bartlett,Boucheron and Lugosi [1]and Mendelson[25]and then further studied in[3].Unfortunately,one of the shortcomings of the Rademacher averages is that they provide global estimates of the complexity of the function class,that is,they do not reflect the fact that the algorithm will likely pick functions that have a small error, and in particular,only a small subset of the function class will be used.As a result,the best error rate that can be obtained via the global Rademacher√averages is at least of the order of1/LOCAL RADEMACHER COMPLEXITIES3 general,power type inequalities.Their results,like those of van de Geer,are asymptotic.In order to exploit this key property and havefinite sample bounds,rather than considering the Rademacher averages of the entire class as the complex-ity measure,it is possible to consider the Rademacher averages of a small subset of the class,usually the intersection of the class with a ball centered at a function of interest.These local Rademacher averages can serve as a complexity measure;clearly,they are always smaller than the corresponding global averages.Several authors have considered the use of local estimates of the complexity of the function class in order to obtain better bounds. Before presenting their results,we introduce some notation which is used throughout the paper.Let(X,P)be a probability space.Denote by F a class of measurable func-tions from X to R,and set X1,...,X n to be independent random variables distributed according to P.Letσ1,...,σn be n independent Rademacher random variables,that is,independent random variables for which Pr(σi= 1)=Pr(σi=−1)=1/2.For a function f:X→R,defineP n f=1nni=1σi f(X i).For a class F,setR n F=supf∈FR n f.Define Eσto be the expectation with respect to the random variablesσ1,...,σn, conditioned on all of the other random variables.The Rademacher averageof F is E R n F,and the empirical(or conditional)Rademacher averages of FareEσR n F=1rx/n+4P.L.BARTLETT,O.BOUSQUET AND S.MENDELSONc3/n,which can be computed from the data.Forˆr N defined byˆr0=1,ˆr k+1=φn(ˆr k),they show that with probability at least1−2Ne−x,2xPˆf≤ˆr N+r)≥EσR n{f∈F:P n f≤r},and if the number of iterations N is at least1+⌈log2log2n/x⌉,then with probability at least1−Ne−x,ˆr N≤c ˆr∗+xr)=bining the above results,one has a procedure to obtain data-dependent error bounds that are of the order of thefixed point of the modulus of continuity at0of the empirical Rademacher averages.One limitation of this result is that it assumes that there is a function f∗in the class with P f∗=0.In contrast,we are interested in prediction problems where P f is the error of an estimator, and in the presence of noise there may not be any perfect estimator(even the best in the class can have nonzero error).More recently,Bousquet,Koltchinskii and Panchenko[9]have obtained a more general result avoiding the iterative procedure.Their result is that for functions with values in[0,1],with probability at least1−e−x,∀f∈F P f≤c P n f+ˆr∗+t+log log nr)≥EσR n{f∈F:P n f≤r}.The main difference between this and the results of[16]is that there is no requirement that the class contain a perfect function.However,the local Rademacher averages are centered around the zero function instead of the one that minimizes P f.As a consequence,thefixed pointˆr∗cannot be expected to converge to zero when inf f∈F P f>0.In order to remove this limitation,Lugosi and Wegkamp[19]use localized Rademacher averages of a small ball around the minimizerˆf of P n.However, their result is restricted to nonnegative functions,and in particular functions with values in{0,1}.Moreover,their bounds also involve some global in-formation,in the form of the shatter coefficients S F(X n1)of the function class(i.e.,the cardinality of the coordinate projections of the class F onLOCAL RADEMACHER COMPLEXITIES5 the data X n1).They show that there are constants c1,c2such that,with probability at least1−8/n,the empirical minimizerˆf satisfiesP f+2 ψn(ˆr n),Pˆf≤inff∈Fwhereψn(r)=c1 EσR n{f∈F:P n f≤16P nˆf+15r}+log n log n P nˆf+randˆr n=c2(log S F(X n1)+log n)/n.The limitation of this result is thatˆr n has to be chosen according to the(empirically measured)complexity of the whole class,which may not be as sharp as the Rademacher averages,and in general,is not afixed point of ψn.Moreover,the balls over which the Rademacher averages are computed in ψn contain a factor of16in front of P nˆf.As we explain later,this induces a lower bound on ψn when there is no function with P f=0in the class.It seems that the only way to capture the right behavior in the general, noisy case is to analyze the increments of the empirical process,in other words,to directly consider the functions f−f∗.This approach wasfirst proposed by Massart[22];see also[26].Massart introduces the assumption Var[ℓf(X)−ℓf∗(X)]≤d2(f,f∗)≤B(Pℓf−Pℓf∗),whereℓf is the loss associated with the function f[in other words,ℓf(X,Y)=ℓ(f(X),Y),which measures the discrepancy in the prediction made by f],d is a pseudometric and f∗minimizes the expected loss.(The previous results could also be stated in terms of loss functions,but we omitted this in order to simplify exposition.However,the extra notation is necessary to properly state Massart’s result.)This is a more refined version of the assumption we mentioned earlier on the relationship between the variance and expectation of the increments of the empirical process.It is only satisfied for some loss functionsℓand function classes F.Under this assumption,Massart considers a nondecreasing functionψsatisfying|P f−P f∗−P n f+P n f∗|+c xψ(r)≥E supf∈F,d2(f,f∗)2≤rr is nonincreasing(we refer to this property as the sub-root property later in the paper).Then,with probability at least1−e−x,∀f∈F Pℓf−Pℓf∗≤c r∗+x6P.L.BARTLETT,O.BOUSQUET AND S.MENDELSONsituations of interest,this bound suffices to prove minimax rates of conver-gence for penalized M-estimators.(Massart considers examples where the complexity term can be bounded using a priori global information about the function class.)However,the main limitation of this result is that it does not involve quantities that can be computed from the data.Finally,as we mentioned earlier,Mendelson[26]gives an analysis similar to that of Massart,in a slightly less general case(with no noise in the target values,i.e.,the conditional distribution of Y given X is concentrated at one point).Mendelson introduces the notion of the star-hull of a class of functions(see the next section for a definition)and considers Rademacher averages of this star-hull as a localized measure of complexity.His results also involve a priori knowledge of the class,such as the rate of growth of covering numbers.We can now spell out our goal in more detail:in this paper we com-bine the increment-based approach of Massart and Mendelson(dealing with differences of functions,or more generally with bounded real-valued func-tions)with the empirical local Rademacher approach of Koltchinskii and Panchenko and of Lugosi and Wegkamp,in order to obtain data-dependent bounds which depend on afixed point of the modulus of continuity of Rademacher averages computed around the empirically best function.Ourfirst main result(Theorem3.3)is a distribution-dependent result involving thefixed point r∗of a local Rademacher average of the star-hull of the class F.This shows that functions with the sub-root property can readily be obtained from Rademacher averages,while in previous work the appropriate functions were obtained only via global information about the class.The second main result(Theorems4.1and4.2)is an empirical counterpart of thefirst one,where the complexity is thefixed point of an empirical local Rademacher average.We also show that thisfixed point is within a constant factor of the nonempirical one.Equipped with this result,we can then prove(Theorem5.4)a fully data-dependent analogue of Massart’s result,where the Rademacher averages are localized around the minimizer of the empirical loss.We also show(Theorem6.3)that in the context of classification,the local Rademacher averages of star-hulls can be approximated by solving a weighted empirical error minimization problem.Ourfinal result(Corollary6.7)concerns regression with kernel classes, that is,classes of functions that are generated by a positive definite ker-nel.These classes are widely used in interpolation and estimation problems as they yield computationally efficient algorithms.Our result gives a data-dependent complexity term that can be computed directly from the eigen-values of the Gram matrix(the matrix whose entries are values of the kernel on the data).LOCAL RADEMACHER COMPLEXITIES7 The sharpness of our results is demonstrated from the fact that we recover, in the distribution-dependent case(treated in Section4),similar results to those of Massart[22],which,in the situations where they apply,give the minimax optimal rates or the best known results.Moreover,the data-dependent bounds that we obtain as counterparts of these results have the same rate of convergence(see Theorem4.2).The paper is organized as follows.In Section2we present some prelimi-nary results obtained from concentration inequalities,which we use through-out.Section3establishes error bounds using local Rademacher averages and explains how to compute theirfixed points from“global information”(e.g., estimates of the metric entropy or of the combinatorial dimensions of the indexing class),in which case the optimal estimates can be recovered.In Section4we give a data-dependent error bound using empirical and local Rademacher averages,and show the connection between thefixed points of the empirical and nonempirical Rademacher averages.In Section5we ap-ply our results to loss classes.We give estimates that generalize the results of Koltchinskii and Panchenko by eliminating the requirement that some function in the class have zero loss,and are more general than those of Lugosi and Wegkamp,since there is no need have in our case to estimate global shatter coefficients of the class.We also give a data-dependent exten-sion of Massart’s result where the local averages are computed around the minimizer of the empirical loss.Finally,Section6shows that the problem of estimating these local Rademacher averages in classification reduces to weighted empirical risk minimization.It also shows that the local averages for kernel classes can be sharply bounded in terms of the eigenvalues of the Gram matrix.2.Preliminary results.Recall that the star-hull of F around f0is de-fined bystar(F,f0)={f0+α(f−f0):f∈F,α∈[0,1]}. Throughout this paper,we will manipulate suprema of empirical processes, that is,quantities of the form sup f∈F(P f−P n f).We will always assume they are measurable without explicitly mentioning it.In other words,we assume that the class F and the distribution P satisfy appropriate(mild) conditions for measurability of this supremum(we refer to[11,28]for a detailed account of such issues).The following theorem is the main result of this section and is at the core of all the proofs presented later.It shows that if the functions in a class have small variance,the maximal deviation between empirical means and true means is controlled by the Rademacher averages of F.In particular, the bound improves as the largest variance of a class member decreases.8P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON Theorem2.1.Let F be a class of functions that map X into[a,b]. Assume that there is some r>0such that for every f∈F,Var[f(X i)]≤r. Then,for every x>0,with probability at least1−e−x,sup f∈F (P f−P n f)≤infα>0 2(1+α)E R n F+n+(b−a) 1α x 1−αEσR n F+ n+(b−a) 1α+1+αn .Moreover,the same results hold for the quantity sup f∈F(P n f−P f).This theorem,which is proved in Appendix A.2,is a more or less directconsequence of Talagrand’s inequality for empirical processes[30].However,the actual statement presented here is new in the sense that it displays thebest known constants.Indeed,compared to the previous result of Koltchin-skii and Panchenko[16]which was based on Massart’s version of Talagrand’sinequality[21],we have used the most refined concentration inequalitiesavailable:that of Bousquet[7]for the supremum of the empirical process and that of Boucheron,Lugosi and Massart[5]for the Rademacher averages.This last inequality is a powerful tool to obtain data-dependent bounds,since it allows one to replace the Rademacher average(which measures thecomplexity of the class of functions)by its empirical version,which can beefficiently computed in some cases.Details about these inequalities are givenin Appendix A.1.When applied to the full function class F,the above theorem is not useful.Indeed,with only a trivial bound on the maximal variance,better resultscan be obtained via simpler concentration inequalities,such as the boundeddifference inequality[23],which would allow x/n. However,by applying Theorem2.1to subsets of F or to modified classesobtained from F,much better results can be obtained.Hence,the presence ofan upper bound on the variance in the square root term is the key ingredientof this result.A last preliminary result that we will require is the following consequenceof Theorem2.1,which shows that if the local Rademacher averages are small,then balls in L2(P)are probably contained in the corresponding empiricalballs[i.e.,in L2(P n)]with a slightly larger radius.Corollary2.2.Let F be a class of functions that map X into[−b,b] with b>0.For every x>0and r that satisfyr≥10b E R n{f:f∈F,P f2≤r}+11b2xLOCAL RADEMACHER COMPLEXITIES9 then with probability at least1−e−x,{f∈F:P f2≤r}⊆{f∈F:P n f2≤2r}.Proof.Since the range of any function in the set F r={f2:f∈F, P f2≤r}is contained in[0,b2],it follows that Var[f2(X i)]≤P f4≤b2P f2≤b2r.Thus,by thefirst part of Theorem2.1(withα=1/4),with probability at least1−e−x,every f∈F r satisfiesP n f2≤r+52b2rx3n≤r+52+16b2x2+16b2xr is nonincreasing for r>0.We only consider nontrivial sub-root functions,that is,sub-root functions that are not the constant functionψ≡0.10P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON Lemma3.2.Ifψ:[0,∞)→[0,∞)is a nontrivial sub-root function,then it is continuous on[0,∞)and the equationψ(r)=r has a unique positive solution.Moreover,if we denote the solution by r∗,then for all r>0,r≥ψ(r)if and only if r∗≤r.The proof of this lemma is in Appendix A.2.In view of the lemma,we will simply refer to the quantity r∗as the unique positive solution ofψ(r)=r, or as thefixed point ofψ.3.1.Error bounds.We can now state and discuss the main result of this section.It is composed of two parts:in thefirst part,one requires a sub-root upper bound on the local Rademacher averages,and in the second part,it is shown that better results can be obtained when the class over which the averages are computed is enlarged slightly.Theorem3.3.Let F be a class of functions with ranges in[a,b]and assume that there are some functional T:F→R+and some constant B such that for every f∈F,Var[f]≤T(f)≤BP f.Letψbe a sub-root function and let r∗be thefixed point ofψ.1.Assume thatψsatisfies,for any r≥r∗,ψ(r)≥B E R n{f∈F:T(f)≤r}.Then,with c1=704and c2=26,for any K>1and every x>0,with probability at least1−e−x,∀f∈F P f≤K B r∗+x(11(b−a)+c2BK)K P f+c1Kn.2.If,in addition,for f∈F andα∈[0,1],T(αf)≤α2T(f),and ifψsatisfies,for any r≥r∗,ψ(r)≥B E R n{f∈star(F,0):T(f)≤r},then the same results hold true with c1=6and c2=5.The proof of this theorem is given in Section3.2.We can compare the results to our starting point(Theorem2.1).The improvement comes from the fact that the complexity term,which was es-sentially sup rψ(r)in Theorem2.1(if we had applied it to the class F di-rectly)is now reduced to r∗,thefixed point ofψ.So the complexity term is always smaller(later,we show how to estimate r∗).On the other hand,LOCAL RADEMACHER COMPLEXITIES11 there is some loss since the constant in front of P n f is strictly larger than1. Section5.2will show that this is not an issue in the applications we have in mind.In Sections5.1and5.2we investigate conditions that ensure the assump-tions of this theorem are satisfied,and we provide applications of this result to prediction problems.The condition that the variance is upper bounded by the expectation turns out to be crucial to obtain these results.The idea behind Theorem3.3originates in the work of Massart[22],who proves a slightly different version of thefirst part.The difference is that we use local Rademacher averages instead of the expectation of the supremum of the empirical process on a ball.Moreover,we give smaller constants.As far as we know,the second part of Theorem3.3is new.3.1.1.Choosing the functionψ.Notice that the functionψcannot be chosen arbitrarily and has to satisfy the sub-root property.One possible approach is to use classical upper bounds on the Rademacher averages,such as Dudley’s entropy integral.This can give a sub-root upper bound and was used,for example,in[16]and in[22].However,the second part of Theorem3.3indicates a possible choice for ψ,namely,one can takeψas the local Rademacher averages of the star-hull of F around0.The reason for this comes from the following lemma, which shows that if the class is star-shaped and T(f)behaves as a quadratic function,the Rademacher averages are sub-root.Lemma3.4.If the class F is star-shaped aroundˆf(which may depend on the data),and T:F→R+is a(possibly random)function that satis-fies T(αf)≤α2T(f)for any f∈F and anyα∈[0,1],then the(random) functionψdefined for r≥0byψ(r)=EσR n{f∈F:T(f−ˆf)≤r}is sub-root and r→Eψ(r)is also sub-root.This lemma is proved in Appendix A.2.Notice that making a class star-shaped only increases it,so thatE R n{f∈star(F,f0):T(f)≤r}≥E R n{f∈F:T(f)≤r}. However,this increase in size is moderate as can be seen,for example,if one compares covering numbers of a class and its star-hull(see,e.g.,[26], Lemma4.5).12P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON3.1.2.Some consequences.As a consequence of Theorem3.3,we obtain an error bound when F consists of uniformly bounded nonnegative functions. Notice that in this case the variance is trivially bounded by a constant times the expectation and one can directly use T(f)=P f.Corollary3.5.Let F be a class of functions with ranges in[0,1].Let ψbe a sub-root function,such that for all r≥0,E R n{f∈F:P f≤r}≤ψ(r),and let r∗be thefixed point ofψ.Then,for any K>1and every x>0,with probability at least1−e−x,every f∈F satisfiesP f≤Kn.Also,with probability at least1−e−x,every f∈F satisfiesP n f≤K+1n.Proof.When f∈[0,1],we have Var[f]≤P f so that the result follows from applying Theorem3.3with T(f)=P f.We also note that the same idea as in the proof of Theorem3.3gives a converse of Corollary2.2,namely,that with high probability the intersection of F with an empirical ball of afixed radius is contained in the intersection of F with an L2(P)ball with a slightly larger radius.Lemma3.6.Let F be a class of functions that map X into[−1,1].Fix x>0.Ifr≥20E R n{f:f∈star(F,0),P f2≤r}+26xLOCAL RADEMACHER COMPLEXITIES13 Corollary3.7.Let F be a class of{0,1}-valued functions with VC-dimen-sion d<∞.Then for all K>1and every x>0,with probability at least1−e−x,every f∈F satisfiesP f≤Kn+x14P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON(b)Upper bound the Rademacher averages of this weighted class,by“peeling off”subclasses of F according to the variance of their elements,and bounding the Rademacher averages of these subclasses usingψ.(c)Use the sub-root property ofψ,so that itsfixed point gives a common upper bound on the complexity of all the subclasses(up to some scaling).(d)Finally,convert the upper bound for functions in the weighted classinto a bound for functions in the initial class.The idea of peeling—that is,of partitioning the class F into slices wherefunctions have variance within a certain range—is at the core of the proof of thefirst part of Theorem3.3[see,e.g.,(3.1)].However,it does not appearexplicitly in the proof of the second part.One explanation is that when oneconsiders the star-hull of the class,it is enough to consider two subclasses:the functions with T(f)≤r and the ones with T(f)>r,and this is done by introducing the weighting factor T(f)∨r.This idea was exploited inthe work of Mendelson[26]and,more recently,in[4].Moreover,when oneconsiders the set F r=star(F,0)∩{T(f)≤r},any function f′∈F with T(f′)>r will have a scaled down representative in that set.So even though it seems that we look at the class star(F,0)only locally,we still take intoaccount all of the functions in F(with appropriate scaling).3.2.Proofs.Before presenting the proof,let usfirst introduce some ad-ditional notation.Given a class F,λ>1and r>0,let w(f)=min{rλk:k∈N,rλk≥T(f)}and setG r= rT(f)∨r:f∈F ,and define˜V+ r =supg∈˜G rP g−P n g and˜V−r=supg∈˜G rP n g−P g.Lemma3.8.With the above notation,assume that there is a constant B>0such that for every f∈F,T(f)≤BP f.Fix K>1,λ>0and r>0.LOCAL RADEMACHER COMPLEXITIES15If V+r≤r/(λBK),then∀f∈F P f≤KλBK.Also,if V−r≤r/(λBK),then∀f∈F P n f≤K+1λBK. Similarly,if K>1and r>0are such that˜V+r≤r/(BK),then∀f∈F P f≤K BK.Also,if˜V−r≤r/(BK),then∀f∈F P n f≤K+1BK.Proof.Notice that for all g∈G r,P g≤P n g+V+r.Fix f∈F and define g=rf/w(f).When T(f)≤r,w(f)=r,so that g=f.Thus,the fact that P g≤P n g+V+r implies that P f≤P n f+V+r≤P n f+r/(λBK).On the other hand,if T(f)>r,then w(f)=rλk with k>0and T(f)∈(rλk−1,rλk].Moreover,g=f/λk,P g≤P n g+V+r,and thusP fλk+V+r.Using the fact that T(f)>rλk−1,it follows thatP f≤P n f+λk V+r<P n f+λT(f)V+r/r≤P n f+P f/K. Rearranging,P f≤KK−1P n f+r2rx3+1n.16P.L.BARTLETT,O.BOUSQUET AND S.MENDELSONLet F(x,y):={f∈F:x≤T(f)≤y}and define k to be the smallest integer such that rλk+1≥Bb.ThenE R n G r≤E R n F(0,r)+E supf∈F(r,Bb)rw(f)R n f(3.1)=E R n F(0,r)+kj=0λ−j E supf∈F(rλj,rλj+1)R n f≤ψ(r)Bkj=0λ−jψ(rλj+1).By our assumption it follows that forβ≥1,ψ(βr)≤√Bψ(r) 1+√r/r∗ψ(r∗)=√B √2rx3+1n.Set A=10(1+α)√2x/n and C=(b−a)(1/3+1/α)x/n,and note that V+r≤A√r+C=r/(λBK).It satisfies r0≥λ2A2B2K2/2≥r∗and r0≤(λBK)2A2+2λBKC,so that applying Lemma3.8, it follows that every f∈F satisfiesP f≤KK−1P n f+λBK 100(1+α)2r∗/B2+20(1+α)2xr∗n+(b−a) 1α x2xr∗/n≤Bx/(5n)+ 5r∗/(2B)completes the proof of thefirst statement.The second statement is proved in the same way,by considering V−r instead of V+r.LOCAL RADEMACHER COMPLEXITIES17 Proof of Theorem3.3,second part.The proof of this result uses the same argument as for thefirst part.However,we consider the class˜G rdefined above.One can easily check that˜G r⊂{f∈star(F,0):T(f)≤r}, and thus E R n˜G r≤ψ(r)/B.Applying Theorem2.1to˜G r,it follows that,for all x>0,with probability1−e−x,˜V+ r≤2(1+α)2rx3+1n.The reasoning is then the same as for thefirst part,and we use in the very last step thatn .(3.2)Clearly,if f∈F,then f2maps to[0,1]and Var[f2]≤P f2.Thus,Theo-rem2.1can be applied to the class G r={rf2/(P f2∨r):f∈F},whose functions have range in[0,1]and variance bounded by r.Therefore,with probability at least1−e−x,every f∈F satisfiesr P f2−P n f22rx3+1n.Selectα=1/4and notice thatP f2∨r≤52+19xr 54+19x18P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON4.Data-dependent error bounds.The results presented thus far use distribution-dependent measures of complexity of the class at hand.In-deed,the sub-root functionψof Theorem3.3is bounded in terms of theRademacher averages of the star-hull of F,but these averages can only becomputed if one knows the distribution P.Otherwise,we have seen that it is possible to compute an upper bound on the Rademacher averages using apriori global or distribution-free knowledge about the complexity of the classat hand(such as the VC-dimension).In this section we present error boundsthat can be computed directly from the data,without a priori information. Instead of computingψ,we compute an estimate, ψn,of it.The function ψn is defined using the data and is an upper bound onψwith high probability.To simplify the exposition we restrict ourselves to the case where the func-tions have a range which is symmetric around zero,say[−1,1].Moreover, we can only treat the special case where T(f)=P f2,but this is a minor restriction as in most applications this is the function of interest[i.e.,for which one can show T(f)≤BP f].4.1.Results.We now present the main result of this section,which givesan analogue of the second part of Theorem3.3,with a completely empiricalbound(i.e.,the bound can be computed from the data only).Theorem4.1.Let F be a class of functions with ranges in[−1,1]and assume that there is some constant B such that for every f∈F,P f2≤BP f. Let ψn be a sub-root function and letˆr∗be thefixed point of ψn.Fix x>0 and assume that ψn satisfies,for any r≥ˆr∗,ψn(r)≥c1EσR n{f∈star(F,0):P n f2≤2r}+c2xK−1P n f+6Kn.Also,with probability at least1−3e−x,∀f∈F P n f≤K+1Bˆr∗+x(11+5BK)。
Object recognition from local scale-invariant features
Object Recognition from Local Scale-Invariant FeaturesDavid G.LoweComputer Science DepartmentUniversity of British ColumbiaV ancouver,B.C.,V6T1Z4,Canadalowe@cs.ubc.caAbstractAn object recognition system has been developed that uses a new class of local image features.The features are invariant to image scaling,translation,and rotation,and partially in-variant to illumination changes and affine or3D projection. These features share similar properties with neurons in in-ferior temporal cortex that are used for object recognition in primate vision.Features are efficiently detected through a stagedfiltering approach that identifies stable points in scale space.Image keys are created that allow for local ge-ometric deformations by representing blurred image gradi-ents in multiple orientation planes and at multiple scales. The keys are used as input to a nearest-neighbor indexing method that identifies candidate object matches.Final veri-fication of each match is achieved byfinding a low-residual least-squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially-occluded images witha computation time of under2seconds.1.IntroductionObject recognition in cluttered real-world scenes requires local image features that are unaffected by nearby clutter or partial occlusion.The features must be at least partially in-variant to illumination,3D projective transforms,and com-mon object variations.On the other hand,the features must also be sufficiently distinctive to identify specific objects among many alternatives.The difficulty of the object recog-nition problem is due in large part to the lack of success in finding such image features.However,recent research on the use of dense local features(e.g.,Schmid&Mohr[19]) has shown that efficient recognition can often be achieved by using local image descriptors sampled at a large number of repeatable locations.This paper presents a new method for image feature gen-eration called the Scale Invariant Feature Transform(SIFT). This approach transforms an image into a large collection of local feature vectors,each of which is invariant to image translation,scaling,and rotation,and partially invariant to illumination changes and affine or3D projection.Previous approaches to local feature generation lacked invariance to scale and were more sensitive to projective distortion and illumination change.The SIFT features share a number of properties in common with the responses of neurons in infe-rior temporal(IT)cortex in primate vision.This paper also describes improved approaches to indexing and model ver-ification.The scale-invariant features are efficiently identified by using a stagedfiltering approach.Thefirst stage identifies key locations in scale space by looking for locations that are maxima or minima of a difference-of-Gaussian function. Each point is used to generate a feature vector that describes the local image region sampled relative to its scale-space co-ordinate frame.The features achieve partial invariance to local variations,such as affine or3D projections,by blur-ring image gradient locations.This approach is based on a model of the behavior of complex cells in the cerebral cor-tex of mammalian vision.The resulting feature vectors are called SIFT keys.In the current implementation,each im-age generates on the order of1000SIFT keys,a process that requires less than1second of computation time.The SIFT keys derived from an image are used in a nearest-neighbour approach to indexing to identify candi-date object models.Collections of keys that agree on a po-tential model pose arefirst identified through a Hough trans-form hash table,and then through a least-squaresfit to afinal estimate of model parameters.When at least3keys agree on the model parameters with low residual,there is strong evidence for the presence of the object.Since there may be dozens of SIFT keys in the image of a typical object,it is possible to have substantial levels of occlusion in the image and yet retain high levels of reliability.The current object models are represented as2D loca-tions of SIFT keys that can undergo affine projection.Suf-ficient variation in feature location is allowed to recognize perspective projection of planar shapes at up to a60degree rotation away from the camera or to allow up to a20degree rotation of a3D object.2.Related researchObject recognition is widely used in the machine vision in-dustry for the purposes of inspection,registration,and ma-nipulation.However,current commercial systems for object recognition depend almost exclusively on correlation-based template matching.While very effective for certain engi-neered environments,where object pose and illumination are tightly controlled,template matching becomes computa-tionally infeasible when object rotation,scale,illumination, and3D pose are allowed to vary,and even more so when dealing with partial visibility and large model databases.An alternative to searching all image locations for matches is to extract features from the image that are at least partially invariant to the image formation process and matching only to those features.Many candidate feature types have been proposed and explored,including line seg-ments[6],groupings of edges[11,14],and regions[2], among many other proposals.While these features have worked well for certain object classes,they are often not de-tected frequently enough or with sufficient stability to form a basis for reliable recognition.There has been recent work on developing much denser collections of image features.One approach has been to use a corner detector(more accurately,a detector of peaks in local image variation)to identify repeatable image loca-tions,around which local image properties can be measured. Zhang et al.[23]used the Harris corner detector to iden-tify feature locations for epipolar alignment of images taken from differing viewpoints.Rather than attempting to cor-relate regions from one image against all possible regions in a second image,large savings in computation time were achieved by only matching regions centered at corner points in each image.For the object recognition problem,Schmid&Mohr [19]also used the Harris corner detector to identify in-terest points,and then created a local image descriptor at each interest point from an orientation-invariant vector of derivative-of-Gaussian image measurements.These image descriptors were used for robust object recognition by look-ing for multiple matching descriptors that satisfied object-based orientation and location constraints.This work was impressive both for the speed of recognition in a large database and the ability to handle cluttered images.The corner detectors used in these previous approaches have a major failing,which is that they examine an image at only a single scale.As the change in scale becomes sig-nificant,these detectors respond to different image points. Also,since the detector does not provide an indication of the object scale,it is necessary to create image descriptors and attempt matching at a large number of scales.This paper de-scribes an efficient method to identify stable key locations in scale space.This means that different scalings of an im-age will have no effect on the set of key locations selected.Furthermore,an explicit scale is determined for each point, which allows the image description vector for that point to be sampled at an equivalent scale in each image.A canoni-cal orientation is determined at each location,so that match-ing can be performed relative to a consistent local2D co-ordinate frame.This allows for the use of more distinctive image descriptors than the rotation-invariant ones used by Schmid and Mohr,and the descriptor is further modified to improve its stability to changes in affine projection and illu-mination.Other approaches to appearance-based recognition in-clude eigenspace matching[13],color histograms[20],and receptivefield histograms[18].These approaches have all been demonstrated successfully on isolated objects or pre-segmented images,but due to their more global features it has been difficult to extend them to cluttered and partially occluded images.Ohba&Ikeuchi[15]successfully apply the eigenspace approach to cluttered images by using many small local eigen-windows,but this then requires expensive search for each window in a new image,as with template matching.3.Key localizationWe wish to identify locations in image scale space that are invariant with respect to image translation,scaling,and ro-tation,and are minimally affected by noise and small dis-tortions.Lindeberg[8]has shown that under some rather general assumptions on scale invariance,the Gaussian ker-nel and its derivatives are the only possible smoothing ker-nels for scale space analysis.To achieve rotation invariance and a high level of effi-ciency,we have chosen to select key locations at maxima and minima of a difference of Gaussian function applied in scale space.This can be computed very efficiently by build-ing an image pyramid with resampling between each level. Furthermore,it locates key points at regions and scales of high variation,making these locations particularly stable for characterizing the image.Crowley&Parker[4]and Linde-berg[9]have previously used the difference-of-Gaussian in scale space for other purposes.In the following,we describe a particularly efficient and stable method to detect and char-acterize the maxima and minima of this function.As the2D Gaussian function is separable,its convolution with the input image can be efficiently computed by apply-ing two passes of the1D Gaussian function in the horizontal and vertical directions:For key localization,all smoothing operations are done us-ing,which can be approximated with sufficient ac-curacy using a1D kernel with7sample points.The input image isfirst convolved with the Gaussian function using to give an image A.This is then repeated a second time with a further incremental smooth-ing of to give a new image,B,which now has an effective smoothing of.The difference of Gaussian function is obtained by subtracting image B from A,result-ing in a ratio of between the two Gaussians.To generate the next pyramid level,we resample the al-ready smoothed image B using bilinear interpolation with a pixel spacing of1.5in each direction.While it may seem more natural to resample with a relative scale of,the only constraint is that sampling be frequent enough to de-tect peaks.The1.5spacing means that each new sample will be a constant linear combination of4adjacent pixels.This is efficient to compute and minimizes aliasing artifacts that would arise from changing the resampling coefficients.Maxima and minima of this scale-space function are de-termined by comparing each pixel in the pyramid to its neighbours.First,a pixel is compared to its8neighbours at the same level of the pyramid.If it is a maxima or minima at this level,then the closest pixel location is calculated at the next lowest level of the pyramid,taking account of the 1.5times resampling.If the pixel remains higher(or lower) than this closest pixel and its8neighbours,then the test is repeated for the level above.Since most pixels will be elim-inated within a few comparisons,the cost of this detection is small and much lower than that of building the pyramid.If thefirst level of the pyramid is sampled at the same rate as the input image,the highest spatial frequencies will be ig-nored.This is due to the initial smoothing,which is needed to provide separation of peaks for robust detection.There-fore,we expand the input image by a factor of2,using bilin-ear interpolation,prior to building the pyramid.This gives on the order of1000key points for a typical pixel image,compared to only a quarter as many without the ini-tial expansion.3.1.SIFT key stabilityTo characterize the image at each key location,the smoothed image A at each level of the pyramid is processed to extract image gradients and orientations.At each pixel,,the im-age gradient magnitude,,and orientation,,are com-puted using pixel differences:The pixel differences are efficient to compute and provide sufficient accuracy due to the substantial level of previous smoothing.The effective half-pixel shift in position is com-pensated for when determining key location.Robustness to illuminationchange is enhanced by thresh-olding the gradient magnitudes at a value of0.1timesthe Figure1:The second image was generated from thefirst by rotation,scaling,stretching,change of brightness and con-trast,and addition of pixel noise.In spite of these changes, 78%of the keys from thefirst image have a closely match-ing key in the second image.These examples show only a subset of the keys to reduce clutter.maximum possible gradient value.This reduces the effect of a change in illumination direction for a surface with3D relief,as an illumination change may result in large changes to gradient magnitude but is likely to have less influence on gradient orientation.Each key location is assigned a canonical orientation so that the image descriptors are invariant to rotation.In or-der to make this as stable as possible against lighting or con-trast changes,the orientation is determined by the peak in a histogram of local image gradient orientations.The orien-tation histogram is created using a Gaussian-weighted win-dow with of3times that of the current smoothing scale. These weights are multiplied by the thresholded gradient values and accumulated in the histogram at locations corre-sponding to the orientation,.The histogram has36bins covering the360degree range of rotations,and is smoothed prior to peak selection.The stability of the resulting keys can be tested by sub-jecting natural images to affine projection,contrast and brightness changes,and addition of noise.The location of each key detected in thefirst image can be predicted in the transformed image from knowledge of the transform param-eters.This framework was used to select the various sam-pling and smoothing parameters given above,so that max-Image transformation Match%Ori%A.Increase contrast by1.289.086.6B.Decrease intensity by0.288.585.9C.Rotate by20degrees85.481.0D.Scale by0.785.180.3E.Stretch by1.283.576.1F.Stretch by1.577.765.0G.Add10%pixel noise90.388.4H.All of A,B,C,D,E,G.78.671.8 Figure2:For various image transformations applied to a sample of20images,this table gives the percent of keys that are found at matching locations and scales(Match%)and that also match in orientation(Ori%).imum efficiency could be obtained while retaining stability to changes.Figure1shows a relatively small number of keys de-tected over a2octave range of only the larger scales(to avoid excessive clutter).Each key is shown as a square,with a line from the center to one side of the square indicating ori-entation.In the second half of thisfigure,the image is ro-tated by15degrees,scaled by a factor of0.9,and stretched by a factor of1.1in the horizontal direction.The pixel inten-sities,in the range of0to1,have0.1subtracted from their brightness values and the contrast reduced by multiplication by0.9.Random pixel noise is then added to give less than 5bits/pixel of signal.In spite of these transformations,78% of the keys in thefirst image had closely matching keys in the second image at the predicted locations,scales,and ori-entationsThe overall stability of the keys to image transformations can be judged from Table2.Each entry in this table is gen-erated from combining the results of20diverse test images and summarizes the matching of about15,000keys.Each line of the table shows a particular image transformation. Thefirstfigure gives the percent of keys that have a match-ing key in the transformed image within in location(rel-ative to scale for that key)and a factor of1.5in scale.The second column gives the percent that match these criteria as well as having an orientation within20degrees of the pre-diction.4.Local image descriptionGiven a stable location,scale,and orientation for each key,it is now possible to describe the local image region in a man-ner invariant to these transformations.In addition,it is desir-able to make this representation robust against small shifts in local geometry,such as arise from affine or3D projection.One approach to this is suggested by the response properties of complex neurons in the visual cortex,in which a feature position is allowed to vary over a small region while orienta-tion and spatial frequency specificity are maintained.Edel-man,Intrator&Poggio[5]have performed experiments that simulated the responses of complex neurons to different3D views of computer graphic models,and found that the com-plex cell outputs provided much better discrimination than simple correlation-based matching.This can be seen,for ex-ample,if an affine projection stretches an image in one di-rection relative to another,which changes the relative loca-tions of gradient features while having a smaller effect on their orientations and spatial frequencies.This robustness to local geometric distortion can be ob-tained by representing the local image region with multiple images representing each of a number of orientations(re-ferred to as orientation planes).Each orientation plane con-tains only the gradients corresponding to that orientation, with linear interpolation used for intermediate orientations. Each orientation plane is blurred and resampled to allow for larger shifts in positions of the gradients.This approach can be efficiently implemented by using the same precomputed gradients and orientations for each level of the pyramid that were used for orientation selection. For each keypoint,we use the pixel sampling from the pyra-mid level at which the key was detected.The pixels that fall in a circle of radius8pixels around the key location are in-serted into the orientation planes.The orientation is mea-sured relative to that of the key by subtracting the key’s ori-entation.For our experiments we used8orientation planes, each sampled over a grid of locations,with a sample spacing4times that of the pixel spacing used for gradient detection.The blurring is achieved by allocating the gradi-ent of each pixel among its8closest neighbors in the sam-ple grid,using linear interpolation in orientation and the two spatial dimensions.This implementation is much more effi-cient than performing explicit blurring and resampling,yet gives almost equivalent results.In order to sample the image at a larger scale,the same process is repeated for a second level of the pyramid one oc-tave higher.However,this time a rather than a sample region is used.This means that approximately the same image region will be examined at both scales,so that any nearby occlusions will not affect one scale more than the other.Therefore,the total number of samples in the SIFT key vector,from both scales,is or160 elements,giving enough measurements for high specificity.5.Indexing and matchingFor indexing,we need to store the SIFT keys for sample im-ages and then identify matching keys from new images.The problem of identifyingthe most similar keys for high dimen-sional vectors is known to have high complexity if an ex-act solution is required.However,a modification of the k-d tree algorithm called the best-bin-first search method(Beis &Lowe[3])can identify the nearest neighbors with high probability using only a limited amount of computation.To further improve the efficiency of the best-bin-first algorithm, the SIFT key samples generated at the larger scale are given twice the weight of those at the smaller scale.This means that the larger scale is in effect able tofilter the most likely neighbours for checking at the smaller scale.This also im-proves recognition performance by giving more weight to the least-noisy scale.In our experiments,it is possible to have a cut-off for examining at most200neighbors in a probabilisticbest-bin-first search of30,000key vectors with almost no loss of performance compared tofinding an exact solution.An efficient way to cluster reliable model hypotheses is to use the Hough transform[1]to search for keys that agree upon a particular model pose.Each model key in the database contains a record of the key’s parameters relative to the model coordinate system.Therefore,we can create an entry in a hash table predicting the model location,ori-entation,and scale from the match hypothesis.We use a bin size of30degrees for orientation,a factor of2for scale, and0.25times the maximum model dimension for location. These rather broad bin sizes allow for clustering even in the presence of substantial geometric distortion,such as due to a change in3D viewpoint.To avoid the problem of boundary effects in hashing,each hypothesis is hashed into the2clos-est bins in each dimension,giving a total of16hash table entries for each hypothesis.6.Solution for affine parametersThe hash table is searched to identify all clusters of at least 3entries in a bin,and the bins are sorted into decreasing or-der of size.Each such cluster is then subject to a verification procedure in which a least-squares solution is performed for the affine projection parameters relating the model to the im-age.The affine transformation of a model point to an image point can be written aswhere the model translation is and the affine rota-tion,scale,and stretch are represented by the parameters.We wish to solve for the transformation parameters,sochange in scale,and0.2times maximum modelsize in terms of location.If fewer than3points remain afterdiscarding outliers,then the match is rejected.If any outliersare discarded,the least-squares solution is re-solved with theremainingpoints.Figure5:Examples of3D object recognition with occlusion.7.ExperimentsThe affine solution provides a good approximation to per-spective projection of planar objects,so planar models pro-vide a good initial test of the approach.The top row of Fig-ure3shows three model images of rectangular planar facesof objects.Thefigure also shows a cluttered image contain-ing the planar objects,and the same image is shown over-layed with the models following recognition.The modelkeys that are displayed are the ones used for recognition andfinal least-squares solution.Since only3keys are neededfor robust recognition,it can be seen that the solutions arehighly redundant and would survive substantial occlusion.Also shown are the rectangular borders of the model images,projected using the affine transform from the least-squaresolution.These closely agree with the true borders of theplanar regions in the image,except for small errors intro-duced by the perspective projection.Similar experimentshave been performed for many images of planar objects,andthe recognition has proven to be robust to at least a60degreerotation of the object in any direction away from the camera.Although the model images and affine parameters do notaccount for rotation in depth of3D objects,they are stillsufficient to perform robust recognition of3D objects overabout a20degree range of rotation in depth away from eachmodel view.An example of three model images is shown inFigure6:Stability of image keys is tested under differing illumination.Thefirst image is illuminated from upper left and the second from center right.Keys shown in the bottom image were those used to match second image tofirst.the top row of Figure4.The models were photographed on a black background,and object outlinesextracted by segment-ing out the background region.An example of recognition is shown in the samefigure,again showing the SIFT keys used for recognition.The object outlines are projected using the affine parameter solution,but this time the agreement is not as close because the solution does not account for rotation in depth.Figure5shows more examples in which there is significant partial occlusion.The images in these examples are of size pix-els.The computation times for recognition of all objects in each image are about1.5seconds on a Sun Sparc10pro-cessor,with about0.9seconds required to build the scale-space pyramid and identify the SIFT keys,and about0.6 seconds to perform indexing and least-squares verification. This does not include time to pre-process each model image, which would be about1second per image,but would only need to be done once for initial entry into a model database.The illumination invariance of the SIFT keys is demon-strated in Figure6.The two images are of the same scene from the same viewpoint,except that thefirst image is il-luminated from the upper left and the second from the cen-ter right.The full recognition system is run to identify the second image using thefirst image as the model,and the second image is correctly recognized as matching thefirst. Only SIFT keys that were part of the recognition are shown. There were273keys that were verified as part of thefinal match,which means that in each case not only was the same key detected at the same location,but it also was the clos-est match to the correct corresponding key in the second im-age.Any3of these keys would be sufficient for recognition. While matching keys are not found in some regions where highlights or shadows change(for example on the shiny top of the camera)in general the keys show good invariance to illumination change.8.Connections to biological visionThe performance of human vision is obviously far superior to that of current computer vision systems,so there is poten-tially much to be gained by emulating biological processes. Fortunately,there have been dramatic improvements within the past few years in understanding how object recognition is accomplished in animals and humans.Recent research in neuroscience has shown that object recognition in primates makes use of features of intermedi-ate complexity that are largely invariant to changes in scale, location,and illumination(Tanaka[21],Perrett&Oram [16]).Some examples of such intermediate features found in inferior temporal cortex(IT)are neurons that respond to a darkfive-sided star shape,a circle with a thin protruding element,or a horizontal textured region within a triangular boundary.These neurons maintain highly specific responses to shape features that appear anywhere within a large por-tion of the visualfield and over a several octave range of scales(Ito et.al[7]).The complexity of many of these fea-tures appears to be roughly the same as for the current SIFT features,although there are also some neurons that respond to more complex shapes,such as faces.Many of the neu-rons respond to color and texture properties in addition to shape.The feature responses have been shown to depend on previous visual learning from exposure to specific objects containing the features(Logothetis,Pauls&Poggio[10]). These features appear to be derived in the brain by a highly computation-intensive parallel process,which is quite dif-ferent from the stagedfiltering approach given in this paper. However,the results are much the same:an image is trans-formed into a large set of local features that each match a small fraction of potential objects yet are largely invariant to common viewing transformations.It is also known that object recognition in the brain de-pends on a serial process of attention to bind features to ob-ject interpretations,determine pose,and segment an object from a cluttered background[22].This process is presum-ably playing the same role in verification as the parameter solving and outlier detection used in this paper,since the accuracy of interpretations can often depend on enforcing a single viewpoint constraint[11].9.Conclusions and commentsThe SIFT features improve on previous approaches by being largely invariant to changes in scale,illumination,and local。
matlab图像处理中英文翻译文献
附录A 英文原文Scene recognition for mine rescue robotlocalization based on visionCUI Yi-an(崔益安), CAI Zi-xing(蔡自兴), WANG Lu(王璐)Abstract:A new scene recognition system was presented based on fuzzy logic and hidden Markov model(HMM) that can be applied in mine rescue robot localization during emergencies. The system uses monocular camera to acquire omni-directional images of the mine environment where the robot locates. By adopting center-surround difference method, the salient local image regions are extracted from the images as natural landmarks. These landmarks are organized by using HMM to represent the scene where the robot is, and fuzzy logic strategy is used to match the scene and landmark. By this way, the localization problem, which is the scene recognition problem in the system, can be converted into the evaluation problem of HMM. The contributions of these skills make the system have the ability to deal with changes in scale, 2D rotation and viewpoint. The results of experiments also prove that the system has higher ratio of recognition and localization in both static and dynamic mine environments.Key words: robot location; scene recognition; salient image; matching strategy; fuzzy logic; hidden Markov model1 IntroductionSearch and rescue in disaster area in the domain of robot is a burgeoning and challenging subject[1]. Mine rescue robot was developed to enter mines during emergencies to locate possible escape routes for those trapped inside and determine whether it is safe for human to enter or not. Localization is a fundamental problem in this field. Localization methods based on camera can be mainly classified into geometric, topological or hybrid ones[2]. With its feasibility and effectiveness, scene recognition becomes one of the important technologies of topological localization.Currently most scene recognition methods are based on global image features and have twodistinct stages: training offline and matching online.During the training stage, robot collects the images of the environment where it works and processes the images to extract global features that represent the scene. Some approaches were used to analyze the data-set of image directly and some primary features were found, such as the PCA method [3]. However, the PCA method is not effective in distinguishing the classes of features. Another type of approach uses appearance features including color, texture and edge density to represent the image. For example, ZHOU et al[4] used multidimensional histograms to describe global appearance features. This method is simple but sensitive to scale and illumination changes. In fact, all kinds of global image features are suffered from the change of environment.LOWE [5] presented a SIFT method that uses similarity invariant descriptors formed by characteristic scale and orientation at interest points to obtain the features. The features are invariant to image scaling, translation, rotation and partially invariant to illumination changes. But SIFT may generate 1 000 or more interest points, which may slow down the processor dramatically.During the matching stage, nearest neighbor strategy(NN) is widely adopted for its facility and intelligibility[6]. But it cannot capture the contribution of individual feature for scene recognition. In experiments, the NN is not good enough to express the similarity between two patterns. Furthermore, the selected features can not represent the scene thoroughly according to the state-of-art pattern recognition, which makes recognition not reliable[7].So in this work a new recognition system is presented, which is more reliable and effective if it is used in a complex mine environment. In this system, we improve the invariance by extracting salient local image regions as landmarks to replace the whole image to deal with large changes in scale, 2D rotation and viewpoint. And the number of interest points is reduced effectively, which makes the processing easier. Fuzzy recognition strategy is designed to recognize the landmarks in place of NN, which can strengthen the contribution of individual feature for scene recognition. Because of its partial information resuming ability, hidden Markov model is adopted to organize those landmarks, which can capture the structure or relationship among them. So scene recognition can be transformed to the evaluation problem of HMM, which makes recognition robust.2 Salient local image regions detectionResearches on biological vision system indicate that organism (like drosophila) often pays attention to certain special regions in the scene for their behavioral relevance or local image cues while observing surroundings [8]. These regions can be taken as natural landmarks to effectively represent and distinguish different environments. Inspired by those, we use center-surround difference method to detect salient regions in multi-scale image spaces. The opponencies of color and texture are computed to create the saliency map.Follow-up, sub-image centered at the salient position in S is taken as the landmark region. The size of the landmark region can be decided adaptively according to the changes of gradient orientation of the local image [11].Mobile robot navigation requires that natural landmarks should be detected stably when environments change to some extent. To validate the repeatability on landmark detection of our approach, we have done some experiments on the cases of scale, 2D rotation and viewpoint changes etc. Fig.1 shows that the door is detected for its saliency when viewpoint changes. More detailed analysis and results about scale and rotation can be found in our previous works[12].3 Scene recognition and localizationDifferent from other scene recognition systems, our system doesn’t need training offline. In other words, our scenes are not classified in advance. When robot wanders, scenes captured at intervals of fixed time are used to build the vertex of a topological map, which represents the place where robot locates. Although the map’s geometric layout is ignored by the localization system, it is useful for visualization and debugging[13] and beneficial to path planning. So localization means searching the best match of current scene on the map. In this paper hidden Markov model is used to organize the extracted landmarks from current scene and create the vertex of topological map for its partial information resuming ability.Resembled by panoramic vision system, robot looks around to get omni-images. FromFig.1 Experiment on viewpoint changeseach image, salient local regions are detected and formed to be a sequence, named as landmark sequence whose order is the same as the image sequence. Then a hidden Markov model is created based on the landmark sequence involving k salient local image regions, which is taken as the description of the place where the robot locates. In our system EVI-D70 camera has a view field of ±170°. Considering the overlap effect, we sample environment every 45° to get 8 images.Let the 8 images as hidden state Si (1≤i≤8), the created HMM can be illustrated by Fig.2. The parameters of HMM, aij and bjk, are achieved by learning, using Baulm-Welch algorithm[14]. The threshold of convergence is set as 0.001.As for the edge of topological map, we assign it with distance information between twovertices. The distances can be computed according to odometry readings.Fig.2 HMM of environmentTo locate itself on the topological map, robot must run its ‘eye’ on environment and extract a landmark sequence L1′ −Lk′ , then search the map for the best matched vertex (scene). Different from traditional probabilistic localization[15], in our system localization problem can be converted to the evaluation problem of HMM. The vertex with the greatest evaluation value, which must also be greater than a threshold, is taken as the best matched vertex, which indicates the most possible place where the robot is.4 Match strategy based on fuzzy logicOne of the key issues in image match problem is to choose the most effective features or descriptors to represent the original image. Due to robot movement, those extracted landmark regions will change at pixel level. So, the descriptors or features chosen should be invariant to some extent according to the changes of scale, rotation and viewpoint etc. In this paper, we use 4 features commonly adopted in the community that are briefly described as follows.GO: Gradient orientation. It has been proved that illumination and rotation changes are likely to have less influence on it[5].ASM and ENT: Angular second moment and entropy, which are two texture descriptors.H: Hue, which is used to describe the fundamental information of the image.Another key issue in match problem is to choose a good match strategy or algorithm. Usually nearest neighbor strategy (NN) is used to measure the similarity between two patterns. But we have found in the experiments that NN can’t adequately exhibit the individual descriptor or feature’s contribution to similarity measurement. As indicated in Fig.4, the input image Fig.4(a) comes from different view of Fig.4(b). But the distance between Figs.4(a) and (b) computed by Jefferey divergence is larger than Fig.4(c).To solve the problem, we design a new match algorithm based on fuzzy logic for exhibiting the subtle changes of each features. The algorithm is described as below.And the landmark in the database whose fused similarity degree is higher than any others is taken as the best match. The match results of Figs.2(b) and (c) are demonstrated by Fig.3. As indicated, this method can measure the similarity effectively between two patterns.Fig.3 Similarity computed using fuzzy strategy5 Experiments and analysisThe localization system has been implemented on a mobile robot, which is built by our laboratory. The vision system is composed of a CCD camera and a frame-grabber IVC-4200. The resolution of image is set to be 400×320 and the sample frequency is set to be 10 frames/s. The computer system is composed of 1 GHz processor and 512 M memory, which is carried by the robot. Presently the robot works in indoor environments.Because HMM is adopted to represent and recognize the scene, our system has the ability to capture the discrimination about distribution of salient local image regions and distinguish similar scenes effectively. Table 1 shows the recognition result of static environments including 5 laneways and a silo. 10 scenes are selected from each environment and HMMs are created for each scene. Then 20 scenes are collected when the robot enters each environment subsequently to match the 60 HMMs above.In the table, “truth” m eans that the scene to be localized matches with the right scene (the evaluation value of HMM is 30% greater than the second high evaluation). “Uncertainty” means that the evaluation value of HMM is greater than the second high evaluation under 10%. “Error match” means that the scene to be localized matches with the wrong scene. In the table, the ratio of error match is 0. But it is possible that the scene to be localized can’t match any scenes and new vertexes are created. Furthermore, the “ratio of truth” about silo is lower because salient cues arefewer in this kind of environment.In the period of automatic exploring, similar scenes can be combined. The process can be summarized as: when localization succeeds, the current landmark sequence is added to the accompanying observation sequence of the matched vertex un-repeatedly according to their orientation (including the angle of the image from which the salient local region and the heading of the robot come). The parameters of HMM are learned again.Compared with the approaches using appearance features of the whole image (Method 2, M2), our system (M1) uses local salient regions to localize and map, which makes it have more tolerance of scale, viewpoint changes caused by robot’s movement and higher ratio of recognition and fewer amount of vertices on the topological map. So, our system has better performance in dynamic environment. These can be seen in Table 2. Laneways 1, 2, 4, 5 are in operation where some miners are working, which puzzle the robot.6 Conclusions1) Salient local image features are extracted to replace the whole image to participate in recognition, which improve the tolerance of changes in scale, 2D rotation and viewpoint of environment image.2) Fuzzy logic is used to recognize the local image, and emphasize the individual feature’s contribution to recognition, which improves the reliability of landmarks.3) HMM is used to capture the structure or relationship of those local images, which converts the scene recognition problem into the evaluation problem of HMM.4) The results from the above experiments demonstrate that the mine rescue robot scene recognition system has higher ratio of recognition and localization.Future work will be focused on using HMM to deal with the uncertainty of localization.附录B 中文翻译基于视觉的矿井救援机器人场景识别CUI Yi-an(崔益安), CAI Zi-xing(蔡自兴), WANG Lu(王璐)摘要:基于模糊逻辑和隐马尔可夫模型(HMM),论文提出了一个新的场景识别系统,可应用于紧急情况下矿山救援机器人的定位。
专八英语阅读
英语专业八级考试TEM-8阅读理解练习册(1)(英语专业2012级)UNIT 1Text AEvery minute of every day, what ecologist生态学家James Carlton calls a global ―conveyor belt‖, redistributes ocean organisms生物.It’s planetwide biological disruption生物的破坏that scientists have barely begun to understand.Dr. Carlton —an oceanographer at Williams College in Williamstown,Mass.—explains that, at any given moment, ―There are several thousand marine species traveling… in the ballast water of ships.‖ These creatures move from coastal waters where they fit into the local web of life to places where some of them could tear that web apart. This is the larger dimension of the infamous无耻的,邪恶的invasion of fish-destroying, pipe-clogging zebra mussels有斑马纹的贻贝.Such voracious贪婪的invaders at least make their presence known. What concerns Carlton and his fellow marine ecologists is the lack of knowledge about the hundreds of alien invaders that quietly enter coastal waters around the world every day. Many of them probably just die out. Some benignly亲切地,仁慈地—or even beneficially — join the local scene. But some will make trouble.In one sense, this is an old story. Organisms have ridden ships for centuries. They have clung to hulls and come along with cargo. What’s new is the scale and speed of the migrations made possible by the massive volume of ship-ballast water压载水— taken in to provide ship stability—continuously moving around the world…Ships load up with ballast water and its inhabitants in coastal waters of one port and dump the ballast in another port that may be thousands of kilometers away. A single load can run to hundreds of gallons. Some larger ships take on as much as 40 million gallons. The creatures that come along tend to be in their larva free-floating stage. When discharged排出in alien waters they can mature into crabs, jellyfish水母, slugs鼻涕虫,蛞蝓, and many other forms.Since the problem involves coastal species, simply banning ballast dumps in coastal waters would, in theory, solve it. Coastal organisms in ballast water that is flushed into midocean would not survive. Such a ban has worked for North American Inland Waterway. But it would be hard to enforce it worldwide. Heating ballast water or straining it should also halt the species spread. But before any such worldwide regulations were imposed, scientists would need a clearer view of what is going on.The continuous shuffling洗牌of marine organisms has changed the biology of the sea on a global scale. It can have devastating effects as in the case of the American comb jellyfish that recently invaded the Black Sea. It has destroyed that sea’s anchovy鳀鱼fishery by eating anchovy eggs. It may soon spread to western and northern European waters.The maritime nations that created the biological ―conveyor belt‖ should support a coordinated international effort to find out what is going on and what should be done about it. (456 words)1.According to Dr. Carlton, ocean organism‟s are_______.A.being moved to new environmentsB.destroying the planetC.succumbing to the zebra musselD.developing alien characteristics2.Oceanographers海洋学家are concerned because_________.A.their knowledge of this phenomenon is limitedB.they believe the oceans are dyingC.they fear an invasion from outer-spaceD.they have identified thousands of alien webs3.According to marine ecologists, transplanted marinespecies____________.A.may upset the ecosystems of coastal watersB.are all compatible with one anotherC.can only survive in their home watersD.sometimes disrupt shipping lanes4.The identified cause of the problem is_______.A.the rapidity with which larvae matureB. a common practice of the shipping industryC. a centuries old speciesD.the world wide movement of ocean currents5.The article suggests that a solution to the problem__________.A.is unlikely to be identifiedB.must precede further researchC.is hypothetically假设地,假想地easyD.will limit global shippingText BNew …Endangered‟ List Targets Many US RiversIt is hard to think of a major natural resource or pollution issue in North America today that does not affect rivers.Farm chemical runoff残渣, industrial waste, urban storm sewers, sewage treatment, mining, logging, grazing放牧,military bases, residential and business development, hydropower水力发电,loss of wetlands. The list goes on.Legislation like the Clean Water Act and Wild and Scenic Rivers Act have provided some protection, but threats continue.The Environmental Protection Agency (EPA) reported yesterday that an assessment of 642,000 miles of rivers and streams showed 34 percent in less than good condition. In a major study of the Clean Water Act, the Natural Resources Defense Council last fall reported that poison runoff impairs损害more than 125,000 miles of rivers.More recently, the NRDC and Izaak Walton League warned that pollution and loss of wetlands—made worse by last year’s flooding—is degrading恶化the Mississippi River ecosystem.On Tuesday, the conservation group保护组织American Rivers issued its annual list of 10 ―endangered‖ and 20 ―threatened‖ rivers in 32 states, the District of Colombia, and Canada.At the top of the list is the Clarks Fork of the Yellowstone River, whereCanadian mining firms plan to build a 74-acre英亩reservoir水库,蓄水池as part of a gold mine less than three miles from Yellowstone National Park. The reservoir would hold the runoff from the sulfuric acid 硫酸used to extract gold from crushed rock.―In the event this tailings pond failed, the impact to th e greater Yellowstone ecosystem would be cataclysmic大变动的,灾难性的and the damage irreversible不可逆转的.‖ Sen. Max Baucus of Montana, chairman of the Environment and Public Works Committee, wrote to Noranda Minerals Inc., an owner of the ― New World Mine‖.Last fall, an EPA official expressed concern about the mine and its potential impact, especially the plastic-lined storage reservoir. ― I am unaware of any studies evaluating how a tailings pond尾矿池,残渣池could be maintained to ensure its structural integrity forev er,‖ said Stephen Hoffman, chief of the EPA’s Mining Waste Section. ―It is my opinion that underwater disposal of tailings at New World may present a potentially significant threat to human health and the environment.‖The results of an environmental-impact statement, now being drafted by the Forest Service and Montana Department of State Lands, could determine the mine’s future…In its recent proposal to reauthorize the Clean Water Act, the Clinton administration noted ―dramatically improved water quality since 1972,‖ when the act was passed. But it also reported that 30 percent of riverscontinue to be degraded, mainly by silt泥沙and nutrients from farm and urban runoff, combined sewer overflows, and municipal sewage城市污水. Bottom sediments沉积物are contaminated污染in more than 1,000 waterways, the administration reported in releasing its proposal in January. Between 60 and 80 percent of riparian corridors (riverbank lands) have been degraded.As with endangered species and their habitats in forests and deserts, the complexity of ecosystems is seen in rivers and the effects of development----beyond the obvious threats of industrial pollution, municipal waste, and in-stream diversions改道to slake消除the thirst of new communities in dry regions like the Southwes t…While there are many political hurdles障碍ahead, reauthorization of the Clean Water Act this year holds promise for US rivers. Rep. Norm Mineta of California, who chairs the House Committee overseeing the bill, calls it ―probably the most important env ironmental legislation this Congress will enact.‖ (553 words)6.According to the passage, the Clean Water Act______.A.has been ineffectiveB.will definitely be renewedC.has never been evaluatedD.was enacted some 30 years ago7.“Endangered” rivers are _________.A.catalogued annuallyB.less polluted than ―threatened rivers‖C.caused by floodingD.adjacent to large cities8.The “cataclysmic” event referred to in paragraph eight would be__________.A. fortuitous偶然的,意外的B. adventitious外加的,偶然的C. catastrophicD. precarious不稳定的,危险的9. The owners of the New World Mine appear to be______.A. ecologically aware of the impact of miningB. determined to construct a safe tailings pondC. indifferent to the concerns voiced by the EPAD. willing to relocate operations10. The passage conveys the impression that_______.A. Canadians are disinterested in natural resourcesB. private and public environmental groups aboundC. river banks are erodingD. the majority of US rivers are in poor conditionText CA classic series of experiments to determine the effects ofoverpopulation on communities of rats was reported in February of 1962 in an article in Scientific American. The experiments were conducted by a psychologist, John B. Calhoun and his associates. In each of these experiments, an equal number of male and female adult rats were placed in an enclosure and given an adequate supply of food, water, and other necessities. The rat populations were allowed to increase. Calhoun knew from experience approximately how many rats could live in the enclosures without experiencing stress due to overcrowding. He allowed the population to increase to approximately twice this number. Then he stabilized the population by removing offspring that were not dependent on their mothers. He and his associates then carefully observed and recorded behavior in these overpopulated communities. At the end of their experiments, Calhoun and his associates were able to conclude that overcrowding causes a breakdown in the normal social relationships among rats, a kind of social disease. The rats in the experiments did not follow the same patterns of behavior as rats would in a community without overcrowding.The females in the rat population were the most seriously affected by the high population density: They showed deviant异常的maternal behavior; they did not behave as mother rats normally do. In fact, many of the pups幼兽,幼崽, as rat babies are called, died as a result of poor maternal care. For example, mothers sometimes abandoned their pups,and, without their mothers' care, the pups died. Under normal conditions, a mother rat would not leave her pups alone to die. However, the experiments verified that in overpopulated communities, mother rats do not behave normally. Their behavior may be considered pathologically 病理上,病理学地diseased.The dominant males in the rat population were the least affected by overpopulation. Each of these strong males claimed an area of the enclosure as his own. Therefore, these individuals did not experience the overcrowding in the same way as the other rats did. The fact that the dominant males had adequate space in which to live may explain why they were not as seriously affected by overpopulation as the other rats. However, dominant males did behave pathologically at times. Their antisocial behavior consisted of attacks on weaker male,female, and immature rats. This deviant behavior showed that even though the dominant males had enough living space, they too were affected by the general overcrowding in the enclosure.Non-dominant males in the experimental rat communities also exhibited deviant social behavior. Some withdrew completely; they moved very little and ate and drank at times when the other rats were sleeping in order to avoid contact with them. Other non-dominant males were hyperactive; they were much more active than is normal, chasing other rats and fighting each other. This segment of the rat population, likeall the other parts, was affected by the overpopulation.The behavior of the non-dominant males and of the other components of the rat population has parallels in human behavior. People in densely populated areas exhibit deviant behavior similar to that of the rats in Calhoun's experiments. In large urban areas such as New York City, London, Mexican City, and Cairo, there are abandoned children. There are cruel, powerful individuals, both men and women. There are also people who withdraw and people who become hyperactive. The quantity of other forms of social pathology such as murder, rape, and robbery also frequently occur in densely populated human communities. Is the principal cause of these disorders overpopulation? Calhoun’s experiments suggest that it might be. In any case, social scientists and city planners have been influenced by the results of this series of experiments.11. Paragraph l is organized according to__________.A. reasonsB. descriptionC. examplesD. definition12.Calhoun stabilized the rat population_________.A. when it was double the number that could live in the enclosure without stressB. by removing young ratsC. at a constant number of adult rats in the enclosureD. all of the above are correct13.W hich of the following inferences CANNOT be made from theinformation inPara. 1?A. Calhoun's experiment is still considered important today.B. Overpopulation causes pathological behavior in rat populations.C. Stress does not occur in rat communities unless there is overcrowding.D. Calhoun had experimented with rats before.14. Which of the following behavior didn‟t happen in this experiment?A. All the male rats exhibited pathological behavior.B. Mother rats abandoned their pups.C. Female rats showed deviant maternal behavior.D. Mother rats left their rat babies alone.15. The main idea of the paragraph three is that __________.A. dominant males had adequate living spaceB. dominant males were not as seriously affected by overcrowding as the otherratsC. dominant males attacked weaker ratsD. the strongest males are always able to adapt to bad conditionsText DThe first mention of slavery in the statutes法令,法规of the English colonies of North America does not occur until after 1660—some forty years after the importation of the first Black people. Lest we think that existed in fact before it did in law, Oscar and Mary Handlin assure us, that the status of B lack people down to the 1660’s was that of servants. A critique批判of the Handlins’ interpretation of why legal slavery did not appear until the 1660’s suggests that assumptions about the relation between slavery and racial prejudice should be reexamined, and that explanation for the different treatment of Black slaves in North and South America should be expanded.The Handlins explain the appearance of legal slavery by arguing that, during the 1660’s, the position of white servants was improving relative to that of black servants. Thus, the Handlins contend, Black and White servants, heretofore treated alike, each attained a different status. There are, however, important objections to this argument. First, the Handlins cannot adequately demonstrate that t he White servant’s position was improving, during and after the 1660’s; several acts of the Maryland and Virginia legislatures indicate otherwise. Another flaw in the Handlins’ interpretation is their assumption that prior to the establishment of legal slavery there was no discrimination against Black people. It is true that before the 1660’s Black people were rarely called slaves. But this shouldnot overshadow evidence from the 1630’s on that points to racial discrimination without using the term slavery. Such discrimination sometimes stopped short of lifetime servitude or inherited status—the two attributes of true slavery—yet in other cases it included both. The Handlins’ argument excludes the real possibility that Black people in the English colonies were never treated as the equals of White people.The possibility has important ramifications后果,影响.If from the outset Black people were discriminated against, then legal slavery should be viewed as a reflection and an extension of racial prejudice rather than, as many historians including the Handlins have argued, the cause of prejudice. In addition, the existence of discrimination before the advent of legal slavery offers a further explanation for the harsher treatment of Black slaves in North than in South America. Freyre and Tannenbaum have rightly argued that the lack of certain traditions in North America—such as a Roman conception of slavery and a Roman Catholic emphasis on equality— explains why the treatment of Black slaves was more severe there than in the Spanish and Portuguese colonies of South America. But this cannot be the whole explanation since it is merely negative, based only on a lack of something. A more compelling令人信服的explanation is that the early and sometimes extreme racial discrimination in the English colonies helped determine the particular nature of the slavery that followed. (462 words)16. Which of the following is the most logical inference to be drawn from the passage about the effects of “several acts of the Maryland and Virginia legislatures” (Para.2) passed during and after the 1660‟s?A. The acts negatively affected the pre-1660’s position of Black as wellas of White servants.B. The acts had the effect of impairing rather than improving theposition of White servants relative to what it had been before the 1660’s.C. The acts had a different effect on the position of white servants thandid many of the acts passed during this time by the legislatures of other colonies.D. The acts, at the very least, caused the position of White servants toremain no better than it had been before the 1660’s.17. With which of the following statements regarding the status ofBlack people in the English colonies of North America before the 1660‟s would the author be LEAST likely to agree?A. Although black people were not legally considered to be slaves,they were often called slaves.B. Although subject to some discrimination, black people had a higherlegal status than they did after the 1660’s.C. Although sometimes subject to lifetime servitude, black peoplewere not legally considered to be slaves.D. Although often not treated the same as White people, black people,like many white people, possessed the legal status of servants.18. According to the passage, the Handlins have argued which of thefollowing about the relationship between racial prejudice and the institution of legal slavery in the English colonies of North America?A. Racial prejudice and the institution of slavery arose simultaneously.B. Racial prejudice most often the form of the imposition of inheritedstatus, one of the attributes of slavery.C. The source of racial prejudice was the institution of slavery.D. Because of the influence of the Roman Catholic Church, racialprejudice sometimes did not result in slavery.19. The passage suggests that the existence of a Roman conception ofslavery in Spanish and Portuguese colonies had the effect of _________.A. extending rather than causing racial prejudice in these coloniesB. hastening the legalization of slavery in these colonies.C. mitigating some of the conditions of slavery for black people in these coloniesD. delaying the introduction of slavery into the English colonies20. The author considers the explanation put forward by Freyre andTannenbaum for the treatment accorded B lack slaves in the English colonies of North America to be _____________.A. ambitious but misguidedB. valid有根据的but limitedC. popular but suspectD. anachronistic过时的,时代错误的and controversialUNIT 2Text AThe sea lay like an unbroken mirror all around the pine-girt, lonely shores of Orr’s Island. Tall, kingly spruce s wore their regal王室的crowns of cones high in air, sparkling with diamonds of clear exuded gum流出的树胶; vast old hemlocks铁杉of primeval原始的growth stood darkling in their forest shadows, their branches hung with long hoary moss久远的青苔;while feathery larches羽毛般的落叶松,turned to brilliant gold by autumn frosts, lighted up the darker shadows of the evergreens. It was one of those hazy朦胧的, calm, dissolving days of Indian summer, when everything is so quiet that the fainest kiss of the wave on the beach can be heard, and white clouds seem to faint into the blue of the sky, and soft swathing一长条bands of violet vapor make all earth look dreamy, and give to the sharp, clear-cut outlines of the northern landscape all those mysteries of light and shade which impart such tenderness to Italian scenery.The funeral was over,--- the tread鞋底的花纹/ 踏of many feet, bearing the heavy burden of two broken lives, had been to the lonely graveyard, and had come back again,--- each footstep lighter and more unconstrained不受拘束的as each one went his way from the great old tragedy of Death to the common cheerful of Life.The solemn black clock stood swaying with its eternal ―tick-tock, tick-tock,‖ in the kitchen of the brown house on Orr’s Island. There was there that sense of a stillness that can be felt,---such as settles down on a dwelling住处when any of its inmates have passed through its doors for the last time, to go whence they shall not return. The best room was shut up and darkened, with only so much light as could fall through a little heart-shaped hole in the window-shutter,---for except on solemn visits, or prayer-meetings or weddings, or funerals, that room formed no part of the daily family scenery.The kitchen was clean and ample, hearth灶台, and oven on one side, and rows of old-fashioned splint-bottomed chairs against the wall. A table scoured to snowy whiteness, and a little work-stand whereon lay the Bible, the Missionary Herald, and the Weekly Christian Mirror, before named, formed the principal furniture. One feature, however, must not be forgotten, ---a great sea-chest水手用的储物箱,which had been the companion of Zephaniah through all the countries of the earth. Old, and battered破旧的,磨损的, and unsightly难看的it looked, yet report said that there was good store within which men for the most part respect more than anything else; and, indeed it proved often when a deed of grace was to be done--- when a woman was suddenly made a widow in a coast gale大风,狂风, or a fishing-smack小渔船was run down in the fogs off the banks, leaving in some neighboring cottage a family of orphans,---in all such cases, the opening of this sea-chest was an event of good omen 预兆to the bereaved丧亲者;for Zephaniah had a large heart and a large hand, and was apt有…的倾向to take it out full of silver dollars when once it went in. So the ark of the covenant约柜could not have been looked on with more reverence崇敬than the neighbours usually showed to Captain Pennel’s sea-chest.1. The author describes Orr‟s Island in a(n)______way.A.emotionally appealing, imaginativeB.rational, logically preciseC.factually detailed, objectiveD.vague, uncertain2.According to the passage, the “best room”_____.A.has its many windows boarded upB.has had the furniture removedC.is used only on formal and ceremonious occasionsD.is the busiest room in the house3.From the description of the kitchen we can infer that thehouse belongs to people who_____.A.never have guestsB.like modern appliancesC.are probably religiousD.dislike housework4.The passage implies that_______.A.few people attended the funeralB.fishing is a secure vocationC.the island is densely populatedD.the house belonged to the deceased5.From the description of Zephaniah we can see thathe_________.A.was physically a very big manB.preferred the lonely life of a sailorC.always stayed at homeD.was frugal and saved a lotText BBasic to any understanding of Canada in the 20 years after the Second World War is the country' s impressive population growth. For every three Canadians in 1945, there were over five in 1966. In September 1966 Canada's population passed the 20 million mark. Most of this surging growth came from natural increase. The depression of the 1930s and the war had held back marriages, and the catching-up process began after 1945. The baby boom continued through the decade of the 1950s, producing a population increase of nearly fifteen percent in the five years from 1951 to 1956. This rate of increase had been exceeded only once before in Canada's history, in the decade before 1911 when the prairies were being settled. Undoubtedly, the good economic conditions of the 1950s supported a growth in the population, but the expansion also derived from a trend toward earlier marriages and an increase in the average size of families; In 1957 the Canadian birth rate stood at 28 per thousand, one of the highest in the world. After the peak year of 1957, thebirth rate in Canada began to decline. It continued falling until in 1966 it stood at the lowest level in 25 years. Partly this decline reflected the low level of births during the depression and the war, but it was also caused by changes in Canadian society. Young people were staying at school longer, more women were working; young married couples were buying automobiles or houses before starting families; rising living standards were cutting down the size of families. It appeared that Canada was once more falling in step with the trend toward smaller families that had occurred all through theWestern world since the time of the Industrial Revolution. Although the growth in Canada’s population had slowed down by 1966 (the cent), another increase in the first half of the 1960s was only nine percent), another large population wave was coming over the horizon. It would be composed of the children of the children who were born during the period of the high birth rate prior to 1957.6. What does the passage mainly discuss?A. Educational changes in Canadian society.B. Canada during the Second World War.C. Population trends in postwar Canada.D. Standards of living in Canada.7. According to the passage, when did Canada's baby boom begin?A. In the decade after 1911.B. After 1945.C. During the depression of the 1930s.D. In 1966.8. The author suggests that in Canada during the 1950s____________.A. the urban population decreased rapidlyB. fewer people marriedC. economic conditions were poorD. the birth rate was very high9. When was the birth rate in Canada at its lowest postwar level?A. 1966.B. 1957.C. 1956.D. 1951.10. The author mentions all of the following as causes of declines inpopulation growth after 1957 EXCEPT_________________.A. people being better educatedB. people getting married earlierC. better standards of livingD. couples buying houses11.I t can be inferred from the passage that before the IndustrialRevolution_______________.A. families were largerB. population statistics were unreliableC. the population grew steadilyD. economic conditions were badText CI was just a boy when my father brought me to Harlem for the first time, almost 50 years ago. We stayed at the hotel Theresa, a grand brick structure at 125th Street and Seventh avenue. Once, in the hotel restaurant, my father pointed out Joe Louis. He even got Mr. Brown, the hotel manager, to introduce me to him, a bit punchy强力的but still champ焦急as fast as I was concerned.Much has changed since then. Business and real estate are booming. Some say a new renaissance is under way. Others decry责难what they see as outside forces running roughshod肆意践踏over the old Harlem. New York meant Harlem to me, and as a young man I visited it whenever I could. But many of my old haunts are gone. The Theresa shut down in 1966. National chains that once ignored Harlem now anticipate yuppie money and want pieces of this prime Manhattan real estate. So here I am on a hot August afternoon, sitting in a Starbucks that two years ago opened a block away from the Theresa, snatching抓取,攫取at memories between sips of high-priced coffee. I am about to open up a piece of the old Harlem---the New York Amsterdam News---when a tourist。
基于相位一致性的异源图像匹配方法
基于相位一致性的异源图像匹配方法赵春阳;赵怀慈;赵刚【摘要】异源图像由于亮度和对比度差异较大,采用基于灰度和梯度信息的局部特征匹配方法匹配正确率较低。
针对该问题,提出一种基于相位一致性和梯度方向直方图的异源图像匹配方法。
该方法首先采用具有亮度和对比度不变性的相位一致性方法提取异源图像特征点和边缘图像,并以特征点为中心,选取100×100的边缘图像作为特征区域,统计梯度方向直方图,生成64维特征描述符;然后,选用归一化相关函数作为匹配测度,采用双点匹配方法选取一个特征点的两个较优的候选匹配点,并采用RANSAC方法进行匹配点提纯;最后,基于局部归一化互信息方法和最优化方法进行匹配点精确定位,提高匹配精度。
实验结果表明,该方法在可见光、近红外、中波红外和长波红外等异源图像匹配中具有较好的匹配性能,平均匹配正确率高达88%,是SURF匹配方法的3.4倍。
%The common local feature matching methods based on gradient histogram are difficult to match correctly due to the difference of contrast and luminance of heterogonous image. For solving this problem,the heterogonous image matching method was proposed based on the phase congruency and histograms of oriented gradients. Firstly,feature points and edge image of heterogonous images were extracted by using phase congruency method which has invariance of luminance and contrast,and then the 64-dimensional feature descriptor was generated by counting histograms of ori-ented gradients of squared feature area whose size is 1 00 ×1 00. Secondly,for matching heterogonous image pair,nor-malized correlation function is selected as similarity measure,and two better candidatematching point pairs of one fea-ture point was first selected by using dual-point matching method,then matching point pair was purified by using RANSAC method. Finally,the location of matching point was refined using optimization method and local normalized mutual information. The experimental results indicate that the proposed method can achieve higher performance in heterogonous image matching,the average matching correct rate is up to 88% and is 3.4 times of SURF matching method.【期刊名称】《激光与红外》【年(卷),期】2014(000)010【总页数】5页(P1174-1178)【关键词】异源图像匹配;相位一致性;梯度方向直方图;归一化互相关;归一化互信息【作者】赵春阳;赵怀慈;赵刚【作者单位】中国科学院沈阳自动化研究所,辽宁沈阳110016; 中国科学院光电信息处理重点实验室,辽宁沈阳110016; 中国科学院大学,北京100049;中国科学院沈阳自动化研究所,辽宁沈阳110016; 中国科学院光电信息处理重点实验室,辽宁沈阳110016;中国科学院沈阳自动化研究所,辽宁沈阳110016; 中国科学院光电信息处理重点实验室,辽宁沈阳110016【正文语种】中文【中图分类】TP391异源图像匹配广泛应用于医学、遥感[1]、图像融合、景象匹配[2]等多个领域。
基于视觉的旋翼无人机地面目标跟踪(英文)
I. INTRODUCTION UAV is one of the best platforms to perform dull, dirty or dangerous (3D) tasks [1]. UAV can be used in various applications where human is impossible to intervene. It greatly expands the application space of visual tracking. Research on the technology of vision based ground target tracking for UAV has been a great concern among cybernetic experts and robotic experts, and has become one of the most active research directions in UAV applications. Currently, researchers from America, Britain, France and Sweden are on the cutting edge in this field [2]. Typical visual tracking platforms for UAV include Scan Eagle, GTMax, RQ-11, RQ-16, DragonFly, etc. Because of many advantages, such as small size, light weight, flexible, easy to carry and low cost, rotor UAV has a broad application prospect in the fields of traffic monitoring, resource exploration, electricity patrol, forest fire prevention, aerial photography, atmospheric monitoring, etc [3]. Vision based ground target tracking system for rotor UAV is such a system that gets images by the camera installed on a low-flying rotor UAV, then recognizes the target in the images and estimates the motion state of the target, and finally according to the visual information regulates the pan-tilt-zoom (PTZ) camera automatically to keep the target at the center of the camera view. In view of the current situation of international researches, the study of ground target tracking system for
基于二次表示的空间目标图像分类
基于二次表示的空间目标图像分类蒋飞云;孙锐;张旭东;李超【摘要】According to the characteristics of space target image, an novel method of space target image categorization based on local invariant features is proposed. The method extracts firstly local invariant features of each image and uses Gaussian Mixture Model (GMM) to establish global visual modes. Then co-occurrence matrix of the entire training set is constructed by matching local invariant features and visual models with maximum a posteriori probability and Probability Latent Semantic Analysis (PLSA) model is used to obtain latent class vector of images to achieve sencond representation. Finally, the SVM algorithm is used to implement image categorization. The experimental result demonstrates the effectiveness of the proposed method.% 针对空间目标图像的特点,该文提出一种基于局部不变特征的空间目标图像分类方法。
基于3D人体骨架的动作识别
基于3D人体骨架的动作识别张友梅;常发亮;刘洪彬【摘要】本文提出了一种基于3D人体骨架的动作识别方法.该方法以3D人体骨架为基础,将骨架中关节点的位置重新定义,形成简化的立体骨架模型,进而采用改进的动态时间规整算法(Reformative Dynamic Time Warping,R-DTW)对齐动作序列并进行识别.由于人体大小、形状、动作方式等差异,任意两个人表达同一动作都不尽相同,简化的立体骨架模型能有效缓解这种类内差异性.传统的DTW算法存在计算复杂性高,效率低的问题,本文在传统算法的基础上设计了"一次规划,二次细化"的方法,有效降低计算量,提高计算效率.该算法在MSR 3D Action数据库上的实验验证了其有效性.%This paper presents an action recognition method based on 3D skeleton.This method redefines the coordinates of the articulations which belong to the skeleton to form a simplistic skeleton modelfirstly.Then a reformative dynamic time warping (R-DTW) algorithm is applied to implement action recognition.There are no two persons identical in an action owing to the difference of body size,shape and action expression.The simplistic skeleton model could decrease this intra-class variability effectively.The drawbacks of conventional DTW algorithm lie in high computational complexity and low recognition efficiency.To solve this problem,we design a method named "Planning & Refining".We conduct this algorithm on MSR Action3D dataset and the results demonstrate its effectiveness.【期刊名称】《电子学报》【年(卷),期】2017(045)004【总页数】6页(P906-911)【关键词】人体骨架;动态时间规整;动作识别【作者】张友梅;常发亮;刘洪彬【作者单位】山东大学控制科学与工程学院,山东济南 250061;山东大学控制科学与工程学院,山东济南 250061;山东大学控制科学与工程学院,山东济南 250061【正文语种】中文【中图分类】TP391.4近年来,人体动作识别在人机交互、智能监控等领域应用广泛,已经成为机器视觉和模式识别领域的一个研究热点.动作识别问题主要围绕图像和视频展开,而当今的一些研究开始在图像的基础上结合深度信息以提高动作识别率[1,2]或直接使用深度信息以加快识别速度[3,4].随着深度相机、Kinect等技术的发展,这些工作变得切实可行.结合深度信息后的动作数据包含更丰富的信息,同时也不可避免的导致数据冗余的问题,如何从这些数据中获取表达动作的关键信息也成为了值得研究的课题.早在1975年,Johansson[5]进行了一项实验:在黑暗的屋子里将几束光照在人体的主要关节点并根据这些光点进行人体动作识别,实验结果验证了采用骨架节点进动作识别的可行性.Ziaeefard[6]在二维图像的人体轮廓剪影中提取骨架点,并将整个视频序列缩略成骨架点累积图模型(Cumulative Skeletonized Images,CSI),进而根据CSI得出骨架点分布的直方图并进行识别,该方法在KTH行为数据库上获得了迄今为止的最高识别率.Li[3]提出了一种简单有效的从深度图中提取3D点云姿态模型的方法,并验证了该方法对于遮挡的处理能力.Xia[4]将Ziaeefard的方法扩展到三维空间中,采用隐马尔科夫模型(Hidden Markov Model,HMM)进行动作识别.本文针对3D骨架模型中关节点的空间位置设计了简化的立体骨架模型,这种模型能有效降低因不同人体的大小、形状及动作方式所产生的类内差异性.为解决不同样本的时变匹配问题本文采用了动态时间规整算法对样本序列进行对齐,为缓解该算法计算量大的问题对其进行了改进:在相似距离计算过程中设计了“一次规划,二次细化”的算法,在保证识别精度的同时提高了计算效率.基于3D骨架模型的动作识别方法整体流程如图1所示,图中每个人体骨架代表一个动作序列.该方法首先对数据进行预处理,将所有骨架归一化到同一位置下,保证了样本数据的统一性;在此基础上,建立了简化的立体骨架模型,有效降低动作的类内差异性;最后,采用改进的DTW算法对训练和测试样本进行相似距离计算,从而实现动作序列的快速匹配分析与识别.2.1 数据预处理本文选取了MSR 3DAction 数据库作为实验数据.该数据库广泛应用于国内外基于骨架的行为识别算法的评估,共557个行为序列,包含20种人体行为.参与者在表达某些动作(如慢跑)时并非保持在某个位置,而是会向不同方向及位置变换,因此不同参与者及同一个参与者在不同时刻的整体位置都会发生较大变化,为保证样本数据的统一性,将每个样本序列中的所有骨架以图2所展示的7号关节点为基准归一化到同一位置下.假设基准点坐标为Base=(x,y,z),各样本的关节点坐标为其中i代表样本序列号,k代表骨架节点序列号,经过预处理后样本新坐标为预处理方法如式(1)所示.2.2 简化的立体骨架模型图3(a)展示的是空间坐标系中两个人体骨架S和在正常站立时的状态,其相似距离的计算公式如下:结合图3(a)和式(2)可知,在空间坐标系中两个骨架相应点对间存在空间位置偏移,采用原始空间坐标计算两骨架的相似距离会出现较大偏差.针对以上问题,本文设计了简化的立体骨架模型,如图3(b)所示:首先在3D坐标系中将三维空间分成若干网格,根据实验数据中骨架点分布的紧密度在Z方向上将坐标100~400以步长30划分成了10个区间;在X、Y方向上均将坐标100~200以步长20划分了5个区间,并将边缘上的50~100和200~250分别划分成单独的区间,然后将骨架点的空间坐标映射到这些网格中,用该网格区间坐标替换关节点的原始空间坐标.网格划分的步长为定值,若步长较小则接近于原始坐标,步长过大则无法区分待匹配的骨架节点对,步长最终根据人体骨架相连关节点的长度以及选取不同步长的实验结果进行设定.在简化的立体骨架模型中,如果两个相应节点投影在同一网格,则其网格坐标相同.图3(c),3(d)分别展示了在XZ和YZ方向上的模型坐标表示.图中方框所包含的两个相应骨架点在同一个网格中,则其欧氏距离为0.椭圆框中两个相应骨架节点在网格坐标中仍然存在差距,因此简化的立体骨架模型并未完全消除这些偏差,但由于模型匹配累加了所有相应点对的欧式距离,只要能消除部分偏差,依然能有效降低因人体骨架大小、形状不同所产生的类内差异性.2.3 改进的DTW 算法DTW算法由日本学者Itakura提出[7],广泛应用在非等长时间序列匹配中,如语音识别[8]和动作识别[9]领域.采用DTW算法进行行为识别本质上是进行模板匹配,假定训练样本(标准模板)TR 和测试样本(待匹配样本)TE分别为:为实现两个不等长样本序列的匹配,首先计算两个样本中任意两帧间的相似距离d(rn,em),得出相似距离矩阵dN*M,进而获得两个样本的最终相似距离Dist[TR,TE].DTW算法简洁,但计算量大,运算效率低,因此出现了很多针对其计算量的改进算法[10~12].DTW算法匹配过程的低运算效率主要体现在相似距离的计算和最优路径的匹配,本文针对这两个方面设计了“一次规划,二次细化”的算法,如图4所示.输入数据的维度是影响计算量的一大因素,因此本文首先采用头部和四肢,共5个节点进行路径规划,即“一次规划”过程,如图4(a).这5个关节点既能保证人体姿态的完整性,又降低了数据维度.图4(a)中每个方格代表了两个动作序列中两帧间的相似距离,白色方格是通过路径约束所避免的计算,其它为基于5个关节点所进行的匹配计算,蓝色方格为最终的匹配路径.“一次规划”过程的目标是获取两个样本时变匹配过程中的最佳路径Path,其流程如下:一次规划过程输入输出:{PathNR*NE}for i=1:NRfor j=1:NEendend其中和表示仅采5个节点的训练和测试样本,NR和NE分别为训练样本和测试样本的数量.为更全面地体现动作细节,“二次细化”过程采用所有关节点,根据所规划的路径计算两样本序列的相似距离,如图4(b)所示.白色方格部分无需再进行计算,仅根据黑色方格所示路径采用20个骨架节点计算两个动作序列间的相似距离,其流程如下: 二次细化过程输入输出for i=1:NRfor j=1:NEendendDist2(i,j)为TRi和TEj的最终相似距离.算法目标是判定样本序列行为类别,测试样本与所有训练样本之间最小相似距离Dist[TE,(TR1,TR2,...TRN)]都已计算出,则TE的类别跟与其有着最近相似距离的训练样本TRn的类别相同,即Label(TE)=Label(TRn).3.1 与其他动作识别方法的比较本文的实验设置参照文献[2],将数据库中的行为分为三组.AS1和AS2旨在识别较为相似的动作,AS3则集合了相对复杂的动作,如表1所示.对于每组行为都进行了3次实验,Test1和Test2分别随机抽取了所有数据中的1/3和2/3作为训练数据,为验证方法的泛化能力,实验还进行了交叉验证.对比实验结果见图5.根据图5所展示的实验结果,文献[13]利用了新的坐标系统得到了关键节点间的角度信息,实验效果最好。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
f (g (r, c, θ = j
r =0 c=0 j =0
2π )I) (2) q
Bilinear interpolation is applied when the samples do not fall onto the image grid. The above equation suggests that invariant features are computed by applying a nonlinear function, f , on the neighborhood of each pixel in the image, then summing up all the results to get a single value representing the invariant feature. Using several different functions finally builds up a feature space. Much of the local information is lost by summing up the local results. This makes the discrimination capability of the features weak. In order to preserve the local information, Siggelkow et al. [10] replaced the summation ( r c ) by histogramming: 1 IF (I) = hist q
On Using Histograms of Local Invariant Features for Image Retrieval
Alaa Halawani and Hans Burkhardt Albert-Ludwigs-University of Freiburg Chair of Pattern Recognition and Image Processing 79110 Freiburg, Germany {halawani|burkhardt}@informatik.uni-freiburg.de Abstract
M r =0 N c=0 2π
f (g (r, c, θ)I)dθdcdr
θ =0
(1) where IF (I) is the invariant feature of the image, M, N are the dimensions of the image, and g is an element in the transformation group G (which consists here of rotations and translations). Because of the discrete nature of the image, IF is approximated by choosing r and c to be integers and by varying θ in a discrete manner producing q samples: IF (I) ≈ 1 qM N
1
Introduction
Invariant image features based on Haar integral were introduced by Schulz-Mirbach in [6]. These features have been used successfully in texture-classification [5], pollenrecognition [4], and image retrieval. [10, 1]. For image retrieval, Siggelkow et al. [10] have used color and texture histograms that are based on Euclidean-invariant integral features. Unlike the ordinary histogram, the invariant integral features have the advantage of capturing the local structure held in the image. Experimental results have shown that these features demonstrate a very good capability in retrieving images. However, the main disadvantage is that the computation of the invariant features over the whole image is time consuming. In order to reduce the computation complexity, Siggelkow and Schael [9] have estimated the invariant features using the Monte-Carlo method. They have computed the features for a set of randomly generated points and directions. Recently we have compared in [1] the work based on the Monte-Carlo approximation of the invariant features with the extraction of these features from areas of high relevance in the image under consideration. These areas are image patches that are centered around the so-called salient points. We have used the salient point extraction algorithm introduced in [2] in order to determine these patches. It was found that the computation of invariant features around the salient points enhances the performance of the retrieval system and makes it more robust for cases like object scaling and viewpoint changes. This previous work has concentrated on using a single kernel function to extract features and build the histogram. However, there is a possibil-
In this paper we employ different methods for constructing histograms from invariant features that are computed locally around a set of salient points. These points represent, together with their neighborhood, the most important visual information in an image. The features used for constructing the histograms are evaluations of Haar integrals with nonlinear kernel functions. The resulting histograms are able to preserve the local structure of the image in addition to the fact that they are invariant to Euclidean motion. We study and compare the performance of the different histogram construction methods for a database that consists of 15000 images. ity to use more than one kernel function to build up a feature space and use it for image retrieval. In this paper, we study both cases (using a single or a set of kernel functions) for the construction of the invariantfeature histogram for the purpose of content-based image retrieval. We concentrate on extracting the features around the salient points only. We use the HSV color space as it was found that it performs better than the RGB color space [1]. In the case of using a set of kernels to extract feature vectors around the points, we distribute the resulting vectors in several clusters and construct a histogram of the cluster numbers rather than the feature vectors themselves. This is done in order to overcome the high dimensionality of the feature vectors (which makes histogram construction difficult). We consider both one-dimensional and twodimensional cluster-number histograms. The latter reflects the spatial relationship between the pattern at each point and the patterns of its neighboring points based on cluster numbers assigned to these patterns. The paper is organized as follows: In section 2 we explain the process of calculating the invariant features. A brief description of the different ways used to construct the histograms is given in section 3. A summary of the experimental results is presented in section 4. Finally, a conclusion is given in section 5.