Abstract Combined Shape Descriptors of Face Region and Contour of Human Object in MPEG Comp

合集下载

外文翻译---特征空间稳健性分析：彩色图像分割

附录2：外文翻译Robust Analysis of Feature Spaces: Color ImageSegmentationAbstractA general technique for the recovery of significant image features is presented. The technique is based on the mean shift algorithm, a simple nonparametric procedure for estimating density gradients. Drawbacks of the current methods (including robust clustering) are avoided. Feature space of any nature can be processed, and as an example, color image segmentation is discussed. The segmentation is completely autonomous, only its class is chosen by the user. Thus, the same program can produce a high quality edge image, or provide, by extracting all the significant colors, a preprocessor for content-based query systems. A 512 512 color image is analyzed in less than 10 seconds on a standard workstation. Gray level images are handled as color images having only the lightness coordinate.Keywords:robust pattern analysis, low-level vision, content-based indexing1 IntroductionFeature space analysis is a widely used tool for solving low-level image understanding tasks. Given an image, feature vectors are extracted from local neighborhoods and mapped into the space spanned by their components. Significant features in the image then correspond to high density regions in this space. Feature space analysis is the procedure of recovering the centers of the high density regions, i.e., the representations of the significant image features. Histogram based techniques, Hough transform are examples of the approach.When the number of distinct feature vectors is large, the size of the feature space is reduced by grouping nearby vectors into a single cell.A discretized feature space is called an accumulator. Whenever the size of the accumulator cell is not adequate for the data, serious artifacts can appear. The problem was extensively studied in the context of the Hough transform, e.g.. Thus, for satisfactory results a feature space should have continuous coordinate system. The content of a continuous feature space can be modeled as a sample from a multivariate, multimodal probability distribution. Note that for real images the number of modes can be very large, of the order of tens.The highest density regions correspond to clusters centered on the modes of the underlying probability distribution. Traditional clustering techniques, can be used for feature space analysis but they are reliable only if the number of clusters is small and known a priori. Estimating the number of clusters from the data is computationally expensive and not guaranteed to produce satisfactory result.A much too often used assumption is that the individual clusters obey multivariate normal distributions, i.e., the feature space can be modeled as a mixture of Gaussians. The parameters of the mixture are then estimated by minimizing an error criterion. For example, a large class ofthresholding algorithms are based on the Gaussian mixture model of the histogram, e.g.. However, there is no theoretical evidence that an extracted normal cluster necessarily corresponds to a significant image feature. On the contrary, a strong artifact cluster may appear when several features are mapped into partially overlapping regions.Nonparametric density estimation avoids the use of the normality assumption. The two families of methods, Parzen window, and k-nearest neighbors, both require additional input information (type of the kernel, number of neighbors). This information must be provided by the user, and for multimodal distributions it is difficult to guess the optimal setting.Nevertheless, a reliable general technique for feature space analysis can be developed using a simple nonparametric density estimation algorithm. In this paper we propose such a technique whose robust behavior is superior to methods employing robust estimators from statistics.2 Requirements for RobustnessEstimation of a cluster center is called in statistics the multivariate location problem. To be robust, an estimator must tolerate a percentage of outliers, i.e., data points not obeying the underlying distribution of the cluster. Numerous robust techniques were proposed, and in computer vision the most widely used is the minimum volume ellipsoid (MVE) estimator proposed by Rousseeuw.The MVE estimator is affine equivariant (an affine transformation of the input is passed on to the estimate) and has high breakdown point (tolerates up to half the data being outliers). The estimator finds the center of the highest density region by searching for the minimal volume ellipsoid containing at least h data points. The multivariate location estimate is the center of this ellipsoid. To avoid combinatorial explosion a probabilistic search is employed. Let the dimension of the data be p.A small number of (p+1) tuple of points are randomly chosen. For each (p+1) tuple the mean vector and covariance matrix are computed, defining an ellipsoid. The ellipsoid is inated to include h points, and the one having the minimum volume provides the MVE estimate.Based on MVE, a robust clustering technique with applications in computer vision was proposed in. The data is analyzed under several \resolutions" by applying the MVE estimator repeatedly with h values representing fixed percentages of the data points. The best cluster then corresponds to the h value yielding the highest density inside the minimum volume ellipsoid. The cluster is removed from the feature space, and the whole procedure is repeated till the space is not empty. The robustness of MVE should ensure that each cluster is associated with only one mode of the underlying distribution. The number of significant clusters is not needed a priori.The robust clustering method was successfully employed for the analysis of a large variety of feature spaces, but was found to become less reliable once the number of modes exceeded ten. This is mainly due to the normality assumption embedded into the method. The ellipsoid defining a cluster can be also viewed as the high confidence region of a multivariate normal distribution. Arbitrary feature spaces are not mixtures of Gaussians and constraining the shape of the removed clusters to be elliptical can introduce serious artifacts. The effect of these artifacts propagates as more and more clusters are removed. Furthermore, the estimated covariance matrices are not reliable since are based on only p + 1 points. Subsequent post processing based on all the points declared inliers cannot fully compensate for an initial error.To be able to correctly recover a large number of significant features, the problem of feature space analysis must be solved in context. In image understanding tasks the data to be analyzed originates in the image domain. That is, the feature vectors satisfy additional, spatial constraints. While these constraints are indeed used in the current techniques, their role is mostly limited to compensating for feature allocation errors made during the independent analysis of the feature space. To be robust the feature space analysis must fully exploit the image domain information.As a consequence of the increased role of image domain information the burden on the feature space analysis can be reduced. First all the significant features are extracted, and only after then are the clusters containing the instances of these features recovered. The latter procedure uses image domain information and avoids the normality assumption.Significant features correspond to high density regions and to locate these regions a search window must be employed. The number of parameters defining the shape and size of the window should be minimal, and therefore whenever it is possible the feature space should be isotropic. A space is isotropic if the distance between two points is independent on the location of the point pair. The most widely used isotropic space is the Euclidean space, where a sphere, having only one parameter (its radius) can be employed as search window. The isotropy requirement determines the mapping from the image domain to the feature space. If the isotropy condition cannot be satisfied, a Mahalanobis metric should be defined from the statement of the task.We conclude that robust feature space analysis requires a reliable procedure for the detection of high density regions. Such a procedure is presented in the next section.3 Mean Shift AlgorithmA simple, nonparametric technique for estimation of the density gradient was proposed in 1975 by Fukunaga and Hostetler. The idea was recently generalized by Cheng.Assume, for the moment, that the probability density function p(x) of the p-dimensional feature vectors x is unimodal. This condition is forS of radius r, sake of clarity only, later will be removed. A sphereXcentered on x contains the feature vectors y such thatr x y ≤-. The expected value of the vector x y z -=, given x and X S is[]()()()()()dy S y p y p x y dy S y p x y S z E X X S X S X X ⎰⎰∈-=-==μ （1） If X S is sufficiently small we can approximate()()X S X V x p S y p =∈，where p S r c V X ⋅= （2）is the volume of the sphere. The first order approximation of p(y) is()()()()x p x y x p y p T∇-+= （3） where ()x p ∇ is the gradient of the probability density function in x. Then()()()()⎰∇--=X X S S Tdy x p x p V x y x y μ （4） since the first term vanishes. The value of the integral is()()x p x p p r ∇+=22μ （5） or[]()()x p x p p r x S x x E X ∇+=-∈22 （6） Thus, the mean shift vector, the vector of difference between the local mean and the center of the window, is proportional to the gradient of the probability density at x. The proportionality factor is reciprocal to p(x). This is beneficial when the highest density region of the probability density function is sought. Such region corresponds to large p(x) and small ()x p ∇, i.e., to small mean shifts. On the other hand, low density regions correspond to large mean shifts (amplified also by small p(x) values). The shifts are always in the direction of the probability density maximum, the mode. At the mode the mean shift is close to zero. This property can be exploited in a simple, adaptive steepest ascent algorithm.Mean Shift Algorithm1. Choose the radius r of the search window.2. Choose the initial location of the window.3. Compute the mean shift vector and translate the search window by that amount.4. Repeat till convergence.To illustrate the ability of the mean shift algorithm, 200 data points were generated from two normal distributions, both having unit variance. The first hundred points belonged to a zero-mean distribution, the second hundred to a distribution having mean 3.5. The data is shown as a histogram in Figure 1. It should be emphasized that the feature space is processed as an ordered one-dimensional sequence of points, i.e., it is continuous. The mean shift algorithm starts from the location of the mode detected by the one-dimensional MVE mode detector, i.e., the center of the shortest rectangular window containing half the data points. Since the data is bimodal with nearby modes, the mode estimator fails and returns a location in the trough. The starting point is marked by the cross at the top of Figure 1.Figure 1: An example of the mean shift algorithm.In this synthetic data example no a priori information is available about the analysis window. Its size was taken equal to that returned by the MVE estimator, 3.2828. Other, more adaptive strategies for setting the search window size can also be defined.Table 1: Evolution of Mean Shift AlgorithmIn Table 1 the initial values and the final location,shown with a star at the top of Figure 1, are given.The mean shift algorithm is the tool needed for feature space analysis. The unimodality condition can be relaxed by randomly choosing the initial location of the search window. The algorithm then converges to the closest high density region. The outline of a general procedure is given below.Feature Space Analysis1. Map the image domain into the feature space.2. Define an adequate number of search windows at random locations in the space.3. Find the high density region centers by applying the mean shift algorithm to each window.4. Validate the extracted centers with image domain constraints to provide the feature palette.5. Allocate, using image domain information, all the feature vectors to the feature palette.The procedure is very general and applicable to any feature space. In the next section we describe a color image segmentation technique developed based on this outline.4 Color Image SegmentationImage segmentation, partioning the image into homogeneous regions, is a challenging task. The richness of visual information makes bottom-up, solely image driven approaches always prone to errors. To be reliable, the current systems must be large and incorporate numerous ad-hoc procedures, e.g.. The paradigms of gray level image segmentation (pixel-based, area-based, edge-based) are also used for color images. In addition, the physics-based methods take into account information about the image formation processes as well. See, for example, the reviews. The proposed segmentation technique does not consider the physical processes, it uses only the given image, i.e., a set of RGB vectors. Nevertheless,can be easily extended to incorporate supplementary information about the input. As homogeneity criterion color similarity is used.Since perfect segmentation cannot be achieved without a top-down, knowledge driven component, a bottom-up segmentation technique should ·only provide the input into the next stage where the task is accomplished using a priori knowledge about its goal; and·eliminate, as much as possible, the dependence on user set parameter values.Segmentation resolution is the most general parameter characterizing a segmentation technique. Whilethis parameter has a continuous scale, three important classes can be distinguished.Undersegmentation corresponds to the lowest resolution. Homogeneity is defined with a large tolerance margin and only the most significant colors are retained for the feature palette. The region boundaries in a correctly undersegmented image are the dominant edges in the image.Oversegmentation corresponds to intermediate resolution. The feature palette is rich enough that the image is broken into many small regions from which any sought information can be assembled under knowledge control. Oversegmentation is the recommended class when the goal of the task is object recognition.Quantization corresponds to the highest resolution.The feature palette contains all the important colors in the image. This segmentation class became important with the spread of image databases, e.g.. The full palette, possibly together with the underlying spatial structure, is essential for content-based queries.The proposed color segmentation technique operates in any of the these three classes. The user only chooses the desired class, the specific operating conditions are derived automatically by the program.Images are usually stored and displayed in the RGB space. However, to ensure the isotropy of the feature space, a uniform color space with the perceived color differences measured by Euclidean distances should be used. We have chosen the***v u L space, whose coordinates are related to the RGB values by nonlinear transformations. The daylight standard 65D was used as reference illuminant. The chromatic information is carried by *u and *v , while the lightness coordinate *L can be regarded as the relative brightness. Psychophysical experiments show that ***v u L space may not be perfectly isotropic, however, it was found satisfactory for image understanding applications. The image capture/display operations also introduce deviations which are most often neglected.The steps of color image segmentation are presented below. The acronyms ID and FS stand for image domain and feature space respectively. All feature space computations are performed in the ***v u L space.1. [FS] Definition of the segmentation parameters.The user only indicates the desired class of segmentation. The class definition is translated into three parameters·the radius of the search window, r;·the smallest number of elements required for a significant color, min N ;·the smallest number of contiguous pixels required for a significant image region, con N .The size of the search window determines the resolution of the segmentation, smaller values corresponding to higher resolutions. The subjective (perceptual) definition of a homogeneous region seems to depend on the “visual activity” in the image. Within the same segmentation class an image containing large homogeneous regions should be analyzed at higher resolution than an image with many textured areas. The simplest measure of the “visual activity” can be derived from the global covariance matrix. The square root of its trace,σ, is related to the power of the signal(image). The radius r is taken proportional to σ. The rules defining the three segmentation class parameters are given in Table 2. These rules were used in the segmentation of a large variety images, ranging from simple blood cells to complex indoor and outdoorscenes.When the goal of the task is well defined and/or all the images are of the same type, the parameters can be fine tuned.Table 2: Segmentation Class Parameters2. [ID+FS] Definition of the search window.The initial location of the search window in the feature space is randomly chosen. To ensure that the search starts close to a high density region several location candidates are examined. The random sampling is performed in the image domain and a few, M = 25, pixels are chosen. For each pixel, the mean of its 3⨯3 neighborhood is computed and mapped into the feature space. If the neighborhood belongs to a larger homogeneous region, with high probability the location of the search window will be as wanted. To further increase this probability, the window containing the highest density of feature vectors is selected from the M candidates.3. [FS] Mean shift algorithm.To locate the closest mode the mean shift algorithm is applied to the selected search window. Convergence is declared when the magnitude of the shift becomes less than 0.1.4. [ID+FS] Removal of the detected feature.The pixels yielding feature vectors inside the search window at its final location are discarded from both domains. Additionally, their 8-connected neighbors in the image domain are also removed independent of the feature vector value. These neighbors can have “strange” colors due to the image formation process and their removal cleans the background of the feature space. Since all pixels are reallocated in Step 7, possible errors will be corrected.5. [ID+FS] Iterations.Repeat Steps 2 to 4, till the number of feature vectors in the selectedN.search window no longer exceedsmin6. [ID] Determining the initial feature palette.In the feature space a significant color must be based on minimumN vectors. Similarly, to declare a color significant in the image minN pixels of that color should belong to a connected domain more thanmincomponent. From the extracted colors only those are retained for theinitial feature palette which yield at least one connected component inN. The neighbors removed at Step 4 are the image of size larger thanminalso considered when defining the connected components Note that the N which is used only at the post processing stage. threshold is notcon7. [ID+FS] Determining the final feature palette.The initial feature palette provides the colors allowed whensegmenting the image. If the palette is not rich enough the segmentationresolution was not chosen correctly and should be increased to the nextclass. All the pixel are reallocated based on this palette. First, thepixels yielding feature vectors inside the search windows at their finallocation are considered. These pixels are allocated to the color of thewindow center without taking into account image domain information. Thewindows are then inflated to double volume (their radius is multiplied with p32). The newly incorporated pixels are retained only if they have at least one neighbor which was already allocated to that color. The mean of the feature vectors mapped into the same color is the value retained for the final palette. At the end of the allocation procedure a small number of pixels can remain unclassified. These pixels are allocated to the closest color in the final feature palette.8. [ID+FS] Postprocessing.This step depends on the goal of the task. The simplest procedure is the removal from the image of all small connected components of size less N.These pixels are allocated to the majority color in their 3⨯thancon3 neighborhood, or in the case of a tie to the closest color in the feature space.In Figure 2 the house image containing 9603 different colors is shown. The segmentation results for the three classes and the region boundaries are given in Figure 5a-f. Note that undersegmentation yields a good edge map, while in the quantization class the original image is closely reproduced with only 37 colors. A second example using the oversegmentation class is shown in Figure 3. Note the details on the fuselage.5 DiscussionThe simplicity of the basic computational module, the mean shift algorithm, enables the feature space analysis to be accomplished very fast. From a 512⨯512 pixels image a palette of 10-20 features can be extracted in less than 10 seconds on a Ultra SPARC 1 workstation. To achieve such a speed the implementation was optimized and whenever possible, the feature space (containing fewer distinct elements than the image domain) was used for array scanning; lookup tables were employed instead of frequently repeated computations; direct addressing instead of nested pointers; fixed point arithmetic instead of floating point calculations; partial computation of the Euclidean distances, etc.The analysis of the feature space is completely autonomous, due to the extensive use of image domain information. All the examples in this paper, and dozens more not shown here, were processed using the parameter values given in Table 2. Recently Zhu and Yuille described a segmentation technique incorporating complex global optimization methods(snakes, minimum description length) with sensitive parameters and thresholds. To segment a color image over a hundred iterations were needed. When the images used in were processed with the technique described in this paper, the same quality results were obtained unsupervised and in less than a second. The new technique can be used un modified for segmenting gray levelimages, which are handled as color images with only the*L coordinates. In Figure 6 an example is shown.The result of segmentation can be further refined by local processing in the image domain. For example, robust analysis of the pixels in a large connected component yields the inlier/outlier dichotomy which then can be used to recover discarded fine details.In conclusion, we have presented a general technique for feature space analysis with applications in many low-level vision tasks like thresholding, edge detection, segmentation. The nature of the feature space is not restricted, currently we are working on applying the technique to range image segmentation, Hough transform and optical flow decomposition.255⨯ pixels, 9603 colors.Figure 2: The house image, 192（a）（b）Figure 3: Color image segmentation example.512⨯ pixels, 77041 colors. (b)Oversegmentation: (a)Original image, 51221/21 colors.（a ）（b ） Figure 4: Performance comparison.(a) Original image, 261116 pixels, 200 colors. (b) Undersegmentation:5/4 colors. Region boundaries.（a）（b）（c）（d）（e）（f）Figure 5: The three segmentation classes for the house image. The right column shows the region boundaries.(a)(b) Undersegmentation. Number of colors extracted initially and in thefeature palette: 8/8.(c)(d) Oversegmentation: 24/19 colors. (e)(f) Quantization: 49/37 colors.（a）（b）（c）Figure 6: Gray level image segmentation example. (a)Original image,256 pixels.256(b) Undersegmenta-tion: 5 gray levels. (c) Region boundaries.特征空间稳健性分析：彩色图像分割摘要本文提出了一种恢复显著图像特征的普遍技术。

A performance evaluation of local descriptors

A performance evaluation of local descriptorsKrystian Mikolajczyk and Cordelia SchmidDept.of Engineering Science INRIA Rhˆo ne-AlpesUniversity of Oxford655,av.de l’EuropeOxford,OX13PJ38330MontbonnotUnited Kingdom Francekm@ schmid@inrialpes.frAbstractIn this paper we compare the performance of descriptors computed for local interest regions,as for example extracted by the Harris-Afﬁne detector[32].Many different descriptors have been proposed inthe literature.However,it is unclear which descriptors are more appropriate and how their performancedepends on the interest region detector.The descriptors should be distinctive and at the same time robustto changes in viewing conditions as well as to errors of the detector.Our evaluation uses as criterionrecall with respect to precision and is carried out for different image transformations.We compareshape context[3],steerableﬁlters[12],PCA-SIFT[19],differential invariants[20],spin images[21],SIFT[26],complexﬁlters[37],moment invariants[43],and cross-correlation for different types ofinterest regions.We also propose an extension of the SIFT descriptor,and show that it outperforms theoriginal method.Furthermore,we observe that the ranking of the descriptors is mostly independent ofthe interest region detector and that the SIFT based descriptors perform best.Moments and steerableﬁlters show the best performance among the low dimensional descriptors.Index TermsLocal descriptors,interest points,interest regions,invariance,matching,recognition.I.I NTRODUCTIONLocal photometric descriptors computed for interest regions have proved to be very successful in applications such as wide baseline matching[37,42],object recognition[10,25],textureCorresponding author is K.Mikolajczyk,km@.recognition[21],image retrieval[29,38],robot localization[40],video data mining[41],building panoramas[4],and recognition of object categories[8,9,22,35].They are distinctive,robust to occlusion and do not require segmentation.Recent work has concentrated on making these descriptors invariant to image transformations.The idea is to detect image regions covariant to a class of transformations,which are then used as support regions to compute invariant descriptors. Given invariant region detectors,the remaining questions are which is the most appropriate descriptor to characterize the regions,and does the choice of the descriptor depend on the region detector.There is a large number of possible descriptors and associated distance measures which emphasize different image properties like pixel intensities,color,texture,edges etc.In this work we focus on descriptors computed on gray-value images.The evaluation of the descriptors is performed in the context of matching and recognition of the same scene or object observed under different viewing conditions.We have selected a number of descriptors,which have previously shown a good performance in such a context and compare them using the same evaluation scenario and the same test data.The evaluation criterion is recall-precision,i.e.the number of correct and false matches between two images.Another possible evaluation criterion is the ROC(Receiver Operating Characteristics)in the context of image retrieval from databases[6,31].The detection rate is equivalent to recall but the false positive rate is computed for a database of images instead of a single image pair.It is therefore difﬁcult to predict the actual number of false matches for a pair of similar images.Local features were also successfully used for object category recognition and classiﬁcation. The comparison of descriptors in this context requires a different evaluation setup.However,it is unclear how to select a representative set of images for an object category and how to prepare the ground truth,since there is no linear transformation relating images within a category.A possible solution is to select manually a few corresponding points and apply loose constraints to verify correct matches,as proposed in[18].In this paper the comparison is carried out for different descriptors,different interest regions and for different matching pared to our previous work[31],this paper performs a more exhaustive evaluation and introduces a new descriptor.Several descriptors and detectors have been added to the comparison and the data set contains a larger variety of scenes types and transformations.We have modiﬁed the evaluation criterion and now use recall-precision for image pairs.The ranking of the top descriptors is the same as in the ROC based evaluation[31].Furthermore,our new descriptor,gradient location and orientation histogram(GLOH),which is an extension of the SIFT descriptor,is shown to outperform SIFT as well as the other descriptors.A.Related workPerformance evaluation has gained more and more importance in computer vision[7].In the context of matching and recognition several authors have evaluated interest point detectors[14, 30,33,39].The performance is measured by the repeatability rate,that is the percentage of points simultaneously present in two images.The higher the repeatability rate between two images,the more points can potentially be matched and the better are the matching and recognition results. Very little work has been done on the evaluation of local descriptors in the context of matching and recognition.Carneiro and Jepson[6]evaluate the performance of point descriptors using ROC (Receiver Operating Characteristics).They show that their phase-based descriptor performs better than differential invariants.In their comparison interest points are detected by the Harris detector and the image transformations are generated artiﬁcially.Recently,Ke and Sukthankar[19]have developed a descriptor similar to the SIFT descriptor.It applies Principal Components Analysis (PCA)to the normalized image gradient patch and performs better than the SIFT descriptor on artiﬁcially generated data.The criterion recall-precision and image pairs were used to compare the descriptors.Local descriptors(also calledﬁlters)have also been evaluated in the context of texture classiﬁcation.Randen and Husoy[36]compare differentﬁlters for one texture classiﬁcation algorithm.Theﬁlters evaluated in this paper are Laws masks,Gaborﬁlters,wavelet transforms, DCT,eigenﬁlters,linear predictors and optimizedﬁnite impulse responseﬁlters.No single approach is identiﬁed as best.The classiﬁcation error depends on the texture type and the dimensionality of the descriptors.Gaborﬁlters were in most cases outperformed by the other ﬁlters.Varma and Zisserman[44]also compared differentﬁlters for texture classiﬁcation and showed that MRF perform better than Gaussian basedﬁlter zebnik et al.[21]propose a new invariant descriptor called“spin image”and compare it with Gaborﬁlters in the context of texture classiﬁcation.They show that the region-based spin image outperforms the point-based Gaborﬁlter.However,the texture descriptors and the results for texture classiﬁcation cannot be directly transposed to region descriptors.The regions often contain a single structure without repeated patterns,and the statistical dependency frequently explored in texture descriptors cannotbe used in this context.B.OverviewIn section II we present a state of the art on local descriptors.Section III describes the implementation details for the detectors and descriptors used in our comparison as well as our evaluation criterion and the data set.In section IV we present the experimental results.Finally, we discuss the results.II.D ESCRIPTORSMany different techniques for describing local image regions have been developed.The simplest descriptor is a vector of image pixels.Cross-correlation can then be used to compute a similarity score between two descriptors.However,the high dimensionality of such a description results in a high computational complexity for recognition.Therefore,this technique is mainly used forﬁnding correspondences between two images.Note that the region can be sub-sampled to reduce the dimension.Recently,Ke and Sukthankar[19]proposed to use the image gradient patch and to apply PCA to reduce the size of the descriptor.Distribution based descriptors.These techniques use histograms to represent different charac-teristics of appearance or shape.A simple descriptor is the distribution of the pixel intensities represented by a histogram.A more expressive representation was introduced by Johnson and Hebert[17]for3D object recognition in the context of range data.Their representation(spin image)is a histogram of the relative positions in the neighborhood of a3D interest point.This descriptor was recently adapted to images[21].The two dimensions of the histogram are distance from the center point and the intensity value.Zabih and Woodﬁll[45]have developed an approach robust to illumination changes.It relies on histograms of ordering and reciprocal relations between pixel intensities which are more robust than raw pixel intensities.The binary relations between intensities of several neighboring pixels are encoded by binary strings and a distribution of all possible combinations is represented by histograms.This descriptor is suitable for texture representation but a large number of dimensions is required to build a reliable descriptor[34].Lowe[25]proposed a scale invariant feature transform(SIFT),which combines a scale invari-ant region detector and a descriptor based on the gradient distribution in the detected regions.Thedescriptor is represented by a3D histogram of gradient locations and orientations,seeﬁgure1 for illustration.The contribution to the location and orientation bins is weighted by the gradient magnitude.The quantization of gradient locations and orientations makes the descriptor robust to small geometric distortions and small errors in the region detection.Geometric histogram[1] and shape context[3]implement the same idea and are very similar to the SIFT descriptor.Both methods compute a3D histogram of location and orientation for edge points where all the edge points have equal contribution in the histogram.These descriptors were successfully used,for example,for shape recognition of drawings for which edges are reliable features.Spatial-frequency techniques.Many techniques describe the frequency content of an image. The Fourier transform decomposes the image content into the basis functions.However,in this representation the spatial relations between points are not explicit and the basis functions are inﬁnite,therefore difﬁcult to adapt to a local approach.The Gabor transform[13]overcomes these problems,but a large number of Gaborﬁlters is required to capture small changes in frequency and orientation.Gaborﬁlters and wavelets[27]are frequently explored in the context of texture classiﬁcation.Differential descriptors.A set of image derivatives computed up to a given order approximates a point neighborhood.The properties of local derivatives(local jet)were investigated by Koen-derink[20].Florack et al.[11]derived differential invariants,which combine components of the local jet to obtain rotation invariance.Freeman and Adelson[12]developed steerableﬁlters, which steer derivatives in a particular direction given the components of the local jet.Steering derivatives in the direction of the gradient makes them invariant to rotation.A stable estimation of the derivatives is obtained by convolution with Gaussian derivatives.Figure2(a)shows Gaussian derivatives up to order4.Baumberg[2]and Schaffalitzky and Zisserman[37]proposed to use complexﬁlters derived from the family,where is the orientation.For the function Baumberg uses Gaussian derivatives and Schaffalitzky and Zisserman apply a polynomial(cf. section III-B andﬁgure2(b)).Theseﬁlters differ from the Gaussian derivatives by a linear coordinates change inﬁlter response space.Other techniques.Generalized moment invariants have been introduced by Van Gool et al.[43] to describe the multi-spectral nature of the image data.The invariants combine central moments deﬁned by with order and degree.The moments char-acterize shape and intensity distribution in a region.They are independent and can be easily computed for any order and degree.However,the moments of high order and degree are sensitive to small geometric and photometric puting the invariants reduces the number of dimensions.These descriptors are therefore more suitable for color images where the invariants can be computed for each color channel and between the channels.III.E XPERIMENTAL SETUPIn the following weﬁrst describe the region detectors used in our comparison and the region normalization necessary for computing the descriptors.We then give implementation details for the evaluated descriptors.Finally,we discuss the evaluation criterion and the image data used in the tests.A.Support regionsRegion detectors use different image measurements and are either scale or afﬁne invariant. Lindeberg[23]has developed a scale-invariant“blob”detector,where a“blob”is deﬁned by a maximum of the normalized Laplacian in scale-space.Lowe[25]approximates the Laplacian with difference-of-Gaussian(DoG)ﬁlters and also detects local extrema in scale-space.Lindeberg and G˚a rding[24]make the blob detector afﬁne-invariant using an afﬁne adaptation process based on the second moment matrix.Mikolajczyk and Schmid[29,30]use a multi-scale version of the Harris interest point detector to localize interest points in space and then employ Lindeberg’s scheme for scale selection and afﬁne adaptation.A similar idea was explored by Baumberg[2] as well as Schaffalitzky and Zisserman[37].Tuytelaars and Van Gool[42]construct two types of afﬁne-invariant regions,one based on a combination of interest points and edges and the other one based on image intensities.Matas et al.[28]introduced Maximally Stable Extremal Regions extracted with a watershed like segmentation algorithm.Kadir et al.[18]measure the entropy of pixel intensity histograms computed for elliptical regions toﬁnd local maxima in afﬁne transformation space.A comparison of state-of the art afﬁne region detectors can be found in[33].1)Region detectors:The detectors provide the regions which are used to compute the de-scriptors.If not stated otherwise the detection scale determines the size of the region.In this evaluation we have usedﬁve detectors:Harris points[15]are invariant to rotation.The support region is aﬁxed size neighborhood of 41x41pixels centered at the interest point.Harris-Laplace regions[29]are invariant to rotation and scale changes.The points are detected by the scale-adapted Harris function and selected in scale-space by the Laplacian-of-Gaussian operator.Harris-Laplace detects corner-like structures.Hessian-Laplace regions[25,32]are invariant to rotation and scale changes.Points are localized in space at the local maxima of the Hessian determinant and in scale at the local maxima of the Laplacian-of-Gaussian.This detector is similar to the DoG approach[26],which localizes points at local scale-space maxima of the difference-of-Gaussian.Both approaches detect the same blob-like structures.However,Hessian-Laplace obtains a higher localization accuracy in scale-space,as DoG also responds to edges and detection is unstable in this case.The scale selection accuracy is also higher than in the case of the Harris-Laplace placian scale selection acts as a matchedﬁlter and works better on blob-like structures than on corners since the shape of the Laplacian kernelﬁts to the blobs.The accuracy of the detectors affects the descriptor performance.Harris-Afﬁne regions[32]are invariant to afﬁne image transformations.Localization and scale are estimated by the Harris-Laplace detector.The afﬁne neighborhood is determined by the afﬁne adaptation process based on the second moment matrix.Hessian-Afﬁne regions[32,33]are invariant to afﬁne image transformations.Localization and scale are estimated by the Hessian-Laplace detector and the afﬁne neighborhood is determined by the afﬁne adaptation process.Note that Harris-Afﬁne differs from Harris-Laplace by the afﬁne adaptation,which is applied to Harris-Laplace regions.In this comparison we use the same regions except that for Harris-Laplace the region shape is circular.The same holds for the Hessian based detector.Thus the number of regions is the same for afﬁne and scale invariant detectors.Implementation details for these detectors as well as default thresholds are described in[32].The number of detected regions varies from200to3000per image depending on the content.2)Region normalization:The detectors provide circular or elliptic regions of different size, which depends on the detection scale.Given a detected region it is possible to change its size or shape by scale or afﬁne covariant construction.Thus,we can modify the set of pixels which contribute to the descriptor computation.Typically,larger regions contain more signalvariations.Hessian-Afﬁne and Hessian-Laplace detect mainly blob-like structures for which the signal variations lie on the blob boundaries.To include these signal changes into the description, the measurement region is3times larger than the detected region.This factor is used for all scale and afﬁne detectors.All the regions are mapped to a circular region of constant radius to obtain scale and afﬁne invariance.The size of the normalized region should not be too small in order to represent the local structure at a sufﬁcient resolution.In all experiments this size is arbitrarily set to41pixels.A similar patch size was used in[19].Regions which are larger than the normalized size,are smoothed before the size normalization.The parameter of the smoothing Gaussian kernel is given by the ratio measurement/normalized region size.Spin images,differential invariants and complexﬁlters are invariant to rotation.To obtain rotation invariance for the other descriptors the normalized regions are rotated in the direction of the dominant gradient orientation,which is computed in a small neighborhood of the region center. To estimate the dominant orientation we build a histogram of gradient angles weighted by the gradient magnitude and select the orientation corresponding to the largest histogram bin,as suggested in[25].Illumination changes can be modeled by an afﬁne transformation of the pixel intensities.To compensate for such afﬁne illumination changes the image patch is normalized with mean and standard deviation of the pixel intensities within the region.The regions,which are used for descriptor evaluation,are normalized with this method if not stated otherwise. Derivative-based descriptors(steerableﬁlters,differential invariants)can also be normalized by computing illumination invariants.The offset is eliminated by the differentiation operation. The invariance to linear scaling with factor is obtained by dividing the higher order derivatives by the gradient magnitude raised to the appropriate power.A similar normalization is possible for moments and complexﬁlters,but has not been implemented here.B.DescriptorsIn the following we present the implementation details for the descriptors used in our experi-mental evaluation.We use ten different descriptors:SIFT[25],gradient location and orientation histogram(GLOH),shape context[3],PCA-SIFT[19],spin images[21],steerableﬁlters[12], differential invariants[20],complexﬁlters[37],moment invariants[43],and cross-correlation of sampled pixel values.Gradient location and orientation histogram(GLOH)is a new descriptorwhich extends SIFT by changing the location grid and using PCA to reduce the size.SIFT descriptors are computed for normalized image patches with the code provided by Lowe[25].A descriptor is a3D histogram of gradient location and orientation,where location is quantized into a4x4location grid and the gradient angle is quantized into8orientations.The resulting descriptor is of dimension128.Figure1illustrates the approach.Each orientation plane represents the gradient magnitude corresponding to a given orientation.To obtain illumination invariance, the descriptor is normalized by the square root of the sum of squared components.Gradient location-orientation histogram(GLOH)is an extension of the SIFT descriptor designed to increase its robustness and distinctiveness.We compute the SIFT descriptor for a log-polar location grid with3bins in radial direction(the radius set to6,11and15)and8in angular direction(cf.ﬁgure1(e)),which results17location bins.Note that the central bin is not divided in angular directions.The gradient orientations are quantized in16bins.This gives a272bin histogram.The size of this descriptor is reduced with PCA.The covariance matrix for PCA is estimated on47000image patches collected from various images(see section III-C.1).The128 largest eigenvectors are used for description.(a)(b)(c)(d)(e)Fig.1.SIFT descriptor.(a)Detected region.(b)Gradient image and location grid.(c)Dimensions of the histogram.(d)4of8 orientation planes.(e)Cartesian and the log-polar location grids.The log-polar grid shows9location bins used in shape context (4in angular direction).Shape context is similar to the SIFT descriptor,but is based on edges.Shape context is a3D histogram of edge point locations and orientations.Edges are extracted by the Canny[5]detector. Location is quantized into9bins of a log-polar coordinate system as displayed inﬁgure1(e) with the radius set to6,11and15and orientation quantized into4bins(horizontal,vertical and two diagonals).We therefore obtain a36dimensional descriptor.In our experiments we weight a point contribution to the histogram with the gradient magnitude.This has shown to give betterresults than using the same weight for all edge points,as proposed in[3].Note that the original shape context was computed only for edge point locations and not for orientations.PCA-SIFT descriptor is a vector of image gradients in and direction computed within the support region.The gradient region is sampled at39x39locations therefore the vector is of dimension3042.The dimension is reduced to36with PCA.Spin image is a histogram of quantized pixel locations and intensity values.The intensity of a normalized patch is quantized into10bins.A10bin normalized histogram is computed for each of5rings centered on the region.The dimension of the spin descriptor is50.(a)(b)Fig.2.Derivative basedﬁlters.(a)Gaussian derivatives up to4th order.(b)Complexﬁlters up to6th order.Note that the displayedﬁlters are not weighted by a Gaussian,forﬁgure clarity.Steerableﬁlters and differential invariants use derivatives computed by convolution with Gaus-sian derivatives of for an image patch of size41.Changing the orientation of derivatives as proposed in[12]gives equivalent results to computing the local jet on rotated image patches. We use the second approach.The derivatives are computed up to4th order,that is the descriptor has dimension14.Figure2(a)shows8of14derivatives;the remaining derivatives are obtained by rotation by.The differential invariants are computed up to3rd order(dimension8). We compare steerableﬁlters and differential invariants computed up to the same order(cf. section IV-A.3).Complexﬁlters are derived from the following equation. The original implementation[37]has been used for generating the kernels.The kernels are computed for a unit disk of radius1and sampled at41x41locations.We use15ﬁlters deﬁned by(swapping and just gives complex conjugateﬁlters);the response of the ﬁlters with is the average intensity of the region.Figure2(b)shows8of15ﬁlters.Rotation changes the phase but not the magnitude of the response,therefore we use the modulus of each complexﬁlter response.Moment invariants are computed up to2nd order and2nd degree.The moments are computed for derivatives of an image patch withchange(c)&(d);viewpoint change(e)&(f);image blur(g)&(h);JPEG compression(i);and illumination(j).In the case of rotation,scale change,viewpoint change and blur,we use two different scene types.One scene type contains structured scenes,that is homogeneous regions with distinctive edge boundaries(e.g.grafﬁti,buildings)and the other contains repeated textures of different forms.This allows to analyze the inﬂuence of image transformation and scene type separately.Image rotations are obtained by rotating the camera around its optical axis in the range of 30and45degrees.Scale change and blur sequences are acquired by varying the camera zoom and focus respectively.The scale changes are in the range of2-2.5.In the case of the viewpoint change sequences the camera position varies from a fronto-parallel view to one with signiﬁcant foreshortening at approximately50-60degrees.The light changes are introduced by varying the camera aperture.The JPEG sequence is generated with a standard xv image browser with the image quality parameter set to5%.The images are either of planar scenes or the camera position wasﬁxed during acquisition.The images are therefore always related by a homography(plane projective transformation).The ground truth homographies are computed in two steps.First, an approximation of the homography is computed using manually selected correspondences. The transformed image is warped with this homography so that it is roughly aligned with the reference image.Second,a robust small baseline homography estimation algorithm is used to compute an accurate residual homography between the reference image and the warped image, with automatically detected and matched interest points[16].The composition of the approximate and residual homography results in an accurate homography between the images.In section IV we display the results for image pairs fromﬁgure3.The transformation between these images is signiﬁcant enough to introduce some noise in the detected regions.Yet, many correspondences are found and the matching results are stable.Typically,the descriptor performance is higher for small image transformations but the ranking remains the same.There are few corresponding regions for large transformations and the recall-precision curves are not smooth.A data set different from the test data was used to estimate the covariance matrices for PCA and descriptor normalization.In both cases we have used21image sequences of different planar(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)Fig.3.Data set.Examples of images used for the evaluation,(a)(b)Rotation,(c)(d)Zoom+rotation,(e)(f)Viewpoint change,(g)(h)Image blur,(i)JPEG compression,(j)Light change.scenes which are viewed under all the transformations for which we evaluate the descriptors2.2)Evaluation criterion:We use a criterion similar to the one proposed in[19].It is based on the number of correct matches and the number of false matches obtained for an image pair. Two regions and are matched if the distance between their descriptors and is below a threshold.Each descriptor from the reference image is compared with each descriptor from the transformed one and we count the number of correct matches as well as the number of false matches.The value of is varied to obtain the curves.The results are presented with recall versus1-precision.Recall is the number of correctly matched regions with respect to the number of corresponding regions between two images of the same scene:Given recall,1-precision and the number of corresponding regions,the number of correct matches can be determined by and the number of false matches2The data set is available at /˜vgg/research/afﬁneby.For example,there are3708corre-sponding regions between the images used to generateﬁgure4(a).For a point on the GLOH curve with recall of0.3and1-precision of0.6,the number of correct matches is,and the number of false matches is.Note that recall and1-precision are independent terms.Recall is computed with respect to the number of corresponding regions and1-precision with respect to the total number of matches.Before we start the evaluation we discuss the interpretation ofﬁgures and possible curve shapes.A perfect descriptor would give a recall equal to1for any precision.In practice, recall increases for an increasing distance threshold,as noise which is introduced by image transformations and region detection increases the distance between similar descriptors.Hor-izontal curves indicate that the recall is attained with a high precision and is limited by the speciﬁcity of the scene i.e.the detected structures are very similar to each other and the descriptor cannot distinguish them.Another possible reason for non-increasing recall is that the remaining corresponding regions are very different from each other(partial overlap close to50%)and therefore the descriptors are different.A slowly increasing curve shows that the descriptor is affected by the image degradation(viewpoint change,blur,noise etc.).If curves corresponding to different descriptors are far apart and have different slopes,then the distinctiveness and robustness of the descriptors is different for a given image transformation or scene type.IV.E XPERIMENTAL RESULTSIn this section we present and discuss the experimental results of the evaluation.The perfor-mance is compared for afﬁne transformations,scale changes,rotation,blur,jpeg compression and illumination changes.In the case of afﬁne transformations we also examine different matching strategies,the inﬂuence of the overlap error and the dimension of the descriptor.A.Afﬁne transformationsIn this section we evaluate the performance for viewpoint changes of approximately degrees. This introduces a perspective transformation which can locally be approximated by an afﬁne transformation.This is the most challenging transformation of the ones evaluated in this paper. Note that there are also some scale and brightness changes in the test images,seeﬁgure3(e)(f). In the following weﬁrst examine different matching approaches.Second,we investigate the。

基于 3D 人脸重建的光照、姿态不变人脸识别

ISSN 1000-9825, CODEN RUXUEW E-mail: jos@Journal of Software, Vol.17, No.3, March 2006, pp.525−534 DOI: 10.1360/jos170525 Tel/Fax: +86-10-62562563© 2006 by Journal of Softwar e. All rights reserved.∗基于3D人脸重建的光照、姿态不变人脸识别柴秀娟1+, 山世光2, 卿来云2, 陈熙霖2, 高文1,21(哈尔滨工业大学计算机学院,黑龙江哈尔滨 150001)2(中国科学院计算技术研究所 ICT-ISVISION面像识别联合实验室,北京 100080)Pose and Illumination Invariant Face Recognition Based on 3D Face ReconstructionCHAI Xiu-Juan1+, SHAN Shi-Guang2, QING Lai-Yun2, CHEN Xi-Lin2, GAO Wen1,21(Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China)2(ICT-ISVISION Joint R&D Laboratory for Face Recognition, Institute of Computer Technology, The Chinese Academy of Sciences,Beijing 100080, China)+ Corresponding author: Phn: +86-10-58858300 ext 314, Fax: +86-10-58858301, E-mail: xjchai@, /Chai XJ, Shan SG, Qing LY, Chen XL, Gao W. Pose and illumination invariant face recognition based on 3Dface reconstruction. Journal of Software, 2006,17(3):525−534. /1000-9825/17/525.htmAbstract: Pose and illumination changings from picture to picture are two main barriers toward full automaticface recognition. In this paper, a novel method to handle both pose and lighting conditions simultaneously isproposed, which calibrates the pose and lighting to a predefined reference condition through an illuminationinvariant 3D face reconstruction. First, some located facial landmarks and a priori statistical deformable 3D modelare used to recover an elaborate 3D shape. Based on the recovered 3D shape, the “texture image” calibrated to astandard illumination is generated by spherical harmonics ratio image and finally the illumination independent 3Dface is reconstructed completely. The proposed method combines the strength of statistical deformable model todescribe the shape information and the compact representations of the illumination in spherical frequency space,and handles both the pose and illumination variation simultaneously. This algorithm can be used to synthesizevirtual views of a given face image and enhance the performance of face recognition. Experimental results on CMUPIE database show that this method can significantly improve the accuracy of the existing face recognition methodwhen pose and illumination are inconsistent between gallery and probe sets.Key words: face recognition; 3D face reconstruction; statistic deformable model; spherical harmonic; ratio image摘要: 待匹配人脸图像与库存原型图像之间姿态和光照的差异是自动人脸识别的两个主要瓶颈问题,已有的解决方法往往只能单独处理二者之一,而不能同时处理光照和姿态问题.提出了一种对人脸图像中的姿态和光照变化同时进行校正处理的方法,即通过光照不变的3D人脸重建过程,将姿态和光照都校正到预先定义的标准条件下.首先,利用先验的统计变形模型,结合人脸图像上的一些关键点来恢复较为精细的人脸3D形状.基于此重建的3D形∗ Supported by the National Natural Science Foundation of China under Grant No.60332010 (国家自然科学基金); the “100 TalentsProgram” of the CAS (中国科学院百人计划); the Shanghai Municipal Sciences and Technology Committee of China under GrantNo.03DZ15013 (上海市科委项目); the ISVISION Technologies Co., Ltd (银晨智能识别科技有限公司资金资助)Received 2005-05-16; Accepted 2005-07-11526 Journal of Software软件学报 V ol.17, No.3, March 2006状,进而通过球面谐波商图像的方法估计输入图像的光照属性并提取输入图像的光照无关的纹理信息,从而将光照无关的3D人脸完全重构出来,生成输入人脸图像在标准姿态和光照条件下的虚拟视图,用于最终的分类识别,实现了对光照和姿态问题的同时处理.在CMU PIE数据库上的实验结果表明,此方法可以在很大程度上提高现有人脸识别方法对于原型集合(gallery)和测试集合中图像在姿态和光照不一致情况下识别结果的正确性.关键词: 人脸识别;3D人脸重建;统计变形模型;球面谐波;商图像中图法分类号: TP393文献标识码: A人脸识别技术在安全、金融、法律、人机交互等领域具有广阔的应用前景,因此得到了研究人员的广泛关注.经过近40年的发展,对于均匀光照下的中性表情的正面人脸图像其识别率已经很高[1].然而在一些更复杂的情况下,现有多数系统的识别性能受姿态和光照变化的影响特别大.这是因为当人脸的姿态或光照发生变化时,人脸图像的外观也会随之发生很大变化,所以通常使用的基于2D外观的方法在这种情况下就会失效.尽管也有一些基于2D的方法(如基于多视图的方法[2])可以在一定程度上解决姿态或者光照的变化问题,但我们认为基于3D信息来改善2D图像外观的方法是解决姿态、光照变化问题的最本质的方法.在研究的早期阶段,无论是针对姿态问题,还是光照问题,基于人脸图像外观低维子空间描述的方法都是主要的思路.Eigenfaces[2]和Fisherfaces[3]根据统计学习得到经验的人脸低维的姿态或光照空间.这类方法易于实现,精度较高.但是当测试图像和训练图像集合的成像条件不太相似时,其性能下降得非常严重.由Gross等人提出的Fisher Light-Fields算法[4]通过估计gallery或者测试图像对应的头部的特征光场,并将特征光场系数作为特征集合来进行最终的识别.将此工作进一步扩展,Zhou提出了一种Illuminating Light Field方法[5],其中,Lambertian反射模型被用于控制光照的变化,该方法对新的光照条件比Fisher Light Field具有更强的泛化能力.然而,该算法需要多个姿态以及多种光照条件下的多幅训练图像进行建模,这对多数实际应用而言是难以满足的.姿态和光照的变化对人脸图像的影响说到底是与人脸的3D结构有关的,假设人脸的3D形状、表面反射率已知或者可以估计,那么姿态和光照问题就可以更容易地解决.因此,一些基于模型的方法试图从人脸图像中将人脸内在属性(3D形状和表面反射率)和外在成像条件(光照条件、绘制条件、摄像机参数等)分离开来,从而可以消除外在成像条件的影响而仅利用内在本质属性实现准确的分类识别.其中最著名的方法包括光照锥、对称SFS和3D形变模型方法.Georghiades提出了光照锥(illumination cone)方法[6]来解决光照变化和姿态变化下的人脸识别.该方法能够根据给定的多幅(不少于3幅)相同姿态、不同光照条件的输入图像,估计出输入人脸的3D信息.其本质上是传统的光度立体视觉(photometric stereo)方法的一个变种,通过SVD迭代估计光照、人脸3D形状、表面点反射率,并最终利用人脸3D形状分布的先验知识(如对称性、鼻子为最近点等)作为约束求解人脸的3D形状.通常这种方法至少需要每个目标不同光照条件下的7幅图像,这对于大多数应用来说都太过苛刻,因此难以实用.与光照锥利用光度立体视觉方法不同,Zhao则提出了采用从影调恢复形状(shape from shading,简称SFS)方法进行人脸3D形状重建的思路,在传统SFS方法的基础上,利用了人脸的对称性先验知识,提出了SSFS方法(symmetric SFS)[7].该方法只需要一幅输入人脸图像即可,但它需要通过其他方法估计输入图像的光照情况以及精确的对称轴信息,这都增加了该方法在实用上的困难.迄今为止,最成功的姿态和光照不变的人脸识别是3D变形模型方法(3D morphable model)[8].该方法通过主成分分析对人脸的3D形状和纹理(表面反射率)分别进行统计建模,在此基础上建立了包含形状、纹理统计参数、Phone模型参数、光照参数、摄像机内外参数、绘制参数等在内的复杂成像模型,最终采用基于合成的分析(analysis-by-synthesis)技术通过优化算法估计这些参数,得到输入人脸的3D形状和纹理统计参数用于最终的分类识别.这种变形方法已经用于FRVT2002[1]中.遗憾的是,该方法需要求解一个涉及几百个参数的复杂连续优化问题,迭代优化过程耗费了大量的计算时间,整个拟合过程在一台2GHzP4处理器的工作站上,大约需要4.5分钟.这对于多数实用系统而言是难以忍受的.基于上述分析,我们认为3D变形模型中采用的统计建模方法是利用人脸先验3D信息的最佳方式,所以本柴秀娟等:基于3D人脸重建的光照、姿态不变人脸识别527文也采用了类似的建模方法,不同的是,为了避免3D形变模型中过分复杂的参数优化过程,我们没有采用稠密的统计模型,而仅仅采用了一个稀疏的关键特征点统计模型;同时也没有采用复杂的成像模型,而是采用了更为方便、实用的球面谐波商图像方法来处理光照估计和光照影响消除问题.这些措施极大地降低了算法的计算复杂度,使得整个处理过程可以在1秒内完成(P4 3.2G CPU的机器上).当然,与3D形变模型相比,本文算法重建的人脸3D信息的精度会有较大的差距,但需要注意的是,我们的目标是实现对姿态、光照变化不敏感的人脸识别,并非精确的3D重建,而多数准正面人脸识别系统都对较小的光照、姿态变化有一定的容忍能力,因此,本文在算法精度和速度方面进行折衷是合理的,本文的实验也表明了这一点.本文第1节详细的描述我们提出的姿态和光照不变的人脸识别方法,第1.1节介绍基于3D稀疏变形模型的3D形状重构算法.第1.2节对基于球面谐波商图像的光照不变的纹理生成方法进行描述.本文算法的一些合成结果以及对姿态和光照不变的人脸识别实验结果在第2节给出.最后给出本文的结论.1 姿态和光照不变的人脸识别本文提出的光照、姿态不变的人脸识别系统的整体框架如图1所示.首先,对于给定的任意一幅人脸图像进行人脸检测和眼睛中心的定位,根据基于视图的子空间方法得到粗略的姿态类别,然后采用ASM或者ESL算法[9]标定给定人脸图像的稀疏的关键特征点.基于3D稀疏变形模型,结合人脸图像的2D形状信息,重建对应于输入图像人脸的3D形状.借助于重建得到的特定人的3D形状,根据球面谐波商图像方法,将由输入人脸图像重新打光到预先定义的参考光照条件下,从而生成光照不变的纹理图像,即实现了参考光照条件下的3D人脸重建.进行了姿态和光照校正之后的人脸图像即可作为一般正面人脸识别系统的输入,与库存人脸图像进行匹配,得到识别结果.因此,我们的算法也可以看作是任何人脸识别系统的预处理步骤.reconstructionOriginal textureextraction texture generationFig.1 The framework of pose and illumination calibration for face recognition图1 用于人脸识别的姿态和光照校正的流程图下面,我们对上面所述框架中的两个关键问题分别进行描述,即基于3D稀疏变形模型的3D人脸形状重建和基于球面谐波商图像的光照不变的纹理生成.1.1 基于单幅图像的3D人脸形状重构由单幅人脸图像恢复出此特定人的3D结构是解决姿态问题的最直观、最本质的方法.然而,在没有任何假设的前提下,从单幅人脸图像恢复3D形状是一个病态的问题.文献[10]中明确指出,重建3D人脸所需图像的最小数目是3.为克服需要多幅图像才能重建3D人脸形状这一困难,我们对3D人脸数据集合中的形状向量进行主成分分析(principle component analysis,简称PCA),建立统计变形模型.此模型即包含了人脸类的3D形状的先528Journal of Software 软件学报 V ol.17, No.3, March 2006验知识.1.1.1 创建稀疏统计模型本文采用USF Human ID 3-D 数据库中的100个激光扫描的3D 人脸作为训练集合创建统计变形模型[11].所有这些人脸都被标准化到一个统一的标准的方向和位置.人脸的形状由75 972个顶点表示,为简化计算,均匀下采样到8 955个顶点.在本文算法中,对形状和纹理按照不同的策略分开处理.我们认为,人脸的姿态只与一些关键特征点的位置有关,而与图像的亮度无关.通过将形状和纹理分开处理,避免了复杂的优化过程,节省了计算时间.下面,首先介绍稀疏统计模型的建立过程.我们将人脸的n 个顶点的X ,Y ,Z 坐标串接起来,组成形状向量来描述人脸的3D 形状:n T n n n Z Y X Z Y X 3111),,,...,,,(ℜ∈=S .假设用于训练的3D 人脸的数目是m ,每个3D 人脸向量可以描述为S i ,其中i =1,…,m .这些形状向量都已经在尺度上进行了配准,则任意一个新的3D 人脸形状向量S 都可表示为训练样本形状向量的线性组合形式,即∑==mi w 1i i S S .考虑到所有的人脸形状在整体上都是相似的,只是不同的人脸形状之间存在一些小的差别.由于PCA 适合于捕获同类向量主成分方向上的变化,并能够滤去各种测量噪声,因此,我们采用PCA 对训练集合中的3D 形状进行建模.通过对协方差矩阵的特征值进行降序排列,按序选择前)1(−≤m d 个对应最大特征值的形状特征向量,共同构成投影矩阵P ,即形成统计变形模型: P αS S += (1) 其中S 是平均的形状向量,α是对应于d 维投影矩阵P 的系数向量.当人脸发生姿态旋转时,对上面的等式进行扩展.对于3D 形状中的任意一个顶点D =(x ,y ,z )T ,根据PCA 特性,由式(1)可知此式成立:αRP R R D D D +=.其中,R 是由3个旋转角度表征的3×3维旋转矩阵,D 是S 中对应此顶点的坐标值,P D 是由P 中抽取出的对应顶点D 的3×d 维矩阵.由此可知,对于整个3D 形状向量也同样满足上面的旋转变换.我们引入符号R V 来表示这样的运算操作:对一个3D 形状向量V 按旋转矩阵R 进行旋转变换.同样,P R 表示对P 中的每一列3D 特征向量做同样的旋转变换.因此,将式(1)向多姿态情况扩展,有下面的等式: αP S S R R R += (2)有了这个扩展到多姿态的统计变形模型,则任意姿态下的3D 人脸形状都可以由变换到同姿态下的样本形状的PCA 模型来表示.但是仅仅给定一幅人脸图像,如何利用人脸形状的先验知识来进行3D 形状重建仍然是一个问题.基于此,我们进一步提出了3D 稀疏变形模型(3D SDM).目的是建立输入图像的关键特征点向量和3D 稀疏变形模型之间的对应,从而优化得到3D 稀疏变形模型的系数向量.我们认为此系数向量同样是密集的统计变形模型的系数向量.这样就可以很容易地实现对应于给定人脸的密集3D 形状的重建.与3D 形状表示方法相似,将2D 人脸图像上的k 个关键点的X ,Y 坐标串接起来,表示为I S .每一个2D 的关键点在3D 形状向量上都有且只有一个标号固定的点与其对应.按照这种对应关系,将3D 形状上的对应于2D 关键点的3D 点抽取出来,并同样连接起来表示,标记为S ′.对平均形状和投影矩阵做同样的处理,即形成稀疏的S 和P ′.由此得到3D 变形模型的稀疏版本:αP S S ′+′=′.同样,向多姿态情况进行扩展,即创建得到3D 稀疏变形模型(3D SDM):αP S S R R R )()()(′+′=′.1.1.2 基于SDM 的3D 重建即使有了3D 形状的统计先验知识,仅根据一幅2D 图像恢复其3D 形状仍然很困难.本文的策略是以PCA 由部分重构整体的特性为依据提出的.我们试图根据输入图像的关键点信息以及相应的3D 稀疏变形模型信息,获取PCA 模型的重构系数,最后将优化的系数向量投影到完整的PCA 模型上,以得到对应输入图像人脸的密集柴秀娟等:基于3D 人脸重建的光照、姿态不变人脸识别 529 的3D 形状向量.我们接下来对稀疏的3D 形状S ′抽取其X ,Y 坐标,形成2D 形状分量,标记为f S ′.由于f S ′可以被认为是3D 形状S 的一部分,因此,下面的等式仍然近似成立:αP S S f f f ′+′=′.这里引入符号f V 来表示在3D 形状V 中按序抽取相应的X ,Y 坐标形成2D 形状向量.f S )(和f P )(′分别表示从稀疏的3D 平均形状S ′和投影矩阵P ′中提取的对应的2D 关键点向量,则符号Rf S )(′就表示从经过旋转矩阵R变换之后的稀疏3D 形状向量S ′中抽取X ,Y 坐标形成的2D 形状向量.因此,任意姿态角度下的3D 形状向量,其对应2D 关键点的稀疏向量的X ,Y 分量串联成的向量可以如下表示: αP S S R f R f R f )()()(′+′=′ (3)我们的目标是通过形状系数向量α来重构完整3D 形状信息,对应于特定人的形状系数向量可如下求解:[][]R f R f R f S S )P α)()((′−′′=+ (4) 其中,+′])[(Rf P 是伪逆转置矩阵,可通过[][]()[]TT))()()()(1R f R f R f R f P P P P ′′′=′−+来计算.此系数向量是根据PCA 的部分向量求得的,我们认为它同时也可作为完整PCA 各分量对应的系数.因此,问题的关键就是根据图像关键特征点向量I S ,计算出向量Rf S )(′.I S 和R f S )(′之间的关系可以表示为T S S Rf +′=])[(c I (5)我们用对应于x ,y ,z 三个坐标轴的旋转角度γβα,,来刻画旋转矩阵R .利用人脸图像上的5个关键特征点及其对应的3D 模型S 上的5个关键的3D 点,其组成的向量分别表示为5−I S 以及5S ′,我们可以通过投影计算得到3个旋转角度参数.用于计算旋转姿态参数的5个特征点是左、右眼中心,鼻尖,左、右嘴角.下面,我们将对此3D 形状重建算法进行详细介绍.迭代优化过程如下:A. 通过对选择的5个关键特征点,在输入人脸图像以及对应的3D 形状上的点之间建立透视投影关系,S I −5=s QRS ′5+(tx ,ty )T 联立方程组,求解计算得到旋转矩阵R 的参数.其中,Q 为2×3的投影矩阵,即⎟⎟⎠⎞⎜⎜⎝⎛=010001Q ,仅获取y x ,方向的信息用来计算.s 是尺度因子,ty tx ,分别是x ,y 方向平移分量. B. 在I S 和Rf S )(′两个向量对应的位移向量T 和尺度变化因子c 通过下面等式计算:尺度因子为()()∑∑==+∇−+∇−=k i i i ki i I i i I i Y X Y Y Y X X X c 122100)()()()(.然后,更新位移向量分量:,)(110∑=−=∇k i i I i X c X k X ∑=−=∇k i i I i Y c Y k Y 10)(1, 位移向量为 T Y X Y X ),...,,,(∇∇∇∇=T .其中,0X ∇和0Y ∇分别是X ∇和Y ∇在前一次迭代计算得到的值,在第1次计算时被设置为A 步骤中求解得到的tx 和ty .I i X 和I i Y 是输入人脸图像中的特征点的坐标值.i X 和i Y 是Rf S )(′中对应的点的x ,y 方向分量值.重复B步骤2~3次即可计算出恰当的位移和尺度因子.C. 利用B 中得到的T 和c ,通过等式(5)更新Rf S )(′.D. 得到了新的Rf S )(′,我们就可以很容易地根据等式(4)来计算形状系数向量α.E. 根据等式αP S S ′+′=′,我们可以重构出对应于特定人的稀疏的3D 人脸形状S ′.F. 重复步骤A~E,直到形状系数向量收敛.530Journal of Software 软件学报 V ol.17, No.3, March 2006最终,由密集的统计变形模型(等式(1)),我们可以重构得到对应于输入人脸的3D 形状.为得到更加精细的3D 形状的解,我们按照给定2D 图像上的特征点的坐标来对3D 形状顶点进行进一步调整.一旦得到了人脸精细的3D 形状信息,再结合此特定人的纹理图像,对姿态的校正就可以简单地通过旋转3D 人脸模型来实现了.1.2 基于球面谐波商图像的光照不变的纹理生成根据上节中恢复得到的3D 形状和姿态参数,人脸区域可以直接从给定的2D 人脸图像中抽取得到.然而,所得到的人脸图像并不是真正的纹理图像,而是随着光照的变化而变化的.虽然生成特定人本质的纹理是很直观的想法,但是本文并不直接恢复纹理信息,而是转而采取去光的策略来消除光照的影响,将提取的人脸区域的光照条件校正到参考光照条件下,即形成光照不变的纹理图像[12].最终,这个标准参考光照下的纹理可以被绘制到3D 人脸形状上,从而重建出完整的与光照无关的3D 人脸.由于反射等式可以被看作是一个卷积,因此很自然地在频域空间来对其进行分析.对于球面谐波,Basri 等人[13]证明了亮度的绝大部分能量都限制在3个低阶的部分,其频域公式为∑∑∑∑∑∑∞=−==−=∞=−=≈==0200),(),(),(),(l l l m l ll m lm lm l l l l m lm lm l lm lm Y L A Y L A Y E E βαβαβαβα (6) 其中,)4π,3π2,π(210===A A A A l [13]是Lambertian 反射的球面谐波系数,lm L 是入射光线的系数,lm Y 是球面谐波函数.给定一幅纹理图像I ,对于每一个像素(x ,y ),下面等式几乎都成立:()),(),,(),(),(y x y x E y x y x I βαρ=.这里,),(y x α和),(y x β可由3D 人脸形状的法线向量计算得到.假设人脸的反照率ρ为常数,lm l A Y E lm =表示谐波图像,E 是lm E 的9×n 维矩阵.其中,n 是纹理图像的像素总数,则光照条件L 的系数就可以通过最小二乘解得: I L E L L−=)(min arg ˆρ (7) 如果我们已经估计得到给定图像的光照条件,就可以很容易地将其重新打光到标准光照下.对图像上(x ,y )处一个确定的点P ,它的法线为),(βα,反照率为),(y x ρ,则原始图像上和光照校正后P 点的亮度分别是:⎪⎪⎭⎪⎪⎬⎫==∑∑∑∑=−==−=2020),(),(),(),(ˆ),(),(l l l m lm can lm l can l ll m lm lm l org Y L A y x y x I Y L A y x y x I βαρβαρ (8) 两种不同光照下的商图像定义为 ∑∑∑∑∑∑∑∑=−==−==−==−====20202020),(ˆ),(),(ˆ),(),(),(),(),(),(l l l m lm lm l l ll m lm can lm l l l l m lm lm l l l lm lm can lm l org can Y L A Y L A Y L A y x Y L A y x y x I y x I y x R βαβαβαρβαρ (9)由于给定图像的光照条件和参考光照条件都是确定的,因此对于给定人来说,参考光照条件相对于原始光照条件下的商图像就已经确定了.进而,基于原图像和商图像,光照校正后的纹理图像可以如下计算得到: ),(),(),(y x I y x R y x I org can ×= (10)得到了精细的3D 形状和去除光照影响的纹理后,我们就实现了根据输入的单幅任意光照下的非正面图像重构出对应于特定人的标准光照条件下的3D 人脸.对于纹理图像上不可见的点,采取插值策略进行填充. 2 实验与分析在这一节,我们通过姿态和光照不变的人脸识别来评价本文提出的姿态和光照校正算法的性能.对于一幅任意光照下的非正面人脸图像,重建它的与光照无关的3D 人脸,即无论输入图像的光照条件如何,最终重构的柴秀娟等:基于3D人脸重建的光照、姿态不变人脸识别5313D人脸对应的纹理都被校正到预先定义的参考光照条件下.姿态的校正是通过将3D人脸旋转到预先定义的标准姿态下实现的.经过校正后的人脸图像被作为一般人脸识别系统的输入来进行识别.2.1 姿态不变的人脸识别实验结果首先,对仅仅是姿态发生变化的情况进行测试,在CMU PIE数据库[14]中对4个姿态子集进行测试,分别是姿态集合05(右转22.5°)、姿态集合29(左转22.5°)、姿态集合37(右转45°)和姿态集合11(左转45°).Gallery中的图像来自姿态集合2,其中都是正面图像.本节仅仅对姿态变化情况进行测试,假设给定的图像其光照条件是均衡的.因此,不必对光照进行特别处理,即用从图像中提取出来的人脸区域来近似标准光照条件下的纹理图像,然后进行3D人脸重建,实现姿态校正.我们采用的人脸识别方法是Gabor PCA 加 LDA方法,其思想与GFC相似[15].训练图像从CAS-PEAL数据库[16]中选取,共300个人,平均每人选择6个姿态图像,10个正面图像.由于CAS-PEAL库采集条件与CMU PIE 库相距甚远,因此,避免了训练图像和测试图像之间条件相似给识别结果带来的影响.在此实验过程中,特征点是手工标定的.对于姿态校正策略进行人脸识别,其最重要的一点就是校正后的视图一定要在外观上与Gallery中的原型视图很相似.我们在图2里给出了一些基于3D人脸重建的姿态校正后的结果,用来进行视觉上的评价.图2中第1行是原始的4个非正面姿态下加了掩模的图像,第2行是对应的姿态校正后的图像,校正视图右侧的一列是来自姿态集合27的gallery图像,用作校正视图的参考.可见,对4个姿态的人脸进行姿态校正,校正后的人脸与Galley中原始正面人脸图像极为相似.并且,在P4 处理器,主频3.2GHz的电脑上,本文的方法进行一次完整的人脸重建只需1秒的时间.与文献[8]中的3D变形模型方法相比,计算复杂度大大降低.3D变形模型方法的一次拟合过程在一台2GHz,P4处理器的工作站上,大约需要4.5分钟[8].我们对这4个姿态子集的图像进行姿态校正,继而进行人脸识别,识别结果在表1中列出.校正之后,在4个姿态集合上的累积识别率比用原始人脸图像进行识别的情况有很大程度上的提高,首选识别率平均达到94.85%.我们还将此结果与特征光场方法(eigen light field)的结果进行了比较.特征光场方法可以用来解决人脸识别中的姿态和光照问题[17].该方法通常采用两种不同的标准化策略,分别是3点标准化和多点标准化.我们的结果与这两种标准化策略的结果进行比较,识别率都有提高(见表1).实验结果表明,基于3D人脸重建的姿态校正方法具有很好的性能.因此,我们认为在计算时间降低的情况下,基于本文方法进行的姿态不变的人脸识别仍然取得了另人满意的识别率.OriginalPose normalization P11 P11P29 P29 P05P05 P37 P37P27 P27Fig.2 The pose normalized images图2 姿态校正的结果由于这里用于3D人脸重建的特征点是手工标定的,而各种多姿态特征点定位方法在定位人脸关键点的过程中都不可避免地存在误差,因此为了衡量本文算法对特征点定位误差的鲁棒性,我们人为地对手工标定的特征点加高斯噪声.基于有高斯误差的特征点定位结果,我们重新进行3D人脸重建、姿态校正以及后续的人脸识别,从而验证定位的误差对使用本文策略进行姿态不变的人脸识别结果的影响.这里为尽量减少其他因素的影响,精确地分析特征点定位误差对识别结果的影响,我们采用简单的计算相关性的方法作为识别的度量.我们还尝试了对手工标定的人脸关键特征点增加5组不同的噪声,其均值为0,方差分别为1.0,1.5,2.0,2.5以及3.0.对特征点定位结果加入高斯噪声的姿态不变人脸识别结果见表 2.我们发现,对特征定位结果增加不同程度的高斯扰动,并没有造成识别结果发生较大的变化.。

基于时序特性的视频目标检测研究

第2期2021年2月Journal of CAEITVol. 16 No. 2 Feb. 2021doi ： 10.3969/j. issn. 1673-5692.2021.02.009基于时序特性的视频目标检测研究周念h2(1.西安电子科技大学，陕西西安710071;2.中国电子科学研究院，北京100041)摘要：目标检测是计算机视觉领域的研究热点与难点，是图像理解与视频分析技术的基础，深度学习具有良好的特征提取与泛化推理能力是当前主流研究方向。

文中介绍了目前典型的目标检测模型，结合视频的时序特性提出了特征融合、结果反馈、双模型检测三种策略，利用前序帧信息修正当前帧的检测结果，一定程度上解决了摄影失焦、运动模糊、遮挡、奇异角度等问题导致视频逐帧检测结果不连续、质量低下的问题。

关键词：深度学习；视频检测；特征融合；结果反馈；双模型检测中图分类号：TP391 文献标志码：A 文章编号：1673-5692(2021 )02-157>08Research on Video Object Detection Based on Temporal CharacteristicsZHOU Nian1'2(1. Xidian University ，Xi’an 710071，China;2. China Academy of Electmnic and Infomlation Technology, Beijing 100041 , China)Abstract ： Object detection is a research hotspot in the field of computer vision, and is the basis of image understanding and video analysis technology. Deep learning with favorable feature extraction and generalization reasoning ability is the cuirent popular research direction. This paper introduces the current typical object detection model, and then combined with the temporal characteristics of video, puts forward the strategies of feature fusion, result feedback and dual model detection. These strategies make the results of the current frame more accurate by using the information of previous frame and solve the problems of discontinuous and low quality using frame by frame detection resulted from defocus, motion blur, occlusion, queer angle, etc.Key words ： deep learning ； video detection ； feature fusion ； result feedback ； double model detector〇引言随着我国智慧城市、监控安防建设的不断推进，视频数据的爆炸性增长，视频智能分析技术:1:扮演着越来越重要的角色。

机械设计方案制造及其自动化专业英语翻译超级汇总

Unit 1 MetalsUnit 2 Selection of Construction Materials淬透性：指在规定条件下，决定钢材淬硬深度和硬度分布的特性。

即钢淬火时得到淬硬层深度大小的能力，它表示钢接受淬火的能力。

钢材淬透性好与差，常用淬硬层深度来表示。

淬硬层深度越大，则钢的淬透性越好。

钢的淬透性是钢材本身所固有的属性，它只取决于其本身的内部因素，而与外部因素无关。

钢的淬透性主要取决于它的化学成分，特别是含增大淬透性的合金元素及晶粒度，加热温度和保温时间等因素有关。

淬透性好的钢材，可使钢件整个截面获得均匀一致的力学性能以及可选用钢件淬火应力小的淬火剂，以减少变形和开裂。

淬透性主要取决于其临界冷却速度的大小，而临界冷却速度则主要取决于过冷奥氏体的稳定性，影响奥氏体的稳定性主要是：1.化学成分的影响碳的影响是主要的，当C％小于1.2％时，随着奥氏体中碳浓度的提高，显著降低临界冷却速度，C曲线右移，钢的淬透性增大；当C％大于时，钢的冷却速度反而升高，C曲线左移，淬透性下降。

其次是合金元素的影响，除钴外，绝大多数合金元素溶入奥氏体后，均使C曲线右移，降低临界冷却速度，从而提高钢的淬透性。

2.奥氏体晶粒大小的影响奥氏体的实际晶粒度对钢的淬透性有较大的影响，粗大的奥氏体晶粒能使C曲线右移，降低了钢的临界冷却速度。

但晶粒粗大将增大钢的变形、开裂倾向和降低韧性。

3.奥氏体均匀程度的影响在相同冷度条件下，奥氏体成分越均匀，珠光体的形核率就越低，转变的孕育期增长，C曲线右移，临界冷却速度减慢，钢的淬透性越高。

4.钢的原始组织的影响钢的原始组织的粗细和分布对奥氏体的成分将有重大影响。

5.部分元素，例如Mn，Si等元素对提高淬透性能起到一定作用，但同时也会对钢材带来其他不利的影响。

可锻性(forgeability)金属具有热塑性，在加热状态(各种金属要求温度不同)，可以进行压力加工，称为具有可锻性。

可锻性：指金属材料在压力加工时，能改变形状而不产生裂纹的性能。

Beyond Bags of FeaturesSpatial Pyramid Matching

Beyond Bags of Features:Spatial Pyramid Matching for Recognizing Natural Scene CategoriesSvetlana Lazebnik1 slazebni@ 1Beckman InstituteUniversity of IllinoisCordelia Schmid2Cordelia.Schmid@inrialpes.fr2INRIA Rhˆo ne-AlpesMontbonnot,FranceJean Ponce1,3ponce@3Ecole Normale Sup´e rieureParis,FranceAbstractThis paper presents a method for recognizing scene cat-egories based on approximate global geometric correspon-dence.This technique works by partitioning the image into increasinglyﬁne sub-regions and computing histograms of local features found inside each sub-region.The result-ing“spatial pyramid”is a simple and computationally efﬁ-cient extension of an orderless bag-of-features image rep-resentation,and it shows signiﬁcantly improved perfor-mance on challenging scene categorization tasks.Speciﬁ-cally,our proposed method exceeds the state of the art on the Caltech-101database and achieveshigh accuracy on alarge database ofﬁfteen natural scene categories.The spa-tial pyramid framework also offers insights into the successof several recently proposed image descriptions,includingTorralba’s“gist”and Lowe’s SIFT descriptors.1.IntroductionIn this paper,we consider the problem of recognizingthe semantic category of an image.For example,we maywant to classify a photograph as depicting a scene(forest,street,ofﬁce,etc.)or as containing a certain object of in-terest.For such whole-image categorization tasks,bag-of-methods,which represent an image as an orderlessof local features,have recently demonstrated im-levels of performance[7,22,23,25].However,because these methods disregard all information about thespatial layout of the features,they have severely limited de-scriptive ability.In particular,they are incapable of captur-ing shape or of segmenting an object from its background.Unfortunately,overcoming these limitations to build effec-tive structural object descriptions has proven to be quitechallenging,especially when the recognition system mustbe made to work in the presence of heavy clutter,occlu-sion,or large viewpoint changes.Approaches based ongenerative part models[3,5]and geometric correspondencesearch[1,11]achieve robustness at signiﬁcant computa-tional expense.A more efﬁcient approach is to augment abasic bag-of-features representation with pairwise relationsbetween neighboring local features,but existing implemen-tations of this idea[11,17]have yielded inconclusive re-sults.One other strategy for increasing robustness to geo-metric deformations is to increase the level of invariance oflocal features(e.g.,by using afﬁne-invariant detectors),buta recent large-scale evaluation[25]suggests that this strat-egy usually does not pay off.Though we remain sympathetic to the goal of develop-ing robust and geometrically invariant structural object rep-resentations,we propose in this paper to revisit“global”non-invariant representations based on aggregating statis-tics of local features overﬁxed subregions.We introduce akernel-based recognition method that works by computingrough geometric correspondence on a global scale using anefﬁcient approximation technique adapted from the pyramidmatching scheme of Grauman and Darrell[7].Our methodinvolves repeatedly subdividing the image and computinghistograms of local features at increasinglyﬁne resolutions.As shown by experiments in Section5,this simple oper-ation sufﬁces to signiﬁcantly improve performance over abasic bag-of-features representation,and even over meth-ods based on detailed geometric correspondence.Previous research has shown that statistical properties ofthe scene considered in a holistic fashion,without any anal-ysis of its constituent objects,yield a rich set of cues to itssemantic category[13].Our own experiments conﬁrm thatglobal representations can be surprisingly effective not onlyfor identifying the overall scene,but also for categorizingimages as containing speciﬁc objects,even when these ob-jects are embedded in heavy clutter and vary signiﬁcantlyin pose and appearance.This said,we do not advocate thedirect use of a global method for object recognition(exceptfor very restricted sorts of imagery).Instead,we envision asubordinate role for this method.It may be used to capturethe“gist”of an image[21]and to inform the subsequentsearch for speciﬁc objects(e.g.,if the image,based on its global description,is likely to be a highway,we have a high probability ofﬁnding a car,but not a toaster).In addition, the simplicity and efﬁciency of our method,in combina-tion with its tendency to yield unexpectedly high recogni-tion rates on challenging data,could make it a good base-line for“calibrating”new datasets and for evaluating more sophisticated recognition approaches.2.Previous WorkIn computer vision,histograms have a long history as a method for image description(see,e.g.,[16,19]).Koen-derink and Van Doorn[10]have generalized histograms to locally orderless images,or histogram-valued scale spaces (i.e.,for each Gaussian aperture at a given location and scale,the locally orderless image returns the histogram of image features aggregated over that aperture).Our spatial pyramid approach can be thought of as an alternative for-mulation of a locally orderless image,where instead of a Gaussian scale space of apertures,we deﬁne aﬁxed hier-archy of rectangular windows.Koenderink and Van Doorn have argued persuasively that locally orderless images play an important role in visual perception.Our retrieval exper-iments(Fig.4)conﬁrm that spatial pyramids can capture perceptually salient features and suggest that“locally or-derless matching”may be a powerful mechanism for esti-mating overall perceptual similarity between images.It is important to contrast our proposed approach with multiresolution histograms[8],which involve repeatedly subsampling an image and computing a global histogram of pixel values at each new level.In other words,a mul-tiresolution histogram varies the resolution at which the fea-tures(intensity values)are computed,but the histogram res-olution(intensity scale)staysﬁxed.We take the opposite approach ofﬁxing the resolution at which the features are computed,but varying the spatial resolution at which they are aggregated.This results in a higher-dimensional rep-resentation that preserves more information(e.g.,an image consisting of thin black and white stripes would retain two modes at every level of a spatial pyramid,whereas it would become indistinguishable from a uniformly gray image at all but theﬁnest levels of a multiresolution histogram).Fi-nally,unlike a multiresolution histogram,a spatial pyramid, when equipped with an appropriate kernel,can be used for approximate geometric matching.The operation of“subdivide and disorder”—i.e.,par-tition the image into subblocks and compute histograms (or histogram statistics,such as means)of local features in these subblocks—has been practiced numerous times in computer vision,both for global image description[6,18, 20,21]and for local description of interest regions[12]. Thus,though the operation itself seems fundamental,pre-vious methods leave open the question of what is the right subdivision scheme(although a regular4×4grid seemsto be the most popular implementation choice),and what isthe right balance between“subdividing”and“disordering.”The spatial pyramid framework suggests a possible way toaddress this issue:namely,the best results may be achievedwhen multiple resolutions are combined in a principled way.It also suggests that the reason for the empirical success of“subdivide and disorder”techniques is the fact that they ac-tually perform approximate geometric matching.3.Spatial Pyramid MatchingWeﬁrst describe the original formulation of pyramidmatching[7],and then introduce our application of thisframework to create a spatial pyramid image representation.3.1.Pyramid Match KernelsLet X and Y be two sets of vectors in a d-dimensionalfeature space.Grauman and Darrell[7]propose pyramidmatching toﬁnd an approximate correspondence betweenthese two rmally,pyramid matching works byplacing a sequence of increasingly coarser grids over thefeature space and taking a weighted sum of the number ofmatches that occur at each level of resolution.At anyﬁxedresolution,two points are said to match if they fall into thesame cell of the grid;matches found atﬁner resolutions areweighted more highly than matches found at coarser resolu-tions.More speciﬁcally,let us construct a sequence of gridsat resolutions0,...,L,such that the grid at level has2cells along each dimension,for a total of D=2d cells.LetH X and H Y denote the histograms of X and Y at this res-olution,so that H X(i)and H Y(i)are the numbers of points from X and Y that fall into the i th cell of the grid.Thenthe number of matches at level is given by the histogramintersection function[19]:I(H X,H Y)=Di=1minH X(i),H Y(i).(1)In the following,we will abbreviate I(H X,H Y)to I .Note that the number of matches found at level also in-cludes all the matches found at theﬁner level +1.There-fore,the number of new matches found at level is given by I −I +1for =0,...,L−1.The weight associatedwith level is set to12L−,which is inversely proportional to cell width at that level.Intuitively,we want to penalize matches found in larger cells because they involve increas-ingly dissimilar features.Putting all the pieces together,weget the following deﬁnition of a pyramid match kernel :κL(X,Y )=I L+L −1 =012 I −I+1(2)=12LI 0+L =112L − +1I .(3)Both the histogram intersection and the pyramid match ker-nel are Mercer kernels [7].3.2.Spatial Matching SchemeAs introduced in [7],a pyramid match kernel workswith an orderless image representation.It allows for pre-cise matching of two collections of features in a high-dimensional appearance space,but discards all spatial in-formation.This paper advocates an “orthogonal”approach:perform pyramid matching in the two-dimensional image space,and use traditional clustering techniques in feature space.1Speciﬁcally,we quantize all feature vectors into M discrete types,and make the simplifyingassumption that only features ofthe same type can be matched to one an-other.Each channel m gives us two sets of two-dimensional vectors,X m and Y m ,representing the coordinates of fea-tures of type m found in the respective images.The ﬁnal kernel is then the sum of the separate channel kernels:K L(X,Y )=M m =1κL (X m ,Y m ).(4)This approach has the advantage of maintaining continuity with the popular “visual vocabulary”paradigm —in fact,it reduces to a standard bag of features when L =0.Because the pyramid match kernel (3)is simply a weighted sum of histogram intersections,and because c min(a,b )=min(ca,cb )for positive numbers,we can implement K L as a single histogram intersection of “long”vectors formed by concatenating the appropriately weighted histograms of all channels at all resolutions (Fig.1).For L levels and M channels,the resulting vector has dimen-sionality M L =04 =M 13(4L +1−1).Several experi-ments reported in Section 5use the settings of M =400and L =3,resulting in 34000-dimensional histogram in-tersections.However,these operations are efﬁcient because the histogram vectors are extremely sparse (in fact,just as in [7],the computational complexity of the kernel is linear in the number of features).It must also be noted that we did not observe any signiﬁcant increase in performance beyond M =200and L =2,where the concatenated histograms are only 4200-dimensional.1Inprinciple,it is possible to integrate geometric information directly into the original pyramid matching framework by treating image coordi-nates as two extra dimensions in the feature space.´Figure 1.Toy example of constructing a three-level pyramid.Theimage has three feature types,indicated by circles,diamonds,and crosses.At the top,we subdivide the image at three different lev-els of resolution.Next,for each level of resolution and each chan-nel,we count the features that fall in each spatial bin.Finally,we weight each spatial histogram according to eq.(3).The ﬁnal implementation issue is that of normalization.For maximum computational efﬁciency,we normalize all histograms by the total weight of all features in the image,in effect forcing the total number of features in all images to be the same.Because we use a dense feature representation (see Section 4),and thus do not need to worry about spuri-ous feature detections resulting from clutter,this practice is sufﬁcient to deal with the effects of variable image size.4.Feature ExtractionThis section brieﬂy describes the two kinds of features used in the experiments of Section 5.First,we have so-called “weak features,”which are oriented edge points,i.e.,points whose gradient magnitude in a given direction ex-ceeds a minimum threshold.We extract edge points at two scales and eight orientations,for a total of M =16chan-nels.We designed these features to obtain a representation similar to the “gist”[21]or to a global SIFT descriptor [12]of the image.For better discriminative power,we also utilize higher-dimensional “strong features,”which are SIFT descriptors of 16×16pixel patches computed over a grid with spacing of 8pixels.Our decision to use a dense regular grid in-stead of interest points was based on the comparative evalu-ation of Fei-Fei and Perona [4],who have shown that dense features work better for scene classiﬁcation.Intuitively,a dense image description is necessary to capture uniform re-gions such as sky,calm water,or road surface (to deal with low-contrast regions,we skip the usual SIFT normalization procedure when the overall gradient magnitude of the patch is too weak).We perform k -means clustering of a random subset of patches from the training set to form a visual vo-cabulary.Typical vocabulary sizes for our experiments are M =200and M =400.ofﬁce kitchen livingroombedroom storeindustrialtall building ∗inside city ∗street∗highway ∗coast ∗open country∗mountain ∗forest ∗suburbFigure 2.Example images from the scene category database.The starred categories originate from Oliva and Torralba [13].Weak features (M =16)Strong features (M =200)Strong features (M =400)L Single-level Pyramid Single-level Pyramid Single-level Pyramid 0(1×1)45.3±0.572.2±0.674.8±0.31(2×2)53.6±0.356.2±0.677.9±0.679.0±0.578.8±0.480.1±0.52(4×4)61.7±0.664.7±0.779.4±0.381.1±0.379.7±0.581.4±0.53(8×8)63.3±0.866.8±0.677.2±0.480.7±0.377.2±0.581.1±0.6Table 1.in bold.5.ExperimentsIn this section,we report results on three diverse datasets:ﬁfteen scene categories [4],Caltech-101[3],and Graz [14].We perform all processing in grayscale,even when color images are available.All experiments are re-peated ten times with different randomly selected training and test images,and the average of per-class recognition rates 2is recorded for each run.The ﬁnal result is reported as the mean and standard deviation of the results from the in-dividual runs.Multi-class classiﬁcation is done with a sup-port vector machine (SVM)trained using the one-versus-all rule:a classiﬁer is learned to separate each class from the rest,and a test image is assigned the label of the classiﬁer with the highest response.2Thealternative performance measure,the percentage of all test im-ages classiﬁed correctly,can be biased if test set sizes for different classes vary signiﬁcantly.This is especially true of the Caltech-101dataset,where some of the “easiest”classes are disproportionately large.5.1.Scene Category RecognitionOur ﬁrst dataset (Fig.2)is composed of ﬁfteen scene cat-egories:thirteen were provided by Fei-Fei and Perona [4](eight of these were originally collected by Oliva and Tor-ralba [13]),and two (industrial and store)were collected by ourselves.Each category has 200to 400images,and av-erage image size is 300×250pixels.The major sources of the pictures in the dataset include the COREL collection,personal photographs,and Google image search.This is one of the most complete scene category dataset used in the literature thus far.Table 1shows detailed results of classiﬁcation experi-ments using 100images per class for training and the rest for testing (the same setup as [4]).First,let us examine the performance of strong features for L =0and M =200,corresponding to a standard bag of features.Our classi-ﬁcation rate is 72.2%(74.7%for the 13classes inherited from Fei-Fei and Perona),which is much higher than their best results of 65.2%,achieved with an orderless method and a feature set comparable to ours.We conjecture that Fei-Fei and Perona’s approach is disadvantaged by its re-o92.7 kitchen k 68.5living rooml60.4b68.3store s 76.2industriali 65.4tall buildingt91.1i80.5s90.2h86.6c82.4open country o 70.5mountain88.8f94.7 suburb s 99.4images from class i that were misidentiﬁed as class j.liance on latent Dirichlet allocation(LDA)[2],which is essentially an unsupervised dimensionality reduction tech-nique and as such,is not necessarily conducive to achiev-ing the highest classiﬁcation accuracy.To verify this,we have experimented with probabilistic latent semantic analy-sis(pLSA)[9],which attempts to explain the distribution of features in the image as a mixture of a few“scene topics”or“aspects”and performs very similarly to LDA in prac-tice[17].Following the scheme of Quelhas et al.[15],we run pLSA in an unsupervised setting to learn a60-aspect model of half the training images.Next,we apply this model to the other half to obtain probabilities of topics given each image(thus reducing the dimensionality of the feature space from200to60).Finally,we train the SVM on these reduced features and use them to classify the test set.In this setup,our average classiﬁcation rate drops to63.3%from the original72.2%.For the13classes inherited from Fei-Fei and Perona,it drops to65.9%from74.7%,which is now very similar to their results.Thus,we can see that la-tent factor analysis techniques can adversely affect classiﬁ-cation performance,which is also consistent with the results of Quelhas et al.[15].Next,let us examine the behavior of spatial pyramid matching.For completeness,Table1lists the performance achieved using just the highest level of the pyramid(the “single-level”columns),as well as the performance of the complete matching scheme using multiple levels(the“pyra-mid”columns).For all three kinds of features,results im-prove dramatically as we go from L=0to a multi-level setup.Though matching at the highest pyramid level seems to account for most of the improvement,using all the levels inated at higher pyramid levels.Thus,we can conclude that the coarse-grained geometric cues provided by the pyramid have more discriminative power than an enlarged visual vo-cabulary.Of course,the optimal way to exploit structure both in the image and in the feature space may be to com-bine them in a uniﬁed multiresolution framework;this is subject for future research.Fig.3shows a confusion table between theﬁfteen scene categories.Not surprisingly,confusion occurs between the indoor classes(kitchen,bedroom,living room),and also be-tween some natural classes,such as coast and open country. Fig.4shows examples of image retrieval using the spatial pyramid kernel and strong features with M=200.These examples give a sense of the kind of visual information cap-tured by our approach.In particular,spatial pyramids seem successful at capturing the organization of major pictorial elements or“blobs,”and the directionality of dominant lines and edges.Because the pyramid is based on features com-puted at the original image resolution,even high-frequency details can be preserved.For example,query image(b) shows white kitchen cabinet doors with dark borders.Three of the retrieved“kitchen”images contain similar cabinets, the“ofﬁce”image shows a wall plastered with white docu-ments in dark frames,and the“inside city”image shows a white building with darker window frames.5.2.Caltech-101Our second set of experiments is on the Caltech-101 database[3](Fig.5).This database contains from31to 800images per category.Most images are medium resolu-tion,i.e.,about300×300pixels.Caltech-101is probably the most diverse object database available today,though it(a)kitchen living room living room living room ofﬁce living room living room living room livingroom(b)kitchen ofﬁce insidecity(c)store mountainforest(d)tall bldg inside city insidecity(e)tall bldg inside city mountain mountainmountain(f)inside city tallbldg(g)streetFigure 4.Retrieval from the scene category database.The query images are on the left,and the eight images giving the highest values of the spatial pyramid kernel (for L =2,M =200)are on the right.The actual class of incorrectly retrieved images is listed below them.is not without ly,most images feature relatively little clutter,and the objects are centered and oc-cupy most of the image.In addition,a number of categories,such as minaret (see Fig.5),are affected by “corner”arti-facts resulting from artiﬁcial image rotation.Though these artifacts are semantically irrelevant,they can provide stable cues resulting in misleadingly high recognition rates.We follow the experimental setup of Grauman and Dar-rell [7]and J.Zhang et al.[25],namely,we train on 30im-ages per class and test on the rest.For efﬁciency,we limit the number of test images to 50per class.Note that,be-cause some categories are very small,we may end up with just a single test image per class.Table 2gives a break-down of classiﬁcation rates for different pyramid levels for weak features and strong features with M =200.The results for M =400are not shown,because just as for the scene category database,they do not bring any signiﬁ-cant improvement.For L =0,strong features give 41.2%,which is slightly below the 43%reported by Grauman and Darrell.Our best result is 64.6%,achieved with strong fea-tures at L =2.This exceeds the highest classiﬁcation rate previously published,3that of 53.9%reported by J.Zhang et al.[25].Berg et al.[1]report 48%accuracy using 15training images per class.Our average recognition rate with this setup is 56.4%.The behavior of weak features on this database is also noteworthy:for L =0,they give a clas-siﬁcation rate of 15.5%,which is consistent with a naive graylevel correlation baseline [1],but in conjunction with a four-level spatial pyramid,their performance rises to 54%—on par with the best results in the literature.Fig.5shows a few of the “easiest”and “hardest”object classes for our method.The successful classes are either dominated by rotation artifacts (like minaret),have very lit-tle clutter (like windsor chair),or represent coherent natural “scenes”(like joshua tree and okapi).The least success-ful classes are either textureless animals (like beaver and cougar),animals that camouﬂage well in their environment3See,however,H.Zhang et al.[24]in these proceedings,for an al-gorithm that yields a classiﬁcation rate of 66.2±0.5%for 30training examples,and 59.1±0.6%for 15examples.minaret (97.6%)windsor chair (94.6%)joshua tree (87.9%)okapi (87.8%)cougar body (27.6%)beaver (27.5%)crocodile (25.0%)ant (25.0%)Figure 5.Caltech-101results.Top:some classes on which our method (L =2,M =200)achieved high performance.Bottom:some classes on which our method performed poorly.Weak featuresStrong features (200)L Single-level Pyramid Single-level Pyramid 015.5±0.941.2±1.2131.4±1.232.8±1.355.9±0.957.0±0.8247.2±1.149.3±1.463.6±0.964.6±0.8352.2±0.854.0±1.160.3±0.964.6±0.7Table 2.Classiﬁcation results for the Caltech-101database.class 1mis-class 2mis-class 1/class 2classiﬁed asclassiﬁed as class 2class 1ketch /schooner 21.614.8lotus /water lily 15.320.0crocodile /crocodile head 10.510.0crayﬁsh /lobster 11.39.1ﬂamingo /ibis 9.510.4)on the Caltech-101database.ClassL =0L =2Opelt [14]Zhang [25]Bikes 82.4±2.086.3±2.586.592.0People 79.5±2.382.3±3.180.888.0and comparison with two existing methods.(like crocodile),or “thin”objects (like ant).Table 3shows the top ﬁve of our method’s confusions,all of which are between closely related classes.To summarize,our method has outperformed both state-of-the-art orderless methods [7,25]and methods based on precise geometric correspondence [1].Signiﬁcantly,all these methods rely on sparse features (interest points or sparsely sampled edge points).However,because of the geometric stability and lack of clutter of Caltech-101,dense features combined with global spatial relations seem to cap-ture more discriminative information about the objects.5.3.The Graz DatasetAs seen from Sections 5.1and 5.2,our proposed ap-proach does very well on global scene classiﬁcation tasks,or on object recognition tasks in the absence of clutter with most of the objects assuming “canonical”poses.However,it was not designed to cope with heavy clutter and pose changes.It is interesting to see how well our algorithm can do by exploiting the global scene cues that still remain under these conditions.Accordingly,our ﬁnal set of ex-periments is on the Graz dataset [14](Fig.6),which is characterized by high intra-class variation.This dataset has two object classes,bikes (373images)and persons (460im-ages),and a background class (270images).The image res-olution is 640×480,and the range of scales and poses at which exemplars are presented is very diverse,e.g.,a “per-son”image may show a pedestrian in the distance,a side view of a complete body,or just a closeup of a head.For this database,we perform two-class detection (object vs.back-ground)using an experimental setup consistent with that of Opelt et al.[14].Namely,we train detectors for persons and bikes on 100positive and 100negative images (of which 50are drawn from the other object class and 50from the back-ground),and test on a similarly distributed set.We generate ROC curves by thresholding raw SVM output,and report the ROC equal error rate averaged over ten runs.Table 4summarizes our results for strong features with M =200.Note that the standard deviation is quite high be-cause the images in the database vary greatly in their level of difﬁculty,so the performance for any single run is depen-dent on the composition of the training set (in particular,for L =2,the performance for bikes ranges from 81%to 91%).For this database,the improvement from L =0to L =2is relatively small.This makes intuitive sense:when a class is characterized by high geometric variability,it is difﬁcult to ﬁnd useful global features.Despite this disadvantage of our method,we still achieve results very close to those of Opelt et al.[14],who use a sparse,locally invariant feature representation.In the future,we plan to combine spatial pyramids with invariant features for improved robustness against geometric changes.6.DiscussionThis paper has presented a “holistic”approach for image categorization based on a modiﬁcation of pyramid match kernels [7].Our method,which works by repeatedly sub-dividing an image and computing histograms of image fea-tures over the resulting subregions,has shown promising re-。

常用英语形容词

常用英语形容词Adjectives are an essential part of the English language, allowing us to describe the world around us with greater precision and nuance. From the simple to the complex, adjectives provide a wealth of expressive possibilities, enabling us to convey our thoughts, feelings, and observations with clarity and impact. In this essay, we will explore some of the most common and versatile English adjectives, examining their usage, meaning, and the ways in which they enrich our communication.One of the most fundamental and widely used adjectives is the word "big." This simple yet powerful descriptor can be applied to a vast array of subjects, from physical size to abstract concepts. A "big house," a "big idea," or a "big problem" all convey a sense of scale, magnitude, or importance that would be difficult to express without this adjective. The word "small," on the other hand, offers a counterpoint, allowing us to describe things that are diminutive, compact, or insignificant in comparison.Another highly useful adjective is "good." This term can be used toexpress approval, quality, or desirability, as in a "good book," a "good meal," or a "good friend." Its opposite, "bad," is equally versatile, enabling us to communicate disapproval, poor quality, or undesirable characteristics. These adjectives are so ubiquitous in our language that they have become ingrained in our daily speech, shaping the way we perceive and evaluate the world around us.Adjectives can also be used to convey emotional states or personal qualities. Words like "happy," "sad," "angry," and "kind" allow us to articulate our inner experiences and the attributes of others with precision. These emotive adjectives play a crucial role in interpersonal communication, helping us to express empathy, understand one another's perspectives, and navigate the complex landscape of human relationships.Beyond the realm of emotions, adjectives can be used to describe physical attributes, such as "tall," "short," "heavy," and "light." These descriptors enable us to paint vivid mental pictures, allowing the listener or reader to visualize the subject in question. Similarly, adjectives like "smooth," "rough," "soft," and "hard" allow us to convey sensory information, enriching our descriptions and enhancing our ability to communicate effectively.Adjectives can also be used to express more abstract qualities, such as "intelligent," "creative," "honest," and "reliable." These adjectivesspeak to the character, abilities, and values of a person or thing, offering insight into their nature and the way they are perceived. Such descriptors are essential in our efforts to understand and evaluate the world around us, as they allow us to make nuanced judgments and form meaningful connections.It is important to note that many adjectives can be modified and intensified through the use of adverbs, such as "very," "extremely," "incredibly," and "remarkably." These adverbial modifiers add depth and emphasis to our descriptions, allowing us to convey subtle shades of meaning and nuance. For example, the difference between "happy" and "extremely happy" can be significant, conveying a more profound or intense state of emotion.Adjectives can also be used comparatively, with words like "better," "worse," "larger," and "smaller" allowing us to draw direct comparisons between two or more subjects. This comparative usage enables us to articulate relative differences, hierarchies, and gradations, further enhancing our ability to describe and evaluate the world around us.In addition to their primary function as descriptors, adjectives can also be used in more creative and figurative ways. Metaphorical expressions, such as "a heart of stone," "a mind like a steel trap," or "a voice of velvet," employ adjectives to draw vivid comparisons andevoke powerful imagery. These linguistic devices not only add color and richness to our language but also challenge us to think in more abstract and imaginative ways.The versatility of adjectives is further demonstrated in their ability to be combined and layered, creating complex and nuanced descriptions. For instance, a "small, round, red apple" conveys far more information than a simple "apple," allowing the reader or listener to visualize the subject with greater precision and detail.In conclusion, the humble adjective is a linguistic powerhouse, enabling us to communicate with clarity, nuance, and emotional depth. From the basic to the complex, these descriptors are the building blocks of our language, shaping the way we perceive, understand, and express the world around us. By mastering the use of adjectives, we can unlock new levels of expressiveness, cultivate deeper connections with others, and ultimately enhance the richness and effectiveness of our communication.。

(Guest Editors) Dupin Cyclide Blends Between Quadric Surfaces for Shape Modeling

1. Introduction Dupin cyclides are non-spherical algebraic surfaces discovered by French mathematician Pierre-Charles Dupin at the beginning of the 19th century [Dup22]. These surfaces have an essential property: all their curvature lines are circular. R. Martin, in his 1982 PhD thesis, is the ﬁrst author who thought of using use these surfaces in CAD/CAM and geometric modeling [Mar82], using cyclides in the formulation of his principal patches. Afterwards, Dupin cyclides gained a lot of attention, and their algebraic and geometric properties have been studied in depth by many authors [BG92, Ber78, Heb81, Pin85, Ban70, DMP93, Pra97] which has allowed to see the adequacy and the contribution of these surfaces to geometric modeling.
† Currently guest researcher at the National Institute of Standards and Technology, Gaithersburg, MD 20899, USA. sfoufou@

浩瀚的英文形容词

浩瀚的英文形容词The Vast and Versatile World of English AdjectivesThe English language is renowned for its richness and diversity, and this is particularly evident in its vast array of adjectives. These descriptive words play a crucial role in adding depth, nuance, and vivacity to our communication, allowing us to paint vivid pictures with our words. From the diminutive to the grandiose, the whimsical to the profound, the English language offers a seemingly limitless palette of adjectives to choose from, each one capable of evoking a unique emotional response or conjuring a specific mental image.At the heart of this linguistic abundance lies the inherent flexibility and adaptability of English adjectives. These modifiers can be usedto describe a wide range of entities, from the tangible to the abstract, the physical to the metaphysical. They can convey size, shape, color, texture, age, origin, and a myriad of other attributes, enabling us to capture the essence of the world around us with precision and clarity.One of the most striking features of English adjectives is their ability to convey a sense of scale and magnitude. From the diminutive "tiny" to the colossal "mammoth," the language offers a rich tapestry ofwords that can transport the reader or listener to vastly different realms of experience. The adjective "enormous," for instance, can conjure visions of towering mountains, sprawling landscapes, or immense celestial bodies, while "minuscule" might evoke the delicate intricacies of a snowflake or the intricate structures of a single-celled organism.But the power of English adjectives extends far beyond the realm of physical dimensions. These versatile words can also convey a wide range of emotional states, from the joyful "delightful" to the melancholic "somber." They can capture the nuances of personality, as in the case of "quirky" or "stoic," or the complexities of human experience, as in the case of "bittersweet" or "poignant."Moreover, English adjectives are not limited to static descriptions; they can also be used to convey a sense of movement, change, or transformation. Words like "dynamic," "volatile," or "metamorphic" can evoke a sense of flux and evolution, while "ethereal" or "transcendent" can suggest a departure from the mundane and the ordinary.One of the most remarkable aspects of English adjectives is their ability to be combined and layered, creating rich and evocative compound descriptions. The phrase "a breathtakingly magnificent sunset," for instance, combines multiple adjectives to paint a vividand emotive picture, evoking a sense of awe and wonder. Similarly, the term "a deliciously decadent dessert" blends descriptors to conjure a sensory experience that tantalizes the taste buds and the imagination.The sheer breadth and depth of the English adjective arsenal is a testament to the language's capacity for nuance and expression. From the simple and straightforward to the complex and multifaceted, these linguistic tools allow us to navigate the vast and varied landscape of human experience, conveying our thoughts, feelings, and perceptions with precision and artistry.As we continue to explore and harness the power of English adjectives, we discover new ways to communicate, to inspire, and to connect with one another. Whether we are crafting a poetic description, crafting a persuasive argument, or simply engaging in everyday conversation, the judicious use of these versatile modifiers can elevate our language, enrich our understanding, and expand the horizons of our shared experience.。

纹理物体缺陷的视觉检测算法研究--优秀毕业论文

摘要
在竞争激烈的工业自动化生产过程中，机器视觉对产品质量的把关起着举足轻重的作用，机器视觉在缺陷检测技术方面的应用也逐渐普遍起来。与常规的检测技术相比，自动化的视觉检测系统更加经济、快捷、高效与安全。纹理物体在工业生产中广泛存在，像用于半导体装配和封装底板和发光二极管，现代化电子系统中的印制电路板，以及纺织行业中的布匹和织物等都可认为是含有纹理特征的物体。本论文主要致力于纹理物体的缺陷检测技术研究，为纹理物体的自动化检测提供高效而可靠的检测算法。纹理是描述图像内容的重要特征，纹理分析也已经被成功的应用与纹理分割和纹理分类当中。本研究提出了一种基于纹理分析技术和参考比较方式的缺陷检测算法。这种算法能容忍物体变形引起的图像配准误差，对纹理的影响也具有鲁棒性。本算法旨在为检测出的缺陷区域提供丰富而重要的物理意义，如缺陷区域的大小、形状、亮度对比度及空间分布等。同时，在参考图像可行的情况下，本算法可用于同质纹理物体和非同质纹理物体的检测，对非纹理物体的检测也可取得不错的效果。在整个检测过程中，我们采用了可调控金字塔的纹理分析和重构技术。与传统的小波纹理分析技术不同，我们在小波域中加入处理物体变形和纹理影响的容忍度控制算法，来实现容忍物体变形和对纹理影响鲁棒的目的。最后可调控金字塔的重构保证了缺陷区域物理意义恢复的准确性。实验阶段，我们检测了一系列具有实际应用价值的图像。实验结果表明本文提出的纹理物体缺陷检测算法具有高效性和易于实现性。关键字: 缺陷检测；纹理；物体变形；可调控金字塔；重构
Keywords: defect detection, texture, object distortion, steerable pyramid, reconstruction
II

形容词列表英语

形容词列表英语The English language is renowned for its vast and diverse vocabulary, and adjectives play a pivotal role in enriching the language's expressive capabilities. Adjectives are words that describe or modify nouns, adding qualities, characteristics, or attributes to them. They are the vibrant hues that paint the canvas of language, bringing clarity, detail, and emotion to our communication. The beauty of adjectives lies in their ability to evoke images, emotions, and experiences. They allow us to describe things with precision and creativity, converting simple nouns into vivid, multidimensional entities. For instance, the word "tree" can become "majestic oak," "graceful willow," or "mysterious banyan," depending on the adjective used. Each adjective conjures up a unique image and feeling, painting a different picture in the listener's mind.Adjectives in English are diverse, ranging from simple descriptors like "big" and "small" to more complex oneslike "surreal," "eccentric," and "ethereal." They can be used to describe everything from the smallest particle tothe vastest cosmos, from the most abstract concept to the most tangible object. The range and depth of adjectives in English reflect the vastness and complexity of human thought and experience.Moreover, adjectives can be combined and layered to create even more nuanced and descriptive phrases. For instance, the phrase "a beautifully crafted wooden sculpture" uses two adjectives, "beautiful" and "crafted," to describe the noun "wooden sculpture." This combination not only adds depth to the description but also creates a stronger emotional impact on the reader.The use of adjectives in English also varies depending on the context and the desired effect. In literary works, adjectives are often used to create vivid images and evoke strong emotions. In scientific writing, adjectives are used to provide precision and clarity. In everyday conversation, adjectives add color and life to our speech, making it more engaging and expressive.In conclusion, adjectives are the essential building blocks of descriptive language in English. They enrich our vocabulary, expand our expressive capabilities, and adddepth and detail to our communication. The rich vocabularyof adjectives in English is a testament to the language's versatility and expressiveness, making it a powerful toolfor communication and expression.**形容词列表英语：语言中的丰富色彩**英语以其广泛且多样的词汇而闻名，其中形容词在丰富语言表达能力方面扮演着重要角色。

图像融合技术外文翻译-中英对照

***大学毕业设计（英文翻译）原文题目：Automatic Panoramic Image Stitching using Invariant Features 译文题目：使用不变特征的全景图像自动拼接学院：电子与信息工程学院专业： ********姓名： ******学号：**********使用不变特征的全景图像自动拼接马修·布朗和戴维•洛{mbrown|lowe}@cs.ubc.ca计算机科学系英国哥伦比亚大学加拿大温哥华摘要本文研究全自动全景图像的拼接问题，尽管一维问题（单一旋转轴）很好研究，但二维或多行拼接却比较困难。

以前的方法使用人工输入或限制图像序列，以建立匹配的图像，在这篇文章中，我们假定拼接是一个多图像匹配问题，并使用不变的局部特征来找到所有图像的匹配特征。

由于以上这些，该方法对输入图像的顺序、方向、尺度和亮度变化都不敏感；它也对不属于全景图一部分的噪声图像不敏感，并可以在一个无序的图像数据集中识别多个全景图。

此外，为了提供更多有关的细节，本文通过引入增益补偿和自动校直步骤延伸了我们以前在该领域的工作。

1. 简介全景图像拼接已经有了大量的研究文献和一些商业应用。

这个问题的基本几何学很好理解，对于每个图像由一个估计的3×3的摄像机矩阵或对应矩阵组成。

估计处理通常由用户输入近似的校直图像或者一个固定的图像序列来初始化，例如，佳能数码相机内的图像拼接软件需要水平或垂直扫描，或图像的方阵。

在自动定位进行前，第4版的REALVIZ拼接软件有一个用户界面，用鼠标在图像大致定位，而我们的研究是有新意的，因为不需要提供这样的初始化。

根据研究文献，图像自动对齐和拼接的方法大致可分为两类——直接的和基于特征的。

直接的方法有这样的优点，它们使用所有可利用的图像数据，因此可以提供非常准确的定位，但是需要一个只有细微差别的初始化处理。

基于特征的配准不需要初始化，但是缺少不变性的传统的特征匹配方法（例如，Harris角点图像修补的相关性）需要实现任意全景图像序列的可靠匹配。

机械设计制造及其自动化专业英语翻译超级大全

Unit 1 MetalsUnit 2 Selection of Construction Materials工程材料的选择淬透性:指在规定条件下，决定钢材淬硬深度和硬度分布的特性.即钢淬火时得到淬硬层深度大小的能力，它表示钢接受淬火的能力.钢材淬透性好与差,常用淬硬层深度来表示.淬硬层深度越大，则钢的淬透性越好。

钢的淬透性是钢材本身所固有的属性,它只取决于其本身的内部因素，而与外部因素无关。

钢的淬透性主要取决于它的化学成分，特别是含增大淬透性的合金元素及晶粒度，加热温度和保温时间等因素有关。

淬透性好的钢材,可使钢件整个截面获得均匀一致的力学性能以及可选用钢件淬火应力小的淬火剂，以减少变形和开裂.淬透性主要取决于其临界冷却速度的大小,而临界冷却速度则主要取决于过冷奥氏体的稳定性,影响奥氏体的稳定性主要是：1。

化学成分的影响碳的影响是主要的,当C％小于1。

2％时，随着奥氏体中碳浓度的提高,显著降低临界冷却速度,C曲线右移，钢的淬透性增大；当C％大于时,钢的冷却速度反而升高，C曲线左移，淬透性下降。

其次是合金元素的影响,除钴外,绝大多数合金元素溶入奥氏体后，均使C曲线右移，降低临界冷却速度，从而提高钢的淬透性.2.奥氏体晶粒大小的影响奥氏体的实际晶粒度对钢的淬透性有较大的影响，粗大的奥氏体晶粒能使C曲线右移,降低了钢的临界冷却速度.但晶粒粗大将增大钢的变形、开裂倾向和降低韧性.3.奥氏体均匀程度的影响在相同冷度条件下，奥氏体成分越均匀，珠光体的形核率就越低，转变的孕育期增长，C曲线右移，临界冷却速度减慢，钢的淬透性越高。

4.钢的原始组织的影响钢的原始组织的粗细和分布对奥氏体的成分将有重大影响。

5。

部分元素，例如Mn，Si等元素对提高淬透性能起到一定作用，但同时也会对钢材带来其他不利的影响。

可锻性（forgeability)金属具有热塑性,在加热状态（各种金属要求温度不同)，可以进行压力加工，称为具有可锻性。

英语形容词汇编

英语形容词汇编Adjectives are the linguistic tools that allow us to describe the world around us with greater precision and nuance. They are the colorful brushstrokes that breathe life into our language, transforming the mundane into the remarkable. In this comprehensive compendium, we will explore the vast and versatile realm of English adjectives, uncovering their power to captivate, enlighten, and inspire.At the heart of any adjective lies the ability to modify a noun, providing essential details that shape our understanding of the subject. From the diminutive "tiny" to the colossal "gigantic," adjectives grant us the means to convey size and scale with remarkable clarity. Similarly, the spectrum of color adjectives, ranging from the vibrant "scarlet" to the muted "beige," allows us to paint vivid mental pictures with our words.But adjectives transcend the purely physical realm, delving into the realms of emotion, personality, and character. The "joyful" child radiates a sense of boundless enthusiasm, while the "somber" mood evokes a palpable air of solemnity. Adjectives such as "courageous,""resilient," and "compassionate" serve as powerful descriptors of the human spirit, capturing the essence of our most admirable qualities.Delving deeper, we discover that adjectives can also convey more abstract concepts, such as time and quality. The "ancient" artifact whispers of bygone eras, while the "exquisite" craftsmanship captivates the senses. Adjectives like "reliable," "efficient," and "innovative" allow us to assess the merits of products, services, and ideas, guiding our decision-making and shaping our perceptions.One of the remarkable aspects of English adjectives is their ability to be combined and layered, creating intricate and nuanced descriptions. The "serene, azure-blue lake" evokes a sense of tranquility and visual splendor, while the "feisty, diminutive terrier" paints a vivid picture of a spirited and diminutive canine companion. These complex constructions demonstrate the richness and flexibility of our language, allowing us to convey even the most intricate of ideas with precision and eloquence.Beyond their descriptive power, adjectives also play a crucial role in the structure and flow of our language. They serve as the essential building blocks of clear and compelling communication, helping us to organize our thoughts, emphasize important details, and create a sense of rhythm and cadence in our writing and speech.Consider, for example, the difference between the simple statement "The car is red" and the more evocative "The sleek, crimson-hued sports car gleamed in the afternoon sun." The addition of adjectives transforms the mundane into the captivating, drawing the reader or listener into a more vivid and engaging experience.Moreover, the strategic placement of adjectives can have a profound impact on the overall meaning and tone of a sentence. The "intelligent, kind-hearted" person conveys a very different impression than the "kind-hearted, intelligent" person, with the former emphasizing intellect and the latter prioritizing compassion.The versatility of English adjectives extends beyond their traditional roles as modifiers of nouns. They can also be employed as the foundations of adverbs, adding depth and nuance to our descriptions of actions and processes. The "quickly" runner dashes with urgency, while the "elegantly" dancer moves with grace and poise.In the realm of comparative and superlative forms, adjectives truly shine, allowing us to make precise and compelling comparisons. The "larger" sibling towers over the "smaller" one, while the "most beautiful" sunset paints the sky in unparalleled splendor. These linguistic tools enable us to navigate the complexities of the world, highlighting the unique qualities and relative merits of the subjectswe describe.Lastly, it is important to recognize the power of adjectives in shaping our perceptions and biases. The choice of adjectives can have a profound impact on the way we interpret and respond to the world around us. A "dangerous" neighborhood may evoke feelings of fear and trepidation, while a "lively" neighborhood might inspire a sense of excitement and adventure. It is crucial that we wield this linguistic power with care and mindfulness, ensuring that our adjectives reflect fairness, empathy, and a commitment to understanding the full complexity of the human experience.In conclusion, the compendium of English adjectives is a rich and multifaceted tapestry, woven with the threads of precision, emotion, and nuance. From the tangible to the abstract, these linguistic tools grant us the ability to paint vivid portraits of the world, to convey our thoughts and feelings with clarity and eloquence, and to navigate the intricate web of human experience. As we continue to explore and expand the boundaries of this remarkable linguistic resource, we unlock new avenues for self-expression, communication, and the profound exploration of the human condition.。

移动的英文形容词

The Evolution of Mobility: An Insight intoMobile AdjectivesIn today's fast-paced world, the concept of mobility has transformed significantly, encompassing various adjectives that describe the different facets of movement. These adjectives are not merely descriptors of physical locomotion but also extensions of technology, culture, and society's evolving perspectives.The most basic adjective associated with mobility is "mobile," which refers to something that can move or be transported easily. This adjective has found its way into various contexts, from describing physical objects like mobile phones and laptops to abstract concepts like mobile computing and mobile workforces. The prefix "mobile" has become synonymous with convenience, portability, and adaptability.Another adjective that has gained popularity in recent years is "wireless." This adjective describes devices or connections that are not tethered to a physical cable or wire, allowing for greater freedom of movement. The wireless revolution has transformed the way we communicate,work, and entertain ourselves, making it possible to stay connected without being physically restrained.Another important adjective in the realm of mobility is "remote." Remote work, remote learning, and remote healthcare have become commonplace, thanks to advancementsin technology. The "remote" adjective represents theability to access information, services, or resources froma distance, breaking geographical barriers and enabling a more distributed way of life.Another significant adjective is "dynamic," which describes something that is constantly changing or evolving. In the context of mobility, this adjective captures the fluidity and adaptability of modern life, where constant motion and transformation are the norm. Dynamic mobility refers to the ability to adjust to changing environmentsand situations, whether it's a changing work schedule,traffic patterns, or the need to adapt to new technologies. These adjectives are not just descriptors of physical movement but also reflect the cultural and technological shifts that have occurred over time. The increasing popularity of mobile devices and wireless connections, forexample, has given rise to a new era of connectivity and accessibility. Similarly, the shift towards remote work and learning has highlighted the importance of flexibility and adaptability in today's workforce.In conclusion, the evolution of mobility has been marked by the emergence of new adjectives that capture the diverse aspects of movement in today's world. These adjectives reflect not only the physical aspects of locomotion but also the cultural, technological, andsocietal shifts that have shaped our understanding of mobility. As we continue to move forward, it will be fascinating to see how these adjectives evolve and how they shape our future perspectives on mobility.**移动的英文形容词演变：深入洞察**在当今快节奏的世界中，移动性的概念已经发生了显著的变化，涵盖了描述运动不同方面的各种形容词。

基于自适应权重的多重稀疏表示分类算法_段刚龙_魏龙_李妮

网络出版时间：2012-08-16 10:45网络出版地址：/kcms/detail/11.2127.TP.20120816.1045.019.htmlComputer Engineering and Applications计算机工程与应用基于自适应权重的多重稀疏表示分类算法段刚龙, 魏龙, 李妮DUAN Ganglong, WEI Long, LI Ni西安理工大学信息管理系, 陕西西安 710048Department of Information Management, Xi’an University of Technology, Xi’an 710048, ChinaAdaptive weighted multiple sparse representation classification approach Abstract：An adaptive weighted multiple sparse representation classification method is proposed in this paper. To address the weak discriminative power of the conventional SRC (sparse representation classifier) method which uses a single feature representation, we propose using multiple features to represent each sample and construct multiple feature sub-dictionaries for classification. To reflect the different importance and discriminative power of each feature, we present an adaptive weighted method to linearly combine different feature representations for classification. Experimental results demonstrate the effectiveness of our proposed method and better classification accuracy can be obtained than the conventional SRC method.Key words：adaptive weight; multiple sparse representation; SRC摘要：提出了一种基于多特征字典的稀疏表示算法。

三维CAD模型检索综述

似性的三维ＣＡＤ模型检索及面向语义与功能描述的三维
ＣＡＤ模型检索。基于视觉相似性的ＣＡＤ模型检索方法一般
可独立于领域知识，侧重于通过函数投影、统计分析、拓扑结
计算ＣＡＤ模型的几何描述参数。如Ｃｏｍｅｙ等人［４妇使用凸
包特征（ｃｏｎｖｅｘ
ｈｕｌｌ
构比较等方法提取ＣＡＤ模型的全局几何描述、形状特征等，
在此基础上生成多维空间中的特征描述子，最后在该多维特征空间中通过比较特征描述子来完成模型检索。如Ｆｏｕｎｋ— ｈｏｕｓｅｒ等人［２２］利用球面调和分析（ｓｐｈｅｒｉｃａｌｈａｒｍｏｎｉｃ）得到一组旋转不变的频率函数，并以此作为比较依据；Ｎｏｖｏｔｎｉ等人［２３］则将二维Ｚｅｒｎｉｋｅ矩方法推广到三维，用于模型不变性检索；ＨｉｌａｇａＥ２４１等人通过测地距离函数生成多分辨率Ｒｅｅｂ图来提取模型拓扑结构完成比较等。面向语义与功能描述的
提取全局几何结构描述是ＣＡＤ模型检索的常用方法之一，其基本思想是根据给定的ＣＡＤ模型表示，提取其中的各
种几何参数或结构描述，以简化模型的复杂数据描述，实现快速检索的目的。基于全局几何结构提取的方法可分为两个层次：几何参数提取及结构特征简化。其中，几何参数提取侧重于提取与
进一步的ＣＡＤ模型检索可划分为两个层次：基于视觉相
征。在此基础上对三维ＣＡＤ模型检索评测基准库、评测方法进行总结。最后对目前各类三维ＣＡＤ模型检索方法的检索
能力、开销等进行量化比较，在此基础上指出现有方法所面临的主要困难，并对进一步深入研究的方向进行展望。
其它ＣＡＤ数据规范（如ＩＧＥＳ）相比，ＳＴＥＰ标准在数据表达
与交换上改进较大，因而在面向语义及功能描述的检索系统
检索一直是研究热点之一［１］。近几年来随着三维获取设备及相关软件技术的不断发展，各种应用对三维数据描述的需求开始迅速增加，生物分子、机械工程、三维游戏、服装设计、虚拟现实场景建模、建筑设计及室内装潢等众多领域都逐步建立了相关模型库，且其中的三维模型数量仍在不断扩充。与此相应，面向内容的三维模型检索也逐步成为研究热点之一［２’３］，近年来提出了多种不同类型的模型检索方法。根据三维模型表示方法、应用场合及检索需求的不同，三维模型检索大体上可划分为两类：三维通用模型检索及三维ＣＡＤ模型检索。通用模型检索一般不涉及领域知识，侧重于从几何形状描述角度分析，通过几何比较、统计分析等方法，在提取通用模型形状描述子的基础上完成比较与检索［４。８］。较早开展这方面工作的研究机构包括美国普林斯顿大学［９］、德国Ｋｏｎｓｔａｎｚ大学［１“、日本东京大学［１１］等。

有机物定量结构—水溶解性相关的研究

桂林工学院硕士学位论文
ＮＨ２
图４．１ＭｉｎｏｘｉｄｉＩ图４．２Ｃｙｈｅｘａｔｉｎ
洲
图４，４Ｃｅｐｈａｌｏｒｉｄｉｎｅ
本研究中有机化合物的水溶解性数据由溶解度的对数ｌｏｇＳ表示，其中Ｓ为２０。

２５０Ｃ时有机物在纯水中的摩尔溶解度，单位为ｍｏｌ／Ｌ。

１２９０个化台物的ｌｏｇＳ值最小为一１１．６２，最大值为＋１．５８，该１２９０个化合物的ｌｏｇＳ值的分布情况如图４．５所示。

可以看出，其中微溶物质占总样本的一半以上，剩余难溶有机物与易溶有机物分布频率相差不大。

可以说，本数据集内样本的水溶解性数据涵盖了大多数典型有机物的水溶解性数据范围，因此，本研究所采用的数据集具有一定的典型性和代表性。

ｌｏｇＳ值
图４．５１２９０个化合物的ｌｏｇＳ值分布
一３５一
莎。

坷～
』∽。

A survey of content based 3d shape retrieval methods

A Survey of Content Based3D Shape Retrieval MethodsJohan W.H.Tangelder and Remco C.VeltkampInstitute of Information and Computing Sciences,Utrecht University hanst@cs.uu.nl,Remco.Veltkamp@cs.uu.nlAbstractRecent developments in techniques for modeling,digitiz-ing and visualizing3D shapes has led to an explosion in the number of available3D models on the Internet and in domain-speciﬁc databases.This has led to the development of3D shape retrieval systems that,given a query object, retrieve similar3D objects.For visualization,3D shapes are often represented as a surface,in particular polygo-nal meshes,for example in VRML format.Often these mod-els contain holes,intersecting polygons,are not manifold, and do not enclose a volume unambiguously.On the con-trary,3D volume models,such as solid models produced by CAD systems,or voxels models,enclose a volume prop-erly.This paper surveys the literature on methods for con-tent based3D retrieval,taking into account the applicabil-ity to surface models as well as to volume models.The meth-ods are evaluated with respect to several requirements of content based3D shape retrieval,such as:(1)shape repre-sentation requirements,(2)properties of dissimilarity mea-sures,(3)efﬁciency,(4)discrimination abilities,(5)ability to perform partial matching,(6)robustness,and(7)neces-sity of pose normalization.Finally,the advantages and lim-its of the several approaches in content based3D shape re-trieval are discussed.1.IntroductionThe advancement of modeling,digitizing and visualizing techniques for3D shapes has led to an increasing amount of3D models,both on the Internet and in domain-speciﬁc databases.This has led to the development of theﬁrst exper-imental search engines for3D shapes,such as the3D model search engine at Princeton university[2,57],the3D model retrieval system at the National Taiwan University[1,17], the Ogden IV system at the National Institute of Multimedia Education,Japan[62,77],the3D retrieval engine at Utrecht University[4,78],and the3D model similarity search en-gine at the University of Konstanz[3,84].Laser scanning has been applied to obtain archives recording cultural heritage like the Digital Michelan-gelo Project[25,48],and the Stanford Digital Formae Urbis Romae Project[75].Furthermore,archives contain-ing domain-speciﬁc shape models are now accessible by the Internet.Examples are the National Design Repos-itory,an online repository of CAD models[59,68], and the Protein Data Bank,an online archive of struc-tural data of biological macromolecules[10,80].Unlike text documents,3D models are not easily re-trieved.Attempting toﬁnd a3D model using textual an-notation and a conventional text-based search engine would not work in many cases.The annotations added by human beings depend on language,culture,age,sex,and other fac-tors.They may be too limited or ambiguous.In contrast, content based3D shape retrieval methods,that use shape properties of the3D models to search for similar models, work better than text based methods[58].Matching is the process of determining how similar two shapes are.This is often done by computing a distance.A complementary process is indexing.In this paper,indexing is understood as the process of building a datastructure to speed up the search.Note that the term indexing is also of-ten used for the identiﬁcation of features in models,or mul-timedia documents in general.Retrieval is the process of searching and delivering the query results.Matching and in-dexing are often part of the retrieval process.Recently,a lot of researchers have investigated the spe-ciﬁc problem of content based3D shape retrieval.Also,an extensive amount of literature can be found in the related ﬁelds of computer vision,object recognition and geomet-ric modelling.Survey papers to this literature have been provided by Besl and Jain[11],Loncaric[50]and Camp-bell and Flynn[16].For an overview of2D shape match-ing methods we refer the reader to the paper by Veltkamp [82].Unfortunately,most2D methods do not generalize di-rectly to3D model matching.Work in progress by Iyer et al.[40]provides an extensive overview of3D shape search-ing techniques.Atmosukarto and Naval[6]describe a num-ber of3D model retrieval systems and methods,but do not provide a categorization and evaluation.In contrast,this paper evaluates3D shape retrieval meth-ods with respect to several requirements on content based 3D shape retrieval,such as:(1)shape representation re-quirements,(2)properties of dissimilarity measures,(3)ef-ﬁciency,(4)discrimination abilities,(5)ability to perform partial matching,(6)robustness,and(7)necessity of posenormalization.In section2we discuss several aspects of3D shape retrieval.The literature on3D shape matching meth-ods is discussed in section3and evaluated in section4. 2.3D shape retrieval aspectsIn this section we discuss several issues related to3D shape retrieval.2.1.3D shape retrieval frameworkAt a conceptual level,a typical3D shape retrieval frame-work as illustrated byﬁg.1consists of a database with an index structure created ofﬂine and an online query engine. Each3D model has to be identiﬁed with a shape descrip-tor,providing a compact overall description of the shape. To efﬁciently search a large collection online,an indexing data structure and searching algorithm should be available. The online query engine computes the query descriptor,and models similar to the query model are retrieved by match-ing descriptors to the query descriptor from the index struc-ture of the database.The similarity between two descriptors is quantiﬁed by a dissimilarity measure.Three approaches can be distinguished to provide a query object:(1)browsing to select a new query object from the obtained results,(2) a direct query by providing a query descriptor,(3)query by example by providing an existing3D model or by creating a3D shape query from scratch using a3D tool or sketch-ing2D projections of the3D model.Finally,the retrieved models can be visualized.2.2.Shape representationsAn important issue is the type of shape representation(s) that a shape retrieval system accepts.Most of the3D models found on the World Wide Web are meshes deﬁned in aﬁle format supporting visual appearance.Currently,the most common format used for this purpose is the Virtual Real-ity Modeling Language(VRML)format.Since these mod-els have been designed for visualization,they often contain only geometry and appearance attributes.In particular,they are represented by“polygon soups”,consisting of unorga-nized sets of polygons.Also,in general these models are not“watertight”meshes,i.e.they do not enclose a volume. By contrast,for volume models retrieval methods depend-ing on a properly deﬁned volume can be applied.2.3.Measuring similarityIn order to measure how similar two objects are,it is nec-essary to compute distances between pairs of descriptors us-ing a dissimilarity measure.Although the term similarity is often used,dissimilarity corresponds to the notion of dis-tance:small distances means small dissimilarity,and large similarity.A dissimilarity measure can be formalized by a func-tion deﬁned on pairs of descriptors indicating the degree of their resemblance.Formally speaking,a dissimilarity measure d on a set S is a non-negative valued function d:S×S→R+∪{0}.Function d may have some of the following properties:i.Identity:For all x∈S,d(x,x)=0.ii.Positivity:For all x=y in S,d(x,y)>0.iii.Symmetry:For all x,y∈S,d(x,y)=d(y,x).iv.Triangle inequality:For all x,y,z∈S,d(x,z)≤d(x,y)+d(y,z).v.Transformation invariance:For a chosen transforma-tion group G,for all x,y∈S,g∈G,d(g(x),g(y))= d(x,y).The identity property says that a shape is completely similar to itself,while the positivity property claims that dif-ferent shapes are never completely similar.This property is very strong for a high-level shape descriptor,and is often not satisﬁed.However,this is not a severe drawback,if the loss of uniqueness depends on negligible details.Symmetry is not always wanted.Indeed,human percep-tion does not alwaysﬁnd that shape x is equally similar to shape y,as y is to x.In particular,a variant x of prototype y,is often found more similar to y then vice versa[81].Dissimilarity measures for partial matching,giving a small distance d(x,y)if a part of x matches a part of y, do not obey the triangle inequality.Transformation invariance has to be satisﬁed,if the com-parison and the extraction process of shape descriptors have to be independent of the place,orientation and scale of the object in its Cartesian coordinate system.If we want that a dissimilarity measure is not affected by any transforma-tion on x,then we may use as alternative formulation for (v):Transformation invariance:For a chosen transforma-tion group G,for all x,y∈S,g∈G,d(g(x),y)=d(x,y).When all the properties(i)-(iv)hold,the dissimilarity measure is called a metric.Other combinations are possi-ble:a pseudo-metric is a dissimilarity measure that obeys (i),(iii)and(iv)while a semi-metric obeys only(i),(ii)and(iii).If a dissimilarity measure is a pseudo-metric,the tri-angle inequality can be applied to make retrieval more efﬁ-cient[7,83].2.4.EfﬁciencyFor large shape collections,it is inefﬁcient to sequen-tially match all objects in the database with the query object. Because retrieval should be fast,efﬁcient indexing search structures are needed to support efﬁcient retrieval.Since for query by example the shape descriptor is computed online, it is reasonable to require that the shape descriptor compu-tation is fast enough for interactive querying.2.5.Discriminative powerA shape descriptor should capture properties that dis-criminate objects well.However,the judgement of the sim-ilarity of the shapes of two3D objects is somewhat sub-jective,depending on the user preference or the application at hand.E.g.for solid modeling applications often topol-ogy properties such as the numbers of holes in a model are more important than minor differences in shapes.On the contrary,if a user searches for models looking visually sim-ilar the existence of a small hole in the model,may be of no importance to the user.2.6.Partial matchingIn contrast to global shape matching,partial matching ﬁnds a shape of which a part is similar to a part of another shape.Partial matching can be applied if3D shape mod-els are not complete,e.g.for objects obtained by laser scan-ning from one or two directions only.Another application is the search for“3D scenes”containing an instance of the query object.Also,this feature can potentially give the user ﬂexibility towards the matching problem,if parts of inter-est of an object can be selected or weighted by the user. 2.7.RobustnessIt is often desirable that a shape descriptor is insensitive to noise and small extra features,and robust against arbi-trary topological degeneracies,e.g.if it is obtained by laser scanning.Also,if a model is given in multiple levels-of-detail,representations of different levels should not differ signiﬁcantly from the original model.2.8.Pose normalizationIn the absence of prior knowledge,3D models have ar-bitrary scale,orientation and position in the3D space.Be-cause not all dissimilarity measures are invariant under ro-tation and translation,it may be necessary to place the3D models into a canonical coordinate system.This should be the same for a translated,rotated or scaled copy of the model.A natural choice is toﬁrst translate the center to the ori-gin.For volume models it is natural to translate the cen-ter of mass to the origin.But for meshes this is in gen-eral not possible,because they have not to enclose a vol-ume.For meshes it is an alternative to translate the cen-ter of mass of all the faces to the origin.For example the Principal Component Analysis(PCA)method computes for each model the principal axes of inertia e1,e2and e3 and their eigenvaluesλ1,λ2andλ3,and make the nec-essary conditions to get right-handed coordinate systems. These principal axes deﬁne an orthogonal coordinate sys-tem(e1,e2,e3),withλ1≥λ2≥λ3.Next,the polyhe-dral model is rotated around the origin such that the co-ordinate system(e x,e y,e z)coincides with the coordinatesystem(e1,e2,e3).The PCA algorithm for pose estimation is fairly simple and efﬁcient.However,if the eigenvalues are equal,prin-cipal axes may switch,without affecting the eigenvalues. Similar eigenvalues may imply an almost symmetrical mass distribution around an axis(e.g.nearly cylindrical shapes) or around the center of mass(e.g.nearly spherical shapes). Fig.2illustrates the problem.3.Shape matching methodsIn this section we discuss3D shape matching methods. We divide shape matching methods in three broad cate-gories:(1)feature based methods,(2)graph based meth-ods and(3)other methods.Fig.3illustrates a more detailed categorization of shape matching methods.Note,that the classes of these methods are not completely disjoined.For instance,a graph-based shape descriptor,in some way,de-scribes also the global feature distribution.By this point of view the taxonomy should be a graph.3.1.Feature based methodsIn the context of3D shape matching,features denote ge-ometric and topological properties of3D shapes.So3D shapes can be discriminated by measuring and comparing their features.Feature based methods can be divided into four categories according to the type of shape features used: (1)global features,(2)global feature distributions,(3)spa-tial maps,and(4)local features.Feature based methods from theﬁrst three categories represent features of a shape using a single descriptor consisting of a d-dimensional vec-tor of values,where the dimension d isﬁxed for all shapes.The value of d can easily be a few hundred.The descriptor of a shape is a point in a high dimensional space,and two shapes are considered to be similar if they are close in this space.Retrieving the k best matches for a3D query model is equivalent to solving the k nearest neighbors -ing the Euclidean distance,matching feature descriptors can be done efﬁciently in practice by searching in multiple1D spaces to solve the approximate k nearest neighbor prob-lem as shown by Indyk and Motwani[36].In contrast with the feature based methods from theﬁrst three categories,lo-cal feature based methods describe for a number of surface points the3D shape around the point.For this purpose,for each surface point a descriptor is used instead of a single de-scriptor.3.1.1.Global feature based similarityGlobal features characterize the global shape of a3D model. Examples of these features are the statistical moments of the boundary or the volume of the model,volume-to-surface ra-tio,or the Fourier transform of the volume or the boundary of the shape.Zhang and Chen[88]describe methods to com-pute global features such as volume,area,statistical mo-ments,and Fourier transform coefﬁcients efﬁciently.Paquet et al.[67]apply bounding boxes,cords-based, moments-based and wavelets-based descriptors for3D shape matching.Corney et al.[21]introduce convex-hull based indices like hull crumpliness(the ratio of the object surface area and the surface area of its convex hull),hull packing(the percentage of the convex hull volume not occupied by the object),and hull compactness(the ratio of the cubed sur-face area of the hull and the squared volume of the convex hull).Kazhdan et al.[42]describe a reﬂective symmetry de-scriptor as a2D function associating a measure of reﬂec-tive symmetry to every plane(speciﬁed by2parameters) through the model’s centroid.Every function value provides a measure of global shape,where peaks correspond to the planes near reﬂective symmetry,and valleys correspond to the planes of near anti-symmetry.Their experimental results show that the combination of the reﬂective symmetry de-scriptor with existing methods provides better results.Since only global features are used to characterize the overall shape of the objects,these methods are not very dis-criminative about object details,but their implementation is straightforward.Therefore,these methods can be used as an activeﬁlter,after which more detailed comparisons can be made,or they can be used in combination with other meth-ods to improve results.Global feature methods are able to support user feed-back as illustrated by the following research.Zhang and Chen[89]applied features such as volume-surface ratio, moment invariants and Fourier transform coefﬁcients for 3D shape retrieval.They improve the retrieval performance by an active learning phase in which a human annotator as-signs attributes such as airplane,car,body,and so on to a number of sample models.Elad et al.[28]use a moments-based classiﬁer and a weighted Euclidean distance measure. Their method supports iterative and interactive database searching where the user can improve the weights of the distance measure by marking relevant search results.3.1.2.Global feature distribution based similarityThe concept of global feature based similarity has been re-ﬁned recently by comparing distributions of global features instead of the global features directly.Osada et al.[66]introduce and compare shape distribu-tions,which measure properties based on distance,angle, area and volume measurements between random surface points.They evaluate the similarity between the objects us-ing a pseudo-metric that measures distances between distri-butions.In their experiments the D2shape distribution mea-suring distances between random surface points is most ef-fective.Ohbuchi et al.[64]investigate shape histograms that are discretely parameterized along the principal axes of inertia of the model.The shape descriptor consists of three shape histograms:(1)the moment of inertia about the axis,(2) the average distance from the surface to the axis,and(3) the variance of the distance from the surface to the axis. Their experiments show that the axis-parameterized shape features work only well for shapes having some form of ro-tational symmetry.Ip et al.[37]investigate the application of shape distri-butions in the context of CAD and solid modeling.They re-ﬁned Osada’s D2shape distribution function by classifying2random points as1)IN distances if the line segment con-necting the points lies complete inside the model,2)OUT distances if the line segment connecting the points lies com-plete outside the model,3)MIXED distances if the line seg-ment connecting the points lies passes both inside and out-side the model.Their dissimilarity measure is a weighted distance measure comparing D2,IN,OUT and MIXED dis-tributions.Since their method requires that a line segment can be classiﬁed as lying inside or outside the model it is required that the model deﬁnes a volume properly.There-fore it can be applied to volume models,but not to polyg-onal soups.Recently,Ip et al.[38]extend this approach with a technique to automatically categorize a large model database,given a categorization on a number of training ex-amples from the database.Ohbuchi et al.[63],investigate another extension of the D2shape distribution function,called the Absolute Angle-Distance histogram,parameterized by a parameter denot-ing the distance between two random points and by a pa-rameter denoting the angle between the surfaces on which two random points are located.The latter parameter is ac-tually computed as an inner product of the surface normal vectors.In their evaluation experiment this shape distribu-tion function outperformed the D2distribution function at about1.5times higher computational costs.Ohbuchi et al.[65]improved this method further by a multi-resolution ap-proach computing a number of alpha-shapes at different scales,and computing for each alpha-shape their Absolute Angle-Distance descriptor.Their experimental results show that this approach outperforms the Angle-Distance descrip-tor at the cost of high processing time needed to compute the alpha-shapes.Shape distributions distinguish models in broad cate-gories very well:aircraft,boats,people,animals,etc.How-ever,they perform often poorly when having to discrimi-nate between shapes that have similar gross shape proper-ties but vastly different detailed shape properties.3.1.3.Spatial map based similaritySpatial maps are representations that capture the spatial lo-cation of an object.The map entries correspond to physi-cal locations or sections of the object,and are arranged in a manner that preserves the relative positions of the features in an object.Spatial maps are in general not invariant to ro-tations,except for specially designed maps.Therefore,typ-ically a pose normalization is doneﬁrst.Ankerst et al.[5]use shape histograms as a means of an-alyzing the similarity of3D molecular surfaces.The his-tograms are not built from volume elements but from uni-formly distributed surface points taken from the molecular surfaces.The shape histograms are deﬁned on concentric shells and sectors around a model’s centroid and compare shapes using a quadratic form distance measure to compare the histograms taking into account the distances between the shape histogram bins.Vrani´c et al.[85]describe a surface by associating to each ray from the origin,the value equal to the distance to the last point of intersection of the model with the ray and compute spherical harmonics for this spherical extent func-tion.Spherical harmonics form a Fourier basis on a sphere much like the familiar sine and cosine do on a line or a cir-cle.Their method requires pose normalization to provide rotational invariance.Also,Yu et al.[86]propose a descrip-tor similar to a spherical extent function and a descriptor counting the number of intersections of a ray from the ori-gin with the model.In both cases the dissimilarity between two shapes is computed by the Euclidean distance of the Fourier transforms of the descriptors of the shapes.Their method requires pose normalization to provide rotational in-variance.Kazhdan et al.[43]present a general approach based on spherical harmonics to transform rotation dependent shape descriptors into rotation independent ones.Their method is applicable to a shape descriptor which is deﬁned as either a collection of spherical functions or as a function on a voxel grid.In the latter case a collection of spherical functions is obtained from the function on the voxel grid by restricting the grid to concentric spheres.From the collection of spher-ical functions they compute a rotation invariant descriptor by(1)decomposing the function into its spherical harmon-ics,(2)summing the harmonics within each frequency,and computing the L2-norm for each frequency component.The resulting shape descriptor is a2D histogram indexed by ra-dius and frequency,which is invariant to rotations about the center of the mass.This approach offers an alternative for pose normalization,because their method obtains rotation invariant shape descriptors.Their experimental results show indeed that in general the performance of the obtained ro-tation independent shape descriptors is better than the cor-responding normalized descriptors.Their experiments in-clude the ray-based spherical harmonic descriptor proposed by Vrani´c et al.[85].Finally,note that their approach gen-eralizes the method to compute voxel-based spherical har-monics shape descriptor,described by Funkhouser et al.[30],which is deﬁned as a binary function on the voxel grid, where the value at each voxel is given by the negatively ex-ponentiated Euclidean Distance Transform of the surface of a3D model.Novotni and Klein[61]present a method to compute 3D Zernike descriptors from voxelized models as natural extensions of spherical harmonics based descriptors.3D Zernike descriptors capture object coherence in the radial direction as well as in the direction along a sphere.Both 3D Zernike descriptors and spherical harmonics based de-scriptors achieve rotation invariance.However,by sampling the space only in radial direction the latter descriptors donot capture object coherence in the radial direction,as illus-trated byﬁg.4.The limited experiments comparing spherical harmonics and3D Zernike moments performed by Novotni and Klein show similar results for a class of planes,but better results for the3D Zernike descriptor for a class of chairs.Vrani´c[84]expects that voxelization is not a good idea, because manyﬁne details are lost in the voxel grid.There-fore,he compares his ray-based spherical harmonic method [85]and a variation of it using functions deﬁned on concen-tric shells with the voxel-based spherical harmonics shape descriptor proposed by Funkhouser et al.[30].Also,Vrani´c et al.[85]accomplish pose normalization using the so-called continuous PCA algorithm.In the paper it is claimed that the continuous PCA is better as the conventional PCA and better as the weighted PCA,which takes into account the differing sizes of the triangles of a mesh.In contrast with Kazhdan’s experiments[43]the experiments by Vrani´c show that for ray-based spherical harmonics using the con-tinuous PCA without voxelization is better than using rota-tion invariant shape descriptors obtained using voxelization. Perhaps,these results are opposite to Kazhdan results,be-cause of the use of different methods to compute the PCA or the use of different databases or both.Kriegel et al.[46,47]investigate similarity for voxelized models.They obtain a spatial map by partitioning a voxel grid into disjoint cells which correspond to the histograms bins.They investigate three different spatial features asso-ciated with the grid cells:(1)volume features recording the fraction of voxels from the volume in each cell,(2) solid-angle features measuring the convexity of the volume boundary in each cell,(3)eigenvalue features estimating the eigenvalues obtained by the PCA applied to the voxels of the model in each cell[47],and a fourth method,using in-stead of grid cells,a moreﬂexible partition of the voxels by cover sequence features,which approximate the model by unions and differences of cuboids,each containing a number of voxels[46].Their experimental results show that the eigenvalue method and the cover sequence method out-perform the volume and solid-angle feature method.Their method requires pose normalization to provide rotational in-variance.Instead of representing a cover sequence with a single feature vector,Kriegel et al.[46]represent a cover sequence by a set of feature vectors.This approach allows an efﬁcient comparison of two cover sequences,by compar-ing the two sets of feature vectors using a minimal match-ing distance.The spatial map based approaches show good retrieval results.But a drawback of these methods is that partial matching is not supported,because they do not encode the relation between the features and parts of an object.Fur-ther,these methods provide no feedback to the user about why shapes match.3.1.4.Local feature based similarityLocal feature based methods provide various approaches to take into account the surface shape in the neighbourhood of points on the boundary of the shape.Shum et al.[74]use a spherical coordinate system to map the surface curvature of3D objects to the unit sphere. By searching over a spherical rotation space a distance be-tween two curvature distributions is computed and used as a measure for the similarity of two objects.Unfortunately, the method is limited to objects which contain no holes, i.e.have genus zero.Zaharia and Prˆe teux[87]describe the 3D Shape Spectrum Descriptor,which is deﬁned as the histogram of shape index values,calculated over an en-tire mesh.The shape index,ﬁrst introduced by Koenderink [44],is deﬁned as a function of the two principal curvatures on continuous surfaces.They present a method to compute these shape indices for meshes,byﬁtting a quadric surface through the centroids of the faces of a mesh.Unfortunately, their method requires a non-trivial preprocessing phase for meshes that are not topologically correct or not orientable.Chua and Jarvis[18]compute point signatures that accu-mulate surface information along a3D curve in the neigh-bourhood of a point.Johnson and Herbert[41]apply spin images that are2D histograms of the surface locations around a point.They apply spin images to recognize models in a cluttered3D scene.Due to the complexity of their rep-resentation[18,41]these methods are very difﬁcult to ap-ply to3D shape matching.Also,it is not clear how to deﬁne a dissimilarity function that satisﬁes the triangle inequality.K¨o rtgen et al.[45]apply3D shape contexts for3D shape retrieval and matching.3D shape contexts are semi-local descriptions of object shape centered at points on the sur-face of the object,and are a natural extension of2D shape contexts introduced by Belongie et al.[9]for recognition in2D images.The shape context of a point p,is deﬁned as a coarse histogram of the relative coordinates of the re-maining surface points.The bins of the histogram are de-。

【25】Robust recovery of subspace structures by low-rank representation

Robust Recovery of Subspace Structures by Low-Rank RepresentationGuangcan Liu,Member,IEEE,Zhouchen Lin,Senior Member,IEEE,Shuicheng Yan,Senior Member,IEEE,Ju Sun,Student Member,IEEE,Yong Yu,and Yi Ma,Senior Member,IEEEAbstract—In this paper,we address the subspace clustering problem.Given a set of data samples(vectors)approximately drawn from a union of multiple subspaces,our goal is to cluster the samples into their respective subspaces and remove possible outliers as well.To this end,we propose a novel objective function named Low-Rank Representation(LRR),which seeks the lowest rank representation among all the candidates that can represent the data samples as linear combinations of the bases in a given dictionary.It is shown that the convex program associated with LRR solves the subspace clustering problem in the following sense:When the data is clean,we prove that LRR exactly recovers the true subspace structures;when the data are contaminated by outliers,we prove that under certain conditions LRR can exactly recover the row space of the original data and detect the outlier as well;for data corrupted by arbitrary sparse errors,LRR can also approximately recover the row space with theoretical guarantees.Since the subspace membership is provably determined by the row space,these further imply that LRR can perform robust subspace clustering and error correction in an efficient and effective way.Index Terms—Low-rank representation,subspace clustering,segmentation,outlier detectionÇ1I NTRODUCTIONI N pattern analysis and signal processing,an underlying tenet is that the data often contains some type of structure that enables intelligent representation and processing.So one usually needs a parametric model to characterize a given set of data.To this end,the well-known(linear) subspaces are possibly the most common choice,mainly because they are easy to compute and often effective in real applications.Several types of visual data,such as motion[1],[2],[3],face[4],and texture[5],have been known to be well characterized by subspaces.Moreover,by applying the concept of reproducing kernel Hilbert space,one can easily extend the linear models to handle nonlinear data.So the subspace methods have been gaining much attention in recent years.For example,the widely used Principal Component Analysis(PCA)method and the recently established matrix completion[6]and recovery[7]methods are essentially based on the hypothesis that the data is approximately drawn from a low-rank subspace.However, a given dataset can seldom be well described by a single subspace.A more reasonable model is to consider data as lying near several subspaces,namely,the data is considered as samples approximately drawn from a mixture of several low-rank subspaces,as shown in Fig.1.The generality and importance of subspaces naturally lead to a challenging problem of subspace segmentation(or clustering),whose goal is to segment(cluster or group)data into clusters with each cluster corresponding to a subspace. Subspace segmentation is an important data clustering problem and arises in numerous research areas,including computer vision[3],[8],[9],image processing[5],[10],and system identification[11].When the data is clean,i.e.,the samples are strictly drawn from the subspaces,several existing methods(e.g.,[12],[13],[14])are able to exactly solve the subspace segmentation problem.So,as pointed out by Rao et al.[3]and Liu et al.[14],the main challenge of subspace segmentation is to handle the errors(e.g.,noise and corruptions)that possibly exist in data,i.e.,to handle the data that may not strictly follow subspace structures. With this viewpoint,in this paper we therefore study the following robust subspace clustering[15]problem. Problem1.1(Robust Subspace Clustering).Given a set of data samples approximately(i.e.,the data may contain errors).G.Liu is with the Department of Computer Science and Engineering,Shanghai Jiao Tong University,China,the Coordinated Science Labora-tory,University of Illinois,1308West Main Street,Urbana-Champaign,Urbana,IL61801,and the Department of Electrical and ComputerEngineering,National University of Singapore.E-mail:gutty.liu@..Z.Lin is with the Key Laboratory of Machine Perception(MOE),School ofElectronic Engineering and Computer Science,Peking University,No.5Yiheyuan Road,Haidian District,Beijing100871,China.E-mail:zlin@..S.Yan is with the Department of Electrical and Computer Engineering,National University of Singapore,Block E4,#08-27,Engineering Drive3,Singapore117576.E-mail:eleyans@.sg..J.Sun is with the Department of Electrical Engineering,ColumbiaUniversity,1300S.W.Mudd,500West120th Street,New York,NY10027.E-mail:jusun@..Y.Yu is with the Department of Computer Science and Engineering,Shanghai Jiao Tong University,No.800Dongchuan Road,MinhangDistrict,Shanghai200240,China.E-mail:yyu@..Y.Ma is with the Visual Computing Group,Microsoft Research Asia,China,and with the Coordinated Science Laboratory,University of Illinoisat Urbana-Champaign,Room145,1308West Main Street,Urbana,IL61801.Manuscript received14Oct.2010;revised8Sept.2011;accepted24Mar.2012;published online4Apr.2012.Recommended for acceptance by T.Jebara.For information on obtaining reprints of this article,please send e-mail to:tpami@,and reference IEEECS Log NumberTPAMI-2010-10-0786.Digital Object Identifier no.10.1109/TPAMI.2012.88.0162-8828/13/$31.00ß2013IEEE Published by the IEEE Computer Societydrawn from a union of linear subspaces,correct the possibleerrors and segment all samples into their respective subspaces simultaneously.Notice that the word“error”generally refers to the deviation between model assumption(i.e.,subspaces)and data.It could exhibit as noise[6],missed entries[6],outliers [16],and corruptions[7]in reality.Fig.2illustrates three typical types of errors under the context of subspace modeling.In this paper,we shall focus on the sample-specific corruptions(and outliers)shown in Fig.2c,with mild concerns to the cases of Figs.2a and2b.Notice that an outlier is from a different model other than subspaces and is essentially different from a corrupted sample that belongs to the subspaces.We put them into the same category just because they can be handled in the same way,as will be shown in Section5.2.To recover the subspace structures from the data containing errors,we propose a novel method termed Low-Rank Representation(LRR)[14].Given a set of data samples,each of which can be represented as a linear combination of the bases in a dictionary,LRR aims at finding the lowest rank representation of all data jointly.The computational procedure of LRR is to solve a nuclear norm [17]regularized optimization problem,which is convex and can be solved in polynomial time.By choosing a specific dictionary,it is shown that LRR can well solve the subspace clustering problem:When the data is clean,we prove that LRR exactly recovers the row space of the data;for the data contaminated by outliers,we prove that under certain conditions LRR can exactly recover the row space of the original data and detect the outlier as well;for the data corrupted by arbitrary errors,LRR can also approximately recover the row space with theoretical guarantees.Since the subspace membership is provably determined by the row space(we will discuss this in Section3.2),these further imply that LRR can perform robust subspace clustering and error correction in an efficient way.In summary,the contributions of this work include:.We develop a simple yet effective method,termed LRR,which has been used to achieve state-of-the-artperformance in several applications such as motionsegmentation[4],image segmentation[18],saliencydetection[19],and face recognition[4]..Our work extends the recovery of corrupted data from a single subspace[7]to multiple subspaces.Compared to[20],which requires the bases ofsubspaces to be known for handling the corrupteddata from multiple subspaces,our method isautonomous,i.e.,no extra clean data is required..Theoretical results for robust recovery are provided.While our analysis shares similar features asprevious work in matrix completion[6]and RobustPCA(RPCA)[7],[16],it is considerably morechallenging due to the fact that there is a dictionarymatrix in LRR.2R ELATED W ORKIn this section,we discuss some existing subspace segmen-tation methods.In general,existing works can be roughly divided into four main categories:mixture of Gaussian, factorization,algebraic,and spectral-type methods.In statistical learning,mixed data is typically modeled as a set of independent samples drawn from a mixture of probabilistic distributions.As a single subspace can be well modeled by a(degenerate)Gaussian distribution,it is straightforward to assume that each probabilistic distribu-tion is Gaussian,i.e.,adopting a mixture of Gaussian models.Then the problem of segmenting the data is converted to a model estimation problem.The estimation can be performed either by using the Expectation Max-imization(EM)algorithm to find a maximum likelihood estimate,as done in[21],or by iteratively finding a min-max estimate,as adopted by K-subspaces[8]and Random Sample Consensus(RANSAC)[10].These methods are sensitive to errors.So several efforts have been made for improving their robustness,e.g.,the Median K-flats[22]for K-subspaces,the work[23]for RANSAC,and[5]use a coding length to characterize a mixture of Gaussian.These refinements may introduce some robustness.Nevertheless, the problem is still not well solved due to the optimization difficulty,which is a bottleneck for these methods.Factorization-based methods[12]seek to approximate the given data matrix as a product of two matrices such that the support pattern for one of the factors reveals the segmenta-tion of the samples.In order to achieve robustness to noise, these methods modify the formulations by adding extra regularization terms.Nevertheless,such modifications usually lead to non-convex optimization problems which need heuristic algorithms(often based on alternating minimization or EM-style algorithms)to solve.Getting stuck at local minima may undermine their performances, especially when the data is grossly corrupted.It will be shown that LRR can be regarded as a robust generalization of the method in[12](which is referred to as PCA in thisFig.1.A mixture of subspaces consisting of a2D plane and two1D lines.(a)The samples are strictly drawn from the underlying subspaces.(b)The samples are approximately drawn from the underlyingsubspaces.Fig.2.Illustrating three typical types of errors:(a)noise[6],which indicates the phenomena that the data is slightly perturbed around the subspaces(what we show is a perturbed data matrix whose columns are samples drawn from the subspaces),(b)random corruptions[7], which indicate that a fraction of random entries are grossly corrupted, (c)sample-specific corruptions(and outliers),which indicate the phenomena that a fraction of the data samples(i.e.,columns of the data matrix)are far away from the subspaces.paper).The formulation of LRR is convex and can be solved in polynomial time.Generalized Principal Component Analysis(GPCA)[24] presents an algebraic way to model the data drawn from a union of multiple subspaces.This method describes a subspace containing a data point by using the gradient of a polynomial at that point.Then subspace segmentation is made equivalent to fitting the data with polynomials.GPCA can guarantee the success of the segmentation under certain conditions,and it does not impose any restriction on the subspaces.However,this method is sensitive to noise due to the difficulty of estimating the polynomials from real data, which also causes the high computation cost of GPCA. Recently,Robust Algebraic Segmentation(RAS)[25]has been proposed to resolve the robustness issue of GPCA. However,the computation difficulty for fitting polynomials is unfathomably large.So RAS can make sense only when the data dimension is low and the number of subspaces is small.As a data clustering problem,subspace segmentation can be done by first learning an affinity matrix from the given data and then obtaining the final segmentation results by Spectral Clustering(SC)algorithms such as Normalized Cuts(NCut)[26].Many existing methods,such as Sparse Subspace Clustering(SSC)[13],Spectral Curvature Cluster-ing(SCC)[27],[28],Spectral Local Best-fit Flats(SLBF)[29], [30],the proposed LRR method,and[2],[31],possess such a spectral nature,so-called spectral-type methods.The main difference among various spectral-type methods is the approach for learning the affinity matrix.Under the assumption that the data is clean and the subspaces are independent,Elhamifar and Vidal[13]show that a solution produced by Sparse Representation(SR)[32]could achieve the so-called‘1Subspace Detection Property(‘1-SDP):The within-class affinities are sparse and the between-class affinities are all zeros.In the presence of outliers,it is shown in[15]that the SR method can still obey‘1-SDP.However,‘1-SDP may not be sufficient to ensure the success of subspace segmentation[33].Recently,Lerman and Zhang [34]proved that under certain conditions the multiple subspace structures can be exactly recovered via‘p(p1) minimization.Unfortunately,since the formulation is not convex,it is still unknown how to efficiently obtain the globally optimal solution.In contrast,the formulation of LRR is convex and the corresponding optimization problem can be solved in polynomial time.What is more,even if the data is contaminated by outliers,the proposed LRR method is proven to exactly recover the right row space,which provably determines the subspace segmentation results (we shall discuss this in Section3.2).In the presence of arbitrary errors(e.g.,corruptions,outliers,and noise),LRR is also guaranteed to produce near recovery.3P RELIMINARIES AND P ROBLEM S TATEMENT3.1Summary of Main NotationsIn this paper,matrices are represented with capital symbols. In particular,I is used to denote the identity matrix,and the entries of matrices are denoted by using½Á with subscripts. For instance,M is a matrix,½M ij is itsði;jÞth entry,½M i;:is its i th row,and½M :;j is its j th column.For ease of presentation,the horizontal(respectively,vertical)concate-nation of a collection of matrices along row(respectively, column)is denoted by½M1;M2;...;M k (respectively,½M1;M2;...;M k ).The block-diagonal matrix formed by a collection of matrices M1;M2;...;M k is denoted by diag M1;M2;...;M kðÞ¼M10000M20000...000M k2666437775:ð1ÞThe only used vector norm is the‘2norm,denoted by Ák k2.A variety of norms on matrices will be used.The matrix ‘0,‘2;0,‘1,‘2;1norms are defined by Mk k0¼#fði;jÞ:½M ij¼0g,Mk k2;0¼#f i:k½M :;i k2¼0g,Mk k1¼Pi;jj½M ij j,and Mk k2;1¼Pik½M :;i k2,respectively.The matrix‘1norm is defined as Mk k1¼max i;j j½M ij j.The spectral norm of a matrix M is denoted by Mk k,i.e.,Mk k is the largest singular value of M.The Frobenius norm and the nuclear norm(the sum of singular values of a matrix)are denoted by Mk k F and Mk kÃ,respectively.The euclidean inner product between two matrices is h M;N i¼trðM T NÞ,where M T is the transpose of a matrix and trðÁÞis the trace of a matrix.The supports of a matrix M are the indices of its nonzero entries,i.e.,fði;jÞ:½M ij¼0g.Similarly,its column supports are the indices of its nonzero columns.The symbol I (superscripts,subscripts,etc.)is used to denote the column supports of a matrix,i.e.,I¼fðiÞ:k½M :;i k2¼0g.The corresponding complement set(i.e.,zero columns)is I c. There are two projection operators associated with I and I c: P I and P I c.While applying them to a matrix M,the matrix P IðMÞ(respectively,P I cðMÞ)is obtained from M by setting ½M :;i to zero for all i2I(respectively,i2I c).We also adopt the conventions of using spanðMÞto denote the linear space spanned by the columns of a matrix M,using y2spanðMÞto denote that a vector y belongs to the space spanðMÞ,and using Y2spanðMÞto denote that all column vectors of Y belong to spanðMÞ.Finally,in this paper we use several terminologies, including“block-diagonal matrix,”“union and sum of subspaces,”“independent(and disjoint)subspaces,”“full SVD and skinny SVD,”“pseudo-inverse,”“column space and row space,”and“affinity degree.”These terminologies are defined in the Appendix,which can be found in the Computer Society Digital Library at http://doi. /10.1109/TPAMI.2012.88.3.2Relations between Segmentation and RowSpaceLet X0with skinny SVD U0Æ0V T0be a collection of data samples strictly drawn from a union of multiple subspaces (i.e.,X0is clean);the subspace membership of the samples is determined by the row space of X0.Indeed,as shown in[12], when subspaces are independent,V0V T0forms a block-diagonal matrix:Theði;jÞth entry of V0V T0can be nonzero only if the i th and j th samples are from the same subspace. Hence,this matrix,termed Shape Interaction Matrix(SIM) [12],has been widely used for subspace segmentation. Previous approaches simply compute the SVD of the data matrix X¼U XÆX V T X and then use j V X V T X j1for subspace segmentation.However,in the presence of outliers and corruptions,V X can be far away from V0and thus theLIU ET AL.:ROBUST RECOVERY OF SUBSPACE STRUCTURES BY LOW-RANK REPRESENTATION1731.For a matrix M,j M j denotes the matrix with theði;jÞth entry being theabsolute value of½M ij.segmentation using such approaches is inaccurate.In contrast,we show that LRR can recover V0V T0even when the data matrix X is contaminated by outliers.If the subspaces are not independent,V0V T0may not be strictly block-diagonal.This is indeed well expected since when the subspaces have nonzero(nonempty)intersections, then some samples may belong to multiple subspaces simultaneously.When the subspaces are pairwise disjoint (but not independent),our extensive numerical experiments show that V0V T0may still be close to be block-diagonal,as exemplified in Fig.3.Hence,to recover V0V T0is still of interest to subspace segmentation.3.3Problem StatementProblem1.1only roughly describes what we want to study. More precisely,this paper addresses the following problem. Problem 3.1(Subspace Recovery).Let X02I R dÂn with skinny SVD U0Æ0V T0store a set of n d-dimensional samples (vectors)strictly drawn from a union of k subspaces fS i g k i¼1of unknown dimensions(k is unknown either).Given a set of observation vectors X generated byX¼X0þE0;the goal is to recover the row space of X0or to recover the true SIM V0V T0as equal.The recovery of row space can guarantee high segmenta-tion accuracy,as analyzed in Section3.2.Also,the recovery of row space naturally implies success in error correction.So it is sufficient to set the goal of subspace clustering as the recovery of the row space identified by V0V T0.For ease of exploration,we consider the problem under three assump-tions of increasing practicality and difficulty. Assumption1.The data is clean,i.e.,E0¼0.Assumption 2.A fraction of the data samples are grossly corrupted and the others are clean,i.e.,E0has sparse column supports as shown in Fig.2c.Assumption 3.A fraction of the data samples are grossly corrupted and the others are contaminated by small Gaussian noise,i.e.,E0is characterized by a combination of the models shown in Figs.2a and2c.Unlike[14],the independent assumption on the sub-spaces is not highlighted in this paper because the analysis in this work focuses on recovering V0V T0rather than a pursuit of block-diagonal matrix.4L OW-R ANK R EPRESENTATION FOR M ATRIX R ECOVERYIn this section,we abstractly present the LRR method for recovering a matrix from corrupted observations.The basic theorems and optimization algorithms will be presented. The specific methods and theories for handling the sub-space clustering problem are deferred until Section5.4.1Low-Rank RepresentationIn order to recover the low-rank matrix X0from the given observation matrix X corrupted by errors E0(X¼X0þE0), it is straightforward to consider the following regularized rank minimization problem:minD;Erank DðÞþ Ek k‘;s:t:X¼DþE;ð2Þwhere >0is a parameter andÁk k‘indicates certain regularization strategy,such as the squared Frobenius norm (i.e.,kÁk2F)used for modeling the noise as show in Fig.2a[6], the‘0norm adopted by Cande`s et al.[7]for characterizing the random corruptions as shown in Fig.2b,and the ‘2;0norm adopted by Liu et al.[14]and Xu et al.[16]for dealing with sample-specific corruptions and outliers. Suppose DÃis a minimizer with respect to the variable D, then it gives a low-rank recovery to the original data X0.The above formulation is adopted by the recently established the Robust PCA method[7],which has been used to achieve the state-of-the-art performance in several applications(e.g.,[35]).However,this formulation impli-citly assumes that the underlying data structure is a single low-rank subspace.When the data is drawn from a union of multiple subspaces,denoted as S1;S2;...;S k,it actually treats the data as being sampled from a single subspace defined by S¼P ki¼1S i.Since the sumP ki¼1S i can be much larger than the union[k i¼1S i,the specifics of the individual subspaces are not well considered and so the recovery may be inaccurate.To better handle the mixed data,here we suggest a more general rank minimization problem defined as follows: minZ;Erank ZðÞþ Ek k‘;s:t:X¼AZþE;ð3Þwhere A is a“dictionary”that linearly spans the data space. We call the minimizer ZÃ(with regard to the variable Z)the “lowest rank representation”of data X with respect to a dictionary A.After obtaining an optimal solutionðZÃ;EÃÞ, we could recover the original data by using AZÃ(or XÀEÃ). Since rankðAZÃÞrankðZÃÞ,AZÃis also a low-rank recovery to the original data X0.By setting A¼I,the formulation(3) falls back to(2).So LRR could be regarded as a general-ization of RPCA that essentially uses the standard bases as the dictionary.By choosing an appropriate dictionary A,as we will see,the lowest rank representation can recover the underlying row space so as to reveal the true segmentation of data.So,LRR could handle well the data drawn from a union of multiple subspaces.174IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.35,NO.1,JANUARY2013Fig. 3.An example of the matrix V0V T0computed from dependent subspaces.In this example,we create11pairwise disjoint subspaces, each of which is of dimension20and draw20samples from each subspace without errors.The ambient dimension is200,which is smaller than the sum of the dimensions of the subspaces.So the subspaces are dependent and V0V T0is not strictly block-diagonal.Nevertheless,it is simple to see that high segmentation accuracy can be achieved by using the above affinity matrix to do spectral clustering.4.2Analysis on the LRR ProblemThe optimization problem(3)is difficult to solve due to the discrete nature of the rank function.For ease of exploration, we begin with the“ideal”case that the data is clean.That is, we consider the following rank minimization problem:min Z rank ZðÞ;s:t:X¼AZ:ð4ÞIt is easy to see that the solution to(4)may not be unique. As a common practice in rank minimization problems,we replace the rank function with the nuclear norm,resulting in the following convex optimization problem:min Z Zk kÃ;s:t:X¼AZ:ð5ÞWe will show that the solution to(5)is also a solution to(4) and this special solution is useful for subspace segmentation.In the following,we shall show some general properties of the minimizer to problem(5).These general conclusions form the foundations of LRR(the proofs can be found in Appendix,which is available in the online supplemental material).4.2.1Uniqueness of the MinimizerThe nuclear norm is convex,but not strongly convex.So it is possible that(5)has multiple optimal solutions.Fortunately, it can be proven that the minimizer to(5)is always uniquely defined by a closed form.This is summarized in the following theorem.Theorem 4.1.Assume A¼0and X¼AZ have feasible solution(s),i.e.,X2spanðAÞ.Then,ZÃ¼A y Xð6Þis the unique minimizer to(5),where A y is the pseudo-inverse of A.From the above theorem,we have the following corollary which shows that(5)is a good surrogate of(4).Corollary 4.1.Assume A¼0and X¼AZ have feasible solutions.Let ZÃbe the minimizer to(5),then rankðZÃÞ¼rankðXÞand ZÃis also a minimal rank solution to(4).4.2.2Block-Diagonal Property of the MinimizerBy choosing an appropriate dictionary,the lowest rank representation can reveal the true segmentation results. Namely,when the columns of A and X are exactly sampled from independent subspaces,the minimizer to(5)can reveal the subspace membership among the samples.Let fS1;S2;...;S k g be a collection of k subspaces,each of which has a rank(dimension)of r i>0.Also,let A¼½A1; A2;...;A k and X¼½X1;X2;...;X k .Then we have the following theorem.Theorem4.2.Without loss of generality,assume that A i is a collection of m i samples of the i th subspace S i,X i is a collection of n i samples from S i,and the sampling of each A i is sufficient such that rankðA iÞ¼r i(i.e.,A i can be regarded as the bases that span the subspace).If the subspaces are independent,then the minimizer to(5)is block-diagonal:ZÃ¼ZÃ10000ZÃ20000...000ZÃk2666437775;where ZÃi is an m iÂn i coefficient matrix with rankðZÃiÞ¼rankðX iÞ;8i.Note that the claim of rankðZÃiÞ¼rankðX iÞguarantees the high within-class homogeneity of ZÃi since the low-rank properties generally require ZÃi to be dense.This is different from SR,which is prone to produce a“trivial”solution if A¼X because the sparsest representation is an identity matrix in this case.It is also worth noting that the above block-diagonal property does not require the data samples to have been grouped together according to their subspace memberships.There is no loss of generality to assume that the indices of the samples have been rearranged to satisfy the true subspace memberships,because the solution produced by LRR is globally optimal and does not depend on the arrangements of the data samples.4.3Recovering Low-Rank Matrices by ConvexOptimizationCorollary 4.1suggests that it is appropriate to use the nuclear norm as a surrogate to replace the rank function in (3).Also,the matrix‘1and‘2;1norms are good relaxations of the‘0and‘2;0norms,respectively.So we could obtain a low-rank recovery to X0by solving the following convex optimization problem:minZ;Ek Z kÃþ k E k2;1;s:t:X¼AZþE:ð7ÞHere,the‘2;1norm is adopted to characterize the error term E since we want to model the sample-specific corruptions(and outliers)as shown in Fig.2c.For the small Gaussian noise as shown in Fig.2a,k E k2F should be chosen;for the random corruptions as shown in Fig.2b,k E k1is an appropriate choice.After obtaining the minimizerðZÃ;EÃÞ,we could use AZÃ(or XÀEÃ)to obtain a low-rank recovery to the original data X0.The optimization problem(7)is convex and can be solved by various methods.For efficiency,we adopt in this paper the Augmented Lagrange Multiplier(ALM)[36],[37] method.We first convert(7)to the following equivalent problem:minZ;E;JJ k kÃþ Ek k2;1;s:t:X¼AZþE;Z¼J:This problem can be solved by the ALM method,which minimizes the following augmented Lagrangian function: L¼k J kÃþ k E k2;1þtrÀY T1ðXÀAZÀEÞÁþtrÀY T2ðZÀJÞÁþ2ÀXÀAZÀE2FþZÀJ2FÁ: The above problem is unconstrained.So it can be minimized with respect to J,Z,and E,respectively,by fixing the other variables and then updating the Lagrange multipliers Y1and Y2,where >0is a penalty parameter.The inexact ALM method,also called the alternating direction method,isLIU ET AL.:ROBUST RECOVERY OF SUBSPACE STRUCTURES BY LOW-RANK REPRESENTATION175。