  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Minimally Supervised Acquisition of3D Recognition Models from ClutteredImagesAndrea Selinger and Randal C.NelsonDepartment of Computer ScienceUniversity of RochesterRochester,NY14627selinger,nelson@AbstractAppearance-based object recognition systems rely on train-ing from imagery,which allows the recognition of objects without requiring a3D geometric model.It has been little explored whether such systems can be trained from imagery that is unlabeled,and whether they can be trained from im-agery that is not trivially segmentable.In this paper we present a method for minimally super-vised training of a previously developed recognition system from unlabeled and unsegmented imagery.We show that the system can successfully extend an object representation extracted from one black background image to contain ob-ject features extracted from unlabeled cluttered images and can use the extended representation to improve recognition performance on a test set.1.IntroductionAppearance-based systems have proven quite successful in recognizing3D objects.They typically rely on training from labeled imagery,which allows the recognition of ob-jects without the requirement of constructing a3D geomet-ric model.It has been little explored whether such systems can be trained from imagery that is unlabeled,and whether they can be trained from imagery that is not trivially seg-mentable.A recognition system that could be trained from either unlabeled or unsegmented imagery would be valuable for reducing the effort required to obtain a training set.Of greater practical impact,a3D recognition system that could be trained from cluttered imagery would be useful for auto-matic,object-level labeling of image databases,which is an important outstanding problem.1.1.Appearance-Based Object RecognitionSystemsOne of thefirst appearance-based systems was the one de-veloped by Poggio that recognized wire objects[?].Rao and Ballard[?]describe an approach based on the memo-rization of the responses of a set of steerablefilters.Mel’s SEEMORE system[?]uses a database of stored feature channels representing multiple low-level cues describing contour shape,color and texture.Schiele and Crowley[?] used histograms of the responses of a vector of local lin-ear neighborhood operators.Murase and Nayar[?]find the major principal components of an image dataset,and use the projections of unknown images onto these as indices into a recognition memory.This approach was extended by Huang and Camps[?]to appearance-based parts and rela-tionships among them.Wang and Ben-Arie[?]were able to do generic object detection using vectorial eigenspaces derived from a small set of model shapes which are affine transformed in a wide parameter range.The approach taken by Schmid and Mohr[?]is based on the combination of differential invariants computed at keypoints with a robust voting algorithm and semilocal constraints.1.2.Previous Work on Unsupervised Trainingof3D Recognition SystemsA system that is trained from unlabeled images has to be able to perform unsupervised clustering of multiple views into multiple object classes.In their approach,Ando et al. [?]observed that although the input dimension of such an image set is very high,the view data of an object often re-sides in a low-dimensional subspace.Their strategy is to identify multiple non-linear subspaces each of which con-tains the views of each object class.A similar approach was taken by Basri et al.[?].Their method examines the space of all images and partitions the images into sets that form smooth and parallel surfaces in this space.Nearby images are grouped into surface patches that form the nodes of a graph.Further grouping becomes a standard graph cluster-ing problem.In both cases good results are obtained only if a very large number of clean,segmented images,or even se-quences of images are considered.This is not surprising.If the number of images is high,clustering is affected by a phase transition phenomenon:when the parameters of the image set reach a certain value,the topology of the network suddenly changes from small isolated clusters to a giant one containing very many nodes[?,?].The unsupervised learning methods discussed above were based on computing distances between images.Basri et al.[?]use a similarity measure based on the distortion of salient features between images.Gdalyahu and Weinshall [?]use a curve dissimilarity measure.The disadvantage of such similarity measures is that they generally require full object segmentation and cannot deal with scale changes.Weber et al.[?]developed a system that learns object class models from unsegmented cluttered scenes.Their method automatically identifies distinctive parts in the train-ing set by applying a clustering algorithm to patterns se-lected by an interest operator,and then learns the statistical shape model using expectation maximization.However,the method requires images to be labeled as to the object that they represent.1.3.Current Work on Training from Unla-beled Cluttered ImagesWhile there has been work on unsupervised training of recognition systems from clean,segmented images,and on supervised training from cluttered images,there has been no work to our knowledge on unsupervised training of object recognition systems from cluttered images.In this paper we present a method for minimally super-vised training of a recognition system from unlabeled and unsegmented imagery.We use an object recognition sys-tem developed previously[?],and train it on one labeled black background image of an object and a set of unlabeled cluttered images.We show that the system can successfully classify the majority of cluttered images containing the seed object and can extend the object’s representation using fea-tures from these ing this representation,recogni-tion performance becomes significantly better than the per-formance obtained by training the system only on the black background seed image.2.The Underlying Object RecognitionSystemThe recognition system we adapt is based on a hierarchy of perceptual grouping processes[?].A3D object is a fourth-level group(see Figure1)consisting of a topologi-cally structured set offlexible2D views,each derived from a training image.In these views,which represent third-level perceptual groups,the visual appearance of an object is represented as a geometrically consistent cluster of sev-eral overlapping local context regions.These local context regions represent high-information second-level perceptual groups,and are essentially windows centered on and nor-malized by keyfirst-level features that contain a representa-tion of allfirst-level features that intersect the window.The first level features are the result offirst level grouping pro-cesses run on the image,typically representing connected contourfragments.Figure1:The perceptual grouping hierarchy In more detail,distinctive local features called keys,se-lected from thefirst level groups in our hierarchy,seed and normalize keyed context regions,the second level groups. In the current system,the keys are contours automatically extracted from the image.The second level of grouping into keyed context patches amplifies the power of the key features by providing a means of verifying whether the key is likely to be part of a particular object.Even these high information local context regions are generally consistent with several object/pose hypotheses; hence we use the third-level grouping process to organize the context patches into globally consistent clusters that rep-resent hypotheses about object identity and pose.This is done through a hypothesis database that maintains a proba-bilistic estimate of the likelihood of each third level group (cluster)based on statistics about the frequency of match-ing context patches in the primary database.The idea is similar to a multi-dimensional Hough transform without the space problems that arise in an explicit decomposition of the parameter space.In our case,since 3D objects are rep-resented by a set of views,the clusters represent two di-mensional rigid transforms of specific views.The use of keyed contexts rather than first-level groups gives the voting features sufficient power to substantially ameliorate well known problems with false positives in Hough-like voting schemes.The system obtains a recognition rate of 97%when trained on images of 24objects taken against a clean black background over the whole viewing sphere,and tested on images taken between the training views,under the same good conditions.The test objects range from sports cars and fighter planes to snakes and lizards.Some of them can be seen in Figure 2.Performance remains relatively good in the case of clutter and partial occlusion [?].Figure 2:Some of the objects used in testing the system.The feature-based nature of the algorithm provides some immunity to the presence of clutter and occlusion in the scene;this,in fact,was one of the primary design goals.This is in contrast to appearance-based schemes that use the structure of the full object,and require good prior segmen-tation.3.Minimally Supervised TrainingThe basic idea behind our current work is that if the recog-nition system is trained on one view of an object,it will be able to recognize views that are topologically close to the original view.After adding these views to the object representations the system will be able to recognize addi-tional views in an iterative process.This will lead to the de-velopment of clusters of views characterizing each object,clusters that ideally will be able to cover the entire viewing sphere.3.1.Training from Clean ImagesAs a validation experiment,we used a corpus of clean,black background images of objects (some of them seen in Figure 2).The images were taken at about 20degrees apart over the viewing sphere.We seeded the recognition system with a single black background image of an object and then iteratively found the best match to the current representation over the entire corpus and rebuilt the object representation incorporating the new image.We stopped the procedure the first time an incorrect image was attracted.The overall procedure is es-sentially a minimum spanning tree algorithm,which is astandard clustering technique.In practice,thecomplex-ity of this algorithm,which would be a problem with large databases,can be avoided by modifying the representation as soon as a “good enough”match is found.Using this algorithm we attracted around 50%of the im-ages of each object to the representation before making the first incorrect classification.Figure 3shows the seed im-age for the sports-car and some of the sports-car images attracted to the representation through this method.It also shows the non-sports-car image attracted at the last step that stopped the growth process.The image is actually an odd view of the toy-rabbit,one that interestingly,looks a bit like thesports-car.Figure 3:Top:Seed image of sports-car for propagation ex-periment and terminating non-car image.Bottom:Car im-ages attracted to the representation during the experiment.To improve the performance of the system we experi-mented with a denser image set.Such a set would increase the percentage of images attracted to object representations by reducing the number of isolated views that are very dif-ferent from the other views.Many of our objects,including the car and the aircraft,have a locus of relatively “patholog-ical”views around the equator,where appearance changes very rapidly and recognition is more difficult.This is due to a “flattened”axis in the 3D shape,which is a common gen-eral property.To investigate the effect,we acquired an ad-ditional image set for the car and aircraft,with the distance between views decreasing in an adaptive fashion,reaching 5degrees at the equator.Learning performance improvedsignificantly in this case,and more than 90%of the images were attracted to object representations before an incorrect classification was made (94.6%of the sports-car images and 91.9%of the fighter images were attracted).Figure 4shows the tree by which the sports-car views attracted each other to the growing representation.The 371sports-car images in the corpus represent one hemisphere,and are represented by squares on the polar coordinate sys-tem in the figure.Dark squares represent images attracted to the representation prior to the first false match.Arrows show the topology of the growth process.The attraction process generally operated between close geometric neigh-bors,with the exception of some views separated by 180degrees that were matched due to the symmetrical shape of the car.The images not attracted to the representation are pathological views that could not be matched to any other views of the object.90270204050009000180Figure 4:Tree by which training images were attracted to representation during growth process.Further results on training from black background im-ages can be found in [?].3.2.Training from Cluttered ImagesAs mentioned in the introduction,a system trained from cluttered imagery would be an important step towards object-level labeling of image databases.A method similar to the one used for clean,black background images can be used for minimally supervised training from cluttered im-ages as long as we have a way to segment the object featuresfrom the features coming from clutter.This is very impor-tant since features arising from clutter could be matched to features from other objects,giving us false positives.We start by seeding the system with a clean,black back-ground image of the object.The recognition system will be able to recognize topologically close views of the same object even if they are taken against cluttered backgrounds.The difficulty is to extract the features belonging to the object from the image and add only those to the object rep-resentation.Adding clutter features would corrupt the rep-resentation and the object model would no longer be useful.An obvious way of extracting the object features would be to find the object’s occluding contour and extract all the fea-tures inside the contour.The difficulty,of course,is finding this contour in our dataset.The occluding contour of the object in the seed image is easy to obtain.Since the seed image is taken against a clean,black background,we can simply threshold the image and extract the contour of the white blob.Subsequent images however are cluttered,and thus more difficult.The posi-tion of the object is known from the output of the object recognition system.But as the object changes its appear-ance in these images,the shape of the contour also changes.The transformation that morphs the model view into the new view can be used to find the contour of the new view of the object.We can find this transformation using a deformable templates algorithm.To do this,we use a relatively generic algorithm adapted from the method of Jain et al.[?].In this method the prior shape of the object of interest is specified as a tem-plate containing edge/boundary information in the form of a bitmap.Deformed templates are obtained by applying parametric transforms to the prototype,and the variability in the shape is achieved by imposing a probability distribu-tion on the admissible mappings.The goal is to find,among all such admissible transformations,the one that minimizes the Bayesian objective function(1)where is the potential energy linking the edge positions and gradient directions in the input image to the object boundary specified by the deformable template,and the second term penalizes the various deformations of the template.The deformation is described by two displacement func-tions from the space spanned by the following orthogonal bases:(2)(3)Figure 5illustrates the deformations of a grid using onlybases of the displacement functions,i.e.and for,,andrespectively.Theoriginal grid can be seen on theleft.Figure 5:Original grid (top left)and deformed grids for different values of m and nIn our implementation of this algorithm,we started with the position and shape of the template given by the recogni-tion process as initial estimates and we performed gradient descent on the scale,rotation,translation and deformation parameters,trying to minimize the objective function.An example can be seen in Figure 6.The image on top is a cluttered image of a toy-bear.The bottom left image shows the toy-bear template the image was matched to,in the size and position given by the object recognition process.The bottom right image shows the template after the deformable templates algorithm wasperformed.Figure 6:Top:cluttered bear image.Bottom left:original template in green.Bottom right:deformed template in red Assuming that the object recognition procedure gives us a reasonable starting hypothesis,this deformable templates algorithm will find a transformation between the model ob-ject and the object in the cluttered image that matches im-age to model features much more closely than is specified by our relatively generic recognition procedure.We will use this transformation to find the object contour in the image.Most of the object features will be inside this deformed contour.But the object can have additional features that are not inside this contour since they were not present in the original model (for example,an emerging handle on a cup).These need to be added to the object representation.To address this problem we also add the features that are close enough to the deformed contour and are longer than a specified threshold.An example of the feature extraction process can be seen in Figure 7,that shows a view of the sports-car on a cluttered background,the curves extracted from the image and the curves judged as belonging to theobject.Figure 7:Top:Cluttered image of sports-car.Bottom left:Curves extracted from image.Bottom right:Curves added to sports-car representation.Since the new view can miss features that were present in the model,or can have additional features that were not in the model,the contour modified through deformable tem-plates has to be updated to reflect the object features ex-tracted from the image.We do this by running a variant of the snakes algorithm on the (already deformed)occluding contour [?]using the newly extracted object features.This pulls the contour much more tightly than the deformable template did,but it cannot be used without the intialization and feature filtering provided by that step,due to the density of clutter in the images.This approach to snakes is more stable than the origi-nal version [?],it is flexible,allows hard constraints,and runs much faster than the dynamic programming method [?].The underlying method is a greedy algorithm that min-imizes the energy expressed similarly to Kass’method:(4)Usingin the continuity term (where are the points on the curve)causes the curve to shrink,as this is actually minimizing the distance between points.It also contributes to the problem of points bunching up on strongportions of the contour.To avoid this,the method uses the difference between the average distance between points,,and the distance between the two points under consider-ation:.The image energy is the gradi-ent magnitude normalized by where is the gradient magnitude at a point,and and are the maximum and minimum gradients in the point’s neighborhood.Figure 8shows an example of how the snakes algorithm modifies the contour of the cup given a set of curves.The handle in the model has no support in the image,so the contour shrinks closer to the objectfeatures.Figure 8:Results of the snakes algorithm.Top:cup curves.Bottom:original contour (left,in red)and deformed contour (right,in red).With the deformable templates and snakes algorithms implemented,the training algorithm,given a recognizer match,is as follows:1.Run deformable templates to find the transformation between the matched object model and the object in the image.e the transformation from step 1to position the pre-vious object contour in the image.3.Extract the features (curves)that are inside or very close to the contour,and add these to the representa-tion as a new model view.4.Find the new occluding contour for this model view by running the snakes algorithm to further adjust the previous contour using the curves extracted in step 3.3.3Experimental ResultsFor training we used a corpus of 264cluttered images,with 48views per object,taken around the whole view-ing sphere (except for the sports-car that had only the 24views of the top hemisphere).These views are not evenly distributed around the viewing sphere.Instead,they are canonical views,creating a situation more similar to an im-age database,where it is unlikely that we would find odd views that are difficult to recognize even by people (the so called pathological views).We took these pictures by plac-ing the objects on a colorful poster and moving them around to make sure that the clutter features did not repeat in the images.We seeded the system with a clean,black background view of some object (e.g.the sports-car)and then iteratively found the best match to the current representation over the entire corpus and rebuilt the representation incorporating the model features from the new image.As in the case of the clean,black background images,we stopped the pro-cedure the first time a non-sports-car image was attracted.This gives us an idea of the best possible performance.To make the process unsupervised,we found that it is generally possible to set thresholds on the goodness of the match,that will define when we should stop the growth process.In the case of the sports-car all the correct views were attracted to the representation before any misclassification was made.The tree by which the views were attracted can be seen in Figure 9.As in Figure 4,the views are repre-sented by circles on a polar coordinate system.Again the at-traction process involves close geometric neighbors,as well as some symmetric images due to the front/rear symmetry of our car.1800ooFigure 9:Tree by which cluttered training images were at-tracted to representation during growth process.The blackbackground seed image is marked with a circle.Figure 10shows two of the views that were used in ex-tending the sports-carrepresentation.Figure 10:Cluttered images attracted to the sports car rep-resentationThe method is useful if a system trained this way can ob-tain better recognition results than a system trained only on the original,black background seed image.We trained the system on cluttered images of objects and we tested recog-nition performance against the system trained only on the seed images.For testing we used images of the objects taken in the same clutter conditions.These images were taken at about every 20degrees around the viewing sphere,with 106images per object (only 53for the sports-car),for a total of 583images.Performance improved significantly for all the objects.ROC curves for the sports-car and plane can be seen in Figures 11and 12where the dashed line is the performance when the system is trained only on the single black background seed image,while the continuous line is the performance achieved after training on cluttered images.The dotted line is the performance when the sys-tem is trained on the complete set of black background im-ages.When training on cluttered images,recognition per-formance always improves,and in some cases approaches the best performance we can achieve with a clean,dense training set.Figure 11:ROC comparison for sports-car.Recognition performance improves after training on cluttered images To test the system’s behavior when dealing with sev-eral objects at the same time,we trained the system on three of our objects (sports-car,plane and fighter)simul-taneously.We seeded the system on one black backgroundFigure 12:ROC comparison for plane.Recognition perfor-mance improves after training on cluttered images view of each object,then we iteratively found the best match over the entire corpus of sparse,cluttered background im-ages and rebuilt the object representations incorporating the model features from the new image.We stopped when the first incorrect match was attracted to the database.The num-ber of images attracted to the representation of each object was approximately the same as in the case of independent training.We tested the recognition performance of the system trained this way by performing forced-choice recognition on the images of the three objects from the dense clut-tered background set.Out of the 265test images,146were correctly classified when the system was trained only on the single black background seed images,and 204were correctly classified when the system was trained on the cluttered background images.Thus performance increased from 55%to 76.9%.Tables 1and 2show the results for each object before and after training on cluttered images.Table 3shows the recognition results after training the sys-tem on labeled black background images of the objects in the same poses as in the unlabeled cluttered images.Over-all performance is 86.4%,somewhat better than the result we obtained by training from clutter.class name samples sports-car53148535fighter106class name samplessports-car53167426fighter106index01204535plane10620898 Table3:Forced choice recognition results after training on clean imagesimagery.We tested the system by training on a single la-beled black background image of an object accompanied by a set of unlabeled cluttered images of several objects.We showed that recognition performance can be significantly improved when the object representation is augmented by features coming from the cluttered images.One problem that we encountered was the existence of pathological views of objects,where object appearance changes very rapidly and recognition is more difficult.This can be overcome by increasing the sampling density over these regions or by allowing some user interaction,as we saw in the case of clean,black background images.Our current sampling is quite sparse,with up to30degrees be-tween some test cases and the closest training er interaction can also solve a different issue:the existence of accidentally similar views,situations where one view of an object looks like another object.The main issue is to avoid too much interaction.References[1] A.Amini,S.Tehrani,and ing dynamicprogramming for minimizing the energy of active contours in the presence of hard constraints.In International Conference on Computer Vision(ICCV88),pages95–99,1988.[2]H.Ando,S.Suzuki,and T.Fujita.Unsupervised visual learn-ing of three-dimensional objects using a modular network architecture.Neural Networks,12:1037–1051,1999.[3]R.Basri,D.Roth,and D.Jacobs.Clustering appearances of2d objects.In IEEE Conference on Computer Vision and Pat-tern Recognition,pages414–420,Santa Barbara,CA,June 1998.[4]P.Erd˝o s and A.R´e nyi.On the evolution of random graphs.Publications of the Mathematical Institute of the Hungarian Academy of Science,5:17–61,1960.[5]Y.Gdalyahu and D.Weinshall.Automatic hierarchical clas-sification of silhouettes of3d objects.In IEEE Conference onComputer Vision and Pattern Recognition,pages787–793, Santa Barbara,CA,June1998.[6]T.Hogg and J.O.Kephart.Phase transitions in high-dimensional pattern classifiputer Systems Sci-ence and Engineering,5(4):223–232,October1990.[7] C.Huang,O.Camps,and T.Kanungo.Object recognitionusing appearance-based parts and relations.In IEEE Con-ference on Computer Vision and Pattern Recognition,pages 878–884,San Juan,Puerto Rico,June1997.[8] A.K.Jain,Y.Zhong,and kshmanan.Object matchingusing deformable templates.IEEE Trans.on Pattern Analy-sis and Machine Intelligence,pages267–277,1996.[9]M.Kass,A.Witkin,and D.Terzopoulos.Snakes:Activecontour models.International Journal of Computer Vision, pages321–331,1988.[10] B.Mel.Seemore:Combining color,shape,and texture his-togramming in a neurally-inspired approach to visual object recognition.Neural Computation,9:777–804,1997. [11]H.Murase and K.Nayar.Visual learning and recognitionof3-d objects from appearance.Int.Journal of Computer Vision,14(1):5–24,1995.[12]R.Nelson and rge-scale tests of a keyed,appearance-based3-d object recognition system.Vision Re-search,38(15-16):2469–88,August1998.[13]R.Nelson and A.Selinger.Learning3d recognition modelsfor general objects from unlabeled imagery:An experiment in intelligent brute force.In International Conference on Pat-tern Recognition(ICPR2000),pages1–8,Barcelona,Spain, September2000.[14]T.Poggio and S.Edelman.A network that learns to rec-ognize three-dimensional objects.Nature,343(1):263–266, 1990.[15]R.Rao and D.Ballard.An active vision architecture based oniconic representations.Artificial Intelligence,78:461–505, 1995.[16] B.Schiele and J.Crowley.Object recognition using multidi-mensional receptivefield histograms.In Proc.Fourth Euro-pean Conference on Computer Vision,pages610–619,1996.[17] C.Schmid and bining greyvalue invariantswith local constraints for object recognition.In IEEE Con-ference on Computer Vision and Pattern Recognition,pages 872–877,San Francisco,CA,June1996.[18] A.Selinger and R.Nelson.A perceptual grouping hierarchyfor appearance-based3d object puter Vi-sion and Image Understanding,76(1):83–92,October1999.[19]Z.Wang and J.Ben-Arie.Generic object detection usingmodel based segmentation.In IEEE Conference on Com-puter Vision and Pattern Recognition,pages428–433,Fort Collins,CO,June1999.[20]M.Weber,M.Welling,and P.Perona.Unsupervised learningof models for recognition.In Proc.6th European Conference on Computer Vision,Dublin,Ireland,June2000.[21] D.Williams and M.Shah.A fast algorithm for active con-tours and curvature estimation.CVGIP:Image Understand-ing,pages14–26,1992.。
