Algorithms for Large Scale Markov Blanket Discovery

合集下载

Scalable Nearest Neighbor Algorithms for high dimensional data

Scalable Nearest Neighbor Algorithmsfor High Dimensional DataMarius Muja,Member,IEEE and David G.Lowe,Member,IEEE Abstract—For many computer vision and machine learning problems,large training sets are key for good performance.However,the most computationally expensive part of many computer vision and machine learning algorithms consists ofﬁnding nearest neighbor matches to high dimensional vectors that represent the training data.We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms.For matching high dimensional features,weﬁnd two algorithms to be the most efﬁcient:the randomized k-d forest and a new algorithm proposed in this paper,the priority search k-means tree.We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature.We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated conﬁguration procedure forﬁnding the best algorithm to search a particular data set.In order to scale to very large data sets that would otherwise notﬁt in the memory of a single machine,we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper.All this research has been released as an open source library called fast library for approximate nearest neighbors(FLANN),which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching.Index Terms—Nearest neighbor search,big data,approximate search,algorithm conﬁgurationÇ1I NTRODUCTIONT HE most computationally expensive part of many com-puter vision algorithms consists of searching for the most similar matches to high-dimensional vectors,also referred to as nearest neighbor matching.Having an efﬁ-cient algorithm for performing fast nearest neighbor matching in large data sets can bring speed improvements of several orders of magnitude to many applications. Examples of such problems includeﬁnding the best matches for local image features in large data sets[1],[2] clustering local features into visual words using the k-means or similar algorithms[3],global image feature matching for scene recognition[4],human pose estimation [5],matching deformable shapes for object recognition[6] or performing normalized cross-correlation(NCC)to com-pare image patches in large data sets[7].The nearest neighbor search problem is also of major importance in many other applications,including machine learning,doc-ument retrieval,data compression,bio-informatics,and data analysis.It has been shown that using large training sets is key to obtaining good real-life performance from many computer vision methods[2],[4],[7].Today the Internet is a vast resource for such training data[8],but for large data sets the performance of the algorithms employed quickly becomes a key issue.When working with high dimensional features,as with most of those encountered in computer vision applications (image patches,local descriptors,global image descriptors), there is often no known nearest-neighbor search algorithm that is exact and has acceptable performance.To obtain a speed improvement,many practical applications are forced to settle for an approximate search,in which not all the neighbors returned are exact,meaning some are approxi-mate but typically still close to the exact neighbors.In prac-tice it is common for approximate nearest neighbor search algorithms to provide more than95percent of the correct neighbors and still be two or more orders of magnitude faster than linear search.In many cases the nearest neighbor search is just a part of a larger application containing other approximations and there is very little loss in performance from using approximate rather than exact neighbors.In this paper we evaluate the most promising nearest-neighbor search algorithms in the literature,propose new algorithms and improvements to existing ones,present a method for performing automatic algorithm selection and parameter optimization,and discuss the problem of scal-ing to very large data sets using compute clusters.We have released all this work as an open source library named fast library for approximate nearest neighbors (FLANN).1.1Deﬁnitions and NotationIn this paper we are concerned with the problem of efﬁcient nearest neighbor search in metric spaces.The nearest neigh-bor search in a metric space can be deﬁned as follows:given a set of points P¼f p1;p2;...;p n g in a metric space M and a query point q2M,ﬁnd the element NNðq;PÞ2P that is theM.Muja is with BitLit Media Inc,Vancouver,BC,Canada.E-mail:mariusm@cs.ubc.ca.D.G.Lowe is with the Computer Science Department,University ofBritish Columbia(UBC),2366Main Mall,Vancouver,BC V6T1Z4,Canada.E-mail:lowe@cs.ubc.ca.Manuscript received26Aug.2013;revised14Feb.2014;accepted1Apr.2014;date of publication xx xx xxxx;date of current version xx xx xxxx.Recommended for acceptance by T.Tuytelaars.For information on obtaining reprints of this article,please send e-mail to:reprints@,and reference the Digital Object Identiﬁer below.Digital Object Identiﬁer no.10.1109/TPAMI.2014.23213760162-8828ß2014IEEE.Translations and content mining are permitted for academic research only.Personal use is also permitted,but republication/redistribution requires IEEE permission.See /publications_standards/publications/rights/index.html for more information.closest to q with respect to a metric distance d:MÂM!R:NNðq;PÞ¼argmin x2P dðq;xÞ:The nearest neighbor problem consists ofﬁnding a method to pre-process the set P such that the operation NNðq;PÞcan be performed efﬁciently.We are often interested inﬁnding not just theﬁrst clos-est neighbor,but several closest neighbors.In this case,the search can be performed in several ways,depending on the number of neighbors returned and their distance to the query point:K-nearest neighbor(KNN)search where the goal is toﬁnd the closest K points from the query point and radius nearest neighbor search(RNN),where the goal is to ﬁnd all the points located closer than some distance R from the query point.We deﬁne the K-nearest neighbor search more formally in the following manner:KNNðq;P;KÞ¼A;where A is a set that satisﬁes the following conditions:j A j¼K;A P8x2A;y2PÀA;dðq;xÞdðq;yÞ:The K-nearest neighbor search has the property that it will always return exactly K neighbors(if there are at least K points in P).The radius nearest neighbor search can be deﬁned as follows:RNNðq;P;RÞ¼f p2P;dðq;pÞ<R g: Depending on how the value R is chosen,the radius search can return any number of points between zero and the whole data set.In practice,passing a large value R to radius search and having the search return a large number of points is often very inefﬁcient.Radius K-nearest neighbor (RKNN)search,is a combination of K-nearest neighbor search and radius search,where a limit can be placed on the number of points that the radius search should return:RKNNðq;P;K;RÞ¼A;such thatj A j K;A P8x2A;y2PÀA;dðq;xÞ<R and dðq;xÞdðq;yÞ:2B ACKGROUNDNearest-neighbor search is a fundamental part of many computer vision algorithms and of signiﬁcant importance in many otherﬁelds,so it has been widely studied.This sec-tion presents a review of previous work in this area.2.1Nearest Neighbor Matching AlgorithmsWe review the most widely used nearest neighbor techni-ques,classiﬁed in three categories:partitioning trees,hash-ing techniques and neighboring graph techniques.2.1.1Partitioning TreesThe kd-tree[9],[10]is one of the best known nearest neigh-bor algorithms.While very effective in low dimensionality spaces,its performance quickly decreases for high dimen-sional data.Arya et al.[11]propose a variation of the k-d tree to be used for approximate search by considering ð1þ"Þ-approximate nearest neighbors,points for whichdistðp;qÞð1þ"ÞdistðpÃ;qÞwhere pÃis the true nearest neighbor.The authors also propose the use of a priority queue to speed up the search.This method of approxi-mating the nearest neighbor search is also referred to as “error bound”approximate search.Another way of approximating the nearest neighbor search is by limiting the time spent during the search,or “time bound”approximate search.This method is proposed in[12]where the k-d tree search is stopped early after exam-ining aﬁxed number of leaf nodes.In practice the time-con-strained approximation criterion has been found to give better results than the error-constrained approximate search.Multiple randomized k-d trees are proposed in[13]as a means to speed up approximate nearest-neighbor search. In[14]we perform a wide range of comparisons showing that the multiple randomized trees are one of the most effective methods for matching high dimensional data.Variations of the k-d tree using non-axis-aligned parti-tioning hyperplanes have been proposed:the PCA-tree[15], the RP-tree[16],and the trinary projection tree[17].We have not found such algorithms to be more efﬁcient than a randomized k-d tree decomposition,as the overhead of evaluating multiple dimensions during search outweighed the beneﬁt of the better space decomposition.Another class of partitioning trees decompose the space using various clustering algorithms instead of using hyper-planes as in the case of the k-d tree and its variants.Example of such decompositions include the hierarchical k-means tree[18],the GNAT[19],the anchors hierarchy[20],the vp-tree[21],the cover tree[22]and the spill-tree[23].Nister and Stewenius[24]propose the vocabulary tree,which is searched by accessing a single leaf of a hierarchical k-means tree.Leibe et al.[25]propose a ball-tree data structure con-structed using a mixed partitional-agglomerative clustering algorithm.Schindler et al.[26]propose a new way of search-ing the hierarchical k-means tree.Philbin et al.[2]conducted experiments showing that an approximateﬂat vocabulary outperforms a vocabulary tree in a recognition task.In this paper we describe a modiﬁed k-means tree algorithm that we have found to give the best results for some data sets, while randomized k-d trees are best for others.J e gou et al.[27]propose the product quantization approach in which they decompose the space into low dimensional subspaces and represent the data sets points by compact codes computed as quantization indices in these subspaces.The compact codes are efﬁciently compared to the query points using an asymmetric approximate dis-tance.Babenko and Lempitsky[28]propose the inverted multi-index,obtained by replacing the standard quantiza-tion in an inverted index with product quantization,obtain-ing a denser subdivision of the search space.Both these methods are shown to be efﬁcient at searching large datasets and they should be considered for further evaluation and possible incorporation into FLANN.2.1.2Hashing Based Nearest Neighbor Techniques Perhaps the best known hashing based nearest neighbor technique is locality sensitive hashing(LSH)[29],which uses a large number of hash functions with the property that the hashes of elements that are close to each other are also likely to be close.Variants of LSH such as multi-probe LSH[30]improves the high storage costs by reducing the number of hash tables,and LSH Forest[31]adapts better to the data without requiring hand tuning of parameters.The performance of hashing methods is highly depen-dent on the quality of the hashing functions they use and a large body of research has been targeted at improving hash-ing methods by using data-dependent hashing functions computed using various learning techniques:parameter sensitive hashing[5],spectral hashing[32],randomized LSH hashing from learned metrics[33],kernelized LSH [34],learnt binary embeddings[35],shift-invariant kernel hashing[36],semi-supervised hashing[37],optimized ker-nel hashing[38]and complementary hashing[39].The different LSH algorithms provide theoretical guaran-tees on the search quality and have been successfully used in a number of projects,however our experiments reported in Section4,show that in practice they are usually outperformed by algorithms using space partitioning structures such as the randomized k-d trees and the priority search k-means tree. 2.1.3Nearest Neighbor Graph TechniquesNearest neighbor graph methods build a graph structure in which points are vertices and edges connect each point to its nearest neighbors.The query points are used to explore this graph using various strategies in order to get closer to their nearest neighbors. In[40]the authors select a few well separated elements in the graph as“seeds”and start the graph exploration from those seeds in a best-ﬁrst fashion.Similarly,the authors of[41]perform a best-ﬁrst exploration of the k-NN graph,but use a hill-climbing strat-egy and pick the starting points at random.They present recent experiments that compare favourably to randomized KD-trees,so the proposed algorithm should be considered for future evalua-tion and possible incorporation into FLANN.The nearest neighbor graph methods suffer from a quite expensive construction of the k-NN graph structure.Wang et al.[42]improve the construction cost by building an approximate nearest neighbor graph.2.2Automatic Conﬁguration of NN Algorithms There have been hundreds of papers published on nearest neighbor search algorithms,but there has been little system-atic comparison to guide the choice among algorithms and set their internal parameters.In practice,and in most of the nearest neighbor literature,setting the algorithm parame-ters is a manual process carried out by using various heuris-tics and rarely make use of more systematic approaches.Bawa et al.[31]show that the performance of the stan-dard LSH algorithm is critically dependent on the length of the hashing key and propose the LSH Forest,a self-tuning algorithm that eliminates this data dependent parameter.In a previous paper[14]we have proposed an auto-matic nearest neighbor algorithm conﬁguration method by combining grid search with aﬁner grained Nelder-Mead downhill simplex optimization process[43].There has been extensive research on algorithm conﬁgura-tion methods[44],[45],however we are not aware of papers that apply such techniques toﬁnding optimum parameters for nearest neighbor algorithms.Bergstra and Bengio[46] show that,except for small parameter spaces,random search can be a more efﬁcient strategy for parameter optimi-zation than grid search.3F AST A PPROXIMATE NN M ATCHINGExact search is too costly for many applications,so this has generated interest in approximate nearest-neighbor search algorithms which return non-optimal neighbors in some cases,but can be orders of magnitude faster than exact search.After evaluating many different algorithms for approxi-mate nearest neighbor search on data sets with a wide range of dimensionality[14],[47],we have found that one of two algorithms gave the best performance:the priority search k-means tree or the multiple randomized k-d trees.These algo-rithms are described in the remainder of this section.3.1The Randomized k-d Tree AlgorithmThe randomized k-d tree algorithm[13],is an approximate nearest neighbor search algorithm that builds multiple ran-domized k-d trees which are searched in parallel.The trees are built in a similar manner to the classic k-d tree[9],[10], with the difference that where the classic kd-tree algorithm splits data on the dimension with the highest variance,for the randomized k-d trees the split dimension is chosen randomly from the top N D dimensions with the highest variance.We used theﬁxed value N D¼5in our implemen-tation,as this performs well across all our data sets and does not beneﬁt signiﬁcantly from further tuning.When searching the randomized k-d forest,a single pri-ority queue is maintained across all the randomized trees. The priority queue is ordered by increasing distance to the decision boundary of each branch in the queue,so the search will exploreﬁrst the closest leaves from all the trees. Once a data point has been examined(compared to the query point)inside a tree,it is marked in order to not be re-examined in another tree.The degree of approximation is determined by the maximum number of leaves to be visited (across all trees),returning the best nearest neighbor candi-dates found up to that point.Fig.1shows the value of searching in many randomized kd-trees at the same time.It can be seen that the perfor-mance improves with the number of randomized trees up to a certain point(about20random trees in this case)and that increasing the number of random trees further leads to static or decreasing performance.The memory overhead of using multiple random trees increases linearly with the number of trees,so at some point the speedup may not jus-tify the additional memory used.Fig.2gives an intuition behind why exploring multiple randomized kd-tree improves the search performance. When the query point is close to one of the splitting hyperplanes,its nearest neighbor lies with almost equalMUJA AND LOWE:SCALABLE NEAREST NEIGHBOR ALGORITHMS FOR HIGH DIMENSIONAL DATA3probability on either side of the hyperplane and if it lies on the opposite side of the splitting hyperplane,further explo-ration of the tree is required before the cell containing it will be ing multiple random decompositions increases the probability that in one of them the query point and its nearest neighbor will be in the same cell.3.2The Priority Search K-Means Tree Algorithm We have found the randomized k-d forest to be very effective in many situations,however on other data sets a different algorithm,the priority search k-means tree ,has been more effective at ﬁnding approximate nearest neighbors,especially when a high precision is required.The priority search k-means tree tries to better exploit the natural struc-ture existing in the data,by clustering the data points using the full distance across all dimensions,in contrast to the (randomized)k-d tree algorithm which only partitions the data based on one dimension at a time.Nearest-neighbor algorithms that use hierarchical parti-tioning schemes based on clustering the data points have been previously proposed in the literature [18],[19],[24].These algorithms differ in the way they construct the parti-tioning tree (whether using k-means,agglomerative or some other form of clustering)and especially in the strate-gies used for exploring the hierarchical tree.We have devel-oped an improved version that explores the k-means tree using a best-bin-ﬁrst strategy,by analogy to what has been found to signiﬁcantly improve the performance of the approximate kd-tree searches.3.2.1Algorithm DescriptionThe priority search k-means tree is constructed by partition-ing the data points at each level into K distinct regions using k-means clustering,and then applying the same method recursively to the points in each region.The recur-sion is stopped when the number of points in a region is smaller than K (see Algorithm1).Fig.2.Example of randomized kd-trees.The nearest neighbor is across a decision boundary from the query point in the ﬁrst decomposition,how-ever is in the same cell in the seconddecomposition.Fig.1.Speedup obtained by using multiple randomized kd-trees (100K SIFT features data set).4IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.36,NO.X,XXXXX 2014The tree is searched by initially traversing the tree from the root to the closest leaf,following at each inner node the branch with the closest cluster centre to the query point,and adding all unexplored branches along the path to a priority queue (see Algorithm 2).The prior-ity queue is sorted in increasing distance from the query point to the boundary of the branch being added to the queue.After the initial tree traversal,the algorithm resumes traversing the tree,always starting with the top branch in thequeue.The number of clusters K to use when partitioning the data at each node is a parameter of the algorithm,called the branching factor and choosing K is important for obtaininggood search performance.In Section 3.4we propose an algorithm for ﬁnding the optimum algorithm parameters,including the optimum branching factor.Fig.3contains a visualisation of several hierarchical k-means decomposi-tions with different branching factors.Another parameter of the priority search k-means tree is I max ,the maximum number of iterations to perform in the k-means clustering loop.Performing fewer iterations can substantially reduce the tree build time and results in a slightly less than optimal clustering (if we consider the sum of squared errors from the points to the cluster centres as the measure of optimality).However,we have observed that even when using a small number of iterations,the near-est neighbor search performance is similar to that of the tree constructed by running the clustering until convergence,as illustrated by Fig.4.It can be seen that using as few as seven iterations we get more than 90percent of the nearest-neigh-bor performance of the tree constructed using full conver-gence,but requiring less than 10percent of the build time.The algorithm to use when picking the initial centres in the k-means clustering can be controlled by the C alg parame-ter.In our experiments (and in the FLANN library)wehaveFig.3.Projections of priority search k-means trees constructed using different branching factors:4,32,128.The projections are constructed using the same technique as in [26],gray values indicating the ratio between the distances to the nearest and the second-nearest cluster centre at each tree level,so that the darkest values (ratio %1)fall near the boundaries between k-meansregions.Fig.4.The inﬂuence that the number of k-means iterations has on the search speed of the k-means tree.Figure shows the relative search time compared to the case of using full convergence.MUJA AND LOWE:SCALABLE NEAREST NEIGHBOR ALGORITHMS FOR HIGH DIMENSIONAL DATA 5used the following algorithms:random selection,Gonzales’algorithm (selecting the centres to be spaced apart from each other)and KMeans++algorithm [48].We have found that the initial cluster selection made only a small difference in terms of the overall search efﬁciency in most cases and that the random initial cluster selection is usually a good choice for the priority search k-means tree.3.2.2AnalysisWhen analysing the complexity of the priority search k-means tree,we consider the tree construction time,search time and the memory requirements for storing the tree.Construction time complexity .During the construction of the k-means tree,a k-means clustering operation has to be performed for each inner node.Considering a node v with n v associated data points,and assuming a maximum number of iterations I in the k-means clustering loop,the complexity of the clustering operation is O ðn v dKI Þ,where d represents the data dimensionality.Taking into account all the inner nodes on a level,we have Pn v ¼n ,so the complexity of construct-ing a level in the tree is O ðndKI Þ.Assuming a balanced tree,the height of the tree will be ðlog n=log K Þ,resulting in a total tree construction cost of O ðndKI ðlog n=log K ÞÞ.Search time complexity .In case of the time constrained approx-imate nearest neighbor search,the algorithm stops after exam-ining L data points.Considering a complete priority search k-means tree with branching factor K ,the number of top down tree traversals required is L=K (each leaf node contains K points in a complete k-means tree).During each top-down tra-versal,the algorithm needs to check O ðlog n=log K Þinner nodes and one leaf node.For each internal node,the algorithm has to ﬁnd the branch closest to the query point,so it needs to compute the distances to all the cluster centres of the child nodes,an O ðKd Þoperation.The unexplored branches are added to a priority queue,which can be accomplished in O ðK Þamor-tized cost when using binomial heaps.For the leaf node the distance between the query and all the points in the leaf needs to be computed which takes O ðKd Þtime.In summary the overall search cost is O ðLd ðlog n=log K ÞÞ.3.3The Hierarchical Clustering TreeMatching binary features is of increasing interest in the com-puter vision community with many binary visual descriptors being recently proposed:BRIEF [49],ORB [50],BRISK [51].Many algorithms suitable for matching vector based fea-tures,such as the randomized kd-tree and priority search k-means tree,are either not efﬁcient or not suitable for match-ing binary features (for example,the priority search k-means tree requires the points to be in a vector space where their dimensions can be independently averaged).Binary descriptors are typically compared using the Hamming distance,which for binary data can be computed as a bitwise XOR operation followed by a bit count on the result (very efﬁcient on computers with hardware support for counting the number of bits set in a word 1).This section brieﬂy presents a new data structure and algorithm,called the hierarchical clustering tree ,which wefound to be very effective at matching binary features.For a more detailed description of this algorithm the reader is encouraged to consult [47]and [52].The hierarchical clustering tree performs a decomposi-tion of the search space by recursively clustering the input data set using random data points as the cluster centers of the non-leaf nodes (see Algorithm3).In contrast to the priority search k-means tree presented above,for which using more than one tree did not bring signiﬁ-cant improvements,we have found that building multiple hierarchical clustering trees and searching them in parallel using a common priority queue (the same approach that has been found to work well for randomized kd-trees [13])resulted in signiﬁcant improvements in the search performance.3.4Automatic Selection of the Optimal Algorithm Our experiments have revealed that the optimal algorithm for approximate nearest neighbor search is highly depen-dent on several factors such as the data dimensionality,size and structure of the data set (whether there is any correla-tion between the features in the data set)and the desired search precision.Additionally,each algorithm has a set of parameters that have signiﬁcant inﬂuence on the search per-formance (e.g.,number of randomized trees,branching fac-tor,number of k-means iterations).As we already mention in Section 2.2,the optimum parameters for a nearest neighbor algorithm are typically chosen manually,using various heuristics.In this section we propose a method for automatic selection of the best nearest neighbor algorithm to use for a particular data set and for choosing its optimum parameters.By considering the nearest neighbor algorithm itself as a parameter of a generic nearest neighbor search routine A ,the problem is reduced to determining the parameters u 2Q that give the best solution,where Q is also known as the parameter conﬁguration space .This can be formulated as an optimization problem in the parameter conﬁguration space:min u 2Qc ðu Þwith c :Q !R being a cost function indicating how well the search algorithm A ,conﬁgured with the parameters u ,per-forms on the given input data.1.The POPCNT instruction for modern x86_64architectures.6IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.36,NO.X,XXXXX 2014。

harris点云关键点提取原理

2.对于每个像素点，计算相邻区域内的灰度差异。这里一般使用的是Sobel算子，通过对图像进行梯度计算，来得到每个点的梯度值和方向。梯度值代表了灰度变化的大小，而梯度方向表示了灰度变化的Байду номын сангаас势。
3.为了计算关键点，需要定义一个特征值来描述图像上的局部区域的结构。Harris算子使用的是图像的特征值，通过计算图像的协方差矩阵来评估局部区域的结构。协方差矩阵包含了关键点的兴趣信息，它反映了图像中的灰度变化和相邻区域的关系。
4.对于每个像素点，计算协方差矩阵的特征值。特征值可以用来判断该点是否为关键点。一般情况下，如果特征值的比值接近于1，则表示该点处于平坦区域，不具备关键性。而如果特征值的差异较大，则表示该点处于角点或边缘区域，具备关键性。
5.根据特征值的大小，将图像中的像素点进行分类。通过设定一个合适的阈值，可以选择关键点的数量。一般来说，阈值越大，选择的关键点数量越少，而阈值越小，选择的关键点数量越多。
harris点云关键点提取原理
Harris点云关键点提取是一种常用的计算机视觉算法，用于在点云数据中寻找具有显著度和独特性的关键点。它的原理基于局部图像的灰度变化，通过计算图像中每个像素点的局部邻域的结构差异来识别关键点。
Harris点云关键点提取算法的步骤如下：
1.选择合适的标志点检测区域大小，通常选取一个小的像素区域。这个区域的大小对算法的效果有很大的影响，过小的区域会导致关键点检测不准确，而过大的区域会导致关键点的数量增多，从而增加了计算和处理的复杂性。
然而，Harris点云关键点提取算法也存在一些局限性。首先，该算法对于光照变化和噪声有一定的敏感性，可能会产生较多的误检和漏检。另外，该算法在处理纹理较少的区域时可能会失效，无法提取到有效的关键点。

代数多重网格算法库AMGCL

代数多重网格算法库AMGCL代数多重网格算法库（AMGCL）是一个用于求解大规模线性方程组的开源软件库。

它采用了代数多重网格（AMG）方法来加速求解线性方程组的过程。

AMGCL通过自适应网格划分和层次求解的策略，可有效解决大规模线性方程组的求解问题。

AMGCL提供了一个灵活且易于使用的接口，可以与各种线性方程组求解器和迭代求解方法结合使用。

它支持各种稀疏矩阵格式，包括压缩格式和坐标格式。

通过将AMG算法与现有的迭代求解器结合，AMGCL能够并行求解大规模问题，提高求解效率。

AMGCL的主要思想是使用层次结构来逐步逼近解，从而达到加速求解的目的。

在网格层次结构中，每一层都对应着一个粗糙的网格，通过在不同层次上进行求解，可以减少求解的自由度，从而加速计算过程。

AMGCL采用了自适应网格划分的方法，可以根据问题的特点自动选择适合的网格划分策略，从而提高求解效率。

在AMGCL中，网格划分和层次求解是两个关键的步骤。

网格划分是将原始问题划分为不同的网格层次，每个层次对应着一个粗糙的网格。

层次求解是在不同层次上求解问题，通过层次间的插值操作来逼近原始问题的解。

AMGCL还使用了预先计算的插值和限制算子，用于在不同层次间传递信息。

除了网格划分和层次求解，AMGCL还提供了其他功能来提高求解效率。

例如，它支持并行化求解过程，可以利用多核处理器和分布式计算环境来加速求解。

此外，AMGCL还提供了一些用于求解非线性方程组的扩展功能，例如非线性预条件子和非线性层次求解。

AMGCL已经在很多科学和工程领域得到了广泛的应用。

它在计算流体力学、结构力学、电磁学等领域的应用中取得了良好的效果。

由于AMGCL具有高度的可扩展性和灵活性，它可以适应各种求解问题的需求，并能够处理大规模的线性方程组。

总之，代数多重网格算法库AMGCL通过自适应网格划分和层次求解的方法，能够高效地求解大规模线性方程组。

它具有灵活且易于使用的接口，支持各种稀疏矩阵格式和迭代求解器，具有良好的可扩展性和应用性。

Large-scale machine learning with stochastic gradient descent

Large-Scale Machine Learningwith Stochastic Gradient DescentL´e on BottouNEC Labs America,Princeton NJ08542,USAleon@Abstract.During the last decade,the data sizes have grown faster than the speed of processors.In this context,the capabilities of statistical machine learning meth-ods is limited by the computing time rather than the sample size.A more pre-cise analysis uncovers qualitatively diﬀerent tradeoﬀs for the case of small-scale and large-scale learning problems.The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways.Unlikely optimization algorithms such as stochastic gradient descent show amazing perfor-mance for large-scale problems.In particular,second order stochastic gradient and averaged stochastic gradient are asymptotically eﬃcient after a single pass on the training set.Keywords:Stochastic gradient descent,Online learning,Eﬃciency1IntroductionThe computational complexity of learning algorithm becomes the critical limiting factor when one envisions very large datasets.This contribution ad-vocates stochastic gradient algorithms for large scale machine learning prob-lems.Theﬁrst section describes the stochastic gradient algorithm.The sec-ond section presents an analysis that explains why stochastic gradient algo-rithms are attractive when the data is abundant.The third section discusses the asymptotical eﬃciency of estimates obtained after a single pass over the training set.The last section presents empirical evidence.2Learning with gradient descentLet usﬁrst consider a simple supervised learning setup.Each example z is a pair(x,y)composed of an arbitrary input x and a scalar output y.We consider a loss function (ˆy,y)that measures the cost of predictingˆy when the actual answer is y,and we choose a family F of functions f w(x)parametrized by a weight vector w.We seek the function f∈F that minimizes the loss Q(z,w)= (f w(x),y)averaged on the examples.Although we would like to average over the unknown distribution dP(z)that embodies the Laws of2L´e on BottouNature,we must often settle for computing the average on a sample z1...z n.E(f)=(f(x),y)dP(z)E n(f)=1nni=1(f(x i),y i)(1)The empirical risk E n(f)measures the training set performance.The expected risk E(f)measures the generalization performance,that is,the expected performance on future examples.The statistical learning theory(Vapnik and Chervonenkis,1971)justiﬁes minimizing the empirical risk instead of the expected risk when the chosen family F is suﬃciently restrictive.2.1Gradient descentIt has often been proposed(e.g.,Rumelhart et al.,1986)to minimize the empirical risk E n(f w)using gradient descent(GD).Each iteration updates the weights w on the basis of the gradient of E n(f w),w t+1=w t−γ1nni=1∇w Q(z i,w t),(2)whereγis an adequately chosen gain.Under suﬃcient regularity assumptions, when the initial estimate w0is close enough to the optimum,and when the gainγis suﬃciently small,this algorithm achieves linear convergence(Dennis and Schnabel,1983),that is,−logρ∼t,whereρrepresents the residual error.Much better optimization algorithms can be designed by replacing the scalar gainγby a positive deﬁnite matrixΓt that approaches the inverse of the Hessian of the cost at the optimum:w t+1=w t−Γt 1nni=1∇w Q(z i,w t).(3)This second order gradient descent(2GD)is a variant of the well known Newton algorithm.Under suﬃciently optimistic regularity assumptions,and provided that w0is suﬃciently close to the optimum,second order gradient descent achieves quadratic convergence.When the cost is quadratic and the scaling matrixΓis exact,the algorithm reaches the optimum after a single iteration.Otherwise,assuming suﬃcient smoothness,we have−log logρ∼t.2.2Stochastic gradient descentThe stochastic gradient descent(SGD)algorithm is a drastic simpliﬁcation. Instead of computing the gradient of E n(f w)exactly,each iteration estimates this gradient on the basis of a single randomly picked example z t:w t+1=w t−γt∇w Q(z t,w t).(4)Large-Scale Machine Learning 3The stochastic process {w t ,t =1,...}depends on the examples randomly picked at each iteration.It is hoped that (4)behaves like its expectation (2)despite the noise introduced by this simpliﬁed procedure.Since the stochastic algorithm does not need to remember which examples were visited during the previous iterations,it can process examples on the ﬂy in a deployed system.In such a situation,the stochastic gradient descent directly optimizes the expected risk,since the examples are randomly drawn from the ground truth distribution.The convergence of stochastic gradient descent has been studied exten-sively in the stochastic approximation literature.Convergence results usuallyrequire decreasing gains satisfying the conditions t γ2t <∞and t γt =∞.The Robbins-Siegmund theorem (Robbins and Siegmund,1971)provides the means to establish almost sure convergence under mild conditions (Bottou,1998),including cases where the loss function is not everywhere diﬀerentiable.The convergence speed of stochastic gradient descent is in fact limited by the noisy approximation of the true gradient.When the gains decrease too slowly,the variance of the parameter estimate w t decreases equally slowly.When the gains decrease too quickly,the expectation of the parameter es-timate w t takes a very long time to approach the optimum.Under suﬃ-cient regularity conditions (e.g.Murata,1998),the best convergence speed is achieved using gains γt ∼t −1.The expectation of the residual error then decreases with similar speed,that is,E ρ∼t −1.The second order stochastic gradient descent (2SGD)multiplies the gradi-ents by a positive deﬁnite matrix Γt approaching the inverse of the Hessian :w t +1=w t −γt Γt ∇w Q (z t ,w t ).(5)Unfortunately,this modiﬁcation does not reduce the stochastic noise and therefore does not signiﬁcantly improve the variance of w t .Although con-stants are improved,the expectation of the residual error still decreases like t −1,that is,E ρ∼t −1,(e.g.Bordes et al.,2009,appendix).2.3Stochastic gradient examplesTable 1illustrates stochastic gradient descent algorithms for a number of classic machine learning schemes.The stochastic gradient descent for the Perceptron,for the Adaline,and for k -Means match the algorithms proposed in the original papers.The SVM and the Lasso were ﬁrst described with traditional optimization techniques.Both Q svm and Q lasso include a regular-ization term controlled by the hyperparameter λ.The K-means algorithm converges to a local minimum because Q kmeans is nonconvex.On the other hand,the proposed update rule uses second order gains that ensure a fast convergence.The proposed Lasso algorithm represents each weight as the diﬀerence of two positive variables.Applying the stochastic gradient rule to these variables and enforcing their positivity leads to sparser solutions.4L´e on BottouTable 1.Stochastic gradient algorithms for various learning systems.Loss Stochastic gradient algorithmAdaline (Widrow and Hoﬀ,1960)Q adaline =12 y −w Φ(x ) 2Φ(x )∈R d ,y =±1w ←w +γt y t −w Φ(x t ) Φ(x t )Perceptron (Rosenblatt,1957)Q perceptron =max {0,−y w Φ(x )}Φ(x )∈R d ,y =±1w ←w +γt y t Φ(x t )if y t w Φ(x t )≤00otherwise K-Means (MacQueen,1967)Q kmeans =min k 1(z −w k )2z ∈R d ,w 1...w k ∈R dn 1...n k ∈N ,initially 0k ∗=arg min k (z t −w k )2n k ∗←n k ∗+1w k ∗←w k ∗+1n k∗(z t −w k ∗)SVM (Cortes and Vapnik,1995)Q svm =λw 2+max {0,1−y w Φ(x )}Φ(x )∈R d ,y =±1,λ>0w ←w −γt λw if y t w Φ(x t )>1,λw −y t Φ(x t )sso (Tibshirani,1996)Q lasso =λ|w |1+12 y −w Φ(x ) 2w =(u 1−v 1,...,u d −v d )Φ(x )∈R d ,y ∈R ,λ>0u i ← u i −γt λ−(y t −w Φ(x t ))Φi (x t ) +v i ← v i −γt λ+(y t −w t Φ(x t ))Φi (x t ) +with notation [x ]+=max {0,x }.3Learning with large training setsLet f ∗=arg min f E (f )be the best possible prediction function.Since we seek the prediction function from a parametrized family of functions F ,let f ∗F =arg min f ∈F E (f )be the best function in this family.Since we optimize the empirical risk instead of the expected risk,let f n =arg min f ∈F E n (f )be the empirical optimum.Since this optimization can be costly,let us stop the algorithm when it reaches an solution ˜f n that minimizes the objective function with a predeﬁned accuracy E n (˜f n )<E n (f n )+ρ.3.1The tradeoﬀs of large scale learningThe excess error E =E E (˜f n)−E (f ∗) can be decomposed in three terms (Bottou and Bousquet,2008):E =E E (f ∗F )−E (f ∗) +E E (f n )−E (f ∗F ) +E E (˜f n )−E (f n ) .(6)•The approximation error E app =E E (f ∗F )−E (f ∗) measures how closelyfunctions in F can approximate the optimal solution f ∗.The approxima-tion error can be reduced by choosing a larger family of functions.•The estimation error E est =E E (f n )−E (f ∗F ) measures the eﬀect of minimizing the empirical risk E n (f )instead of the expected risk E (f ).Large-Scale Machine Learning 5The estimation error can be reduced by choosing a smaller family of functions or by increasing the size of the training set.•The optimization error E opt =E (˜f n )−E (f n )measures the impact of the approximate optimization on the expected risk.The optimization error can be reduced by running the optimizer longer.The additional computing time depends of course on the family of function and on the size of the training set.Given constraints on the maximal computation time T max and the maximal training set size n max ,this decomposition outlines a tradeoﬀinvolving the size of the family of functions F ,the optimization accuracy ρ,and the number of examples n eﬀectively processed by the optimization algorithm.min F ,ρ,nE =E app +E est +E opt subject to n ≤n max T (F ,ρ,n )≤T max (7)Two cases should be distinguished:•Small-scale learning problems are ﬁrst constrained by the maximal num-ber of examples.Since the computing time is not an issue,we can reduce the optimization error E opt to insigniﬁcant levels by choosing ρarbitrarily small,and we can minimize the estimation error by chosing n =n max .We then recover the approximation-estimation tradeoﬀthat has been widely studied in statistics and in learning theory.•Large-scale learning problems are ﬁrst constrained by the maximal com-puting time.Approximate optimization can achieve better expected risk because more training examples can be processed during the allowed time.The speciﬁcs depend on the computational properties of the chosen op-timization algorithm.3.2Asymptotic analysisSolving (7)in the asymptotic regime amounts to ensuring that the terms of the decomposition (6)decrease at similar rates.Since the asymptotic conver-gence rate of the excess error (6)is the convergence rate of its slowest term,the computational eﬀort required to make a term decrease faster would be wasted.For simplicity,we assume in this section that the Vapnik-Chervonenkis dimensions of the families of functions F are bounded by a common constant.We also assume that the optimization algorithms satisfy all the assumptions required to achieve the convergence rates discussed in section 2.Similar anal-yses can be carried out for speciﬁc algorithms under weaker assumptions (e.g.Shalev-Shwartz and Srebro,2008).A simple application of the uniform convergence results of (Vapnik and Chervonenkis,1971)gives then the upper bound E =E app +E est +E opt =E app +O log n n+ρ .6L´e on BottouTable 2.Asymptotic equivalents for various optimization algorithms:gradient descent(GD,eq.2),second order gradient descent(2GD,eq.3),stochastic gradient descent(SGD,eq.4),and second order stochastic gradient descent(2SGD,eq.5). Although they are the worst optimization algorithms,SGD and2SGD achieve the fastest convergence speed on the expected risk.They diﬀer only by constant factors not shown in this table,such as condition numbers and weight vector dimension.GD2GD SGD2SGD Time per iteration:n n11Iterations to accuracyρ:log1ρlog log1ρ1ρ1ρTime to accuracyρ:n log1ρn log log1ρ1ρ1ρTime to excess error E:1E1/αlog21E1E1/αlog1Elog log1E1E1EUnfortunately the convergence rate of this bound is too pessimistic.Faster convergence occurs when the loss function has strong convexity properties (Lee et al.,2006)or when the data distribution satisﬁes certain assumptions (Tsybakov,2004).The equivalenceE=E app+E est+E opt∼E app+log nnα+ρ,for someα∈1,1,(8)provides a more realistic view of the asymptotic behavior of the excess er-ror(e.g.Massart,2000,Bousquet,2002).Since the three component of the excess error should decrease at the same rate,the solution of the tradeoﬀproblem(7)must then obey the multiple asymptotic equivalencesE∼E app∼E est∼E opt∼log nnα∼ρ.(9)Table2summarizes the asymptotic behavior of the four gradient algo-rithm described in section2.Theﬁrst three rows list the computational cost of each iteration,the number of iterations required to reach an optimization accuracyρ,and the corresponding computational cost.The last row provides a more interesting measure for large scale machine learning purposes.Assum-ing we operate at the optimum of the approximation-estimation-optimization tradeoﬀ(7),this line indicates the computational cost necessary to reach a predeﬁned value of the excess error,and therefore of the expected risk.This is computed by applying the equivalences(9)to eliminate n andρfrom the third row results.Although the stochastic gradient algorithms,SGD and2SGD,are clearly the worst optimization algorithms(third row),they need less time than the other algorithms to reach a predeﬁned expected risk(fourth row).Therefore, in the large scale setup,that is,when the limiting factor is the computing time rather than the number of examples,the stochastic learning algorithms performs asymptotically better!Large-Scale Machine Learning 74Eﬃcient learningLet us add an additional example z t to a training set z 1...z t −1.Since the new empirical risk E t (f )remains close to E t −1(f ),the empirical minimum w ∗t +1=arg min w E t (f w )remains close to w ∗t =arg min w E t −1(f w ).With suﬃcient regularity assumptions,a ﬁrst order calculation gives the resultw ∗t +1=w ∗t −t −1Ψt ∇w Q (z t ,w ∗t )+O t −2 ,(10)where Ψt is the inverse of the Hessian of E t (f w )in w ∗t .The similarity be-tween this expression and the second order stochastic gradient descent rule(5)has deep consequences.Let w t be the sequence of weights obtained by performing a single second order stochastic gradient pass on the randomly shuﬄed training set.With adequate regularity and convexity assumptions,we can prove (e.g.Bottou and LeCun,2004)lim t →∞t E (f w t )−E (f ∗F ) =lim t →∞t E (f w ∗t )−E (f ∗F ) =I >0.(11)Therefore,a single pass of second order stochastic gradient provides a pre-diction function f w t that approaches the optimum f ∗Fas eﬃciently as the empirical optimum f w ∗t .In particular,when the loss function is the log like-lihood,the empirical optimum is the asymptotically eﬃcient maximum like-lihood estimate,and the second order stochastic gradient estimate is also asymptotically eﬃcient.Unfortunately,second order stochastic gradient descent is computation-ally costly because each iteration (5)performs a computation that involves the large dense matrix Γt .Two approaches can work around this problem.•Computationally eﬃcient approximations of the inverse Hessian trade asymptotic optimality for computation speed.For instance,the SGDQN algorithm (Bordes et al.,2009)achieves interesting speeds using a diag-onal approximation.•The averaged stochastic gradient descent (ASGD)algorithm (Polyak and Juditsky,1992)performs the normal stochastic gradient update (4)and recursively computes the average ¯w t =1tt i =1w t :w t +1=w t −γt ∇w Q (z t ,w t ),¯w t +1=t t +1¯w t +1t +1w t +1.(12)When the gains γt decrease slower than t −1,the ¯w t converges with the optimal asymptotic speed (11).Reaching this asymptotic regime can take a very long time in practice.A smart selection of the gains γt helps achieving the promised performance (Xu,2010).8L´e on BottouAlgorithm Time Test ErrorHinge loss SVM,λ=10−4.SVMLight 23,642s. 6.02%SVMPerf 66s. 6.03%SGD 1.4s. 6.02%Log loss SVM,λ=10−5.TRON (-e0.01)30s. 5.68%TRON (-e0.001)44s. 5.70%SGD 2.3s. 5.66%Optimization accuracy (trainingCost−optimalTrainingCost)Fig.1.Results achieved with a linear SVM on the RCV1task.The lower half of the plot shows the time required by SGD and TRON to reach a predeﬁned accuracy ρon the log loss task.The upper half shows that the expected risk stops improving long before the superlinear TRON algorithm overcomes SGD.0.300.320.340.360.380.400 1 2 34 5E x p e c t e d r i s k Number of epochs SGD SGDQN ASGD 21.022.023.024.025.026.027.0 0 1 2 3 4 5T e s t E r r o r (%)Number of epochsSGD SGDQN ASGD paraison of the test set performance of SGD,SGDQN,and ASGD for a linear squared hinge SVM trained on the ALPHA task of the 2008Pascal Large Scale Learning Challenge.ASGD nearly reaches the optimal expected risk after a single pass.44004500460047004800490050005100520053005400epochs epochsparison of the test set performance of SGD,SGDQN,and ASGD on a CRF trained on the CONLL Chunking task.On this task,SGDQN appears more attractive because ASGD does not reach its asymptotic performance.Large-Scale Machine Learning9 5ExperimentsThis section brieﬂy reports experimental results illustrating the actual per-formance of stochastic gradient algorithms on a variety of linear systems. We use gainsγt=γ0(1+λγ0t)−1for SGD and,following(Xu,2010),γt=γ0(1+λγ0t)−0.75for ASGD.The initial gainsγ0were set manually by observing the performance of each algorithm running on a subset of the training examples.Figure1reports results achieved using SGD for a linear SVM trained for the recognition of the CCAT category in the RCV1dataset(Lewis et al.,2004)using both the hinge loss(Q svm in table1),and the log loss, (Q logsvm=λw2+log(1+exp(−y w Φ(x)))).The training set contains781,265 documents represented by47,152relatively sparse TF/IDF features.SGD runs considerably faster than either the standard SVM solvers SVMLight and SVMPerf(Joachims,2006)or the superlinear optimization algorithm TRON(Lin et al.,2007).Figure2reports results achieved using SGD,SGDQN,and ASGD for a linear SVM trained on the ALPHA task of the2008Pascal Large Scale Learning Challenge(see Bordes et al.,2009)using the squared hinge loss (Q sqsvm=λw2+max{0,1−y w Φ(x)}2).The training set contains100,000 patterns represented by500centered and normalized variables.Performances measured on a separate testing set are plotted against the number of passes over the training set.ASGD achieves near optimal results after one pass.Figure3reports results achieved using SGD,SGDQN,and ASGD for a CRF(Laﬀerty et al.,2001)trained on the CONLL2000Chunking task (Tjong Kim Sang and Buchholz,2000).The training set contains8936sen-tences for a1.68×106dimensional parameter space.Performances measured on a separate testing set are plotted against the number of passes over the training set.SGDQN appears more attractive because ASGD does not reach its asymptotic performance.All three algorithms reach the best test set per-formance in a couple minutes.The standard CRF L-BFGS optimizer takes 72minutes to compute an equivalent solution.ReferencesBORDES.A.,BOTTOU,L.,and GALLINARI,P.(2009):SGD-QN:Careful Quasi-Newton Stochastic Gradient Descent.Journal of Machine Learning Research, 10:1737-1754.With Erratum(to appear).BOTTOU,L.and BOUSQUET,O.(2008):The Tradeoﬀs of Large Scale Learning, In Advances in Neural Information Processing Systems,vol.20,161-168. BOTTOU,L.and LECUN,Y.(2004):On-line Learning for Very Large Datasets.Applied Stochastic Models in Business and Industry,21(2):137-151 BOUSQUET,O.(2002):Concentration Inequalities and Empirical Processes The-ory Applied to the Analysis of Learning Algorithms.Th`e se de doctorat,Ecole Polytechnique,Palaiseau,France.10L´e on BottouCORTES, C.and VAPNIK,V.N.(1995):Support Vector Networks,Machine Learning,20:273-297.DENNIS,J.E.,Jr.,and SCHNABEL,R.B.(1983):Numerical Methods For Un-constrained Optimization and Nonlinear Equations.Prentice-Hall JOACHIMS,T.(2006):Training Linear SVMs in Linear Time.In Proceedings of the12th ACM SIGKDD,ACM Press.LAFFERTY,J.D.,MCCALLUM,A.,and PEREIRA,F.(2001):Conditional Ran-dom Fields:Probabilistic Models for Segmenting and Labeling Sequence Data.In Proceedings of ICML2001,282-289,Morgan Kaufman.LEE,W.S.,BARTLETT,P.L.,and WILLIAMSON,R.C.(1998):The Importance of Convexity in Learning with Squared Loss.IEEE Transactions on Informa-tion Theory,44(5):1974-1980.LEWIS,D.D.,YANG,Y.,ROSE,T.G.,and LI,F.(2004):RCV1:A New Bench-mark Collection for Text Categorization Research.Journal of Machine Learn-ing Research,5:361-397.LIN,C.J.,WENG,R.C.,and KEERTHI,S.S.(2007):Trust region Newton methods for large-scale logistic regression.In Proceedings of ICML2007,561-568,ACM Press.MACQUEEN,J.(1967):Some Methods for Classiﬁcation and Analysis of Multi-variate Observations.In Fifth Berkeley Symposium on Mathematics,Statistics, and Probabilities,vol.1,281-297,University of California Press. MASSART,P.(2000):Some applications of concentration inequalities to Statistics, Annales de la Facult´e des Sciences de Toulouse,series6,9,(2):245-303. MURATA,N.(1998):A Statistical Study of On-line Learning.In Online Learning and Neural Networks,Cambridge University Press.POLYAK,B.T.and JUDITSKY,A.B.(1992):Acceleration of stochastic approx-imation by averaging.SIAM J.Control and Optimization,30(4):838-855. ROSENBLATT,F.(1957):The Perceptron:A perceiving and recognizing automa-ton.Technical Report85-460-1,Project PARA,Cornell Aeronautical Lab. RUMELHART,D.E.,HINTON,G.E.,and WILLIAMS,R.J.(1986):Learning internal representations by error propagation.In Parallel distributed processing: Explorations in the microstructure of cognition,vol.I,318-362,Bradford Books. SHALEV-SHWARTZ,S.and SREBRO,N.(2008):SVM optimization:inverse de-pendence on training set size.In Proceedings of the ICML2008,928-935,ACM. TIBSHIRANI,R.(1996):Regression shrinkage and selection via the Lasso.Journal of the Royal Statistical Society,Series B,58(1):267-288.TJONG KIM SANG E. F.,and BUCHHOLZ,S.(2000):Introduction to the CoNLL-2000Shared Task:Chunking.In Proceedings of CoNLL-2000,127-132. TSYBAKOV,A.B.(2004):Optimal aggregation of classiﬁers in statistical learning, Annals of Statististics,32(1).VAPNIK,V.N.and CHERVONENKIS,A.YA.(1971):On the Uniform Con-vergence of Relative Frequencies of Events to Their Probabilities.Theory of Probability and its Applications,16(2):264-280.WIDROW, B.and HOFF,M. E.(1960):Adaptive switching circuits.IRE WESCON Conv.Record,Part4.,96-104.XU,W.(2010):Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent.Journal of Machine Learning Research(to ap-pear).。

vision master 算法例题

Vision Master 算法例题：假设你正在开发一个计算机视觉系统，该系统需要识别图像中的物体。

为了实现这个功能，你可以使用深度学习中的卷积神经网络（CNN）模型。

以下是一个简单的CNN 模型结构：1. 输入层：接收图像数据，通常使用RGB 格式的彩色图像。

2. 卷积层：对输入图像进行卷积操作，提取图像的特征。

3. 激活函数层：如ReLU，用于增加非线性。

4. 池化层：如最大池化，用于降低特征图的空间尺寸。

5. 全连接层：将提取到的特征映射到一个向量空间，用于分类任务。

6. 输出层：输出预测结果，如物体类别的概率分布。

以下是一个使用Python 和TensorFlow 实现的简单CNN 模型：```pythonimport tensorflow as tffrom tensorflow.keras import layers, models# 构建CNN 模型model = models.Sequential()model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))model.add(layers.MaxPooling2D((2, 2)))model.add(layers.Conv2D(64, (3, 3), activation='relu'))model.add(layers.MaxPooling2D((2, 2)))model.add(layers.Conv2D(64, (3, 3), activation='relu'))model.add(layers.Flatten())model.add(layers.Dense(64, activation='relu'))model.add(layers.Dense(10, activation='softmax'))# 编译模型pile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])# 训练模型（此处省略数据集加载和预处理部分）# model.fit(x_train, y_train, batch_size=32, epochs=10, validation_data=(x_test, y_test)) ```在这个例子中，我们创建了一个简单的CNN 模型，用于识别具有32x32 像素的彩色图像中的10 个不同类别的物体。

A Comprehensive Survey of Multiagent Reinforcement Learning

156
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 2, MARCH 2008
A Comprehensive Survey of Multiagent ReinfoN
A
MULTIAGENT system [1] can be deﬁned as a group of autonomous, interacting entities sharing a common environment, which they perceive with sensors and upon which they act with actuators [2]. Multiagent systems are ﬁnding applications in a wide variety of domains including robotic teams, distributed control, resource management, collaborative decision support systems, data mining, etc. [3], [4]. They may arise as the most natural way of looking at the system, or may provide an alternative perspective on systems that are originally regarded as centralized. For instance, in robotic teams, the control authority is naturally distributed among the robots [4]. In resource management, while resources can be managed by a central authority, identifying each resource with an agent may provide a helpful, distributed perspective on the system [5].

外文翻译---基于离散混沌映射的图像加密并行算法

这种并行图像加密框架下，我们提出了一种新的算法，这是基于四个基本的转换。因此，我们将描述我们的算法之前，先介绍这些转换。
3.转换
3.1.A-转换
在A转换中，A代表加，能被形式化的定义如下：
a+b=c(1)
加法被定义为按位与操作
转换A有三个基本性质：
(2.1)a+a=0
(2.2)a+b=b+a(2)
(2.3)(a+b)+c=a+(b+c)
在并行模式计算时，许多的PE可以同时读取或写入相同的内存区域（即临界区），
这往往会导致意想不到的执行程序。因此，有必要在关键区域使用一些并行技术管理。
2.2.并行图像的加密框架
为了满足上述要求，我们提出了一个并行图像加密的框架，这是一个四个步骤的过程：
步骤1：整个图像被划分成若干块。
步骤2：每个PE负责确定数量块。一个区域内的像素可以充分使用有效的混乱和扩散进行操作加密。
附件C：译文
基于离散混沌映射的图像加密并行算法
摘要：
最近，针对图像加密提出了多种基于混沌的算法。然而，它们都无法在并行计算环境中有效工作。在本文中，我们提出了一个并行图像加密的框架。基于此框架内，一个使用离散柯尔莫哥洛夫流映射的新算法被提出。它符合所有并行图像加密算法的要求。此外，它是安全、快速的。这些特性使得它是一个很好的基于并行计算平台上的图像加密选择。
这个框架可以非常有效的实现整个图像的扩散。但是，它是不适合在并行计算环境中运行。这是因为当前像素的处理无法启动直到前一个像素已加密。即使有多个处理元素（PE），这种计算仍然是在一个串行模式下工作。此限制了其应用平台，因为许多基于FPGA / CPLD或者数字电路的设备可以支持并行处理。随着并行计算技术的应用，加密速度可以大大加快。

cl重建算法代码

cl重建算法代码一、算法原理CL重建算法是一种基于立体视觉的三维场景重建算法。

其原理是通过对多幅视角图像进行匹配，找到对应的特征点，然后根据这些特征点的位置信息，计算出场景中点的三维坐标。

该算法主要包括以下几个步骤：1. 特征点提取：从每幅视角的图像中提取出特征点，常用的特征点提取算法有SIFT、SURF等。

2. 特征点匹配：对提取出的特征点进行匹配，找出在多个视角中对应的特征点。

3. 三角测量：根据匹配的特征点，通过三角测量的方法计算出对应点的三维坐标。

4. 深度图生成：根据三维坐标，生成场景的深度图，用于后续的三维重建。

5. 点云生成：根据深度图，将每个像素点的三维坐标转化为点云数据，得到场景的三维模型。

二、算法步骤根据上述原理，可以将CL重建算法的具体步骤总结如下：1. 预处理：对输入的多个视角图像进行预处理，包括图像去噪、边缘增强等。

2. 特征点提取：使用SIFT算法提取每幅图像中的特征点，并计算特征描述子。

3. 特征点匹配：对每对视角图像的特征点进行匹配，找到对应的特征点。

4. 三角测量：根据匹配的特征点，使用三角测量的方法计算出对应点的三维坐标。

5. 深度图生成：根据三维坐标，生成场景的深度图，可以使用插值算法对缺失的深度值进行估计。

6. 点云生成：根据深度图，将每个像素点的三维坐标转化为点云数据，得到场景的三维模型。

7. 三维重建：对生成的点云数据进行滤波和重建，可以使用体素网格化算法等方法得到更精细的三维重建结果。

三、算法实现在实际编程实现CL重建算法时，可以使用开源的计算机视觉库OpenCV进行图像处理和特征点匹配等操作。

以下为CL重建算法的伪代码实现：```// 图像预处理for each view in views dopreprocess(view)end// 特征点提取和匹配for each view in views dokeypoints, descriptors = extract_features(view)keypoints_list.append(keypoints)descriptors_list.append(descriptors)endmatches = match_features(keypoints_list, descriptors_list) // 三角测量for each match in matches dopoint_3d = triangulate(match)point_cloud.append(point_3d)end// 深度图生成depth_map = generate_depth_map(matches)// 点云生成point_cloud = generate_point_cloud(depth_map)// 三维重建reconstruct(point_cloud)```以上伪代码实现了CL重建算法的基本步骤，具体的函数实现可以根据实际需求和使用的计算机视觉库进行编写。

超像素分割算法(SLIC算法)

超像素分割算法（SLIC算法）
SLIC算法的核心思想是将图像空间和颜色空间相结合，通过将像素点聚类为超像素，实现图像的分割。

算法的流程如下：
1.初始化：选择超像素数量K，并进行初始位置的选择。

一种常用的初始化方法是均匀地将图像分成K个网格，并选取每个网格的中心点作为初始位置。

2. 迭代优化：对每个超像素中心点，使用k-means算法将其周围的像素分类到该超像素。

这里的距离度量不仅包括欧氏距离，还考虑了颜色相似性和空间距离的权重。

同时，还计算了每个像素点到最近超像素中心点的距离，用于后续的超像素合并操作。

3.超像素合并：根据像素点到最近超像素中心点的距离和相邻超像素之间的相似性，进行超像素的合并操作。

这样可以将尺寸较小的超像素合并为更大的超像素，使得图像分割更加连贯。

4.迭代优化：重复步骤2和步骤3，直到达到预设的迭代次数或者收敛为止。

SLIC算法有以下特点：
1. 快速有效：SLIC算法通过使用k-means算法进行迭代聚类，使得算法具有较高的效率。

同时，由于使用了颜色和空间信息，也能够获得更好的分割效果。

2.参数少：SLIC算法只需要设置一个参数，即超像素数量K，此外，还可以根据需要设置聚类的迭代次数。

3.保持图像边界：由于考虑了颜色相似性和空间距离的权重，在进行超像素合并操作时能够较好地保持图像的边界。

4.可扩展性：SLIC算法可以很容易地扩展到多通道的图像，同时也可以用于视频超像素分割。

总的来说，SLIC算法是一种快速有效的超像素分割算法，具有较好的分割效果。

通过合适的初始化和迭代次数，可以在保持图像细节的同时实现图像的快速分割。

Summa Scalable universal matrix multiplication algorithm

will consider the formation of the matrix products C C C C = = = = AB + C AB T + C AT B + C AT B T + C (1) (2) (3) (4)
These are the special cases implemented as part of the widely used sequential Basic Linear Algebra Subprograms 11]. We will assume that each matrix X is of dimension mX nX , X 2 fA; B; C g. Naturally, there are constraints on these dimensions for the multiplications to be well de ned: We will assume that the dimensions of C are m n, while the \other" dimension is k.
This work is partially supported by the NASA High Performance Computing and Communications Program's Earth and Space Sciences Project under NRA Grant NAG5-2497. Additional support came from the Intel Research Council. Jerrell Watts is being supported by an NSF Graduate Research Fellowship.

The Levenberg-Marquardt Algorithm

The Levenberg-Marquardt AlgorithmAnanth Ranganathan8th June20041IntroductionThe Levenberg-Marquardt(LM)algorithm is the most widely used optimization algorithm.It outperforms simple gradient descent and other conjugate gradient methods in a wide variety of problems.This document aims to provide an intuitive explanation for this algorithm.The LM algorithm isﬁrst shown to be a blend of vanilla gradient descent and Gauss-Newton iteration. Subsequently,another perspective on the algorithm is provided by considering it as a trust-region method.2The ProblemThe problem for which the LM algorithm provides a solution is called Nonlinear Least Squares Minimization.This implies that the function to be minimized is of the following special form:f(x)=12r(x) 2.The derivatives of f can be written using the Jacobian matrix J of r w.r.t x deﬁned as J(x)=∂r j2 Jx+r(0) 2.We also get∇f(x)=J T(Jx+r)and∇2f(x)=J T J.Solving for the min-imum by setting∇f(x)=0,we obtain x min=−(J T J)−1J T r,which is the solution to the set of normal equations.Returning to the general,non-linear case,we have∇f(x)=m∑j=1r j(x)∇r j(x)=J(x)T r(x)(1)∇2f(x)=J(x)T J(x)+m∑j=1r j(x)∇2r j(x)(2)The distinctive property of least-squares problems is that given the Jacobian matrix J,we can essentially get the Hessian(∇2f(x))for free if it is possible to approximate the r j s by linear functions(∇2r j(x)are small)or the residuals(r j(x))themselves are small.The Hessian in this case simply becomes∇2f(x)=J(x)T J(x)(3) which is the same as for the linear case.The common approximation used here is one of near-linearity of the r i s near the solution so that∇2r j(x)are small.It is also important to note that(3)is only valid if the residuals are small. Large residual problems cannot be solved using the quadratic approximation,and consequently, the performance of the algorithms presented in this document is poor in such cases.3LM as a blend of Gradient descent and Gauss-Newton itera-tionVanilla gradient descent is the simplest,most intuitive technique toﬁnd minima in a function. Parameter updation is performed by adding the negative of the scaled gradient at each step,i.e.x i+1=x i−λ∇f(4) Simple gradient descent suffers from various convergence problems.Logically,we would like to take large steps down the gradient at locations where the gradient is small(the slope is gentle) and conversely,take small steps when the gradient is large,so as not to rattle out of the minima. With the above update rule,we do just the opposite of this.Another issue is that the curvature of the error surface may not be the same in all directions.For example,if there is a long and narrow valley in the error surface,the component of the gradient in the direction that points along the base of the valley is very small while the component along the valley walls is quite large.This results in motion more in the direction of the walls even though we have to move a long distance along the base and a small distance along the walls.This situation can be improved upon by using curvature as well as gradient information,namely second derivatives.One way to do this is to use Newton’s method to solve the equation∇f(x)=0. Expanding the gradient of f using a Taylor series around the current state x0,we get∇f(x)=∇f(x0)+(x−x0)T∇2f(x0)+higher order terms of(x−x0)(5) If we neglect the higher order terms(assuming f to be quadratic around x0),and solve for the minimum x by setting the left hand side of(5)to0,we get the update rule for Newton’s method-x i+1=x i− ∇2f(x i) −1∇f(x i)(6)where x0has been replaced by x i and x by x i+1.Since Newton’s method implicitly uses a quadratic assumption on f(arising from the neglect of higher order terms in a Taylor series expansion of f),the Hessian need not be evaluated exactly. Rather the approximation of(3)can be used.The main advantage of this technique is rapid con-vergence.However,the rate of convergence is sensitive to the starting location(or more precisely, the linearity around the starting location).It can be seen that simple gradient descent and Gauss-Newton iteration are complementary in the advantages they provide.Levenberg proposed an algorithm based on this observation,whose update rule is a blend of the above mentioned algorithms and is given asx i+1=x i−(H+λI)−1∇f(x i)(7) where H is the Hessian matrix evaluated at x i.This update rule is used as follows.If the error goes down following an update,it implies that our quadratic assumption on f(x)is working and we reduceλ(usually by a factor of10)to reduce the inﬂuence of gradient descent.On the other hand,if the error goes up,we would like to follow the gradient more and soλis increased by the same factor.The Levenberg algorithm is thus-1.Do an update as directed by the rule above.2.Evaluate the error at the new parameter vector.3.If the error has increased as a result the update,then retract the step(i.e.reset the weights totheir previous values)and increaseλby a factor of10or some such signiﬁcant factor.Then go to(1)and try an update again.4.If the error has decreased as a result of the update,then accept the step(i.e.keep the weightsat their new values)and decreaseλby a factor of10or so.The above algorithm has the disadvantage that if the value ofλis large,the calculated Hessian matrix is not used at all.We can derive some advantage out of the second derivative even in such cases by scaling each component of the gradient according to the curvature.This should result in larger movement along the directions where the gradient is smaller so that the classic“error valley”problem does not occur any more.This crucial insight was provided by Marquardt.He replaced the identity matrix in(7)with the diagonal of the Hessian resulting in the Levenberg-Marquardt update rule.x i+1=x i−(H+λdiag[H])−1∇f(x i)(8) Since the Hessian is proportional to the curvature of f,(8)implies a large step in the direction with low curvature(i.e.,an almostﬂat terrain)and a small step in the direction with high curvature(i.e, a steep incline).It is to be noted that while the LM method is in no way optimal but is just a heuristic,it works extremely well in practice.The onlyﬂaw is its need for matrix inversion as part of the update.Even though the inverse is usually implemented using clever pseudo-inverse methods such as singular value decomposition,the cost of the update becomes prohibitive after the model size increases to a few thousand parameters.For moderately sized models(of a few hundred parameters)however, this method is much faster than say,vanilla gradient descent.4LM as a trust-region algorithmHistorically,the LM algorithm was presented by Marquardt as given in the previous section where the parameter,λ,was manipulated directly toﬁnd the minimum.Subsequently,a trust-region approach to the algorithm has gained ground.Trust-region algorithms work in a fundamentally different manner than those presented in the previous section,which are called line-search methods.In a line search method,we decide on a direction in which to descend the gradient and are then concerned about the step size,i.e.if p(k) is the direction of descent,andαk the stepsize,then our step is given by x(k+1)=x(k)+αk p(k)and the stepsize is obtained by solving the sub-problemmin f x(k)+αk p(k) ∀αk>0By contrast,in a trust-region algorithm we build a model m(k)that approximates the function f in aﬁnite region near x(k).This region,∆,where the model is a good approximation of f,is called the trust-region.Trust-region algorithms maintain∆and update it at each iteration using heuristics. The model m(k)is most often a quadratic obtained by a Taylor series expansion of f around x(k), i.e.m(k)=f x(k) +∇f x(k) p+1p T H p(10)2and the iteration step itself is x(k+1)=x(k)+p.A trust-region algorithm can thus be conceived of as a sequence of iterations,in each of which we model the function f by a quadratic and then jump to the minimum of that quadratic.The solution of(10)is given by a theorem which is as follows-p∗is a global solution of min p <∆f x(k) +∇f x(k) p+1f w(k) −f w(k)+p∗ ρk=。

A compact algorithm for rectification of stereo pairs中文版翻译

Machine Vision and Applications (2000) 12: 16–22机器视觉与应用（2000）12:16-22Machine Vision and Applications©Springer-Verlag 2000机器视觉与应用©施普林格出版社2000Andrea Fusiello1, Emanuele Trucco2, Alessandro Verri31 Dipartimento Scientifico e Tecnologico, Universita d i Verona, Ca’ Vignal 2, Strada Le Grazie, 37134 Verona, Italy; e-mail: fusiello@sci.univr.it `2 Heriot-Watt University, Department of Computing and Electrical Engineering, Edinburgh, UK3 INFM, Dipartimento di Informatica e Scienze dell’Informazione, Univ ersita di Genova, Genova, ItalyReceived: 25 February 1999 / Accepted: 2 March 2000收稿日期：1999年2月25日/接受日期：2000年3月2日Abstract. We present a linear rectification algorithm for general, unconstrained stereo rigs. The algorithm takes the two perspective projection matrices of the original cameras,and computes a pair of rectifying projection matrices. It is compact (22-line MATLAB code) and easily reproducible.We report tests proving the correct behavior of our method,as well as the negligible decrease of the accuracy of 3D reconstruction performed from the rectified images directly.摘要：我们在本篇文章中阐述一个用于通用的不加约束的立体视觉设备的线性修正算法。

FastSLAM A factored solution to the simultaneous localization and mapping problem

FastSLAM:A Factored Solution to the Simultaneous Localization and Mapping ProblemMichael Montemerlo and Sebastian Thrun School of Computer ScienceCarnegie Mellon UniversityPittsburgh,PA15213mmde@,thrun@ Daphne Koller and Ben Wegbreit Computer Science DepartmentStanford UniversityStanford,CA94305-9010koller@,ben@AbstractThe ability to simultaneously localize a robot and ac-curately map its surroundings is considered by many tobe a key prerequisite of truly autonomous robots.How-ever,few approaches to this problem scale up to handlethe very large number of landmarks present in real envi-ronments.Kalmanﬁlter-based algorithms,for example,require time quadratic in the number of landmarks to in-corporate each sensor observation.This paper presentsFastSLAM,an algorithm that recursively estimates thefull posterior distribution over robot pose and landmarklocations,yet scales logarithmically with the number oflandmarks in the map.This algorithm is based on an ex-act factorization of the posterior into a product of con-ditional landmark distributions and a distribution overrobot paths.The algorithm has been run successfullyon as many as50,000landmarks,environments far be-yond the reach of previous approaches.Experimentalresults demonstrate the advantages and limitations ofthe FastSLAM algorithm on both simulated and real-world data.IntroductionThe problem of simultaneous localization and mapping,also known as SLAM,has attracted immense attention in the mo-bile robotics literature.SLAM addresses the problem of building a map of an environment from a sequence of land-mark measurements obtained from a moving robot.Since robot motion is subject to error,the mapping problem neces-sarily induces a robot localization problem—hence the name SLAM.The ability to simultaneously localize a robot and accurately map its environment is considered by many to be a key prerequisite of truly autonomous robots[3,7,16]. The dominant approach to the SLAM problem was in-troduced in a seminal paper by Smith,Self,and Cheese-man[15].This paper proposed the use of the extended Kalmanﬁlter(EKF)for incrementally estimating the poste-rior distribution over robot pose along with the positions of the landmarks.In the last decade,this approach has found widespread acceptance inﬁeld robotics,as a recent tutorial paper[2]documents.Recent research has focused on scal-ing this approach to larger environments with more than aFigure1:The SLAM problem:The robot moves from pose through a sequence of controls,.As it moves,it observes nearby landmarks.At time,it observes landmark out of two landmarks,.The measurement is denoted (range and bearing).At time,it observes the other landmark, ,and at time,it observes again.The SLAM problem is concerned with estimating the locations of the landmarks and the robot’s path from the controls and the measurements.The gray shading illustrates a conditional independence relation.plementation of this idea leads to an algorithm that requires time,where is the number of particles in the particleﬁlter and is the number of landmarks.We de-velop a tree-based data structure that reduces the running time of FastSLAM to,making it signiﬁcantly faster than existing EKF-based SLAM algorithms.We also extend the FastSLAM algorithm to situations with unknown data association and unknown number of landmarks,show-ing that our approach can be extended to the full range of SLAM problems discussed in the literature. Experimental results using a physical robot and a robot simulator illustrate that the FastSLAM algorithm can han-dle orders of magnitude more landmarks than present day approaches.We alsoﬁnd that in certain situations,an in-creased number of landmarks leads to a mild reduction of the number of particles needed to generate accurate maps—whereas in others the number of particles required for accurate mapping may be prohibitively large.SLAM Problem DeﬁnitionThe SLAM problem,as deﬁned in the rich body of litera-ture on SLAM,is best described as a probabilistic Markov chain.The robot’s pose at time will be denoted.For robots operating in the plane—which is the case in all of our experiments—poses are comprised of a robot’s-coordi-nate in the plane and its heading direction.Poses evolve according to a probabilistic law,often re-ferred to as the motion model:(1) Thus,is a probabilistic function of the robot control and the previous pose.In mobile robotics,the motion model is usually a time-invariant probabilistic generalization of robot kinematics[1].The robot’s environment possesses immobile land-marks.Each landmark is characterized by its location in space,denoted for.Without loss of gen-erality,we will think of landmarks as points in the plane,so that locations are speciﬁed by two numerical values.To map its environment,the robot can sense landmarks. For example,it may be able to measure range and bearing to a landmark,relative to its local coordinate frame.The mea-surement at time will be denoted.While robots can often sense more than one landmark at a time,we follow com-monplace notation by assuming that sensor measurements correspond to exactly one landmark[2].This convention is adopted solely for mathematical convenience.It poses no restriction,as multiple landmark sightings at a single time step can be processed sequentially.Sensor measurements are governed by a probabilistic law, often referred to as the measurement model:(2) Here is the set of all landmarks,andis the index of the landmark perceived at time.For example,in Figure1,we have, and,since the robotﬁrst observes landmark, then landmark,andﬁnally landmark for a second time. Many measurement models in the literature assume that the robot can measure range and bearing to landmarks,con-founded by measurement noise.The variable is often referred to as correspondence.Most theoretical work in the literature assumes knowledge of the correspondence or,put differently,that landmarks are uniquely identiﬁable.Practi-cal implementations use maximum likelihood estimators for estimating the correspondence on-the-ﬂy,which work well if landmarks are spaced sufﬁciently far apart.In large parts of this paper we will simply assume that landmarks are iden-tiﬁable,but we will also discuss an extension that estimates the correspondences from data.We are now ready to formulate the SLAM problem.Most generally,SLAM is the problem of determining the location of all landmarks and robot poses from measurementsand controls.In probabilis-tic terms,this is expressed by the posterior, where we use the superscript to refer to a set of variables from time1to time.If the correspondences are known,the SLAM problem is simpler:(3) As discussed in the introduction,all individual landmark es-timation problems are independent if one knew the robot’s path and the correspondence variables.This condi-tional independence is the basis of the FastSLAM algorithm described in the next section.FastSLAM with Known Correspondences We begin our consideration with the important case where the correspondences are known,and so is the number of landmarks observed thus far.Factored RepresentationThe conditional independence property of the SLAM prob-lem implies that the posterior(3)can be factored as follows:(4)Put verbally,the problem can be decomposed into esti-mation problems,one problem of estimating a posterior over robot paths,and problems of estimating the locationsof the landmarks conditioned on the path estimate.This factorization is exact and always applicable in the SLAM problem,as previously argued in[12].The FastSLAM algorithm implements the path estimatorusing a modiﬁed particleﬁlter[4].As we argue further below,thisﬁlter can sample efﬁciently from this space,providing a good approximation of the poste-rior even under non-linear motion kinematics.The land-mark pose estimators are realized by Kalmanﬁlters,using separateﬁlters for different landmarks. Because the landmark estimates are conditioned on the path estimate,each particle in the particleﬁlter has its own,lo-cal landmark estimates.Thus,for particles and land-marks,there will be a total of Kalmanﬁlters,each of dimension2(for the two landmark coordinates).This repre-sentation will now be discussed in detail.Particle Filter Path EstimationFastSLAM employs a particleﬁlter for estimating the path posterior in(4),using aﬁlter that is similar (but not identical)to the Monte Carlo localization(MCL) algorithm[1].MCL is an application of particleﬁlter tothe problem of robot pose estimation(localization).At each point in time,both algorithms maintain a set of particles rep-resenting the posterior,denoted.Each particle represents a“guess”of the robot’s path:(5) We use the superscript notation to refer to the-th par-ticle in the set.The particle set is calculated incrementally,from theset at time,a robot control,and a measurement.First,each particle in is used to generate a probabilistic guess of the robot’s pose at time:(6) obtained by sampling from the probabilistic motion model. This estimate is then added to a temporary set of parti-cles,along with the path.Under the assumption that the set of particles in is distributed according to(which is an asymptotically cor-rect approximation),the new particle is distributed accord-ing to.This distribution is commonly referred to as the proposal distribution of particleﬁltering. After generating particles in this way,the new set is obtained by sampling from the temporary particle set.Each particle is drawn(with replacement)with a probability proportional to a so-called importance factor,which is calculated as follows[10]:target distribution(7) The exact calculation of(7)will be discussed further below. The resulting sample set is distributed according to an ap-proximation to the desired pose posterior,an approximation which is correct as the number of particlesgoes to inﬁnity.We also notice that only the most recent robot pose estimate is used when generating the parti-cle set.This will allows us to silently“forget”all other pose estimates,rendering the size of each particle indepen-dent of the time index.Landmark Location EstimationFastSLAM represents the conditional landmark estimatesin(4)by Kalmanﬁlters.Since this estimate is conditioned on the robot pose,the Kalmanﬁlters are attached to individual pose particles in.More speciﬁ-cally,the full posterior over paths and landmark positions in the FastSLAM algorithm is represented by the sample set(8) Here and are mean and covariance of the Gaus-sian representing the-th landmark,attached to the-th particle.In the planar robot navigation scenario,each mean is a two-element vector,and is a2by2matrix. The posterior over the-th landmark pose is easily ob-tained.Its computation depends on whether or not, that is,whether or not was observed at time.For, we obtain(9)For,we simply leave the Gaussian unchanged:(10) The FastSLAM algorithm implements the update equation (9)using the extended Kalmanﬁlter(EKF).As in existing EKF approaches to SLAM,thisﬁlter uses a linearized ver-sion of the perceptual model[2].Thus, FastSLAM’s EKF is similar to the traditional EKF for SLAM[15]in that it approximates the measurement model using a linear Gaussian function.We note that,with a lin-ear Gaussian observation model,the resulting distributionis exactly a Gaussian,even if the mo-tion model is not linear.This is a consequence of the use of sampling to approximate the distribution over the robot’s pose.One signiﬁcant difference between the FastSLAM algo-rithm’s use of Kalmanﬁlters and that of the traditional SLAM algorithm is that the updates in the FastSLAM algo-rithm involve only a Gaussian of dimension two(for the two landmark location parameters),whereas in the EKF-based SLAM approach a Gaussian of size has to be updated (with landmarks and3robot pose parameters).This cal-culation can be done in constant time in FastSLAM,whereas it requires time quadratic in in standard SLAM. Calculating the Importance WeightsLet us now return to the problem of calculating the impor-tance weights needed for particleﬁlter resampling,as deﬁned in(7):µµµµµµµµ8,Σ87,Σ76,Σ65,Σ54,Σ43,Σ32,Σ21,Σ1[m][m][m][m][m][m][m][m][m][m][m][m][m][m][m][m]Figure 2:A tree representinglandmark estimates within asingle particle.(a)(b)(c)Figure4:(a)Physical robot mapping rocks,in a testbed developed for Mars Rover research.(b)Raw range and path data.(c)Map generated using FastSLAM(dots),and locations of rocks determined manually(circles).in the map.It also has to determine if a measurement cor-responds to a new,previously unseen landmark,in whichcase the map should be augmented accordingly.In most existing SLAM solutions based on EKFs,theseproblems are solved via maximum likelihood.More specif-ically,the probability of a data association is given by(12)The step labeled“PF”uses the particleﬁlter approxima-tion to the posterior.Theﬁnal step assumesa uniform prior,which is commonly used[2].The maximum likelihood data association is simply the in-dex that maximizes(12).If the maximum value of—with careful consideration of all constantsin(12)—is below a threshold,the landmark is consideredpreviously unseen and the map is augmented accordingly.In FastSLAM,the data association is estimated on a per-particle basis:.As a result,different particles may rely on different values of.Theymight even possess different numbers of landmarks in theirrespective maps.This constitutes a primary difference toEKF approaches,which determine the data association onlyonce for each sensor measurement.It has been observedfrequently that false data association will make the conven-tional EKF approach fail catastrophically[2].FastSLAM ismore likely to recover,thanks to its ability to pursue multi-ple data associations simultaneously.Particles with wrongdata association are(in expectation)more likely to disap-pear in the resampling process than those that guess the dataassociation correctly.We believe that,under mild assumptions(e.g.,minimumspacing between landmarks and bounded sensor error),thedata association search can be implemented in time loga-rithmic in.One possibility is the use of kd-trees as anindexing scheme in the tree structures above,instead of thelandmark number,as proposed in[11].Experimental ResultsThe FastSLAM algorithm was tested extensively under vari-ous conditions.Real-world experiments were complimentedby systematic simulation experiments,to investigate thescaling abilities of the approach.Overall,the results indicatefavorably scaling to large number of landmarks and smallparticle sets.Aﬁxed number of particles(e.g.,)appears to work well across a large number of situations.Figure4a shows the physical robot testbed,which consistsof a small arena set up under NASA funding for Mars Roverresearch.A Pioneer robot equipped with a SICK laser rangeﬁnder was driven along an approximate straight line,gener-ating the raw data shown in Figure4b.The resulting mapgenerated with samples is depicted in Figure4c,with manually determined landmark locations marked bycircles.The robot’s estimates are indicated by x’s,illustrat-ing the high accuracy of the resulting maps.FastSLAM re-sulted in an average residual map error of8.3centimeters,when compared to the manually generated map.Unfortunately,the physical testbed does not allow for sys-tematic experiments regarding the scaling properties of theapproach.In extensive simulations,the number of land-marks was increased up to a total of50,000,which Fast-SLAM successfully mapped with as few as100particles.Here,the number of parameters in FastSLAM is approx-imately0.3%of that in the conventional EKF.Maps with50,000landmarks are entirely out of range for conventionalSLAM techniques,due to their enormous computationalcomplexity.Figure5shows example maps with smallernumbers of landmarks,for different maximum sensor rangesas indicated.The ellipses in Figure5visualize the residualuncertainty when integrated over all particles and Gaussians.In a set of experiments speciﬁcally aimed to elucidate thescaling properties of the approach,we evaluated the map androbot pose errors as a function of the number of landmarks,and the number of particles,respectively.The resultsare graphically depicted in Figure6.Figure6a illustratesthat an increase in the number of landmarks mildly re-duces the error in the map and the robot pose.This is be-cause the larger the number of landmarks,the smaller therobot pose error at any point in time.Increasing the numberof particles also bears a positive effect on the map andpose errors,as illustrated in Figure6b.In both diagrams,thebars correspond to95%conﬁdence intervals.Figure5:Maps and estimated robot path,generated using sensors with(a)large and(b)small perceptualﬁelds.The correct landmark locations are shown as dots,and the estimates as ellipses,whose sizes correspond to the residual uncertainty.ConclusionWe presented the FastSLAM algorithm,an efﬁcient new so-lution to the concurrent mapping and localization problem. This algorithm utilizes a Rao-Blackwellized representation of the posterior,integrating particleﬁlter and Kalmanﬁlter representations.Similar to Murphy’s work[12],FastSLAM is based on an inherent conditional independence property of the SLAM problem.However,Murphy’s approach main-tains maps using grid positions with discrete values,and therefore scales poorly with the size of the map.His ap-proach also did not deal with the data association problem, which does not arise in the grid-based setting.In FastSLAM,landmark estimates are efﬁciently repre-sented using tree structures.Updating the posterior requires time,where is the number of particles and the number of landmarks.This is in contrast to the complexity of the common Kalman-ﬁlter based ap-proach to SLAM.Experimental results illustrate that Fast-SLAM can build maps with orders of magnitude more land-marks than previous methods.They also demonstrate that under certain conditions,a small number of particles works well regardless of the number of landmarks. Acknowledgments We thank Kevin Murphy and Nando de Freitas for insightful discussions on this topic.This research was sponsored by DARPA’s MARS Program(Contract number N66001-01-C-6018)and the National Science Foundation(CA-REER grant number IIS-9876136and regular grant number IIS-9877033).We thank the Hertz Foundation for their support of Michael Montemerlo’s graduate research.Daphne Koller was supported by the Ofﬁce of Naval Research,Young Investigator (PECASE)grant N00014-99-1-0464.This work was done while Sebastian Thrun was visiting Stanford University.References[1] F.Dellaert,D.Fox,W.Burgard,and S.Thrun.Monte Carlolocalization for mobile robots.ICRA-99.[2]G.Dissanayake,P.Newman,S.Clark,H.F.Durrant-Whyte,and M.Csorba.An experimental and theoretical investigation into simultaneous localisation and map building(SLAM).Lecture Notes in Control and Information Sciences:Exper-imental Robotics VI,Springer,2000.[3]G.Dissanayake,P.Newman,S.Clark,H.F.Durrant-Whyte,and M.Csorba.A solution to the simultaneous localisation and map building(SLAM)problem.IEEE Transactions of Robotics and Automation,2001.[4] A.Doucet,J.F.G.de Freitas,and N.J.Gordon,editors.Se-quential Monte Carlo Methods In Practice.Springer,2001.(a)(b)Figure6:Accuracy of the FastSLAM algorithm as a function of (a)the number of landmarks,and(b)the number of particles .Large number of landmarks reduce the robot localization error, with little effect on the map error.Good results can be achieved with as few as100particles.[5]A Doucet,N.de Freitas,K.Murphy,and S.Russell.Rao-Blackwellised particleﬁltering for dynamic Bayesian net-works.UAI-2000.[6]J.Guivant and E.Nebot.Optimization of the simultaneouslocalization and map building algorithm for real time imple-mentation.IEEE Transaction of Robotic and Automation, May2001.[7] D.Kortenkamp,R.P.Bonasso,and R.Murphy,editors.AI-based Mobile Robots:Case studies of successful robot sys-tems,MIT Press,1998.[8]J.J.Leonard and H.J.S.Feder.A computationally efﬁcientmethod for large-scale concurrent mapping and localization.ISRR-99.[9] F.Lu and ios.Globally consistent range scan alignmentfor environment mapping.Autonomous Robots,4,1997. [10]N.Metropolis, A.W.Rosenbluth,M.N.Rosenbluth, A.H.Teller,and E.Teller.Equations of state calculations by fast computing machine.Journal of Chemical Physics,21,1953.[11] A.W.Moore.Very fast EM-based mixture model clusteringusing multiresolution kd-trees.NIPS-98.[12]K.Murphy.Bayesian map learning in dynamic environments.NIPS-99.[13]K.Murphy and S.Russell.Rao-blackwellized particleﬁl-tering for dynamic bayesian networks.In Sequential Monte Carlo Methods in Practice,Springer,2001.[14]P.Newman.On the Structure and Solution of the Simulta-neous Localisation and Map Building Problem.PhD thesis, Univ.of Sydney,2000.[15]R.Smith,M.Self,and P.Cheeseman.Estimating uncertainspatial relationships in robotics.In Autonomous Robot Vehni-cles,Springer,1990.[16] C.Thorpe and H.Durrant-Whyte.Field robots.ISRR-2001.[17]S.Thrun,D.Fox,and W.Burgard.A probabilistic approachto concurrent mapping and localization for mobile robots.Machine Learning,31,1998.。

超像素分割算法研究综述

超像素分割算法研究综述超像素分割是计算机视觉领域的一个重要研究方向，它的目标是将图像分割成一组紧密连接的区域，每个区域都具有相似的颜色和纹理特征。

超像素分割可以在许多计算机视觉任务中发挥重要作用，如图像分割、目标检测和图像语义分割等。

本综述将介绍一些常见的超像素分割算法及其应用。

1. SLIC (Simple Linear Iterative Clustering)：SLIC是一种基于k-means聚类的超像素分割算法。

它首先在图像中均匀采样一组初始超像素中心，并通过迭代的方式将每个像素分配给最近的超像素中心。

SLIC算法结合了颜色和空间信息，具有简单高效的特点，适用于实时应用。

2.QuickShift：QuickShift是一种基于密度峰值的超像素分割算法。

它通过利用图片的颜色相似性和空间相似性来计算每个像素的相似度，并通过移动像素之间的边界来形成超像素。

QuickShift算法不依赖于预定义的超像素数量，适用于不同大小和形状的图像。

3. CPMC (Constrained Parametric Min-Cuts)：CPMC是一种基于图割的超像素分割算法。

该算法通过求解最小割问题来获得具有边界连通性的超像素分割结果。

CPMC算法能够生成形状规则的超像素，适用于对形状准确性要求较高的应用。

4. LSC (Linear Spectral Clustering)：LSC是一种基于线性谱聚类的超像素分割算法。

它通过构建图像的颜色和空间邻接图，并对其进行谱分解来获取超像素分割结果。

LSC算法具有良好的分割结果和计算效率，适用于大规模图像数据的处理。

5. SEEDS (Superpixels Extracted via Energy-Driven Sampling)：SEEDS是一种基于随机采样的超像素分割算法。

它通过迭代的方式将像素相似度转化为能量函数，并通过最小化能量函数来生成超像素。

SEEDS算法能够快速生成具有边界连通性的超像素，并适用于实时应用。

opencv 高斯牛顿法

opencv 高斯牛顿法【实用版】目录1.OpenCV 简介2.高斯牛顿法原理3.OpenCV 中的高斯牛顿法应用4.高斯牛顿法的优缺点5.总结正文1.OpenCV 简介OpenCV（Open Source Computer Vision Library）是一个开源的计算机视觉库，它包含了大量的图像处理和计算机视觉方面的算法。

OpenCV 的目的是为人工智能、机器视觉、图像处理等领域的研究人员和开发者提供一个通用且高效的平台。

它支持多种编程语言，如 C++、Python 等，使得开发者可以方便地在不同的操作系统上进行开发和测试。

2.高斯牛顿法原理高斯牛顿法（Gauss-Newton method）是一种用于求解非线性最小二乘问题的优化算法。

它的基本思想是通过对数据点进行加权最小二乘拟合，以求得参数的最优值。

在图像处理领域，高斯牛顿法常用于求解图像的参数，例如相机的内部参数和外部参数等。

3.OpenCV 中的高斯牛顿法应用在 OpenCV 库中，高斯牛顿法被广泛应用于以下领域：（1）相机标定：相机标定是计算机视觉领域的一个重要环节，其目的是通过拍摄包含已知几何形状的场景，求解相机的内部参数（如焦距、光心坐标等）和外部参数（如旋转和平移矩阵）。

OpenCV 提供了一系列的函数来实现相机标定，其中就包括了使用高斯牛顿法的函数。

（2）图像拟合：在图像处理中，常常需要对图像进行参数化的拟合，以求得图像中的关键点或者纹理等信息。

OpenCV 中提供了基于高斯牛顿法的图像拟合函数，可以对图像中的特征点进行精确的拟合。

（3）优化问题求解：OpenCV 中还提供了一些基于高斯牛顿法的优化问题求解函数，如求解线性或非线性最小二乘问题等。

4.高斯牛顿法的优缺点高斯牛顿法具有以下优缺点：（1）优点：高斯牛顿法是一种迭代法，其迭代公式具有较好的数值稳定性，可以快速地收敛到最小二乘解。

同时，高斯牛顿法具有一定的鲁棒性，对于存在噪声的数据点也能获得较好的拟合效果。

求最大完全子图的启发式着色算法

为顶点着色分类的方法划分各个极大相容类，进一步研究
可将该方法演化为求极大完全子图（ｒａｏｌｔｕ — ｇｅｔｍｐｅｅｓｂｃｄｇａｈ的一般算法，而也就得到了求最大完全子图的ｒｐ）进算法．
２相关概念
定义１］［通常把二元序组（，），７ＶＥ称为图．记为：Ｇ（
解决．针对该情况，一些学者提出了启发式算法求解最大完全子图问题，当前的启发式算法都基于以下几类：序顺贪婪启发式算法ｌ、传算法、经网络＿、２遗］ｊ神４模拟退火和］
禁忌搜索＿、于连续的启发式算法 ¨ 等，得了令人满５基Ｊ６取
第ｌ２卷第２期２１００年４月
滁州学院学报ＪＵＮＬＦＣＵＨＵＵＩＥＳＹＯＲＡＺＯＮＶＲＩＯＨＴ
Ｖ１１．ｏ．２Ｎｏ２
Ａｐ．２１ｔ００
求最大完全子图的启发式着色算法
着色算法是图论中的一种经典算法，它通过对图的顶点着色或边着色的途径在方案选择及地图学等各个领域有着广泛的运用．但通过文献查询，目前还没有发现运用着色手段求解最大完全子图的成熟算法，经研究，本文推
出一种对无环的简单元向连通图求解最大完全子图的着
试设备Ａ（ｕｏｔｓｅｕｐｎ）ＴＥａｔｍａｉｔｔｑｉｃｅｍｅｔ向测试芯片传输大量的测试数据，为减少测试时问，降低测试成本，要对测需试数据进行压缩，由于测试数据中含有大量无关位，量向间的相容性较好，因此可对测试向量采取相容压缩的方法．相容是指两测试向量对应位相同或其中一位为无关

列文伯格马夸尔特算法原理

列文伯格马夸尔特算法原理列文伯格-马夸尔特算法：深度解析与应用在计算机视觉和机器学习的领域中，列文伯格-马夸尔特（Levenberg-Marquardt）算法是一个不可或缺的重要工具。

它是一种非线性最小二乘法的优化技术，广泛应用于图像处理、机器人控制、信号分析等多个方面。

本文将深入探讨该算法的原理、工作流程以及其在实际问题中的应用。

首先，让我们了解一下算法的命名来源。

Levenberg是一位美国数学家，而Marquardt是他的学生，他们共同提出了这个优化方法。

该算法结合了梯度下降法和勒贝格-琼斯最小二乘法的优点，旨在解决非线性系统中的参数估计问题。

列文伯格-马夸尔特算法的核心原理基于最小化目标函数的思想。

在许多问题中，我们都需要找到一组参数，使得某个函数的输出尽可能接近已知的数据或测量值。

目标函数通常表示为误差的平方和，即f(x) = Σ(y_i - f(x))^2，其中x是待求参数，y_i是测量值，f(x)是模型预测值。

算法的目标就是通过迭代调整x，使目标函数达到最小。

算法的关键在于引入了一个动态权重因子λ，这个因子在每次迭代中会根据残差的大小自动调整。

当残差较小，说明模型拟合良好，λ会被减小，使得算法更接近于经典最小二乘法；反之，如果残差较大，λ会被增大，使得算法更倾向于梯度下降，有助于跳出局部最优解。

这种策略有效地平衡了全局搜索和局部收敛，提高了算法的性能。

列文伯格-马夸尔特算法的工作流程大致可以分为以下步骤：1. 初始化：选择一组初始参数x_0。

2. 残差计算：计算当前参数下的目标函数值f(x_0)。

3. 梯度计算：计算目标函数对参数的梯度，即df/dx。

4. 更新规则：使用梯度和λ来更新参数，x_new = x_old - λ * df/dx。

5. 判断收敛：检查新的参数是否满足停止条件，如残差变化小于预设阈值或达到最大迭代次数。

6. 更新λ：根据残差的变化调整λ的值。

7. 重复步骤2-6，直到满足停止条件。

blue-stein 算法

blue-stein 算法
蓝斯坦算法是一种用来进行聚类分析的常用算法。

该算法是从最初的k均值算法演化而来，但是在处理大型数据集时，相比于k均值算法，他的优势更加明显。

下面我们将对蓝斯坦算法进行详细的介绍。

蓝斯坦算法的基本思想是通过一系列的步骤来完成聚类分析。

其基本流程如下：
1. 随机选择一个数据点，把它作为第一个聚类中心。

2. 对于剩下的数据点，计算它们与当前聚类中心的距离，并选择距离当前聚类中心最远的数据点，作为下一个聚类中心。

3. 重复上述步骤，直到指定数量的聚类中心被选择为止。

4. 对于每个数据点，计算它们与每个聚类中心的距离，并把它们分配到距离最近的聚类中心所对应的类别中。

5. 计算每个类别的中心点，并将其作为新的聚类中心。

6. 重复上述步骤，直到聚类结果收敛为止。

在蓝斯坦算法的基础上，还有一些改进方法：
1. MiniBatch KMeans算法。

该算法通过随机选择一个子数据集，只对该子数据集中的数据点进行聚类分析，可以有效地降低计算复杂度。

2. KMeans++算法。

该算法通过选择更加合理的聚类中心，可以优化最终的聚类结果。

3. Birch算法。

该算法通过将大量数据点分成多个子聚类簇，并把子聚类簇合并成一个大的聚类簇，从而提高计算效率和聚类质量。

总的来说，蓝斯坦算法是一种非常优秀的聚类算法，可以对大型数据集进行高效的聚类分析。

同时，通过引入一些改进方法，还可以进一步提高聚类效果。

克鲁斯卡尔算法伪代码

克鲁斯卡尔算法伪代码克鲁斯卡尔算法是一个用于求解最小生成树的算法。

最小生成树(Minimum Spanning Tree, MST)指的是在一张无向图G中找到一个树T，使得T中的边权之和最小。

克鲁斯卡尔算法的基本思路是将所有边按照权值从小到大排序，然后依次选取这些边，如果选取的边之间没有形成环，就将该边加入到最小生成树中去。

这样，直到最小生成树中有n-1条边为止，就得到了最小生成树。

1. 将边按照权值从小到大排序。

2. 初始化一个并查集，即每个点都是一个单独的子集。

3. 遍历排好序的边，依次将边加入最小生成树中。

4. 对于每条边，检查该边的两个端点是否在同一个子集中，如果不在同一个子集中，则将它们合并成一个子集，同时将这条边加入最小生成树中。

5. 直到最小生成树中有n-1条边为止。

下面是使用Java语言实现克鲁斯卡尔算法的代码示例：//边的类定义class Edge implements Comparable<Edge> {int u; // 边的起点int v; // 边的终点int w; // 边的权值// 构造函数public Edge(int u, int v, int w) {this.u = u;this.v = v;this.w = w;}//并查集的类定义class UnionFind {int[] parent;int[] rank;//查找父节点public int find(int p) {while (p != parent[p]) {parent[p] = parent[parent[p]]; //路径压缩p = parent[p];}return p;}//判断两个节点是否在同一个集合中public boolean connected(int p, int q) {return find(p) == find(q);}}//求解最小生成树public void solve() {Collections.sort(edges); //按照权值从小到大排序UnionFind uf = new UnionFind(V); //初始化并查集Kruskal kruskal = new Kruskal(V, edges);kruskal.solve();System.out.println("最小生成树的边为：");for (Edge e : kruskal.mst) {System.out.println(e.u + " - " + e.v + " : " + e.w);}}输出结果如下：最小生成树的边为： 0 - 2 : 13 - 5 : 21 - 4 : 30 - 1 : 62 -3 : 5。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Algorithms for Large Scale Markov Blanket Discovery Ioannis Tsamardinos, Constantin F. Aliferis, Alexander StatnikovDepartment of Biomedical Informatics, Vanderbilt University2209 Garland Ave, Nashville, TN 37232-8340{ioannis.tsamardinos, constantin.aliferis, alexander.statnikov@}AbstractThis paper presents a number of new algorithms for discov-ering the Markov Blanket of a target variable T from train-ing data. The Markov Blanket can be used for variable se-lection for classification, for causal discovery, and for Bayesian Network learning. We introduce a low-order polynomial algorithm and several variants that soundly in-duce the Markov Blanket under certain broad conditions in datasets with thousands of variables and compare them to other state-of-the-art local and global methods with excel-lent results.IntroductionThe Markov Blanket of a variable of interest T, denoted as MB(T), is a minimal set of variables conditioned on which all other variables are probabilistically independent of the target T. Given this property, knowledge of only the values of the MB(T) is enough to determine the probability distri-bution of T and the values of all other variables become superfluous. Therefore, the variables in the MB(T) are ade-quate for optimal classification. The strong connection between MB(T) and optimal, principled variable selection has been explored in (Tsamardinos and Aliferis 2003).In addition, under certain conditions (faithfulness to a Bayesian Network; see next section) the MB(T) is identical to the direct causes, direct effects, and direct effects of direct causes of T and thus it can be used for causal dis-covery, e.g., to reduce the number of variables an experi-mentalist has to consider in order to discover the direct causes of T.Finally, Markov Blanket discovery algorithms can be used to guide Bayesian Network learning algorithms: the MB(T) for all T are identified as a first step, and then used to guide the construction of the Bayesian Network of the domain; this is the approach taken in (Margaritis and Thrun 1999). Indeed, given the potential uses and signifi-cance of the concept of the Markov Blanket “It is surpris-ing … how little attention it has attracted in the context of Bayesian net structure learning for all its being a funda-mental property of a Bayesian net” (Margaritis and Thrun 1999).In this paper we present novel algorithms that soundly induce the MB(T) from data and scale-up to thousands ofCopyright © 2002, American Association for Artificial Intelligence(). All rights reserved.variables. We compare the new algorithms with other state-of-the-art methods for inducting the MB(T) with ex-cellent results. The novel algorithms are particularly suited for the cases where the available sample size is enough to perform conditional independence tests condition on the full MB(T).BackgroundBayesian Networks (BN) (Neapolitan 1990) are mathe-matical objects that compactly represent a joint probability distribution J using a graph G annotated with conditional probabilities; J and G are connected by the Markov Condi-tion property: a node is conditionally independent of its non-descendants, given its parents. The MB(T) probabilis-tically shields T from the rest of the variables and graphi-cally it corresponds to a neighborhood of T in the BN graph. We will denote the conditional independence of X, and T given Z, as I(X ; T | Z) ≡ P(T | X, Z) = P(T | Z). Definitions: The Markov Blanket of a variable T, MB(T), is a minimal set for which I(X ; T | MB(T)), for all X∈ V – {T} – MB(T) (Margaritis and Thrun 1999). A BN C is faithful to a joint probability distribution J over the vari-able set V if and only if every dependence entailed by the graph of C is also present in J (Spirtes et al. 2000). A BN C faithful if it is faithful to its corresponding distribution J. The Markov Condition ensures that every conditional in-dependence entailed by the graph G is also present in prob-ability distribution J. Thus, together Faithfulness and the Markov Condition establish a close relationship between the graph G and some empirical or theoretical probability distribution J. In practical terms, what faithfulness implies is that we can associate statistical properties of the prob-ability distribution J with properties of the graph G of the corresponding BN. It turns out that in faithful BNs, the set of parents, children, and spouses (i.e., parents of children of T) is the unique MB(T). An example of the Markov Blanket concept is displayed in Figure 1: the MB(T) is the set of gray-filled nodes.The IAMB Algorithm and VariantsIn this section several novel algorithms for discovering the MB(T) are presented that are sound under the following assumptions: (i) the data are generated by processes that can be faithfully represented by BNs, and (ii) there exist reliable statistical tests of conditional independence andmeasures of associations for the given variable distribu-tion, sample size, and sampling of the data. We discuss the rationale and justification of the assumptions in the Dis-cussion section.IAMB Description: Incremental Association Markov Blanket (IAMB) (Figure 2) consists of two phases, a for-ward and a backward one. An estimate of the MB(T) is kept in the set CMB. In the forward phase all variables that belong in MB(T) and possibly more (false positives) enter CMB while in the backward phase the false positives are identified and removed so that CMB = MB(T) in the end. The heuristic used in IAMB to identify potential Markov Blanket members in phase I is the following: start with an empty candidate set for the CMB and admit into it (in the next iteration) the variable that maximizes a heuristic func-tion f(X ; T | CMB). Function f should return a non-zero value for every variable that is a member of the Markov Blanket for the algorithm to be sound, and typically it is a measure of association between X and T given CMB. In our experiments we used as f the Mutual Information simi-lar to what suggested in (Margaritis and Thrun 1999, Cheng et al. 1998): f(X ; T | CMB) is the Mutual Informa-tion between S and T given CMB. It is important that f is an informative and effective heuristic so that the set of candi-date variables after phase I is as small as possible for two reasons: one is time efficiency (i.e., do not spend time con-sidering irrelevant variables) and another is sample effi-ciency (do not require sample larger than what is abso-lutely necessary to perform conditional tests of independ-ence). In backward conditioning (Phase II) we remove one-by-one the features that do not belong to the MB(T) by testing whether a feature X from CMB is independent of T given the remaining CMB.IAMB Proof of Correctness (sketch): If a feature be-longs to MB(T), then it will be admitted in the first step because it will be dependent on T given any subset of the feature set because of faithfulness and because the MB(T) is the minimal set with that property. If a feature is not a member of MB(T), then conditioned on MB(T), or any su-perset of MB(T), it will be independent of T and thus will be removed from CMB in the second phase. Using this argument inductively we see that we will end up with the unique MB(T).InterIAMBnPC Description: The smaller the condition-ing test given a finite sample of fixed size, the more accu-rate the statistical tests of independence and the measure of associations. The InterIAMBnPC algorithm uses two methods to reduce the size of the conditioning sets: (a) it interleaves the admission phase of IAMB (phase I) with the backward conditioning (phase II) attempting to keep the size of MB(T) as small as possible during all steps of the algorithm’s execution. (b) it substitutes the backward conditioning phase as implemented in IAMB with the PC algorithm instead (Spirtes et al. 2000), a Bayesian Network learning algorithm that determines direct edges between variables in a more sample–efficient manner, and that is sound given the stated assumptions (see next section); thus, interIAMBnPC is expected to be more sample-efficient than IAMB. In addition, interIAMBnPC is still practical because PC is running only on small sets of vari-ables, not the full set of variables.InterIAMBnPC Proof of Correctness (sketch): All par-ents and children of T will enter CMB by the property of f mentioned above. Since PC is sound, it will never remove these variables. Since all effects enter CMB, conditioned on them, all the spouses (parents of children) of T will be dependent with T given CMB and enter CMB at some point. Again, because PC is sound, it will not permanently remove them (they may be removed temporarily but will enter CMB at a subsequent iteration; we do not elaborate due to space limitations), and they will be included in the final output.Two other IAMB variants we experimented with are in-terIAMB and IAMBnPC which are similar to interI-AMBnPC but they employ only either interleaving the first two phases or using PC for the backward phase, re-spectively. Even though IAMB provides theoretical guar-antees only in the sample limit, the quality of the output and the approximation of the true MB(T) degrades grace-fully in practical settings with finite sample (see experi-mental section). IAMB and its variants are expected to perform best in problems where the MB(T) is small rela-tively to the available data samples, but the domain may contain hundreds of thousands of variables.Time Complexity: Typically, the performance of BN-induction algorithms based on tests of conditional inde-pendence is measured in the number of association calcula-tions and conditional independence tests executed (both operations take similar computation effort and we will not distinguish between the two) (Spirtes et al. 2000, Cheng et al. 1998, Margaritis and Thrun 1999). Phase II performs O(|CMB|) conditional independence tests. Phase I performs N association computations for each variable that enters CMB, where N is the number of variables, and so the algo-rithm performs O(|CMB|×N) tests. In the worst case |CMB|=N giving an order of O(N2). In all experiments of IAMB we observed |CMB|=O(MB(T)) giving an average case order of O(MB(T) ×N) tests. For Mutual Information there exists an algorithm linear to the size of the data (Margaritis and Thrun 1999). The other IAMB variants have higher worst-case time complexity (since for example the PC is exponential to the number of variables) trading-off computation for higher performance. Nevertheless, since in our experiments we observed that the size of the CMB is relative small to the total number of variables, the additional time overhead of the variants versus the vanilla IAMB was minimal.Other Markov Blanket algorithmsTo our knowledge, the only other algorithm developed explicitly for discovering the MB(T) and that scales-up is the Grow-Shrink (GS) algorithm (Margaritis and Thrun 1999). It is theoretically sound but uses a static and poten-tially inefficient heuristic in the first phase. IAMB en-hances GS by employing a dynamic heuristic. The Koller-Sahami algorithm (KS) (Koller and Sahami 1996) is the first algorithm for feature selection to employ the conceptof the Markov Blanket. KS is a heuristic algorithm and provides no theoretical guarantees.The GS algorithm is structurally similar to IAMB and follows the same two-phase structure. However, there is one important difference: GS statically orders the variables when they are considered for inclusion in phase I, accord-ing to their strength of association with T given the empty set. It then admits into CMB the next variable in that order-ing that is not conditionally independent from T given CMB. One problem with this heuristic is that when the MB(T) contains spouses of T. In that case, the spouses are typically associated with T very weakly given the empty set and are considered for inclusion in the MB(T) late in the first phase (associations between spouses and T are only through confounding /common ancestors variables, thus they are weaker than those ancestors’ associations with T). In turn, this implies that more false positives will enter CMB at phase I and the conditional tests of inde-pendence will become unreliable much sooner than when using IAMB’s heuristic. In contrast, conditioned on the common children, spouses may have strong association with T and, when using IAMB’s heuristic, enter the CMB early. An analogous situation is in constraint satisfaction where dynamic heuristics typically outperform static ones. We provide evidence to support this hypothesis in the ex-periment section. We would also like to note that the proof of correctness of GS is indeed correct only if one assumes faithfulness, and not just the existence of a unique MB(T) as is stated in the paper: a non-faithful counter example is when T is the exclusive or of X and Y on which the GS will fail to discover the MB(T), even though it is unique.The KS algorithm (Koller, Sahami 1996) is the first one that employed the concept of the Markov Blanket for fea-ture selection. The algorithm accepts two parameters: (i) the number v of variables to retain and (ii) a parameter k which is the maximum number of variables the algorithm is allowed to condition on. For k=0 KS is equivalent to univariately ordering the variables and selecting the first v. The Koller-Sahami paper does not explicitly claim to iden-tify the MB(T); however, if one could guess the size of the MB(T) and set the parameter v to this number then ideally the algorithm should output MB(T). Viewed this way we treated the KS algorithm as an algorithm for approximating the MB(T) using only v variables. Unlike IAMB, the IAMB variants, and GS, the KS algorithm does not pro-vide any theoretical guarantees of discovering the MB(T). PC (Spirtes et al. 2000) is a prototypical BN learning algorithm that is sound given the stated set of assumptions. PC learns the whole network (and so it does not scale-up well) from which the MB(T) can be easily extracted as the set of parents, children, and spouses of T. The PC algo-rithm starts with a fully connected unoriented Bayesian Network graph and has three phases. In phase I the algo-rithm finds undirected edges by using the criterion that variable A has an edge to variable B iff for all subsets of features there is no subset S, s.t. I(A ; B | S). In phases II and III the algorithm orients the edges by performing global constraint propagation. IAMBnPC could be thought of as improving GS by employing a more effi-cient, but still sound, way (i.e., PC) for the backward phase and a dynamic heuristic for the forward phase, or as improving PC by providing an admissible first phase heu-ristic that focuses PC on a local neighborhood.We now provide a hypothetical trace of IAMB on the BN of Figure 1. We assume the reader’s familiarity with the d-separation criterion (Spirtest et al. 2000) which is a graph-theory criterion that implies probabilistic conditional independence. In the beginning CMB is empty and the variable mostly associated with T given the empty set will enter CMB, e.g. W. In general, we expect the variables closer to T to exhibit the highest univariate association. Conditioned on W, the associations of all variables with T are calculated. It is possible that O will be the next variable to enter, since conditioned on W, O and T are dependent. After both W and O are in CMB, Q is independent of T and cannot enter CMB. Let us suppose that R enters next (a false positive). It is guaranteed that both U and V will also enter the CMB because they are dependent with T given any subset of the variables. In the backwards phase, R will be removed since it is independent of T given both U and V. Notice that in GS, O and Q are the last variables to be considered for inclusion, since they have no association with T given the empty set. This increases the probability that a number of false positives will have already entered CMB before O is considered, making the conditional inde-pendence tests unreliable.Experimental ResultsIn order to measure the performance of each algorithm, we need to know the real MB(T) to use it as a gold standard, which in practice is possible only in simulated data. Experiment Set 1: BNs from real diagnostic systems (Ta-ble 1). We tested the algorithm on the ALARM Network (Beinlich et al. 1989), which captures the structure of a medical domain having 37 variables, and on Hailfinder, a BN used for modeling and predicting the weather, pub-lished in (Abramson et al. 1996), with 56 variables. We randomly sampled 10000 training instances from the joint probability that each network specifies. The task was to find the Markov Blanket of certain target variables. For ALARM the target variables were all variables, on which we report the average performance, while for Hailfinder there were four natural target nodes corresponding to weather forecasting, on which we report the performance individually. The performance measure used is the area under the ROC curve (Metz 1978). The ROCs were cre-ated by examining various different thresholds for the sta-tistical tests of independence. For the PC algorithm thresholds correspond to the significance levels of the G2 statistical test employed by the algorithm, whereas for the GS and the IAMB variants we consider I(X ; T | CMB) iff Mutual-Info(X ; T | CMB) < threshold. For the KS we tried all the possible values of the parameter v of the variables to retain to create a very detailed ROC curve, and all values k that have been suggested in the original paper.three random BNs with 50, 200, and 1000 nodes each, such that the number of the parents of each node was ran-domly and uniformly chosen between 0 and 10 and the free parameters in the conditional probability tables were drawn uniformly from (0, 1). The Markov Blanket of an arbitrarily chosen target variable T contained 6 variables (three parents, two children, and one spouse) and was held fixed across the networks so that consistent comparisons could be achieved among different-sized networks. Each network adds more variables to the previous one without altering the MB(T). We ran the algorithms for sample sizes in the set {1000, 10000, 20000} and report the average areas under the ROCs curve in Table 2. We remind the reader that the released version of the PC algorithm does not accept more than 100 variables, hence the missing cells in the figure. We see that the IAMB variants scale very well to large number of variables both in performance, and in computation time (IAMB variants took less than 20 minutes on the largest datasets, except interIAMBnPC which took 12 hours; the other methods took between one and five hours; all experiments on an Intel Xeon 1.8 and using the same approach as before, but this time the MB(T) contained four spouse nodes (instead of one), one parent, and two children nodes (for a total of seven nodes). Interpretation: The results are shown in Tables 1 and 2. The best performance in its column is shown in bold (PC is excluded since it does not scale-up). We did not test whether the faithfulness assumption holds for any of the above networks, thus the results are indicative of the per-formance of the algorithms on arbitrary BNs. Whenever applicable, we see that PC is one of the best algorithms. Experiment Set 1 (Figure 3(a)): IAMBnPC and interI-AMBnPC were the best algorithms on average. All IAMB variants are better than GS, implying that a dynamic heu-ristic for selecting variables is important. KS for k=0 is equivalent to ordering the variables according to univariate association with the target, a standard and common tech-nique used in statistical analysis. This algorithm performs well in this set; however, the behavior of KS is quite un-stable and non-monotonic for different values of k which is consistent with the results in the original paper (Koller, Sahami 1996). Experiment Set 2 (Figure 3(b)): We expectthe simple static heuristic of GS, and KS for k=0, to per-form well in cases were most members of MB(T) have strong univariate association with T, which is typically the case when there are no spouses of T in MB(T). Indeed, in the first random BN, where there is only one spouse, both of these algorithms perform well (Figure 3(b)). However, in the second random BN there are four spouses of T, which seriously degrades the performance of KS for k=0 and GS (Figure 3(b)). KS for k=1,2 has unpredictable be-havior, but it always performs worse than the IAMB vari-ants. The IAMB variants and the PC algorithm still per-form well even in this trickier case.Other Results: Due to space limitations it is impossible to report all of our experiments. Other experiments we ran provide evidence to support another important hypothesis: IAMB’s dynamic heuristic is expensive in the data sample, therefore it is possible that for small sample sizes the sim-plest heuristics of KS for k=0 and GS will perform better, especially when there are not that many spouses in the MB(T). Other experiments, suggest that the performance of the PC significantly degrades for small (less than 100 in-stances) data samples. This is explained by the fact that PC has a bias towards sensitivity: it removes an edge only if it can prove it should be removed, and retains it otherwise. Below a certain sample size the PC is not able to remove most edges and thus reports unnecessarily large Markov Blankets.Given the above empirical results, we would suggest to the practitioners to apply the algorithms mostly appropriate for the available sample and variable size, i.e., the PC al-gorithm for sizes above 300 training instances and for variable size less than 100, GS and KS for k=0 (i.e., uni-variate association ordering) for sizes less than 300, and the IAMB variants for everything else.Discussion and ConclusionDiscussion: The MB(T) discovery algorithms can also be used for causal discovery. If there exists at least one faith-ful BN that captures the data generating process then the MB(T) of any such BN has to contain the direct causes of T. It thus significantly prunes the search space for an ex-perimentalist who wants to identify such direct causes. In fact, other algorithms can post-process the MB(T) to direct the edges and identify the direct causes of T without any experiments, e.g. the PC of the FCI algorithm; the first assumes causal sufficiency while the second does not (Spirtes et al. 2000). In (Spirtes et al. 2000) specific condi-tions are discussed under which faithfulness gets violated. These situations are relatively rare in the sample limit as supported by the work of (Meek 1995). Most BN learning or MB(T) identification algorithms explicitly or implicitly assume faithfulness, e.g., PC and GS, but also (implicitly) BN score-and-search for most scoring metrics (see (Heck-erman et al. 1997)).Conclusions: In this paper we took a first step towards developing and comparing Markov Blanket identification algorithms. The concept of the Markov Blanket has strong connections with principled and optimal variable selection (Tsamardinos and Aliferis 2003), has been used as part of Bayesian Network learning (Margaritis and Thrun 1999), and can be used for causal discovery. We presented novel algorithms that are sound under broad conditions, scale-up to thousands of variables, and compare favorably with all the rest state-of-the-art algorithms that we have tried. We followed a principled approach that allowed us to interpret the empirical results and identify appropriate cases of us-age of each algorithm. There is much room for improve-ment to the algorithms and hopefully the present work will inspire other researchers to address this important class of algorithms.ReferencesAbramson, B., Brown, J., Edwards, W., Murphy, A., and Winkler R. L., “Hailfinder: A Bayesian system for fore-casting severe weather”, International Journal of Fore-casting, 12 (1996), 57-71Beinlich, I.A., et al. “The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks”. In Proc. of the Second European Confer-ence on Artificial Intelligence in Medicine, London, Eng-land. 1989.Cheng J, Bell D., and Liu W., “Learning Bayesian Net-works from Data: An Efficient Approach Based on Infor-mation Theory”, Technical Report, 1998, URL http://www.cs.ualberta.ca/~jcheng/Doc/report98.pdf Heckerman, D., Meek, C., and Cooper, G., “A Bayesian Approach to Causal Discovery”, Technical Report, Micro-soft Research, MSR-TR-97-05, 1997.Koller, D. and M. Sahami. “Toward Optimal Feature Se-lection”, In Proc. of the Thirteenth International Confer-ence in Machine Learning. 1996.Margaritis, D., and Thrun, S. “Bayesian Network Induc-tion via Local Neighborhoods, Carnegie Mellon Univer-sity”, Technical Report CMU-CS-99-134, August 1999. Meek, C., “Strong Completeness and Faithfulnes in Bayes-ian Networks”, In Proc. of Uncertainty in Artificial Intelli-gence (UAI), 1995, 411-418.Metz C. E. (1978) “Basic principles of ROC analysis”, Seminars in Nuclear Medicine, 8, 283-298.Neapolitan, R.E., “Probabilistic Reasoning in Expert Sys-tems: Theory and Algorithms”. 1990: John Wiley and Sons.Spirtes, P., C. Glymour, and R. Scheines. “Constructing Bayesian network models of gene expression networks from microarray data”. In Proc. of the Atlantic Symposium on Computational Biology, Genome Information Systems & Technology, 2000.Tsamardinos, I, and C. F. Aliferis, “Towards Principled Feature Selection: Relevancy, Filters, and Wrappers”, In Proc. of Ninth International Workshop on Artificial Intelli-gence and Statistics,2003.。