Randomized k-coverage algorithms for dense sensor networks
Scalable Nearest Neighbor Algorithms for high dimensional data
Scalable Nearest Neighbor Algorithmsfor High Dimensional DataMarius Muja,Member,IEEE and David G.Lowe,Member,IEEE Abstract—For many computer vision and machine learning problems,large training sets are key for good performance.However,the most computationally expensive part of many computer vision and machine learning algorithms consists offinding nearest neighbor matches to high dimensional vectors that represent the training data.We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms.For matching high dimensional features,wefind two algorithms to be the most efficient:the randomized k-d forest and a new algorithm proposed in this paper,the priority search k-means tree.We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature.We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure forfinding the best algorithm to search a particular data set.In order to scale to very large data sets that would otherwise notfit in the memory of a single machine,we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper.All this research has been released as an open source library called fast library for approximate nearest neighbors(FLANN),which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching.Index Terms—Nearest neighbor search,big data,approximate search,algorithm configurationÇ1I NTRODUCTIONT HE most computationally expensive part of many com-puter vision algorithms consists of searching for the most similar matches to high-dimensional vectors,also referred to as nearest neighbor matching.Having an effi-cient algorithm for performing fast nearest neighbor matching in large data sets can bring speed improvements of several orders of magnitude to many applications. Examples of such problems includefinding the best matches for local image features in large data sets[1],[2] clustering local features into visual words using the k-means or similar algorithms[3],global image feature matching for scene recognition[4],human pose estimation [5],matching deformable shapes for object recognition[6] or performing normalized cross-correlation(NCC)to com-pare image patches in large data sets[7].The nearest neighbor search problem is also of major importance in many other applications,including machine learning,doc-ument retrieval,data compression,bio-informatics,and data analysis.It has been shown that using large training sets is key to obtaining good real-life performance from many computer vision methods[2],[4],[7].Today the Internet is a vast resource for such training data[8],but for large data sets the performance of the algorithms employed quickly becomes a key issue.When working with high dimensional features,as with most of those encountered in computer vision applications (image patches,local descriptors,global image descriptors), there is often no known nearest-neighbor search algorithm that is exact and has acceptable performance.To obtain a speed improvement,many practical applications are forced to settle for an approximate search,in which not all the neighbors returned are exact,meaning some are approxi-mate but typically still close to the exact neighbors.In prac-tice it is common for approximate nearest neighbor search algorithms to provide more than95percent of the correct neighbors and still be two or more orders of magnitude faster than linear search.In many cases the nearest neighbor search is just a part of a larger application containing other approximations and there is very little loss in performance from using approximate rather than exact neighbors.In this paper we evaluate the most promising nearest-neighbor search algorithms in the literature,propose new algorithms and improvements to existing ones,present a method for performing automatic algorithm selection and parameter optimization,and discuss the problem of scal-ing to very large data sets using compute clusters.We have released all this work as an open source library named fast library for approximate nearest neighbors (FLANN).1.1Definitions and NotationIn this paper we are concerned with the problem of efficient nearest neighbor search in metric spaces.The nearest neigh-bor search in a metric space can be defined as follows:given a set of points P¼f p1;p2;...;p n g in a metric space M and a query point q2M,find the element NNðq;PÞ2P that is theM.Muja is with BitLit Media Inc,Vancouver,BC,Canada.E-mail:mariusm@cs.ubc.ca.D.G.Lowe is with the Computer Science Department,University ofBritish Columbia(UBC),2366Main Mall,Vancouver,BC V6T1Z4,Canada.E-mail:lowe@cs.ubc.ca.Manuscript received26Aug.2013;revised14Feb.2014;accepted1Apr.2014;date of publication xx xx xxxx;date of current version xx xx xxxx.Recommended for acceptance by T.Tuytelaars.For information on obtaining reprints of this article,please send e-mail to:reprints@,and reference the Digital Object Identifier below.Digital Object Identifier no.10.1109/TPAMI.2014.23213760162-8828ß2014IEEE.Translations and content mining are permitted for academic research only.Personal use is also permitted,but republication/redistribution requires IEEE permission.See /publications_standards/publications/rights/index.html for more information.closest to q with respect to a metric distance d:MÂM!R:NNðq;PÞ¼argmin x2P dðq;xÞ:The nearest neighbor problem consists offinding a method to pre-process the set P such that the operation NNðq;PÞcan be performed efficiently.We are often interested infinding not just thefirst clos-est neighbor,but several closest neighbors.In this case,the search can be performed in several ways,depending on the number of neighbors returned and their distance to the query point:K-nearest neighbor(KNN)search where the goal is tofind the closest K points from the query point and radius nearest neighbor search(RNN),where the goal is to find all the points located closer than some distance R from the query point.We define the K-nearest neighbor search more formally in the following manner:KNNðq;P;KÞ¼A;where A is a set that satisfies the following conditions:j A j¼K;A P8x2A;y2PÀA;dðq;xÞdðq;yÞ:The K-nearest neighbor search has the property that it will always return exactly K neighbors(if there are at least K points in P).The radius nearest neighbor search can be defined as follows:RNNðq;P;RÞ¼f p2P;dðq;pÞ<R g: Depending on how the value R is chosen,the radius search can return any number of points between zero and the whole data set.In practice,passing a large value R to radius search and having the search return a large number of points is often very inefficient.Radius K-nearest neighbor (RKNN)search,is a combination of K-nearest neighbor search and radius search,where a limit can be placed on the number of points that the radius search should return:RKNNðq;P;K;RÞ¼A;such thatj A j K;A P8x2A;y2PÀA;dðq;xÞ<R and dðq;xÞdðq;yÞ:2B ACKGROUNDNearest-neighbor search is a fundamental part of many computer vision algorithms and of significant importance in many otherfields,so it has been widely studied.This sec-tion presents a review of previous work in this area.2.1Nearest Neighbor Matching AlgorithmsWe review the most widely used nearest neighbor techni-ques,classified in three categories:partitioning trees,hash-ing techniques and neighboring graph techniques.2.1.1Partitioning TreesThe kd-tree[9],[10]is one of the best known nearest neigh-bor algorithms.While very effective in low dimensionality spaces,its performance quickly decreases for high dimen-sional data.Arya et al.[11]propose a variation of the k-d tree to be used for approximate search by considering ð1þ"Þ-approximate nearest neighbors,points for whichdistðp;qÞð1þ"ÞdistðpÃ;qÞwhere pÃis the true nearest neighbor.The authors also propose the use of a priority queue to speed up the search.This method of approxi-mating the nearest neighbor search is also referred to as “error bound”approximate search.Another way of approximating the nearest neighbor search is by limiting the time spent during the search,or “time bound”approximate search.This method is proposed in[12]where the k-d tree search is stopped early after exam-ining afixed number of leaf nodes.In practice the time-con-strained approximation criterion has been found to give better results than the error-constrained approximate search.Multiple randomized k-d trees are proposed in[13]as a means to speed up approximate nearest-neighbor search. In[14]we perform a wide range of comparisons showing that the multiple randomized trees are one of the most effective methods for matching high dimensional data.Variations of the k-d tree using non-axis-aligned parti-tioning hyperplanes have been proposed:the PCA-tree[15], the RP-tree[16],and the trinary projection tree[17].We have not found such algorithms to be more efficient than a randomized k-d tree decomposition,as the overhead of evaluating multiple dimensions during search outweighed the benefit of the better space decomposition.Another class of partitioning trees decompose the space using various clustering algorithms instead of using hyper-planes as in the case of the k-d tree and its variants.Example of such decompositions include the hierarchical k-means tree[18],the GNAT[19],the anchors hierarchy[20],the vp-tree[21],the cover tree[22]and the spill-tree[23].Nister and Stewenius[24]propose the vocabulary tree,which is searched by accessing a single leaf of a hierarchical k-means tree.Leibe et al.[25]propose a ball-tree data structure con-structed using a mixed partitional-agglomerative clustering algorithm.Schindler et al.[26]propose a new way of search-ing the hierarchical k-means tree.Philbin et al.[2]conducted experiments showing that an approximateflat vocabulary outperforms a vocabulary tree in a recognition task.In this paper we describe a modified k-means tree algorithm that we have found to give the best results for some data sets, while randomized k-d trees are best for others.J e gou et al.[27]propose the product quantization approach in which they decompose the space into low dimensional subspaces and represent the data sets points by compact codes computed as quantization indices in these subspaces.The compact codes are efficiently compared to the query points using an asymmetric approximate dis-tance.Babenko and Lempitsky[28]propose the inverted multi-index,obtained by replacing the standard quantiza-tion in an inverted index with product quantization,obtain-ing a denser subdivision of the search space.Both these methods are shown to be efficient at searching large datasets and they should be considered for further evaluation and possible incorporation into FLANN.2.1.2Hashing Based Nearest Neighbor Techniques Perhaps the best known hashing based nearest neighbor technique is locality sensitive hashing(LSH)[29],which uses a large number of hash functions with the property that the hashes of elements that are close to each other are also likely to be close.Variants of LSH such as multi-probe LSH[30]improves the high storage costs by reducing the number of hash tables,and LSH Forest[31]adapts better to the data without requiring hand tuning of parameters.The performance of hashing methods is highly depen-dent on the quality of the hashing functions they use and a large body of research has been targeted at improving hash-ing methods by using data-dependent hashing functions computed using various learning techniques:parameter sensitive hashing[5],spectral hashing[32],randomized LSH hashing from learned metrics[33],kernelized LSH [34],learnt binary embeddings[35],shift-invariant kernel hashing[36],semi-supervised hashing[37],optimized ker-nel hashing[38]and complementary hashing[39].The different LSH algorithms provide theoretical guaran-tees on the search quality and have been successfully used in a number of projects,however our experiments reported in Section4,show that in practice they are usually outperformed by algorithms using space partitioning structures such as the randomized k-d trees and the priority search k-means tree. 2.1.3Nearest Neighbor Graph TechniquesNearest neighbor graph methods build a graph structure in which points are vertices and edges connect each point to its nearest neighbors.The query points are used to explore this graph using various strategies in order to get closer to their nearest neighbors. In[40]the authors select a few well separated elements in the graph as“seeds”and start the graph exploration from those seeds in a best-first fashion.Similarly,the authors of[41]perform a best-first exploration of the k-NN graph,but use a hill-climbing strat-egy and pick the starting points at random.They present recent experiments that compare favourably to randomized KD-trees,so the proposed algorithm should be considered for future evalua-tion and possible incorporation into FLANN.The nearest neighbor graph methods suffer from a quite expensive construction of the k-NN graph structure.Wang et al.[42]improve the construction cost by building an approximate nearest neighbor graph.2.2Automatic Configuration of NN Algorithms There have been hundreds of papers published on nearest neighbor search algorithms,but there has been little system-atic comparison to guide the choice among algorithms and set their internal parameters.In practice,and in most of the nearest neighbor literature,setting the algorithm parame-ters is a manual process carried out by using various heuris-tics and rarely make use of more systematic approaches.Bawa et al.[31]show that the performance of the stan-dard LSH algorithm is critically dependent on the length of the hashing key and propose the LSH Forest,a self-tuning algorithm that eliminates this data dependent parameter.In a previous paper[14]we have proposed an auto-matic nearest neighbor algorithm configuration method by combining grid search with afiner grained Nelder-Mead downhill simplex optimization process[43].There has been extensive research on algorithm configura-tion methods[44],[45],however we are not aware of papers that apply such techniques tofinding optimum parameters for nearest neighbor algorithms.Bergstra and Bengio[46] show that,except for small parameter spaces,random search can be a more efficient strategy for parameter optimi-zation than grid search.3F AST A PPROXIMATE NN M ATCHINGExact search is too costly for many applications,so this has generated interest in approximate nearest-neighbor search algorithms which return non-optimal neighbors in some cases,but can be orders of magnitude faster than exact search.After evaluating many different algorithms for approxi-mate nearest neighbor search on data sets with a wide range of dimensionality[14],[47],we have found that one of two algorithms gave the best performance:the priority search k-means tree or the multiple randomized k-d trees.These algo-rithms are described in the remainder of this section.3.1The Randomized k-d Tree AlgorithmThe randomized k-d tree algorithm[13],is an approximate nearest neighbor search algorithm that builds multiple ran-domized k-d trees which are searched in parallel.The trees are built in a similar manner to the classic k-d tree[9],[10], with the difference that where the classic kd-tree algorithm splits data on the dimension with the highest variance,for the randomized k-d trees the split dimension is chosen randomly from the top N D dimensions with the highest variance.We used thefixed value N D¼5in our implemen-tation,as this performs well across all our data sets and does not benefit significantly from further tuning.When searching the randomized k-d forest,a single pri-ority queue is maintained across all the randomized trees. The priority queue is ordered by increasing distance to the decision boundary of each branch in the queue,so the search will explorefirst the closest leaves from all the trees. Once a data point has been examined(compared to the query point)inside a tree,it is marked in order to not be re-examined in another tree.The degree of approximation is determined by the maximum number of leaves to be visited (across all trees),returning the best nearest neighbor candi-dates found up to that point.Fig.1shows the value of searching in many randomized kd-trees at the same time.It can be seen that the perfor-mance improves with the number of randomized trees up to a certain point(about20random trees in this case)and that increasing the number of random trees further leads to static or decreasing performance.The memory overhead of using multiple random trees increases linearly with the number of trees,so at some point the speedup may not jus-tify the additional memory used.Fig.2gives an intuition behind why exploring multiple randomized kd-tree improves the search performance. When the query point is close to one of the splitting hyperplanes,its nearest neighbor lies with almost equalMUJA AND LOWE:SCALABLE NEAREST NEIGHBOR ALGORITHMS FOR HIGH DIMENSIONAL DATA3probability on either side of the hyperplane and if it lies on the opposite side of the splitting hyperplane,further explo-ration of the tree is required before the cell containing it will be ing multiple random decompositions increases the probability that in one of them the query point and its nearest neighbor will be in the same cell.3.2The Priority Search K-Means Tree Algorithm We have found the randomized k-d forest to be very effective in many situations,however on other data sets a different algorithm,the priority search k-means tree ,has been more effective at finding approximate nearest neighbors,especially when a high precision is required.The priority search k-means tree tries to better exploit the natural struc-ture existing in the data,by clustering the data points using the full distance across all dimensions,in contrast to the (randomized)k-d tree algorithm which only partitions the data based on one dimension at a time.Nearest-neighbor algorithms that use hierarchical parti-tioning schemes based on clustering the data points have been previously proposed in the literature [18],[19],[24].These algorithms differ in the way they construct the parti-tioning tree (whether using k-means,agglomerative or some other form of clustering)and especially in the strate-gies used for exploring the hierarchical tree.We have devel-oped an improved version that explores the k-means tree using a best-bin-first strategy,by analogy to what has been found to significantly improve the performance of the approximate kd-tree searches.3.2.1Algorithm DescriptionThe priority search k-means tree is constructed by partition-ing the data points at each level into K distinct regions using k-means clustering,and then applying the same method recursively to the points in each region.The recur-sion is stopped when the number of points in a region is smaller than K (see Algorithm1).Fig.2.Example of randomized kd-trees.The nearest neighbor is across a decision boundary from the query point in the first decomposition,how-ever is in the same cell in the seconddecomposition.Fig.1.Speedup obtained by using multiple randomized kd-trees (100K SIFT features data set).4IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.36,NO.X,XXXXX 2014The tree is searched by initially traversing the tree from the root to the closest leaf,following at each inner node the branch with the closest cluster centre to the query point,and adding all unexplored branches along the path to a priority queue (see Algorithm 2).The prior-ity queue is sorted in increasing distance from the query point to the boundary of the branch being added to the queue.After the initial tree traversal,the algorithm resumes traversing the tree,always starting with the top branch in thequeue.The number of clusters K to use when partitioning the data at each node is a parameter of the algorithm,called the branching factor and choosing K is important for obtaininggood search performance.In Section 3.4we propose an algorithm for finding the optimum algorithm parameters,including the optimum branching factor.Fig.3contains a visualisation of several hierarchical k-means decomposi-tions with different branching factors.Another parameter of the priority search k-means tree is I max ,the maximum number of iterations to perform in the k-means clustering loop.Performing fewer iterations can substantially reduce the tree build time and results in a slightly less than optimal clustering (if we consider the sum of squared errors from the points to the cluster centres as the measure of optimality).However,we have observed that even when using a small number of iterations,the near-est neighbor search performance is similar to that of the tree constructed by running the clustering until convergence,as illustrated by Fig.4.It can be seen that using as few as seven iterations we get more than 90percent of the nearest-neigh-bor performance of the tree constructed using full conver-gence,but requiring less than 10percent of the build time.The algorithm to use when picking the initial centres in the k-means clustering can be controlled by the C alg parame-ter.In our experiments (and in the FLANN library)wehaveFig.3.Projections of priority search k-means trees constructed using different branching factors:4,32,128.The projections are constructed using the same technique as in [26],gray values indicating the ratio between the distances to the nearest and the second-nearest cluster centre at each tree level,so that the darkest values (ratio %1)fall near the boundaries between k-meansregions.Fig.4.The influence that the number of k-means iterations has on the search speed of the k-means tree.Figure shows the relative search time compared to the case of using full convergence.MUJA AND LOWE:SCALABLE NEAREST NEIGHBOR ALGORITHMS FOR HIGH DIMENSIONAL DATA 5used the following algorithms:random selection,Gonzales’algorithm (selecting the centres to be spaced apart from each other)and KMeans++algorithm [48].We have found that the initial cluster selection made only a small difference in terms of the overall search efficiency in most cases and that the random initial cluster selection is usually a good choice for the priority search k-means tree.3.2.2AnalysisWhen analysing the complexity of the priority search k-means tree,we consider the tree construction time,search time and the memory requirements for storing the tree.Construction time complexity .During the construction of the k-means tree,a k-means clustering operation has to be performed for each inner node.Considering a node v with n v associated data points,and assuming a maximum number of iterations I in the k-means clustering loop,the complexity of the clustering operation is O ðn v dKI Þ,where d represents the data dimensionality.Taking into account all the inner nodes on a level,we have Pn v ¼n ,so the complexity of construct-ing a level in the tree is O ðndKI Þ.Assuming a balanced tree,the height of the tree will be ðlog n=log K Þ,resulting in a total tree construction cost of O ðndKI ðlog n=log K ÞÞ.Search time complexity .In case of the time constrained approx-imate nearest neighbor search,the algorithm stops after exam-ining L data points.Considering a complete priority search k-means tree with branching factor K ,the number of top down tree traversals required is L=K (each leaf node contains K points in a complete k-means tree).During each top-down tra-versal,the algorithm needs to check O ðlog n=log K Þinner nodes and one leaf node.For each internal node,the algorithm has to find the branch closest to the query point,so it needs to compute the distances to all the cluster centres of the child nodes,an O ðKd Þoperation.The unexplored branches are added to a priority queue,which can be accomplished in O ðK Þamor-tized cost when using binomial heaps.For the leaf node the distance between the query and all the points in the leaf needs to be computed which takes O ðKd Þtime.In summary the overall search cost is O ðLd ðlog n=log K ÞÞ.3.3The Hierarchical Clustering TreeMatching binary features is of increasing interest in the com-puter vision community with many binary visual descriptors being recently proposed:BRIEF [49],ORB [50],BRISK [51].Many algorithms suitable for matching vector based fea-tures,such as the randomized kd-tree and priority search k-means tree,are either not efficient or not suitable for match-ing binary features (for example,the priority search k-means tree requires the points to be in a vector space where their dimensions can be independently averaged).Binary descriptors are typically compared using the Hamming distance,which for binary data can be computed as a bitwise XOR operation followed by a bit count on the result (very efficient on computers with hardware support for counting the number of bits set in a word 1).This section briefly presents a new data structure and algorithm,called the hierarchical clustering tree ,which wefound to be very effective at matching binary features.For a more detailed description of this algorithm the reader is encouraged to consult [47]and [52].The hierarchical clustering tree performs a decomposi-tion of the search space by recursively clustering the input data set using random data points as the cluster centers of the non-leaf nodes (see Algorithm3).In contrast to the priority search k-means tree presented above,for which using more than one tree did not bring signifi-cant improvements,we have found that building multiple hierarchical clustering trees and searching them in parallel using a common priority queue (the same approach that has been found to work well for randomized kd-trees [13])resulted in significant improvements in the search performance.3.4Automatic Selection of the Optimal Algorithm Our experiments have revealed that the optimal algorithm for approximate nearest neighbor search is highly depen-dent on several factors such as the data dimensionality,size and structure of the data set (whether there is any correla-tion between the features in the data set)and the desired search precision.Additionally,each algorithm has a set of parameters that have significant influence on the search per-formance (e.g.,number of randomized trees,branching fac-tor,number of k-means iterations).As we already mention in Section 2.2,the optimum parameters for a nearest neighbor algorithm are typically chosen manually,using various heuristics.In this section we propose a method for automatic selection of the best nearest neighbor algorithm to use for a particular data set and for choosing its optimum parameters.By considering the nearest neighbor algorithm itself as a parameter of a generic nearest neighbor search routine A ,the problem is reduced to determining the parameters u 2Q that give the best solution,where Q is also known as the parameter configuration space .This can be formulated as an optimization problem in the parameter configuration space:min u 2Qc ðu Þwith c :Q !R being a cost function indicating how well the search algorithm A ,configured with the parameters u ,per-forms on the given input data.1.The POPCNT instruction for modern x86_64architectures.6IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.36,NO.X,XXXXX 2014。
k阶指数哥伦布编码算法
k阶指数哥伦布编码算法k阶指数哥伦布编码算法(k-order exponential Golomb coding algorithm)是一种无损压缩算法,主要用于对非负整数序列进行编码。
本文将对该算法进行详细介绍和分析,包括算法原理、编码过程和编码效果评估。
一、算法原理k阶指数哥伦布编码算法的核心思想是通过对非负整数进行分组,然后对每个分组进行编码,并使用相应的标志位区分不同的分组。
该算法以k为阶数进行分组,将每个整数表示为两部分:前缀和后缀。
具体而言,设k为阶数,将整数n分解为一个前缀x和一个后缀y,其中x的位数为k-1,y的值范围为0到2^(k-1)。
前缀x表示该整数所属的分组,后缀y则表示该整数在分组中的位置。
对于非负整数n,其编码过程可以分为以下几个步骤:Step 1:计算前缀x和后缀y,其中x = n / 2^(k-1),y = n mod 2^(k-1)。
Step 2:根据前缀x的位数和y的值,确定分组的编码长度,使用二进制表示的结果为g。
Step 3:将g的二进制表示加上前缀x的二进制表示作为最终的编码结果。
例如,当k=3时,对整数15进行编码:Step 1:计算x和y,其中x = 15 / 2^2 = 3,y = 15 mod 2^2 = 3。
Step 2:由于x = 3的二进制表示为"011",y = 3的二进制表示为"011",所以分组的编码长度为"011011"。
Step 3:将分组的编码长度"011011"与前缀x的二进制表示"011"拼接得到最终的编码结果"011011011"。
二、编码过程k阶指数哥伦布编码算法的编码过程可以总结为以下几个步骤:Step 1:确定阶数k和非负整数序列。
Step 2:对每个非负整数n进行编码。
Step 3:将每个编码的结果拼接得到最终的压缩结果。
minimalmarkerset...
Minimal Marker Sets to Discriminate Among Seedlines Thomas C.Hudson,Ann E.Stapleton,Amy M.CurleyUniversity of North Carolina at WilmingtonDepartments of Computer Science and Biological Sciences{hudsont,stapletona}@AbstractRaising seeds for biological experiments is prone to er-ror;a careful experimenter will test in the lab to verify that plants are of the intended strain.Choosing a minimal set of tests that will discriminate between all known seedlines is an instance of Minimal Test Set,a NP-complete problem. Similar biological problems,such as minimizing the num-ber of haplotype tag SNPs,require complex nondeterminis-tic heuristics to solve in reasonable timeframes over modest datasets.However,selecting the minimal marker set to dis-criminate among seedlines is less complicated than other problems considered in the literature;we show that a sim-ple heuristic approach works well in practice.Finding all minimal sets of tests to identify91Zea mays recombinant inbred lines would require months of CPU time;our heuris-tic gives a result less than twice the minimal possible size in underfive seconds,with similar performance on Arabidop-sis thaliana recombinant inbred lines.1.IntroductionWhen a plant geneticist wants to conduct an experiment, she needs samples of a plant.Frequently,she will grow the plant herself from seeds kept in her laboratory.However, raising these plants is a labor-intensive,error-prone proce-dure:seeds can be wrongly sown,fields wrongly marked, natural pollination occur unintentionally,collected seeds mislabelled in thefield or stored incorrectly in the lab.A cautious scientist will perform tests on the plants she takes her experimental samples from to confirm that they are from the intended seedline.To verify the genotype of the sample,the scientist selects markers,extracts DNA from the sample plants,and ampli-fies each test region;these regions have known detectable differences in length.In the case of recombinant inbred lines,there are only two possibilities for each marker,con-ventionally referred to as size“A”and“B”.Our poster reports on heuristic algorithms developed to help minimize the expense of testing.Finding the optimum set of markers to use is a problem that can take months or years of CPU time;this software produces near-optimum answers in under a minute.The algorithms discussed in our poster have been im-plemented in Java and are available under an open-source license at /csc/bioinformatics/.2.Heuristic SolutionA randomized greedy algorithm gives a reasonablefirst answer for the problem offinding minimal marker sets to distinguish among the seedlines:1.Shuffle the markers into a random order2.Examine each marker in order(a)Remove it from the set of markers if the resul-tant set is still able to discriminate among all theseedlinesIn our experiments on Zea mays(134markers,91seed-lines)and Arabidopsis thaliana(99markers,162seedlines), this random greedy approach produces answers no more than twice the size of the theoretical optimum;repeated tri-als show that the results are roughly normally distributed (see our poster).If there are N seedlines and M markers, the theoretical complexity is O(M2N2);the algorithm runs in seconds on those datasets.These distributions imply that random sampling of the search space could yield reasonable results.The quality of the result of random sampling is very sensitive to the input: some subsets of the full data have many minimal-length an-swers,making random discovery likely,while others have only one.However,in practice they seem to have a large number of solutions requiring one marker more than min-imal,which are reasonably likely to be found by random search.As problems grow larger–more seedlines are de-veloped and more markers are identified–larger and largersamples of the search space will be necessary to have a rea-sonable likelihood offinding a good solution.Sorting according to simple metrics does not yield anyimprovement on random ordering,but provides consistency.Assigning a large negative value to a marker for every seed-line about which the marker returns an inconclusive resultgives a coarse ordering.If A and B appear with dissimilarfrequency,adding a small positive value to the marker’s rat-ing for every seedline on which it returns the less-commonresult gives afiner ordering.Neither of these metrics out-performs random ordering;both typically give a result com-parable to the median result returned in one thousand tri-als of random ordering.However,they do so in a singletrial(underfive seconds for both Zea mays and Arabidopsisthaliana),which gives us good input for the second stage of our algorithm.We thenfilter the data.If the initial greedy heuristic re-turns a solution S containing K markers,we run the greedyalgorithm K additional times.Let S i be the i th marker in S;on the i th additional execution in thisfiltering pass,we remove S i from the set of possible markers.Whether we start with a random or sorted list of mark-ers,running the basic greedy algorithm and then one passoffiltering gives us an answer of the same size as the bestanswers ever returned by the randomized algorithm.Addi-tional passes of thefiltering algorithm do not yield furtherimprovement.For both Zea mays and Arabidopsis thaliana,this is roughly one pointfive times the length of the smallestpossible answer.In essence,this algorithm performs a heuristic searchof the M-dimensional space of possible answers tofinda candidate answer,and then exhaustively explores its K-dimensional immediate neighborhood looking for a localminimum.Wefind that the solution initially reported bythe greedy heuristic is rarely a local minimum,but that itconsistently has an adjacent local minimum.Over the datacurrently available,the two-stage approach gives reliablygood results about a minute.3.Exact SolutionA heuristic solution to the problem is not strictly nec-essary.The minimal discriminating set of markers canbe found by examining all potentially discriminating sets.However,this requires an exhaustive search over a largesearch space.For N seedlines and M markers,there areMJsub-sets of markers of size J.For each subset that we exam-ine,a straightforward determination of whether the subset distinguishes between each pair of seedlines takes O(JN2) time.The total predicted time is O(M K N2K),where K is the size of the minimal discriminating marker set;K>=log2(N).To verify this O()characterization,we implemented an exact solver for the minimal discriminating marker set prob-lem and ran it over subsets of the Zea mays data.Graphs of the time performance of the exact solver can be found on the poster.A trial run of the exact solver on a dedicated 2.4GHz Xeon CPU examined only1.33%of the possible size-7solutions for Zea mays in17.5CPU hours;if there isa size7answer,it would take us54days tofind.4.Theory and ContextFinding the minimal discriminating set of markers is an instance of a well-known NP-complete problem,Minimal Test Set[1].In Garey and Johnson’s formulation,the asso-ciated decision problem is:INSTANCE:A collection C of subsets of afiniteset S,a positive integer K≤||C||.QUESTION:Is there a subcollection C ⊆Cwith||C ||<K such that for each pair of distinctelements u,v∈S,there exists some set c∈Cthat contains exactly one of u and v?[3]is a comprehensive survey of approaches to the Minimal Test Set problem.This problem looks similar to another question intensely studied in bioinformatics,Haplotype Tag Selection.Al-though the decision problems are only subtly different,this difference significantly increases the complexity of algo-rithms that solve Haplotype Tag Selection.Approaches like ours to Minimal Test Set are not sufficient to solve Haplo-type Tag Selection.[2]is a survey of current work on Haplotype Tag Selec-tion.The authorsfit21published Haplotype Tag Selection algorithms into a three-stage framework:evaluating each SNP for how well it describes other nearby SNPs,evaluat-ing a candidate set of SNPs for how well they classify the entire set of data,and constructing afinal minimal set of SNPs.Our algorithm performs three analogous activities, albeit in a different order:filtering to minimize the set of results,sorting metrics,and a greedy minimization phase.References[1]M.R.Garey and puters and Intractability:A Guide to the Theory of NP-Completeness.W.H.Freemanand Company,San Francisco,1979.[2] B.V.Halldorsson,S.Istrail,and F.M.De La Vega.Opti-mal selection of snp markers for disease association studies.Human Heredity,58:190–202,2004.[3] B.Moret and H.Shapiro.On minimizing a set of tests.SIAMJournal of Scientific Computing,6(4):983–1003,1985.Below is given annual work summary, do not need friends can download after editor deleted Welcome to visit againXXXX annual work summaryDear every leader, colleagues:Look back end of XXXX, XXXX years of work, have the joy of success in your work, have a collaboration with colleagues, working hard, also have disappointed when encountered difficulties and setbacks. Imperceptible in tense and orderly to be over a year, a year, under the loving care and guidance of the leadership of the company, under the support and help of colleagues, through their own efforts, various aspects have made certain progress, better to complete the job. For better work, sum up experience and lessons, will now work a brief summary.To continuously strengthen learning, improve their comprehensive quality. With good comprehensive quality is the precondition of completes the labor of duty and conditions. A year always put learning in the important position, trying to improve their comprehensive quality. Continuous learning professional skills, learn from surrounding colleagues with rich work experience, equip themselves with knowledge, the expanded aspect of knowledge, efforts to improve their comprehensive quality.The second Do best, strictly perform their responsibilities. Set up the company, to maximize the customer to the satisfaction of the company's products, do a good job in technical services and product promotion to the company. And collected on the properties of the products of the company, in order to make improvement in time, make the products better meet the using demand of the scene.Three to learn to be good at communication, coordinating assistance. On‐site technical service personnel should not only have strong professional technology, should also have good communication ability, a lot of a product due to improper operation to appear problem, but often not customers reflect the quality of no, so this time we need to find out the crux, and customer communication, standardized operation, to avoid customer's mistrust of the products and even the damage of the company's image. Some experiences in the past work, mentality is very important in the work, work to have passion, keep the smile of sunshine, can close the distance between people, easy to communicate with the customer. Do better in the daily work to communicate with customers and achieve customer satisfaction, excellent technical service every time, on behalf of the customer on our products much a understanding and trust.Fourth, we need to continue to learn professional knowledge, do practical grasp skilled operation. Over the past year, through continuous learning and fumble, studied the gas generation, collection and methods, gradually familiar with and master the company introduced the working principle, operation method of gas machine. With the help of the department leaders and colleagues, familiar with and master the launch of the division principle, debugging method of the control system, and to wuhan Chen Guchong garbage power plant of gas machine control system transformation, learn to debug, accumulated some experience. All in all, over the past year, did some work, have also made some achievements, but the results can only represent the past, there are some problems to work, can't meet the higher requirements. In the future work, I must develop the oneself advantage, lack of correct, foster strengths and circumvent weaknesses, for greater achievements. Looking forward to XXXX years of work, I'll be more efforts, constant progress in their jobs, make greater achievements. Every year I have progress, the growth of believe will get greater returns, I will my biggest contribution to the development of the company, believe inyourself do better next year!I wish you all work study progress in the year to come.。
自然语言处理算法 k-m
自然语言处理算法 k-mk-means算法是一种常用的聚类算法,它可以将数据集划分为k个不同的簇。
本文将介绍k-means算法的基本原理、步骤和应用。
一、算法原理k-means算法的原理很简单,它通过迭代的方式将数据集划分为k 个簇,使得簇内的样本点相似度最高,而簇间的样本点相似度最低。
具体步骤如下:1. 初始化k个中心点,可以是随机选择或者根据经验选择。
2. 根据中心点,将数据集中的每个样本点分配给最近的中心点所在的簇。
3. 根据簇内的样本点,更新中心点的位置。
4. 重复步骤2和步骤3,直到中心点的位置不再发生变化或者达到最大迭代次数。
二、算法步骤k-means算法的步骤可以按照以下几个阶段进行描述:1. 初始化阶段:随机选择k个中心点作为初始值。
2. 分配阶段:将数据集中的每个样本点分配给离它最近的中心点所在的簇。
3. 更新阶段:根据簇内的样本点,更新中心点的位置。
4. 终止条件:当中心点的位置不再发生变化或者达到最大迭代次数时,停止算法。
三、算法应用k-means算法在实际应用中有很多场景,下面介绍几个常见的应用:1. 图像分割:将一幅图像分成若干个具有相似特征的区域,可以利用k-means算法将图像的像素点聚类成不同的颜色簇。
2. 文本聚类:将大量的文本数据划分为若干个簇,可以帮助用户更好地理解和分析文本数据。
3. 推荐系统:根据用户的历史行为和偏好,将用户划分到不同的簇,从而为用户推荐更加个性化的内容。
4. 无监督学习:k-means算法是一种无监督学习算法,可以在没有标记数据的情况下对数据进行聚类分析。
四、总结k-means算法是一种简单而有效的聚类算法,它通过迭代的方式将数据集划分为k个簇,使得簇内的样本点相似度最高,簇间的样本点相似度最低。
该算法在图像分割、文本聚类、推荐系统和无监督学习等领域都有广泛的应用。
通过理解k-means算法的原理和步骤,我们可以更好地应用它来解决实际问题。
k近邻算法例题 知乎
k近邻算法例题知乎K近邻算法被广泛采用于分类问题和回归问题的解决。
这种机器学习算法的工作方式是根据训练数据中最邻近的K个样本进行分类或预测。
在这篇文章中,我们将围绕一个K近邻算法的例题展开。
考虑一个数据集,它包括某几种水果的甜度和酸度。
我们想要根据这些属性来预测水果的种类。
我们可以使用K近邻算法分析这个数据集,以便将新的水果分类为苹果或是橙子。
第一步,我们需要将数据集分为训练集和测试集。
我们可以将70%的数据用于训练,30%的数据用于测试。
这可以通过在数据集上进行随机抽样来实现。
第二步,我们需要选择K值。
K值是指在分析新数据时考虑的最近邻居的数量。
通常,我们将K值设定为一个奇数,以免出现平票情况。
在这个例题中,我们可以将K值设定为3。
第三步,我们需要计算训练集中每个数据点与测试集中每个数据点之间的距离。
对于这个例题,我们可以使用欧氏距离作为距离度量方法。
欧氏距离的公式如下:d(x, y) = √(Σ(xi - yi)²)其中x和y是两个数据点的属性向量,xi和yi是属性向量中的单个值。
第四步,我们需要选择与测试点最近的K个训练集数据点,以便确定它的类别。
根据这个例题中的K值,我们将选择最近的3个数据点。
第五步,我们需要计算这3个最近邻居中每个类别出现的次数。
在这个例题中,如果有2个数据点是苹果,1个是橙子,则我们将分类这个测试水果为苹果。
最后,我们可以使用测试集来评估我们的算法的性能。
我们可以使用准确率、精确率和召回率等指标来衡量算法的性能。
通过这个例题,我们可以清楚地了解到K近邻算法的工作原理和应用。
该算法易于实现且在许多分类和回归问题中表现良好。
当我们学习机器学习算法时,如果想更好地掌握这个算法的工作原理和应用,这个例题是一个很好的例子。
Randomized Kinodynamic Planning
There is a strong need for a simple, e cient planning technique that determines control inputs to drive a robot from an initial con guration and velocity to a goal con guration and velocity while obeying physically-based dynamic constraints and avoiding obstacles in the robot's environment. Although many interesting approaches exist to speci c kinodynamic problems, they fall short of being able to solve many complicated, high degree-offreedom problems. Randomized techniques have led to e cient, incomplete planners for basic path planning (holonomic and purely kinematic); however, there appears to be no equivalent technique for the broader kinodynamic planning problem (or even nonholonomic planning in the con guration space). We try to account for some of the reasons for this, and argue the need for a sim1
找零钱问题
A 30° ° B
A
30° °
如果你选做了本题并取得了正确答案,请在上机结束前将源程序交到 如果你选做了本题并取得了正确答案,请在上机结束前将源程序交到 上机结束前 Ftp://202.117.35.169/C++/zhudj/Algorithm目录中,文件名为学号 ,不分班级。 目录中, 为学号_M,不分班级。 目录中 文件名为学号
假定有伍角、壹角、伍分、 假定有伍角、壹角、伍分、贰分和壹分共五种硬 在给顾客找硬币时, 币,在给顾客找硬币时,一般都会尽可能的选用 硬币个数最小的方法。例如, 硬币个数最小的方法。例如,当要给某顾客找七 角二分钱时,会给他一个伍角, 个壹角和1 角二分钱时,会给他一个伍角,2个壹角和1个贰 分的硬币。 分的硬币。
Quicksort
// 快速排序 void QuickSort(int *a, int p, int q) { if(p<q) { int r=Partition(a, p, q); QuickSort(a, p, r-1); //对左半段排序 对; //对右半段排序 对右半段排序 } }
• 即
π = 4∗ B/ A
求π 的近似值
• 下雨时: 下雨时:
– 雨点落在整个正方形容器中的限制条件是: 雨点落在整个正方形容器中的限制条件是:
0 ≤ x ≤1 0 ≤ y ≤1
– 雨点落在扇形区域中的限制条件是: 雨点落在扇形区域中的限制条件是:
x2 + y 2 ≤ 1
int main() { int a=0, b=0; double pi, x, y; srand((unsigned int)time(NULL)); for(int i=0; i<10000; i++) { a++; x=(double)rand()/RAND_MAX; y=(double)rand()/RAND_MAX; if(sqrt(x*x+y*y)<=1)b++; } pi=4.0*(double)b/a; cout<<"pi="<<pi<<endl; return 0; }
纹理物体缺陷的视觉检测算法研究--优秀毕业论文
摘 要
在竞争激烈的工业自动化生产过程中,机器视觉对产品质量的把关起着举足 轻重的作用,机器视觉在缺陷检测技术方面的应用也逐渐普遍起来。与常规的检 测技术相比,自动化的视觉检测系统更加经济、快捷、高效与 安全。纹理物体在 工业生产中广泛存在,像用于半导体装配和封装底板和发光二极管,现代 化电子 系统中的印制电路板,以及纺织行业中的布匹和织物等都可认为是含有纹理特征 的物体。本论文主要致力于纹理物体的缺陷检测技术研究,为纹理物体的自动化 检测提供高效而可靠的检测算法。 纹理是描述图像内容的重要特征,纹理分析也已经被成功的应用与纹理分割 和纹理分类当中。本研究提出了一种基于纹理分析技术和参考比较方式的缺陷检 测算法。这种算法能容忍物体变形引起的图像配准误差,对纹理的影响也具有鲁 棒性。本算法旨在为检测出的缺陷区域提供丰富而重要的物理意义,如缺陷区域 的大小、形状、亮度对比度及空间分布等。同时,在参考图像可行的情况下,本 算法可用于同质纹理物体和非同质纹理物体的检测,对非纹理物体 的检测也可取 得不错的效果。 在整个检测过程中,我们采用了可调控金字塔的纹理分析和重构技术。与传 统的小波纹理分析技术不同,我们在小波域中加入处理物体变形和纹理影响的容 忍度控制算法,来实现容忍物体变形和对纹理影响鲁棒的目的。最后可调控金字 塔的重构保证了缺陷区域物理意义恢复的准确性。实验阶段,我们检测了一系列 具有实际应用价值的图像。实验结果表明 本文提出的纹理物体缺陷检测算法具有 高效性和易于实现性。 关键字: 缺陷检测;纹理;物体变形;可调控金字塔;重构
Keywords: defect detection, texture, object distortion, steerable pyramid, reconstruction
II
武汉大学资环学院gis地图学与地理信息系统考研(真题+答案+笔记)
现在大家可能也在面临着选择,比如是否考研,考哪个学校,考哪个专业,跟哪个导师, 怎样平衡专业课与公共课的时间等等。作为过来人,当初的我们和大家一样,也在思考着类 似的问题;也正因为是过来人,我们才有了给大家提供建议的底气。借此机会,把我们一些 个人看法与大家交流,虽个人经历不同,经验不足以证明什么,但仅供参考之用足矣! 1、为什么要考研?
II
学长学姐们对大家网上购买资料的几点忠告
1、考研不易,资料的作用显而易见,只要经济允许,大家还是要找个信得过的 提供者。
2、买资料注意,首先,一定要和对方聊聊,看对方是什么出身,如果一个人连 地域分异规律、新仙女木事件、中心地理论、空间分析、遥感影像分辨率都不知 道是什么,那他是无法保证质量的。有时候,对方会狡辩说是找人编写的,那大 家扪心自问下,如果有人出钱让你编资料,但对方对这块也不懂,也就是说质量 上是没人把关的,你会十二分的用心吗?其次,要让对方截图,任意指定版块进 行截图,很多资料描述的都很美,但实际拿到后会大失所望。随机性的截图可以 避免这一点。
3、资料更新说明
新的年份,考研范围、考试重点和大纲都会有新的变化,针对这种变化,本中心每年都会进行一次较大内 容更新,所以本套资料仅适用于 2015 年,请大家多加关注我们!谢谢。
4、资料接受说明
收到资料后,在不拆开前提下,可自由浏览资料内容,如不满意,2 小时之内,联系我们,然后无偿退货。 但对于那些擅自拆开书钉,进行复印,而后再退货的资料贩子,我们也会坚决维护我们的权益,与之奉陪 到底!联系方式:咨询 QQ:2593665687,TEL:18012981114
PTAS算法
Course:Randomized Algorithms,LNMBLeen Stougie and Ren´e SittersWeek724-10-2011Approximation schemes for geometric problems1IntroductionThis lecture is about the polynomial time approximation scheme(PTAS) for the traveling salesman problem(TSP)in the Euclidean plane developed by Arora[1].The survey paper by the same author[2]is a good reference. See:/arora/pubs/arorageo.ps.The book by Vazi-rani[5]and the book by Shmoys and Williamson[4]contain a chapter on this algorithm.This result contains many technical details.When you browse through the versions listed above you will see many differences.In thefirst version (1996)the running time was n O(1/ ).This was improved to n(log n)O(1/ ) in[1].Here,I will only discuss thefirst,less efficient result.(If you come across the terms‘patching lemma’and‘(m,r)-light tours’in the literature then these refer to the improved algorithm that is not discussed here.) All references listed here are excellent but I think this note can be helpful since in[5]many details are missing(and are left as exercise)while the other references are long and cover the more advanced algorithm completely.This notes covers the easiest form completely.Karp’s algorithmWhat is the most obvious approach for designing a polynomial time approx-imation scheme for Euclidean TSP?Cut the problem in smaller pieces and then solve the small problems and combine everything to get one solution. The nice thing about working in the Euclidean plane is that cutting the met-ric space in small pieces is simple:Draw a square B around all input points and then cut the square in smaller squares.This was exactly what Karp did in his TSP algorithm for the Euclidean plane[3].In general,the TSP on1n points can be solved exactly in time O(2n).So if we split up the square B in small rectangles Q i,each containing M=log2n input points,then we canfind an optimal TSP tour T i for any of these subproblems in time O(2M)=O(n)time.There are n/M rectangles so the total time to solve all these subproblems is O(n2/M)=O(n2/log2n).The next step of Karp’s algorithm is to pick one input point v i in each rectangle Q i and then make a tour T on{v1,v2,...,v n/M}.For this we can use for example Christofides’algorithm.Now a TSP tour follows from combining the big tour T with the small tours T i.From a worst case point of view this algorithm performs poorly.However, if we assume that the input is formed by taking n points uniformly at random in a square,then the expected ratio is1+o(1).In other words,it goes to1 for n→∞.Arora’s PTAS for Euclidean TSPThe algorithm of Karps described above is deterministic and performs well on a random input.Arora’s algorithm is randomized and performs well an any input.Moreover,derandomization is easy,although at the cost of an increased running time.The randomization is a simple step in the algorithm but it is an important part of the analysis.The random step makes it possible to do the dynamic programming in polynomial time.The main ingredientsThe main ingredients of the algorithm are:(A)Rounding the instance.(B)Restricting the set of feasible solutions.(C)Dynamic programming.(D)Randomization.Note that Karp’s algorithm uses none of these.Below,I will sketch the basics of these ideas.The algorithm is given in detail later.The randomization is actually thefirst step of the algorithm but it is easier to explain the idea of this step here at the end.2(A)Rounding the instance This is done by defining a grid and move each input point to the middle of the grid cell in which it is contained.Since we only want a1+ approximation and not the true optimum we can afford to change the instance a little bit.The solution that wefind is used for the original instance as well:just maintain the same ordering of points.The advantage of shifting points is that it simplifies the dynamic programming: For the DP we divide the plane into small squares and solve subproblems inside the squares.The effect of rounding(i.e.shifting points)is twofold:(i) there are no input points on the boundary of squares and(ii)the smallest squares(the grid cells)contain at most1input point.The density of the grid is important.A denser grid gives a smaller error but leads to a higher running time.(B)Restricting the set of feasible solutions.This is another way to speed up the dynamic programming.Instead offinding the optimum over all feasible solutions wefind the optimum over a smaller set of feasible solutions that satisfy some condition.This restriction enables us to do the dynamic programming in polynomial time.Of course one should prove that the optimum over the restricted set of solutions problem is not far from the true optimum.In fact,it can be far offbut the randomization(D)ensures that the difference is small in expectation.The basic idea of this restriction is as follows.On each grid line we place a set of points that we call portals and add the restriction that the tour can only cross a grid line at a portal.(C)Dynamic programming.By dynamic programming wefind the true optimum over all restricted solutions(B)for the rounded instance(A).First, draw one square that contains all input points.Call this the enclosing box B.Divide the square into four equal sized squares.Then,divide each of the four squares into four equal sized squares.Keep dividing squares into four equal sized squares until each of the smallest squares contains at most one input point.This division defines a tree:The root is the largest square B and its four children are the four squares in which it is divided.In general, each inner vertex of the tree has degree four and that is why it was called the quad tree in[1].The squares that correspond with the vertices of the tree are called dissection squares.The DP starts at the leaves of the putations for the leaves of3the tree(the smallest squares)become easy since they contain at most input point.In general,each dissection square defines a polynomial number of subproblems in the DP.The optimal values of any subproblem is found by looking up a polynomial number of values for subproblems of its four children in the tree.Starting from the leaves wefill the DP table until we reach the root.(D)Randomization It is a simple but powerful step of the algorithm. Instead of taking an arbitrary enclosing box(which forms the root of the tree)we take a box uniformly at random from a set of boxes.In this way, the restriction made in(B)becomes randomized.The effect is that the restriction only gives a small change in the optimal value in expectation. The algorithmIn this section we describe the algorithm in detail.The steps of the algorithm:1.Take a random square B that covers all input points.2.Rounding:move each point to the middle of the grid cell it is in.3.Build the quad tree.4.Define portals.5.Build the dynamic program table.6.Fill the table.Step1:First we take a smallest square that covers all input points.The square may not be unique but any choice isfine.Call this the bounding box.Redefine the distances such that the side length of the box becomes 2k−1with k= log2(n/ ) .Let(0,0)be its lower left corner.Now take a,b∈{0,...,2k−1−1}independently and uniformly at random.Let L=2k and let(−a,−b)be the lower left corner of the L×L enclosing box B. Note that the bounding box has side length L/2and that its covered by the enclosing box B for any outcome of a and b.Also note that L=2k≥n/4Step2:Place an L×L grid over B and move each input point to the middle of the grid cell that it is in.If a point is on a grid line then choose one adjacent cell.From now on we only consider this rounded instance.Step3:The box B is the root of the tree and we say it is of level0.It is divided into four equal sized squares of level1.These are its children in the quad tree.Keep dividing until the smallest squares have size1×1.These are the leaves of the tree and they are of level k(by definition of k).We day that the root has highest level and the leaves have the lowest level.The squares of the tree are called dissection squares.We also define levels for all inner grid lines.Denote the horizontal middle grid line and vertical middle grid line as level1lines.Hence,these are the two lines that divide the box B into its four children.In general,the level i grid lines are those lines that divide the level i−1squares of the tree each into four level i squares.So horizontally we have2i−1level i lines and the same number vertically,i=1,...,k. Step4:Each inner grid line gets a number of equidistant portals.The number depends on the level of the line.A level i lines gets m2i−1portals such that these points divide the line into m2i segments of equal length.The portals on the boundary of a dissection square are then given by the portals on the grid lines that bound the square.For a level i square,two of its sides are level i grid lines and the other two sides are of higher level.That means that two sides have exactly m+1portals and the other sides both have at most m/2+1portals.Then,the number of portals per square is at most4m if m≥4.Let m= k/ .(Arora reduced this to m=O(1/ )in[1].) We say that a solution is portal respecting if it crosses grid lines only at portals.Step5:By dynamic programming wefind the smallest portal respecting tour.We shall prove later that the optimal portal respecting tour crosses each portal at most twice.For the DP it is convenient to think of a portal as two portals:each is crossed at most ones.There is an entry in the DP table for every dissection square and for every possible way that an optimal solution may cross the boundary of this square. For each entry of the table we will store one value,which is the length of the fraction inside the square of the optimal solution satisfying the boundary conditions.A property of any optimal TSP tour in the plane is that it does not cross itself.That means that the part inside a square is a set of paths that do not cross and visit all input points inside.5Formally,an entry in the dynamic program table is given by a triple (X,Y,Z).Here,X ranges over all dissection squares,Y is any even subset P of the portals of X,and Z is a valid pairing for this subset P.A valid pairing is a partitioning of P in pairs such that we can connect each pair of points by a path inside the squares without crossings.Step6:For each entry(X,Y,Z)we compute the length of the shortest set of paths visiting all input points inside X and that satisfies the boundary conditions Y and Z.Denote this value by F(X,Y,Z).If X is a leaf of the tree then the value can easily be computed:If there is no input point inside X then the optimal solution is to connect each pair in Z by a straight line and if there is one input in X then one path will visit the point and the other paths are straight lines.If X is not a leaf,then the value F(X,Y,Z)can be computed by looking at all values for the children of X in the tree,say X1,X2,X3,X4.Consider any quadruple(X1,Y1,Z1),(X1,Y2,Z2),(X3,Y3,Z3),(X4,Y4,Z4).We say that it is consistent with(X,Y,Z)if(i)portals are consistent and(ii)the pairing is consistent.With(i)we mean that if two squares(for example X and X1or X1and X2)have a boundary in common then on this part they use the same set of portals.With(ii)we mean that the pairing Z should be consistent with the pairing that is defined by the pairings Z1,Z2,Z3,Z4.For example, assume that p1and p2are portals in Y and assume that in Z1portal p1is paired with some portal q and in Z2portal q is paired with p2.Then p1and p2should be paired in Z.Further,we should check that the pairings do not define a subtour,unless Y is the emptyset in which case the four pairings should add up to a single tour,(which is the case when X is the root). The analysisWe need to check that the algorithm can be implemented in polynomial time and that the approximation guarantee is(1+ )OPT.We srat with the latter. Analysis of the approximation ratioTo prove that the algorithm is a PTAS it is enough to prove that the ratio is1+O( ).We need to check that rounding the instance and restricting to portal respecting tours does not change the optimal value to much.Let OPT be the optimal value of the original instance and let OPT be the optimal value of the rounded instance,say I .6Lemma1|OPT−OPT |≤O( )OPT.Proof.The bounding box has side length L/2and was taken as small as possible.Hence,OPT≥L.Further,L≥n/ which implies OPT≥n/ . The maximum distance by which a point is moved by the rounding step is √2/2.Therefore,|OPT−OPT |≤n √2≤√2 OPT.Let E[OPT ]be the expected value(over random choices a and b)of the optimal solution for instance I .LetΠbe an optimal tour for I .Lemma2The tourΠhas at most √2OPT crossings with grid lines.Proof.TourΠis formed by n straight line segments.Consider any such segment of length s.Assume it goes from(x1,y1)to(x2,y2).The number of crossings with vertical grid lines is|x1−x2|and it has at most|y1−y2| crossings with horizontal grid lines.The sum is|x1−x2|+|y1−y2|≤√2s.Hence the total number of crossings is at most √2OPT .Lemma3OPT ≤E[OPT ]|≤(1+O( ))OPT .Proof.Letδi be the distance between two neighboring portals on a level i grid line,call this the inter portal distance.Thenδi=Lm2i=2km2i.Given the optimal tourΠ for I we make it portal respecting by moving each crossing with a grid line to the nearest portal.The length of such detour is at most twice the distance to the nearest portal.Consider an arbitrary crossing ofΠ with some grid line l.The inter portal distance depends on the level of the line,which is determined by the randomization step of the algorithm.The probability that this grid line l becomes a level1line is exactly1/(L/2)=1/2k−1.Somewhat surprisingly, the probability that this grid line l becomes a level2line is also1/2k−1.For7any i≥2this probability is1/2k−i+1.Hence,for any i≥1,the probability that a grid line becomes a level i line is at most12k−i.The expected length of the detour for moving the crossing with l to the nearest portal is at mostki=1δi·12k−i=ki=12km2i·12k−i=km=kk/≤ .By Lemma2,there are at most √2OPT crossings with grid lines.So thetotal expected detour for makingΠ portal respecting is at most·√2OPT =O( )OPTBy combining Lemma1and Lemma3we get the desired result.Corollary1|OPT−E[OPT ]|≤O( )OPT.Analysis of the running timeThe running time of the algorithm is determined by the dynamic program-ming.We need to show that the size of the DP table is polynomially bounded and that each value of each entry can be computed in polynomial time.The number of dissection squares is O(4k)=O(4log2(n/ ))=O(n2/ 2). Each square has no more than4m portals.Remember that we copied each portal to enhance the DP.This gives at most8m portals per square.The number of possible subsets is at most28m=(n/ )O(1/ )=n O(1/ ).Given an even subset Y of the portals,the number of valid pairings is at most 2|Y|.This upper bound is obtained as follows.Assume we are given a valid pairing.Now choose one of the corners of X and walk one round around the boundary of X.Every time that a portal of Y is encountered label it 0if it comes before its paired portal in this walk and label it1otherwise. This gives a0,1-vector of length|Y|≤8m.If wefix the starting point,then every pairing gives a unique0,1-vector.This follows from the fact that valid pairings have no crossings.Hence,given Y the number of valid pairings is at8most28m=n O(1/ ).Over all,the number of entries in the DP table is upper bounded by O(n2/ 2)×n O(1/ )×n O(1/ )=n O(1/ ).Now we compute an upper bound on the computation time of F(X,Y,Z). If X is a leaf then we only need to check|Y|/2solutions:exactly one of the |Y|/2pairs is connected with the input point(if any)and the other pairs are connected by straight lines.If X is not a leaf then the value F(X,Y,Z)can be computed by considering all combinations of entries of its four children. The number of possible combinations is(n O(1/ ))4=n O(1/ ).We have shown that the size of the DP table is n O(1/ )and that it takes n O(1/ )time to compute one value.Therfore the total running time of the algorithm isn O(1/ )·n O(1/ )=n O(1/ ).Remark:Some constants here are different(read:better)from what was shown in class.Problem for this week:Adjust the algorithm above to the Euclidean Steiner Tree problem.In this problem we are given a set of n points in the plane and we need tofind the smallest tree that contains all these points as vertices and possibly some other vertices as well.The extra vertices of the tree are called Steiner points.References[1]S.Arora.Polynomial time approximation schemes for euclidean travelingsalesman and other geometric problems.Journal of the ACM,45:753–782, 1998.[2]S.Arora.Approximation schemes for NP-hard geometric optimizationproblems:A survey.Math.Programming,97,2003.[3]R.Karp.Probalistic analysis for partitioning algorithms for the travelingsalesman problem in the Euclidean plane.Math.of Operations Research, 2,1977.[4]D.Shmoys and D.Williamson.The design of approximation algorithms.Cambridge University Press,2011.9[5]V.Vazirani.Approximation Algorithms.Springer,Berlin,2001.10。
随机矩阵奇异值分解算法在3D建模中的应用效果评估
随机矩阵奇异值分解算法在3D建模中的应用效果评估随机矩阵奇异值分解(Randomized Singular Value Decomposition, rSVD)算法是一种用于矩阵分解的高效方法,近年来在3D建模领域得到了广泛的应用。
本文将对随机矩阵奇异值分解在3D建模中的应用效果进行评估。
1. 引言3D建模是计算机图形学领域的重要研究方向之一,广泛应用于电影、游戏、虚拟现实等领域。
在3D建模中,常常需要对大量的三维点云数据进行处理和分析。
而随机矩阵奇异值分解算法可以高效地对大规模矩阵进行分解,因此在3D建模中有着广泛的应用前景。
2. 随机矩阵奇异值分解算法随机矩阵奇异值分解算法是一种基于采样和迭代的矩阵分解方法。
它通过对原始矩阵进行随机采样,构造一个低秩近似矩阵,并对其进行奇异值分解。
与传统的奇异值分解算法相比,随机矩阵奇异值分解算法具有更低的计算复杂度和更快的运算速度。
3. 随机矩阵奇异值分解在3D建模中的应用3D建模中常用的数据表示方式之一是三维点云。
而随机矩阵奇异值分解算法可以对三维点云数据进行降维和拟合,从而实现对三维模型的快速建模。
通过将三维点云数据映射到低维空间,随机矩阵奇异值分解算法可以提取出三维模型的主要特征,并去除噪声和冗余信息。
4. 实验设计与结果分析为了评估随机矩阵奇异值分解算法在3D建模中的应用效果,我们设计了实验,并对比了其与传统奇异值分解算法的性能差异。
实验中使用了不同规模的三维点云数据集,并分别对其进行了随机矩阵奇异值分解和传统奇异值分解处理。
结果表明,随机矩阵奇异值分解算法在运算速度和降维效果上都优于传统奇异值分解算法,能够更快速地实现对三维模型的建模和分析。
5. 应用案例分析除了实验评估,本文还通过应用案例对随机矩阵奇异值分解在3D 建模中的具体应用效果进行分析。
通过对真实场景中的三维点云数据进行处理,我们展示了随机矩阵奇异值分解算法在三维模型建模和分析方面的潜力和优势。
Monotone
Monotone circuits for the majority functionShlomo Hoory Avner Magen†Toniann Pitassi†AbstractWe present a simple randomized construction of size O n3and depth53log n O1monotone circuits for the majority function on n variables.This result can be viewed as a reduction in the size anda partial derandomization of Valiant’s construction of an O n53monotone formula,[15].On the otherhand,compared with the deterministic monotone circuit obtained from the sorting network of Ajtai, Koml´o s,and Szemer´e di[1],our circuit is much simpler and has depth O log n with a small constant.The techniques used in our construction incorporate fairly recent results showing that expansion yields performance guarantee for the belief propagation message passing algorithms for decoding low-density parity-check(LDPC)codes,[3].As part of the construction,we obtain optimal-depth linear-size mono-tone circuits for the promise version of the problem,where the number of1’s in the input is promised to be either less than one third,or greater than two thirds.We also extend these improvements to general threshold functions.At last,we show that the size can be further reduced at the expense of increased depth,and obtain a circuit for the majority of size and depth about n1Department of Computer Science,University of British Columbia,Vancouver,Canada.†Department of Computer Science,University of Toronto,Toronto,Canada.1IntroductionThe complexity of monotone formulas/circuits for the majority function is a fascinating,albeit perplexing,problem in theoretical computer science.Without the monotonicity restriction,majority can be solvedwith simple linear-size circuits of depth O log n,where the best known depth(over binary AND,OR,NOT gates)is495log n O1[12].There are two fundamental algorithms for the majority function thatachieve logarithmic depth.Thefirst is a beautiful construction obtained by Valiant in1984[15]that achievesmonotone formulas of depth53log n O1and size O n53.The second algorithm is obtained from the celebrated sorting network constructed in1983by Ajtai,Koml´o s,and Szemer´e di[1].Restricting to binaryinputs and taking the middle output bit(median),reduces this network to a monotone circuit for the majorityfunction of depth K log n and size O n log n.The advantage of the AKS sorting network for majority is thatit is a completely uniform construction of small size.On the negative side,its proof is quite complicated andmore importantly,the constant K is huge:the best known constant K is about5000[11],and as observed byPaterson,Pippenger,and Zwick[12],this constant is important.Further converting the circuit to a formulayields a monotone formula of size O n K,which is roughly n5000.In order to argue about a quality of a solution to the problem,one should be precise about the differentresources and the tradeoffs between them.We care about the depth,the size,the number of random bitsfor a randomized construction,and formula vs circuit question.Finally,the conceptual simplicity of boththe algorithm and the correctness proof is also an important goal.Getting the best depth-size tradeoffs isperhaps the most sought after goal around this classical question,while achieving uniformity comes next. An interesting aspect of the problem is the natural way it splits into two subproblems,the solution to which gives a solution to the original problem.Problem I takes as input an arbitrary n-bit binary vector,and outputs an m-bit vector.If the input vector has a majority of1’s,then the output vector has at least a2/3fraction of 1’s,and if the input vector does not have a majority of1’s,then the output vector has at most a1/3fraction of1’s.Problem II is a promise problem that takes the m-bit output of problem I as its input.The output of Problem II is a single bit that is1if the input has at least a2/3fraction of1’s,and is a0if the input has at most a1/3fraction of1’s.Obviously the composition of these two functions solves the original majority problem.There are several reasons to consider monotone circuits that are constructed via this two-phase approach.First,Valiant’s analysis uses this viewpoint.Boppana’s later work[2]actually lower bounds each of thesesubproblems separately(although failing to provide lower bound for the entire problem).Finally,the secondsubproblem is of interest in its own right.Problem II can be viewed as an approximate counting problem,and thus plays an important role in many areas of theoretical computer science.Non monotone circuits forthis promise problem have been widely studied.Results The contribution of the current work is primarily in obtaining a new and simple construction ofmonotone circuits for the majority function of depth53log n and size O n3,hence significantly reducing the size of Valiant’s formula while not compromising at all the depth parameter.Further,for subproblem II as defined above,we supply a construction of a circuit size that is of a linear size,and it too does not compromise the depth compared to Valiant’s solution.A very appealing feature of this construction is that it is uniform,conditioned on a reasonable assumption about the existence of good enough expander graphs. To this end we introduce a connection between this circuit complexity question and another domain,namely message passing algorithms.The depth we achieve for the promise problem nearly matches the1954lower bound of Moore and Shannon[10].We further show how to generalize our solution to a general threshold function,and explore another optionin the tradeoffs between the different resources we use;specifically,we show that by allowing for a depth of roughly twice that of Valiant’s construction,we may get a circuit of size O n12Definitions and amplificationFor a monotone boolean function H on k inputs,we define its amplification function A H:0101 as A H p Pr H X1X k1,where X i are independent boolean random variables that are one with probability p.Valiant[15],considered the function H on four variables,which is the OR of two AND gates,H x1x2x3x4x1x2x3x4.The amplification function of H,depicted in Figure1,is A H p11p22,and has a non-trivialfixed point atβ512152.Let H k be the depth2k binary tree with alternating layers of AND and OR gates,where the root is labeled OR.Valiant’s construction uses the fact that A Hk is the composition of A H with itself k times.Therefore,H k probabilistically amplifiesβ∆β∆to βγεk∆βγεk∆,as long asγεk∆∆0.This implies that for any constantε0we can take2k33log n O1to probabilistically amplifyβΩ1nβΩ1n toε1ε,where33is any constant bigger thanαlogDefinition1.Let F be a boolean function F:01n01m,and let S01n be some subset of the inputs.We say that F deterministically amplifies p l p h to q l q h with respect to S,if for all inputs x S, the following promise is satisfied(we denote by x the number of ones in the vector x):F x q l m if x p l nF x q h m if x p h nNote that unlike the probabilistic amplification,deterministic amplification has to work for all inputs or scenarios in the given set S.From here on,whenever we simply say“amplification”we mean deterministic amplification.For an arbitrary small constantε0,the construction we give is composed of two independent phases that may be of independent interest.A circuit C1:01n01m for m O n that deterministically amplifiesβΩ1nβΩ1n toδ1δfor an arbitrarily small constantδ0.This circuit has size O n3and depth αεlog n O1.A circuit C2:01m01,such that C2x0if xδm and C2x1if x1δm,whereδ0is a sufficiently small constant.This circuit has size O m and depth2εlog m O1.Thefirst circuit C1is achieved by a simple probabilistic construction that resembles Valiant’s construction. We present two constructions for the second circuit,C2.Thefirst construction is probabilistic;the second construction is a simulation of a logarithmic number of rounds of a certain message passing algorithm on a good bipartite expander graph.The correctness is based on the analysis of a similar algorithm used to decode a low density parity check code(LDPC)on the erasure channel[3].Combining the two circuits together yields a circuit C:01n01for theβn-th threshold function. The circuit is of size O n3and depthα22εlog n O1.3Monotone circuits for MajorityIn this section we give a randomized construction of the circuit C:01n01such that C x is one if the portion of ones in x is at leastβn and zero otherwise.The circuit C has size O n3and depth2αεlog n O1for an arbitrary small constantε0.As we described before,we will describe C as the compositions of the circuits C1and C2whose parameters are given by the following two theorems: Theorem2.For everyεεc0,there exists a circuit C1:01n01m for m O n,of size O n3and depthαεlog n O1that deterministically amplifies all inputs fromβc nβc n toε1ε. Theorem3.For everyε0,there existsε0and a circuit C2:01n01,of size O n and depth 2εlog n O1that deterministically amplifies all inputs fromε1εto01.The two circuits use a generalization of the four input function H used in Valiant’s construction.For any integer d2,we define the function H d on d2inputs as the d-ary OR of d d-ary AND gates,i.e d i1d j1 x i j.Note that Valiant’s function H is just H2.Each of the circuits C1and C2is a layered circuit,where layer zero is the input,and each value at the i-th layer is obtained by applying H d to d2independently chosen inputs from layer i 1.However,the valuesof d we choose for C1and C2are different.For C1we have d2,while for C2we choose sufficiently large d dεto meet the depth requirement of the circuit.We let F n m F denote a random circuit mapping n inputs to m outputs,where F is afixed monotone boolean circuit with k inputs,and each of the m output bits is calculated by applying F to k independently chosen random inputs.We start with a simple lemma that relates the deterministic amplification properties of F to the probabilistic amplification function A F.1Lemma4.For anyεδ0,the random function F deterministically amplifies p l p h to A F p l1δA F p h1δwith respect to S01n with probability at least1ε,if:log S log1εmΩΘγ2εi1c nβγεγ2εi1c nThat is,we can chooseδas an increasing geometric sequence,starting fromΘ1n for i1,up toΘ1 for i logγ2εn.The implied layer size for error probability2n(which is much better than we need),is Θnδ2.Therefore,it decreases geometrically fromΘn3down toΘn.It is not difficult to see that after achieving the desired amplification fromβc n toβ∆0,only a constant number of layers is needed to get down toε.The corresponding value ofδin these last steps is a constant (that depends onε),and therefore,the required layer sizes are allΘn.Proof of Theorem3.The circuit C2is a composition of F n m1H d F m1m2H dF mt1m t H d,where d andthe layer sizes n m0m1m t1are suitably chosen parameters depending onε.We prove that with high probability such a circuit deterministically amplifies all inputs fromε1εto01.As before,we restrict our attention to the lower end of the promise problem and prove that C2outputs zero on all inputs with portion of ones smaller thanε.As in the circuit C1,the layer sizes must be sufficiently large to allow accurate computation.However, for the circuit C2,accurate computation does not mean that the portion of ones in each layer is close to its expected value.Rather,our aim is to keep the portion of ones bounded by afixed constantε,while making each layer smaller than the preceding one by approximately a factor of d.We continue this process until the layer size is constant,and then use a constant size circuit tofinish the computation.Therefore,since the number of layers of such a circuit is about log n log d,and the depth of the circuit for H d is2log d,the total depth is about2log n for large d.By the above discussion,it suffices to prove the following:For everyε0there exists a real number δ0and two integers d n0,such that for all n n0the random circuit F n m H d with m1εn d, deterministically amplifiesδtoδwith respect to all inputs,with failure probability at most1n.Since A Hδ11δd d dδd,the probability of failure for any specific input with portion of ones at most δ,is bounded by:mδmA Hδδm eamplification method to analyze the performance of a belief propagation message passing algorithm for decoding low density parity check(LDPC)codes.Today the use of belief propagation for decoding LDPC codes is one of the hottest topics in error correcting codes[9,14,13].Let G V L V R;E be a d regular bipartite graph with n vertices on each side,V L V R n.Consider the following message passing algorithm,where we think of the left and right as two players.The left player “plays AND”and the right player“plays OR”.At time zero the left player starts by sending one boolean message through each left to right edge,where the value of the message m uv from u V L to v V R is the input bit x u.Subsequently,the messages at time t0are calculated from the messages at time t 1.At odd times,given the left to right messages m uv,the right player calculates the right to left messages m vw, from v V R to w V L by the formula m vw u N v w m uv.That is,the right player sends a1along the edge from v V R to w V L if and only if at least one of the incoming messages/values(not including the incoming message from w)is1.Similarly,at even times the algorithm calculates the left to right messages m vw,v V L,w V R,from the right to left messages m uv,by the formula m vw u N v w m uv.That is,the left player sends a1along the edge from v V L to w V R if and only if all of the incoming messages/values (not including the incoming message from w)are1.We further need the following definitions.We call a left vertex bad at even time t if it transmits at least one message of value one at time t.Similarly,a right vertex is bad at odd time t if it is a right vertex that transmits at least one message of value zero at time t.We let b t be the number of bad vertices at time t.These definitions will be instrumental in providing a potential function measuring the progress of the message passing algorithm which is expressed in Lemma5.We say that a bipartite graph G V L V R;E isλe-expanding,if for any vertex set S V L(or S V R)of size at mostλn,N S e S.It will be convenient to denote the expansion of the set S by e S N S S. Lemma5.Consider the message passing algorithm using a d4regular expander graph with d1e d12.If b tλn d2then b t2b tη,whereηd1d1ηt and so b2t10for t log a d2d e gets,and the better the time guarantee above gets.How good are the expanders that we may use?One can show the existence of such expanders for sufficiently large d large,and e d c for an absolute constant c.The best known explicit construction that gets close to what we need,is the result of[4].However,that result does not suffice here for two reasons.Thefirst is that it only achieves expansion1εd for anyε0 and sufficiently large d depending onε.The second is that it only guarantees left-to-right expansion,while our construction needs both left-to-right and right-to-left expansion.We refer the reader to the survey[6] for further reading and background.For such expanders,ηd1d1log d1log d1iterations,all mes-sages contain the right answer,whereεcan be made arbitrarily small by choosing sufficiently large d.It remains to convert the algorithm into a monotone circuit,which introduces a depth-blowup of log d1 owing to the depth of a binary tree simulating a d1-ary gate.Thus we get a2εlog n-depth circuit for arbitrarily smallε0.The size is obviously dn depth O n log n.To get a linear circuit,further work is needed,which we now describe.The idea is to use a sequence of graphs G 0G G 1,where each graph is half the size of its preceding graph,but has the same degree and expansion parameters.We start the message passing algorithm using the graph G G 0,and every t 0rounds (each round consists of OR and then AND),we switch to the next graph in the sequence.Without the switch,the portion of bad vertices should decrease by a factor of ηt 0,every t 0rounds.We argue that each switch can be performed,while losing at most a constant factor.To describe the switch from G i to G i 1,we identify V L G i 1with an arbitrary half of the vertices V L G i ,and start the message passing algorithm on G i 1with the left to right messages from each vertex in V L G i 1,being the same as at the last round of the algorithm on G i .As the number of bad left vertices cannot increase at a switch,their portion,at most doubles.For the right vertices,the exact argument is slightly more involved,but it is clear that the portion of bad right vertices in the first round in G i 1,increases by at most a constant factor c ,compared with what it should have been,had there been no switch.(Precise calculation,yields c 2d η.)Therefore,to summarize,as the circuit consists of a geometrically decreasing sequence of blocks starting with a linear size block,the total size is linear as well.As for the depth,the amortized reduction in the portion of bad vertices per round,is by a factor of ηηc 1t 0.Therefore,the resulting circuit is only deeper than the one described in the previous paragraph,by a factor of log ηlog η.By choosing a sufficiently large value for t 0,we obtain:Theorem 6.For any ε0,there exists a 0such that for any n there exists a monotone circuit of depth 2εlog n O 1and size O n that solves a-promise problem.We note here that O log n depth monotone circuits for the a -promise problem can also be obtained from ε-halvers.These are building blocks used in the AKS network.However,our monotone circuits for the a -promise problem have two advantages.First,our algorithm relates this classical problem in circuit com-plexity to recent popular message passing algorithms.And second,the depth that we obtain is nearly ly,Moore and Shannon [10]prove that any monotone formula/circuit for majority requires depth 2log n O 1,and the lower bound holds for the a -promise problem as well.Proof of Lemma 5.(based on Burshtein and Miller [3])We consider only the case of bad left vertices.The proof for bad right vertices follows from the same proof,after exchanging ones with zeroes,ANDs with ORs,and lefts with rights.Let B V L be the set of bad leftvertices,and assume Bλd 2at some even time t and B the set of bad vertices at time t 2.We bound the size of B by considering separately B B and B B .Note that all sets considered in the proof have size at most λn ,and therefore expansion at least e.N(B’)To bound B B ,consider the set Q N B B N B N B B N B .Since vertices in Q are not adjacent to B ,then at time t 1they send right to left messages valued zero.On the other hand,any vertex in B B can receive at most one such zero message (otherwise all its messages at time t 2will be valuedzero and it cannot be in B).Therefore,since each vertex in Q must have at least one neighbour in B B,it follows that Q B B.Therefore,we have:N B B N B Q N B B B e B B B BOn the other hand,N B B e B B e B B B.Combining the above two inequalities,we obtain:B B e Be2B B1d12B(2) Combining inequalities(1)and(2)we get that:B B e B ed12Since e d12,and e B e,this yields the required bound:B B2d e d1As noted before in Section2,replacing the last2log n layers of Valiant’s tree with2log r n layers of r-ary AND/OR gates,results in an arbitrarily small increase in the depth of the corresponding formula for a large value of r.It is interesting to compare the expected behavior of the suggested belief-propagation algorithm to the behavior of the d1-ary tree.Assume that the graph G is chosen at random(in theconfiguration model),and that the number of rounds k is sufficiently small,d12k n.Then,almost surely the computation of all but o1fraction of the k-th round messages is performed by evaluating a d1-ary depth k trees.Moreover,introducing an additional o1error,one may assume that the leaves are independently chosen boolean random variables that are one with probability p,where p is the portion of ones in the input.This observation sheds some light on the performance of the belief propagation algorithm. However,our analysis proceeds far beyond the number of rounds for which a cycle free analysis can be applied.4Monotone formulas for threshold-k functionsConsider the case of the k-th threshold function,T k n,i.e.a function that is one on x01n if xk1and zero otherwise.We show that,by essentially the same techniques of Section3,we can construct monotone circuits to this more general problem.We assume henceforth that k n2,since otherwise, we construct the circuit T n1k n and switch AND with OR gates.For k nΘ1,the construction yields circuits of depth53log n O1and size O n3.However,when k o n,circuits are shallower and smaller (this not surprising fact is also discussed in[2]in the context of formulas).The construction goes as follows:(i)Amplify k n k1n toβΩ1kβΩ1k by randomly applying to the input a sufficiently large number of OR gates with arityΘn k(ii)AmplifyβΩ1kβΩ1k to O11O1using a variation of phase I,and(iii)Amplify O11O1to01using phase II.We now give a detailed description.For the sake of the section to follow,we require the following lemma which is more general than is needed for the results of this section.Lemma7.Let S01n,andε0.Then,for any k,there is a randomized construction of a monotone circuit that evaluates T k n correctly on all inputs from S and hasdepth log n23log k2εloglog S O1size O log S k nHere k min k n1k,and the constants of the O depend only onε.Proof.Let s log S,and let i be the OR function with arity i.Then An kk n11k n n k,while An k k1n11k1n n k.Hence An kk n is a constant bounded from zero andone.We further notice thatAn k k1nΘ1kIt is not hard to see that we can pick a constantρso that Aρn k knβΩ1k.Therefore,ρn k probabilistically amplify k n k1n toβΩ1kβΩingLemma4withδΘ1k and m sk2we get that F n mρn k amplifies k n k1n toβΩ1kβΩ1k with arbitrarily high probability.The depth required to implement the above circuit is log n k and the size is O skn.Next we apply a slight modification of phase I.The analysis there remains the same except that the starting point is separation guarantee ofΩ1k instead ofΩ1n,and log S is s instead of n.This leads to a circuit of depthαεlog k O1and of size O sk2,for an arbitrarily small constantε0.Also,we note that the output of this phase is of sizeΘs.Finally,we apply phase II,where the number of inputs isΘs instead ofΘn,to obtain an amplification from O11O1to01.This requires depth2εlog s O1and size O s,for an arbitrarily small constantε0.To guarantee the correctness of a monotone circuit for T n k,it suffices to check its output on inputs of weight k k1(as the circuit is monotone).Therefore,S n k n k1,implying that log S O k log n k. Therefore,we have:Theorem8.There is a randomized construction of a monotone circuit for T k n with:depth log n43log k O loglog n ksize O k2n log n kwhere k min k n1k,and the constants of the O are absolute.5Reducing the circuit sizeThe result obtained so far for the majority,is a monotone circuit of depth53log n O1and size O n3.In this section,we would like to obtain smaller circuit size,at the expense of increasing the depth somewhat. The crucial observation is that the size of our circuit depends linearly on the logarithm of the number of scenarios it has to handle.Therefore,applying a preprocessing stage to reduce the wealth of scenarios may save up to a factor of n in the circuit size.We propose a recursive construction that reduces the circuit size to about n1We chooseαi2σi1to equate1αiσi with3αi.This implies thatσi132σi1,and δi153δi22σi1,resulting in the following sequence:iαiσiδi2,and the sequence of δi tends to129896.Therefore,we have:Theorem9.There is a randomized construction of a monotone circuit for the majority of size n1There are two central open problems related to this work.First,is the promise version really simpler than majority?A lower bound greater than2log n on the communication complexity of mMaj-search would settle this question.Boppana[2]and more recent work[5]show lower bounds on a particular method for obtaining monotone formulas for majority.However we are asking instead for lower bounds on the size/depth of unrestricted monotone formulas/circuits.Secondly,the original question remains unresolved. Namely,we would like to obtain explicit uniform formulas for majority of optimal or near optimal size.A related problem is to come up with a natural(top-down)communication complexity protocol for mMaj-Search that uses O log n many bits.References[1]M.Ajtai,J.Koml´o s,and E.Szemer´e di.Sorting in c log n parallel binatorica,3(1):1–19,1983.[2]R.B.Boppana.Amplification of probabilistic boolean formulas.IEEE Symposium on Foundations ofComputer Science(FOCS),pages20–29,1985.[3]D.Burshtein and ler.Expander graph arguments for message-passing algorithms.IEEE Trans.Inform.Theory,47(2):782–790,2001.[4]M.Capalbo,O.Reingold,S.Vadhan,and A.Wigderson.Randomness conductors and constant-degreeexpansion beyond the degree2barrier.In Proceedings34th Symposium on Theory of Computing, pages659–668,2002.[5]M.Dubiner and U.Zwick.Amplification by read-once formulas.SIAM put.,26(1):15–38,1997.[6]S.Hoory,N.Linial,and A.Wigderson.Expander graphs and their applications.survey article toappear in the Bulletin of the AMS.[7]Mauricio Karchmer and Avi Wigderson.Monotone circuits for connectivity require super-logarithmicdepth.In Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing,pages539–550,Chicago,IL,May1988.[8]M.Luby,M.Mitzenmacher,and A.Shokrollahi.Analysis of random processes via and-or tree evalu-ation.In ACM-SIAM Symposium on Discrete Algorithms(SODA),1998.[9]M.Luby,M.Mitzenmacher,A.Shokrollahi,and D.A.Spielman.Analysis of low density codes andimproved designs using irregular graphs.ACM Symposium on Theory of Computing(STOC),1998.[10]E.F.Moore and C.E.Shannon.Reliable circuits using less reliable relays.I,II.J.Franklin Inst.,262:191–208,281–297,1956.[11]M.S.Paterson.Improved sorting networks with O log N depth.Algorithmica,5(1):75–92,1990.[12]M.S.Paterson,N.Pippenger,and U.Zwick.Optimal carry save networks.In Boolean functioncomplexity(Durham,1990),volume169of London Math.Soc.Lecture Note Ser.,pages174–201.Cambridge Univ.Press,Cambridge,1992.[13]T.Richardson and R.Urbanke.Modern coding theory.Draft of a book.[14]T.Richardson and R.Urbanke.The capacity of low-density parity-check codes under message-passingdecoding.IEEE rm.Theory,47(2):599–618,2001.[15]L.G.Valiant.Short monotone formulae for the majority function.J.Algorithms,5(3):363–366,1984.。
Randomized Algorithms
Simple special case
Uniform binary tree, height h, h leaves. with n = 2 All values are boolean, so MAX!OR and MIN!AND. Exercise: Every deterministic algorithm can be forced to read n leaves.
{ Typeset by FoilTEX { 14
Randomized algorithm
AND 1 0 OR 0 AND 1 AND 0 AND OR 0 AND
1
0
1
1
1
0
1
0
Evaluate (recursively) a random child of the current node. If this does not determine the value of the current node, evaluate the other child.
批注本地保存成功开通会员云端永久保存去开通
Randomized Algorithms
Prabhakar Raghavan IBM Almaden Research Center San Jose, CA.
{ Typeset by FoilTEX {
Deterministic Algorithms
INPUT ALGORITHM OUTPUT
{ Typeset by FoilTEX {
10
Derandomization: First devise a randomized algorithm, then argue that it can be \derandomized" to yield a deterministic algorithm.
Randomized Algorithms
=n
4
9
points, then solve
it with a straightforward method for the n
9
points : O( n 9 )
step 2 ~ Step 4: O(n) 1 step 5: O(n)with probability 1-2e-cn 6
1 − cn 6
.
11 -12
A randomized algorithm to test whether a number is prime.
This problem is very difficult and no polynomial algorithm has been found to solve this problem Traditional method: use 2,3,… N to test whether N is prime. input size of N : B=log2N (binary representation) N =2B/2, exponential function of B Thus N can not be viewed as a polynomial function of the input size.
11 -5
An example
6δ
27 points. S1 = {x1, x2, …, x9}, δ = d(x1, x2)
5δ
X4 X3 X5 X2
4δ 3δ
X1
X9
2δ
X6
X7
δ
X8
δ
2δ
3δ
4δ
5δ
11 -6
随机算法
Randomized Algorithms (随机算法)Probabilistic Algorithms (概率算法)起源可以追溯到20世纪40年代中叶。
当时Monte Carlo在进行数值计算时,提出通过统计模拟或抽样得到问题的近似解,而且出现错误的概率随着实验次数的增多而显著地减少,即可以用时间来换取求解正确性的提高。
但Monte Carlo方法很长时间没有引入到非数值算法中来。
74年,Michael Rabin(76年Turing奖获得者, 哈佛教授, 以色列人) 在瑞典讲演时指出:有些问题,如果不用随机化的方法,而用确定性的算法,在可以忍受的时间内得不到所需要的结果。
e.g. Presburge算术系统(其中只有加法)中的计算程序,即使只有100个符号,用每秒1万亿次运算的机器1万亿台、进行并行计算也需做1万亿年。
但如果使用随机性的概念,可以很快得出结果,而出错率则微乎其微。
74年Rabin关于随机化算法的思想还不太成熟,到76年,Rabin设计了一个判定素数的随机算法,该算法至今仍是一个典范。
随机算法在分布式计算、通信、信息检索、计算几何、密码学等许多领域都有着广泛的应用。
最著名的是在公开密钥体系、RSA算法方面的应用。
用随机化方法解决问题之例:设有一函数表达式f(x1,x2,…x n),要判断f在某一区域D中是否恒为0。
如果f不能用数学方法进行形式上的化简(这在工程中是经常出现的),如何判断就很麻烦。
如果我们随机地产生一个n维的坐标(r1,r2,…r n) D,代入f得f(r1,r2,…r n)≠0,则可断定在区域D内f≢0。
如果f(r1,r2,…r n)=0,则有两种可能:1. 在区域D内f≡0;2. 在区域D内f≢0,得到上述结果只是巧合。
如果我们对很多个随机产生的坐标进行测试,结果次次均为0,我们就可以断言f≢0的概率是非常之小的。
如上例所示,在随机算法中,我们不要求算法对所有可能的输入均正确计算,只要求出现错误的可能性小到可以忽略的程度;另外我们也不要求对同一输入算法每次执行时给出相同的结果。
k近邻算法的原理
k近邻算法的原理k近邻算法(k-nearest neighbors algorithm)是一种常用的分类和回归算法,其原理基于样本的相似性。
该算法的核心思想是通过找到与待分类样本最相似的k个训练样本,来预测待分类样本的标签或属性。
在分类问题中,通过统计k个最近邻样本中的类别标签,选择出现次数最多的类别作为待分类样本的预测类别;在回归问题中,通过计算k个最近邻样本的平均值或加权平均值,来预测待分类样本的属性值。
k近邻算法的基本步骤如下:1. 计算距离:首先,根据给定的距离度量方法(如欧氏距离、曼哈顿距离等),计算待分类样本与训练样本之间的距离。
2. 选择k个最近邻:根据计算得到的距离,选择与待分类样本最近的k个训练样本作为最近邻。
3. 确定类别:对于分类问题,统计k个最近邻样本中各类别的出现次数,选择出现次数最多的类别作为待分类样本的预测类别;对于回归问题,计算k个最近邻样本的平均值或加权平均值作为待分类样本的预测属性值。
4. 输出结果:将预测结果输出作为待分类样本的分类标签或属性值。
k近邻算法的优缺点:优点:1. 简单易理解:k近邻算法的原理简单,易于理解和实现。
2. 适用性广泛:k近邻算法适用于多种数据类型和问题类型,包括分类、回归、密度估计等。
3. 对异常值不敏感:k近邻算法对异常值不敏感,能够在一定程度上克服数据中的噪声和异常值的影响。
缺点:1. 计算复杂度高:k近邻算法需要计算待分类样本与所有训练样本之间的距离,计算复杂度较高,尤其是当训练样本规模较大时。
2. 需要大量存储空间:k近邻算法需要存储所有训练样本的特征向量和类别标签,对存储空间要求较高。
3. 需要确定k值:k近邻算法需要人工确定k值,选择不合适的k 值可能导致分类或回归结果的不准确。
改进和应用:为了克服k近邻算法的缺点,研究者们提出了很多改进的方法,如加权k近邻算法、局部加权k近邻算法、减少计算复杂度的近似最近邻算法等。
此外,k近邻算法在各个领域都有广泛的应用,如文本分类、图像识别、推荐系统等。
k近邻算法最优实现
k近邻算法最优实现K-nearest neighbors (KNN) algorithm is a simple and effective method for classification and regression tasks in machine learning. It is a non-parametric technique that makes predictions based on the similarity between the input sample and the labeled data points in the training set. One of the key advantages of KNN is its simplicity and ease of implementation, making it a popular choice for many applications. However, like any algorithm, there are considerations and trade-offs to be made in order to achieve the best possible performance.K最近邻(KNN)算法是机器学习中用于分类和回归任务的一种简单有效的方法。
它是一种非参数技术,根据输入样本与训练集中标记数据点之间的相似性进行预测。
KNN的一个关键优点是其简单性和易于实现,因此它成为许多应用程序的首选。
然而,就像任何算法一样,为了实现最佳性能,我们需要考虑一些因素和权衡。
One important consideration when implementing the KNN algorithm is the choice of distance metric used to measure the similarity between data points. The Euclidean distance is commonly used dueto its simplicity and effectiveness, but other distance metrics such as Manhattan or cosine distances may be more appropriate for certain datasets. Selecting the most appropriate distance metric for a given dataset can significantly impact the performance of the KNN algorithm.在实现KNN算法时的一个重要考虑因素是选择用于衡量数据点相似性的距离度量标准。
kfold参数
kfold参数
K折交叉验证是机器学习模型评估中常用的一种方法。
K表示数
据集被平均分成K份,其中K-1份用于训练模型,剩下的1份用于测试模型。
这个过程重复K次,每次使用不同的测试集,最终取平均值作为模型评估的结果。
在sklearn中,K折交叉验证可以通过KFold类实现。
该类有三个重要参数:n_splits(折数)、shuffle(是否打乱数据集)和random_state(随机种子)。
n_splits参数决定数据集被平均分成几份。
默认值是5,但实际应用中可以根据数据集大小和模型复杂度进行调整。
shuffle参数决定是否打乱数据集。
默认为False,即不打乱。
但当数据集中样本顺序对模型训练产生影响时,可以将其设置为True。
random_state参数是随机种子,控制每次分割的一致性。
当设
置了该参数后,每次分割的结果都是一样的。
但是,如果不设置该参数,则每次分割的结果都是随机的。
在使用KFold进行交叉验证时,需要在每一次循环中重新定义模型并进行训练和测试,然后将结果进行记录和累加,最终计算出平均值作为模型的评估指标。
- 1 -。
2.1-Introduction-to-randomized-algorithm
Generate Random Number?
Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin(痴心妄想).
John von Neumann
Generate Random Numbers
8
Lagged Fibonacci Generators
Similar to Fibonacci Sequence Increasingly popular Xn+1 = (Xn-l + Xn-k) mod m (l>k>0) l seeds are needed m usually a power of 2
We really don’t have truly random random number generators.
To generate psuedo-random sequence, let s∈X be a seed. This seed defines a foralli
0
yi g(xi ) foralli 0
Where, f : X→X and g: X→Y, X is a sufficiently large set and Y is the domain of pseudorandom values to be generated. yi is the pseudorandom sequence
Other generators have longer maximum periods. Bad choices of m result in very bad sequences
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Randomized k-Coverage Algorithms For DenseSensor NetworksMohamed Hefeeda School of Computing Science Simon Fraser UniversitySurrey,BC,CanadaMajid Bagheri School of Computing Science Simon Fraser UniversitySurrey,BC,CanadaAbstract—We propose new algorithms to achieve k-coverage in dense sensor networks.In such networks,covering sensor locations approximates covering the whole area.However,it has been shown before that selecting the minimum set of sensors to activate from an already deployed set of sensors is NP-hard.We propose an efficient approximation algorithm which achieves a solution of size within a logarithmic factor of the optimal.We prove that our algorithm is correct and analyze its complexity. We implement our algorithm and compare it against two others in the literature.Our results show that the logarithmic factor is only a worst-case upper bound and the solution size is close to the optimal in most cases.A key feature of our algorithm is that it can be implemented in a distributed manner with local information and low message complexity.We design and implement a fully distributed version of our algorithm.Our distributed algorithm does not require that sensors know their parison with two other distributed algorithms in the literature indicates that our algorithm:(i)converges much faster than the others,(ii)activates near-optimal number of sensors,and(iii)significantly prolongs(almost doubles)the network lifetime because it consumes much less energy than the other algorithms.I.I NTRODUCTIONMass production of sensor devices with low cost enables the deployment of large-scale sensor networks for real-life applications such as forestfire detection and vehicle traffic monitoring.A fundamental issue in such applications is the quality of monitoring provided by the network.This quality is usually measured by how well deployed sensors cover a target area.In its simplest form,coverage means that every point in the target area is monitored by,i.e.,within the sensing range of,at least one sensor.This is called1-coverage.In this paper, we consider the more general k-coverage(k≥1)problem, where each point should be within the sensing range of k or more sensors.Covering each point by multiple sensors is desired for many applications,because it provides redundancy and fault tolerance.Furthermore,k-coverage is necessary for the proper functioning of many other applications,such as intrusion detection[1],data gathering[2],and object tracking [3].When deployed sensors are dense,area coverage can be approximated by point coverage.That is,if all sensor locations are covered by the set of activated sensors,the entire area is covered.In this paper,we address the problem of selecting the minimum set of sensors to activate from an already deployed set of sensors such that all locations are k-covered.Achieving a minimal set of sensors is critical,because it reduces interfer-ence among active sensors,reduces total energy consumption, and thus prolongs the lifetime of the whole network.The problem of selecting the minimum number of sensors, however,is NP-hard[4].1We propose an efficient approxima-tion algorithm for it,which achieves a solution of size within a logarithmic factor of the optimal and terminates quickly(in the order of seconds in most cases).We show by simulation that although the approximation factor is logarithmic,it is only a worst-case upper bound and the solution size is close to the optimal in most cases.We take a novel approach in solving the k-coverage problem.In particular,we model the problem as a set system for which an optimal hitting set corresponds to an optimal solution for coverage.Finding the optimal hitting set is NP-hard[6],but there is an efficient approximation algorithm for it[7].Our k-coverage algorithm is inspired by the approximation algorithm for the optimal hitting set problem.We prove that our algorithm is correct and analyze its complexity.We implement our algorithm and compare it against other centralized algorithms in the literature.Our comparison reveals that our algorithm is about four orders of magnitude faster than the currently-known k-coverage algorithms.A key feature of our centralized k-algorithm is that it can be implemented in a distributed manner with local information and low message complexity.We design and implement a fully distributed version of our algorithm.Our distributed algorithm does not require sensors to know their parison with two other distributed algorithms in the literature indicates that our algorithm:(i)converges much faster than the others, (ii)activates near-optimal number of sensors,and(iii)signifi-cantly prolongs(almost doubles)the network lifetime because it consumes much less energy than the other algorithms. The rest of the paper is organized as follows.In Section II,we review previous works.In Section III,we present an overview of the k-coverage problem and our solution ap-proach.The details and analysis of our k-coverage algorithms are presented in Sections IV and V.We evaluate our algorithms and compare them against others in Section VI.We conclude 1Note that this problem is different from the problem of placing sensors in an area to cover it,which can be solved efficiently[5].the paper in Section VII.II.R ELATED W ORKThe closest work to ours are[4]and[8].In[4],the authors address the problem of selecting the minimum num-ber of sensors to activate from a set of already deployed sensors for k-coverage.They prove that the problem is NP-hard since it is an extension of the dominating set problem [6].They formulate the problem and provide a centralized approximation solution based on integer linear programming. The algorithm works by relaxing the problem to ordinary linear programming,where the variables may take real values. They also design a distributed algorithms,PKA,which uses pruning to reduce the number of active sensors.The work in [8]presents a centralized algorithm that works by iteratively adding a set of nodes which maximizes a measure called k-benefit to an initially empty set of nodes.The authors also present a distributed algorithms,DPA,that works by pruning unnecessary nodes.We compare our algorithms against the algorithms in[4],[8].III.T HE K-C OVERAGE P ROBLEM AND O UR S OLUTIONA PPROACHProblem1(k-Coverage Problem):Given n already de-ployed sensors in a target area,and a desired coverage degree k≥1,select a minimal subset of sensors to cover all sensor locations such that every location is within the sensing range of at leas k different sensors.It is assumed that the sensing range of each sensor is a disk with radius r,and sensor deployment can follow any distribution.The above k-coverage problem is proved to be NP-hard by reduction to the minimum dominating set problem in[4].We propose an efficient approximation algorithm for solving the k-coverage problem.We start describing our solution approach with the following definition[7].Definition1(Set System and Hitting Set):A set system (X,R)is composed of a set X and a collection R of subsets of X.We say that H⊆X is a hitting set if H has a non-empty intersection with every element of R,that is,∀R∈R we have R∩H=∅.Our solution does not require a grid deployment,and any node deployment such as uniform or Poisson distribution can be used.We define X to be the set of all sensor locations. Thus,we have|X|=n.We define the collection R as follows. For each point p in X,we draw a circle of radius r centered at p.All points in X that fall within that circle constitute one set in R.Fig.1(b)shows only three elements of R that correspond to the three highlighted points p1,p2,p3in Fig.1(a).Now the minimum hitting set problem on(X,R)is tofind the minimum set of points in X that hit(intersect)all elements (disks)of R.Fig.1(c)shows a possible hitting set for the three disks of R shown in Fig.1(b).The hitting set has two points c1and c2.If we consider c1and c2to be locations of sensors,we will ensure that points p1,p2and p3are1-covered, because each of them is within the sensing range of at least one of the sensors located at c1and c2,as shown in Fig.1(c).For k-coverage,elements in the hitting set are not locations for individual sensors.Rather,each element in the hitting set is a center of what we call a k-flower,which is a set of k sensors that all intersect at that center point.Fig.1(d)shows one3-flower centered at point c that3-covers point p3.Details of constructing k-flowers are discussed in Section IV.The k-coverage problem now reduces tofinding a minimum hitting set where elements in that set are the centers of k-flowers. Sincefinding the minimum hitting set is NP-hard,we try to find a near optimal hitting set.We propose an approximation algorithm that uses the concept ofǫ-nets[9],which is defined as follows.Definition2(ǫ-Net):Let0<ǫ≤1be a constant.The set N⊆X is called anǫ-net for the set system(X,R)if N has a non-empty intersection with every element of R of size greater than or equal toǫ|X|,that is,∀R∈R such that|R|≥ǫ|X| we have R∩N=∅.The definition ofǫ-net is similar to that of the hitting set, except that theǫ-net is required to hit only large elements of R(ones that are greater than or equal toǫ|X|),while the hitting set must hit every element of R.This similarity is exploited by our approximation algorithm tofind a near optimal hitting byfindingǫ-nets of increasing sizes(i.e., decreasingǫ)till one of them hits all elements of R.For this to work,we clearly need to efficiently:(i)computeǫ-nets,and(ii)verify coverage.We use a simple verifier that checks all points in O(n)putingǫ-nets can be done efficiently for set systems withfinite VC-dimensions (defined in[9]).Specifically,Haussler and Welzl[9]show that for any set system(X,R)with afinite VC-dimension d, randomly sampling m≥max 4δ,8dǫ points of X constitutes anǫ-net with a probability at least1−δ,where 0<δ<1.Notice that m does not depend on the size of X,which allows X to be arbitrarily large with no effect onthe size of theǫ-net.Br¨o nnimann and Goodrich[7]further extend the concept ofǫ-net by assigning weights to elements of X.Weights accelerate the process offinding a near optimal hitting set,and help in establishing an upper bound on its size, as we discuss in Section IV.The VC-dimension of our set system is proved to be3by the following lemma.Due to space limitation,the proof is given in[10].Lemma1:Consider the set system(X,R),where X is the set of points,and R contains a disk of radius r for each point in X.This set system has a VC-dimension of3.To summarize,we model the k-coverage problem as a set system(X,R)where X is the set of sensor locations and R is the collection of subsets of X created by intersecting disks of radius r with points of X.This set system has a VC-dimension of3,therefore,we can efficiently implement a net-finder algorithm tofindǫ-nets of various sizes.Our approximation algorithm for the k-coverage problem employs the net-finder to computeǫ-nets of increasing sizes,and for eachǫ-net it verifies the coverage until all points are sufficiently covered. We assign weights to points of X to guarantee termination and to bound the approximation factor of the output solution.Fig.1.Modeling the k-coverage problem as a set system(X,R).(a)shows the set of points which constitute X.(b)shows only three subsets of R that are associated with the three highlighted points in(a).(c)shows a hitting set{c1,c2}that1-covers the three subsets in(b).(d)shows one3-flower that 3-covers only one subset in R.Finally,each element in the output represents the center ofwhat we call a k-flower,which is a set of k sensors that allintersect at that center point and should be activated for k-coverage.IV.C ENTRALIZED K-C OVERAGE A LGORITHMThe pseudo code of the k-coverage algorithm,which we callRKC(Randomized k-Coverage algorithm),is given in Fig.2.The algorithm takes as input the set of sensor locations X,sensing range of sensors r,and required degree of coveragek.If the algorithm succeeds,it will return a subset of nodesto activate in order to ensure k-coverage.The algorithm mayonly fail if activating all sensors is not enough for k-coveragebecause of low density.The minimum required density can becalculated as follows.If every point is to be k-covered,it hasto be in the sensing range of at least k sensors.Thus,for eachnode p,there should be at least k other nodes inside a disk ofradius r centered at p.In every single iteration of the while loop,the algorithmtries up to4c log(n/c)1c –net is computed by the net-finder(SectionIV-A),and hits all disks with weight greater than or equal to 1c –net.Points with increased weights will have higherprobability of being included in the new net.The size of each returned1c-net is O(c log c),the size of the solution is O( N log N).ǫlog2ǫlog241.c=1;//sets the initial size ofǫ-net2.while(net-size(1c6.N=net-finder(X,k,ǫ,r);7.u=verifier(X,N,k,r);8.if(u==null)9.return N;10.else11.double weight of u;12.c=2×c;13.return∅;ǫlog1ǫ),is possible to design[12].However,the constant in this linear bound is quite high.Moreover,the algorithm involves triangulation which requires sensors to be aware of their locations,and more importantly,it is not clear how the algorithm can be implemented in a distributed manner.Therefore,although the efficient net-finder in[12]would make our RKC algorithm produce a solutionthat is a constant factor from the optimal,we opt to use the simpler net-finder algorithm because it can be implemented in a distributed manner,and it produces near-optimal results onthe average,as shown by our simulations in Section VI.B.Algorithm Correctness and ComplexityThe following theorem proves that our algorithm is correct,provides its time complexity,and proves the upper bound on the solution.Theorem1:The k-coverage algorithm(RKC)in Fig.2 ensures that every point in the area is k-covered,terminatesin O(n2log2n)steps,and returns a solution of size at mostO( N log N),where N is the minimum number of sensors required for k-coverage.Proof:Suppose that the algorithm terminates by provid-ing a set S of sensor locations.By construction,this set of points is guaranteed to hit every disk of radius r.Since for our set system(X,R),we put a disk in R for each point p∈X,there should be at least one element(i.e.,a k-flower) in S that hits the disk centered at p.In addition,the center of each sensor in the k-flower is within a distance r from p(see Section IV-A for details on constructing k-flowers).Therefore,p is k-covered by sensors of this k-flower.Hence,all points are k-covered by sensors in S.The time complexity follows from Lemmas3,and4,and by using a simple verifier that checks all n points in O(n) steps.The bound on the solutions size follows from Lemma 2.Sensing range, r (meters)N u m b e r o f a c t i v e s e n s o r sFig.3.Efficiency of our centralized k -coverage algorithm (RKC).The figure compares the number of active sensors produced by our RKC algorithm versus the necessary (Nec cond)conditions proved in [2].shown in Fig.3,where Nec cond denote the necessary and sufficient conditions,respectively.The figure shows that our algorithm does not unnecessarily activate too many sensors,because its output is very close to the necessary condition.The results of this experiment show that the worst-case logarithmic factor proved in Theorem 1is very conservative,and on average our centralized algorithm produces near-optimal number of active sensors.Next,we compare our centralized RKC algorithm against two other centralized k -coverage algorithms:CKC [8]and LPA [4].Simulation results show that our centralized RKC algorithm runs up to four orders of magnitude faster,while producing same or better solution sizes than the other algo-rithms [10].Now,we compare our distributed DRKC algorithm against two other distributed k -coverage algorithms:DPA [8]and PKA [4]along various performance metrics.The results indicate that the DRKC algorithm converges much faster than the other two algorithms and always results in much smaller numbers of activated sensors [10].Finally,we look at the lifetime of the sensor network under different distributed algorithms.we compare the percentage of alive sensors as the time progresses for the three algorithms.As Fig.4indicates,our algorithm prolongs (almost doubles)the lifetime of the network,because it consumes much smaller amount of energy than the other two algorithms [10].VII.CONCLUSIONSIn this paper,we presented a novel approach to solve the k -coverage problem in large-scale sensor networks.We modeled the k -coverage problem as a set system for which an optimal hitting set corresponds to an optimal solution for k -coverage.We proposed an approximation algorithm for computing near-optimal hitting sets efficiently.We proved thatour algorithm produces a solution that is at most a logarithmic factor from the optimal.Furthermore,we showed through simulation that the logarithmic factor is only a conservative upper bound,paring the network lifetime under different distributed k -coveragealgorithms.and the solution is typically close to the optimal in most cases.We compared our algorithm against the currently-known k -coverage algorithms and showed that it runs up to four orders of magnitude faster,while producing same or better solution sizes than the other algorithms.We also designed and implemented a fully distributed version of our algorithm that uses only local information.Our distributed algorithm has low message complexity and it does not require sensors to know their locations.R EFERENCES[1] D.Mehta,M.Lopez,and L.Lin,“Optimal coverage paths in ad-hoc sensor networks,”in Proc.of IEEE International Conference on Communications (ICC’03),May 2003.[2]S.Kumar,i,and J.Balogh,“On k-coverage in a mostlysleeping sensor network,”in Proc.of ACM International Conference on Mobile Computing and Networking (MOBICOM’04),Philadelphia,PA,September 2004,pp.144–158.[3] D.Hall and J.Llinas,Handbook of Multisensor Data Fusion .CRCPress,2001.[4]S.Yang,F.Dai,M.Cardei,and J.Wu,“On connected multiple pointcoverage in wireless sensor networks,”Journal of Wireless Information Networks ,May 2006.[5]R.Iyengar,K.Kar,and S.Banerjee,“Low-coordination topologies forredundancy in sensor networks,”in Proc.of ACM Mobihoc’05,Urbana-Champaign,IL,May 2005.[6]M.Garey and D.Johnson,Computers and Intractability:A Guide to theTheory of NP-Completeness .W.H.Freeman,1979.[7]H.Bronnimann and M.Goodrich,“Almost optimal set covers in finiteVC-dimension,”Discrete and Computational Geometry ,vol.14,no.4,April 1995.[8]Z.Zhou,S.Das,and H.Gupta,“Connected k-coverage problem insensor networks,”in Proc.of International Conference on Computer Communications and Networks (ICCCN’04),Chicago,IL,October 2004.[9] D.Haussler and E.Welzl,“Epsilon-nets and simplex range queries,”Discrete and Computational Geometry ,vol.2,no.1,December 1987.[10]M.Hefeeda and M.Bagheri,“Efficient k-coverage algorithms forwireless sensor networks,”School of Computing Science,Simon Fraser University,Tech.Rep.TR 2006-22,September 2006.[11] A.So and Y .Ye,“On solving coverage problems in a wireless sensornetwork using voronoi diagrams,”in Proc.of Workshop on Internet and Network Economics (WINE’05),Hong Kong,December 2005.[12]J.Matousek,R.Seidel,and E.Welzl,“How to net a lot with little:Small ǫ-nets for disks and halfspaces,”in Proc.of the 6th Annual ACM Symposium on Computational Geometry (SoCG’90),Berkeley,CA,June 1990.。