On Boyer-Moore Preprocessing


On the Benjamini--Hochberg method

On the Benjamini--Hochberg method
Benjamini and Hochberg [2] have proposed a method of choosing R specifically aimed at discovering r.v.’s taking values in the interval [0, 1] that tend to be smaller than standard uniform r.v.’s and which, given δ > 0, guarantees that E(Π1,m) ≤ δ under certain conditions. The method consists of
By J. A. Ferreira1 and A. H. Zwinderman
University of Amsterdam
We investigate the properties of the Benjamini–Hochberg method for multiple testing and of a variant of Storey’s generalization of it, extending and complementing the asymptotic and exact results available in the literature. Results are obtained under two different sets of assumptions and include asymptotic and exact expressions and bounds for the proportion of rejections, the proportion of incorrect rejections out of all rejections and two other proportions used to quantify the efficacy of the method.






















BM算法(全称Boyer-Moore Algorithm)是一种精确字符串匹配算法(只是一个启发式的字符串搜索算法)。





1)Bad-characterBM 算法在上图中从右向左匹配中第一个字符就出现不一致的情况,此时需要采用两种情况来处理:a)如果T 中不匹配字符E 在模式P 中没有出现,那么我们很容易就能理解为E开始的m 长度的字符串不可能匹配到P(直观,无需解释),我们可以直接把P 跳过E,匹配后面的内容。

b)如果E 在模式P 中未进行匹配的字段中出现了,则以该字符E 进行对齐。

BM算法实现2)Good Suffix若发现某个字符不匹配的同时,已有部分字符匹配成功,则按如下两种情况进行:a)如果在P中位置t处已经匹配部分P’在P中某位置t’也出现,且位置t’的前一个字符与位置t的前一个字符不相同,则将P右移使t’对应t所在的位置。

b)如果P中任何位置已经匹配部分P’没有再出现,则找到与P’的后缀P’’相同的P的最长前缀x,向右移动P,使x对应刚才P’’后缀所在位置下面两个链接解释的很好:/sealyao/archive/2009/09/18/4568167.aspxhttp://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140void preBmBc(char *x, int m, int bmBc[]) //坏字符表预处理,x就是上文中的模式串P {int i; //注意:bmBc数组的下标是字符,而不是数字for (i = 0; i < ASIZE; ++i) //初始将所有ASIZE=256个字符都赋初值为模式串的长度m,bmBc[i] = m; //也就是说模式串中没出现的字符,相应的移动距离都为m for (i = 0; i < m - 1; ++i)bmBc[x[i]] = m - i - 1;}Suffixes[]数组的计算方法。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

The need to search for occurrences of some string within some other string arises in countless applications. Exact string matching is a fundamental task in computer science that has been studied extensively. Given a pattern string P and a typically much longer text string T , the task of exact string matching is to find all locations in T where P occurs. Let |s| denote the length of a string s. Also let the notation xi refer to the ith character of a string x, counting from the left, and let the notation xh..i denote the substring of x that is formed by the characters of x from its hth position to the ith position. Here we require that h ≤ i. If i > |x| or i < 1, we interpret the character xi to be a non-existing 1
On Boyer-Moore Preprocபைடு நூலகம்ssing
Heikki Hyyr¨ o Department of Computer Sciences University of Tampere, Finland Heikki.Hyyro@cs.uta.fi
Abstract Probably the two best-known exact string matching algorithms are the linear-time algorithm of Knuth, Morris and Pratt (KMP), and the fast on average algorithm of Boyer and Moore (BM). The efficiency of these algorithms is based on using a suitable failure function. When a mismatch occurs in the currently inspected text position, the purpose of a failure function is to tell how many positions the pattern can be shifted forwards in the text without skipping over any occurrences. The BM algorithm uses two failure functions: one is based on a bad character rule, and the other on a good suffix rule. The classic linear-time preprocessing algorithm for the good suffix rule has been viewed as somewhat obscure [8]. A formal proof of the correctness of that algorithm was given recently by Stomp [14]. That proof is based on linear time temporal logic, and is fairly technical and a-posteriori in nature. In this paper we present a constructive and somewhat simpler discussion about the correctness of the classic preprocessing algorithm for the good suffix rule. We also highlight the close relationship between this preprocessing algorithm and the exact string matching algorithm of Morris and Pratt (a pre-version of KMP). For these reasons we believe that the present paper gives a better understanding of the ideas behind the preprocessing algorithm than the proof by Stomp. This paper is based on [9], and thus the discussion is originally roughly as old as the proof by Stomp.
character that does not match with any character. A string y is a prefix of x if y = x1..h for some h > 0. In similar fashion, y is a suffix of x if y = xh..|x| for some h ≤ |x|. It is common to denote the length of the pattern string P by m and the length of the text T by n. With this notation P = P1..m and T = T1..n , and the task of exact string matching can be defined more formally as searching for such indices j for which Tj −m+1..j = P . A naive “Brute-Force” approach for exact string matching is to check each possible text location separately for an occurrence of the pattern. This can be done for example by sliding a window of length m over the text. Let us say that the window is in position w when it overlaps the text substring Tw−m+1..w . The position w is checked for a match of P by a sequential comparison between the characters Pi and Tw−m+i in the order i = 1 . . . m. The comparison is stopped as soon as Pi = Tw−m+i or all m character-pairs have matched (in which case Tw−m+1..w = P ). After the window position w has been checked, the window is shifted one step right to the position w + 1, and a new comparison is started. As there are n − m + 1 possible window positions, and checking each location may involve up to m character comparisons, the worst-case run time of the naive method is O(mn). Morris and Pratt have presented a linear O(n) algorithm for exact string matching [11]. Let us call this algorithm MP. It improves the above-described naive approach by using a suitable failure function, which utilizes the information gained from previous character comparisons. The failure function enables to move the window forward in a smart way after the window position w has been checked. Later Knuth, Morris and Pratt presented an O(n) algorithm [10] that uses a slightly improved version of the failure function. Boyer and Moore have presented an algorithm that is fast in practice, but O(mn) in the worst case. Let us call it BM. Subsequently also many variants of BM have been proposed (e.g. [7, 2, 6]). The main innovation in BM is to check the window in reverse order. That is, when the window is at position w, the characters Pi and Tw−m+i are compared from right to left in the order i = m . . . 1. This enables to use a failure function that can often skip over several text characters. BM actually uses two different failure functions, δ1 and δ2 . The former is based on so-called bad character rule, and the latter on so-called good suffix rule. The failure function δ1 of BM is very simple to precompute. But the original preprocessing algorithm given in [4] for the δ2 function has been viewed as somewhat mysterious and incomprehensible [8]. Stomp even states that the algorithm is known to be “notoriously difficult” [14]. An example of this is that the algorithms shown in [4, 10] were slightly erroneous, and a corrected version was given without any detailed explanations by Rytter [12]. A formal proof of the correctness of the preprocessing algorithm was given recently by Stomp [14]. He analysed the particular version shown in [3, 1], and also found and corrected a small error that concerns running out of bounds of an array. Stomp’s proof is based on linear temporal logic and it is a-posteriori in nature: he first shows the algorithm, and then proceeds to prove that that given algorithm computes δ2 correctly. The proof is also fairly technical, and does not shed too much light 2