But it is also possible to derive Zipf’s law from a simple statistical model([3]).For example, Zipf’s Law can be derived for word occurrences in natural language,when it is assumed that words are drawn randomly from some distribution.In practice,however,words are thoughtfully selected by the author;yet on the long run this selection process may adjust to such a statistical description.Another experimental law of nature is Heaps’Law([2]),which describes the growth in the num-ber of unique elements(also referred as the number of records),when elements are drawn from some distribution.Heaps’Law states that this number will grow according toαkβfor some ap-plication dependent constantsαandβ,where0<β<1.In the case of word occurrences in natural language,Heaps’Law predicts the vocabulary size from the size of a text.In[1]Heaps’Law and the generalized Zipf’s Law are related.It is shown that under a plausible assumption,it can be derived that if both Heaps’law and the generalized Zipf’Law hold,β= 1/θ.The argumentation is as follows.Heaps’Law predicts the vocabulary size in a text of a given size.It is assumed that the frequency of the least frequent word in this text isΘ(1).As a result, the prediction of this frequency as obtained by the generalized Zipf’s is alsoΘ(1).Working out the details leads toβ=1/θ.In this paper,we take another approach.We assume the generation of text is according to the Mandelbrot distribution,and derive Heaps’Law for the average vocabulary size in a text of a given length.This is done by a statistical analysis,leading to a rather untractable recurrence relation.As a consequence,Heaps’Law can also be regarded in a natural way as a complexity estimate.By applying techniques from complexity theory,restricting ourselves tofirst order terms,Heaps’Law is obtained.Note that by involving second order terms,a more advanced formulation of Heaps’Law may be obtained.Infigure1we see how nicely the functionαkβcan befitted against the function describing the average number records as a function of the number of drawings in the case of a set of100 elements,while infigure1a set of1000elements is taken.This will be explained in more detail in section2.It will be clear from thesefigures however that the approximation provided by Heaps’∗Technical Report NIII-R030201020304050607080901000500100015002000250030003500400045005000N (k ) v s H e a p s (k )kHeaps Law: N = 100 theta = 1.75 c = 0.30 a N = 0.70N(k)f(k)Figure 1:N k versus Heaps law fit for N =100Law is not valid everywhere,as obviously the number of records is bounded by the total number of events,while a power function will exceed this number eventually.In order to express this limited validity of Heaps’Law,we also focus on the validity area of the approximations in our analysis.The validity area is described rather defensively,in practice the area will be larger.In section 2we present a statistical model for the vocabulary size in a text,i.e.the average number of unique occurrences after a series of drawings.In section 3we solve the resulting equation,leading to Heaps’Law.We also give bounds for the validity area of the approximations.In section 4we make some conclusions and discuss further research.2A probabilistic model for Heaps’lawLet W be a set of N words numbered 1...,N ,and let p i the probability that word i is chosen.The underlying text model is that words are subsequently taken from the set W according to this probability distribution.We will be interested in the asymptotic behavior of the expected resulting number of different words taken.After taking k words w 1,...,w k from W ,let D k ={w 1,...,w k }be the set of different words,and let n k be the number of such words:n k =#D k .Then:Prob (n k =a )=Prob (n k −1=a −1∧w k ∈D k −1)+Prob (n k −1=a ∧w k ∈D k −1)=Prob (n k −1=a −1)∗Prob (w k ∈D k −1)+Prob (n k −1=a )∗Prob (w k ∈D k −1)10020030040050060005000100001500020000250003000035000400004500050000N (k ) v s H e a p s (k )kHeaps Law: N = 1000 theta = 1.75 c = 0.30 a N = 0.68N(k)f(k)Figure 2:N k versus Heaps’law fit for N =1000Then obviously:Prob (w k ∈D k −1)=i ∈WProb (w k =i ∧i ∈D k −1)(1)=i ∈Wp i (1−p i )k −1(2)Let S k = i ∈W p i (1−p i )k −1,and M n =i ∈W (1−p i )n .We will refer to M n as the k -th reverse moment of the probability distribution.Then we obtain S k =M k −1−M k ,as:i ∈Wp i (1−p i )k −1= i ∈W−(1−p i )k +(1−p i )k −1Let N (k,a )=Prob (n k =a ),then we haveka =1N (k,a )=1.From the equations above we getthe following recurrence relation::N (1,1)=1N (k,a )=0if k <aN (k,a )=N (k −1,a −1)∗S k +N (k −1,a )∗(1−S k )if k ≥aWe will be interested in the expected number of different words N k after taking k words ran-domly from the set W of ing this recurrence relation,we get for N k (k >1)the follow-ing recurrence relation:N k=k a =1a ∗N (k,a )=k a =1a ∗(N (k −1,a −1)∗S k +N (k −1,a )∗(1−S k ))=S k k a =1a ∗N (k −1,a −1)+(1−S k )k a =1a ∗N (k −1,a )=S kka =1(a −1)∗N (k −1,a −1)+k a =1N (k −1,a −1)+(1−S k )N k −1=S k ∗N k −1+S k ∗k −1 a =1N (k −1,a )+(1−S k )N k −1=N k −1+S kAs a consequence:Lemma 1The expected number of different words in a random selection of k words is N k =N −M k .Proof:N k =N 0+ kj =1S j =M 0−M k =N −M k In order to estimate the asymptotic behavior of N k ,we will estimate M k in the next section.20406080100120010002000300040005000600070008000900010000N (k ) v s H e a p s (k )kHeaps Law: N = 100 theta = 1.75 c = 0.30 a N = 0.70N(k)f(k)Figure 3:N k versus Heaps law fit for N =100Applying curve fitting,we get some idea of the quality of this approximation provided by Heaps’Law.In figure 1we see how Heaps’Law fits when 5000elements are drawn from a set of 100elements,while in figure 22000drawings are taken.We see how the fit deteriorates.The same can be seen for larger values of N ,for example,figure 1shows the fits after 50000drawings,and the worse situation in figure ??after 500000drawings.20040060080010001200050000100000150000200000250000300000350000400000450000500000N (k ) v s H e a p s (k )kHeaps Law: N = 1000 theta = 1.75 c = 0.30 a N = 0.68N(k)f(k)Figure 4:N k versus Heaps law fit for N =10003Approximating reverse moments for Mandelbrot distributionThe Mandelbrot distribution provides a reasonable approximation of frequency of word usage in natural language.The Mandelbrot distribution assumes words to be ranked according to their frequency of usage.The probability of the word ranked at position i then corresponds to:p i =a N (c +i )for some constants c ≥0and θ,where a N is such that Ni =1p i =ually the constant θranges over [1,2].As special case is Zipf’s Law,which results from the Mandelbrot distribution by choosing θ=1and c =0.For the normalization constant we have:Lemma 2θ>1=⇒a N =Θ(1)Let t (x )=a N (c +x )−θ,thenLemma 3For k >0the function φk (x )=(1−t (x ))kis increasing for x ≥1.•lim k →∞φk (x )=0•lim x →∞φk (x )=1The reverse moments may be estimated using the integral criterion:Lemma 4N i =1φk (i )=N1φk (x )dx +with φk (1)≤ ≤φk (N )=(1−a N (c +N )−θ)k =o (1).The error of replacing summation by integration thus decreases exponentially in both k and N .Next we focus on estimating N1φk (x )dx .By substituting t =a N (c +x )−θ,leading to dx =−At µdtwhere µ=−1−βand A =βa βN and β=1θ,we get:Lemma 5N11−a N(c +x )θkdx =At 1t N(1−t )k t µdtwith t 1=a N (c +1)−θ,t N =a N (c +N )−θ.We will process this outcome by applying partial integration:Lemma 6At 1t N(1−t )k t µdt=A (1−t )k t µ+1µ+1 t 1t N−Aθk t 1t N (1−t )k −1t µ+1dt The first term of the righthand side approximates the number N of words in the set W :Lemma 7A (1−t )k t µ+1µ+1 t 1t N=(c +N )φk (N )−(c +1)φk (1)=N +Θ k (c +N )Proof:A (1−t )k t µ+1µ+1 t 1t N=A (1−t 1)k t µ+11µ+1−A(1−t N )k t µ+1Nµ+1=−(c +1)(1−t 1)k +(c +N )(1−t N )k =(c +N )φk (N )−(c +1)φk (1)The result follows by the observation:φk (N )=1−a N(c +N )θk=1+Θk (c +N )θFor small values of k ,the term Nφk (N )is around N ,but for larger values,this term decreases exponentially.Next we focus on Aθk t 1t N (1−t )k −1t µ+1dt .Lemma 8Aθkt 1t N(1−t )k −1t µ+1dt =Aθk1(1−t )k −1t µ+1dt − 2where2=Θk(c +N )θ−1+O k 2(1−t 1)k −1The latter term is exponentially decreasing in k .Proof:Obviously2=Aθkt N0(1−t)k−1tµ+1dt+Aθk1t1(1−t)k−1tµ+1dtFor thefirst term we notice that t N will be almost0for large N and thus1−t=Θ(1). Consequently:Aθkt N0(1−t)k−1tµ+1dt=ΘAθkt Ntµ+1dt=ΘaβNt1−βN1−βk=Θa N1−β·k(c+N)θ−1For the second term another application of partial integration is performed:Aθk1t1(1−t)k−1tµ+1dt=Aθk(1−t)k−1tµ+2µ+21t1+Aθk(k−1)µ+21t1(1−t)k−2tµ+2dtBoth terms are exponentially decreasing in k.For thefirst part we have:Aθk(1−t)k−1tµ+21t1=−Aθtµ+21·k(1−t1)k−1Note that k(1−t1)k−1decreases exponentially to zero.This is also the case for the second part:Aθk(k−1)µ+2 1t1(1−t)k−2tµ+2dt≤Aθk(k−1)µ+21t1(1−t)k−2dt≤Aθµ+2·k(k−1)(1−t1)k−1We proceed with Aθk 1(1−t)k−1tµ+1dt,and recognize this integral as the Beta-function B(µ+2,k).Lemma9Aθk1(1−t)k−1tµ+1dt=Aθk B(k,µ+2) The Beta-function can be expressed in terms of the Gamma-function:B(k,µ+2)=Γ(k)Γ(µ+2)Γ(k+µ+2)Note that expression is only valid forµ+2=0,which is equivalent withθ=1.In this paper we restrict ourselves to this case.Substituting Stirling’s approximation of theΓ-functionΓ(x+1)=√2πxx x e−x(1+o(1)(see[5])on the termsΓ(k)andΓ(k+µ+2)yields:AθkΓ(k)Γ(k+µ+2)Γ(µ+2)∼Aθk√2π(k−1)k−1e−k+1(k−1)12√2π(k+µ+1)k+µ+1e−k−µ−1(k+µ+1)1Γ(µ+2)=aβN ·k−1k+µ+1k·k−1k+µ+112·eµ+2·(k+µ+1)−µ−1·(1−1k)·Γ(µ+2)∼aβNeµ+2Γ(µ+2)(k+µ+1)−µ−1∼aβNeµ+2Γ(µ+2)k−µ−1=aβNe1−βΓ(1−β)·kβSummarizing we haveLemma10Ni=1φk(i)=N−α·kβ·(1+o(1))+Θk(c+N)whereα=e1−βΓ(1−β)This leads to the main result:Lemma11The expected number of different words in a random selection of k words isN k=α·kβ·(1+o(1))+Θk(c+N)θAnd thus we have shown:Theorem1(Heaps’Law)N k=α·kβ,with validity interval restricted to k=Θ((c+N)θ).4Conclusions and further researchIn this paper we have derived the Heaps’Law from the Mandelbrot distribution,and provided a validity area for Heaps’Law.As a next step,a second order approximation may be employed, providing a sharper formulation for Heaps’Law,and a larger validity area.Furthermore,other distributions may be examined,leading to Heaps’Criterion as a sufficient condition for a distribu-tion to imply Heaps’Law.References[1]Ricardo A.Baeza-Yates and Gonzalo Navarro.Block addressing indices for approximate textretrieval.Journal of the American Society of Information Science,51(1):69–82,2000.[2]rmation retrieval:Computational and theoretical aspects.pages206–208,1978.[3]Wentian Li.Random texts exhibit zipf’s-law-like word frequency distribution.IEEE Transac-tions on Information Theory,38(6)1842-1845,1992.[4]B.Mandelbrot.The pareto-levy law and the distribution of income.International EconomicReview,I,pages79–106,1960.[5]Schaum.Schaums Handbook of Formulas and Tables.[6]G.Zipf.Human Behavior and the Principle of Last Effort.1949.[7]Xavier Gabaix Zipf.Zipf’s law for cities:An explanation*.。