Learning probabilistic motion models for mobile robots

合集下载

Longman Dictionary of Contemporary English

Multinomial Randomness Models for Retrieval withDocument FieldsVassilis Plachouras1and Iadh Ounis21Yahoo!Research,Barcelona,Spain2University of Glasgow,Glasgow,UKvassilis@,ounis@ Abstract.Documentﬁelds,such as the title or the headings of a document,offer a way to consider the structure of documents for retrieval.Most of the pro-posed approaches in the literature employ either a linear combination of scoresassigned to differentﬁelds,or a linear combination of frequencies in the termfrequency normalisation component.In the context of the Divergence From Ran-domness framework,we have a sound opportunity to integrate documentﬁeldsin the probabilistic randomness model.This paper introduces novel probabilis-tic models for incorporatingﬁelds in the retrieval process using a multinomialrandomness model and its information theoretic approximation.The evaluationresults from experiments conducted with a standard TREC Web test collectionshow that the proposed models perform as well as a state-of-the-artﬁeld-basedweighting model,while at the same time,they are theoretically founded and moreextensible than currentﬁeld-based models.1IntroductionDocumentﬁelds provide a way to incorporate the structure of a document in Information Retrieval(IR)models.In the context of HTML documents,the documentﬁelds may correspond to the contents of particular HTML tags,such as the title,or the heading tags.The anchor text of the incoming hyperlinks can also be seen as a documentﬁeld. In the case of email documents,theﬁelds may correspond to the contents of the email’s subject,date,or to the email address of the sender[9].It has been shown that using documentﬁelds for Web retrieval improves the retrieval effectiveness[17,7].The text and the distribution of terms in a particularﬁeld depend on the function of thatﬁeld.For example,the titleﬁeld provides a concise and short description for the whole document,and terms are likely to appear once or twice in a given title[6].The anchor textﬁeld also provides a concise description of the document,but the number of terms depends on the number of incoming hyperlinks of the document.In addition, anchor texts are not always written by the author of a document,and hence,they may enrich the document representation with alternative terms.The combination of evidence from the differentﬁelds in a retrieval model requires special attention.Robertson et al.[14]pointed out that the linear combination of scores, which has been the approach mostly used for the combination ofﬁelds,is difﬁcult to interpret due to the non-linear relation between the assigned scores and the term frequencies in each of theﬁelds.Hawking et al.[5]showed that the term frequency G.Amati,C.Carpineto,and G.Romano(Eds.):ECIR2007,LNCS4425,pp.28–39,2007.c Springer-Verlag Berlin Heidelberg2007Multinomial Randomness Models for Retrieval with Document Fields 29normalisation applied to each ﬁeld depends on the nature of the corresponding ﬁeld.Zaragoza et al.[17]introduced a ﬁeld-based version of BM25,called BM25F,which applies term frequency normalisation and weighting of the ﬁelds independently.Mac-donald et al.[7]also introduced normalisation 2F in the Divergence From Randomness (DFR)framework [1]for performing independent term frequency normalisation and weighting of ﬁelds.In both cases of BM25F and the DFR models that employ normali-sation 2F,there is the assumption that the occurrences of terms in the ﬁelds follow the same distribution,because the combination of ﬁelds takes place in the term frequency normalisation component,and not in the probabilistic weighting model.In this work,we introduce weighting models,where the combination of evidence from the different ﬁelds does not take place in the term frequency normalisation part of the model,but instead,it constitutes an integral part of the probabilistic randomness model.We propose two DFR weighting models that combine the evidence from the different ﬁelds using a multinomial distribution,and its information theoretic approx-imation.We evaluate the performance of the introduced weighting models using the standard .Gov TREC Web test collection.We show that the models perform as well as the state-of-the-art model ﬁeld-based PL2F,while at the same time,they employ a theoretically founded and more extensible combination of evidence from ﬁelds.The remainder of this paper is structured as follows.Section 2provides a description of the DFR framework,as well as the related ﬁeld-based weighting models.Section 3introduces the proposed multinomial DFR weighting models.Section 4presents the evaluation of the proposed weighting models with a standard Web test collection.Sec-tions 5and 6close the paper with a discussion related to the proposed models and the obtained results,and some concluding remarks drawn from this work,respectively.2Divergence from Randomness Framework and Document Fields The Divergence From Randomness (DFR)framework [1]generates a family of prob-abilistic weighting models for IR.It provides a great extent of ﬂexibility in the sense that the generated models are modular,allowing for the evaluation of new assumptions in a principled way.The remainder of this section provides a description of the DFR framework (Section 2.1),as well as a brief description of the combination of evidence from different document ﬁelds in the context of the DFR framework (Section 2.2).2.1DFR ModelsThe weighting models of the Divergence From Randomness framework are based on combinations of three components:a randomness model RM ;an information gain model GM ;and a term frequency normalisation model.Given a collection D of documents,the randomness model RM estimates the probability P RM (t ∈d |D )of having tf occurrences of a term t in a document d ,and the importance of t in d corresponds to the informative content −log 2(P RM (t ∈d |D )).Assuming that the sampling of terms corresponds to a sequence of independent Bernoulli trials,the randomness model RM is the binomial distribution:P B (t ∈d |D )= T F tfp tf (1−p )T F −tf (1)30V .Plachouras and I.Ouniswhere TF is the frequency of t in the collection D ,p =1N is a uniform prior probabilitythat the term t appears in the document d ,and N is the number of documents in the collection D .A limiting form of the binomial distribution is the Poisson distribution P :P B (t ∈d |D )≈P P (t ∈d |D )=λtf tf !e −λwhere λ=T F ·p =T FN (2)The information gain model GM estimates the informative content 1−P risk of the probability P risk that a term t is a good descriptor for a document.When a term t appears many times in a document,then there is very low risk in assuming that t describes the document.The information gain,however,from any future occurrences of t in d is lower.For example,the term ‘evaluation’is likely to have a high frequency in a document about the evaluation of IR systems.After the ﬁrst few occurrences of the term,however,each additional occurrence of the term ‘evaluation’provides a diminishing additional amount of information.One model to compute the probability P risk is the Laplace after-effect model:P risk =tf tf +1(3)P risk estimates the probability of having one more occurrence of a term in a document,after having seen tf occurrences already.The third component of the DFR framework is the term frequency normalisation model,which adjusts the frequency tf of the term t in d ,given the length l of d and the average document length l in D .Normalisation 2assumes a decreasing density function of the normalised term frequency with respect to the document length l .The normalised term frequency tfn is given as follows:tfn =tf ·log 2(1+c ·l l )(4)where c is a hyperparameter,i.e.a tunable parameter.Normalisation 2is employed in the framework by replacing tf in Equations (2)and (3)with tfn .The relevance score w d,q of a document d for a query q is given by:w d,q =t ∈qqtw ·w d,t where w d,t =(1−P risk )·(−log 2P RM )(5)where w d,t is the weight of the term t in document d ,qtw =qtf qtf max ,qtf is the frequency of t in the query q ,and qtf max is the maximum qtf in q .If P RM is estimatedusing the Poisson randomness model,P risk is estimated using the Laplace after-effect model,and tfn is computed according to normalisation 2,then the resulting weight-ing model is denotedby PL2.The factorial is approximated using Stirling’s formula:tf !=√2π·tftf +0.5e −tf .The DFR framework generates a wide range of weighting models by using different randomness models,information gain models,or term frequency normalisation models.For example,the next section describes how normalisation 2is extended to handle the normalisation and weighting of term frequencies for different document ﬁelds.Multinomial Randomness Models for Retrieval with Document Fields31 2.2DFR Models for Document FieldsThe DFR framework has been extended to handle multiple documentﬁelds,and to apply per-ﬁeld term frequency normalisation and weighting.This is achieved by ex-tending normalisation2,and introducing normalisation2F[7],which is explained below.Suppose that a document has kﬁelds.Each occurrence of a term can be assigned to exactly oneﬁeld.The frequency tf i of term t in the i-thﬁeld is normalised and weighted independently of the otherﬁelds.Then,the normalised and weighted term frequencies are combined into one pseudo-frequency tfn2F:tfn2F=ki=1w i·tf i log21+c i·l il i(6)where w i is the relative importance or weight of the i-thﬁeld,tf i is the frequency of t in the i-thﬁeld of document d,l i is the length of the i-thﬁeld in d,l i is the average length of the i-thﬁeld in the collection D,and c i is a hyperparameter for the i-thﬁeld.The above formula corresponds to normalisation2F.The weighting model PL2F corresponds to PL2using tfn2F as given in Equation(6).The well-known BM25 weighting model has also been extended in a similar way to BM25F[17].3Multinomial Randomness ModelsThis section introduces DFR models which,instead of extending the term frequency normalisation component,as described in the previous section,use documentﬁelds as part of the randomness model.While the weighting model PL2F has been shown to perform particularly well[7,8],the documentﬁelds are not an integral part of the ran-domness weighting model.Indeed,the combination of evidence from the differentﬁelds takes place as a linear combination of normalised frequencies in the term frequency nor-malisation component.This implies that the term frequencies are drawn from the same distribution,even though the nature of eachﬁeld may be different.We propose two weighting models,which,instead of assuming that term frequen-cies inﬁelds are drawn from the same distribution,use multinomial distributions to incorporate documentﬁelds in a theoretically driven way.Theﬁrst one is based on the multinomial distribution(Section3.1),and the second one is based on an information theoretic approximation of the multinomial distribution(Section3.2).3.1Multinomial DistributionWe employ the multinomial distribution to compute the probability that a term appears a given number of times in each of theﬁelds of a document.The formula of the weighting model is derived as follows.Suppose that a document d has kﬁelds.The probability that a term occurs tf i times in the i-thﬁeld f i,is given as follows:P M(t∈d|D)=T Ftf1tf2...tf k tfp tf11p tf22...p tf kkp tf (7)32V .Plachouras and I.OunisIn the above equation,T F is the frequency of term t in the collection,p i =1k ·N is the prior probability that a term occurs in a particular ﬁeld of document d ,and N is the number of documents in the collection D .The frequency tf =T F − ki =1tf i cor-responds to the number of occurrences of t in other documents than d .The probability p =1−k 1k ·N =N −1N corresponds to the probability that t does not appear in any of the ﬁelds of d .The DFR weighting model is generated using the multinomial distribution from Equation (7)as a randomness model,the Laplace after-effect from Equation (3),and replacing tf i with the normalised term frequency tfn i ,obtained by applying normal-isation 2from Equation (4).The relevance score of a document d for a query q is computed as follows:w d,q = t ∈q qtw ·w d,t = t ∈qqtw ·(1−P risk )· −log 2(P M (t ∈d |D )=t ∈q qtw k i =1tfn i +1· −log 2(T F !)+k i =1 log 2(tfn i !)−tfn i log 2(p i ) +log 2(tfn !)−tfn log 2(p ) (8)where qtw is the weight of a term t in query q ,tfn =T F − k i =1tfn i ,tfn i =tf i ·log 2(1+c i ·li l i )for the i -th ﬁeld,and c i is the hyperparameter of normalisation 2for the i -th ﬁeld.The weighting model introduced in the above equation is denoted by ML2,where M stands for the multinomial randomness model,L stands for the Laplace after-effect model,and 2stands for normalisation 2.Before continuing,it is interesting to note two issues related to the introduced weight-ing model ML2,namely setting the relative importance,or weight,of ﬁelds in the do-cument representation,and the computation of factorials.Weights of ﬁelds.In Equation (8),there are two different ways to incorporate weights for the ﬁelds of documents.The ﬁrst one is to multiply each of the normalised term frequencies tfn i with a constant w i ,in a similar way to normalisation 2F (see Equa-tion (6)):tfn i :=w i ·tfn i .The second way is to adjust the prior probabilities p i of ﬁelds,in order to increase the scores assigned to terms occurring in ﬁelds with low prior probabilities:p i :=p i w i .Indeed,the assigned score to a query term occurring in a ﬁeld with low probability is high,due to the factor −tfn i log 2(p i )in Equation (8).Computing factorials.As mentioned in Section 2.1,the factorial in the weighting model PL2is approximated using Stirling’s formula.A different method to approximate the factorial is to use the approximation of Lanczos to the Γfunction [12,p.213],which has a lower approximation error than Stirling’s formula.Indeed,preliminary experi-mentation with ML2has shown that using Stirling’s formula affects the performance of the weighting model,due to the accumulation of the approximation error from com-puting the factorial k +2times (k is the number of ﬁelds).This is not the case for the Poisson-based weighting models PL2and PL2F,where there is only one factorial com-putation for each query term (see Equation (2)).Hence,the computation of factorials in Equation (8)is performed using the approximation of Lanczos to the Γfunction.Multinomial Randomness Models for Retrieval with Document Fields33 3.2Approximation to the Multinomial DistributionThe DFR framework generates different models by replacing the binomial randomness model with its limiting forms,such as the Poisson randomness model.In this section, we introduce a new weighting model by replacing the multinomial randomness model in ML2with the following information theoretic approximation[13]:T F!tf1!tf2!···tf k!tf !p1tf1p2tf2···p k tf k p tf ≈1√2πT F k2−T F·Dtf iT F,p ip t1p t2···p tk p t(9)Dtf iT F,p icorresponds to the information theoretic divergence of the probability p ti=tf iT Fthat a term occurs in aﬁeld,from the prior probability p i of theﬁeld:D tfiT F,p i=ki=1tfiT Flog2tf iT F·p i+tfT Flog2tfT F·p(10)where tf =T F− ki=1tf i.Hence,the multinomial randomness model M in theweighting model ML2can be replaced by its approximation from Equation(9):w d,q=t∈q qtw·k2log2(2πT F)ki=1tfn i+1·ki=1tfn i log2tfn i/T Fp i+12log2tfn iT F+tfn log2tfn /T Fp+12log2tfnT F(11)The above model is denoted by M D L2.The deﬁnitions of the variables involved in theabove equation have been introduced in Section3.1.It should be noted that the information theoretic divergence Dtf iT F,p iis deﬁnedonly when tf i>0for1≤i≤k.In other words,Dtf iT F,p iis deﬁned only whenthere is at least one occurrence of a query term in all theﬁelds.This is not always the case,because a Web document may contain all the query terms in its body,but it may contain only some of the query terms in its title.To overcome this issue,the weight of a query term t in a document is computed by considering only theﬁelds in which the term t appears.The weights of differentﬁelds can be deﬁned in the same way as in the case of the weighting model ML2,as described in Section3.1.In more detail,the weighting of ﬁelds can be achieved by either multiplying the frequency of a term in aﬁeld by a constant,or by adjusting the prior probability of the correspondingﬁeld.An advantage of the weighting model M D L2is that,because it approximates the multinomial distribution,there is no need to compute factorials.Hence,it is likely to provide a sufﬁciently accurate approximation to the multinomial distribution,and it may lead to improved retrieval effectiveness compared to ML2,due to the lower accu-mulated numerical errors.The experimental results in Section4.2will indeed conﬁrm this advantage of M D L2.34V.Plachouras and I.Ounis4Experimental EvaluationIn this section,we evaluate the proposed multinomial DFR models ML2and M D L2, and compare their performance to that of PL2F,which has been shown to be particu-larly effective[7,8].A comparison of the retrieval effectiveness of PL2F and BM25F has shown that the two models perform equally well on various search tasks and test collections[11],including those employed in this work.Hence,we experiment only with the multinomial models and PL2F.Section4.1describes the experimental setting, and Section4.2presents the evaluation results.4.1Experimental SettingThe evaluation of the proposed models is conducted with TREC Web test collection,a crawl of approximately1.25million documents from domain.The .Gov collection has been used in the TREC Web tracks between2002and2004[2,3,4]. In this work,we employ the tasks from the Web tracks of TREC2003and2004,because they include both informational tasks,such as the topic distillation(td2003and td2004, respectively),as well as navigational tasks,such as named pageﬁnding(np2003and np2004,respectively)and home pageﬁnding(hp2003and hp2004,respectively).More speciﬁcally,we train and test for each type of task independently,in order to get insight on the performance of the proposed models[15].We employ each of the tasks from the TREC2003Web track for training the hyperparameters of the proposed models.Then, we evaluate the models on the corresponding tasks from the TREC2004Web track.In the reported set of experiments,we employ k=3documentﬁelds:the contents of the<BODY>tag of Web documents(b),the anchor text associated with incoming hyperlinks(a),and the contents of the<TITLE>tag(t).Moreﬁelds can be deﬁned for other types ofﬁelds,such as the contents of the heading tags<H1>for example. It has been shown,however,that the body,title and anchor textﬁelds are particularly effective for the considered search tasks[11].The collection of documents is indexed after removing stopwords and applying Porter’s stemming algorithm.We perform the experiments in this work using the Terrier IR platform[10].The proposed models ML2and M D L2,as well as PL2F,have a range of hyperpa-rameters,the setting of which can affect the retrieval effectiveness.More speciﬁcally,all three weighting models have two hyperparameters for each employed documentﬁeld: one related to the term frequency normalisation,and a second one related to the weight of thatﬁeld.As described in Sections3.1and3.2,there are two ways to deﬁne the weights ofﬁelds for the weighting models ML2and M D L2:(i)multiplying the nor-malised frequency of a term in aﬁeld;(ii)adjusting the prior probability p i of the i-th ﬁeld.Theﬁeld weights in the case of PL2F are only deﬁned in terms of multiplying the normalised term frequency by a constant w i,as shown in Equation(6).In this work,we consider only the term frequency normalisation hyperparameters, and we set all the weights ofﬁelds to1,in order to avoid having one extra parameter in the discussion of the performance of the weighting models.We set the involved hyperparameters c b,c a,and c t,for the body,anchor text,and titleﬁelds,respectively, by directly optimising mean average precision(MAP)on the training tasks from the Web track of TREC2003.We perform a3-dimensional optimisation to set the valuesMultinomial Randomness Models for Retrieval with Document Fields 35of the hyperparameters.The optimisation process is the following.Initially,we apply a simulated annealing algorithm,and then,we use the resulting hyperparameter values as a starting point for a second optimisation algorithm [16],to increase the likelihood of detecting a global maximum.For each of the three training tasks,we apply the above optimisation process three times,and we select the hyperparameter values that result in the highest MAP.We employ the above optimisation process to increase the likelihood that the hyperparameters values result in a global maximum for MAP.Figure 1shows the MAP obtained by ML2on the TREC 2003home page ﬁnding topics,for each iteration of the optimisation process.Table 1reports the hyperparameter values that resulted in the highest MAP for each of the training tasks,and that are used for the experiments in this work.0 0.20.40.60.80 40 80 120 160 200M A PiterationML2Fig.1.The MAP obtained by ML2on the TREC 2003home page ﬁnding topics,during the optimisation of the term frequency normalisation hyperparametersThe evaluation results from the Web tracks of TREC 2003[3]and 2004[4]have shown that employing evidence from the URLs of Web documents results in important improvements in retrieval effectiveness for the topic distillation and home page ﬁnd-ing tasks,where relevant documents are home pages of relevant Web sites.In order to provide a more complete evaluation of the proposed models for these two types of Web search tasks,we also employ the length in characters of the URL path,denoted by URLpathlen ,using the following formula to transform it to a relevance score [17]:w d,q :=w d,q +ω·κκ+URLpathlen (12)where w d,q is the relevance score of a document.The parameters ωand κare set by per-forming a 2-dimensional optimisation as described for the case of the hyperparameters c i .The resulting values for ωand κare shown in Table 2.4.2Evaluation ResultsAfter setting the hyperparameter values of the proposed models,we evaluate the models with the search tasks from TREC 2004Web track [4].We report the ofﬁcial TREC evaluation measures for each search task:mean average precision (MAP)for the topic distillation task (td2004),and mean reciprocal rank (MRR)of the ﬁrst correct answer for both named page ﬁnding (np2004)and home page ﬁnding (hp2004)tasks.36V.Plachouras and I.OunisTable1.The values of the hyperparameters c b,c a,and c t,for the body,anchor text and titleﬁelds,respectively,which resulted in the highest MAP on the training tasks of TREC2003Web trackML2Task c b c a c ttd20030.0738 4.326810.8220np20030.1802 4.70578.4074hp20030.1926310.3289624.3673M D L2Task c b c a c ttd20030.256210.038324.6762np20031.02169.232121.3330hp20030.4093355.2554966.3637PL2FTask c b c a c ttd20030.1400 5.0527 4.3749np20031.015311.96529.1145hp20030.2785406.1059414.7778Table2.The values of the hyperparameters ωandκ,which resulted in the high-est MAP on the training topic distillation (td2003)and home pageﬁnding(hp2003) tasks of TREC2003Web trackML2Taskωκtd20038.809514.8852hp200310.66849.8822M D L2Taskωκtd20037.697412.4616hp200327.067867.3153PL2FTaskωκtd20037.36388.2178hp200313.347628.3669Table3presents the evaluation results for the proposed models ML2,M D L2,and the weighting model PL2F,as well as their combination with evidence from the URLs of documents(denoted by appending U to the weighting model’s name).When only the documentﬁelds are employed,the multinomial weighting models have similar perfor-mance compared to the weighting model PL2F.The weighting models PL2F and M D L2 outperform ML2for both topic distillation and home pageﬁnding tasks.For the named pageﬁnding task,ML2results in higher MRR than M D L2and PL2F.Using the Wilcoxon signed rank test,we tested the signiﬁcance of the differences in MAP and MRR between the proposed new multinomial models and PL2F.In the case of the topic distillation task td2004,PL2F and M D L2were found to perform statistically signiﬁcantly better than ML2,with p<0.001in both cases.There was no statistically signiﬁcant difference between PL2F and M D L2.Regarding the named pageﬁnding task np2004,there is no statistically signiﬁcant difference between any of the three proposed models.For the home pageﬁnding task hp2004,only the difference between ML2and PL2F was found to be statistically signiﬁcant(p=0.020).Regarding the combination of the weighting models with the evidence from the URLs of Web documents,Table3shows that PL2FU and M D L2U outperform ML2U for td2004.The differences in performance are statistically signiﬁcant,with p=0.002 and p=0.012,respectively,but there is no signiﬁcant difference in the retrieval ef-fectiveness between PL2FU and M D L2U.When considering hp2004,we can see that PL2F outperforms the multinomial weighting models.The only statistically signiﬁcant difference in MRR was found between PL2FU and M D L2FU(p=0.012).Multinomial Randomness Models for Retrieval with Document Fields37 Table3.Evaluation results for the weighting models ML2,M D L2,and PL2F on the TREC 2004Web track topic distillation(td2004),named pageﬁnding(np2004),and home pageﬁnding (hp2004)tasks.ML2U,M D L2U,and PL2FU correspond to the combination of each weighting model with evidence from the URL of documents.The table reports mean average precision (MAP)for the topic distillation task,and mean reciprocal rank(MRR)of theﬁrst correct answer for the named pageﬁnding and home pageﬁnding tasks.ML2U,M D L2U and PL2FU are evalu-ated only for td2004and hp2004,where the relevant documents are home pages(see Section4.1).Task ML2M D L2PL2FMAPtd20040.12410.13910.1390MRRnp20040.69860.68560.6878hp20040.60750.62130.6270Task ML2U M D L2U PL2FUMAPtd20040.19160.20120.2045MRRhp20040.63640.62200.6464A comparison of the evaluation results with the best performing runs submitted to the Web track of TREC2004[4]shows that the combination of the proposed mod-els with the evidence from the URLs performs better than the best performing run of the topic distillation task in TREC2004,which achieved MAP0.179.The performance of the proposed models is comparable to that of the most effective method for the named pageﬁnding task(MRR0.731).Regarding the home pageﬁnding task,the dif-ference is greater between the performance of the proposed models with evidence from the URLs,and the best performing methods in the same track(MRR0.749).This can be explained in two ways.First,the over-ﬁtting of the parametersωandκon the training task may result in lower performance for the test task.Second,usingﬁeld weights may be more effective for the home pageﬁnding task,which is a high precision task,where the correct answers to the queries are documents of a very speciﬁc type.From the results in Table3,it can be seen that the model M D L2,which employs the information theoretic approximation to the multinomial distribution,signiﬁcantly outperforms the model ML2,which employs the multinomial distribution,for the topic distillation task.As discussed in Section3.2,this may suggest that approximating the multinomial distribution is more effective than directly computing it,because of the number of computations involved,and the accumulated small approximation errors from the computation of the factorial.The difference in performance may be greater if more documentﬁelds are considered.Overall,the evaluation results show that the proposed multinomial models ML2and M D L2have a very similar performance to that of PL2F for the tested search tasks. None of the models outperforms the others consistently for all three tested tasks,and the weighting models M D L2and PL2F achieve similar levels of retrieval effectiveness. The next section discusses some points related to the new multinomial models.。

Conditional Random Fields_ Probabilistic Models for Segmenting and Labeling Sequence Data

Conditional Random Fields:Probabilistic Modelsfor Segmenting and Labeling Sequence DataJohn Lafferty LAFFERTY@ Andrew McCallum MCCALLUM@ Fernando Pereira FPEREIRA@ WhizBang!Labs–Research,4616Henry Street,Pittsburgh,PA15213USASchool of Computer Science,Carnegie Mellon University,Pittsburgh,PA15213USADepartment of Computer and Information Science,University of Pennsylvania,Philadelphia,PA19104USAAbstractWe present,a frame-work for building probabilistic models to seg-ment and label sequence data.Conditional ran-domﬁelds offer several advantages over hid-den Markov models and stochastic grammarsfor such tasks,including the ability to relaxstrong independence assumptions made in thosemodels.Conditional randomﬁelds also avoida fundamental limitation of maximum entropyMarkov models(MEMMs)and other discrimi-native Markov models based on directed graph-ical models,which can be biased towards stateswith few successor states.We present iterativeparameter estimation algorithms for conditionalrandomﬁelds and compare the performance ofthe resulting models to HMMs and MEMMs onsynthetic and natural-language data.1.IntroductionThe need to segment and label sequences arises in many different problems in several scientiﬁcﬁelds.Hidden Markov models(HMMs)and stochastic grammars are well understood and widely used probabilistic models for such problems.In computational biology,HMMs and stochas-tic grammars have been successfully used to align bio-logical sequences,ﬁnd sequences homologous to a known evolutionary family,and analyze RNA secondary structure (Durbin et al.,1998).In computational linguistics and computer science,HMMs and stochastic grammars have been applied to a wide variety of problems in text and speech processing,including topic segmentation,part-of-speech(POS)tagging,information extraction,and syntac-tic disambiguation(Manning&Sch¨u tze,1999).HMMs and stochastic grammars are generative models,as-signing a joint probability to paired observation and label sequences;the parameters are typically trained to maxi-mize the joint likelihood of training examples.To deﬁne a joint probability over observation and label sequences, a generative model needs to enumerate all possible ob-servation sequences,typically requiring a representation in which observations are task-appropriate atomic entities, such as words or nucleotides.In particular,it is not practi-cal to represent multiple interacting features or long-range dependencies of the observations,since the inference prob-lem for such models is intractable.This difﬁculty is one of the main motivations for looking at conditional models as an alternative.A conditional model speciﬁes the probabilities of possible label sequences given an observation sequence.Therefore,it does not expend modeling effort on the observations,which at test time areﬁxed anyway.Furthermore,the conditional probabil-ity of the label sequence can depend on arbitrary,non-independent features of the observation sequence without forcing the model to account for the distribution of those dependencies.The chosen features may represent attributes at different levels of granularity of the same observations (for example,words and characters in English text),or aggregate properties of the observation sequence(for in-stance,text layout).The probability of a transition between labels may depend not only on the current observation, but also on past and future observations,if available.In contrast,generative models must make very strict indepen-dence assumptions on the observations,for instance condi-tional independence given the labels,to achieve tractability. Maximum entropy Markov models(MEMMs)are condi-tional probabilistic sequence models that attain all of the above advantages(McCallum et al.,2000).In MEMMs, each source state1has a exponential model that takes the observation features as input,and outputs a distribution over possible next states.These exponential models are trained by an appropriate iterative scaling method in the 1Output labels are associated with states;it is possible for sev-eral states to have the same label,but for simplicity in the rest of this paper we assume a one-to-one correspondence.maximum entropy framework.Previously published exper-imental results show MEMMs increasing recall and dou-bling precision relative to HMMs in a FAQ segmentation task.MEMMs and other non-generativeﬁnite-state models based on next-state classiﬁers,such as discriminative Markov models(Bottou,1991),share a weakness we callhere the:the transitions leaving a given state compete only against each other,rather than againstall other transitions in the model.In probabilistic terms, transition scores are the conditional probabilities of pos-sible next states given the current state and the observa-tion sequence.This per-state normalization of transition scores implies a“conservation of score mass”(Bottou, 1991)whereby all the mass that arrives at a state must be distributed among the possible successor states.An obser-vation can affect which destination states get the mass,but not how much total mass to pass on.This causes a bias to-ward states with fewer outgoing transitions.In the extreme case,a state with a single outgoing transition effectively ignores the observation.In those cases,unlike in HMMs, Viterbi decoding cannot downgrade a branch based on ob-servations after the branch point,and models with state-transition structures that have sparsely connected chains of states are not properly handled.The Markovian assump-tions in MEMMs and similar state-conditional models in-sulate decisions at one state from future decisions in a way that does not match the actual dependencies between con-secutive states.This paper introduces(CRFs),a sequence modeling framework that has all the advantages of MEMMs but also solves the label bias problem in a principled way.The critical difference between CRFs and MEMMs is that a MEMM uses per-state exponential mod-els for the conditional probabilities of next states given the current state,while a CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence.Therefore,the weights of different features at different states can be traded off against each other.We can also think of a CRF as aﬁnite state model with un-normalized transition probabilities.However,unlike some other weightedﬁnite-state approaches(LeCun et al.,1998), CRFs assign a well-deﬁned probability distribution over possible labelings,trained by maximum likelihood or MAP estimation.Furthermore,the loss function is convex,2guar-anteeing convergence to the global optimum.CRFs also generalize easily to analogues of stochastic context-free grammars that would be useful in such problems as RNA secondary structure prediction and natural language pro-cessing.2In the case of fully observable states,as we are discussing here;if several states have the same label,the usual local maxima of Baum-Welch arise.bel bias example,after(Bottou,1991).For concise-ness,we place observation-label pairs on transitions rather than states;the symbol‘’represents the null output label.We present the model,describe two training procedures and sketch a proof of convergence.We also give experimental results on synthetic data showing that CRFs solve the clas-sical version of the label bias problem,and,more signiﬁ-cantly,that CRFs perform better than HMMs and MEMMs when the true data distribution has higher-order dependen-cies than the model,as is often the case in practice.Finally, we conﬁrm these results as well as the claimed advantages of conditional models by evaluating HMMs,MEMMs and CRFs with identical state structure on a part-of-speech tag-ging task.2.The Label Bias ProblemClassical probabilistic automata(Paz,1971),discrimina-tive Markov models(Bottou,1991),maximum entropy taggers(Ratnaparkhi,1996),and MEMMs,as well as non-probabilistic sequence tagging and segmentation mod-els with independently trained next-state classiﬁers(Pun-yakanok&Roth,2001)are all potential victims of the label bias problem.For example,Figure1represents a simpleﬁnite-state model designed to distinguish between the two words and.Suppose that the observation sequence is. In theﬁrst time step,matches both transitions from the start state,so the probability mass gets distributed roughly equally among those two transitions.Next we observe. Both states1and4have only one outgoing transition.State 1has seen this observation often in training,state4has al-most never seen this observation;but like state1,state4 has no choice but to pass all its mass to its single outgoing transition,since it is not generating the observation,only conditioning on it.Thus,states with a single outgoing tran-sition effectively ignore their observations.More generally, states with low-entropy next state distributions will take lit-tle notice of observations.Returning to the example,the top path and the bottom path will be about equally likely, independently of the observation sequence.If one of the two words is slightly more common in the training set,the transitions out of the start state will slightly prefer its cor-responding transition,and that word’s state sequence will always win.This behavior is demonstrated experimentally in Section5.L´e on Bottou(1991)discussed two solutions for the label bias problem.One is to change the state-transition struc-ture of the model.In the above example we could collapse states1and4,and delay the branching until we get a dis-criminating observation.This operation is a special case of determinization(Mohri,1997),but determinization of weightedﬁnite-state machines is not always possible,and even when possible,it may lead to combinatorial explo-sion.The other solution mentioned is to start with a fully-connected model and let the training procedureﬁgure out a good structure.But that would preclude the use of prior structural knowledge that has proven so valuable in infor-mation extraction tasks(Freitag&McCallum,2000). Proper solutions require models that account for whole state sequences at once by letting some transitions“vote”more strongly than others depending on the corresponding observations.This implies that score mass will not be con-served,but instead individual transitions can“amplify”or “dampen”the mass they receive.In the above example,the transitions from the start state would have a very weak ef-fect on path score,while the transitions from states1and4 would have much stronger effects,amplifying or damping depending on the actual observation,and a proportionally higher contribution to the selection of the Viterbi path.3In the related work section we discuss other heuristic model classes that account for state sequences globally rather than locally.To the best of our knowledge,CRFs are the only model class that does this in a purely probabilistic setting, with guaranteed global maximum likelihood convergence.3.Conditional Random FieldsIn what follows,is a random variable over data se-quences to be labeled,and is a random variable over corresponding label sequences.All components of are assumed to range over aﬁnite label alphabet.For ex-ample,might range over natural language sentences and range over part-of-speech taggings of those sentences, with the set of possible part-of-speech tags.The ran-dom variables and are jointly distributed,but in a dis-criminative framework we construct a conditional model from paired observation and label sequences,and do not explicitly model the marginal..Thus,a CRF is a randomﬁeld globally conditioned on the observation.Throughout the paper we tacitly assume that the graph isﬁxed.In the simplest and most impor-3Weighted determinization and minimization techniques shift transition weights while preserving overall path weight(Mohri, 2000);their connection to this discussion deserves further study.tant example for modeling sequences,is a simple chain or line:.may also have a natural graph structure;yet in gen-eral it is not necessary to assume that and have the same graphical structure,or even that has any graph-ical structure at all.However,in this paper we will be most concerned with sequencesand.If the graph of is a tree(of which a chain is the simplest example),its cliques are the edges and ver-tices.Therefore,by the fundamental theorem of random ﬁelds(Hammersley&Clifford,1971),the joint distribu-tion over the label sequence given has the form(1),where is a data sequence,a label sequence,and is the set of components of associated with the vertices in subgraph.We assume that the and are given andﬁxed. For example,a Boolean vertex feature might be true if the word is upper case and the tag is“proper noun.”The parameter estimation problem is to determine the pa-rameters from training datawith empirical distribution. In Section4we describe an iterative scaling algorithm that maximizes the log-likelihood objective function:.As a particular case,we can construct an HMM-like CRF by deﬁning one feature for each state pair,and one feature for each state-observation pair:. The corresponding parameters and play a simi-lar role to the(logarithms of the)usual HMM parameters and.Boltzmann chain models(Saul&Jor-dan,1996;MacKay,1996)have a similar form but use a single normalization constant to yield a joint distribution, whereas CRFs use the observation-dependent normaliza-tion for conditional distributions.Although it encompasses HMM-like models,the class of conditional randomﬁelds is much more expressive,be-cause it allows arbitrary dependencies on the observationFigure2.Graphical structures of simple HMMs(left),MEMMs(center),and the chain-structured case of CRFs(right)for sequences. An open circle indicates that the variable is not generated by the model.sequence.In addition,the features do not need to specify completely a state or observation,so one might expect that the model can be estimated from less training data.Another attractive property is the convexity of the loss function;in-deed,CRFs share all of the convexity properties of general maximum entropy models.For the remainder of the paper we assume that the depen-dencies of,conditioned on,form a chain.To sim-plify some expressions,we add special start and stop states and.Thus,we will be using the graphical structure shown in Figure2.For a chain struc-ture,the conditional probability of a label sequence can be expressed concisely in matrix form,which will be useful in describing the parameter estimation and inference al-gorithms in Section4.Suppose that is a CRF given by(1).For each position in the observation se-quence,we deﬁne the matrix random variableby,where is the edge with labels and is the vertex with label.In contrast to generative models,con-ditional models like CRFs do not need to enumerate over all possible observation sequences,and therefore these matrices can be computed directly as needed from a given training or test observation sequence and the parameter vector.Then the normalization(partition function)is the entry of the product of these matrices:. Using this notation,the conditional probability of a label sequence is written as, where and.4.Parameter Estimation for CRFsWe now describe two iterative scaling algorithms toﬁnd the parameter vector that maximizes the log-likelihood of the training data.Both algorithms are based on the im-proved iterative scaling(IIS)algorithm of Della Pietra et al. (1997);the proof technique based on auxiliary functions can be extended to show convergence of the algorithms for CRFs.Iterative scaling algorithms update the weights asand for appropriately chosen and.In particular,the IIS update for an edge feature is the solution ofdef.where is thedef. The equations for vertex feature updates have similar form.However,efﬁciently computing the exponential sums on the right-hand sides of these equations is problematic,be-cause is a global property of,and dynamic programming will sum over sequences with potentially varying.To deal with this,theﬁrst algorithm,Algorithm S,uses a“slack feature.”The second,Algorithm T,keepstrack of partial totals.For Algorithm S,we deﬁne the bydef,where is a constant chosen so that for all and all observation vectors in the training set,thus making.Feature is“global,”that is,it does not correspond to any particular edge or vertex.For each index we now deﬁne thewith base caseifotherwiseand recurrence.Similarly,the are deﬁned byifotherwiseand.With these deﬁnitions,the update equations are,where.The factors involving the forward and backward vectors in the above equations have the same meaning as for standard hidden Markov models.For example,is the marginal probability of label given that the observation sequence is.This algorithm is closely related to the algorithm of Darroch and Ratcliff(1972),and MART algorithms used in image reconstruction.The constant in Algorithm S can be quite large,since in practice it is proportional to the length of the longest train-ing observation sequence.As a result,the algorithm may converge slowly,taking very small steps toward the maxi-mum in each iteration.If the length of the observations and the number of active features varies greatly,a faster-converging algorithm can be obtained by keeping track of feature totals for each observation sequence separately. Let def.Algorithm T accumulates feature expectations into counters indexed by.More speciﬁcally,we use the forward-backward recurrences just introduced to compute the expectations of feature and of feature given that.Then our param-eter updates are and,whereand are the unique positive roots to the following polynomial equationsmax max,(2)which can be easily computed by Newton’s method.A single iteration of Algorithm S and Algorithm T has roughly the same time and space complexity as the well known Baum-Welch algorithm for HMMs.To prove con-vergence of our algorithms,we can derive an auxiliary function to bound the change in likelihood from below;this method is developed in detail by Della Pietra et al.(1997). The full proof is somewhat detailed;however,here we give an idea of how to derive the auxiliary function.To simplify notation,we assume only edge features with parameters .Given two parameter settings and,we bound from below the change in the objective function with anas followsdefwhere the inequalities follow from the convexity of and.Differentiating with respect to and setting the result to zero yields equation(2).5.ExperimentsWeﬁrst discuss two sets of experiments with synthetic data that highlight the differences between CRFs and MEMMs. Theﬁrst experiments are a direct veriﬁcation of the label bias problem discussed in Section2.In the second set of experiments,we generate synthetic data using randomly chosen hidden Markov models,each of which is a mix-ture of aﬁrst-order and second-order peting models are then trained and compared on test data.As the data becomes more second-order,the test er-ror rates of the trained models increase.This experiment corresponds to the common modeling practice of approxi-mating complex local and long-range dependencies,as oc-cur in natural data,by small-order Markov models.OurFigure3.Plots of error rates for HMMs,CRFs,and MEMMs on randomly generated synthetic data sets,as described in Section5.2. As the data becomes“more second order,”the error rates of the test models increase.As shown in the left plot,the CRF typically signiﬁcantly outperforms the MEMM.The center plot shows that the HMM outperforms the MEMM.In the right plot,each open square represents a data set with,and a solid circle indicates a data set with.The plot shows that when the data is mostly second order(),the discriminatively trained CRF typically outperforms the HMM.These experiments are not designed to demonstrate the advantages of the additional representational power of CRFs and MEMMs relative to HMMs.results clearly indicate that even when the models are pa-rameterized in exactly the same way,CRFs are more ro-bust to inaccurate modeling assumptions than MEMMs or HMMs,and resolve the label bias problem,which affects the performance of MEMMs.To avoid confusion of dif-ferent effects,the MEMMs and CRFs in these experiments use overlapping features of the observations.Fi-nally,in a set of POS tagging experiments,we conﬁrm the advantage of CRFs over MEMMs.We also show that the addition of overlapping features to CRFs and MEMMs al-lows them to perform much better than HMMs,as already shown for MEMMs by McCallum et al.(2000).5.1Modeling label biasWe generate data from a simple HMM which encodes a noisy version of theﬁnite-state network in Figure1.Each state emits its designated symbol with probabilityand any of the other symbols with probability.We train both an MEMM and a CRF with the same topologies on the data generated by the HMM.The observation fea-tures are simply the identity of the observation symbols. In a typical run using training and test samples, trained to convergence of the iterative scaling algorithm, the CRF error is while the MEMM error is, showing that the MEMM fails to discriminate between the two branches.5.2Modeling mixed-order sourcesFor these results,we useﬁve labels,(),and26 observation values,();however,the results were qualitatively the same over a range of sizes for and .We generate data from a mixed-order HMM with state transition probabilities given byand,simi-larly,emission probabilities given by.Thus,for we have a standardﬁrst-order HMM.In order to limit the size of the Bayes error rate for the resulting models,the con-ditional probability tables are constrained to be sparse. In particular,can have at most two nonzero en-tries,for each,and can have at most three nonzero entries for each.For each randomly gener-ated model,a sample of1,000sequences of length25is generated for training and testing.On each randomly generated training set,a CRF is trained using Algorithm S.(Note that since the length of the se-quences and number of active features is constant,Algo-rithms S and T are identical.)The algorithm is fairly slow to converge,typically taking approximately500iterations for the model to stabilize.On the500MHz Pentium PC used in our experiments,each iteration takes approximately 0.2seconds.On the same data an MEMM is trained using iterative scaling,which does not require forward-backward calculations,and is thus more efﬁcient.The MEMM train-ing converges more quickly,stabilizing after approximately 100iterations.For each model,the Viterbi algorithm is used to label a test set;the experimental results do not sig-niﬁcantly change when using forward-backward decoding to minimize the per-symbol error rate.The results of several runs are presented in Figure3.Each plot compares two classes of models,with each point indi-cating the error rate for a single test set.As increases,the error rates generally increase,as theﬁrst-order models fail toﬁt the second-order data.Theﬁgure compares models parameterized as,,and;results for models parameterized as,,and are qualitatively the same.As shown in theﬁrst graph,the CRF generally out-performs the MEMM,often by a wide margin of10%–20% relative error.(The points for very small error rate,with ,where the MEMM does better than the CRF, are suspected to be the result of an insufﬁcient number of training iterations for the CRF.)HMM 5.69%45.99%MEMM 6.37%54.61%CRF 5.55%48.05%MEMM 4.81%26.99%CRF 4.27%23.76%Using spelling featuresFigure4.Per-word error rates for POS tagging on the Penn tree-bank,usingﬁrst-order models trained on50%of the1.1million word corpus.The oov rate is5.45%.5.3POS tagging experimentsTo conﬁrm our synthetic data results,we also compared HMMs,MEMMs and CRFs on Penn treebank POS tag-ging,where each word in a given input sentence must be labeled with one of45syntactic tags.We carried out two sets of experiments with this natural language data.First,we trainedﬁrst-order HMM,MEMM, and CRF models as in the synthetic data experiments,in-troducing parameters for each tag-word pair andfor each tag-tag pair in the training set.The results are con-sistent with what is observed on synthetic data:the HMM outperforms the MEMM,as a consequence of the label bias problem,while the CRF outperforms the HMM.The er-ror rates for training runs using a50%-50%train-test split are shown in Figure5.3;the results are qualitatively sim-ilar for other splits of the data.The error rates on out-of-vocabulary(oov)words,which are not observed in the training set,are reported separately.In the second set of experiments,we take advantage of the power of conditional models by adding a small set of or-thographic features:whether a spelling begins with a num-ber or upper case letter,whether it contains a hyphen,and whether it ends in one of the following sufﬁxes:.Here weﬁnd,as expected,that both the MEMM and the CRF beneﬁt signif-icantly from the use of these features,with the overall error rate reduced by around25%,and the out-of-vocabulary er-ror rate reduced by around50%.One usually starts training from the all zero parameter vec-tor,corresponding to the uniform distribution.However, for these datasets,CRF training with that initialization is much slower than MEMM training.Fortunately,we can use the optimal MEMM parameter vector as a starting point for training the corresponding CRF.In Figure5.3, MEMM was trained to convergence in around100iter-ations.Its parameters were then used to initialize the train-ing of CRF,which converged in1,000iterations.In con-trast,training of the same CRF from the uniform distribu-tion had not converged even after2,000iterations.6.Further Aspects of CRFsMany further aspects of CRFs are attractive for applica-tions and deserve further study.In this section we brieﬂy mention just two.Conditional randomﬁelds can be trained using the expo-nential loss objective function used by the AdaBoost algo-rithm(Freund&Schapire,1997).Typically,boosting is applied to classiﬁcation problems with a small,ﬁxed num-ber of classes;applications of boosting to sequence labeling have treated each label as a separate classiﬁcation problem (Abney et al.,1999).However,it is possible to apply the parallel update algorithm of Collins et al.(2000)to op-timize the per-sequence exponential loss.This requires a forward-backward algorithm to compute efﬁciently certain feature expectations,along the lines of Algorithm T,ex-cept that each feature requires a separate set of forward and backward accumulators.Another attractive aspect of CRFs is that one can imple-ment efﬁcient feature selection and feature induction al-gorithms for them.That is,rather than specifying in ad-vance which features of to use,we could start from feature-generating rules and evaluate the beneﬁt of gener-ated features automatically on data.In particular,the fea-ture induction algorithms presented in Della Pietra et al. (1997)can be adapted toﬁt the dynamic programming techniques of conditional randomﬁelds.7.Related Work and ConclusionsAs far as we know,the present work is theﬁrst to combine the beneﬁts of conditional models with the global normal-ization of randomﬁeld models.Other applications of expo-nential models in sequence modeling have either attempted to build generative models(Rosenfeld,1997),which in-volve a hard normalization problem,or adopted local con-ditional models(Berger et al.,1996;Ratnaparkhi,1996; McCallum et al.,2000)that may suffer from label bias. Non-probabilistic local decision models have also been widely used in segmentation and tagging(Brill,1995; Roth,1998;Abney et al.,1999).Because of the computa-tional complexity of global training,these models are only trained to minimize the error of individual label decisions assuming that neighboring labels are correctly -bel bias would be expected to be a problem here too.An alternative approach to discriminative modeling of se-quence labeling is to use a permissive generative model, which can only model local dependencies,to produce a list of candidates,and then use a more global discrimina-tive model to rerank those candidates.This approach is standard in large-vocabulary speech recognition(Schwartz &Austin,1993),and has also been proposed for parsing (Collins,2000).However,these methods fail when the cor-rect output is pruned away in theﬁrst pass.。

ddpm和ddim算法重参数技巧

DDPM（Diffusion Probabilistic Models）和DDIM（Diffusion Implicit Models）是一种基于扩散过程的概率生成模型，它们在计算机视觉、图像生成和样本自动生成领域取得了许多突破性的成果。

其中，DDPM和DDIM算法的重参数技巧是其关键之一，本文将就DDPM和DDIM算法的重参数技巧进行深入分析和讨论。

一、DDPM和DDIM算法概述1. DDPM算法概述DDPM是一种基于扩散过程的概率生成模型，它通过建模数据的漫步过程来实现对数据分布的建模。

DDPM算法利用了高斯过程的性质，将高斯过程的扩散过程应用到数据生成中，从而实现了对图像数据的生成和重参数化。

DDPM算法的核心思想是将数据视为扩散过程中的粒子，通过模拟这些粒子的运动轨迹来生成图像数据。

通过对扩散过程进行建模和估计，DDPM算法能够有效地捕捉数据的分布特征，从而实现对图像数据的高效生成。

2. DDIM算法概述DDIM是基于扩散过程的隐式生成模型，它利用了扩散过程的性质来建模数据的生成过程。

与DDPM算法不同，DDIM算法通过对潜在空间的建模和估计，从而实现对图像数据的生成和重参数化。

DDIM算法的核心思想是通过对隐变量的建模和边缘化，从而实现对图像数据的生成。

通过对潜在空间的建模和估计，DDIM算法能够有效地捕捉数据的分布特征，从而实现对图像数据的高效生成。

二、DDPM和DDIM算法的重参数技巧1. DDPM算法的重参数技巧DDPM算法的重参数技巧是其关键之一，它通过引入可微分的随机变量来实现对模型的训练和推断。

具体来说，DDPM算法通过引入重参数化技巧，将模型的参数与噪声变量进行耦合，从而实现对模型的训练和推断。

重参数化技巧的核心是将随机变量的采样过程分解为确定性的变换和随机的噪声变量，从而使得采样过程可微分。

通过引入这种可微分的随机变量，DDPM算法能够实现对模型的端到端训练和推断，从而提高了模型的训练效率和推断精度。

An Introduction to Probabilistic Graphical Models

PROBABILISTIC GRAPHICAL MODELS
David Madigan Rutgers University
madigan@
Expert Systems •Explosion of interest in “Expert Systems” in the early 1980’s
•Many companies (Teknowledge, IntelliCorp, Inference, etc.), many IPO’s, much media hype •Ad-hoc uncertainty handling
Uncertainty in Expert Systems If A then C (p1) If B then C (p2) What if both A and B true? Then C true with CF: p1 + (p2 X (1- p1)) “Currently fashionable ad-hoc mumbo jumbo” A.F.M. Smith
v
εV
Lemma: If P admits a recursive factorization according to an ADG G, then P factorizes according GM (and chordal supergraphs of GM) Lemma: If P admits a recursive factorization according to an ADG G, and A is an ancestral set in G, then PA admits a recursive factorization according to the subgraph GA

斯坦福大学人工智能所有课程介绍

List of related AI Classes CS229covered a broad swath of topics in machine learning,compressed into a sin-gle quarter.Machine learning is a hugely inter-disciplinary topic,and there are many other sub-communities of AI working on related topics,or working on applying machine learning to diﬀerent problems.Stanford has one of the best and broadest sets of AI courses of pretty much any university.It oﬀers a wide range of classes,covering most of the scope of AI issues.Here are some some classes in which you can learn more about topics related to CS229:AI Overview•CS221(Aut):Artiﬁcial Intelligence:Principles and Techniques.Broad overview of AI and applications,including robotics,vision,NLP,search,Bayesian networks, and learning.Taught by Professor Andrew Ng.Robotics•CS223A(Win):Robotics from the perspective of building the robot and controlling it;focus on manipulation.Taught by Professor Oussama Khatib(who builds the big robots in the Robotics Lab).•CS225A(Spr):A lab course from the same perspective,taught by Professor Khatib.•CS225B(Aut):A lab course where you get to play around with making mobile robots navigate in the real world.Taught by Dr.Kurt Konolige(SRI).•CS277(Spr):Experimental Haptics.Teaches haptics programming and touch feedback in virtual reality.Taught by Professor Ken Salisbury,who works on robot design,haptic devices/teleoperation,robotic surgery,and more.•CS326A(Latombe):Motion planning.An algorithmic robot motion planning course,by Professor Jean-Claude Latombe,who(literally)wrote the book on the topic.Knowledge Representation&Reasoning•CS222(Win):Logical knowledge representation and reasoning.Taught by Profes-sor Yoav Shoham and Professor Johan van Benthem.•CS227(Spr):Algorithmic methods such as search,CSP,planning.Taught by Dr.Yorke-Smith(SRI).Probabilistic Methods•CS228(Win):Probabilistic models in AI.Bayesian networks,hidden Markov mod-els,and planning under uncertainty.Taught by Professor Daphne Koller,who works on computational biology,Bayes nets,learning,computational game theory, and more.1Perception&Understanding•CS223B(Win):Introduction to computer vision.Algorithms for processing and interpreting image or camera information.Taught by Professor Sebastian Thrun, who led the DARPA Grand Challenge/DARPA Urban Challenge teams,or Pro-fessor Jana Kosecka,who works on vision and robotics.•CS224S(Win):Speech recognition and synthesis.Algorithms for large vocabu-lary continuous speech recognition,text-to-speech,conversational dialogue agents.Taught by Professor Dan Jurafsky,who co-authored one of the two most-used textbooks on NLP.•CS224N(Spr):Natural language processing,including parsing,part of speech tagging,information extraction from text,and more.Taught by Professor Chris Manning,who co-authored the other of the two most-used textbooks on NLP.•CS224U(Win):Natural language understanding,including computational seman-tics and pragmatics,with application to question answering,summarization,and inference.Taught by Professors Dan Jurafsky and Chris Manning.Multi-agent systems•CS224M(Win):Multi-agent systems,including game theoretic foundations,de-signing systems that induce agents to coordinate,and multi-agent learning.Taught by Professor Yoav Shoham,who works on economic models of multi-agent interac-tions.•CS227B(Spr):General game playing.Reasoning and learning methods for playing any of a broad class of games.Taught by Professor Michael Genesereth,who works on computational logic,enterprise management and e-commerce.Convex Optimization•EE364A(Win):Convex Optimization.Convexity,duality,convex programs,inte-rior point methods,algorithms.Taught by Professor Stephen Boyd,who works on optimization and its application to engineering problems.AI Project courses•CS294B/CS294W(Win):STAIR(STanford AI Robot)project.Project course with no lectures.By drawing from machine learning and all other areas of AI, we’ll work on the challenge problem of building a general-purpose robot that can carry out home and oﬃce chores,such as tidying up a room,fetching items,and preparing meals.Taught by Professor Andrew Ng.2。

Modeling probabilistic actions for practical decision-theoretic planning

Any planning model that strives to solve real world problems must deal with the inherent uncertainty in the domains. Various approaches have been suggested (0; 0; 0; 0) and the generally accepted and traditional solution is to use probability to model domain uncertainty (0; 0; 0). A representative of this approach is the buridan planner (0). In buridan uncertainty about the true state of the world is modeled with a probability distribution over the state space. Actions have uncertain e ects, and each of these e ects is also modeled with a probability distribution. Projecting a plan thus does not result in a single nal state, but a probability distribution over the state space. To make the representation computationally tractable, the probability distributions involved take non-zero probabilities on only a nite number of states. The buridan representation, which we will call the single probability distribution (SPD) model has a wellfounded semantics and is the underlying representation

当前AI领域尚未攻克的29个难题及进展评估文献

Commonsense Reasoning. “Commonsense reasoning pronoun disambiguation problems” Online under /disambiguation.html (2016b)
1பைடு நூலகம்
Brown, Noam, and Tuomas Sandholm. “Safe and Nested Endgame Solving for Imperfect-Information Games.” Online under /~noamb/papers/17-AAAI-Refinement.pdf(2017)
Deng, Jia, et al. “Imagenet: A large-scale hierarchical image database.” Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
Finn, Chelsea, and Sergey Levine. “Deep Visual Foresight for Planning Robot Motion.” arXiv preprint arXiv:1610.00696 (2016).
Fouhey, David F., and C. Lawrence Zitnick. “Predicting object dynamics in scenes.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.
de Freitas, Nando. “Learning to Learn and Compositionality with Deep Recurrent Neural Networks: Learning to Learn and Compositionality.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.

数学家的故事简直惊呆了

高中毕业后的这些年(续）本篇文章想要说明的是数学并非认识世界的唯一途径，即使数学水平不高的你同样也可以开展机器学习方面的工作和研究。

但是不可否认数学是自然科学领域探究真理的有效工具，有了强大的数学背景知识会让你看待问题更加深刻，这就是我们经常会看到很多大牛们都是出身数学专业。

另外本文所列举的课-程比较多，要想一下子去穷尽所有课-程显然也不现实，大可不必打好所有的数学基础再去学机器学习，最好的做法是当你对机器学习本身的理解达到一定瓶颈的时候，你可以补一补一些相关的数学基础之后再回去看机器学习的问题也许会更快的有所突破。

所以本文针对不同学习基础的朋友们，划分初，中，高三个学习阶段，供大家在学习中进一步去取舍。

首先对人工智能、机器学习一个综述：大话“人工智能、数据科学、机器学习”--综述 - 知乎专栏：笼统地说，原理和基础都在数学这边，当然有很多偏应用和软件使用的技术，例如“深度学习调参”等，这些报个培训速成班就能学会的技术含量不那么高的东西，不在讨论范围内。

这里要讨论的，是如何系统的学习，然后自己能编出这机器学习或深度学习的程序或软件－－我想，这才能称为一个合格的机器学习、数据科学家。

1.入门基础1. 微积分（求导，极限，极值）例如传统的BP神经网络的训练算法实际上是基于复合函数求导的链式法则，又比如目前多数的监督学习训练算法都基于极大似然估计，而极大似然估计的求解往往涉及求导，求极值的内容。

2. 线性代数（矩阵表示、矩阵运算、特征根、特征向量）是基础中的基础，主成分分析（PCA）、奇异值分解（SVD）、矩阵的特征分解、LU 分解、QR 分解、对称矩阵、正交化和正交归一化、矩阵运算、投影、特征值和特征向量、向量空间和范数（Norms），这些都是理解机器学习中基本概念的基础。

某篇图像分割1w+引用的神文核心思想便就求解构造矩阵的特征向量。

国内的线性代数教材偏重于计算而忽视了线性空间，特征值等基本概念阐述。

怎样的微课视频受欢迎

利用微课、慕课等方式让学生能接触优秀教师的教学，自然是一件好事，但在学习过程中学生的流失率也是一个不容忽视的问题。

怎样的微课、慕课更能吸引学生的关注，让学生更投入地开展学习呢？最近，来自美国的两篇论文对此进行了分析和研究。

1．短于6分钟的视频最吸引人基于edX数据的统计，无论视频多长，用户实际观看时长的中位数都不超过6分钟。

而且6-9分钟长的视频是个拐点，更长的视频实际观看中位数反倒会下降。

比如长度超过12分钟的视频，实际观看中位数只有3分钟。

所以，“短视频到底多短最合适”这个问题有了标准答案了：6分钟。

2．语速快，很关键虽然统计数字表明语速和视频吸引力并不完全成正比，但当语速达到每分钟185-254个单词（对应中文大约为300个字左右）时，无论视频多长，都比低语速能获得更多注意力。

原因很好理解，快语速常常伴随着激情，激情富有感染力，感染力可以让学生更专注。

所以，教师越热情，甚至是激情，越能吸引学生。

3．教师头像绝非可有可无对于大于6分钟的视频，有教师讲课头像的和纯ppt、软件操作等录屏相比，前者收获的关注更多。

前提是ppt把那一角特意留出来，头像不会遮挡该看到的课件内容。

可汗学院不提倡视频中出现教师的头像，但中国的学生更注重教师的作用，教师的头像在某种教学情景下也是一种重要的学习资源。

4．制造一对一的感觉教师都习惯教室气氛，黑板、大屏幕，站在讲台上，走来走去，甚至安排一些学生假装听众来提起讲课的性质。

但数据分析表明，这种在教室/演播室，配置昂贵设备录制的视频，在吸引力上其实不如更低成本的私人录制方式。

教师坐着，面对镜头，背景就是办公室，像做单独辅导一样地讲课，效果是最好的。

这样很容易产生一种亲切感，而且和坐在电脑前的学生所处学习环境最契合。

这里的关键点，就是让学生有一对一的感觉。

5．手写屏/笔是最值得购买的设备可汗学院的视频是典型的手写笔应用，所以论文干脆将这种视频称为“可汗风格”。

统计表明，与录屏风格相比，学生愿意在可汗风格的视频中投入1.5-2倍的时间。

人工智能应用领域有哪些方面

人工智能运用领域有哪些方面2023人工智能运用领域有哪些方面这几年来中国在人工智能发展研究上的热度一直高涨。

人工智能也越来越贴近我们的生活。

那么它有哪些方面，来看一下!下面作者为大家带来人工智能运用领域有哪些方面，欢迎大家参考阅读，期望能够帮助到大家!人工智能运用领域有哪些方面1、农业：农业中已经用到很多的AI技术，无人机喷撒农药，除草，农作物状态实时监控，物料采购，数据收集，灌溉，收获，销售等。

通过运用人工智能设备终端等，大大提高了农牧业的产量，大大减少了许多人工本钱和时间本钱。

2、通讯：智能外呼系统，客户数据处理(订单管理系统)，通讯故障排除，病毒拦截(360等)，扰乱信息拦截等3、医疗：利用最先进的物联网技术，实现患者与医务人员、医疗机构、医疗设备之间的互动，逐渐到达信息化。

例：健康监测(智能穿着设备)、自动提示用药时间、服用禁忌、剩余药量等的智能服药系统。

4、社会治安：安防监控(数据实时联网，公安系统可以实时进行数据调查分析)、电信诈骗数据锁定、犯法分子抓捕、消防抢险领域(灭火、人员救助、特别区域作业)等。

人工智能运用前景怎么样人工智能技术有着广阔运用前景，能够极大地增进社会经济发展。

近年来，人工智能与电子终端和垂直行业加速融会，已经出现出了智能家居、智能汽车、可穿着设备、智能机器人等一批人工智能产品，而且人工智能正在全面重塑家电、机器人、医疗、教育、金融等行业，将带来大量的经济效益。

202X年7月国务院印发了《新一代人工智能发展计划的通知》，提出三步走战略，到2030年我国人工智能核心产业规模到达1万亿元，带动相干产业规模到达10万亿元。

同时，腾讯、阿里和百度均设立了人工智能的研究中心，期望占据技术研发的制高点。

可见，中国有庞大的传统产业基础。

如何让AI这门技术更好地改造更多传统产业，是各个领域的从业者需要摸索的问题。

人工智能专业简介人工智能，即AI(ArTIficial Intelligence)，是一门包含运算机、控制论、信息论、神经生理学、心理学、语言学等综合学科。

A Probabilistic Model for Phonocardiograms Segmentation Based on Homomorphic Filtering

x (t ) = a (t ) ⋅ f (t ) a(t ) > 0
(1)
We denote:
ˆ (t ) = ln x(t ) = ln a(t ) + ln f (t ) . x In cases where x(t ) = 0 we add a small positive value, and then we have ˆ (t ) = ln a(t ) + ln f (t ) . x
BIOSIGNALModel for Phonocardiograms Segmentation Based on Homomorphic Filtering
Gill D 1, Intrator N 2, Gavriely N3 1 Department of Statistics, The Hebrew University of Jerusalem, Jerusalem 91905, Israel, 2 School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel, 3Rappaport Medicine Faculty, Technion IIT, Haifa 31096, Israel gill@mta.ac.il Abstract. This work presents a novel method for automatic detection and identification of heart sounds. Homomorphic filtering is used to obtain a smooth envelogram of the phonocardiogram, which enables a robust detection of events of interest in heart sound signal. Sequences of features extracted from the detected events are used as observations of a hidden Markov model. It is demonstrated that the task of detection and identification of the major heart sounds can be learned from unlabelled phonocardiograms by an unsupervised training process and without the assistance of any additional synchronizing channels.

An introduction to machine learning and graphical

T Y N
4
Key issue: generalization
yes
no
?
?
Can’t just memorize the training set (overfitting)
5
Hypothesis spaces
Decision trees Neural networks K-nearest neighbors Naïve Bayes classifier Support vector machines (SVMs) Boosted decision stumps …
2
Supervised learning
yes
no
Color
Shape
Size
Output
Blue
Torus
Big
Y
Blue
Square
Small
Y
Blue
Star
Small
Y
Red
Arrow
Small
N
Learn to approximate function F(x1, x2, x3) -> t
Boosting maximizes the margin
13
Supervised learning success stories
Face detection Steering an autonomous car across the US Detecting credit card fraud Medical diagnosis …
6
Perceptron (neural net with no hidden layers)
Linearly separable data

Learning spatiotemporal models from training examples

2 0 1
~ C =b I+b
0
1
2
Hence the linear system of equations is decoupled into kn independent 2nd order di erential equations. 2
2.1 State space metric
The object to be modelled is assumed to have a constant (uniform) density , and the mass matrix is calculated in the usual way by :M = R H (u)H (u)du = H
University of Leeds
SCHOOL OF COMPUTER STUDIES RESEARCH REPORT SERIES
Report 95.9
Learning Spatiotemporal Models From Training Examples
by
A M Baumberg & D C Hogg
1 Introduction
The application of physically based constraints allows di cult problems in computer vision to be solved by ensuring the system is overconstrained. These constraints are not necessarily based on real physical properties but merely motivated by the assumed physical nature of the problem. We are interested in accurately tracking a non-rigid deforming object. Pentland and Horowitz 1] describe a method for recovering non-rigid motion and structure by deriving physically based \free vibration" modes using the Finite Element Method (FEM). The method relies on making physical assumptions about the object, such as uniform distribution of mass and constant elasticity. The vibration modes are derived from the governing equation of the FEM nodal parametrisation. The mass and sti ness matrices in the governing equation are either known or derived from the physical assumptions. Physically based \modal analysis" has been used in a wide range of applications (e.g. Nastar and Ayache 2], 3]). The use of training information has been shown to be a powerful tool in computer vision and pattern recognition (e.g. to train neural networks). In the Point Distribution Model (PDM), Cootes and Taylor 4] utilise a set of static training shapes to derive a set of orthogonal \modes of variation". The training shapes can be accurately represented by a basis consisting of a subset of these vectors. The PDM has proven useful in model-based image interpretation (e.g. Cootes et al 5], Hill et al 6]) and in image sequence analysis (e.g. real-time contour tracking 7], robust tracking of deformable models 8]). However, one drawback of this approach is that there is no temporal aspect to the model. Hence it is not possible to extrapolate forward in time to get good estimates of the expected shape of the object. 1

《神经网络与机器学习》第3讲感知机与学习规则

《神经⽹络与机器学习》第3讲感知机与学习规则神经⽹络与机器学习第3章感知机与学习规则§3.1 感知机的学习规则上⼀节中，区分橘⼦和苹果，是我们⼈为地划分⼀个决策边界，即⼀个平⾯，感知器的权矩阵和偏置向量也是事先给定，这⾮常地不"智能"。

我们能否找到⼀种根据输⼊数据⾃动调整权矩阵和偏置向量的学习算法？如何设定学习规则？这样的学习规则肯定能找到⼀个决策边界吗？感知机给我们提供了⼀个数学上可解析的，⾮常易于⼈们理解的⼀类重要神经⽹络模型。

感知机和现在发展和应⽤的很多⽹络相⽐那是⾮常简单，功能有限，但是在历史发展中却不容忽视。

F. Rosenblatt , "The perceptron: A probabilistic model for information storage and organization in the brain,"Psychological Review, 65: 386-408, 1958.Rosenblatt在1958年引⼊了⼀种学习规则，⽤来训练感知机完成模式识别问题，随机地选择权系数初值，将训练样本集合输⼊到感知机，那么⽹络根据⽬标和实际输出的差值⾃动地学习，他证明只要最优权矩阵存在，那么学习规则肯定能够收敛到最优值，学习速度快速可靠。

学习规则：就是更新⽹络权系数和偏置向量的⽅法，也称为训练算法。

学习规则的分类：有监督学习（有教师学习）事先具有⼀个训练集合\{(p_1,t_1),(p_2,t_2),\cdots,(p_N,t_N)\}p_n表⽰的是⽹络输⼊，t_n是正确的⽬标（target），有时候分类⾥称为"标签"。

学习规则不断地调节⽹络权系数和偏置向量，使得⽹络输出和⽬标越来越接近。

感知机的学习是有监督学习。

（2）⽆监督学习没有可参考的⽬标，仅仅依赖⽹络输出调节⽹络权系数和偏置向量。

⽆监督学习的核⼼，往往是希望发现数据内部潜在的结构和规律，为我们进⾏下⼀步决断提供参考。

教育技术学专业英语词汇

教育技术学专业英语词汇1. curriculum 课程计划2.pilot test 试行3. mechanism 机制4. Communication Theory 传播理论5.programmed instruction 程序教学6.Audiovisual communication 视听传播7.trial 尝试8.Formative evaluation 形成性评价9. Probabilistic 概率性10.Classroom-focus 以"课堂"为中心11.task analysis 任务分析12.verbalism 言语主义13.instructional systems d esign教学系统设计14.instructional technol ogy 教育技术15.performance 绩效16.utilization 利用17.digest 文摘18.syndication 聚合19.peotential 潜能ponent 元素21.stimuli 刺激22.encoding 编码23.Situated learning 情境学习24.advanced organizer 先行组织者25.25.Situated learning 情境学习26.Rand om access learning 随机进入学习27.Anchored learning 锚定式情境学习28.cognitive-d evel opment theory认知发展说29.Learning Psychology 学习理论30.Verbal-linguistic intelligence 语言智能31.Audience对象32.Behavior行为33.Condition条件34.Degree标准35.courses 课程36.36.Perception stage认知阶段rmation processing theory 信息加工理论38.Expressive Objectives 表现性目标puter Supported Collaborative Learning 计算机支持的协作学习40.Evaluation instrument d eveloping 评价工具的编制41.elaboration 细化42.metacognition 元认知43.retrieval 重视44.schema 图式45.channel 信道46.interactional dynamics互动动态47. 47.interpersonal communication 人际传播48. 48.signal 信号49.transmitter 传送者50.mass communication 大众传播51. 51.internal and external learning conditions 学习的内外部条件cational objective 教育目标53.53.electronic support system 电子支持系统54.54.event of instruction 教学事件55.expert instruction 专家教学56. 56.individualized learning 个性化学习57.57.intellectual skill 智慧技能58.learning theorist 学习理论家59.level of cognitive performance 认知行为水平60.Responsive Mod el 应答模式61.the null curriculum 空无课程62.Collaborative Learning协作学习63.IT in education教育信息化rmation and Communications Technol ogy信息与通信技术rmation Literacy 信息素养puter Literracy 计算机文化素养67.Learning Contract 学习契约68.Problem-Based Learning（PBL）基于问题的学习69.verbal information 言语信息70.spyware 间谍软件71.motion 电影pact Disk 光盘73.MTV 音乐电视74.satellite broad cast 卫星广播75.World Wid e Web 万维网76.76.microprocessor 微处理器77.cabl e television systems 有线电视系统78.fiber-optics transmission 光纤传输79.artificial reality 人工现实80.Artificial Intelligence 人工智能81.fiber 光纤82.keyboard 键盘83.mobile phone 移动电话84.virtual reality 虚拟现实85.wireless personal area network 个人无线局域网puter-Mediated Communication 计算机媒介沟通87.Concept Maps 概念图88.Thinking Maps思维导图89.integration 整合90.Performance Assessment 绩效评估91.91.mainframe 主机92.ved eodisk 视盘93.attribute of media 媒体特性94. 94.correspond ence 函授课程95.E-learning Portfolio 电子学档96.tacit knowledge 隐性知识97.explict knowledge显性知识98.Knowledge management 知识管理99.Knowledge Evolution Theory知识进化理论100. evaluation instrument 评价工具101. anchor point 锚点102. instructional material 教学材料103. learning experience 学习体验104. organizational behavior 组织行为105. performance support 绩效支持106. specialized 专用的107. systematic 系统的108. stated objective 既定的目标109. Blend ed learning 混合学习110. Virtual Learning Companion System 虚拟学伴系统111. Integrated Ware 积件112. Group Ware 群件113. Imagination 构想性114. summative evaluation 总结性评价115. Authentic Assessment真实性评价116. Scaffold Learning “支架式”学习117. knowledge object 知识对象118. AID systems 教学设计自动化系统119. analysis phase 分析阶段120. d elivery d omain 传送领域121. instructional d elivery 教学传递122. knowledge management system 知识管理系统123. Automated Instructional Design 自动化教学设计124. information explosion 信息爆炸125. Information Age 信息时代126. self-managed 自我管理127. well-trained 受过良好培训的128. E-learning 数字化学习129. WebQuest 网络探究学习130. experimental group 实验组131. case study 案例研究132. behavioral 行为的133. cognitive 认知的134. subject matter 主题135. postmod ern 后现代的136. hypothesis 假设137. holistic 整体的138. illogical 不合逻辑的139. complexity and interd epend ence 复杂性和相互依赖性140. receiver 接受者141. andragogy 成人教育学142. information-processing theory 信息加工理论143. retrieval 重现144. sensory 感觉器官145. slid e 幻灯片146. taxonomy 分类法147. transfer 迁移148. objectivism 客观主义149. research and d evelopment 研究与开发150. communication 传播.。

(Linyuan)Link prediction in complex networks-A survey

Author's personal copy
Physica A 390 (2011) 1150–1170
Contents lists available at ScienceDirect
Physica A
journal homepage: /locate/physa
Article history: Received 5 October 2010 Received in revised form 10 November 2010 Available online 2 December 2010 Keywords: Link prediction Complex networks Node similarity Maximum likelihood methods Probabilistic models
article
info
abstract
Link prediction in complex networks has attracted increasing attention from both physical and computer science communities. The algorithms can be used to extract missing information, identify spurious interactions, evaluate network evolving mechanisms, and so on. This article summaries recent progress about link prediction algorithms, emphasizing on the contributions from physical perspectives and approaches, such as the random-walkbased methods and the maximum likelihood methods. We also introduce three typical applications: reconstruction of networks, evaluation of network evolving mechanism and classification of partially labeled networks. Finally, we introduce some applications and outline future challenges of link prediction algorithms. © 2010 Elsevier B.V. All rights reserved.

PRML笔记-Notes on Pattern Recognition and Machine Learning

PRML 说的 Bayesian 主要还是指 Empirical Bayesian。
Optimization/approximation Linear/Quadratic/Convex optimization：求线性/二次/凸函数的最值 Lagrange multiplier：带（等式或不等式）约束的最值 Gradient decent：求最值 Newton iteration：解方程 Laplace approximation：近似 Expectation Maximation (EM)：求最值/近似，latent variable model 中几乎无处不在 Variational inference：求泛函最值 Expectation Propagation (EP)：求泛函最值 MCMC/Gibbs sampling：采样
prml笔记notespatternrecognitionmachinelearningbishopversion10jianxiao目录checklistprobabilitydistribution10chapterlinearmodels14chapterlinearmodels19chapterneuralnetworks26chapterkernelmethods33chaptersparsekernelmachine39chaptergraphicalmodels47chaptermixturemodels53chapter10approximateinference58chapter11samplingmethod63chapter12continuouslatentvariables68chapter13sequentialdata72chapter14combiningmodelsiamxiaojiangmailcomchecklistfrequentist版本frequentistbayesian对峙构成的主要内容bayesian版本解模型所用的方法linearbasisfunctionregressionbayesianlinearbasisfunctionregression前者和后者皆有closedformsolutionlogisticregressionbayesianlogitsticregression前者牛顿迭代irls后者laplaceapproximationneuralnetworkregressionclassificationbayesianneuralnetworkregressionclassification前者gradientdecent后者laplaceapproximationsvmregressionclassificationrvmregressionclassification前者解二次规划后者迭代laplaceapproximationgaussianmixturemodelbayesiangaussianmixturemodel前者用em后者variationinferencceprobabilisticpcabayesianprobabilisticpca前者closedformsolutionem后者laplaceapproximationh

概率图模型——精选推荐

概率图模型过去的⼀段时间⾥，忙于考试、忙于完成实验室、更忙于过年，很长时间没有以⼀种良好的⼼态来回忆、总结⾃⼰所学的东西了。

这⼏天总在想，我应该怎么做。

后来我才明⽩，应该想想我现在该做什么，所以我开始写这篇博客了。

这将是对概率图模型的⼀个很基础的总结，主要参考了《PATTERN RECOGNITION and MACHINE LEARNING》。

看这部分内容主要是因为中涉及到了相关的知识。

概率图模型本⾝是值得深究的，但我了解得不多，本⽂就纯当是介绍了，如有错误或不当之处还请多多指教。

0. 这是什么？很多事情是具有不确定性的。

⼈们往往希望从不确定的东西⾥尽可能多的得到确定的知识、信息。

为了达到这⼀⽬的，⼈们创建了概率理论来描述事物的不确定性。

在这⼀基础上，⼈们希望能够通过已经知道的知识来推测出未知的事情，⽆论是现在、过去、还是将来。

在这⼀过程中，模型往往是必须的，什么样的模型才是相对正确的？这⼜是我们需要解决的问题。

这些问题出现在很多领域，包括模式识别、差错控制编码等。

概率图模型是解决这些问题的⼯具之⼀。

从名字上可以看出，这是⼀种或是⼀类模型，同时运⽤了概率和图这两种数学⼯具来建⽴的模型。

那么，很⾃然的有下⼀个问题1. 为什么要引⼊概率图模型？对于⼀般的统计推断问题，概率模型能够很好的解决，那么引⼊概率图模型⼜能带来什么好处呢？LDPC码的译码算法中的置信传播算法的提出早于因⼦图，这在⼀定程度上说明概率图模型不是⼀个从不能解决问题到解决问题的突破，⽽是采⽤概率图模型能够更好的解决问题。

《模式识别和机器学习》这本书在图模型的开篇就阐明了在概率模型中运⽤图这⼀⼯具带来的⼀些好的性质，包括1. They provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate new models.2. Insights into the properties of the model, including conditional independence properties, can be obtained by inspection of the graph.3. Complex computations, required to perform inference and learning in sophisticated models, can be expressed in terms of graphical manipulations, in which underlying mathematical expressions are carried along implicitly.简⽽⾔之，就是图使得概率模型可视化了，这样就使得⼀些变量之间的关系能够很容易的从图中观测出来；同时有⼀些概率上的复杂的计算可以理解为图上的信息传递，这是我们就⽆需关注太多的复杂表达式了。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Austin I.Eliazar ELIAZAR@ Ronald Parr PARR@ Department of Computer Science,Duke University,Durham,NC,27708USAAbstractMachine learning methods are often applied tothe problem of learning a map from a robot’s sen-sor data,but they are rarely applied to the prob-lem of learning a robot’s motion model.The mo-tion model,which can be inﬂuenced by robot id-iosyncrasies and terrain properties,is a crucialaspect of current algorithms for SimultaneousLocalization and Mapping(SLAM).In this pa-per we concentrate on generating the correct mo-tion model for a robot by applying EM methodsin conjunction with a current SLAM algorithm.In contrast to previous calibration approaches,we not only estimate the mean of the motion,but also the interdependencies between motionterms,and the variances in these terms.This canbe used to provide a more focused proposal dis-tribution to a particleﬁlter used in a SLAM al-gorithm,which can reduce the resources neededfor localization while decreasing the chance oflosing track of the robot’s position.We validatethis approach by recovering a good motion modeldespite initialization with a poor one.Further ex-periments validate the generality of the learnedmodel in similar circumstances.1.IntroductionAdvances in the areas of robot localization and Simultane-ous Localization and Mapping(SLAM)have come a longway towards bringing the prospect of truly autonomousrobot operation closer to reality(Thrun,2002).With thesetechniques,mobile robots can create maps and positionthemselves in mapped environments with low risk of get-ting lost.However,an infrequently discussed but importantinput into these methods is the set of parameters for therobot’s motion model.This model is provided to the robotby a human,based upon a combination of intuitions about1In robotics,one typically treats the odometry as the controlinput since it is possible to specify control inputs to the robot interms of target odometry.A particleﬁlter is a Monte Carlo method for estimating andpropagating a probability distribution through a Markovmodel.We brieﬂy review particleﬁlters here,but refer thereader to excellent overviews of this topic(Doucet et al.,2001)and its application to robotics(Thrun,2000)for amore complete discussion.A particleﬁlter maintains a weighted(and normalized)setof sampled states,,called particles.Ateach step,upon observing evidence,the particleﬁlter:1.Samples new states from from theweighted set of particles with replacement.2.Propagates each new state through a Markovian tran-sition(or simulation)model:.This entailssampling a new state from the conditional distributionover next states given the sampled previous state.3.Weighs each new state according to a Markovian ob-servation model:4.Normalizes the weights for the new set of states.Particleﬁlters are easy to implement and have been usedto track multimodal distributions for many practical prob-lems(Doucet et al.,2001).For robot localization,the distri-bution comes from the robot’s motion model.(Infull generality,the distribution from which next states aresampled is referred to as the proposal distribution,becauseit need not match).The observation probabilities,come from a combination of the robot’s sensormodel,typically laser or sonar,and a map.While the SLAM algorithm itself is not the focus of thispaper,SLAM involves an extra step beyond that which isperformed for localization.For SLAM,the map becomespart of the state that is estimated at each iteration.Theparticles therefore represent a joint distribution over mapsand robot states(Cheeseman et al.,1990;Montemerlo&Thrun,2002;Eliazar&Parr,2004).The problem of efﬁ-ciently maintaining this distribution is the source of muchof the subtlety in SLAM algorithm research.We used the DP-SLAM2.0algorithm(Eliazar&Parr,2004)to construct our maps.DP-SLAM2.0is well suitedto this problem because it is able to maintain a joint dis-tribution over maps and robot positions by very efﬁcientlymaintaining large sets of particles.With a good motionmodel,DP-SLAM2.0is accurate enough to close largeloops without any explicit map correction techniques.Forthese reasons,we chose to treat the set of particles main-tained by DP-SLAM as a good representation of the prob-ability distribution at any time step2.This lets us treat theinitial models and we discuss our unusually good performance insuch cases in the conclusion.lateral movement,and vice-versa.Furthermore,the pro-posed method extends the scope of the calibration beyond the systematic errors dealt with in previous methods.We believe that great gains in performance can be achieved by estimating the non-systematic errors,through variance in the different movement terms.This can be crucial to the motion model of SLAM methods,as different amounts of noise in the movement terms can produce vastly different proposal distributions(Burgard et al.,1999).A properly calibrated set of variance parameters will provide the local-ization algorithm with a more appropriate proposal distri-bution,allowing it to better focus its resources on the most likely poses for the robot.The algorithm for learning the motion model is integrated with a SLAM algorithm,giving increased autonomy to the system.The robot now has the potential to learn the most appropriate model based upon recent experiences,and in direct conjunction with its current task.This is especially useful as the robot’s motion model will change over time, both from changes in the terrain and from general wear on the robot.It is also important that this calibration method can be performed in a remote location,without the need of external sensors to measure the robot’s true motion.A rover landing on another planet with unknown surface con-ditions would be an obvious application of this approach. With this view in mind,we can identify two categories of hidden variables in our problem formulation.We are at-tempting to learn both the map of the environment and the set of motion model parameters that describe stochastic re-lationship between the odometry and the actual movement of the robot.To estimate the parameters of this model,we propose using an EM algorithm:The expectation step is provided by a SLAM algorithm,implemented with some initial motion model parameters.The possible trajectories postulated are then used in the maximization step to create a set of parameters which best describe the motions repre-sented by these trajectories.3.Motion Model DetailsLet the robot’s pose at any given time step be represented as ,where is the facing angle of the robot.The motion model then seeks to determine,where is the robot’s pose one time step in the future,andis the amount of lateral and rotational movement(respec-tively)that odometry has reported over that time interval. Roy and Thrun(1999)propose the following model:Here,is the actual distance traveled by the robot,andis the actual turn performed.This is correct only if the turn and drive commands are performed independently,a simplifying assumption which even their own experiments violate.A simple improvement to account for simultaneous turning and lateral movement would be:This model assumes that the turning velocity of the robot is constant throughout the time step,and that the robot can only move in the direction it is facing.These improved equations do not take into account that even in this case,the distance traveled will actually be an arc,and not a straight line.However,when T is reasonably small,this error is minor and can be absorbed as part of the noise.A better model would take into account the ability of the robot to move in a direction that is not solely determined by the beginning and end facing angle of the robot.Such a model would be able to account for variable speed turns and sideways shifts,both of which have been apparent with our robots,even on the best of surfaces:Here is the true movement angle of the robot.In this method,the direction of movement has been expressed sep-arately from and,which permits movement in a direc-tion distinct from the facing angle of the robot.In practice it is often difﬁcult to determine this independently from and,but with some robots,the shaft encoders on each wheel can be read independently,and can give a more di-rect observation of this parameter.Even in the rare cases where it might be possible to ob-serve,it would be very difﬁcult to develop a good noise model.Representing the noise in as a Gaussian would require some choice for a mean.For a robot which can per-form holonomic turns,the lateral shift of the robot could very easily be in any direction,while the lateral movement reported would be negligible.In this case,would more accurately be modeled as a uniform distribution.For these reasons,we prefer a slightly different model that decom-poses the movement into two principle components:We approximate withtion term,which is present to model shift in the orthogonal direction to the major axis,which we call the minor axis. This axis is at angle3There is nothing special about the left-hand choice.where is an diagonal weight matrix with diago-nal element asat one degree increments along a semi-circle at a height of 7cm from theﬂoor.Figure 1.A complete loop of hallway,generated using a naive motion model.The robot starts at the top left and moves coun-terclockwise.Each pixel in this map represents 3cm in the envi-ronment.The total path length is approximately 60meters.White areas are unexplored.Shades between gray and black indicate increasing probability of an obstacle.In these experiments,we noticed a small anomaly where laser readings sometimes changed in a manner implying motion,when no changes were reported in odometry.This could possibly be caused by readings from the laser range ﬁnder not being perfectly synchronized with the readings from the odometers,or some other anomaly in our robot or data collection technique.Since the motion model de-scribed is directly dependent on the magnitude of reported motion,the variance in these situations would be zero,and the SLAM algorithm would have no ability to recover the correct motion for that time step.To handle this problem,we found it necessary to set a minimum amount of noise that must be present at each time step.These levels were small (variances less than 2cm along the major axis and lessthanin facing),and the model exceeded these variance levels in all but a few time steps.The ﬁrst experiment demonstrates the ability of the pro-posed method to calibrate the motion model parameters for a robot with little or no previous knowledge of the environ-ment.The robot is driven around an indoor test environ-ment,eventually completing a loop of hallway,while col-lecting data from its sensors and odometers.Note that the completion of a loop is not necessary for either the SLAM algorithm or the learning method,but merely serves to help illustrate the quality of the map at each EM iteration.The motion model is set initially with no systematic biases,but high variances.Figure 1shows the highest probability map produced at the end of the ﬁrst run of EM.The resulting map has the right general shape,but in the top left area where the robot returns to its starting position there is asigniﬁcant error in the map,resulting in double walls.A closeup of this region is shown in Figure 2.After three EM iterations,the model parameters are reﬁned to the point where the SLAM algorithm successfully closes the loop without any blemishes in the map.A closeup of the same area is shown in Figure3.Figure 2.Close up of the area where the loop is closed,using the naive motion model.Double walls reﬂect an accumulated error of approximately one half meter over the path of therobot.Figure 3.Close up of the same area as Figure 2,using the learned motion model learned by EM.One concern that we had when learning a motion model was the possibility of overﬁtting the speciﬁc trajectory that was supplied to the SLAM algorithm.We would like the learned parameters to be tuned to the properties of the robot and environment,but not the quirks of individual data col-lection runs,since it would be it would be inefﬁcient and contrary to the spirit of SLAM to learn a new motion model with EM every time that the robot is redeployed.To ver-ify this generality of the motion model,we used one run of the robot to learn the parameters in the same indoor envi-ronment as before.Then,using this set of learned motionparameters,we had the robot remap the same environment using data collected from several days later.The resulting map shown in Figure 4is the same high quality as if we had learned the motion model directly from the second tra-jectoryitself.Figure 4.Map created using the motion model learned from one sensor log,applied to a different log generated several days later.Figure 5.Map resulting from the ﬁrst iteration of EM initialized with an out-of-date motion model.This is a close up of the area where the loop is closed.The environment is the same as shown for in other experiments,with a top,center starting position.A strong test of the robustness of our method is its ability to recover from a poor motion model.This is also impor-tant to the applicability of the method since changes in the environment or in the robot itself can cause the appropriate motion model to change.In this experiment,we used a set of data collected in an ofﬁce environment to learn a good motion model.We then tested the model using a second set of data collected in the same area,but with approximately a year of time separating the two data sets.When attempting to use the same model on the second data set,we quickly notice that the map produced by the SLAM algorithm is obviously ﬂawed where it attempts to complete the loop (Figure 5).A year of use and some rough handling during shipping caused signiﬁcant wear in the robot and changes in its behavior,resulting in an altered motion model.In the next iteration (Figure 6),the learned motion model can be seen to be improving the quality of the map as a resultof increased accuracy.Figure 7depicts the map from the next and ﬁnal iteration,where the two ends of the loop are seamlessly aligned.Figure 6.Second iteration of EM.Figure 7.Final iteration of EM.Most of our results are visual or anecdotal,since the ac-tual parameters would be fairly meaningless to all but those very familiar with this model of robot.However,in this ex-periment the difference in variances is particularly telling.Predictably,the variances have all increased signiﬁcantly over the course of the year,as wear on the robot has caused movements to become more erratic.The most signiﬁcantof these is theterm,which changes from ,indicating signiﬁcantly more erratic lateral shiftsduring turns,a result consistent with our observations of the robot in action.A graphical depiction of the change in distributions is shown in (Figure 8).This graph shows the ﬁrst standard de-viation of the noise in the major and minor axes (along the x and y axes respectively)for a single unit of lateral motion.In the progression from the ﬁrst iteration to the second,the mean is shifted signiﬁcantly,while the variances decrease.The third iteration shows a negligible shift in the mean,but the variance along the major axis experiences a large in-crease.In comparing the difference of coverage between the ﬁrst and the third iteration,it is clear that a dramati-cally larger number of particles would be needed for the ﬁrst distribution in order to cover high probability regions of the third distribution effectively.We also performed an experiment on a smaller segment of sensor data.We wanted to use a section with a signiﬁ-cant amount of both lateral motion and turning within its trajectory,so we chose an area consisting of two corners−0.5−0.4−0.3−0.2−0.100.10.20.30.40.5−0.5−0.4−0.3−0.2−0.100.10.20.30.40.5Initial ModelSecond IterationFinal IterationFigure 8.A plot of the ﬁrst standard deviation for the third exper-iment.The major axis of motion is plotted along the x-axis,and the minor axis is plotted along the y-axis,with results shown for a single unit of lateral motion.connected by one lateral stretch of hallway,for a total of approximately one quarter of a full sensor log.The initial model provided was the same naive model presented in the ﬁrst experiment.We performed this experiment to verify the ability to learn a model with less information and to il-lustrate that traversing a loop is not necessary for accurate performance of the learning method.This experiment took ten iterations to converge while those based upon full sen-sor logs typically took less than ﬁve.However,the ﬁnal motion model parameters upon convergence of EM were accurate enough to result in seamless mapping when the algorithm was presented with a full sensor log.The result-ing maps are indistinguishable from those produced with models learned from full sensor logs and are not shown.To determine the effect of terrain type on the motion model we learned motion models for three different surfaces:car-pet,tile and concrete.We used the same robot in each ex-periment,and all information was gathered within the pe-riod of a day,to minimize the effects of wear on the mo-tion model.The models were then learned on a trajectory at least 20m in length,and containing at least 90degrees of rotation.Plots showing a single standard deviation for one unit of lateral (x axis)motion for each of these motion models are shown in Figure 9.The results suggest that the greater friction of a concrete surface reduces wheel slip.However,there is also greater drift along the minor axis of movement.This is consistent with the observed behavior of the robot,but we have not diagnosed the exact cause.One possibility could be slants in sections of concrete traversed by the robot.Finally,we considered the possibility of using very large variances and a large number of particles asan alterna-tive to learning a good model.The problem with this ap-Figure 9.Motion models learned for different terrains.proach is that adequately covering the conﬁguration space of the robot when the model parameters have high variance is quite difﬁcult.Even with 25times as many samples as our reﬁned models,our naive model was unable to produce seamless maps.6.ConclusionBy using an EM learning algorithm coupled with a ﬂexible motion model,we have shown how existing SLAM algo-rithms can greatly improve their performance by reﬁning the stochastic motion model.We believe that our approach is the ﬁrst to capture both systematic errors and variance in odometry.This technique has the potential to signiﬁcantly increase the autonomy of mobile robots by eliminating the human effort required to produce a motion model.The results presented show that the method is capable of learning accurate motion models with very little user in-put.Beginning with a general,naive set of motion param-eters,we demonstrated the ability to reﬁne the model to be signiﬁcantly more accurate.In addition,this model was shown to be generally applicable in similar environments.Furthermore,when presented with an incorrect model,the proposed method quickly adapted,and was able to suc-cessfully learn more appropriate parameters.Finally,we demonstrated the power of this method to learn a good model that is applicable to a large area when presented with data from only a small piece of this area.This research was motivated by our own practical consider-ations after investing signiﬁcant amounts of time and effort into hand tuning appropriate motion models for our dif-ferent robots and test environments.Beyond saving time and resources,this method was also inspired by practical concerns of remote deployment for robots.The ability to learn a complete motion model using only onboard sensors is crucial for isolated robots in unknown environments,orones which suffer malfunction in theﬁeld.One surprising aspect of this approach was its ability to learn good motion models despite initially poor models that lead to poor maps.One might expect that little could be learned from SLAM runs that result in poor maps since poor maps imply that the particle trajectories have failed to capture the true motion of the robot.We speculate that such poor runs still tend to contain useful information that can push the model parameters in the right direction.The reasons for this are twofold.First,although SLAM pos-terior distributions are not unimodal,they often appear to have strong peaks surrounded by relatively shallow local optima.Thus,on poor runs,the trajectories that are closer to truth will often still tend to have higher probability.Sec-ond,poor runs typically assign particles far from the mean higher probability,which has the effect of increasing vari-ance on the the next iteration.On runs which are initialized with poor models,we have observed an“expand-contract”pattern in the model parameters,where the variance grows until the mean is well covered and then contracts to reﬂect the true variance given the correct mean.We recognize that the calibration technique described here would not be efﬁcient to run at all times.Instead,it would best be run at intervals when the motion has notably changed from the existent model.In the future,we would like to develop principled methods for automatically deter-mining when a motion model needs to be updated.This method was developed using an assumption that the true motion of the robot can be described as the sum of two independent normal distributions,arising from rota-tional movement and lateral movement.We would like to investigate relaxing this assumption,and allow the motion to be described by a single multivariate Gaussian with a full covariance matrix.Through learning the complete set of parameters for this multivariate distribution,we could determine to what degree this assumption of independence is valid.In experiments,we noticed that the amounts of noise present at certain time steps were signiﬁcantly lower than at others,due to differing amounts of motion.Thus we were using a number of particles consistent with sufﬁcient cov-erage for the greater noise,even during time steps where the noise,and thus the necessary number of particles,was much lower.A great practical speed up might be achieved by understanding how many particles are required given the noise predicted by the model.The SLAM algorithm could then use a variable number of particles depending on the amount of noise currently present.This function could be signiﬁcantly dependent on the type of environment that the robot is sensing,and could require further machine learning techniques,but is deﬁnitely an avenue of further research which could yield useful results.Our emphasis in this paper has been the development of motion models for SLAM algorithms that use odometry and sensor data produce densely populated maps.We be-lieve that a similar approach could be applied to landmark based SLAM algorithms,or to robots that use GPS data instead of odometry as an initial measure of robot motion. AcknowledgementsThis work supported in part by the National Science Foun-dation,the Sloan Foundation,and SAIC. ReferencesBorenstein,J.,&Feng,L.(1994).UMBmark-A method for measuring,comparing,and correcting dead-reckoning errors in mobile robots Technical Report UM-MEAM-94-22).University of Michigan.Burgard,W.,Cremers,A.,Fox,D.,H¨a hnel,D.,Lakemeyer, G.,Schulz,D.,Steiner,W.,&Thrun,S.(1999).Ex-periences with an interactive museum tour-guide robot. Artiﬁcial Intelligence,114,3–55.Cheeseman,P.,Smith,P.,&Self,M.(1990).Estimat-ing uncertain spatial relationships in robotics.In Au-tonomous robot vehicles,167–193.Springer-Verlag. Doucet,A.,de Freitas,N.,&Gordon,N.(2001).Sequen-tial monte carlo methods in practice.Berlin:Springer-Verlag.Eliazar,A.,&Parr,R.(2004).DP-SLAM2.0.IEEE2004 International Conference on Robotics and Automation (ICRA-04).Montemerlo,M.,&Thrun,S.(2002).Simultaneous lo-calization and mapping with unknown data association using FastSLAM.IEEE2002International Conference on Robotics and Automation(ICRA-02).Roy,N.,&Thrun,S.(1999).Online self-calibration for mobile robots.IEEE1999International Conference on Robotics and Automation(ICRA-99).Thrun,S.(2000).Probabilistic algorithms in robotics.AI Magazine,21,93–109.Thrun,S.(2002).Robotic mapping:A survey.In ke-meyer and B.Nebel(Eds.),Exploring artiﬁcial intelli-gence in the new millenium.Morgan Kaufmann.V oyles,R.,&Khosla,P.(1997).Collabrative calibration. IEEE1997International Conference on Robotics and Automation(ICRA-97).。