Discriminative training of hidden Markov models for multiple pitch tracking

合集下载

A Discriminatively Trained, Multiscale, Deformable Part Model

A Discriminatively Trained, Multiscale, Deformable Part Model

A Discriminatively Trained,Multiscale,Deformable Part ModelPedro Felzenszwalb University of Chicago pff@David McAllesterToyota Technological Institute at Chicagomcallester@Deva RamananUC Irvinedramanan@AbstractThis paper describes a discriminatively trained,multi-scale,deformable part model for object detection.Our sys-tem achieves a two-fold improvement in average precision over the best performance in the2006PASCAL person de-tection challenge.It also outperforms the best results in the 2007challenge in ten out of twenty categories.The system relies heavily on deformable parts.While deformable part models have become quite popular,their value had not been demonstrated on difficult benchmarks such as the PASCAL challenge.Our system also relies heavily on new methods for discriminative training.We combine a margin-sensitive approach for data mining hard negative examples with a formalism we call latent SVM.A latent SVM,like a hid-den CRF,leads to a non-convex training problem.How-ever,a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified for the positive examples.We believe that our training meth-ods will eventually make possible the effective use of more latent information such as hierarchical(grammar)models and models involving latent three dimensional pose.1.IntroductionWe consider the problem of detecting and localizing ob-jects of a generic category,such as people or cars,in static images.We have developed a new multiscale deformable part model for solving this problem.The models are trained using a discriminative procedure that only requires bound-ing box labels for the positive ing these mod-els we implemented a detection system that is both highly efficient and accurate,processing an image in about2sec-onds and achieving recognition rates that are significantly better than previous systems.Our system achieves a two-fold improvement in average precision over the winning system[5]in the2006PASCAL person detection challenge.The system also outperforms the best results in the2007challenge in ten out of twenty This material is based upon work supported by the National Science Foundation under Grant No.0534820and0535174.Figure1.Example detection obtained with the person model.The model is defined by a coarse template,several higher resolution part templates and a spatial model for the location of each part. object categories.Figure1shows an example detection ob-tained with our person model.The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework for representing object categories[1–3,6,10,12,13,15,16,22]. While these models are appealing from a conceptual point of view,it has been difficult to establish their value in prac-tice.On difficult datasets,deformable models are often out-performed by“conceptually weaker”models such as rigid templates[5]or bag-of-features[23].One of our main goals is to address this performance gap.Our models include both a coarse global template cov-ering an entire object and higher resolution part templates. The templates represent histogram of gradient features[5]. As in[14,19,21],we train models discriminatively.How-ever,our system is semi-supervised,trained with a max-margin framework,and does not rely on feature detection. We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data.In contrast to computa-tionally demanding approaches such as[4],we can learn a model in3hours on a single CPU.Another contribution of our work is a new methodology for discriminative training.We generalize SVMs for han-dling latent variables such as part positions,and introduce a new method for data mining“hard negative”examples dur-ing training.We believe that handling partially labeled data is a significant issue in machine learning for computer vi-sion.For example,the PASCAL dataset only specifies abounding box for each positive example of an object.We treat the position of each object part as a latent variable.We also treat the exact location of the object as a latent vari-able,requiring only that our classifier select a window that has large overlap with the labeled bounding box.A latent SVM,like a hidden CRF[19],leads to a non-convex training problem.However,unlike a hidden CRF, a latent SVM is semi-convex and the training problem be-comes convex once latent information is specified for thepositive training examples.This leads to a general coordi-nate descent algorithm for latent SVMs.System Overview Our system uses a scanning window approach.A model for an object consists of a global“root”filter and several part models.Each part model specifies a spatial model and a partfilter.The spatial model defines a set of allowed placements for a part relative to a detection window,and a deformation cost for each placement.The score of a detection window is the score of the root filter on the window plus the sum over parts,of the maxi-mum over placements of that part,of the partfilter score on the resulting subwindow minus the deformation cost.This is similar to classical part-based models[10,13].Both root and partfilters are scored by computing the dot product be-tween a set of weights and histogram of gradient(HOG) features within a window.The rootfilter is equivalent to a Dalal-Triggs model[5].The features for the partfilters are computed at twice the spatial resolution of the rootfilter. Our model is defined at afixed scale,and we detect objects by searching over an image pyramid.In training we are given a set of images annotated with bounding boxes around each instance of an object.We re-duce the detection problem to a binary classification prob-lem.Each example x is scored by a function of the form, fβ(x)=max zβ·Φ(x,z).Hereβis a vector of model pa-rameters and z are latent values(e.g.the part placements). To learn a model we define a generalization of SVMs that we call latent variable SVM(LSVM).An important prop-erty of LSVMs is that the training problem becomes convex if wefix the latent values for positive examples.This can be used in a coordinate descent algorithm.In practice we iteratively apply classical SVM training to triples( x1,z1,y1 ,..., x n,z n,y n )where z i is selected to be the best scoring latent label for x i under the model learned in the previous iteration.An initial rootfilter is generated from the bounding boxes in the PASCAL dataset. The parts are initialized from this rootfilter.2.ModelThe underlying building blocks for our models are the Histogram of Oriented Gradient(HOG)features from[5]. We represent HOG features at two different scales.Coarse features are captured by a rigid template covering anentireImage pyramidFigure2.The HOG feature pyramid and an object hypothesis de-fined in terms of a placement of the rootfilter(near the top of the pyramid)and the partfilters(near the bottom of the pyramid). detection window.Finer scale features are captured by part templates that can be moved with respect to the detection window.The spatial model for the part locations is equiv-alent to a star graph or1-fan[3]where the coarse template serves as a reference position.2.1.HOG RepresentationWe follow the construction in[5]to define a dense repre-sentation of an image at a particular resolution.The image isfirst divided into8x8non-overlapping pixel regions,or cells.For each cell we accumulate a1D histogram of gra-dient orientations over pixels in that cell.These histograms capture local shape properties but are also somewhat invari-ant to small deformations.The gradient at each pixel is discretized into one of nine orientation bins,and each pixel“votes”for the orientation of its gradient,with a strength that depends on the gradient magnitude.For color images,we compute the gradient of each color channel and pick the channel with highest gradi-ent magnitude at each pixel.Finally,the histogram of each cell is normalized with respect to the gradient energy in a neighborhood around it.We look at the four2×2blocks of cells that contain a particular cell and normalize the his-togram of the given cell with respect to the total energy in each of these blocks.This leads to a vector of length9×4 representing the local gradient information inside a cell.We define a HOG feature pyramid by computing HOG features of each level of a standard image pyramid(see Fig-ure2).Features at the top of this pyramid capture coarse gradients histogrammed over fairly large areas of the input image while features at the bottom of the pyramid capture finer gradients histogrammed over small areas.2.2.FiltersFilters are rectangular templates specifying weights for subwindows of a HOG pyramid.A w by hfilter F is a vector with w×h×9×4weights.The score of afilter is defined by taking the dot product of the weight vector and the features in a w×h subwindow of a HOG pyramid.The system in[5]uses a singlefilter to define an object model.That system detects objects from a particular class by scoring every w×h subwindow of a HOG pyramid and thresholding the scores.Let H be a HOG pyramid and p=(x,y,l)be a cell in the l-th level of the pyramid.Letφ(H,p,w,h)denote the vector obtained by concatenating the HOG features in the w×h subwindow of H with top-left corner at p.The score of F on this detection window is F·φ(H,p,w,h).Below we useφ(H,p)to denoteφ(H,p,w,h)when the dimensions are clear from context.2.3.Deformable PartsHere we consider models defined by a coarse rootfilter that covers the entire object and higher resolution partfilters covering smaller parts of the object.Figure2illustrates a placement of such a model in a HOG pyramid.The rootfil-ter location defines the detection window(the pixels inside the cells covered by thefilter).The partfilters are placed several levels down in the pyramid,so the HOG cells at that level have half the size of cells in the rootfilter level.We have found that using higher resolution features for defining partfilters is essential for obtaining high recogni-tion performance.With this approach the partfilters repre-sentfiner resolution edges that are localized to greater ac-curacy when compared to the edges represented in the root filter.For example,consider building a model for a face. The rootfilter could capture coarse resolution edges such as the face boundary while the partfilters could capture details such as eyes,nose and mouth.The model for an object with n parts is formally defined by a rootfilter F0and a set of part models(P1,...,P n) where P i=(F i,v i,s i,a i,b i).Here F i is afilter for the i-th part,v i is a two-dimensional vector specifying the center for a box of possible positions for part i relative to the root po-sition,s i gives the size of this box,while a i and b i are two-dimensional vectors specifying coefficients of a quadratic function measuring a score for each possible placement of the i-th part.Figure1illustrates a person model.A placement of a model in a HOG pyramid is given by z=(p0,...,p n),where p i=(x i,y i,l i)is the location of the rootfilter when i=0and the location of the i-th part when i>0.We assume the level of each part is such that a HOG cell at that level has half the size of a HOG cell at the root level.The score of a placement is given by the scores of eachfilter(the data term)plus a score of the placement of each part relative to the root(the spatial term), ni=0F i·φ(H,p i)+ni=1a i·(˜x i,˜y i)+b i·(˜x2i,˜y2i),(1)where(˜x i,˜y i)=((x i,y i)−2(x,y)+v i)/s i gives the lo-cation of the i-th part relative to the root location.Both˜x i and˜y i should be between−1and1.There is a large(exponential)number of placements for a model in a HOG pyramid.We use dynamic programming and distance transforms techniques[9,10]to compute the best location for the parts of a model as a function of the root location.This takes O(nk)time,where n is the number of parts in the model and k is the number of cells in the HOG pyramid.To detect objects in an image we score root locations according to the best possible placement of the parts and threshold this score.The score of a placement z can be expressed in terms of the dot product,β·ψ(H,z),between a vector of model parametersβand a vectorψ(H,z),β=(F0,...,F n,a1,b1...,a n,b n).ψ(H,z)=(φ(H,p0),φ(H,p1),...φ(H,p n),˜x1,˜y1,˜x21,˜y21,...,˜x n,˜y n,˜x2n,˜y2n,). We use this representation for learning the model parame-ters as it makes a connection between our deformable mod-els and linear classifiers.On interesting aspect of the spatial models defined here is that we allow for the coefficients(a i,b i)to be negative. This is more general than the quadratic“spring”cost that has been used in previous work.3.LearningThe PASCAL training data consists of a large set of im-ages with bounding boxes around each instance of an ob-ject.We reduce the problem of learning a deformable part model with this data to a binary classification problem.Let D=( x1,y1 ,..., x n,y n )be a set of labeled exam-ples where y i∈{−1,1}and x i specifies a HOG pyramid, H(x i),together with a range,Z(x i),of valid placements for the root and partfilters.We construct a positive exam-ple from each bounding box in the training set.For these ex-amples we define Z(x i)so the rootfilter must be placed to overlap the bounding box by at least50%.Negative exam-ples come from images that do not contain the target object. Each placement of the rootfilter in such an image yields a negative training example.Note that for the positive examples we treat both the part locations and the exact location of the rootfilter as latent variables.We have found that allowing uncertainty in the root location during training significantly improves the per-formance of the system(see Section4).tent SVMsA latent SVM is defined as follows.We assume that each example x is scored by a function of the form,fβ(x)=maxz∈Z(x)β·Φ(x,z),(2)whereβis a vector of model parameters and z is a set of latent values.For our deformable models we define Φ(x,z)=ψ(H(x),z)so thatβ·Φ(x,z)is the score of placing the model according to z.In analogy to classical SVMs we would like to trainβfrom labeled examples D=( x1,y1 ,..., x n,y n )by optimizing the following objective function,β∗(D)=argminβλ||β||2+ni=1max(0,1−y i fβ(x i)).(3)By restricting the latent domains Z(x i)to a single choice, fβbecomes linear inβ,and we obtain linear SVMs as a special case of latent tent SVMs are instances of the general class of energy-based models[18].3.2.Semi-ConvexityNote that fβ(x)as defined in(2)is a maximum of func-tions each of which is linear inβ.Hence fβ(x)is convex inβ.This implies that the hinge loss max(0,1−y i fβ(x i)) is convex inβwhen y i=−1.That is,the loss function is convex inβfor negative examples.We call this property of the loss function semi-convexity.Consider an LSVM where the latent domains Z(x i)for the positive examples are restricted to a single choice.The loss due to each positive example is now bined with the semi-convexity property,(3)becomes convex inβ.If the labels for the positive examples are notfixed we can compute a local optimum of(3)using a coordinate de-scent algorithm:1.Holdingβfixed,optimize the latent values for the pos-itive examples z i=argmax z∈Z(xi )β·Φ(x,z).2.Holding{z i}fixed for positive examples,optimizeβby solving the convex problem defined above.It can be shown that both steps always improve or maintain the value of the objective function in(3).If both steps main-tain the value we have a strong local optimum of(3),in the sense that Step1searches over an exponentially large space of latent labels for positive examples while Step2simulta-neously searches over weight vectors and an exponentially large space of latent labels for negative examples.3.3.Data Mining Hard NegativesIn object detection the vast majority of training exam-ples are negative.This makes it infeasible to consider all negative examples at a time.Instead,it is common to con-struct training data consisting of the positive instances and “hard negative”instances,where the hard negatives are data mined from the very large set of possible negative examples.Here we describe a general method for data mining ex-amples for SVMs and latent SVMs.The method iteratively solves subproblems using only hard instances.The innova-tion of our approach is a theoretical guarantee that it leads to the exact solution of the training problem defined using the complete training set.Our results require the use of a margin-sensitive definition of hard examples.The results described here apply both to classical SVMs and to the problem defined by Step2of the coordinate de-scent algorithm for latent SVMs.We omit the proofs of the theorems due to lack of space.These results are related to working set methods[17].We define the hard instances of D relative toβas,M(β,D)={ x,y ∈D|yfβ(x)≤1}.(4)That is,M(β,D)are training examples that are incorrectly classified or near the margin of the classifier defined byβ. We can show thatβ∗(D)only depends on hard instances. Theorem1.Let C be a subset of the examples in D.If M(β∗(D),D)⊆C thenβ∗(C)=β∗(D).This implies that in principle we could train a model us-ing a small set of examples.However,this set is defined in terms of the optimal modelβ∗(D).Given afixedβwe can use M(β,D)to approximate M(β∗(D),D).This suggests an iterative algorithm where we repeatedly compute a model from the hard instances de-fined by the model from the last iteration.This is further justified by the followingfixed-point theorem.Theorem2.Ifβ∗(M(β,D))=βthenβ=β∗(D).Let C be an initial“cache”of examples.In practice we can take the positive examples together with random nega-tive examples.Consider the following iterative algorithm: 1.Letβ:=β∗(C).2.Shrink C by letting C:=M(β,C).3.Grow C by adding examples from M(β,D)up to amemory limit L.Theorem3.If|C|<L after each iteration of Step2,the algorithm will converge toβ=β∗(D)infinite time.3.4.Implementation detailsMany of the ideas discussed here are only approximately implemented in our current system.In practice,when train-ing a latent SVM we iteratively apply classical SVM train-ing to triples x1,z1,y1 ,..., x n,z n,y n where z i is se-lected to be the best scoring latent label for x i under themodel trained in the previous iteration.Each of these triples leads to an example Φ(x i,z i),y i for training a linear clas-sifier.This allows us to use a highly optimized SVM pack-age(SVMLight[17]).On a single CPU,the entire training process takes3to4hours per object class in the PASCAL datasets,including initialization of the parts.Root Filter Initialization:For each category,we auto-matically select the dimensions of the rootfilter by looking at statistics of the bounding boxes in the training data.1We train an initial rootfilter F0using an SVM with no latent variables.The positive examples are constructed from the unoccluded training examples(as labeled in the PASCAL data).These examples are anisotropically scaled to the size and aspect ratio of thefilter.We use random subwindows from negative images to generate negative examples.Root Filter Update:Given the initial rootfilter trained as above,for each bounding box in the training set wefind the best-scoring placement for thefilter that significantly overlaps with the bounding box.We do this using the orig-inal,un-scaled images.We retrain F0with the new positive set and the original random negative set,iterating twice.Part Initialization:We employ a simple heuristic to ini-tialize six parts from the rootfilter trained above.First,we select an area a such that6a equals80%of the area of the rootfilter.We greedily select the rectangular region of area a from the rootfilter that has the most positive energy.We zero out the weights in this region and repeat until six parts are selected.The partfilters are initialized from the rootfil-ter values in the subwindow selected for the part,butfilled in to handle the higher spatial resolution of the part.The initial deformation costs measure the squared norm of a dis-placement with a i=(0,0)and b i=−(1,1).Model Update:To update a model we construct new training data triples.For each positive bounding box in the training data,we apply the existing detector at all positions and scales with at least a50%overlap with the given bound-ing box.Among these we select the highest scoring place-ment as the positive example corresponding to this training bounding box(Figure3).Negative examples are selected byfinding high scoring detections in images not containing the target object.We add negative examples to a cache un-til we encounterfile size limits.A new model is trained by running SVMLight on the positive and negative examples, each labeled with part placements.We update the model10 times using the cache scheme described above.In each it-eration we keep the hard instances from the previous cache and add as many new hard instances as possible within the memory limit.Toward thefinal iterations,we are able to include all hard instances,M(β,D),in the cache.1We picked a simple heuristic by cross-validating over5object classes. We set the model aspect to be the most common(mode)aspect in the data. We set the model size to be the largest size not larger than80%of thedata.Figure3.The image on the left shows the optimization of the la-tent variables for a positive example.The dotted box is the bound-ing box label provided in the PASCAL training set.The large solid box shows the placement of the detection window while the smaller solid boxes show the placements of the parts.The image on the right shows a hard-negative example.4.ResultsWe evaluated our system using the PASCAL VOC2006 and2007comp3challenge datasets and protocol.We refer to[7,8]for details,but emphasize that both challenges are widely acknowledged as difficult testbeds for object detec-tion.Each dataset contains several thousand images of real-world scenes.The datasets specify ground-truth bounding boxes for several object classes,and a detection is consid-ered correct when it overlaps more than50%with a ground-truth bounding box.One scores a system by the average precision(AP)of its precision-recall curve across a testset.Recent work in pedestrian detection has tended to report detection rates versus false positives per window,measured with cropped positive examples and negative images with-out objects of interest.These scores are tied to the reso-lution of the scanning window search and ignore effects of non-maximum suppression,making it difficult to compare different systems.We believe the PASCAL scoring method gives a more reliable measure of performance.The2007challenge has20object categories.We entered a preliminary version of our system in the official competi-tion,and obtained the best score in6categories.Our current system obtains the highest score in10categories,and the second highest score in6categories.Table1summarizes the results.Our system performs well on rigid objects such as cars and sofas as well as highly deformable objects such as per-sons and horses.We also note that our system is successful when given a large or small amount of training data.There are roughly4700positive training examples in the person category but only250in the sofa category.Figure4shows some of the models we learned.Figure5shows some ex-ample detections.We evaluated different components of our system on the longer-established2006person dataset.The top AP scoreaero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tvOur rank 31211224111422112141Our score .180.411.092.098.249.349.396.110.155.165.110.062.301.337.267.140.141.156.206.336Darmstadt .301INRIA Normal .092.246.012.002.068.197.265.018.097.039.017.016.225.153.121.093.002.102.157.242INRIA Plus.136.287.041.025.077.279.294.132.106.127.067.071.335.249.092.072.011.092.242.275IRISA .281.318.026.097.119.289.227.221.175.253MPI Center .060.110.028.031.000.164.172.208.002.044.049.141.198.170.091.004.091.034.237.051MPI ESSOL.152.157.098.016.001.186.120.240.007.061.098.162.034.208.117.002.046.147.110.054Oxford .262.409.393.432.375.334TKK .186.078.043.072.002.116.184.050.028.100.086.126.186.135.061.019.036.058.067.090Table 1.PASCAL VOC 2007results.Average precision scores of our system and other systems that entered the competition [7].Empty boxes indicate that a method was not tested in the corresponding class.The best score in each class is shown in bold.Our current system ranks first in 10out of 20classes.A preliminary version of our system ranked first in 6classes in the official competition.BottleCarBicycleSofaFigure 4.Some models learned from the PASCAL VOC 2007dataset.We show the total energy in each orientation of the HOG cells in the root and part filters,with the part filters placed at the center of the allowable displacements.We also show the spatial model for each part,where bright values represent “cheap”placements,and dark values represent “expensive”placements.in the PASCAL competition was .16,obtained using a rigid template model of HOG features [5].The best previous re-sult of.19adds a segmentation-based verification step [20].Figure 6summarizes the performance of several models we trained.Our root-only model is equivalent to the model from [5]and it scores slightly higher at .18.Performance jumps to .24when the model is trained with a LSVM that selects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templates because they allow for self-adjustment of the detection win-dow in the training examples.Adding deformable parts in-creases performance to .34AP —a factor of two above the best previous score.Finally,we trained a model with partsbut no root filter and obtained .29AP.This illustrates the advantage of using a multiscale representation.We also investigated the effect of the spatial model and allowable deformations on the 2006person dataset.Recall that s i is the allowable displacement of a part,measured in HOG cells.We trained a rigid model with high-resolution parts by setting s i to 0.This model outperforms the root-only system by .27to .24.If we increase the amount of allowable displacements without using a deformation cost,we start to approach a bag-of-features.Performance peaks at s i =1,suggesting it is useful to constrain the part dis-placements.The optimal strategy allows for larger displace-ments while using an explicit deformation cost.The follow-Figure 5.Some results from the PASCAL 2007dataset.Each row shows detections using a model for a specific class (Person,Bottle,Car,Sofa,Bicycle,Horse).The first three columns show correct detections while the last column shows false positives.Our system is able to detect objects over a wide range of scales (such as the cars)and poses (such as the horses).The system can also detect partially occluded objects such as a person behind a bush.Note how the false detections are often quite reasonable,for example detecting a bus with the car model,a bicycle sign with the bicycle model,or a dog with the horse model.In general the part filters represent meaningful object parts that are well localized in each detection such as the head in the person model.Figure6.Evaluation of our system on the PASCAL VOC2006 person dataset.Root uses only a rootfilter and no latent place-ment of the detection windows on positive examples.Root+Latent uses a rootfilter with latent placement of the detection windows. Parts+Latent is a part-based system with latent detection windows but no rootfilter.Root+Parts+Latent includes both root and part filters,and latent placement of the detection windows.ing table shows AP as a function of freely allowable defor-mation in thefirst three columns.The last column gives the performance when using a quadratic deformation cost and an allowable displacement of2HOG cells.s i01232+quadratic costAP.27.33.31.31.345.DiscussionWe introduced a general framework for training SVMs with latent structure.We used it to build a recognition sys-tem based on multiscale,deformable models.Experimental results on difficult benchmark data suggests our system is the current state-of-the-art in object detection.LSVMs allow for exploration of additional latent struc-ture for recognition.One can consider deeper part hierar-chies(parts with parts),mixture models(frontal vs.side cars),and three-dimensional pose.We would like to train and detect multiple classes together using a shared vocab-ulary of parts(perhaps visual words).We also plan to use A*search[11]to efficiently search over latent parameters during detection.References[1]Y.Amit and A.Trouve.POP:Patchwork of parts models forobject recognition.IJCV,75(2):267–282,November2007.[2]M.Burl,M.Weber,and P.Perona.A probabilistic approachto object recognition using local photometry and global ge-ometry.In ECCV,pages II:628–641,1998.[3] D.Crandall,P.Felzenszwalb,and D.Huttenlocher.Spatialpriors for part-based recognition using statistical models.In CVPR,pages10–17,2005.[4] D.Crandall and D.Huttenlocher.Weakly supervised learn-ing of part-based spatial models for visual object recognition.In ECCV,pages I:16–29,2006.[5]N.Dalal and B.Triggs.Histograms of oriented gradients forhuman detection.In CVPR,pages I:886–893,2005.[6] B.Epshtein and S.Ullman.Semantic hierarchies for recog-nizing objects and parts.In CVPR,2007.[7]M.Everingham,L.Van Gool,C.K.I.Williams,J.Winn,and A.Zisserman.The PASCAL Visual Object Classes Challenge2007(VOC2007)Results./challenges/VOC/voc2007/workshop.[8]M.Everingham, A.Zisserman, C.K.I.Williams,andL.Van Gool.The PASCAL Visual Object Classes Challenge2006(VOC2006)Results./challenges/VOC/voc2006/results.pdf.[9]P.Felzenszwalb and D.Huttenlocher.Distance transformsof sampled functions.Cornell Computing and Information Science Technical Report TR2004-1963,September2004.[10]P.Felzenszwalb and D.Huttenlocher.Pictorial structures forobject recognition.IJCV,61(1),2005.[11]P.Felzenszwalb and D.McAllester.The generalized A*ar-chitecture.JAIR,29:153–190,2007.[12]R.Fergus,P.Perona,and A.Zisserman.Object class recog-nition by unsupervised scale-invariant learning.In CVPR, 2003.[13]M.Fischler and R.Elschlager.The representation andmatching of pictorial structures.IEEE Transactions on Com-puter,22(1):67–92,January1973.[14] A.Holub and P.Perona.A discriminative framework formodelling object classes.In CVPR,pages I:664–671,2005.[15]S.Ioffe and D.Forsyth.Probabilistic methods forfindingpeople.IJCV,43(1):45–68,June2001.[16]Y.Jin and S.Geman.Context and hierarchy in a probabilisticimage model.In CVPR,pages II:2145–2152,2006.[17]T.Joachims.Making large-scale svm learning practical.InB.Sch¨o lkopf,C.Burges,and A.Smola,editors,Advances inKernel Methods-Support Vector Learning.MIT Press,1999.[18]Y.LeCun,S.Chopra,R.Hadsell,R.Marc’Aurelio,andF.Huang.A tutorial on energy-based learning.InG.Bakir,T.Hofman,B.Sch¨o lkopf,A.Smola,and B.Taskar,editors, Predicting Structured Data.MIT Press,2006.[19] A.Quattoni,S.Wang,L.Morency,M.Collins,and T.Dar-rell.Hidden conditional randomfields.PAMI,29(10):1848–1852,October2007.[20] ing segmentation to verify object hypothe-ses.In CVPR,pages1–8,2007.[21] D.Ramanan and C.Sminchisescu.Training deformablemodels for localization.In CVPR,pages I:206–213,2006.[22]H.Schneiderman and T.Kanade.Object detection using thestatistics of parts.IJCV,56(3):151–177,February2004. [23]J.Zhang,M.Marszalek,zebnik,and C.Schmid.Localfeatures and kernels for classification of texture and object categories:A comprehensive study.IJCV,73(2):213–238, June2007.。

隐马尔科夫模型的训练技巧(六)

隐马尔科夫模型的训练技巧(六)

隐马尔科夫模型(Hidden Markov Model, HMM)是一种用于建模时序数据的统计模型,被广泛应用于语音识别、自然语言处理、生物信息学等领域。

在HMM中,系统状态是不可观测的,只能通过可观测的输出来推断系统状态。

本文将分析隐马尔科夫模型的训练技巧,包括参数估计、初始化、收敛性等方面。

数据预处理在进行HMM的训练之前,首先需要对输入数据进行预处理。

对于语音识别任务来说,预处理包括语音信号的特征提取,通常使用的特征包括梅尔频率倒谱系数(MFCC)、过零率等。

对于文本处理任务来说,预处理包括词袋模型、词嵌入等。

预处理的质量将直接影响HMM的训练效果,因此需要特别注意。

参数估计在HMM中,有三组参数需要估计:初始状态概率π、状态转移概率矩阵A和发射概率矩阵B。

其中,π表示系统在每个隐藏状态开始的概率,A表示系统从一个隐藏状态转移到另一个隐藏状态的概率,B表示系统从一个隐藏状态生成可观测输出的概率。

Baum-Welch算法是用于HMM参数估计的一种经典算法,它是一种迭代算法,通过不断更新参数的估计值,使得模型的似然函数逐步增大。

Baum-Welch算法的核心是前向-后向算法,它可以有效地估计HMM的参数,但是在实际应用中需要注意其对初始值的敏感性,容易陷入局部最优解。

模型初始化HMM的参数估计需要一个初始值,通常使用随机初始化的方法。

然而,随机初始化容易导致参数估计的不稳定,影响模型的训练效果。

因此,对HMM模型的初始化非常重要,可以使用专门的初始化方法,如K-means聚类算法、高斯混合模型等。

模型收敛性在进行HMM的训练过程中,需要考虑模型的收敛性。

HMM的收敛性通常通过设置迭代次数或者设定阈值来实现。

对于大规模数据集,可以考虑使用分布式计算或者并行计算的方法来加速模型的训练,以提高收敛性。

参数调优HMM模型的训练过程中,需要进行参数的调优。

参数调优可以通过交叉验证等方法来实现,以找到最优的参数组合。

如何利用隐马尔科夫模型进行虚假信息检测(十)

如何利用隐马尔科夫模型进行虚假信息检测(十)

隐马尔科夫模型(Hidden Markov Model, HMM)是一种用于建模时间序列数据的模型,它在许多领域都有着广泛的应用,包括语音识别、自然语言处理、生物信息学等。

在信息安全领域,隐马尔科夫模型也被广泛应用于虚假信息检测。

本文将介绍如何利用隐马尔科夫模型进行虚假信息检测,并探讨其在实际应用中的一些关键问题。

隐马尔科夫模型是一种统计模型,它由状态空间、观测空间、状态转移概率和观测概率组成。

在虚假信息检测中,状态空间可以表示为真实信息和虚假信息两种状态,观测空间可以表示为文本特征,如词频、词性等。

状态转移概率表示在不同状态下的转移概率,而观测概率表示在每个状态下观测到不同观测值的概率。

在利用隐马尔科夫模型进行虚假信息检测时,首先需要构建一个合适的模型。

通常情况下,可以通过训练数据来估计模型的参数,包括状态转移概率和观测概率。

训练数据可以是真实信息和虚假信息的文本数据,通过统计不同状态下的观测值的出现概率来估计模型的参数。

在实际应用中,可以使用一些开源的机器学习库来实现隐马尔科夫模型,如hmmlearn、pomegranate等。

一旦建立了模型,就可以利用该模型来进行虚假信息检测。

对于一个给定的文本数据,可以利用维特比算法来计算出最有可能的状态序列,从而判断该文本数据是真实信息还是虚假信息。

在实际应用中,可以根据实际需要设置阈值来判断文本数据的真实性,比如当虚假信息的概率超过某个阈值时,则判定该文本数据为虚假信息。

然而,在实际应用中,利用隐马尔科夫模型进行虚假信息检测还面临一些挑战。

首先,虚假信息的形式千变万化,可能会包含大量的噪声和干扰信息,这就需要在模型建立的过程中进行特征选择和降维处理,以提高模型的鲁棒性。

其次,虚假信息的产生可能受到一些外部因素的影响,比如社交网络中的信息传播机制、用户行为模式等,这也需要在模型建立的过程中进行一定的建模和分析。

此外,隐马尔科夫模型还具有一些局限性,比如对长距离依赖关系的建模能力较弱,而在文本数据中,长距离的依赖关系往往是虚假信息的重要特征。

隐马尔科夫模型的训练技巧(八)

隐马尔科夫模型的训练技巧(八)

隐马尔科夫模型(Hidden Markov Model,HMM)是一种用于建模时序数据的概率图模型,它在语音识别、自然语言处理、生物信息学等领域有广泛的应用。

HMM的训练是非常重要的,它决定了模型的准确度和泛化能力。

在本文中,我们将探讨HMM的训练技巧,包括参数估计、收敛性和数据处理等方面。

参数估计是HMM训练的核心问题之一。

HMM的参数包括状态转移矩阵、发射概率矩阵和初始状态概率。

常用的参数估计方法有极大似然估计(Maximum Likelihood Estimation,MLE)和期望最大化算法(Expectation-Maximization,EM)。

在MLE中,我们通过最大化观测序列的联合概率来估计参数值;而在EM算法中,我们通过迭代计算隐藏变量的期望和最大化似然函数来更新参数。

在实际训练中,我们通常会结合这两种方法,通过迭代优化参数值,以获得更好的模型拟合效果。

另一个重要的问题是HMM训练的收敛性。

由于HMM的似然函数是非凸函数,因此很容易陷入局部最优解。

为了解决这个问题,我们可以采用多次初始化参数值,并选择似然函数值最大的模型作为最终结果。

另外,一些启发式的优化算法,如模拟退火算法、遗传算法等,也可以用于HMM的训练,以增加模型跳出局部最优解的可能性。

除了参数估计和收敛性外,数据处理也是HMM训练中需要考虑的重要问题之一。

HMM对训练数据的要求比较高,需要大量的标注数据来训练模型。

同时,数据的质量对模型的性能也有很大的影响。

因此,在训练之前,我们需要对数据进行清洗和预处理,包括去除噪声、平衡样本分布、标注错误的数据修正等。

此外,对于长序列数据,我们还需要考虑数据截断和分段的问题,以增加训练效率和减少内存占用。

在实际应用中,HMM的训练技巧需要根据具体的任务和数据特点来选择。

一般来说,我们应该综合考虑参数估计、收敛性和数据处理等因素,通过反复实验和调整,以获得最佳的训练效果。

同时,随着深度学习技术的发展,一些新的模型和算法也可以用于时序数据的建模和训练,如循环神经网络(Recurrent Neural Network,RNN)和长短时记忆网络(Long Short-Term Memory,LSTM),它们通常能够取得更好的效果。

知识蒸馏迁移暗知识

知识蒸馏迁移暗知识

知识蒸馏迁移暗知识知识蒸馏迁移暗知识是一种新兴的机器学习技术,它允许机器从一个模型中蒸馏出另一个模型中缺失的暗知识,从而实现有效的知识迁移。

知识蒸馏迁移暗知识的目的是将现有的机器学习模型的知识转移到另一个机器学习模型中,以提高训练效率和改善模型的性能。

知识蒸馏迁移暗知识技术通过让源模型“蒸馏”出暗知识,并将这些暗知识应用于目标模型来实现知识迁移。

在这种情况下,源模型中包含的知识可以被视为“暗知识”,而目标模型可以被视为“暗知识”的受益者。

知识蒸馏迁移暗知识的主要技术实现方式是通过对源模型的参数进行整合,并将其应用于目标模型的训练过程,从而将源模型中的暗知识转移到目标模型中。

目前,知识蒸馏迁移暗知识的研究众多,主要包括模型聚类、模型联合学习、自监督学习、弱监督学习、多任务学习等技术。

例如,模型聚类将源模型参数聚合为一组参数,然后应用于目标模型的训练,从而使目标模型在训练过程中获得源模型中包含的暗知识。

模型联合学习也是一种知识蒸馏迁移暗知识技术,它采用联合学习的方式,将源模型和目标模型的参数联合在一起,从而让目标模型受益于源模型中的暗知识。

此外,自监督学习也是一种常见的知识蒸馏迁移暗知识技术,它将源模型的参数作为一个自监督学习任务,然后用于目标模型的训练,从而让目标模型受益于源模型中的暗知识。

另外,弱监督学习也是一种知识蒸馏迁移暗知识技术,它可以通过利用源模型的参数来构建弱监督任务,然后用于目标模型的训练,从而让目标模型受益于源模型中的暗知识。

多任务学习也是一种知识蒸馏迁移暗知识技术,它可以利用源模型的参数构建一个多任务学习问题,然后用于目标模型的训练,从而让目标模型受益于源模型中的暗知识。

知识蒸馏迁移暗知识技术在许多机器学习应用中发挥着重要作用,可以有效地将源模型中的暗知识转移到目标模型中,从而提高训练效率和改善模型的性能。

然而,知识蒸馏迁移暗知识也存在一些问题,例如如何选择合适的技术进行知识蒸馏,如何更有效地将暗知识转移到目标模型中等等。

隐马尔可夫算法

隐马尔可夫算法

隐马尔可夫算法隐马尔可夫算法(Hidden Markov Model,HMM)是一种基于统计学的建模方法,经常应用于语音识别、手写体识别、自然语言处理等领域。

HMM的基本思想是,通过观测到的数据序列来推断隐含的状态序列,然后利用这个状态序列进行预测或决策。

HMM模型由三个部分构成:状态序列、观测序列和模型参数。

其中,状态序列包含模型中可能存在的所有状态,如语音识别中可能是说话者的语速、音调等特征;观测序列则是我们实际观测到的数据,如语音信号的音频波形;模型参数则是用来定量描述状态转移概率、观测概率和初始状态概率的参数。

通过这三个部分的组合,可以构建起一个完整的HMM模型。

HMM模型的训练过程通常采用最大似然估计方法,即根据已知的观测序列,利用EM算法估计出最优的模型参数。

在预测或决策时,通常利用前向-后向算法(Forward-Backward Algorithm)计算出给定观测序列下每个状态的概率,并选择概率最大的状态序列作为预测结果。

预测和决策结果的准确度取决于模型参数的设置和训练数据的质量。

HMM算法的优点在于可以处理不完整数据和噪声,而且可以考虑到时间相关性。

因此,在语音识别、自然语言处理等领域中得到了广泛的应用。

然而,HMM模型也存在一些缺点,如模型训练需要大量时间和计算资源、模型中难以处理复杂的长程依赖关系等。

因此,在实际应用中需要根据具体情况选择最合适的模型。

总之,HMM算法是一种广泛应用于统计学建模的方法,其在语音识别、手写体识别、自然语言处理和生物信息学等领域中有着广泛的应用。

虽然它也存在一些缺点,但是凭借着它的优点,HMM算法仍然是一种值得深入研究和应用的算法。

隐马尔科夫模型在语音识别中的应用

隐马尔科夫模型在语音识别中的应用

隐马尔科夫模型在语音识别中的应用隐马尔科夫模型(Hidden Markov Model,HMM)是一种用于建模的统计模型,通过建立状态序列和可观测序列之间的概率关系,用于许多领域,其中包括自然语言处理,语音识别等。

在语音识别领域,隐马尔科夫模型被广泛应用于声学建模,是目前最常见的语音识别系统之一。

在HMM模型中,我们将语音信号分解成一系列时间序列,其中每一帧被称为“特征向量”。

声学模型旨在将这些特征向量映射到文本实例中的音素。

HMM模型由三部分组成:状态,转移概率和发射概率。

状态表示当前的“状态”,转移概率代表从一个状态转移到另一个状态的概率,发射概率表示某个状态生成某个观察值的概率。

在语音识别中,状态可以是任何音素,转移概率测量相邻音素之间的转换概率,发射概率是给定状态生成观察值(即Mel频率倒谱系数)的概率。

在语音识别任务中,HMM被用于建立音素识别模型(ASR),该模型根据语音信号的基本单元(即音素)来翻译音频流。

ASR系统中的下列组件,使其成为提供会话验证(SR)和自动语音识别(ASR)的现代解决方案之一:初步信号处理,特征提取,HMM声学建模和语言模型。

在初步信号处理步骤中,语音信号被录制,过滤噪声以及预处理(加重)音频信号,然后被分成帧。

特征提取步骤从帧中提取Mel-倒谱系数,提供经过降维和增强的分析。

经过这些处理之后,HMM模型就可以用于声学建模。

为了达到最佳效果,通常会使用多个代表性HMM模型并调整它们的参数,从而提高准确性。

语言模型会对ASR系统进行训练,并提供完整的文学,以为HMM根据其口音,说话速度以及极性等因素生成语音信号。

HMM在语音识别中的应用主要可以分成两类:离线(offline)和在线(online)语音识别。

在离线语音识别中,ASR系统处理完整的音频文件,通常需要先进行语音分割,并通过离线对输入进行语音识别。

然而,在在线语音识别中,ASR系统可以处理完整的音频流而不需要分割。

4feature

4feature

本文载《暨南大学华文学院学报》2006年,第1期(总21期)当前自然语言处理发展的四个特点冯志伟(教育部语言文字应用研究所)摘要:本文分析了当前自然语言处理发展的四个特点:基于句法-语义规则的理性主义方法受到质疑,随着语料库建设和语料库语言学的崛起,大规模真实文本的处理成为自然语言处理的主要战略目标;自然语言处理中越来越多地使用机器自动学习的方法来获取语言知识;统计数学方法越来越受到重视;自然语言处理中越来越重视词汇的作用,出现了强烈的“词汇主义”的倾向。

关键词:自然语言处理,语料库,机器自动学习,统计数学,词汇主义。

二十一世纪以来,由于国际互联网的普及,自然语言的计算机处理成为了从互联网上获取知识的重要手段,生活在信息网络时代的现代人,几乎都要与互联网打交道,都要或多或少地使用自然语言处理的研究成果来帮助他们获取或挖掘在广阔无边的互联网上的各种知识和信息,因此,世界各国都非常重视自然语言处理的研究,投入了大量的人力、物力和财力。

我认为,当前国外自然语言处理研究有四个显著的特点:第一, 基于句法-语义规则的理性主义方法受到质疑,随着语料库建设和语料库语言学的崛起,大规模真实文本的处理成为自然语言处理的主要战略目标。

在过去的四十多年中,从事自然语言处理系统开发的绝大多数学者,基本上都采用基于规则的理性主义方法,这种方法的哲学基础是逻辑实证主义,他们认为,智能的基本单位是符号,认知过程就是在符号的表征下进行符号运算,因此,思维就是符号运算。

著名语言学家J. A. Fodor在《Representations》[1]一书(MIT Press, 1980)中说:“只要我们认为心理过程是计算过程(因此是由表征式定义的形式操作),那么,除了将心灵看作别的之外,还自然会把它看作一种计算机。

也就是说,我们会认为,假设的计算过程包含哪些符号操作,心灵也就进行哪些符号操作。

因此,我们可以大致上认为,心理操作跟图灵机的操作十分类似。

前瞻记忆实验范式

前瞻记忆实验范式

前瞻记忆实验范式(一)前瞻记忆实验范式分类前瞻记忆是指对于将来要完成的动作或事件的记忆,是认知功能的一个重要组成部分。

在实验中,研究者通常采用以下几种范式来研究前瞻记忆:1. 提醒任务范式(Reminder Task Paradigm):在这种范式中,参与者需要在一段时间内记住一些特定的信息,然后在稍后的时间内回忆这些信息。

在回忆之前,会给出一些提示,如声音、光线等,以帮助参与者更好地回忆起之前的信息。

这个范式可以用来研究个体的前瞻记忆能力以及对前瞻记忆的编码和提取的影响。

2. 延迟匹配任务范式(Delayed Matching Task Paradigm):在这种范式中,参与者需要在一段时间内观察一些不同的刺激物,然后在稍后的时间段内将它们与之前观察到的刺激物相匹配。

如果参与者能够正确地将它们匹配起来,说明他们具有良好的前瞻记忆能力。

这个范式可以用来研究个体的前瞻记忆能力和对时间信息的加工方式的关系。

3. 目标导向任务范式(Goal-directed Task Paradigm):在这种范式中,参与者需要根据一定的目标来完成一些任务,如走迷宫、寻找隐藏的物品等。

如果参与者能够成功地完成任务,说明他们具有较好的前瞻性思维和计划能力。

这个范式可以用来研究个体的前瞻记忆能力和决策制定之间的关系。

(二)前瞻记忆实验范式的区别前瞻记忆实验范式主要有几种,包括提醒任务范式、延迟匹配任务范式和目标导向任务范式。

然而,更深入的研究揭示了基于时间、基于事件和基于活动的前瞻记忆的区别。

例如,基于时间的前瞻记忆需要人们在一个特定的时间或在一段时间过后去执行一个行动。

而实验室的双任务范式是研究前瞻记忆的常用方法之一,其基本特点是同时进行两种任务:正在进行的任务为背景任务,而被试需要执行的前瞻记忆任务通常需要打断正在进行的背景任务。

另外,一些研究还对比了不同类型前瞻记忆的表现。

例如,有研究发现在基于时间的前瞻记忆中,被试的表现显著优于基于事件的前瞻记忆。

加工深度对单词记忆的影响

加工深度对单词记忆的影响

加工深度对单词记忆的影响摘要】以大学生为被试,用英文单词作为实验材料,采用加工分离实验范式计算出在不同的加工深度条件下有意识提取的贡献率以及无意识提取的贡献率。

该实验的结果表明(1)在深加工条件下的回忆正确率高于浅加工条件下的回忆正确率。

(2)无意识提取成绩不受加工深度的影响,而有意识提取成绩要受到加工深度的影响。

【关键词】内隐记忆;外显记忆;加工深度;加工分离程序【中图分类号】R19 【文献标识码】A 【文章编号】1007-8231(2017)09-0264-021.研究背景在学习外语的过程中,单词记忆是最基础和最重要的一步,如果没有掌握必需的词汇,听、说、读、写都不能顺利进行。

但是单词学习往往是学生最难攻克的关卡,在记忆单词的过程中效果不好,需要不断的重复记忆才能达到识记的效果,因而摸索出科学的记忆单词的方法非常重要。

2.研究设计2.1 被试样本在四川文理学院随机选择48名大学生为被试,将这48名大学生随机分为两组,每组24人,保证每组中男女各12名。

这些大学生均未参加过类似的心理学实验,且都为右利手,矫正视力正常。

2.2 实验材料采用英语单词作为实验材料,从四级词汇书中选择出72个单词,这些单词长度控制在5~10个字母之间,在书面表达中出现的频率控制在0.02~0.06,将72个单词组合成36对,其中18对单词两两词义相似(如benefit--beneficial),其余的单词词义不同(如bloody——bloom)。

每对单词前一个单词作为线索词(如bloody),后一个单词作为靶子词(bloom)。

将36对单词随机混合后,把前18对单词作为学习阶段的单词,后18对单词作为测试阶段的新词。

在测试阶段把学习阶段的旧词和新词随机混合,然后在测试阶段把36对单词随机呈现。

用E-prime 2.0将实验材料编辑成实验程序。

该实验采取单独测试的方式在学习阶段和测试阶段的单词对均用同一台惠普笔记本电脑呈现。

隐马尔可夫模型用于分类

隐马尔可夫模型用于分类

隐马尔可夫模型用于分类隐马尔可夫模型(Hidden Markov Model,HMM)是一种经典的概率统计模型,被广泛应用于分类问题中。

它在语音识别、自然语言处理、金融预测等领域具有重要的应用价值。

本文将从HMM的基本原理、模型训练和分类应用三个方面介绍隐马尔可夫模型的分类方法。

一、HMM的基本原理隐马尔可夫模型由状态序列和观测序列组成。

状态序列是隐藏的,不可直接观测到,而观测序列是可见的,可以通过观测到的数据进行分类。

HMM假设观测序列的生成是由状态序列决定的,并且状态序列之间存在转移概率,观测序列与状态序列之间存在发射概率。

二、HMM的模型训练HMM的模型训练包括两个主要步骤:参数估计和模型优化。

参数估计是指通过已知的观测序列,计算出HMM模型的参数,包括初始状态概率、状态转移概率和观测发射概率。

常用的参数估计方法有最大似然估计和Baum-Welch算法。

模型优化是指通过调整模型的参数,使得模型能够更好地拟合观测数据。

常用的模型优化方法有Viterbi算法和前向-后向算法。

三、HMM的分类应用HMM在分类问题中有着广泛的应用。

以文本分类为例,假设我们要将一篇文章分为多个类别,可以使用HMM模型进行分类。

首先,我们需要将文章转化为观测序列,可以采用词袋模型或TF-IDF等方法进行特征提取。

然后,我们需要构建HMM模型,包括定义状态集合、初始状态概率、状态转移概率和观测发射概率。

最后,利用Viterbi算法或前向-后向算法,根据观测序列和HMM模型,计算出最可能的状态序列,从而实现文章的分类。

HMM模型在分类问题中的应用不仅限于文本分类,还可以应用于语音识别、金融预测等领域。

在语音识别中,HMM模型可以将语音信号转化为观测序列,通过计算最可能的状态序列,实现语音的识别和理解。

在金融预测中,HMM模型可以将历史数据转化为观测序列,通过计算最可能的状态序列,预测未来的股市走势或货币汇率变化。

总结:隐马尔可夫模型是一种重要的分类方法,具有广泛的应用价值。

如何利用隐马尔科夫模型进行虚假信息检测(四)

如何利用隐马尔科夫模型进行虚假信息检测(四)

虚假信息检测一直是一个备受关注的问题,尤其在当今信息爆炸的时代。

人们在社交媒体上接触到的信息越来越多,其中不乏虚假信息。

因此,如何利用技术手段进行虚假信息检测成为了一个迫切需要解决的问题。

隐马尔科夫模型(Hidden Markov Model,HMM)作为一种概率图模型,被广泛应用于语音识别、自然语言处理等领域。

在虚假信息检测中,隐马尔科夫模型也有着独特的优势,可以有效识别虚假信息。

本文将探讨如何利用隐马尔科夫模型进行虚假信息检测。

隐马尔科夫模型是一种描述随机序列的模型,在序列中的每个位置都有一个对应的隐状态。

在虚假信息检测中,可以将文本信息看作是一个序列,每个单词或短语对应一个位置,而每个位置对应一个隐状态,即该位置上的信息是真实的还是虚假的。

因此,可以利用隐马尔科夫模型来对文本信息进行建模,从而实现虚假信息的检测。

首先,需要进行模型的训练。

在训练阶段,需要准备一批真实信息和虚假信息的样本数据。

然后,利用这些样本数据对隐马尔科夫模型进行训练,学习真实信息和虚假信息的特征。

在训练过程中,需要确定模型的参数,包括状态转移概率、观测概率和初始状态概率等。

通过训练,可以得到一个针对虚假信息检测的隐马尔科夫模型。

接着,可以利用训练好的隐马尔科夫模型来对新的文本信息进行检测。

对于给定的文本信息,可以利用Viterbi算法来计算其在模型下的最可能的隐状态序列,从而确定该文本信息的真实性。

通过计算真实信息和虚假信息的概率分布,可以判断该文本信息是真实的还是虚假的。

这样,就可以利用隐马尔科夫模型来进行虚假信息的检测。

然而,隐马尔科夫模型也存在一些局限性。

首先,模型的性能依赖于训练数据的质量和数量。

如果训练数据不够充分或者质量不高,那么模型的性能会受到影响。

其次,隐马尔科夫模型在处理长文本时可能会受到限制,因为文本信息中的上下文信息可能会对模型的性能产生影响。

另外,隐马尔科夫模型对于输入序列长度的敏感度较高,如果输入序列过长,可能会导致模型的参数空间过大,从而影响模型的性能。

隐马尔科夫模型在新闻事件预测中的使用技巧(九)

隐马尔科夫模型在新闻事件预测中的使用技巧(九)

隐马尔可夫模型在新闻事件预测中的使用技巧隐马尔可夫模型(Hidden Markov Model, HMM)是一种统计模型,被广泛应用于语音识别、自然语言处理等领域。

随着人工智能技术的不断发展,HMM也逐渐在新闻事件预测中展现出其独特的价值。

本文将介绍隐马尔可夫模型在新闻事件预测中的使用技巧,包括数据准备、模型训练和预测结果的评估。

数据准备在使用隐马尔可夫模型进行新闻事件预测之前,首先需要准备好相关的数据。

这包括新闻文本、相关事件的时间和地点等信息。

通常情况下,我们会选择一些具有代表性的新闻事件作为训练数据,用来构建模型。

同时,还需要一些历史数据作为验证集,用来评估模型的预测准确性。

在准备数据时,需要注意数据的质量和完整性。

如果数据存在较大的噪声或缺失值,可能会对模型的训练和预测产生不利影响。

因此,在数据准备阶段,需要对数据进行清洗和预处理,确保数据的质量和准确性。

模型训练一旦数据准备就绪,就可以开始构建隐马尔可夫模型并进行训练。

在训练模型时,需要考虑以下几个关键点:1. 状态空间的选择:在新闻事件预测中,状态空间通常可以表示为事件的类别或趋势。

需要根据具体的预测目标和数据特点来选择合适的状态空间。

2. 观测序列的建模:观测序列通常可以表示为新闻文本中的词语或短语。

在建模观测序列时,可以采用词袋模型、tf-idf等技术来对文本进行特征提取和表示。

3. 模型参数的估计:隐马尔可夫模型的参数估计通常使用极大似然估计或期望最大化算法。

通过最大化观测序列的似然函数,可以得到模型的参数。

4. 模型的评估:在训练模型后,需要使用验证集来评估模型的性能。

通常可以使用准确率、召回率等指标来评估模型的预测准确性。

预测结果的评估一旦模型训练完成并且通过验证集的评估,就可以开始使用模型进行新闻事件的预测。

在预测结果的评估过程中,需要考虑以下几个方面:1. 预测结果的解释:隐马尔可夫模型通常可以给出一条最可能的状态序列,表示事件的类别或趋势。

内隐记忆的实验研究方法

内隐记忆的实验研究方法

内隐记忆的实验研究方法内隐记忆指的是一个人在不知情或者没有意识的情况下,所能够表现出来的记忆能力。

内隐记忆可以通过各种实验研究方法来探索和测量。

本文将介绍几种主要的内隐记忆研究方法。

一、隐式联想(TAAD)任务二、隐式记忆测验隐式记忆测验是一种通过间接的方法来测量内隐记忆的方法。

这些测验可以分为两类:诱导性和非诱导性。

在诱导性测试中,被试者被要求完成一项任务(例如完成一个填充词任务),然后被要求回忆之前任务中出现的信息。

在非诱导性测试中,被试者没有意识到在测验其实是在测试内隐记忆。

例如,在一个经典的非诱导性测试中,被试者被要求判断一系列的字母是否和之前的字母相同,被试者通常会表现出在没有意识到记忆的情况下判断正确的能力。

三、程序学习任务程序学习任务是一种用于研究被试者对于复杂信息的处理和内隐学习的实验方法。

在这种任务中,被试者被要求通过学习一系列的规则来完成任务。

例如,被试者可能会被要求在一个迷宫中找到出口,但是他们并不知道根据什么规则来确定正确的路径。

在这个任务中,内隐记忆的表现是被试者能够逐渐提高他们找到正确路径的能力。

四、生理指标一些研究人员使用生理指标来间接测量内隐记忆。

例如,他们可能会通过记录事件相关电位(ERP)来研究内隐记忆的表现。

ERP是一种通过电极记录被试者大脑在反应刺激时的电活动的方法。

通过分析ERP,研究人员可以评估被试者在不知情条件下对于刺激的加工和记忆表现。

总结以上介绍了几种常见的内隐记忆实验研究方法,包括隐式联想任务、隐式记忆测验、程序学习任务和生理指标。

这些方法提供了一种研究内隐记忆的途径,可以帮助我们更好地理解和解释内隐记忆的性质和工作机制。

语音识别算法中的声学建模方法总结

语音识别算法中的声学建模方法总结

语音识别算法中的声学建模方法总结语音识别是一种将语音信号转化为文本的技术,广泛应用于语音助手、智能音箱、电话自动接听等各种场景中。

而在语音识别算法中,声学建模方法是其中一个关键的环节。

本文将对声学建模方法进行总结,包括高斯混合模型(Gaussian Mixture Model,GMM)、隐马尔可夫模型(Hidden Markov Model,HMM)、深度神经网络(Deep Neural Network,DNN)等方法。

首先,我们来介绍GMM方法。

GMM是一种基于统计模型的声学建模方法,它假设语音信号是由多个高斯分布组成的。

在训练过程中,我们通过最大似然估计来估计高斯分布的参数,如均值和协方差矩阵。

然后,在识别过程中,我们将输入的语音信号与每个高斯分布进行比较,选择概率最大的高斯分布作为最终的识别结果。

GMM方法常用于传统的语音识别系统中,其性能在一定程度上受到数据分布的限制。

接下来,我们介绍HMM方法。

HMM是一种基于序列建模的声学建模方法,它假设语音信号是由多个隐藏的状态序列和对应的可观测的观测序列组成的。

在训练过程中,我们通过最大似然估计来估计HMM的参数,如初始状态概率、状态转移概率和观测概率。

然后,在识别过程中,我们使用Viterbi算法来寻找最可能的状态序列,进而得到最终的识别结果。

HMM方法在语音识别中广泛应用,其优势在于对于长时序列的建模能力较好。

然而,GMM和HMM方法都存在一些问题,如GMM的参数数量较大,计算复杂度较高;HMM对于复杂的语音信号建模能力相对较弱。

因此,近年来,深度神经网络被引入到语音识别中作为一种新的声学建模方法。

深度神经网络(DNN)是一种由多层神经元构成的神经网络模型。

在语音识别中,我们可以将DNN用于声学模型的学习和预测过程中。

具体来说,我们可以将语音信号的频谱特征作为输入,通过多层的神经网络进行特征提取和模型训练,在输出层获得最终的识别结果。

相比于传统的GMM和HMM方法,DNN方法在语音识别中取得了更好的性能,其受到数据分布的限制较小,对于复杂的语音信号建模能力更强。

隐马尔科夫模型的训练技巧(九)

隐马尔科夫模型的训练技巧(九)

隐马尔科夫模型(Hidden Markov Model,HMM)是一种在语音识别、自然语言处理、生物信息学等领域广泛应用的统计模型。

它能够描述一个含有隐藏状态的马尔科夫过程,并用于对这些隐藏状态进行推断和预测。

在实际应用中,HMM的性能取决于模型的训练质量,因此训练技巧是非常关键的。

HMM的基本原理在深入讨论HMM的训练技巧之前,让我们先简要回顾一下HMM的基本原理。

HMM由两个随机过程组成:一个隐藏的马尔科夫链和一个可观察的输出过程。

隐藏的马尔科夫链由一组状态以及状态间的转移概率组成,而可观察的输出过程由每个隐藏状态生成可观察符号的发射概率组成。

HMM的目标是通过可观察的符号序列来推断隐藏状态序列,或者通过隐藏状态序列来预测可观察的符号序列。

HMM的训练目标HMM的训练目标是通过已知的可观察符号序列来估计模型的参数,包括隐藏状态的转移概率和发射概率。

这个过程通常使用最大似然估计来实现,即寻找能够最大化观测数据的参数值。

在实际应用中,HMM的训练通常涉及到两个步骤:初始化和迭代优化。

下面我们将重点讨论一些HMM训练的技巧。

初始化参数在进行HMM的训练之前,需要对模型的参数进行初始化。

这包括隐藏状态的转移概率矩阵和发射概率矩阵。

对于转移概率矩阵,可以使用随机初始化的方法来获得一个初步的估计值;而对于发射概率矩阵,则可以基于可观察符号序列来进行估计。

通常情况下,发射概率矩阵的初始化可以使用经验分布或者其他启发式方法来进行。

值得注意的是,好的初始化参数对于后续的训练优化至关重要。

Baum-Welch算法Baum-Welch算法是一种在HMM训练中广泛应用的迭代优化算法。

它基于期望最大化(Expectation Maximization,EM)的思想,通过不断迭代来更新模型的参数,直到收敛为止。

具体来说,Baum-Welch算法包括两个步骤:E步和M步。

在E 步中,通过前向-后向算法计算隐藏状态的后验概率;而在M步中,通过这些后验概率来更新模型的参数。

隐马尔科夫模型在行为识别中的使用注意事项(七)

隐马尔科夫模型在行为识别中的使用注意事项(七)

隐马尔科夫模型在行为识别中的使用注意事项导言隐马尔科夫模型(Hidden Markov Model, HMM)是一种用于对时间序列数据建模的统计模型,它在语音识别、手势识别、行为识别等领域有着广泛的应用。

在行为识别中,HMM可以用于对人体动作、动物行为等进行建模和识别。

然而,在使用HMM进行行为识别时,需要注意一些问题,本文将从数据预处理、模型构建和参数调优等方面进行探讨。

数据预处理在使用HMM进行行为识别时,首先需要进行数据的预处理。

数据预处理包括数据采集、数据清洗和特征提取等步骤。

在数据采集过程中,需要注意采集设备的精度和稳定性,以及数据采集的频率。

数据清洗是指对采集到的数据进行去噪和平滑处理,排除异常值和错误数据。

特征提取是将原始数据转换为适合HMM建模的特征向量,常用的特征包括时域特征和频域特征等。

在数据预处理过程中,需要注意数据的质量和特征的有效性,以确保HMM模型的准确性和稳定性。

模型构建HMM由状态空间、观测空间、状态转移概率、观测概率和初始概率组成。

在构建HMM模型时,需要确定状态空间的大小、观测空间的维度和模型的拓扑结构。

在确定状态空间大小时,需要考虑行为的复杂度和多样性,过大的状态空间会增加模型的复杂度,过小的状态空间会限制模型的表达能力。

观测空间的维度需要包含足够的特征信息,以便对不同行为进行区分和识别。

模型的拓扑结构包括状态之间的转移关系和观测与状态之间的关联关系,需要根据具体的行为特点进行设计和优化。

参数调优在构建HMM模型之后,需要对模型进行参数的调优。

HMM模型的参数包括状态转移概率矩阵、观测概率矩阵和初始概率向量。

参数的调优包括参数的初始化、参数的学习和参数的优化。

在进行参数初始化时,需要根据领域知识和数据特点进行合理的设置。

参数的学习是指根据观测数据对模型的参数进行估计和更新,常用的方法包括Baum-Welch算法和EM算法等。

参数的优化是指对模型进行调整和改进,以提高模型的识别性能和泛化能力。

内隐记忆的实验

内隐记忆的实验

姓名刘婷婷学号222012306032024 专业心理学年级 2012级课程实验心理学实验时间2013/11/18 同组人姓名于尧,黄铖成绩内隐记忆的实验刘婷婷(西南大学心理学部,重庆,400715)·摘要本实验旨在利用实验来证明内隐记忆的存在,并且比较内隐记忆和外显记忆之间的差异。

本实验收集了54名被试的实验数据。

他们均来源于西南大学心理学部,利用计算机上的PsyKey心理学实验教学系统上的内隐测试程序对被试进行测试,考察被试是否存在内隐记忆,并且比较内隐和外显记忆区别。

结果表明,本次实验中存在内隐记忆,并且外显记忆和内隐记忆之间没有显著关系,是完全独立的两个变量。

·关键词内隐记忆外显记忆知觉辨认再认1引言内隐记忆是指在不需要意识或有意回忆的条件下,个体的过去经验对当前任务自动产生影响的现象,因为内隐记忆是在研究精神病患者的启动效应(prim2ing effect) 中发现的,所以人们常把内隐记忆和启动效应作为同等概念使用。

20世纪80年代Schacter和Graf首次提出了内隐记忆这一概念,内隐记忆逐渐成为认知心理学中记忆研究的热点问题。

通过对内隐记忆的研究,研究者得出了相关的理论解释,包括阈限说、激活说、迁移适当加工说和多重记忆系统说。

目前迁移适当加工说和多重记忆系统说是两种主要的理论解释,并且有将二者结合起来对内隐记忆进行解释的趋势。

迁移适当加工说认为记忆系统只有一个,内隐记忆和外显记忆并不是不同的系统,记忆实验所观察到的实验分离现象,反映的只是测验所要求的加工过程不同而已,并不说明机能独立的两个不同记忆系统的存在。

多重记忆系统说的核心则是认为,记忆的实验性分离现象反映了记忆系统中存在着不同的子系统。

内隐记忆和外显记忆分别代表了记忆的两个子系统,从而改变了过去将记忆视为一个单一系统的观点(杨晴,2011)。

两种学说虽然各有优点,但对于所发现的全部实验性分离都不能给与完满的解释。

【中文分词】结构化感知器SP

【中文分词】结构化感知器SP

【中⽂分词】结构化感知器SP结构化感知器(Structured Perceptron, SP)是由Collins [1]在EMNLP'02上提出来的,⽤于解决序列标注的问题。

中⽂分词⼯具、所采⽤的分词模型便是基于此。

1. 结构化感知器模型全局化地以最⼤熵准则建模概率P(Y|X);其中,X为输⼊序列x_1^n,Y为标注序列y_1^n。

不同于CRF建模概率函数,SP则是以最⼤熵准则建模score函数:S(Y,X) = \sum_s \alpha_s \Phi_s(Y,X)其中,\Phi_s(Y,X)为本地特征函数\phi_s(h_i,y_i)的全局化表⽰:\Phi_s(Y,X) = \sum_i \phi_s(h_i,y_i)那么,SP解决序列标注问题,可视作为:给定X序列,求解score函数最⼤值对应的Y序列:\mathop{\arg \max}_Y S(Y,X)为了避免模型过拟合,保留每⼀次更新的权重,然后对其求平均。

具体流程如下所⽰:因此,结构化感知器也被称为平均感知器(Average Perceptron)。

解码在将SP应⽤于中⽂分词时,除了事先定义的特征模板外,还⽤⽤到⼀个状态转移特征(y_{t-1}, y_t)。

记在时刻t的状态为y的路径y_1^{t}所对应的score函数最⼤值为\delta_t(y) = \max S(y_1^{t-1},X,y_t=y)则有,在时刻t+1\delta_{t+1}(y) = \max_{y'} \ \{ \delta_t(y') + w_{y',y} + F(y_{t+1}=y,X) \}其中,w_{y',y}为转移特征(y',y)所对应的权值,F(y_{t+1}=y,X)为y_{t+1}=y所对应的特征模板的特征值的加权之和。

2. 开源实现张开旭的(THULAC的雏形)给出了SP中⽂分词的简单实现。

⾸先,来看看定义的特征模板:def gen_features(self, x): # 枚举得到每个字的特征向量for i in range(len(x)):left2 = x[i - 2] if i - 2 >= 0 else '#'left1 = x[i - 1] if i - 1 >= 0 else '#'mid = x[i]right1 = x[i + 1] if i + 1 < len(x) else '#'right2 = x[i + 2] if i + 2 < len(x) else '#'features = ['1' + mid, '2' + left1, '3' + right1,'4' + left2 + left1, '5' + left1 + mid, '6' + mid + right1, '7' + right1 + right2]yield features共定义了7个特征:x_iy_ix_{i-1}y_ix_{i+1}y_ix_{i-2}x_{i-1}y_ix_{i-1}x_{i}y_ix_{i}x_{i+1}y_ix_{i+1}x_{i+2}y_i将状态B、M、E、S分别对应于数字0、1、2、3:def load_example(words): # 词数组,得到x,yy = []for word in words:if len(word) == 1:y.append(3)else:y.extend([0] + [1] * (len(word) - 2) + [2])return ''.join(words), y训练语料则采取的更新权重:for i in range(args.iteration):print('第 %i 次迭代' % (i + 1), end=' '), sys.stdout.flush()evaluator = Evaluator()for l in open(args.train, 'r', 'utf-8'):x, y = load_example(l.split())z = cws.decode(x)evaluator(dump_example(x, y), dump_example(x, z))cws.weights._step += 1if z != y:cws.update(x, y, 1)cws.update(x, z, -1)evaluator.report()cws.weights.update_all()cws.weights.average()Viterbi算法⽤于解码,与HMM相类似:def decode(self, x): # 类似隐马模型的动态规划解码算法# 类似隐马模型中的转移概率transitions = [[self.weights.get_value(str(i) + ':' + str(j), 0) for j in range(4)]for i in range(4)]# 类似隐马模型中的发射概率emissions = [[sum(self.weights.get_value(str(tag) + feature, 0) for feature in features)for tag in range(4)] for features in self.gen_features(x)]# 类似隐马模型中的前向概率alphas = [[[e, None] for e in emissions[0]]]for i in range(len(x) - 1):alphas.append([max([alphas[i][j][0] + transitions[j][k] + emissions[i + 1][k], j]for j in range(4))for k in range(4)])# 根据alphas中的“指针”得到最优序列alpha = max([alphas[-1][j], j] for j in range(4))i = len(x)tags = []while i:tags.append(alpha[1])i -= 1alpha = alphas[i][alpha[1]]return list(reversed(tags))3. 参考资料[1] Collins, Michael. "Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms." Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002.[2] Zhang, Yue, and Stephen Clark. "Chinese segmentation with a word-based perceptron algorithm." Annual Meeting-Association for Computational Linguistics. Vol. 45. No. 1. 2007.[3] Kai Zhao and Liang Huang, .[4] Michael Collins, .Processing math: 0%。

隐马尔可夫模型三个基本问题以及相应的算法

隐马尔可夫模型三个基本问题以及相应的算法

隐马尔可夫模型三个基本问题以及相应的算法1. 背景介绍隐马尔可夫模型(Hidden Markov Model, HMM)是一种统计模型,用于描述具有隐藏状态的序列数据。

HMM在很多领域中得到广泛应用,如语音识别、自然语言处理、机器翻译等。

在HMM中,我们关心三个基本问题:评估问题、解码问题和学习问题。

2. 评估问题评估问题是指给定一个HMM模型和观测序列,如何计算观测序列出现的概率。

具体而言,给定一个HMM模型λ=(A,B,π)和一个观测序列O=(o1,o2,...,o T),我们需要计算P(O|λ)。

前向算法(Forward Algorithm)前向算法是解决评估问题的一种经典方法。

它通过动态规划的方式逐步计算前向概率αt(i),表示在时刻t处于状态i且观测到o1,o2,...,o t的概率。

具体而言,前向概率可以通过以下递推公式计算:N(i)⋅a ij)⋅b j(o t+1)αt+1(j)=(∑αti=1其中,a ij是从状态i转移到状态j的概率,b j(o t+1)是在状态j观测到o t+1的概率。

最终,观测序列出现的概率可以通过累加最后一个时刻的前向概率得到:N(i)P(O|λ)=∑αTi=1后向算法(Backward Algorithm)后向算法也是解决评估问题的一种常用方法。

它通过动态规划的方式逐步计算后向概率βt(i),表示在时刻t处于状态i且观测到o t+1,o t+2,...,o T的概率。

具体而言,后向概率可以通过以下递推公式计算:Nβt(i)=∑a ij⋅b j(o t+1)⋅βt+1(j)j=1其中,βT(i)=1。

观测序列出现的概率可以通过将初始时刻的后向概率与初始状态分布相乘得到:P (O|λ)=∑πi Ni=1⋅b i (o 1)⋅β1(i )3. 解码问题解码问题是指给定一个HMM 模型和观测序列,如何确定最可能的隐藏状态序列。

具体而言,给定一个HMM 模型λ=(A,B,π)和一个观测序列O =(o 1,o 2,...,o T ),我们需要找到一个隐藏状态序列I =(i 1,i 2,...,i T ),使得P (I|O,λ)最大。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

DISCRIMINATIVE TRAINING OF HIDDEN MARKOV MODELS FOR MULTIPLE PITCH TRACKINGFrancis R.Bach Computer Science DivisionUniversity of California Berkeley,CA94720,USA fbach@Michael I.Jordan Computer Science and Statistics University of CaliforniaBerkeley,CA94720,USA jordan@ABSTRACTWe present a multiple pitch tracking algorithm that is based on direct probabilistic modeling of the spectrogram of the signal.The model is a factorial hidden Markov model whose parameters are learned discriminatively from the Keele pitch database[1].Our algorithm can track several pitches and determines the number of pitches that are active at any given time.We present simulation results on mixtures of several speech signals and noise,showing the robustness of our approach.1.INTRODUCTIONPitch tracking is a fundamental problem in speech and music pro-cessing,and the design of robust algorithms for single or multiple pitch determination has been an active topic of research in acous-tic signal processing[2,3,4,5,6,7].Most pitch extraction algo-rithmsfirst build a set of nonlinear features(e.g.,the correlogram or the cepstrum)that exhibit special behavior when voiced speech is uttered and then model this behavior to track pitch.In the pres-ence of multiple voiced signals that mix additively,it is natural to consider modeling directly the signals or a linear representation thereof(such as the spectrogram)in order to preserve additivity and make it possible to use models for one pitch in order to ex-tract multiple pitches.In this paper,we work with the magnitude of spectrogram.The magnitude is not a linear representation,but because of the sparsity of speech and music signals in the spectro-gram,it can be well approximated as such[8].Working directly with the spectrogram requires a detailed prob-abilistic model for characterizing pitch.In this paper,we consider a variant of a hidden Markov harmonic model and use the frame-work of graphical models to build the model,learn it from data and design efficient inference algorithms[9].In particular,we use recent developments from the machine learning literature to cap-ture the appropriate properties of speech and music;in particular, we make use of nonparametric priors to capture smoothness of the spectral envelope,and we improve extraction performance by using discriminative training of the models[10].We present the graphical model in Section2,the inference algorithm in Section3 and the learning algorithm in Section4.In Section5,we test our algorithm on a variety of challenging pitch extraction tasks.This work was supported by a grant from Intel Corporation,and a graduate fellowship to Francis Bach from Microsoft Research.2.GRAPHICAL MODEL FOR PITCH EXTRACTION In this paper,we assume that the speech signals are sampled at5.5KHz.Given a real one-dimensional signal x t,t=1,...,T, the spectrogram s is defined as the short-time windowed Fouriertransform of x;i.e.,the signal x is cut into N overlapping framesof length M,and the spectrogram s is defined as the N×P matrixwhose n-th column s n∈R P is the P-point FFT of a windowed version of the n-th frame.1In this paper,we model the magnitudeof the spectrogram and refer to the magnitude of the spectrogramsimply as the spectrogram.Since the speech signals are real,theFFT is symmetric and we only need to consider thefirst P/2fre-quencies.2.1.Additive modelThe input to our pitch tracker is the sequence s n∈R P,n= 1,...,N,where N is the number of frames,equal to a constant times the duration T of the signal x.We use an additive modelfor the spectrogram,i.e.,if K speakers are potentially present,wemodel the n-th frame as the superposition of K signals u k n∈R P plus noise,i.e.,s n= K k=1u k n+εn.Note that the acoustics are not additive for the magnitude of the spectrogram;however,sincesignals from two different speakers have small overlap[8],the lin-earity is a reasonable approximation.The advantage of using themagnitude is that the modeling of the smoothness of the spectralenvelope is easily achieved using spline smoothing techniques,asdescribed in Section2.2.2.2.Harmonic modelWe use a harmonic model in the frequency domain,which amountsto modeling the spectrogram of voiced speech as an amplitude-modulated comb[3].We model each speaker k at time frame nwith four variables:•Voiced/unvoiced:v k n is a binary variable which is equal to one if the speaker k utters voiced speech at time n,and equal to zero otherwise(either non-voiced speech or no speech ut-tered).•Pitch:ωk n is the frequency of the pitch of speaker k at time n, scaled so that it is equal to the distance between two harmonic peaks in the spectrogram.1In simulations,we use frames of length40ms sampled every10ms,a Hanning window and a512-point FFT.Fig.1.HMM for one speaker for two time frames n and n+1(time subscripts are omitted).•Harmonics:h k n is a set of vectors of harmonic amplitudes if the speech is voiced.There is one harmonic vector h k nωforeach potential pitch valueω.The dimension of h k nωis equalto P/2ω .Given that the signal is non-voiced(i.e.,givenv k n=0),then all sets of harmonic amplitudes for allωare in-dependent from all other variables,while given that the signalis voiced and given the pitchω,the entire set{h k nω ,ω =ω} is independent from other variables.•Constant term:c k n is the constant amplitude of non-voiced por-tions.Given v k n=1,c k n is independent from all other vari-ables.The graphical model describing the model for a single speakeris a simple Hidden Markov model(HMM)and is shown in Fig-ure1.The conditional probability distributions that are needed tofully specify the model reflect the known psychoacoustics and sta-tistical properties of pitch[11,2,3]and are as follows:•p(v k n+1|v k n)is a constant transition matrix T v with four ele-ments.•p(ωk n+1|ωk n):the pitch is discretized on a grid with nω=300 elements,and each logarithm of the row of the transition ma-trix is equal to(up to additive constants)α1(ωk n+1−ωk n)2+α2ωk n+1+α3(ωk n+1)−1.Note that the high number of val-ues for the discretization of the pitch frequency is necessary inorder to obtain good pitch extraction performance.•p(h k nω):for each value of the pitchω,h k nωis modelled as the restriction of a smooth function on[0,P/2]—i.e.,a functionwith bounded second derivative—to all multiples ofω.Thatis,(h k nω)i is equal to g(iω),where g is a function such that |g(2)|2is bounded.g is usually referred to as the spectral envelope[3].Following[12],h k nωcan thus be modelled as a Gaussian pro-cess on the line[0,P/2]observed at multiples of the funda-mental frequencyω;this implies that h k nωcan be written ash k nω=Kωa k nω+Tωb k nω,where Kωis the“kernel matrix”de-fined as(Kω)ij=(1ij min{i,j}−1min{i,j}3)ω3,and Tωis a matrix with two columns,one constant and one linear func-tion of the frequency.The auxiliary variables a k nωand b k nωare normal with mean0and covariance matrices(α4Kω+α5I)−1 andα6I.•The variable c k n is normal with meanα4and varianceα5.•Observation model:givenωk n,h k n,c k n and v k n,the signal u k n isequal to B(ωk n)h knωk n if v k n=1,and equal to c k n e if v k n=0,where e is the constant vector of all ones.The i-th column of the matrix B(ω)is a bump centered at frequency iω,defined as the Fourier transform of the window.See Figure2.Thus, voiced speechis modeled as a weighted sum of bumps at mul-tiples of the fundamental frequency,where the amplitude of the bump extends to a smooth spectral envelope.(ωhFig.2.Spectral envelope(dotted)and harmonic model(plain).Fig.3.Factorial HMM for two speakers for two time frames n and n+1(time subscripts are omitted).2.3.Factorial hidden Markov modelsThe K models for each speaker can be joined into a single graphi-cal model,a“factorial HMM,”where the2K Markov chains evolve independently(see Figure3for the model with two speakers).The parameterλn is the variance of the Gaussian noiseεn at time n. We assume it has a uniform distribution and it is discretize to an uniform logarithmic grid with nλ=10elements.2.4.Related modelsOur graphical model resembles the models presented previously in[4],[5],[6]and[7].In[4]and[5],the graphical model is de-fined on features rather than on the speech signal directly(or its spectrogram),which abandons the additive structure of the mix-ing and makes it more difficult to estimate several pitch tracks. In[6]and[7],harmonic models are used but most parameters are not learned from data,and the harmonic model does not include a smoothness prior which is crucial to avoid pitch halving.Also, models that are learned from data such as[4]or[5]use maximum likelihood training while we use discriminative training,which is more expensive but leads to superior performance(see Section5).3.PITCH EXTRACTIONIn the following sections,we use the shorthand x to denote the set of variables(x k n)k,n for all k and n,while we use the short-hand x k to denote the set of variables(x k n)n for all n.If we de-note z=(ω,v,h,c,λ),then the task of inference is to compute, given some data s,arg max z p(z|s).Minimization with respect to(h,λ)can be done in closed form and thus we are left with the task of maximizing with respect to(ω,v).3.1.One speakerWith one speaker,this is simply inference in an HMM where the hidden state has a number of values proportional to nω,and the complexity of inference for a speech of duration T is thus O(T nω) for computing potentials and O(T n2ω)for the Viterbi algorithm[13].3.2.Two or more speakersWith m speakers,we have a factorial HMM with2m uncoupled Markov chains with nωor2states each,thus the complexity of ex-act inference is O(T n mω)for computing potentials and O(T n m+1ω)for a structured Viterbi algorithm[13].Given that searching of a space of size n2ωis the most expensive we are willing to af-ford(since nωis large),we use the following approximate scheme which is a simple extension of similar schemes used for single pitch tracking(e.g.,[7]):1.Recursively estimate the m pitches byfinding one singlepitch track and subtracting the corresponding estimated har-monic signals.2.Construct a pool of pωpitch value candidates for each timestep,by storing local minima in the m Viterbi algorithms ofstep1.3.Perform exact inference only using the pool of candidates.4.Perform m local optimizations of a single pitch track giventhe other ones.The algorithm has complexity of O(mn2ωT)for the Viterbi algorithm with single pitch tracks,and O(T p mω)for the structured Viterbi algorithm of step2.In practise,pωis small enough(around 10)so that step3is not the bottleneck while being large enough to yield no significant difference from the setting pω=nω(i.e.,no approximation).4.LEARNING OF PARAMETERSIf we denote z=(ω,v,h,c,λ),then we have a model for s which is a latent variable model with latent variable z.In the presence of“labelled data,”i.e.,datasets for which both s and z are avail-able,there are two different types of training that can be employed, generative or discriminative.In this paper,we use pitch-labelled data from the Keele pitch database[1].This database has ten different speakers;the pitch frequencyωand the voicing decision v are available,but neither the harmonic amplitudes h nor the unvoiced constant amplitude c are availableWe can create artificial labelled training data with several speak-ers by superposing two distinct signals.In this paper,we consider mixing of two speakers for training and mixing of two or three speakers for testing(note that since the parameters are shared by all speakers in our framework,learning only on two speakers leads to a pitch extractor that can deal with any number of pitches).We thus have two sets of hidden variables(ω1,v1),(ω2,v2),one for each speaker.4.1.Generative training(maximum likelihood)In this type of estimation,if we have observations for both x and the hidden states z,we simply maximize the joint likelihood p(s,z) of the data(s,z).Since we have a directed graphical model,this readily decouples in independent parameter estimations for each conditional distribution[9].The data from Keele pitch database do not include the harmonic amplitudes;the harmonic amplitudes that do not correspond to the pitch valueω(which is observed)do not play any role in the model,thus they can be left unspecified; for the harmonic amplitudes corresponding to the observed pitch, we take h to be the best amplitudes in the least-square sense,i.e., hω=B(ω) (B(ω) B(ω))−1s.Although efficient(no inference in an HMM has to be per-formed for learning),such training,when thefinal objective of in-ference is only to estimate the hidden state z and not to also obtain a model of the observations,is usually outperformed by discrimi-native training,which directly optimizes the conditional likelihood p(z|s)[14,10].4.2.Discriminative trainingInstead of maximizing p(s,z),we maximize the conditional like-lihood p(z|s).Maximizing the conditional likelihood does not decouple in a graphical model and thus exact maximum condi-tional likelihood estimation requires performing many runs of the inference algorithm for factorial HMMs,even to simply compute p(z|s).Since exact inference is intractable in those HMMs,we instead maximize a“pseudo log likelihood”which is defined as the sum of the log likelihoods of subproblems and exhibits asymp-totic properties similar to full maximum likelihood[15].We de-fined the pseudo log likelihood as follows:the available data is (ω1,v1,ω2,v2);we let q(ω,ω ,v,v )denoteq(ω,ω ,v,v )=maxh1,h2,c1,c2,λp(ω,ω ,v,v ,h1,h2,c1,c2,λ).We maximize with respect to the parameters the log likelihood defined as:logq(ω1,ω2,v1,v2)ω,vq(ω,ω2,v,v2)+logq(ω1,ω2,v1,v2)ω,vq(ω1,ω,v1,v)The maximization is performed through gradient descent,and re-quires inference in an HMM with a number of states proportional to nω,as opposed to n2ω.5.SIMULATIONSIn this section,we show that the various features that were in-cluded into our graphical model framework lead to robust perfor-mance.In all our simulations,training was performed on thefirst 6speakers,while testing was performed on the remaining4speak-ers.The metric we use to compare pitch frequenciesωandω is d(ω,ω )=1−e−(ω−ω )2/σ2,whereσ2is the empirical variance of the pitch frequency over the entire training set.This measure is equivalent to the squared distance for close values and tends to1 for distant values.We prefer it to the plain squared distance,be-cause if an estimated pitch is far away from the true pitch,its value is not relevant and we prefer to have afixed unit penalty for all clearly wrong values of the pitch.The running time for extracting any number of pitches is linear in the duration of the signals.In our current Matlab implementa-tion with a2GHz processor,the running time is30times the dura-tion of the signal for extracting one pitch,while it is130times the duration of the signal for extracting two pitches.5.1.Effect of smoothing spline priorFor the simple task of pitch determination for independent frames taken from one speaker,we have compared our approach to anvoicing pitch errorfemale-male22%0.03female-female32%0.08male-male31%0.07Fig.4.Double pitch extraction:voicing decision error rates and mean pitch estimation errors.Fig.5.Single pitch extraction with noise:voicing decision error rates(left)and mean pitch estimation errors(right);white noise (plain),stationary colored noise(dashed),restaurant background (dotted).approach without a spline smoothing prior on the harmonic am-plitudes:with the smoothing spline prior,the average error on the pitch estimate using the measure defined earlier is equal to0.28, while the error for the estimate without smoothing spline prior is equal to0.57,and most of the additional errors are due to pitch halving,which is a well known problem in pitch determination. In the context of harmonic modeling approaches such as the one presented in this paper,a priori detailed knowledge of the spec-tral envelope has been shown to remove the pitch halving ambigu-ity[3];the current results suggest that a simple spline smoothing prior which does not require knowledge of the envelope is also sufficient to resolve this ambiguity.5.2.Discriminative vs.generative trainingOn single pitch tracking experiments,we compared the perfor-mance of pitch extractors trained discriminatively or generatively. The pitch extractor trained generatively made an incorrect decision regarding voicing27.4%of the time and had a pitch estimation error of0.022,while the pitch extractor trained discriminatively made an incorrect decision regarding voicing only5%of the time and had a pitch estimation error of0.016.Discriminative training indeed leads to significantly better performance.5.3.Two speakersIn this set of experiments,we mixed two signals from different speakers with same energy.In Figure4we report incorrect voicing decision rates and mean pitch estimation errors,with speakers of different genders.5.4.Noisy conditionsWe also performed experiments in which we added three different types of noise to the signal:white noise,stationary colored noise and non-stationary restaurant background noise.We plot the re-sults as a function of signal-to-noise ratios in Figure5,illustrating the robustness to noise of our pitch extractor.6.CONCLUSIONWe have presented an algorithm for multiple pitch extraction based on graphical models.The use of appropriate prior distributions and discriminative training leads to robust extraction performance.Im-portantly,the computational complexity of our algorithm is linear in the length of the audio segment.Although the running time of our current Matlab implementation is130times slower than real time,we do not foresee any major obstacles to the design of a more efficient software implementation that runs in real time.7.REFERENCES[1] F.Plante,G.F.Meyer,and W.A.Ainsworth,“A pitch ex-traction reference database,”in Proc.EUROSPEECH,1995.[2] B.Gold and N.Morgan,Speech and Audio Signal Process-ing,Wiley Press,1999.[3]R.J.McAulay and T.F.Quatieri,“Pitch estimation and voic-ing detection based on a sinusoidal speech model,”in Proc.ICASSP,1990.[4]M.Wu,D.Wang,and G.J.Brown,“A multipitch tracking al-gorithm for noisy speech,”IEEE Trans.Speech Audio Proc., vol.11,no.3,2003.[5]X.Li,J.Malkin,and J.Bilmes,“Graphical model approachto pitch tracking,”in Intl.Conf.Spoken Lang.Proc.,2004.[6]P.J.Walmsley,S.J.Godsill,and P.J.W.Rayner,“Poly-phonic pitch tracking using joint Bayesian estimation of mul-tiple frame parameters,”in Proc.IEEE Work.App.Sig.Proc.Acoust.,1999.[7]J.Tabrikian,S.Dubnov,and Y.Dickalov,“Maximum a-posteriori probability pitch tracking in noisy environments using harmonic model,”IEEE Trans.Speech Audio Proc., vol.12,no.1,2004.[8]O.Yilmaz and S.Rickard,“Blind separation of speech mix-tures via time-frequency masking,”IEEE Trans.Sig.Proc., vol.52,no.7,pp.1830–1847,2004.[9]M.I.Jordan,“Graphical models,”Stat.Sci.,vol.19,no.1,pp.140–155,2004.[10]fferty,A.McCallum,and F.Pereira,“Conditional ran-domfields:Probabilistic models for segmenting and labeling sequence data,”in Proc.ICML,2001.[11] A.S.Bregman,Auditory Scene Analysis:The PerceptualOrganization of Sound,MIT Press,1990.[12]G.Wahba,Spline Models for Observational Data,SIAM,1990.[13]Z.Ghahramani and M.I.Jordan,“Factorial hidden Markovmodels,”Machine Learning,vol.29,pp.245–273,1997. [14]L.R.Bahl,P.V.de Souza P.F.Brown and,and R.L.Mercer,“Maximum mutual information estimation of hidden Markov model parameters for speech recognition,”in Proc.ICASSP, 1986.[15]G.Liang and B.Yu,“Maximum pseudo likelihood estima-tion in network tomography,”IEEE Trans.Sig.Proc.,vol.51,no.8,pp.2043–2053,2003.。

相关文档
最新文档