gaussians斯坦福大学机器学习笔记
机器学习个人笔记完整版v5(原稿)

斯坦福大学2014机器学习教程个人笔记(V5.01)摘要本笔记是针对斯坦福大学2014年机器学习课程视频做的个人笔记黄海广haiguang2000@qq群:554839127最后修改:2017-12-3斯坦福大学2014机器学习教程中文笔记课程概述Machine Learning(机器学习)是研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。
它是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域,它主要使用归纳、综合而不是演译。
在过去的十年中,机器学习帮助我们自动驾驶汽车,有效的语音识别,有效的网络搜索,并极大地提高了人类基因组的认识。
机器学习是当今非常普遍,你可能会使用这一天几十倍而不自知。
很多研究者也认为这是最好的人工智能的取得方式。
在本课中,您将学习最有效的机器学习技术,并获得实践,让它们为自己的工作。
更重要的是,你会不仅得到理论基础的学习,而且获得那些需要快速和强大的应用技术解决问题的实用技术。
最后,你会学到一些硅谷利用机器学习和人工智能的最佳实践创新。
本课程提供了一个广泛的介绍机器学习、数据挖掘、统计模式识别的课程。
主题包括:(一)监督学习(参数/非参数算法,支持向量机,核函数,神经网络)。
(二)无监督学习(聚类,降维,推荐系统,深入学习推荐)。
(三)在机器学习的最佳实践(偏差/方差理论;在机器学习和人工智能创新过程)。
本课程还将使用大量的案例研究,您还将学习如何运用学习算法构建智能机器人(感知,控制),文本的理解(Web搜索,反垃圾邮件),计算机视觉,医疗信息,音频,数据挖掘,和其他领域。
本课程需要10周共18节课,相对以前的机器学习视频,这个视频更加清晰,而且每课都有ppt课件,推荐学习。
本人是中国海洋大学2014级博士生,2014年刚开始接触机器学习,我下载了这次课程的所有视频和课件给大家分享。
中英文字幕来自于https:///course/ml,主要是教育无边界字幕组翻译,本人把中英文字幕进行合并,并翻译剩余字幕,对视频进行封装,归类,并翻译了课程目录,做好课程索引文件,希望对大家有所帮助。
斯坦福大学机器学习所有问题及答案合集

CS 229机器学习(问题及答案)斯坦福大学目录(1) 作业1(Supervised Learning) 1(2) 作业1解答(Supervised Learning) 5(3) 作业2(Kernels, SVMs, and Theory)15(4) 作业2解答(Kernels, SVMs, and Theory)19(5) 作业3(Learning Theory and UnsupervisedLearning)27 (6) 作业3解答(Learning Theory and Unsupervised Learning)31 (7) 作业4(Unsupervised Learning and Reinforcement Learning)39 (8) 作业4解答(Unsupervised Learning and Reinforcement Learning)44(9) Problem Set #1: Supervised Learning 56(10) Problem Set #1 Answer 62(11) Problem Set #2: Problem Set #2: Naive Bayes, SVMs, and Theory 78 (12) Problem Set #2 Answer 85CS229Problem Set#11 CS229,Public CourseProblem Set#1:Supervised Learning1.Newton’s method for computing least squaresIn this problem,we will prove that if we use Newton’s method solve the least squares optimization problem,then we only need one iteration to converge toθ∗.(a)Find the Hessian of the cost function J(θ)=1T x(i)−y(i))2.2P i=1(θm(b)Show that thefirst iteration of Newton’s method gives usθ⋆=(X T X)−1X T~y,thesolution to our least squares problem.2.Locally-weighted logistic regressionIn this problem you will implement a locally-weighted version of logistic regression,wherewe weight different training examples differently according to the query point.The locally-weighted logistic regression problem is to maximizeλℓ(θ)=−θTθ+Tθ+2mw(i)hy(i)log hθ(x(i))+(1−y(i))log(1−hθ(x(i)))i. Xi=1The−λTθhere is what is known as a regularization parameter,which will be discussed2θin a future lecture,but which we include here because it is needed for Newton’s method to perform well on this task.For the entirety of this problem you can use the valueλ=0.0001.Using this definition,the gradient ofℓ(θ)is given by∇θℓ(θ)=XT z−λθwhere z∈R m is defined byz i=w(i)(y(i)−hθ(x(i)))and the Hessian is given byH=XT DX−λIwhere D∈R m×m is a diagonal matrix withD ii=−w(i)hθ(x(i))(1−hθ(x(i)))For the sake of this problem you can just use the above formulas,but you should try toderive these results for yourself as well.Given a query point x,we choose compute the weightsw(i)=exp2τ2CS229Problem Set#12(a)Implement the Newton-Raphson algorithm for optimizingℓ(θ)for a new query pointx,and use this to predict the class of x.T he q2/directory contains data and code for this problem.You should implementt he y=lwlr(X train,y train,x,tau)function in the lwlr.mfile.This func-t ion takes as input the training set(the X train and y train matrices,in the formdescribed in the class notes),a new query point x and the weight bandwitdh tau.G iven this input the function should1)compute weights w(i)for each training exam-p le,using the formula above,2)maximizeℓ(θ)using Newton’s method,andfinally3)o utput y=1{hθ(x)>0.5}as the prediction.W e provide two additional functions that might help.The[X train,y train]=load data;function will load the matrices fromfiles in the data/folder.The func-t ion plot lwlr(X train,y train,tau,resolution)will plot the resulting clas-s ifier(assuming you have properly implemented lwlr.m).This function evaluates thel ocally weighted logistic regression classifier over a large grid of points and plots ther esulting prediction as blue(predicting y=0)or red(predicting y=1).Dependingo n how fast your lwlr function is,creating the plot might take some time,so wer ecommend debugging your code with resolution=50;and later increase it to atl east200to get a better idea of the decision boundary.(b)Evaluate the system with a variety of different bandwidth parametersτ.In particular,tryτ=0.01,0.050.1,0.51.0,5.0.How does the classification boundary change whenvarying this parameter?Can you predict what the decision boundary of ordinary(unweighted)logistic regression would look like?3.Multivariate least squaresSo far in class,we have only considered cases where our target variable y is a scalar value.S uppose that instead of trying to predict a single output,we have a training set withm ultiple outputs for each example:{(x(i),y(i)),i=1,...,m},x(i)∈R n,y(i)∈R p.Thus for each training example,y(i)is vector-valued,with p entries.We wish to use a linearmodel to predict the outputs,as in least squares,by specifying the parameter matrixΘiny=ΘT x,whereΘ∈R n×p.(a)The cost function for this case isJ(Θ)=12m p2(Θi=1j=1T x(i))j−y(ji)X X(Θ.W rite J(Θ)in matrix-vector notation(i.e.,without using any summations).[Hint: Start with the m×n design matrixX=—(x(1))T——(x(2))T—...—(x(m))T—2CS229Problem Set#13 and the m×p target matrixY=—(y(1))T——(y(2))T—...—(y(m))T—and then work out how to express J(Θ)in terms of these matrices.](b)Find the closed form solution forΘwhich minimizes J(Θ).This is the equivalent tothe normal equations for the multivariate case.(c)Suppose instead of considering the multivariate vectors y(i)all at once,we instead(i)compute each variable yj separately for each j=1,...,p.In this case,we have a p individual linear models,of the form(i)T(i),j=1,...,p. yj=θj x(So here,eachθj∈R n).How do the parameters from these p independent least squares problems compare to the multivariate solution?4.Naive BayesI n this problem,we look at maximum likelihood parameter estimation using the naiveB ayes assumption.Here,the input features x j,j=1,...,n to our model are discrete,b inary-valued variables,so x j∈{0,1}.We call x=[x1x2···x n]T to be the input vector.For each training example,our output targets are a single binary-value y∈{0,1}.Our model is then parameterized by φj|y=0=p(x j=1|y=0),φj|y=1=p(x j=1|y=1),andφy=p(y=1).We model the joint distribution of(x,y)according top(y)=(φy)y(1−φy)1−yp(x|y=0)=nYp(x j|y=0)j=1=nYx j(1−φj|y=0)1−x(φj|y=0)j j=1p(x|y=1)=nYp(x j|y=1)j=1=nYx j(1−φj|y=1)1−x(φj|y=1)j j=1(i),y(i);ϕ)in terms of the(a)Find the joint likelihood functionℓ(ϕ)=log Q i=1p(xmodel parameters given above.Here,ϕrepresents the entire set of parameters {φy,φj|y=0,φj|y=1,j=1,...,n}.(b)Show that the parameters which maximize the likelihood function are the same as3CS229Problem Set#14those given in the lecture notes;i.e.,thatm(i)φj|y=0=P j=1∧yi=11{x(i)=0}mP i=11{y(i)=0} m(i)φj|y=1=P j=1∧yi=11{x(i)=1}mP i=11{y(i)=1}φy=P(i)=1}mi=11{y.m(c)Consider making a prediction on some new data point x using the most likely classestimate generated by the naive Bayes algorithm.Show that the hypothesis returnedby naive Bayes is a linear classifier—i.e.,if p(y=0|x)and p(y=1|x)are the classp robabilities returned by naive Bayes,show that there exists someθ∈R n+1sucht hatT x1p(y=1|x)≥p(y=0|x)if and only ifθ≥0.(Assumeθ0is an intercept term.)5.Exponential family and the geometric distribution(a)Consider the geometric distribution parameterized byφ:p(y;φ)=(1−φ)y−1φ,y=1,2,3,....Show that the geometric distribution is in the exponential family,and give b(y),η,T(y),and a(η).(b)Consider performing regression using a GLM model with a geometric response vari-able.What is the canonical response function for the family?You may use the factthat the mean of a geometric distribution is given by1/φ.(c)For a training set{(x(i),y(i));i=1,...,m},let the log-likelihood of an examplebe log p(y(i)|x(i);θ).By taking the derivative of the log-likelihood with respect toθj,derive the stochastic gradient ascent rule for learning using a GLM model withgoemetric responses y and the canonical response function.4CS229Problem Set#1Solutions1CS229,Public CourseProblem Set#1Solutions:Supervised Learning1.Newton’s method for computing least squaresIn this problem,we will prove that if we use Newton’s method solve the least squares optimization problem,then we only need one iteration to converge toθ∗.(a)Find the Hessian of the cost function J(θ)=1T x(i)−y(i))2.2P i=1(θmAnswer:As shown in the class notes∂J(θ)∂θj=mXT x(i)−y(i))x(i)(θj. i=1So∂2J(θ)∂θj∂θk==m∂X T x(i)−y(i))x(i)(θ∂θkj i=1mX(i)(i)T X)jkx j x k=(Xi=1Therefore,the Hessian of J(θ)is H=X T X.This can also be derived by simply applyingrules from the lecture notes on Linear Algebra.(b)Show that thefirst iteration of Newton’s method gives usθ⋆=(X T X)−1X T~y,thesolution to our least squares problem.Answer:Given anyθ(0),Newton’s methodfindsθ(1)according toθ(1)=θ(0)−H−1∇θJ(θ(0))=θ(0)−(X T X)−1(X T Xθ(0)−X T~y)=θ(0)−θ(0)+(X T X)−1X T~y=(XT X)−1X T~y.Therefore,no matter whatθ(0)we pick,Newton’s method alwaysfindsθ⋆after oneiteration.2.Locally-weighted logistic regressionIn this problem you will implement a locally-weighted version of logistic regression,where we weight different training examples differently according to the query point.The locally- weighted logistic regression problem is to maximizeλℓ(θ)=−θTθ+Tθ+2mw(i)hy(i)log hθ(x(i))+(1−y(i))log(1−hθ(x(i)))i. Xi=15CS229Problem Set#1Solutions2 The−λTθhere is what is known as a regularization parameter,which will be discussedθ2in a future lecture,but which we include here because it is needed for Newton’s method to perform well on this task.For the entirety of this problem you can use the valueλ=0.0001.Using this definition,the gradient ofℓ(θ)is given by∇θℓ(θ)=XT z−λθwhere z∈R m is defined byz i=w(i)(y(i)−hθ(x(i)))and the Hessian is given byH=XT DX−λIwhere D∈R m×m is a diagonal matrix withD ii=−w(i)hθ(x(i))(1−hθ(x(i)))For the sake of this problem you can just use the above formulas,but you should try toderive these results for yourself as well.Given a query point x,we choose compute the weightsw(i)=exp2τ2CS229Problem Set#1Solutions3theta=zeros(n,1);%compute weightsw=exp(-sum((X_train-repmat(x’,m,1)).^2,2)/(2*tau));%perform Newton’s methodg=ones(n,1);while(norm(g)>1e-6)h=1./(1+exp(-X_train*theta));g=X_train’*(w.*(y_train-h))-1e-4*theta;H=-X_train’*diag(w.*h.*(1-h))*X_train-1e-4*eye(n);theta=theta-H\g;end%return predicted yy=double(x’*theta>0);(b)Evaluate the system with a variety of different bandwidth parametersτ.In particular,tryτ=0.01,0.050.1,0.51.0,5.0.How does the classification boundary change whenv arying this parameter?Can you predict what the decision boundary of ordinary(unweighted)logistic regression would look like?Answer:These are the resulting decision boundaries,for the different values ofτ.tau = 0.01 tau = 0.05 tau = 0.1tau = 0.5 tau = 0.5 tau = 5For smallerτ,the classifier appears to overfit the data set,obtaining zero training error,but outputting a sporadic looking decision boundary.Asτgrows,the resulting deci-sion boundary becomes smoother,eventually converging(in the limit asτ→∞to theunweighted linear regression solution).3.Multivariate least squaresSo far in class,we have only considered cases where our target variable y is a scalar value. Suppose that instead of tryingto predict a single output,we have a training set with7CS229Problem Set#1Solutions4multiple outputs for each example:{(x(i),y(i)),i=1,...,m},x(i)∈R n,y(i)∈R p.Thus for each training example,y(i)is vector-valued,with p entries.We wish to use a linearmodel to predict the outputs,as in least squares,by specifying the parameter matrixΘiny=ΘT x,whereΘ∈R n×p.(a)The cost function for this case isJ(Θ)=12m p2 X X T x(i))j−y(j=1ji)(Θi=1.W rite J(Θ)in matrix-vector notation(i.e.,without using any summations).[Hint: Start with the m×n design matrixX=—(x(1))T——(x(2))T—...—(x(m))T—and the m×p target matrixY=—(y(1))T——(y(2))T—...—(y(m))T—and then work out how to express J(Θ)in terms of these matrices.] Answer:The objective function can be expressed asJ(Θ)=12tr T(XΘ−Y)(XΘ−Y).To see this,note thatJ(Θ)===1tr T(XΘ−Y)(XΘ−Y)212X i T(XΘ−Y)XΘ−Y)ii12X i X(XΘ−Y)2ijj=12m p2i)(Θi=1j=1T x(i))j−y j(X X(Θ8CS229Problem Set#1Solutions5(b)Find the closed form solution forΘwhich minimizes J(Θ).This is the equivalent tothe normal equations for the multivariate case.Answer:First we take the gradient of J(Θ)with respect toΘ.∇ΘJ(Θ)=∇Θ1tr(XΘ−Y)T(XΘ−Y)2=∇ΘT X T XΘ−ΘT X T Y−Y T XΘ−Y T T1trΘ21=∇ΘT X T XΘ)−tr(ΘT X T Y)−tr(Y T XΘ)+tr(Y T Y)tr(Θ21=tr(Θ2∇ΘT X T XΘ)−2tr(Y T XΘ)+tr(Y T Y)12T XΘ+X T XΘ−2X T Y=X=XT XΘ−X T YSetting this expression to zero we obtainΘ=(XT X)−1X T Y.This looks very similar to the closed form solution in the univariate case,except now Yis a m×p matrix,so thenΘis also a matrix,of size n×p.(c)Suppose instead of considering the multivariate vectors y(i)all at once,we instead(i)compute each variable y j separately for each j=1,...,p.In this case,we have a p individual linear models,of theform(i)T(i),j=1,...,p. y j=θxj(So here,eachθj∈R n).How do the parameters from these p independent least squares problems compareto the multivariate solution?Answer:This time,we construct a set of vectors~y j=(1)jyy(2)j...(m)yj,j=1,...,p.Then our j-th linear model can be solved by the least squares solutionθj=(XT X)−1X T~y j.If we line up ourθj,we see that we have the following equation:[θ1θ2···θp]=(X T X)−1X T~y1(X T X)−1X T~y2···(X T X)−1X T~y p=(XT X)−1X T[~y1~y2···~y p]=(XT X)−1X T Y=Θ.Thus,our p individual least squares problems give the exact same solution as the multi- variate least squares. 9CS229Problem Set#1Solutions64.Naive BayesI n this problem,we look at maximum likelihood parameter estimation using the naiveB ayes assumption.Here,the input features x j,j=1,...,n to our model are discrete,b inary-valued variables,so x j∈{0,1}.We call x=[x1x2···x n]T to be the input vector.For each training example,our output targets are a single binary-value y∈{0,1}.Our model is then parameterized by φj|y=0=p(x j=1|y=0),φj|y=1=p(x j=1|y=1),andφy=p(y=1).We model the joint distribution of(x,y)according top(y)=(φy)y(1−φy)1−yp(x|y=0)=nYp(x j|y=0)j=1=nYx j(1−φj|y=0)1−x(φj|y=0)j j=1p(x|y=1)=nYp(x j|y=1)j=1=nY x j(1−φj|y=1)1−x(φj|y=1)jj=1m(i),y(i);ϕ)in terms of the(a)Find the joint likelihood functionℓ(ϕ)=log Qi=1p(xmodel parameters given above.Here,ϕrepresents the entire set of parameters {φy,φj|y=0,φj|y=1,j= 1,...,n}.Answer:mY p(x(i),y(i);ϕ) ℓ(ϕ)=logi=1mYp(x(i)|y(i);ϕ)p(y(i);ϕ) =logi=1(i);ϕ)m nY Y(i)p(y(i);ϕ) =log j|yp(xi=1j=1(i);ϕ)m nX Xlog p(y(i);ϕ)+(i)=j|ylog p(xi=1j=1m"Xy(i)logφy+(1−y(i))log(1−φy) =i=1nX j=1(i)(i)+xlogφj|y(i)+(1−x j)log(1−φj|yj(i))(b)Show that the parameters which maximize the likelihood function are the same as10CS229Problem Set#1Solutions7those given in the lecture notes;i.e.,thatm(i)φj|y=0=P(i)=0}i=11{x j=1∧ymP i=11{y(i)=0} m(i)φj|y=1=P(i)=1}i=11{x j=1∧ymPi=11{y(i)=1} m(i)=1}φy=Pi=11{y.mA nswer:The only terms inℓ(ϕ)which have non-zero gradient with respect toφj|y=0 are those which includeφj|y(i).Therefore,∇φj|y=0ℓ(ϕ)=∇φj|y=0mX i=1(i)(i)(i))x j)log(1−φj|yjlogφj|y(i)+(1−xj)log(1−φj|y=∇φj|y=0mi=1(i)X(i)=0} xjlog(φj|y=0)1{y=(i)(i)=0}+(1−xj)log(1−φj|y=0)1{ym111{y(i)=0}CS229Problem Set#1Solutions8To solve forφy,mi=1y(i)logφy+(1−y(i))log(1−φy) X∇φyℓ(ϕ)=∇φym1−φy1p(y=1|x)≥p(y=0|x)if and only ifθ≥0.(Assumeθ0is an intercept term.)Answer:p(y=1|x)≥p(y=0|x)≥1⇐⇒p(y=1|x)p(y=0|x)⇐⇒Qj=1p(x j|y=1)≥1np(y=1)p(y=0)Q j=1p(x j|y=0)n⇐⇒Qj≥1 nx j(1−φj|y=0)1−xφyj=1(φj|y=0)Qjn j=1(φj|y=1)x j(1−φj|y=1)1−x(1−φy)n+(1−x j)log1−φj|y=0x j logφj|y=0≥0,12CS229Problem Set#1Solutions9 whereθ0=nlog1−φj|y=01y log(1−φ)−log1−φThenb(y)=1η=log(1−φ)T(y)=ya(η)=log1−φφCS229Problem Set#1Solutions10ℓi(θ)=log"exp T x(i)·y(i)−log T x(i)!!#θT x(i)eθ1−eθ=log exp T x(i)·y(i)−log T x(i)−1θ1e−θ∂∂θj=θT x(i)·y(i)+log T x(i)−1e−θT x(i)e−θ(i)(i)jy(−x(i)+ℓi(θ)=x j)e−θT x(i)−1(i)(i)−=x j y1(i)T x(i)xj1−e−θ=y(i)−1T x(i)CS229Problem Set#21 CS229,Public CourseProblem Set#2:Kernels,SVMs,and Theory1.Kernel ridge regressionIn contrast to ordinary least squares which has a cost functionJ(θ)=12mXT x(i)−y(i))2,(θi=1we can also add a term that penalizes large weights inθ.In ridge regression,our least s quares cost is regularized by adding a termλkθk2,whereλ>0is afixed(known)constant (regularization will be discussed at greater length in an upcoming course lecutre).The ridger egression cost function is thenJ(θ)=12mXT x(i)−y(i))2+(θi=1λkθk2.2(a)Use the vector notation described in class tofind a closed-form expreesion for thevalue ofθwhich minimizes the ridge regression cost function.(b)Suppose that we want to use kernels to implicitly represent our feature vectors in ahigh-dimensional(possibly infinite dimensional)ing a feature mappingφ, the ridge regression cost function becomesJ(θ)=12mXTφ(x(i))−y(i))2+(θi=1λkθk2.2Making a prediction on a new input x new would now be done by computingθTφ(x new).S how how we can use the“kernel trick”to obtain a closed form for the predictiono n the new input without ever explicitly computingφ(x new).You may assume thatt he parameter vectorθcan be expressed as a linear combination of the input feature vectors;i.e.,θ=P (i))for some set of parametersαi.mi=1αiφ(x[Hint:You mayfind the following identity useful:(λI+BA)−1B=B(λI+AB)−1.If you want,you can try to prove this as well,though this is not required for theproblem.]2.ℓ2norm soft margin SVMsIn class,we saw that if our data is not linearly separable,then we need to modify our support vector machine algorithm by introducing an error margin that must be minimized.Specifically,the formulation we have looked at is known as theℓ1norm soft margin SVM.In this problem we will consider an alternative method,known as theℓ2norm soft margin SVM.This new algorithm is given by the following optimization problem(notice that theslack penalties are now squared):.min w,b,ξ12+Ck wk2P i=1ξ2m2is.t.y(i)(w T x(i)+b)≥1−ξi,i=1,...,m15CS229Problem Set#22(a)Notice that we have dropped theξi≥0constraint in theℓ2problem.Show that thesenon-negativity constraints can be removed.That is,show that the optimal value ofthe objective will be the same whether or not these constraints are present.(b)What is the Lagrangian of theℓ2soft margin SVM optimization problem?(c)Minimize the Lagrangian with respect to w,b,andξby taking the following gradients:∇w L,∂L∂b,and∇ξL,and then setting them equal to0.Here,ξ=[ξ1,ξ2,...,ξm]T.(d)What is the dual of theℓ2soft margin SVM optimization problem?3.SVM with Gaussian kernelC onsider the task of training a support vector machine using the Gaussian kernel K(x,z)=exp(−kx−zk2/τ2).We will show that as long as there are no two identical points in thet raining set,we can alwaysfind a value for the bandwidth parameterτsuch that the SVMa chieves zero training error.(a)Recall from class that the decision function learned by the support vector machinecan be written asf(x)=mXαi y(i)K(x(i),x)+b. i=1A ssume that the training data{(x(1),y(1)),...,(x(m),y(m))}consists of points whicha re separated by at least a distance ofǫ;that is,||x(j)−x(i)||≥ǫfor any i=j.F ind values for the set of parameters{α1,...,αm,b}and Gaussian kernel widthτs uch that x(i)is correctly classified,for all i=1,...,m.[Hint:Letαi=1for all ia nd b=0.Now notice that for y∈{−1,+1}the prediction on x(i)will be correct if|f(x(i))−y(i)|<1,sofind a value ofτthat satisfies this inequality for all i.](b)Suppose we run a SVM with slack variables using the parameterτyou found in part(a).Will the resulting classifier necessarily obtain zero training error?Why or whynot?A short explanation(without proof)will suffice.(c)Suppose we run the SMO algorithm to train an SVM with slack variables,underthe conditions stated above,using the value ofτyou picked in the previous part,and using some arbitrary value of C(which you do not know beforehand).Will thisnecessarily result in a classifier that achieve zero training error?Why or why not?Again,a short explanation is sufficient.4.Naive Bayes and SVMs for Spam ClassificationI n this question you’ll look into the Naive Bayes and Support Vector Machine algorithmsf or a spam classification problem.However,instead of implementing the algorithms your-s elf,you’ll use a freely available machine learning library.There are many such librariesa vailable,with different strengths and weaknesses,but for this problem you’ll use theW EKA machine learning package,available at /ml/weka/.WEKA implements many standard machine learning algorithms,is written in Java,andhas both a GUI and a command line interface.It is not the best library for very large-scaledata sets,but it is very nice for playing around with many different algorithms on mediumsize problems.You can download and install WEKA by following the instructions given on the websiteabove.To use it from the command line,youfirst need to install a java runtime environ-ment,then add the weka.jarfile to your CLASSPATH environment variable.Finally,you16CS229Problem Set#23 can call WEKA using the command:java<classifier>-t<training file>-T<test file>For example,to run the Naive Bayes classifier(using the multinomial event model)on ourprovided spam data set by running the command:java weka.classifiers.bayes.NaiveBayesMultinomial-t spam train1000.arff-T spam test.arffT he spam classification dataset in the q4/directory was provided courtesy of ChristianS helton(cshelton@).Each example corresponds to a particular email,and eachf eature correspondes to a particular word.For privacy reasons we have removed the actualw ords themselves from the data set,and instead label the features generically as f1,f2,etc.H owever,the data set is from a real spam classification task,so the results demonstrate thep erformance of these algorithms on a real-world problem.The q4/directory actually con-t ains several different trainingfiles,named spam train50.arff,spam train100.arff,etc(the“.arff”format is the default format by WEKA),each containing the correspondingn umber of training examples.There is also a single test set spam test.arff,which is ah old out set used for evaluating the classifier’s performance.(a)Run the weka.classifiers.bayes.NaiveBayesMultinomial classifier on the datasetand report the resulting error rates.Evaluate the performance of the classifier usingeach of the different trainingfiles(but each time using the same testfile,spam test.arff).Plot the error rate of the classifier versus the number of training examples.(b)Repeat the previous part,but using the weka.classifiers.functions.SMO classifier,which implements the SMO algorithm to train an SVM.How does the performanceof the SVM compare to that of Naive Bayes?5.Uniform convergenceIn class we proved that for anyfinite set of hypotheses H={h1,...,h k},if we pick the hypothesis hˆthat minimizes the training error on a set of m examples,then with probabilityat least(1−δ),12kε(hˆ)≤ε(h i)min log,+2r2mδiwhereε(h i)is the generalization error of hypothesis h i.Now consider a special case(oftenc alled the realizable case)where we know,a priori,that there is some hypothesis in ourc lass H that achieves zero error on the distribution from which the data is drawn.Thenw e could obviously just use the above bound with min iε(h i)=0;however,we can prove ab etter bound than this.(a)Consider a learning algorithm which,after looking at m training examples,choosessome hypothesis hˆ∈H that makes zero mistakes on this training data.(By ourassumption,there is at least one such hypothesis,possibly more.)Show that withprobability1−δε(hˆ)≤1m logkδ.N otice that since we do not have a square root here,this bound is much tighter.[Hint: C onsider the probability that a hypothesis with generalization error greater thanγmakes no mistakes on the training data.Instead of the Hoeffding bound,you might alsofind the following inequality useful:(1−γ)m≤e−γm.]17CS229Problem Set#24(b)Rewrite the above bound as a sample complexity bound,i.e.,in the form:forfixedδandγ,forε(hˆ)≤γto hold with probability at least(1−δ),it suffices that m≥f(k,γ,δ)(i.e.,f(·)is some function of k,γ,andδ).18CS229Problem Set#2Solutions1 CS229,Public CourseProblem Set#2Solutions:Kernels,SVMs,and Theory1.Kernel ridge regressionIn contrast to ordinary least squares which has a cost functionJ(θ)=12mXT x(i)−y(i))2,(θi=1we can also add a term that penalizes large weights inθ.In ridge regression,our least s quares cost is regularized by adding a termλkθk2,whereλ>0is afixed(known)constant (regularization will be discussed at greater length in an upcoming course lecutre).The ridger egression cost function is thenJ(θ)=12mX T x(i)−y(i))2+(θi=1λkθk2.2(a)Use the vector notation described in class tofind a closed-form expreesion for thevalue ofθwhich minimizes the ridge regression cost function.Answer:Using the design matrix notation,we can rewrite J(θ)asJ(θ)=12(Xθ−~y)T(Xθ−~y)+T(Xθ−~y)+λθTθ.Tθ.2Then the gradient is∇θJ(θ)=XT Xθ−X T~y+λθ.Setting the gradient to0gives us0=XT Xθ−X T~y+λθθ=(XT X+λI)−1X T~y.(b)Suppose that we want to use kernels to implicitly represent our feature vectors in ahigh-dimensional(possibly infinite dimensional)ing a feature mappingφ, the ridge regression cost function becomesJ(θ)=12mX Tφ(x(i))−y(i))2+(θi=1λkθk2.2Making a prediction on a new input x new would now be done by computingθTφ(x new).S how how we can use the“kernel trick”to obtain a closed form for the predictiono n the new input without ever explicitly computingφ(x new).You may assume that the parameter vectorθcan beexpressed as a linear combination of the input featurem(i))for some set of parametersαi.vectors;i.e.,θ=P i=1αiφ(x19。
gaussian教程

gaussian教程节译自Exploring Chemistry with Electronic Structure Methos, Second Edition,作者James B. Foresman, Eleen Frisch 出版社Gaussian, Inc, USA, 1996前言Gaussian可以做很多事情,具体包括分子能量和结构研究过渡态的能量和结构研究化学键以及反应的能量分子轨道偶极矩和多极矩原子电荷和电势振动频率红外和拉曼光谱核磁极化率和超极化率热力学性质反应途径计算可以模拟在气相和溶液中的体系,模拟基态和激发态.Gaussian是研究诸如取代效应,反应机理,势能面和激发态能量的有力工具.全书结构序言运行Gaussian第一部分基本概念和技术第一章计算模型第二章单点能计算第三章几何优化第四章频率分析第二部分计算化学方法第五章基族的影响第六章理论方法的选择第七章高精度计算第三部分应用第八章研究反应和反应性第九章激发态第十章溶液中的反应附录A 理论背景附录B Gaussian输入方法简介运行GaussianUnix/Linux平台:运行gaussian前要设置好运行参数,比如在C Shell中,需要加这两句setenv g94root directory / directory指程序的上级目录名source $g94root/g94/bsd/g94.login然后运行就可以了.比如有输入文件,采用C Shell时的运行格式是g94 h2o.logWindows平台:图形界面就不用多说了输入输出文件介绍在Unix系统中,输入文件是.com为扩展名的,输出文件为.log;在Windows系统中,输入文件是.gjf为扩展名,输出文件为.out.下面是一个输入文件#T RHF/6-31G(d) TestMy first Gaussian job: water single point energy0 1O -0.464 0.177 0.0H -0.464 1.137 0.0H 0.441 -0.143 0.0第一行以#开头,是运行的说明行,#T表示指打印重要的输出部分,#P表示打印更多的信息.后面的RHF表示限制性Hartree-Fock方法,这里要输入计算所选用的理论方法6-31G(d)是计算所采用的基组,就是用什么样的函数组合来描述轨道Test是指不记入Gaussian工作档案,对于单机版没有用处.第三行是对于这个工作的描述,写什么都行,自己看懂就是了.第二行是空行,这个空行以及第四行的空行都是必须的.第五行的两个数字,分别是分子电荷和自旋多重度.第六行以后是对于分子几何构性的描述.这个例子采用的是迪卡尔坐标.分子结构输入完成后要有一个空行.对于Windows版本,程序的图形界面把这几部分分得很清楚.输入的时候就不要再添空行了.输出文件输出文件一般很长,对于上面的输入文件,其输出文件中,首先是版权说明,然后是作者,Pople的名字在最后一个.然后是Gaussian读入输入文件的说明,再将输入的分子坐标转换为标准内坐标,这些东西都不用去管.当然,验证自己的分子构性对不对就要看这个地方.关键的是有SCF Done的一行,后面的能量可是重要的,单位是原子单位,Hartree,1 Hartree= 4.3597482E-18 Joules或=2625.500 kJ/mol=27.2116 eV再后面是布居分析,有分子轨道情况,各个轨道的本征值(能量),各个原子的电荷,偶极距.然后是整个计算结果的一个总结,各小节之间用\分开,所要的东西基本在里面了.然后是一句格言,随机有Gaussian程序从它的格言库里选出的(在l9999.exe中,想看的可以用文本格式打开这个文件,自己去找,学英语的好机会).然后是CPU时间,注意这不是真正的运行时间,是CPU运行的时间,真正的时间要长一些.如果几个工作一起做的话(Window下好像不可能,Unix/Linix下可以同时做多个工作),实际计算时间就长很多了.最后一句话,"Normal termination of Gaussian 94"很关键,如果没有这句话,说明工作是失败的,肯定在什么地方出错误了.这是这里应该有出错信息.根据输入文件的设置,输出文件还要多一些内容,上面的是基本的东西.第一章计算模型1.1 计算化学的方法主要有分子理论(Molecular Mechanics)和电子结构理论(Electronic Structure Theory).两者的共同点是1. 计算分子的能量,分子的性质可以根据能量按照一定的方法得到.2. 进行几何优化,在起始结构的附近寻找具有最低的能量的结构.几何优化是根据能量的一阶导数进行的.3. 计算分子内运动的频率.计算依据是能量的二阶导数.1.2 分子理论分子理论采用经典物理对分子进行处理,可以在MM3,HyperChem, Quanta, Sybyl, Alchemy等软件中看到.根据所采用的力场的不同,分子理论又分为很多种.分子理论方法很便宜(做量化的经常用贵和便宜来描述计算,实际上就是计算时间的长短,因为对于要花钱上机的而言,时间就是金钱;对于自己有机器的,要想算的快,也要多在机器上花钱),可以计算多达几千个原子的体系.其缺点是1. 每一系列参数都是针对特定原子得出的.没有对于原子各个状态的统一参数.2. 计算中忽略了电子,只考虑键和原子,自然就不能处理有很强电子效应的体系, 比如不能描述键的断裂.1.3 电子结构理论这一理论基于薛定鄂方程,采用量子化学方法对分子进行处理.主要有两类:1. 半经验方法,包括AM1, MINDO/3, PM3,常见的软件包有MOPAC, AMPAC, HyperChem, 以及Gaussian.半经验方法采用了一些实验得来的参数,来帮助对薛定鄂方程的求解.2. 从头算.从头算,在解薛定鄂方程的过程中,只采用了几个物理常数,包括光速,电子和核的质量,普朗克常数,在求解薛定鄂方程的过程中采用一系列的数学近似,不同的近似也就导致了不同的方法.最经典的是Hartree-Fock方法,缩写为HF.从头算能够在很广泛的领域提供比较精确的信息,当然计算量要比前面讲的方法大的多,就是贵得多了.1.4 密度泛函(Density Functional Methods)密度泛函是最近几年兴起的第三类电子结构理论方法.它采用泛函(以函数为变量的函数)对薛定鄂方程进行求解,由于密度泛函包涵了电子相关,它的计算结果要比HF方法好,计算速度也快.1.5 化学模型(Model Chemistries)Gaussian认为所谓理论是,一个理论模型,必须适用于任何种类和大小体系,它的应用限制只应该来自于计算这里包括两点,1. 一个理论模型应该对于任何给定的核和电子有唯一的定义,就是说,对于解薛定鄂方程来讲,分子结构本身就可以提供充分的信息.2. 一个理论模型是没有偏见的,指不依靠于任何的化学结构和化学过程.这样的理论可以被认为是化学理论模型(theoretical-model chemistry),简称化学模型(model chemistry)(这个翻译我可拿不准,在国内没听说过).1.6 定义化学模型Gaussian包含多种化学模型,比如计算方法Gaussian关键词方法HF Hartree-Fock自恰场模型B3L YP Becke型3参数密度泛函模型,采用Lee-Yang-Parr泛函MP2 二级Moller-Plesset微扰理论MP4 四级Moller-Plesset微扰理论QCISD(T) 二次CI具体在第六章讨论基组基组是分子轨道的数学表达,具体见第五章开壳层,闭壳层指电子的自旋状态,对于闭壳层,采用限制性计算方法,在方法关键词前面加R对于开壳层,采用非限制性计算方法,在方法关键词前面加U.比如开壳层的HF就是UHF.对于不加的,程序默认为是闭壳层.一般采用开壳层的可能性是1. 存在奇数个电子,如自由基,一些离子2. 激发态3. 有多个单电子的体系4. 描述键的分裂过程模型的组合高精度的计算往往要几种模型进行组合,比如用中等算法进行结构优化,然后用高精度算法计算能量.第二章单点能计算2.1 单点能计算是指对给定几何构性的分子的能量以及性质进行计算,由于分子的几何构型是固定不变的,只是"一个点",所以叫单点能计算.单点能计算可以用于:计算分子的基本信息可以作为分子构型优化前对分子的检查在由较低等级计算得到的优化结果上进行高精度的计算在计算条件下,体系只能进行单点计算单点能的计算可以在不同理论等级,采用不同基组进行,本章的例子都采用HF方法2.2 计算设置计算设置中,要有如下信息:计算采用的理论等级和计算的种类计算的名称分子结构方法设置这里设置了计算要采用的理论方法,采用的基组,所要进行的计算的种类等信息.这一行,以#开头,默认的计算种类为单点能计算,关键词为SP,可以不写.这一部分需要出现的关键词有,计算的理论,如HF(默认关键词,可以不写),B3PW91;计算采用的基组,如6-31G, Lanl2DZ;布局分析方法,如Pop=Reg;波函数自恰方法,如SCF=Tight.Pop=Reg只在输出文件中打印出最高的5条HOMO轨道和最低的5条LOMU轨道,而采用Pop=Full则打印出全部的分子轨道.SCF设置是指波函数的收敛计算时的设定,一般不用写,SCF=Tight设置表示采用比一般方法较严格的收敛计算.计算的名称一般含有一行,如果是多行,中间不能有空行.在这里描述所进行的计算.分子结构首先是电荷和自旋多重度电荷就是分子体系的电荷了,没有就是0,自旋多重度就是2S+1,其中S是体系的总自旋量子数,其实用单电子数加1就是了.没有单电子,自旋多重度就是1.然后是分子几何构性,一般可以用迪卡尔坐标,也可以用Z-矩阵(Z-Matrix)多步计算Gaussian支持多步计算,就是在一个输入文件中进行多个计算步骤.2.3 输出文件中的信息例2.1 文件e2_01 甲醛的单点能标准几何坐标.找到输出文件中Standard Orientation一行,下面的坐标值就是输入分子的标准几何坐标.能量找到SCF Done: E(RHF)= -113.863697598 A. U. after 6 cycles这里的树脂就是能量,单位是hartree.在一些高等级计算中,往往有不止一个能量值,比如下一行E2=-0.3029540001D+00 EUMP2=-0.11416665769315D+03这里在EUMP2后面的数字是采用MP2计算后的能量.MP4计算的能量输出就更复杂了分子轨道和轨道能级对于按照计算设置所打印出的分子轨道,列出的内容包括,轨道对称性以及电子占据情况,O表示占据,V表示空轨道;分子轨道的本征值,也就是分子轨道的能量,分子轨道的顺序就是按照能量由低到高的顺序排列的;每一个原子轨道对分子轨道的贡献.这里要注意轨道系数,这些数字的相对大小(忽略正负号)表示了组成分子轨道的原子轨道在所组成的分子轨道中的贡献大小.寻找HOMO和LUMO轨道的方法就是看占据轨道和非占据轨道的交界处.电荷分布Gaussian采用的默认的电荷分布计算方法是Mullikin方法,在输出文件中寻找Total atomic charges可以找到分子中所有原子的电荷分布情况.偶极矩和多极矩Gassian提供偶极矩和多极矩的计算,寻找Dipole momemt (Debye),下面就是偶极矩的信息,再下两行是四极矩偶极矩的单位是德拜CPU时间和其他Job cpu time : 0days 0 hours 0 minuites 9.1 seconds.这里是计算的时间,注意是CPU时间.2.4 核磁计算例2.2 文件e2_02 甲烷的核磁计算核磁是单点能计算中另外一个可以提供的数据,在计算的工作设置部分,就是以#开头的一行里,加入NMR关键词就可以了,如#T RHF/6-31G(d) NMR Test在输出文件中,寻找如下信息GIAO Magnetic shielding tensor (ppm)1 C Isotropic = 199.0522 Anisotropy = 0.0000这是采用上面的设置计算的甲烷的核磁结果,所采用的甲烷构形是用B3L YP密度泛函方法优化得到的.一般的,核磁数据是以TMS为零点的,下面是用同样的方法计算的TMS(四甲基硅烷)的结果1 C Isotropic = 195.1196 Anisotropy = 17.5214这样,计算所得的甲烷的核磁共振数据就是-3.9ppm,与实验值-7.0ppm相比,还是很接近的.2.5 练习练习2.1 文件2_01 丙烷的单点能练习要点:寻找分子的标准坐标,寻找单点能,偶极矩的方向和大小,电荷分布练习2.2 文件2_02a (RR), 2_02b (SS), 2_02c (RS) 1,2-二氯-1,2-二氟乙烷的能量练习要点:比较该化合物三个旋光异构体的能量和偶极矩差异练习2.3 文件2_03 丙酮和甲醛的比较练习要点:比较甲基取代氢原子后带来的影响说明能量比较必须在有同样的原子种类和数量的情况下进行练习2.4 文件2_04 乙烯和甲醛的分子轨道练习要点:寻找HOMO和LUMO能级,并分析能级的组成情况练习2.5 文件2_05a, 2_05b, 2_05c 烷,烯,炔的核磁共振比较练习2.6 文件2_06 C60的单点能练习要点:分析C60最高占据轨道注意在收敛方法选择的时候,要有SCF=Tight,否则有收敛问题.练习2.7 文件2_07 计算大小的CPU资源比较本练习比较不同基组函数数量,SCF方法对CPU时间,资源的占用情况.比较传统SCF方法(SCF=Convern),直接SCF方法(Gaussian默认方法)传统SCF 直接SCF基组函数数量int文件大小(MB) CPU时间CPU时间23 2 8.6 12.842 4 11.9 19.861 16 23.2 38.880 42 48.7 72.199 92 95.4 122.5118 174 163.4 186.8137 290 354.5 268.0156 437 526.5 375.0175 620 740.2 488.0194 832 1028.4 622.1很显然,函数数量对资源占用和CPU时间都有很大影响,函数越多,资源占用越大,CPU时间越长.理论上来讲,认为CPU时间和函数数量的四次方成正比,但实际上没有这么高, 在本例中,基本上和函数数量的2.5次方成正比.一般的讲,直接SCF方法的效率要比传统SCF方法要好,在本例中,当函数数量比较大时, 可以看到这一点.练习2.8 文件2_08a (O2), 2_08b (O3) SCF稳定性计算本例中采用SCF方法分析分子的稳定性.对于未知的体系,SCF稳定性是必须要做的.当分子本身不稳定的时候,所得到的SCF结果以及波函数等信息就没有化学意义.SCF稳定性分析是寻找是不是存在比当前状态能量更低的分子状态.关键词有Stable 检验分子的稳定性,放松对分子的限制,比如由闭壳层改为开壳层等.Stable=OPT 这一选项设定,当发现不稳定的时候,对新的状态进行优化.这种做法一般是不推荐的,因为所得到的新的状态的几何形太接近原来的几何构形.本例中首先计算闭壳层的单重态的氧分子.很显然,闭壳层单重态的氧分子不应该是稳定的.在输出文件中,我们可以找到这样的句子:The wavefuction has an RHF --> UHF instability.这表明存在一个UHF的状态,其能量要比当前状态低.这说明可能,能量最低的状态是单重态的,但不是闭壳层的;存在有更低能量的三重态;所计算的状态不是能量最低点,可能是过渡态.在三重态情况下重新计算,也进行稳定性验证,可以看到如下的句子The wavefunction is stable under the perturbations considered.臭氧是单重态的,但有不一般的电子结构.采用RHF Stable=Opt可以发现一个RHF-->UHF的不稳定性,在所得到的UHF状态下进行稳定性检验,采用UHF Stable=Opt,发现体系仍然不稳定.The wavefunction has an inernal instability再在此基础上进行的优化,体系又回到了RHF的状态.这时,就需要在进行SCF前的构性初始电子状态猜测上进行改动,使用Guess=Mix,在初始猜测中混合HOMO 和LUMO轨道,从而消除空间对称性,然后进行的UHF Guess=Mix Stable 表明得到了稳定的结构.确定电子状态还可以采用Guess=Alter详见Gaussian User's Reference第三章几何优化前面讨论了在特定几何构型下的能量的计算,可以看出,分子几何构型的变化对能量有很大的影响.由于分子几何构型而产生的能量的变化,被称为势能面.势能面是连接几何构型和能量的数学关系.对于双原子分子,能量的变化与两原子间的距离相关,这样得到势能曲线,对于大的体系,势能面是多维的,其维数取决与分子的自由度.3.1势能面势能面中,包括一些重要的点,包括全局最大值,局域极大值,全局最小值,局域极小值以及鞍点.极大值是一个区域内的能量最高点,向任何方向的几何变化都能够引起能量的减小.在所有的局域极大值中的最大值,就是全局最大值;极小值也同样,在所有极小之中最小的一个就是具有最稳定几何结构的一点.鞍点则是在一个方向上具有极大值,而在其他方向上具有极小值的点.一般的,鞍点代表连接着两个极小值的过渡态.寻找极小值几何优化做的工作就是寻找极小值,而这个极小值,就是分子的稳定的几何形态.对于所有的极小值和鞍点,其能量的一阶导数,也就是梯度,都是零,这样的点被称为稳定点.所有的成功的优化都在寻找稳定点,虽然找到的并不一定就是所预期的点. 几何优化有初始构型开始,计算能量和梯度,然后决定下一步的方向和步长,其方向总是向能量下降最快的方向进行.大多数的优化也计算能量的二阶导数,来修正力矩阵,从而表明在该点的曲度.3.2 收敛标准当一阶导数为零的时候优化结束,但实际计算上,当变化很小,小于某个量的时候,就可以认为得到优化结构.对于Gaussian,默认的条件是力的最大值必须小于0.00045,均方根小于0.0003为下一步所做的取代计算为小于0.0018,其均方根小于0.0012这四个条件必须同时满足,比如,对于非常松弛的体系,势能面很平缓,力的值已经小于域值,但优化过程仍然有很长的路要走.对于非常松弛的体系,当力的值已经低于域值两个数量级,尽管取代计算仍然高于域值,系统也认为找到了最优点.这条规则用于非常大,非常松弛的体系.3.3 几何优化的输入Opt关键字描述了几何优化例3.1 文件e3_01 乙烯的优化输入文件的设置行为#R RHF/6-31G(d) Opt Test表明采用RHF方法,6-31G(d)基组进行优化3.4 输出文件优化部分的计算包含在两行相同的GradGradGradGradGradGradGradGradGradGrad...........之间,这里有优化的次数,变量的变化,收敛的结果等等.注意这里面的长度单位是波尔.在得到每一个新的几何构型之后,都要计算单点能,然后再在此基础上继续进行优化,直到四个条件都得到满足.而最后一个几何构型就被认为是最优构型.注意,最终构型的能量是在最后一次优化计算之前得到的.在得到最优构型之后,在文件中寻找--Stationmay point found.其下面的表格中列出的就是最后的优化结果以及分子坐标.随后按照设置行的要求,列出分子有关性质例3.2 文件e3_02 氟代乙烯的优化3.5 寻找过渡态Gaissian使用STQN方法确定反应过渡态,关键词是Opt=QST2例3.3 文件e3_03 过渡态优化例中分析的是H3CO --> H2COH 的变化,输入文件格式#T UHF/6-31G(d) Opt=QST2 TestH3CO --> H2COH Reactants0,2structure for H3CO0,2structure for H2COHGaussian也提供QST3方法,可以优化反应物,产物和一个由用户定义的猜测的过渡态.3.6 难处例的优化有一些系统的优化很难进行,采用默认的方法得不到结果,其产生的原因往往是所计算出的力矩阵与实际的相差太远.当默认方法得不到结果时,就要采用其他的方法. Gaussian提供很多的选择,具体可以看User's Reference.下面列举一些.Opt=ReadFC 从频率分析(往往是采用低等级的计算得到的)所得到的checkpoint文件中读取初始力矩阵,这一选项需要在设置行之前加入%Chk= filename 一句,说明文件的名称.Opt=CalCFC 采用优化方法同样的基组来计算力矩阵的初始值.Opt=CalcAll 在优化的每一步都计算力矩阵.这是非常昂贵的计算方法,只在非常极端的条件下使用.有时候,优化往往只需要更多的次数就可以达到好的结果,这可以通过设置MaxCycle 来实现.如果在优化中保存了Checkpoint文件,那么使用Opt=Restart可以继续所进行的优化.当优化没有达到效果的时候,不要盲目的加大优化次数.这是注意观察每一步优化的区别,寻找没有得到优化结果的原因,判断体系是否收敛,如果体系能量有越来越小的趋势,那么增加优化次数是可能得到结果的,如果体系能量变化没有什么规律,或者,离最小点越来越远,那么就要改变优化的方法.也可以从输出文件的某一个中间构型开始新的优化,关键词Geom=(Check,Step=n)表示在取得在Checkpoint文件中第n步优化的几何构型3.7 练习练习3.1 文件3_01a (180), 3_01b (0) 丙烯的优化从两种丙烯的几何异构体进行优化,一个是甲基的一个氢原子与CCH形成180度二面角,另一个是0.优化结果表明,二者有0.003Hartree的差别,0度的要低.练习3.2 文件3_02a (0), 3_02b (180), 3_02c (acteald.) 乙烯醇的优化乙烯醇氧端的氢原子与OCC平面的二面角可以为0和180,优化得到的结果时,0度的能量比180度的低0.003Hartree,但同时做的乙醛的优化表明,乙醛的能量还要低,比0度异构体低0.027hartree.练习3.3 文件3_03 乙烯胺的优化运行所有原子都在同一平面上的乙烯胺的优化.比较本章的例子和练习,可以看到不同取代基对乙烯碳碳双键的影响.练习3.4 文件3_04 六羰基铬的优化本例采用STO-3G和3-21G基组,在设置行中加入SCF=NoVarAcc对收敛有帮助.3-21G基组的优化结果要优于STO-3G练习3.5 文件3_05a (C6H6), 3_05b (TMS) 苯的核磁共振采用6-31G(d)基组,B3L YP方法优化几何构性,采用HF方法,6-311+G(2d,p)基组在优化的几何构型基础上计算碳的化学位移.注意,核磁共振的可靠程度依赖准确的几何结构和大的基组.输入文件如下%Chk=NMR#T B3L YP/6-31G(d) Opt TestOptmolecule specification--Link1--%Chk=NMR%NoSave#T RHF/6-311+G(2d,p) NMR Geom=Check Guess=Read TestNMRcharg & spin同样,还需要采用同样方法计算TMS.下面是计算结果绝对位移相对位移实验值TMS Benzene188.7879 57.6198 131.2 130.9练习3.6 文件3_06a (PM3), 3_06b (STO-3G) 氧化碳60的优化C60中有两种碳碳键,一是连接两个六元环的6-6键,另一是连接六元环和无元环的5-6键. 氧化C60就有两种异构体.本例采用PM3和HF/STO-3G方法来判断那种异构体是稳定的,以及氧化后的C-C键的变化.采用Opt=AddRedundant关键词可以在输出文件中打印所要求的键长,键角,这一关键词需要在分子构型输入结束后在增加关于所要键长键角的信息,键长用两个原子的序列号表示,键角则用三个原子表示.计算结果显示,6-6键的氧化,碳碳键仍然存在,接近环氧化合物,而5-6键已经打开.采用不同的方法,得到的几何结构相差不多,但在能量上有很大差异.在采用MNDO,PM3,HF/3-21G方法得到的能量数据中,5-6键氧化的异构体的能量低,但采用HF/STO-3G得到的结果,确实6-6键氧化的能量低.Raghavachari在其进行的上述研究中阐述动力学因素同样是重要的;实验上还没有发现那个是能量最低的异构体;应该进行更精确的计算练习3.7 文件3_07 一个1,1消除反应的过渡态优化分析反应SiH4 --> SiH2 + H2, 可以采用Opt=(QST2, AddRedundant)关键词来进行过渡态优化,同时特别关注过渡态结构中的某个键长练习3.8 文件3_08 优化进程比较采用下述三种方法优化二环[2,2,2]直接采用默认方式冗余内坐标优化Opt;采用迪卡尔坐标优化Opt=Cartesian;采用内坐标优化Opt=Z-Matrix结果显示,冗余内坐标优化的优化次数最短,内坐标优化的次数最多.第四章频率分析频率分析可以用于多种目的,预测分子的红外和拉曼光谱(频率和强度)为几何优化计算力矩阵判断分子在势能面上的位置计算零点能和热力学数据如系统的熵和焓4.1 红外和拉曼光谱几何优化和单点能计算都将原子理想化了,实际上原子一直处于振动状态.在平衡态,这些振动是规则的和可以预测的.频率分析的计算要采用能量对原子位置的二阶导数.HF方法,密度泛函方法(如B3L YP), 二阶Moller-Plesset方法(MP2)和CASSCF方法(CASSCF)都可以提供解析二阶导数.对于其他方法,可以提供数值二阶导数.4.2 频率分析输入Freq关键词代表频率分析.频率分析只能在势能面的稳定点进行,这样,频率分析就必须在已经优化好的结构上进行.最直接的办法就是在设置行同时设置几何优化和频率分析.特别注意的是,频率分析计算是所采用的基组和理论方法,必须与得到该几何构型采用的方法完全相同!例4.1 文件e4_01 甲醛的频率分析例中采用的是已经优化好的几何构型,输入格式# RHF/6-31G(d) Freq Test4.3 频率和强度频率分析首先要计算输入结构的能量,然后计算频率.Gaussian提供每个振动模式的频率,强度,拉曼极化率.以下是例4.1的输出文件中的前四个频率1 2 3 4B1 B2 A1 A1。
(13)因子分析

因子分析(Factor Analysis)JerryLeadcsxulijie@2011年5月11日1 问题之前我们考虑的训练数据中样例x(i)的个数m都远远大于其特征个数n,这样不管是进行回归、聚类等都没有太大的问题。
然而当训练样例个数m太小,甚至m<<n的时候,使用梯度下降法进行回归时,如果初值不同,得到的参数结果会有很大偏差(因为方程数小于参数个数)。
另外,如果使用多元高斯分布(Multivariate Gaussian distribution)对数据进行拟合时,也会有问题。
让我们来演算一下,看看会有什么问题:多元高斯分布的参数估计公式如下:μ=1m∑x(i)mi=1Σ=1m∑(x(i)−μ)(x(i)−μ)Tmi=1分别是求mean和协方差的公式,x(i)表示样例,共有m个,每个样例n个特征,因此μ是n维向量,Σ是n*n协方差矩阵。
当m<<n时,我们会发现Σ是奇异阵(|Σ|=0),也就是说Σ−1不存在,没办法拟合出多元高斯分布了,确切的说是我们估计不出来Σ。
如果我们仍然想用多元高斯分布来估计样本,那怎么办呢?2 限制协方差矩阵当没有足够的数据去估计Σ时,那么只能对模型参数进行一定假设,之前我们想估计出完全的Σ(矩阵中的全部元素),现在我们假设Σ就是对角阵(各特征间相互独立),那么我们只需要计算每个特征的方差即可,最后的Σ只有对角线上的元素不为0Σjj=1m∑(x j(i)−μj)2mi=1回想我们之前讨论过的二维多元高斯分布的几何特性,在平面上的投影是个椭圆,中心点由μ决定,椭圆的形状由Σ决定。
Σ如果变成对角阵,就意味着椭圆的两个轴都和坐标轴平行了。
如果我们想对Σ进一步限制的话,可以假设对角线上的元素都是等值的。
Σ=σ2I其中σ2=1mn∑∑(x j (i )−μj )2mi=1n j=1也就是上一步对角线上元素的均值,反映到二维高斯分布图上就是椭圆变成圆。
当我们要估计出完整的Σ时,我们需要m>=n+1才能保证在最大似然估计下得出的Σ是非奇异的。
斯坦福大学机器学习课程个人笔记完整版

CS 229机器学习(个人笔记目录(1线性回归、logistic 回归和一般回归1(2 判别模型、生成模型与朴素贝叶斯方法10(3 支持向量机SVM 上) 20(4 支持向量机SVM 下) 32(5规则化和模型选择45(6K-mea ns聚类算法50(7混合高斯模型和EM 算法53 (8EM 算法55 (9 在线学习62(10 主成分分析65(11独立成分分析80(12线性判别分析91(13 因子分析103 (14 增强学习114(15典型关联分析120(16偏最小二乘法回归129这里面的内容是我在 2011 年上半年学习斯坦福大学《机器学习》课程的个人 学习笔记,内容主要来自 Andrew Ng 教授的讲义和学习视频。
另外也包含来自其他论文和其他学校讲义的一些内容。
每章内容主要按照个人 学习时的思路总结得到。
由于是个人笔记,里面表述错误、公式错误、理解错误、笔误都会存在。
更重要的是我是初学者,千万不要认为里面的思路都正确。
如果有疑问的地方,请第 一时间参考 Andrew Ng 教授的讲义原文和视频,再有疑问的地方可以找一些大牛问问。
博客上很多网友提出的问题,我难以回答,因为我水平确实有限,的内容最好找相关大牛咨询和相关论文研读。
如果有网友想在我这个版本基础上再添加自己的笔记,可以发送 我提供原始的 word docx 版本。
另,本人目前在科苑软件所读研,马上三年了,方向是分布式计算,主要偏大 数据分布式处理,平时主要玩 Hadoop 、Pig 、Hive 、 Mahout 、NoSQL 啥的,关注系统方面和数据库方面的会议。
希望大家多多交流,以后会往博客上放这 些内容,机器学习会放的少了。
Anyway ,祝大家学习进步、事业成功!1 对回归方法的认识JerryLead2011 年 2 月 27 日1 摘要更深层次Email 给我,本报告是在学习斯坦福大学机器学习课程前四节加上配套的讲义后的总结与认识。
(3)支持向量机SVM(上)

支持向量机(上)JerryLeadcsxulijie@2011年3月12日星期六1简介支持向量机基本上是最好的有监督学习算法了。
最开始接触SVM是去年暑假的时候,老师要求交《统计学习理论》的报告,那时去网上下了一份入门教程,里面讲的很通俗,当时只是大致了解了一些相关概念。
这次斯坦福提供的学习材料,让我重新学习了一些SVM 知识。
我看很多正统的讲法都是从VC 维理论和结构风险最小原理出发,然后引出SVM什么的,还有些资料上来就讲分类超平面什么的。
这份材料从前几节讲的logistic回归出发,引出了SVM,既揭示了模型间的联系,也让人觉得过渡更自然。
2重新审视logistic回归Logistic回归目的是从特征学习出一个0/1分类模型,而这个模型是将特性的线性组合作为自变量,由于自变量的取值范围是负无穷到正无穷。
因此,使用logistic函数(或称作sigmoid函数)将自变量映射到(0,1)上,映射后的值被认为是属于y=1的概率。
形式化表示就是假设函数其中x是n维特征向量,函数g就是logistic函数。
的图像是可以看到,将无穷映射到了(0,1)。
而假设函数就是特征属于y=1的概率。
当我们要判别一个新来的特征属于哪个类时,只需求ℎθ(x),若大于0.5就是y=1的类,反之属于y=0类。
再审视一下ℎθ(x),发现ℎθ(x)只和θT x有关,θT x>0,那么ℎθ(x)>0.5,g(z)只不过是用来映射,真实的类别决定权还在θT x。
还有当θT x≫0时,ℎθ(x)=1,反之ℎθ(x)=0。
如果我们只从θT x出发,希望模型达到的目标无非就是让训练数据中y=1的特征θT x≫0,而是y=0的特征θT x≪0。
Logistic回归就是要学习得到θ,使得正例的特征远大于0,负例的特征远小于0,强调在全部训练实例上达到这个目标。
图形化表示如下:中间那条线是θT x=0,logistic回顾强调所有点尽可能地远离中间那条线。
Gaussian03笔记

一Gaussian03运行1.1Scratch文件Gaussian运行中使用数个Scratch文件,包括:CheckPoint文件:name.chk读写文件:name.rwf双电子积分文件:name.int双电子积分的导数文件name.d2e默认情况下这些文件由Gaussian处理进程ID命名并储存于Scratch中,计算结束后删除。
一般情况下我们希望保存CheckPoint 文件,此时只需要在输入文件中给CheckPoint 文件命名或者指定路径即可。
常用格式:%Chk=name;%Chk=chem/scratch/name。
%RWF=path读写文件%Int=path 积分文件%D2E=path积分导数文件通过以上三个命令可以将以上三个文件分别存入不同的盘中。
%NoSave命令%RWF=/chem/scratch2/water到这里为止的文件都被删除%NoSave%Chk=water到这里为止的文件都被保存1.2控制内存使用%mem=***KB/MB/GB/GW二Gaussian03的输入2.1Gaussian03的输入文件概述Gaussian03输入文件的基本结构如下Gaussian03输入界面Gaussian03的输入部分分类任务类型计算方法基组输入语法Gaussian语法规则:输入是自由格式,而且与大小写无关;空格、TAB、正斜线/、逗号都可以作为一行之内的不同项之间的连接符。
关键字格式:keyword = 选项keyword(选项)keyword=(选项1, 选项2, ...)keyword(选项1, 选项2, ...)这些选项中可以带数值。
在Gaussian中所有的关键字和选项都必须使用可以识别的缩写。
引用外部文件使用@文件名。
注释行以!开始。
Gaussian03任务类型形式为方法2/基组2 // 方法1/基组1 的计算执行路径,表示用方法1/基组1进行几何优化计算,之后在优化的结构上用方法2/基组2进行单点能计算。
Gaussian03学习笔记(二)

HCC=121.5
第1行:计算执行路径 以“#”开头,“T”表示打印重要的输出部分,“RHF”表示采用限制性的Hartree-Fock方法,“6-31G(d)” 是采用的基组,“Opt”表示进行几何优化,“Test”表示不计入高斯工作档案。
第2行:空行 这是高斯输入的格式要求
The following legend is applicable only to US Government
contracts under FAR:
RESTRICTED RIGHTS LEGEND
Use, reproduction and disclosure by the US Government is subject
(二)输出文件分析
以下是版权和引用说明部分。
Entering Link 1 = d:\G03W\l1.exe PID= 1016.
Copyright (c) 1003, Gaussian, Inc.
All Rights Reserved.
Gaussian 03: x86-Win32-G03RevB.01 3-Mar-2003
31-Jan-2007
*********************************************
以下是最大硬盘使用量、计算执行路径、标题和分子说明部分。
computational chemistry and represents and warrants to the
licensee that it is not a competitor of Gaussian, Inc. and that
it will not use this program in any manner prohibited above.
机器学习 深度学习 笔记 (9)

minima or global minima of the cost function. Also note that the underlying
parameterization for hθ(x) is different from the case of linear regression, even though the form of the cost function is the same mean-squared loss.
θ := θ − α∇θJ (j)(θ)
(1.4)
Oftentimes computing the gradient of B examples simultaneously for the parameter θ can be faster than computing B gradients separately due to hardware parallelization. Therefore, a mini-batch version of SGD is most commonly used in deep learning, as shown in Algorithm 2. There are also other variants of the SGD or mini-batch SGD with slightly different sampling schemes.
3
Algorithm 2 Mini-batch Stochastic Gradient Descent
1: Hyperparameters: learning rate α, batch size B, # iterations niter. 2: Initialize θ randomly 3: for i = 1 to niter do 4: Sample B examples j1, . . . , jB (without replacement) uniformly from
gaussian相关基础知识

1“零点能”是指:量子在绝对温度的零点下仍会保持震动的能量,这个振动幅度会随着温度增加而加大。
“零点能”就是原子核旋转惯性能。
我们生活生产除核中子间斥力能外都是利用的核旋能,包括身体发热所需能量等,当然也包括燃烧、发光、发热和“磁”线圈产生的“磁力”。
零点能是对分子的电子能量的矫正,表明了在0K 温度下的分子的振动能量对电子能量的影响。
当比较OK 时的热动力学性质时,零点能需要被加到总能量中。
2密度泛函理论的基本原理是: 体系的基态能量是由电子的密度唯一确定的,其基本方程为Kohn-Sham方程, 它与HF方程在形式上完全一样, 只不过是用交换相关泛函代替HF的交换部分而已.在原理上是可以精确计算的,如果可以确定它的精确泛函的话。
但是, 由于其泛函没有一套系统的方法来逼近精确泛函,因此其必须从经验来确定泛函.这就是DFT近似的根源。
3 当体系变的松散时,ab initio 和DFT 基本上得不到最稳定的结构,你可以先用分子力学去优化一下。
或者你需要固定的原子是不是太大,如果可以作为环境处理的话,可以考虑QM/MM。
另外,初次优化基组不要用太大,逐步提高。
4如果你计算freq有个很小的虚频,可以用改变网格来消虚频。
默认的网格是75302,加入这个命令是指定99590网格。
5 L9999 出错就是说在默认的循环次数里未完成所要求的工作,无法写输出文件。
6一般地,优化所得驻点的性质(极小点还是过渡态)要靠频率来确定;而对过渡态,要确定反应路径(即到底是哪个反应的过渡态)必需要做IRC 了,不然靠不住的(往往用QST 找到的过渡态并不一定就是连接输入反应物和产物的过渡态)。
7在我们用QST2 或QST3 来优化过渡态时,需输入反应物和产物,实际上反应物和产物的输入顺序是没有关系的。
就是说,先输反应物后输产物和先输产物后输反应物得到的是同样的过渡态。
这也好理解,QST2 里对过渡态的初始猜测实际上是程序自动将输入的反应物和产物的各变量取个平均,所以输入顺序是没有关系的。
斯坦福大学 CS229 机器学习notes12

CS229Lecture notesAndrew NgPart XIIIReinforcement Learning and ControlWe now begin our study of reinforcement learning and adaptive control.In supervised learning,we saw algorithms that tried to make their outputs mimic the labels y given in the training set.In that setting,the labels gave an unambiguous“right answer”for each of the inputs x.In contrast,for many sequential decision making and control problems,it is very difficult to provide this type of explicit supervision to a learning algorithm.For example, if we have just built a four-legged robot and are trying to program it to walk, then initially we have no idea what the“correct”actions to take are to make it walk,and so do not know how to provide explicit supervision for a learning algorithm to try to mimic.In the reinforcement learning framework,we will instead provide our al-gorithms only a reward function,which indicates to the learning agent when it is doing well,and when it is doing poorly.In the four-legged walking ex-ample,the reward function might give the robot positive rewards for moving forwards,and negative rewards for either moving backwards or falling over. It will then be the learning algorithm’s job tofigure out how to choose actions over time so as to obtain large rewards.Reinforcement learning has been successful in applications as diverse as autonomous helicopterflight,robot legged locomotion,cell-phone network routing,marketing strategy selection,factory control,and efficient web-page indexing.Our study of reinforcement learning will begin with a definition of the Markov decision processes(MDP),which provides the formalism in which RL problems are usually posed.12 1Markov decision processesA Markov decision process is a tuple(S,A,{P sa},γ,R),where:•S is a set of states.(For example,in autonomous helicopterflight,S might be the set of all possible positions and orientations of the heli-copter.)•A is a set of actions.(For example,the set of all possible directions in which you can push the helicopter’s control sticks.)•P sa are the state transition probabilities.For each state s∈S and action a∈A,P sa is a distribution over the state space.We’ll say more about this later,but briefly,P sa gives the distribution over what states we will transition to if we take action a in state s.•γ∈[0,1)is called the discount factor.•R:S×A→R is the reward function.(Rewards are sometimes also written as a function of a state S only,in which case we would have R:S→R).The dynamics of an MDP proceeds as follows:We start in some state s0, and get to choose some action a0∈A to take in the MDP.As a result of our choice,the state of the MDP randomly transitions to some successor states1,drawn according to s1∼P s0a0.Then,we get to pick another action a1.As a result of this action,the state transitions again,now to some s2∼P s1a1.We then pick a2,and so on....Pictorially,we can represent this process as follows:s0a0−→s1a1−→s2a2−→s3a3−→...Upon visiting the sequence of states s0,s1,...with actions a0,a1,...,our total payoffis given byR(s0,a0)+γR(s1,a1)+γ2R(s2,a2)+···.Or,when we are writing rewards as a function of the states only,this becomesR(s0)+γR(s1)+γ2R(s2)+···.For most of our development,we will use the simpler state-rewards R(s), though the generalization to state-action rewards R(s,a)offers no special difficulties.3 Our goal in reinforcement learning is to choose actions over time so as to maximize the expected value of the total payoff:E R(s0)+γR(s1)+γ2R(s2)+···Note that the reward at timestep t is discounted by a factor ofγt.Thus,to make this expectation large,we would like to accrue positive rewards as soon as possible(and postpone negative rewards as long as possible).In economic applications where R(·)is the amount of money made,γalso has a natural interpretation in terms of the interest rate(where a dollar today is worth more than a dollar tomorrow).A policy is any functionπ:S→A mapping from the states to the actions.We say that we are executing some policyπif,whenever we are in state s,we take action a=π(s).We also define the value function for a policyπaccording toVπ(s)=E R(s0)+γR(s1)+γ2R(s2)+··· s0=s,π].Vπ(s)is simply the expected sum of discounted rewards upon starting in state s,and taking actions according toπ.1Given afixed policyπ,its value function Vπsatisfies the Bellman equa-tions:Vπ(s)=R(s)+γ s ∈S P sπ(s)(s )Vπ(s ).This says that the expected sum of discounted rewards Vπ(s)for starting in s consists of two terms:First,the immediate reward R(s)that we get rightaway simply for starting in state s,and second,the expected sum of future discounted rewards.Examining the second term in more detail,we[Vπ(s )].This see that the summation term above can be rewritten E s ∼Psπ(s)is the expected sum of discounted rewards for starting in state s ,where s is distributed according P sπ(s),which is the distribution over where we will end up after taking thefirst actionπ(s)in the MDP from state s.Thus,the second term above gives the expected sum of discounted rewards obtained after thefirst step in the MDP.Bellman’s equations can be used to efficiently solve for Vπ.Specifically, in afinite-state MDP(|S|<∞),we can write down one such equation for Vπ(s)for every state s.This gives us a set of|S|linear equations in|S| variables(the unknown Vπ(s)’s,one for each state),which can be efficiently solved for the Vπ(s)’s.1This notation in which we condition onπisn’t technically correct becauseπisn’t a random variable,but this is quite standard in the literature.4We also define the optimal value function according toV ∗(s )=max πV π(s ).(1)In other words,this is the best possible expected sum of discounted rewards that can be attained using any policy.There is also a version of Bellman’s equations for the optimal value function:V ∗(s )=R (s )+max a ∈A γ s ∈SP sa (s )V ∗(s ).(2)The first term above is the immediate reward as before.The second term is the maximum over all actions a of the expected future sum of discounted rewards we’ll get upon after action a .You should make sure you understand this equation and see why it makes sense.We also define a policy π∗:S →A as follows:π∗(s )=arg max a ∈A s ∈SP sa (s )V ∗(s ).(3)Note that π∗(s )gives the action a that attains the maximum in the “max”in Equation (2).It is a fact that for every state s and every policy π,we haveV ∗(s )=V π∗(s )≥V π(s ).The first equality says that the V π∗,the value function for π∗,is equal to the optimal value function V ∗for every state s .Further,the inequality above says that π∗’s value is at least a large as the value of any other other policy.In other words,π∗as defined in Equation (3)is the optimal policy.Note that π∗has the interesting property that it is the optimal policy for all states s .Specifically,it is not the case that if we were starting in some state s then there’d be some optimal policy for that state,and if we were starting in some other state s then there’d be some other policy that’s optimal policy for s .Specifically,the same policy π∗attains the maximum in Equation (1)for all states s .This means that we can use the same policy π∗no matter what the initial state of our MDP is.2Value iteration and policy iterationWe now describe two efficient algorithms for solving finite-state MDPs.For now,we will consider only MDPs with finite state and action spaces (|S |<∞,|A |<∞).The first algorithm,value iteration ,is as follows:51.For each state s,initialize V(s):=0.2.Repeat until convergence{For every state,update V(s):=R(s)+max a∈Aγ s P sa(s )V(s ).}This algorithm can be thought of as repeatedly trying to update the esti-mated value function using Bellman Equations(2).There are two possible ways of performing the updates in the inner loop of the algorithm.In thefirst,we canfirst compute the new values for V(s)for every state s,and then overwrite all the old values with the new values.This is called a synchronous update.In this case,the algorithm can be viewed as implementing a“Bellman backup operator”that takes a current estimate of the value function,and maps it to a new estimate.(See homework problem for details.)Alternatively,we can also perform asynchronous updates. Here,we would loop over the states(in some order),updating the values one at a time.Under either synchronous or asynchronous updates,it can be shown that value iteration will cause V to converge to V∗.Having found V∗,we can then use Equation(3)tofind the optimal policy.Apart from value iteration,there is a second standard algorithm forfind-ing an optimal policy for an MDP.The policy iteration algorithm proceeds as follows:1.Initializeπrandomly.2.Repeat until convergence{(a)Let V:=Vπ.(b)For each state s,letπ(s):=arg max a∈A s P sa(s )V(s ).}Thus,the inner-loop repeatedly computes the value function for the current policy,and then updates the policy using the current value function.(The policyπfound in step(b)is also called the policy that is greedy with re-spect to V.)Note that step(a)can be done via solving Bellman’s equations as described earlier,which in the case of afixed policy,is just a set of|S| linear equations in|S|variables.After at most afinite number of iterations of this algorithm,V will con-verge to V∗,andπwill converge toπ∗.6Both value iteration and policy iteration are standard algorithms for solv-ing MDPs,and there isn’t currently universal agreement over which algo-rithm is better.For small MDPs,policy iteration is often very fast and converges with very few iterations.However,for MDPs with large state spaces,solving for Vπexplicitly would involve solving a large system of lin-ear equations,and could be difficult.In these problems,value iteration may be preferred.For this reason,in practice value iteration seems to be used more often than policy iteration.3Learning a model for an MDPSo far,we have discussed MDPs and algorithms for MDPs assuming that the state transition probabilities and rewards are known.In many realistic prob-lems,we are not given state transition probabilities and rewards explicitly, but must instead estimate them from data.(Usually,S,A andγare known.) For example,suppose that,for the inverted pendulum problem(see prob-lem set4),we had a number of trials in the MDP,that proceeded as follows:s(1)0a (1) 0−→s(1)1a (1) 1−→s(1)2a (1) 2−→s(1)3a (1) 3−→...s(2)0a (2) 0−→s(2)1a (2) 1−→s(2)2a (2) 2−→s(2)3a (2) 3−→......Here,s(j)i is the state we were at time i of trial j,and a(j)i is the cor-responding action that was taken from that state.In practice,each of the trials above might be run until the MDP terminates(such as if the pole falls over in the inverted pendulum problem),or it might be run for some large butfinite number of timesteps.Given this“experience”in the MDP consisting of a number of trials, we can then easily derive the maximum likelihood estimates for the state transition probabilities:P sa(s )=#times took we action a in state s and got to s#times we took action a in state s(4)Or,if the ratio above is“0/0”—corresponding to the case of never having taken action a in state s before—the we might simply estimate P sa(s )to be 1/|S|.(I.e.,estimate P sa to be the uniform distribution over all states.) Note that,if we gain more experience(observe more trials)in the MDP, there is an efficient way to update our estimated state transition probabilities7 using the new experience.Specifically,if we keep around the counts for both the numerator and denominator terms of(4),then as we observe more trials, we can simply keep accumulating those puting the ratio of these counts then given our estimate of P sa.Using a similar procedure,if R is unknown,we can also pick our estimate of the expected immediate reward R(s)in state s to be the average reward observed in state s.Having learned a model for the MDP,we can then use either value it-eration or policy iteration to solve the MDP using the estimated transition probabilities and rewards.For example,putting together model learning and value iteration,here is one possible algorithm for learning in an MDP with unknown state transition probabilities:1.Initializeπrandomly.2.Repeat{(a)Executeπin the MDP for some number of trials.(b)Using the accumulated experience in the MDP,update our esti-mates for P sa(and R,if applicable).(c)Apply value iteration with the estimated state transition probabil-ities and rewards to get a new estimated value function V.(d)Updateπto be the greedy policy with respect to V.}We note that,for this particular algorithm,there is one simple optimiza-tion that can make it run much more quickly.Specifically,in the inner loop of the algorithm where we apply value iteration,if instead of initializing value iteration with V=0,we initialize it with the solution found during the pre-vious iteration of our algorithm,then that will provide value iteration with a much better initial starting point and make it converge more quickly.。
Gaussian 高斯使用指南

第十二章 溶液中的计算 ………………………………………………………100
1 目的 ………………………………………………………………………100
2 高斯自带的练习和理论模型介绍 ………………………………………100
第一章 功能和计算原理介绍
2.5 工作环境初始化设置…………………………………………………10
第三章 高斯输入文件的创建和程序运行………………………………………11
1.创建输入文件的目的……………………………………………………11
2.常用的创建高斯输入文件的方法………………………………………11
2.1 利用晶体文件产生输入文件…………………………………………11
1 简要介绍 ……………………………………………………………………64
2. 能量计算的格式和输出解释 ……………………………………………65
2.1 能量计算的格式 ………………………………………………………65
2.2 输出说明 ………………………………………………………………66
3.高斯中自带的练习 …………………………………………………………67
Born-Oppenheimer 分子动力学(BOMD), 对势能曲面的局域二次近似计算经典轨迹。计算用
Hessian 算法预测和校正走步,较以前的计算在步长上能够改善 10 倍以上。还可以使用解析
二级导数,BOMD 能够用于所有具有解析梯度的理论方法。
提供原子中心密度矩阵传播(ADMP)分子动力学方法,用于 Hartree-Fock 和 DFT。吸取了 Car
2.高斯自带的练习 …………………………………………………………73
斯坦福大学机器学习课程个人笔记完整版

百度文库 - 让每个人平等地提升自我153 CS 229机器学习(个人笔记)目录(1)线性回归、logistic回归和一般回归 1(2)判别模型、生成模型与朴素贝叶斯方法 10(3)支持向量机SVM(上) 20百度文库 - 让每个人平等地提升自我(4)支持向量机SVM(下) 32(5)规则化和模型选择 45(6)K-means聚类算法 50(7)混合高斯模型和EM算法 53(8)EM算法 55(9)在线学习 62(10)主成分分析 65(11)独立成分分析 80(12)线性判别分析 91(13)因子分析 103(14)增强学习 114(15)典型关联分析 120(16)偏最小二乘法回归 129153百度文库 - 让每个人平等地提升自我这里面的内容是我在2011年上半年学习斯坦福大学《机器学习》课程的个人学习笔记,内容主要来自Andrew Ng教授的讲义和学习视频。
另外也包含来自其他论文和其他学校讲义的一些内容。
每章内容主要按照个人学习时的思路总结得到。
由于是个人笔记,里面表述错误、公式错误、理解错误、笔误都会存在。
更重要的是我是初学者,千万不要认为里面的思路都正确。
如果有疑问的地方,请第一时间参考Andrew Ng教授的讲义原文和视频,再有疑问的地方可以找一些大牛问问。
博客上很多网友提出的问题,我难以回答,因为我水平确实有限,更深层次的内容最好找相关大牛咨询和相关论文研读。
如果有网友想在我这个版本基础上再添加自己的笔记,可以发送Email给我,我提供原始的word docx版本。
另,本人目前在科苑软件所读研,马上三年了,方向是分布式计算,主要偏大数据分布式处理,平时主要玩Hadoop、Pig、Hive、Mahout、NoSQL啥的,关注系统方面和数据库方面的会议。
希望大家多多交流,以后会往博客上放这些内容,机器学习会放的少了。
Anyway,祝大家学习进步、事业成功!153百度文库 - 让每个人平等地提升自我153 对回归方法的认识JerryLead2011 年2 月27 日1 摘要本报告是在学习斯坦福大学机器学习课程前四节加上配套的讲义后的总结与认识。
经典高斯总结

经典高斯总结一般地,优化所得驻点的性质(极小点还是过渡态)要靠频率来确定;而对过渡态,要确定反应路径(即到底是哪个反应的过渡态)必需要做IRC了,不然靠不住的(往往用QST找到的过渡态并不一定就是连接输入反应物和产物的过渡态)。
在我们用QST2或QST3来优化过渡态时,需输入反应物和产物,实际上反应物和产物的输入顺序是没有关系的。
就是说,先输反应物后输产物和先输产物后输反应物得到的是同样的过渡态。
这也好理解,QST2里对过渡态的初始猜测实际上是程序自动将输入的反应物和产物的各变量取个平均,所以输入顺序是没有关系的。
对QST3和TS,需人为指定过渡态的初始猜测。
上面说的反应物和产物的输入顺序没有关系,有个前提条件,就是反应物和产物的自旋多重度一致。
对QST3,因为是人为指定过渡态的初始猜测,所以没有影响;而对QST2,过渡态的自旋多重度默认和后面输入的一致。
如果反应物和产物的自旋多重度一致,那随便先输哪个都没关系;而如果反应物和产物的自旋多重度不一致,这时该怎么办当反应物基态多重度为1,产物基态多重度为3时,到底将过渡态的多重度定为1还是3?还是两个都试,哪个能量低取哪个?假如取1,在下来做IRC验证时,Reverse还好办,Forward却是按多重度为1做的,这样怎么能和多重度为3的产物连接起来?要将其关联起来,需有一个为激发态,这样才能在自旋多重度上保持一致。
可这种处理对吗?还有更好的思想吗?另外,我觉得IRC 难用得很,很难控制,也不知自己钻进了哪条死胡同!!一直对此很迷茫,希望各位大侠援助啊!对于这点确实很迷茫,我觉得好像得用cas解决,有看过类似的文献,上面有用cas得出过反应物基态经激发态再回到基态得出产物的反映路径。
我没有试过,对于cas我很是头疼,理论和实践都一无所知。
作IRC 能说明哪些问题?irc做完,就说明我们完成了这个反应路径地计算,在SUMMARY OF REACTION PATH FOLLOWING:,我们可以看到分子过渡态的键长键角与能量随着反应坐标的变化而变化,如果你将反应坐标与能量作图,就可以得到一条过渡态曲线,由于irc计算是结构沿着反应路径的方向,在每个点进行优化的,所以如果你找的过渡态是正确且优化是成功的话,TS确实是连接两个minimum的。
Gaussian Notes

Exploring Chemistry with Electronic Structure Method
工欲善其事, 必先利其器
看简正模式,确定该鞍点连接了哪两个 minima;确定虚频对应的简正模式是否导致了 过渡态结构的变形或崩解) ,或者更精确地,进行 IRC 计算
Exploring Chemistry with Electronic Structure Method
工欲善其事, 必先利其器
Exploring Chemistry with Electronic Structure Method
Part 1: Essential Concepts & Techniques
-2By X. Y. Wang ITCC Nanjing University
Exploring Chemistry with Electronic Structure Method
工欲善其事, 必先利其器
坐标 --> 计算能量 --> 判断是否收敛 --> … --> 直至收敛 --> 输出优化收敛后的 坐标,计算各种性质 寻找过渡态(Locating Transition Structure) STQN 方法, 需要反应物和产物的结构 (或者还需有一个猜测的过渡态结构, 即 QST3) , 且对应的原子编号一致。这种方法一般只能处理简单的体系 处理较难情况时,需借助能量对坐标的二阶导数,即力常数。结构优化开始时,需要对 力常数矩阵作一个初始猜测,然后通过每一步计算力(一阶导数)来完善这个近似的初 始猜测矩阵。初始猜测对过渡态优化很重要。 计算 NMR 位移时,需先进行较精确的结构优化,再作 NMR 计算
斯坦福大学机器学习笔记及代码(一)

斯坦福⼤学机器学习笔记及代码(⼀)(Notes and Codes of Machine Learning by Andrew Ng from Stanford University)说明:为了保证连贯性,⽂章按照专题⽽不是原本的课程进度来组织。
零、什么是机器学习?机器学习就是:根据已有的训练集D,采⽤学习算法A,得到特定的假设h,h能最恰当的拟合D。
h被称为最终假设,它实际上是从假设集H 中筛选的,筛选的基本要求是代价函数(cost function)最⼩。
学习算法和假设集的不同组合就构成了不同的学习模型。
简单的说,训练集D的产⽣规则符合某个函数(⽬标函数),但它难以显式(explicitly)给出,所以难以⽤⾮学习型算法实现。
学习型算法就是为了找到⼀种最恰当或最合理的⽅式来拟合这个函数,它就是最终假设h。
有了h就可以预测新的情况,进⾏分类或回归,进⽽做出决策。
⼀、线性回归线性回归在统计中是⼀种最基础的模型,也是最重要的模型,它根据给定的数据拟合出⼀条曲线来预测需要了解的情况,所⽤的算法就是最⼩⼆乘法。
机器学习以线性回归开始是合理的,与常见的知识点结合紧密,不突兀,同时破除了机器学习的⼀些神秘性,也为后续课程的展开奠定了基础,因为很多“⾮线性”的算法还是以线性算法为基础的。
1.1 假设(Hypothesis)线性回归的假设很简单,⼀元线性回归是⼆维平⾯的直线⽅程,多元线性回归则是多维空间中的平⾯(或者超平⾯):θ0+θ1x1+θ2x2+⋯+θn x n=hθ(x(i))其中,θ0的系数为1,即x0为1,写成向量形式如下:θT X=hθ(x(i))其中:X=x0x1x2⋯x nθ=θ0θ1θ2⋯θn把m个训练实例(Example)的变量X列成⼀个维度为m*(n+1)的矩阵:X=1x(1)1x(1)2⋯x(1)n1x(2)1x(2)2⋯x(2)n⋮⋮⋮x(i)j⋮1x(m)1x(m)2⋯x(m)n m×(n+1)=x(1)x(2)⋮x(m)m×1=x0x1x2⋯x n1×(n+1)其中每⾏x(i)对应⼀个训练实例,每列x j是⼀个特征(Feature),X的元素即第i个训练实例的第j个特征,可以表⽰为:x(i)j⽽假设可以进⼀步简写为:Xθ=hθ(x(1))hθ(x(2))⋯hθ(x(m))=hθ(X)1.2 代价函数(Cost Function)[][][][][][]机器学习之所以能够逐渐收敛到最终假设,就是因为每⼀次试⽤训练实例进⾏迭代时,都减⼩了代价函数的值,也就是:min J(θ)线性回归的代价函数定义为预测结果与真实值误差的平⽅和:J(θ)=12mm∑i=1(hθ(x(i))−y(i))2但是,要注意的就是代价函数不总是平⽅和,在后⾯的学习中还会接触到其他形式的代价函数。
gaussian基本概念和用法

1 计算流程上面的图应该是从L301说起的,即图中的“基组”,L301的作用是产生基组信息。
(在这之前还有L1:处理计算执行路径,创建执行链接的列表,并初始化scratch文件;L101:读取标题和分子说明部分;L103:berny优化到最小值;L202:重新定位坐标,计算对称性,检查变量)“基组”之后是“分子结构”即L401,形成初始的分子轨道初猜。
第一个框中有“分子轨道”和“能量”两项,我想这应该指的是L502即迭代求解SCF方程。
下面红线圈的框中则是迭代求解的具体过程。
每次迭代完毕都得到一个能量。
即(SCF Done: E(RHF) = -xx.xxxx A.U. after xx cycles)然后是L103判断力和位移是否满足收敛条件。
如果满足(即四个yes),则进行布居分析等,并完成计算。
如果不满足,则继续调整分子结构(L401),再次进行迭代求解,反复这个过程,直到满足力和位移收敛条件为止。
如果不是HF方法而是多体微扰,CI等后-scf方法的话,在每次迭代(L502)完成后还要多一步相关能的计算,以MP2为例,迭代完成后多出:L801(双电子积分变换的初始化),L906(半直接的MP2)和L1002(迭代求解CPHF方程)等计算。
对于楼主的图,我觉得力和位移判断为NO以后,箭头应该指向最头上的“基组”,因为实际过程中,每次构型调整都会重新定位坐标和产生基组信息。
常见问题分析1.检查是否有初始文件错误在命令行中加入 %kJob L301 or %kJob L302 如果通过则一般初始文件ok。
常见初级错误:a. 自旋多重度错误; b. 变量赋值为整数; c. 变量没有赋值或多重赋值; d. 键角小于等于0度,大于等于180度e. 分子描述后面没有空行; f. 二面角判断错误,造成两个原子距离过近; g. 分子描述一行内两次参考同一原子,或参考原子共线2. SCF(自洽场)不收敛则一般是L502错误,省却情况做64个cycle迭代(G03缺省128 cycles)a. 修改坐标,使之合理;b. 改变初始猜 Guess=Huckel 或其他的,看Guess关键词;c. 增加叠代次数 SCFCYC=N (对小分子作计算时最好不要增加,很可能结构不合理); d. iop(5/13=1)这样忽略不收敛,继续往下做。
斯坦福大学机器学习梯度算法总结

斯坦福大学机器学习梯度下降算法学习心得和相关概念介绍。
1根底概念和记号线性代数对于线性方程组可以提供一种简便的表达和操作方式,例如对于如下的方程组:4x1-5x2=13-2x1+3x2=-9可以简单的表示成下面的方式:X也是一个矩阵,为(x1,x2)T,当然你可以看成一个列向量。
1.1根本记号用A ∈表示一个矩阵A,有m行,n列,并且每一个矩阵元素都是实数。
用x ∈ , 表示一个n维向量. 通常是一个列向量. 如果要表示一个行向量的话,通常是以列向量的转置〔后面加T〕来表示。
1.2向量的积和外积根据课的定义,如果形式如xT y,或者yT x,那么表示为积,结果为一个实数,表示的是:,如果形式为xyT,那么表示的为外积:。
1.3矩阵-向量的乘法给定一个矩阵A ∈ Rm×n,以及一个向量x ∈ Rn,他们乘积为一个向量y = Ax ∈Rm。
也即如下的表示:如果A为行表示的矩阵〔即表示为〕,那么y的表示为:相对的,如果A为列表示的矩阵,那么y的表示为:即:y看成A的列的线性组合,每一列都乘以一个系数并相加,系数由x得到。
同理,yT=xT*A表示为:yT是A的行的线性组合,每一行都乘以一个系数并相加,系数由x得到。
1.4矩阵-矩阵的乘法同样有两种表示方式:第一种:A表示为行,B表示为列第二种,A表示为列,B表示为行:本质上是一样的,只是表示方式不同罢了。
1.5矩阵的梯度运算〔这是教师自定义的〕定义函数f,是从m x n矩阵到实数的一个映射,那么对于f在A上的梯度的定义如下:这里我的理解是,f〔A〕=关于A中的元素的表达式,是一个实数,然后所谓的对于A的梯度即是和A同样规模的矩阵,矩阵中的每一个元素就是f(A)针对原来的元素的求导。
1.6其他概念因为篇幅原因,所以不在这里继续赘述,其他需要的概念还有单位矩阵、对角线矩阵、矩阵转置、对称矩阵〔AT=A〕、反对称矩阵〔A=-AT〕、矩阵的迹、向量的模、线性无关、矩阵的秩、满秩矩阵、矩阵的逆〔当且仅当矩阵满秩时可逆〕、正交矩阵、矩阵的列空间(值域)、行列式、特征向量与特征值……2用到的公式在课程中用到了许多公式,罗列一下。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
z T Σz =
i=1 j =1 n n
(Σij zi zj ) (Cov [Xi , Xj ] · zi zj ) (E [(Xi − E [Xi ])(Xj − E [Xj ])] · zi zj )
n
(2)
=
i=1 j =1 n n
=
i=1 j =1 n
=E
i=1j − E [Xj ]) · zi zj .
√
1
1 2πσ
∞ −∞
exp −
1 (x − µ)2 2σ 2
= 1.
Recall from the section notes on linear algebra that Sn ++ is the space of symmetric positive definite n × n matrices, defined as
The Multivariate Gaussian Distribution
Chuong B. Do October 10, 2008
A vector-valued random variable X = X1 · · · Xn is said to have a multivariate 1 normal (or Gaussian) distribution with mean µ ∈ Rn and covariance matrix Σ ∈ Sn ++ 2 if its probability density function is given by p(x; µ, Σ) = 1 (2π )n/2 |Σ|1/2 1 exp − (x − µ)T Σ−1 (x − µ) . 2
(3)
Here, (2) follows from the formula for expanding a quadratic form (see section notes on linear algebra), and (3) follows by linearity of expectations (see probability notes). To complete the proof, observe that the quantity inside the brackets is of the form T 2 i j xi xj zi zj = (x z ) ≥ 0 (see problem set #1). Therefore, the quantity inside the expectation is always nonnegative, and hence the expectation itself must be nonnegative. We conclude that z T Σz ≥ 0. From the above proposition it follows that Σ must be symmetric positive semidefinite in order for it to be a valid covariance matrix. However, in order for Σ−1 to exist (as required in the definition of the multivariate Gaussian density), then Σ must be invertible and hence full rank. Since any full rank symmetric positive semidefinite matrix is necessarily symmetric positive definite, it follows that Σ must be symmetric positive definite.
The following proposition (whose proof is provided in the Appendix A.1) gives an alternative way to characterize the covariance matrix of a random vector X : Proposition 1. For any random vector X with mean µ and covariance matrix Σ, Σ = E [(X − µ)(X − µ)T ] = E [XX T ] − µµT . (1)
3
3
The diagonal covariance matrix case
T
We write this as X ∼ N (µ, Σ). In these notes, we describe multivariate Gaussians and some of their basic properties.
1
Relationship to univariate Gaussians
Recall that the density function of a univariate normal (or Gaussian) distribution is given by p(x; µ, σ 2 ) = √ 1 1 exp − 2 (x − µ)2 . 2σ 2πσ
1 2 Here, the argument of the exponential function, − 2σ 2 (x − µ) , is a quadratic function of the variable x. Furthermore, the parabola points downwards, as the coefficient of the quadratic 1 term is negative. The coefficient in front, √2 , is a constant that does not depend on x; πσ hence, we can think of it as simply a “normalization factor” used to ensure that
2
1
0.9 0.8 0.02 0.7 0.6 0.5 0.4 0.005 0.3 0.2 0.1 0 0 10 5 0 −5 0 1 2 3 4 5 6 7 8 9 10 −10 −10 0 −5 5 10 0.015
0.01
Figure 1: The figure on the left shows a univariate Gaussian density for a single variable X . The figure on the right shows a multivariate Gaussian density over two variables X1 and X2 . In the case of the multivariate Gaussian density, the argument of the exponential function, 1 −2 (x − µ)T Σ−1 (x − µ), is a quadratic form in the vector variable x. Since Σ is positive definite, and since the inverse of any positive definite matrix is also positive definite, then for any non-zero vector z , z T Σ−1 z > 0. This implies that for any vector x = µ, 1 − (x − µ)T Σ−1 (x − µ) < 0. 2 Like in the univariate case, you can think of the argument of the exponential function as being a downward opening quadratic bowl. The coefficient in front (i.e., (2π)n/1 2 |Σ|1/2 ) has an even more complicated form than in the univariate case. However, it still does not depend on x, and hence it is again simply a normalization factor used to ensure that 1 (2π )n/2 |Σ|1/2
In the definition of multivariate Gaussians, we required that the covariance matrix Σ be symmetric positive definite (i.e., Σ ∈ Sn ++ ). Why does this restriction exist? As seen in the following proposition, the covariance matrix of any random vector must always be symmetric positive semidefinite: Proposition 2. Suppose that Σ is the covariance matrix corresponding to some random vector X . Then Σ is symmetric positive semidefinite. Proof. The symmetry of Σ follows immediately from its definition. Next, for any vector z ∈ Rn , observe that
∞ −∞ ∞ −∞
(x − µ)T Σ−1 (x − µ) > 0