Estimation errors for functionals on measure spaces
工具变量(IV)详细解说

IV
The 2SLS name notwithstanding, we don‘t usually construct 2SLS estimates in two-steps. For one thing, the resulting standard errors are wrong, as we discuss later. () Where si is the residual from a regression cov( si , si ) V ( si ) of si on X i .This follows from the multivariate regression anatomy formula and the fact that cov( si , si ) V ( si ) . It is also easy to show that, in a model with a single endogenous variable and a single instrument, the 2SLS estimator is the same as the corresponding ILS( Indirect Least Squares ) estimator.(Q3) 由2SLS,
Zi 0 1si i cov( Zi ,i ) 0 cov(Yi , Zi ) cov(Yi , Zi ) / v( Zi ) (4.1.3) cov( si , Zi ) cov( si , Zi ) / v( Zi )
Zi
(IV)
IV
• Q1:The second equality in (4.1.3) is useful because its usually easier to think in terms of regression coefficients than in terms of covariance. 2.
A Method to Estimate the True Mahalanobis Distance from Eigenvectors of Sample Covariance M

Abstract. In statistical pattern recognition, the parameters of distributions are usually estimated from training sample vectors. However, estimated parameters contain estimation errors, and the errors cause bad influence on recognition performance when the sample size is not sufficient. Some methods can obtain better estimates of the eigenvalues of the true covariance matrix and can avoid bad influences caused by estimation errors of eigenvalues. However, estimation errors of eigenvectors of covariance matrix have not been considered enough. In this paper, we consider estimation errors of eigenvectors and show the errors can be regarded as estimation errors of eigenvalues. Then, we present a method to estimate the true Mahalanobis distance from eigenvectors of the sample covariance matrix. Recognition experiments show that by applying the proposed method, the true Mahalanobis distance can be estimated even if the sample size is small, and better recognition accuracy is achieved. The proposed method is useful for the practical applications of pattern recognition since the proposed method is effective without any hyper-parameters.
A consistent estimator in general functional errors-in-variables

More precisely,
2 z = y> @ @`(2 ) means that the ith coordinate of z is:
e
0p r X @X @ 2`k ( zi = y k
parameter to be estimated. Suppose that the parameter set is p-dimensional: 0 2 Rp ; xi , i , "i are q-dimensional vectors, yi , i are r-dimensional vectors (i = 1; 2; :::), g : V ! Rr q , i 2 V for each i. is a known function where V R We shall suppose the existence of three sequences of auxiliary functions: (Af) There exist an open set U and functions fi 2 C(Rq U ! Rr ) such that
Abstract
1. Introduction
In statistical work one often encounters such non-standard situations when the explanatory variables of the regression models (e.g. location, temperature, pressure) can only be observed with errors. Throughout this paper we will deal with a model of such kind. It has the following form:
统计前沿--协整与误差修正

三、协整检验应用实例
例1 我国物价水平与经济发 展水平之间协整性检验。 在经典的计量经济模型分 析中,常常会有物价水平(即 物价指数)与收入水平、消费 水平或经济发展水平之间
有关的回归方程。对此,理论 经济学家有不同的意见,他们 认为,物价水平与收入及经济 发展水平不应该有相关关系。 现在使用协整检验的方法对我 国五十年代以来的物价水平与 经济发展水平之间的相关性作 一个检验,以此来说明物价水 平与经济发展水平之间是真实 相关还是虚假相关。
AEG检验的统计量及计算方法 与单位根检验完全相同,所不 同的只是临界值不同。 EG 或 AEG 的 临 界 值 如 下 :查表得三个值:
,1 , 2
由此计算临界值为:
C 1n
1
2n
2
比 较 DF 统 计 量 与 临 界 值 , 或 ADF统计量与临界值, 若 DF(ADF)值<临界值 则要检验的变量(即回归变 量)之间存在协整关系; 若 DF(ADF)值>临界值 则要检验的变量之间不存在 协整关系。。
当EG检验回归式的残差项 存在自相关时,我们使用增项 恩格尔—格兰杰(Augmented Engle-Granger) 检 验 , 简 称 AEG检验,其检验方法与EG 检验类似,AEG检验的回归式 与ADF检验回归式相同,AEG 统计量也与ADF统计量相同,
但 AEG 协 整 检 验 的 临 界 值 与 ADF的临界值不同。 EG与AEG检验方法如下: 先对要作协整检验的变量 进行OLS回归,得残差序列, 对残差序列进行单位根检验, 计算DF或ADF统计量值。DF 统计量对应EG检验,ADF统计 量对应AEG检验。EG或
Keywords: Co-integration, vector autoregression, unit roots, error correction, multivariate time series, Dickey-Fuller tests
美国麻省理工大学经济学中的统计应用 (25)

14.30 Introduction to Statistical Methods in Economics
Spring 2009
For information about citing these materials or our Terms of Use, visit: /terms.
In order to construct confidence intervals, usually proceed as follows:
1. find a(θ0 ) and b(θ0 ) such that Pθ0 (a(θ0 ) ≤ T (X1 , . . . , Xn ) ≤ b(θ0 )) = 1 − α
Confidence Intervals
• find functions of data A(X1 , . . . , X2 ) and B (X1 , . . . , Xn ) such that
Pθ0 (A(X1 , . . . , Xn ) ≤ θ0 ≤ B (X1 , . . . , Xn )) = 1 − α
• θ � � � � � � �� ˆ + Φ− 1 α ˆ), θ ˆ + Φ− 1 1 − α ˆ) [A(X1 , . . . , Xn ), B (X1 , . . . , Xn )] = θ Var(θ Var(θ 2 2 ˆ unbiased and normally distributed, Var(θ ˆ) not known, but have estimator S ˆ: • θ � � � � � α � � α� ˆ ˆ ˆ ˆ Var(θ), θ + tn−1 1 − Var(θ) [A(X1 , . . . , Xn ), B (X1 , . . . , Xn )] = θ + tn−1 2 2 ˆ not normal, n > 30 or so: most estimators we have seen so far turn out to be asymptotically • θ normally distributed, so we’ll use that approximation and calculate confidence intervals as in the previous case. Whether or not we know the variance, we use the t-distribution as a way of penalizing the CI for using approximation. ˆ not normal, n small: if (a) we know the p.d.f. of θ ˆ, can construct CIs from first principles. if (b) • θ we don’t know the p.d.f., there’s nothing we can do.
高斯过程函数的中心极限定理与应用

高斯过程函数的中心极限定理与应用孙琳【摘要】采用Wiener空间的两个算子以及相关的恒等式,提出了新的方法证明了关于高斯过程函数的中心极限定理,并给出了该中心极限定理的应用实例.【期刊名称】《经济数学》【年(卷),期】2011(028)002【总页数】4页(P21-24)【关键词】导数算予;Malliavin随机变分;中心极限定理;高斯过程【作者】孙琳【作者单位】广东工业大学,应用数学学院,广东,广州,510090【正文语种】中文【中图分类】O211前苏联著名概率论学者 Gnedenko和 Kolmogrov曾说过“概率论的认识论的价值只有通过极限定理才能被揭示,没有极限定理就不可能去理解概率论的基本概念的真正含义”[1].因此研究统计量或者随机变量的统计特性,最重要的就是研究其极限理论.而实际问题中所获得的很多数据都可以认为来自高斯过程函数总体,比如来自正态随机变量就可以看成来自关于高斯过程恒等映射的总体.从而自上世纪30年代起,概率极限理论已获得完善的发展.近年来关于高斯过程函数的统计特性成为研究中的热门方向之一,大量学者研究了关于高斯过程函数的极限定理,如 Nualart和Peccati(2005)[2],Nualart和Ortiz-Latorre (2008)[3],Peccati(2007)[4],Hu和Nualart(2005)[5],Peccati和 Taqqu (2008)[6]以及 Peccati和 Taqqu(2007)[7].大量的文献如Deheuvels、Peccati与 Yor(2006)[8],Hu和Nualart(2009)[9]应用了该定理.本文首先利用Malliavin随机变分法,通过导数算子和散度型算子,并利用恒等式构造了证明高斯过程函数的中心极限定理的新方法,该证明避免了采用Dambis-Dubins-Schwarz以及 Clark-Ocone公式.进一步结合具体实例,给出了该中心极限定理的应用.本文主要采用了新的方法证明了关于高斯过程函数的中心极限定理,并将给出了该定理的具体应用.虽然本文只给出了一维情况下的中心极限定理,但对于多维情况,可以得到类似的结论.当然,除了研究高斯过程函数的几乎处处中心极限定理之外,对高斯过程函数的几乎处处大偏差性质、几乎处处局部中心极限定理及几乎处处中心极限定理收敛度等问题需进一步研究.【相关文献】[1] B V GNEDENKO,A M KOLMOGORV.Limit distributions for sums of independent random variables[M].Addison-Wesley,1954.[2] D NUALART,GPECCATI.Central iimit theoremsfor sequencesof multiple stochastic integrals[J].Annals of Probability.2005,33(1):177-193.[3] D NUALART,S Ortiz-Latorre.Central iimit theorems for multiple stochastic integrals and malliavin calculus[J].Stochastic Processes and their Applications.2008,118(4):614-628. [4] G PECCATI.Gaussian approximations of multiple integrals[J].E-lectronic Communications in Probability.2007,34(12):350-364.[5] Y HU,D NUALART.Renormalized self-intersection local time for fractional Brownian motion[J].Annals of Probability.2005,33(3):948-983.[6] GPECCATI,M TAQQU.Stable convergence of multiple Wiener-It^o integrals[J].Journal of theoretical probability.2008,21(3):527-570.[7] G PECCATI,M S TAQQU.Stable convergence of generalizedL2 stochastic integrals and the principle of conditioning[J].Electronic Journal of Probability.2007,12(15):447-480.[8] P DEHEUVELS,G PECCATI,M YOR.On quadratic functionals of the Brownian sheet and related processes[J].Stochastic Processes and their Applications.2006,116(3):493-538. [9] Y HU,D NUALART.Parameter estimationfor fractional Ornstein-Uhlenbeck processes[J].Statistics and Probability Letters.2010,80(11-12),1030-1038.[10]D NUALART.The malliavin calculus and related topics[M].2nd Edition.Berlin:Springer-verlag,2006.。
Functional-coefficient regression models for nonlinear time series

from with
the \curse of dimensionality".
Ui taking values in <k and Xi
tLaketinfgUvai;luXesi;inYi<g1ip=.?T1ypbiecajlolyintklyis
strictly small.
stationary Let E(Y12)
transpose of a matrix or vector. The idea to model time series in such a form is not new; see,
for example, Nicholls and Quinn (1982). In fact, many useful time series models may be viewed
This paper adapts the functional-coe cient modeling technique to analyze nonlinear time series
data. The approach allows appreciable exibility on the structure of tted model without su ering
Ui and Xi consist of some lagged values of Yi. The functional-coe cient regression model has the
form
m(u; x) = Xp aj(u) xj;
(1.2)j=1来自where aj( )'s are measurable functions from <k to <1 and x = (x1; : : :; xp)T with T denoting the
A New Approach for Filtering Nonlinear Systems

computational overhead as the number of calculations demanded for the generation of the Jacobian and the predictions of state estimate and covariance are large. In this paper we describe a new approach to generalising the Kalman filter to systems with nonlinear state transition and observation models. In Section 2 we describe the basic filtering problem and the notation used in this paper. In Section 3 we describe the new filter. The fourth section presents a summary of the theoretical analysis of the performance of the new filter against that of the EKF. In Section 5 we demonstrate the new filter in a highly nonlinear application and we conclude with a discussion of the implications of this new filter1
Tቤተ መጻሕፍቲ ባይዱ
= = =
δij Q(i), δij R(i), 0, ∀i, j.
(3) (4) (5)
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection

A Study of Cross-Validation and Bootstrap for Accuracy E stimation and Model SelectionRon KohaviComputer Science DepartmentStanford UniversityStanford, CA 94305ronnykGCS Stanford E D Uh t t p //r o b o t i c s Stanford edu/"ronnykA b s t r a c tWe review accuracy estimation methods andcompare the two most common methods cross-validation and bootstrap Recent experimen-tal results on artificial data and theoretical recults m restricted settings have shown that forselecting a good classifier from a set of classi-fiers (model selection), ten-fold cross-validationmay be better than the more expensive ka\pone-out cross-validation We report on a large-scale experiment—over half a million runs ofC4 5 and aNaive-Bayes algorithm—loestimalethe effects of different parameters on these algonthms on real-world datascts For cross-validation we vary the number of folds andwhether the folds arc stratified or not, for boot-strap, we vary the number of bootstrap sam-ples Our results indicate that for real-worddatasets similar to ours, The best method lo usefor model selection is ten fold stratified crossvalidation even if computation power allowsusing more folds1 I n t r o d u c t i o nIt can not be emphasized enough that no claimwhatsoever 11 being made in this paper that altalgorithms a re equiva lent in practice in the rea l world In pa rticula r no cla im is being ma de tha t ont should not use cross va lida tion in the real world— Wolpcrt (1994a.) Estimating the accuracy of a classifier induced by su-pervised learning algorithms is important not only to predict its future prediction accuracy, but also for choos-ing a classifier from a given set (model selection), or combining classifiers (Wolpert 1992) For estimating the final accuracy of a classifier, we would like an estimation method with low bias and low variance To choose a classifier or to combine classifiers, the absolute accura-cies are less important and we are willing to trade off biasA longer version of the paper can be retrieved by anony mous ftp to starry Htanford edu pub/ronnyk/accEst-long ps for low variance, assuming the bias affects all classifiers similarly (e g esLimates are ")% pessimistic)In this paper we explain some of the assumptions madeby Ihe different estimation methods and present con-crete examples where each method fails While it is known that no accuracy estimation can be corrert allthe time (Wolpert 1994b Schaffer 1994j we are inter ested in identifying a method that ib well suited for the biases and tn rids in typical real world datasetsRecent results both theoretical and experimental, have shown that it is no! alwa>s the case that increas-ing the computational cost is beneficial especiallhy if the relative accuracies are more important than the exact values For example leave-one-out is almost unbiased,but it has high variance leading to unreliable estimates (Efron 1981) l o r linear models using leave-one-out cross-validation for model selection is asymptotically in consistent in the sense that the probability of selectingthe model with the best predictive power does not con-verge to one as the lolal number of observations ap-proaches infinity (Zhang 1992, Shao 1993)This paper \s organized AS follows Section 2 describesthe common accuracy estimation methods and ways of computing confidence bounds that hold under some as-sumptions Section 3 discusses related work comparing cross-validation variants and bootstrap variants Sec lion 4 discusses methodology underlying our experimentThe results of the experiments are given Section 5 with a discussion of important observations We conelude witha summary in Section 62 Methods for Accuracy E s t i m a t i o nA classifier is a function that maps an unlabelled in-stance to a label using internal data structures An i n-ducer or an induction algorithm builds a classifier froma given dataset CART and C 4 5 (Brennan, Friedman Olshen &. Stone 1984, Quinlan 1993) are decision tree in-ducers that build decision tree classifiers In this paperwe are not interested in the specific method for inducing classifiers, but assume access to a dataset and an inducerof interestLet V be the space of unlabelled instances and y theKOHAVI 1137set of possible labels be the space of labelled instances and ,i n ) be a dataset (possibly a multiset) consisting of n labelled instances, where A classifier C maps an unla-beled instance ' 10 a l a b e l a n d an inducer maps a given dataset D into a classifier CThe notationwill denote the label assigned to an unlabelled in-stance v by the classifier built, by inducer X on dataset D tWe assume that there exists adistribution on the set of labelled instances and that our dataset consists of 1 1 d (independently and identically distributed) instances We consider equal misclassifica-lion costs using a 0/1 loss function, but the accuracy estimation methods can easily be extended to other loss functionsThe accuracy of a classifier C is the probability ofcorrectly clasaifying a randoml\ selected instance, i efor a randomly selected instancewhere the probability distribution over theinstance space 15 the same as the distribution that was used to select instances for the inducers training set Given a finite dataset we would like to custimate the fu-ture performance of a classifier induced by the given in-ducer and dataset A single accuracy estimate is usually meaningless without a confidence interval, thus we will consider how to approximate such an interval when pos-sible In order to identify weaknesses, we also attempt o identify cases where the estimates fail2 1 Holdout The holdout method sometimes called test sample esti-mation partitions the data into two mutually exclusivesubsets called a training set and a test set or holdout setIt is Lommon to designate 2/ 3 of the data as the trainingset and the remaining 1/3 as the test set The trainingset is given to the inducer, and the induced classifier istested on the test set Formally, let , the holdout set,be a subset of D of size h, and let Theholdout estimated accuracy is defined aswhere otherwise Assummg that the inducer s accuracy increases as more instances are seen, the holdout method is a pessimistic estimator because only a portion of the data is given to the inducer for training The more instances we leave for the test set, the higher the bias of our estimate however, fewer test set instances means that the confidence interval for the accuracy will be wider as shown belowEach test instance can be viewed as a Bernoulli trialcorrect or incorrect prediction Let S be the numberof correct classifications on the test set, then s is dis-tributed bmomially (sum of Bernoulli trials) For rea-sonably large holdout sets, the distribution of S/h is ap-proximately normal with mean ace (the true accuracy of the classifier) and a variance of ace * (1 — acc)hi Thus, by De Moivre-Laplace limit theorem, we havewhere z is the quanl lie point of the standard normal distribution To get a IOO7 percent confidence interval, one determines 2 and inverts the inequalities Inversion of the inequalities leads to a quadratic equation in ace, the roots of which are the low and high confidence pointsThe above equation is not conditioned on the dataset D , if more information is available about the probability of the given dataset it must be taken into accountThe holdout estimate is a random number that de-pends on the division into a training set and a test set In r a n d o m sub s a m p l i n g the holdout method is re-peated k times and the eslimated accuracy is derived by averaging the runs Th( slandard deviation can be estimated as the standard dewation of the accuracy es-timations from each holdout runThe mam assumption that is violated in random sub-sampling is the independence of instances m the test set from those in the training set If the training and testset are formed by a split of an original dalaset, thenan over-represented class in one subset will be a under represented in the other To demonstrate the issue we simulated a 2/3, 1 /3 split of Fisher's famous ins dataset and used a majority inducer that builds a classifier pre dieting the prevalent class in the training set The iris dataset describes ins plants using four continuous fea-tures, and the task is to classify each instance (an ins) as Ins Setosa Ins Versicolour or Ins Virginica For each class label there are exactly one third of the instances with that label (50 instances of each class from a to-tal of 150 instances) thus we expect 33 3% prediction accuracy However, because the test set will always con-tain less than 1/3 of the instances of the class that wasprevalent in the training set, the accuracy predicted by the holdout method is 21 68% with a standard deviation of 0 13% (estimated by averaging 500 holdouts) In practice, the dataset size is always finite, and usu-ally smaller than we would like it to be The holdout method makes inefficient use of the data a third of dataset is not used for training the inducer 2 2 Cross-Validation, Leave-one-out, and Stratification In fc-fold cross-validation, sometimes called rotation esti-mation, the dataset V is randomly split into k mutuallyexclusive subsets (the folds) , of approx-imately equal size The inducer is trained and tested1138 LEARNINGThe cross-validation estimate is a random number that depends on the division into folds C o m p l e t ec r o s s -v a l id a t i o n is the average of all possibil ities for choosing m/k instances out of m, but it is usually too expensive Exrept for leave-one-one (rc-fold cross-validation), which is always complete, fc-foM cross-validation is estimating complete K-foId cross-validationusing a single split of the data into the folds Repeat-ing cross-validation multiple limes using different spillsinto folds provides a better M onte C arlo estimate to 1 hecomplele cross-validation at an added cost In s t r a t i -fied c r o s s -v a l i d a t i o n the folds are stratified so thaitlicy contain approximately the same proportions of la-bels as the original dataset An inducer is stable for a given dataset and a set of perturbal ions if it induces classifiers thai make the same predictions when it is given the perturbed datasets P r o p o s i t i o n 1 (V a r i a n c e in A>fold C V )Given a dataset and an inducer If the inductr isstable under the pei tur bations caused by deleting theinstances f o r thr folds in k fold cross-validatwn thecross validation < stnnate will be unbiastd and the t a i lance of the estimated accuracy will be approximatelyaccrv (1—)/n when n is the number of instancesin the datasi t Proof If we assume that the k classifiers produced makethe same predictions, then the estimated accuracy has a binomial distribution with n trials and probabihly of success equal to (he accuracy of the classifier | For large enough n a confidence interval may be com-puted using Equation 3 with h equal to n, the number of instancesIn reality a complex inducer is unlikely to be stable for large perturbations unless it has reached its maximal learning capacity We expect the perturbations induced by leave-one-out to be small and therefore the classifier should be very stable As we increase the size of the perturbations, stability is less likely to hold we expect stability to hold more in 20-fold cross-validation than in 10-fold cross-validation and both should be more stable than holdout of 1/3 The proposition does not apply to the resubstitution estimate because it requires the in-ducer to be stable when no instances are given in the datasetThe above proposition helps, understand one possible assumption that is made when using cross-validation if an inducer is unstable for a particular dataset under a set of perturbations introduced by cross-validation, the ac-curacy estimate is likely to be unreliable If the inducer is almost stable on a given dataset, we should expect a reliable estimate The next corollary takes the idea slightly further and shows a result that we have observed empirically there is almost no change in the variance of the cross validation estimate when the number of folds is variedC o r o l l a r y 2 (Variance m cross-validation)Given a dataset and an inductr If the inducer is sta-ble undfi the }>tituibuhoris (aused by deleting the test instances foi the folds in k-fold cross-validation for var-ious valuts of k then tht vartanct of the estimates will be the sameProof The variance of A-fold cross-validation in Propo-sition 1 does not depend on k |While some inducers are liktly to be inherently more stable the following example shows that one must also take into account the dalaset and the actual perturba (ions E x a m p l e 1 (Failure of leave-one-out)lusher s ins dataset contains 50 instances of each class leading one to expect that a majority indu<er should have acruraov about j \% However the eombmation ofthis dataset with a majority inducer is unstable for thesmall perturbations performed by leave-one-out Whenan instance is deleted from the dalaset, its label is a mi-nority in the training set, thus the majority inducer pre-dicts one of the other two classes and always errs in clas-sifying the test instance The leave-one-out estimatedaccuracy for a majont> inducer on the ins dataset istherefore 0% M oreover all folds have this estimated ac-curacy, thus the standard deviation of the folds is again0 %giving the unjustified assurance that 'he estimate is stable | The example shows an inherent problem with cross-validation th-t applies to more than just a majority in-ducer In a no-infornirition dataset where the label val-ues are completely random, the best an induction algo-rithm can do is predict majority Leave-one-out on such a dataset with 50% of the labels for each class and a majontv ind'-cer (the best, possible inducer) would still predict 0% accuracy 2 3 B o o t s t r a pThe bootstrap family was introduced by Efron and is fully described in Efron &. Tibshirani (1993) Given a dataset of size n a b o o t s t r a p s a m p l e is created by sampling n instances uniformly from the data (with re-placement) Since the dataset is sampled with replace-ment, the probability of any given instance not beingchosen after n samples is theKOHAVI 1139expected number of distinct instances from the original dataset appearing in the teat set is thus 0 632n The eO accuracy estimate is derived by using the bootstrap sam-ple for training and the rest of the instances for testing Given a number b, the number of bootstrap samples, let e0, be the accuracy estimate for bootstrap sample i The632 bootstrap estimate is defined as(5)where ace, is the resubstitution accuracy estimate on the full dataset (i e , the accuracy on the training set) The variance of the estimate can be determined by com puting the variance of the estimates for the samples The assumptions made by bootstrap are basically the same as that of cross-validation, i e , stability of the al-gorithm on the dataset the 'bootstrap world" should closely approximate the real world The b32 bootstrap fails (o give the expected result when the classifier is a perfect memonzer (e g an unpruned decision tree or a one nearest neighbor classifier) and the dataset is com-pletely random, say with two classes The resubstitution accuracy is 100%, and the eO accuracy is about 50% Plugging these into the bootstrap formula, one gets an estimated accuracy of about 68 4%, far from the real ac-curacy of 50% Bootstrap can be shown to fail if we add a memonzer module to any given inducer and adjust its predictions If the memonzer remembers the training set and makes the predictions when the test instance was a training instances, adjusting its predictions can make the resubstitution accuracy change from 0% to 100% and can thus bias the overall estimated accuracy in any direction we want3 Related W o r kSome experimental studies comparing different accuracy estimation methods have been previously done but most of them were on artificial or small datasets We now describe some of these effortsEfron (1983) conducted five sampling experiments and compared leave-one-out cross-validation, several variants of bootstrap, and several other methods The purpose of the experiments was to 'investigate some related es-timators, which seem to offer considerably improved es-timation in small samples ' The results indicate that leave-one-out cross-validation gives nearly unbiased esti-mates of the accuracy, but often with unacceptably high variability, particularly for small samples, and that the 632 bootstrap performed bestBreiman et al (1984) conducted experiments using cross-validation for decision tree pruning They chose ten-fold cross-validation for the CART program and claimed it was satisfactory for choosing the correct tree They claimed that "the difference in the cross-validation estimates of the risks of two rules tends to be much more accurate than the two estimates themselves "Jain, Dubes fa Chen (1987) compared the performance of the t0 bootstrap and leave-one-out cross-validation on nearest neighbor classifiers Using artificial data and claimed that the confidence interval of the bootstrap estimator is smaller than that of leave-one-out Weiss (1991) followed similar lines and compared stratified cross-validation and two bootstrap methods with near-est neighbor classifiers His results were that stratified two-fold cross validation is relatively low variance and superior to leave-one-outBreiman fa Spector (1992) conducted a feature sub-set selection experiments for regression, and compared leave-one-out cross-validation, A:-fold cross-validation for various k, stratified K-fold cross-validation, bias-corrected bootstrap, and partial cross-validation (not discussed here) Tests were done on artificial datasets with 60 and 160 instances The behavior observed was (1) the leave-one-out has low bias and RMS (root mean square) error whereas two-fold and five-fold cross-validation have larger bias and RMS error only at models with many features, (2) the pessimistic bias of ten-fold cross-validation at small samples was significantly re-duced for the samples of size 160 (3) for model selection, ten-fold cross-validation is better than leave-one-out Bailey fa E lkan (1993) compared leave-one-out cross-ahdation to 632 bootstrap using the FOIL inducer and four synthetic datasets involving Boolean concepts They observed high variability and little bias in the leave-one-out estimates, and low variability but large bias in the 632 estimatesWeiss and Indurkyha (Weiss fa Indurkhya 1994) con-ducted experiments on real world data Lo determine the applicability of cross-validation to decision tree pruning Their results were that for samples at least of size 200 using stratified ten-fold cross-validation to choose the amount of pruning yields unbiased trees (with respect to their optimal size) 4 M e t h o d o l o g yIn order to conduct a large-scale experiment we decided to use 04 5 and a Naive Bayesian classifier The C4 5 algorithm (Quinlan 1993) is a descendent of ID3 that builds decision trees top-down The Naive-Bayesian clas-sifier (Langley, Iba fa Thompson 1992) used was the one implemented in (Kohavi, John, Long, Manley fa Pfleger 1994) that uses the observed ratios for nominal features and assumes a Gaussian distribution for contin-uous features The exact details are not crucial for this paper because we are interested in the behavior of the accuracy estimation methods more than the internals of the induction algorithms The underlying hypothe-sis spaces—decision trees for C4 5 and summary statis-tics for Naive-Bayes—are different enough that we hope conclusions based on these two induction algorithms will apply to other induction algorithmsBecause the target concept is unknown for real-world1140 LEARNINGconcepts, we used the holdout method to estimate the quality of the cross-validation and bootstrap estimates To choose & set of datasets, we looked at the learning curves for C4 5 and Najve-Bayes for most of the super-vised classification dataaets at the UC Irvine repository (Murphy & Aha 1994) that contained more than 500 instances (about 25 such datasets) We felt that a min-imum of 500 instances were required for testing While the true accuracies of a real dataset cannot be computed because we do not know the target concept, we can esti mate the true accuracies using the holdout method The "true' accuracy estimates in Table 1 were computed by taking a random sample of the given size computing the accuracy using the rest of the dataset as a test set, and repeating 500 timesWe chose six datasets from a wide variety of domains, such that the learning curve for both algorithms did not flatten out too early that is, before one hundred instances We also added a no inform a tion d l stt, rand, with 20 Boolean features and a Boolean random label On one dataset vehicle, the generalization accu-racy of the Naive-Bayes algorithm deteriorated hy morethan 4% as more instances were g;iven A similar phenomenon was observed on the shuttle dataset Such a phenomenon was predicted by Srhaffer and Wolpert (Schaffer 1994, Wolpert 1994), but we were surprised that it was observed on two real world datasetsTo see how well an Accuracy estimation method per forms we sampled instances from the dataset (uniformly without replacement) and created a training set of the desired size We then ran the induction algorihm on the training set and tested the classifier on the rest of the instances L E I the dataset This was repeated 50 times at points where the lea rning curve wa s sloping up The same folds in cross-validation and the same samples m bootstrap were used for both algorithms compared5 Results and DiscussionWe now show the experimental results and discuss their significance We begin with a discussion of the bias in the estimation methods and follow with a discussion of the variance Due to lack of space, we omit some graphs for the Naive-Bayes algorithm when the behavior is ap-proximately the same as that of C 4 5 5 1 T h e B i a sThe bias of a method to estimate a parameter 0 is de-fined as the expected value minus the estimated value An unbiased estimation method is a method that has zero bias Figure 1 shows the bias and variance of k-fold cross-validation on several datasets (the breast cancer dataset is not shown)The diagrams clearly show that k-fold cross-validation is pessimistically biased, especially for two and five folds For the learning curves that have a large derivative at the measurement point the pessimism in k-fold cross-Figure ] C'4 5 The bias of cross-validation with varying folds A negative K folds stands for leave k-out E rror bars are 95% confidence intervals for (he mean The gray regions indicate 95 % confidence intervals for the true ac curaries Note the different ranges for the accuracy axis validation for small k s is apparent Most of the esti-mates are reasonably good at 10 folds and at 20 folds they art almost unbiasedStratified cross validation (not shown) had similar be-havior, except for lower pessimism The estimated accu-racy for soybe an at 2 fold was 7% higher and at five-fold, 1 1% higher for vehicle at 2-fold, the accuracy was 2 8% higher and at five-fold 1 9% higher Thus stratification seems to be a less biased estimation methodFigure 2 shows the bias and variance for the b32 boot-strap accuracy estimation method Although the 632 bootstrap is almost unbiased for chess hypothyroid, and mushroom for both inducers it is highly biased for soy-bean with C'A 5, vehicle with both inducers and rand with both inducers The bias with C4 5 and vehicle is 9 8%5 2 The VarianceWhile a given method may have low bias, its perfor-mance (accuracy estimation in our case) may be poor due to high variance In the experiments above, we have formed confidence intervals by using the standard de-viation of the mea n a ccura cy We now switch to the standard deviation of the population i e , the expected standard deviation of a single accuracy estimation run In practice, if one dots a single cross-validation run the expected accuracy will be the mean reported above, but the standard deviation will be higher by a factor of V50, the number of runs we averaged in the experimentsKOHAVI 1141Table 1 True accuracy estimates for the datasets using C4 5 and Naive-Bayes classifiers at the chosen sample sizesFigure 2 C4 5 The bias of bootstrap with varying sam-ples Estimates are good for mushroom hypothyroid, and chess, but are extremely biased (optimistically) for vehicle and rand, and somewhat biased for soybeanIn what follows, all figures for standard deviation will be drawn with the same range for the standard devi-ation 0 to 7 5% Figure 3 shows the standard devia-tions for C4 5 and Naive Bayes using varying number of folds for cross-validation The results for stratified cross-validation were similar with slightly lower variance Figure 4 shows the same information for 632 bootstrap Cross-validation has high variance at 2-folds on both C4 5 and Naive-Bayes On C4 5, there is high variance at the high-ends too—at leave-one-out and leave-two-out—for three files out of the seven datasets Stratifica-tion reduces the variance slightly, and thus seems to be uniformly better than cross-validation, both for bias and vananceFigure 3 Cross-validation standard deviation of accu-racy (population) Different, line styles are used to help differentiate between curves6 S u m m a r yWe reviewed common accuracy estimation methods in-cluding holdout, cross-validation, and bootstrap, and showed examples where each one fails to produce a good estimate We have compared the latter two approaches on a variety of real-world datasets with differing charac-teristicsProposition 1 shows that if the induction algorithm is stable for a given dataset, the variance of the cross-validation estimates should be approximately the same, independent of the number of folds Although the induc-tion algorithms are not stable, they are approximately stable it fold cross-validation with moderate k values (10-20) reduces the variance while increasing the bias As k decreases (2-5) and the sample sizes get smaller, there is variance due to the instability of the training1142 LEARNING1 igure 4 632 Bootstrap standard deviation in acc-rat y (population)sets themselves leading to an increase in variance this is most apparent for datasets with many categories, such as soybean In these situations) stratification seems to help, but -epeated runs may be a better approach Our results indicate that stratification is generally a better scheme both in terms of bias and variance whencompared to regular cross-validation Bootstrap has low,variance but extremely large bias on some problems We recommend using stratified Len fold cross-validation for model selection A c k n o w l e d g m e n t s We thank David Wolpert for a thorough reading of this paper and many interesting dis-cussions We thank Tom Bylander Brad E fron Jerry Friedman, Rob Holte George John Pat Langley Hob Tibshiram and Sholom Weiss for their helpful com nients and suggestions Dan Sommcrfield implemented Lhe bootstrap method in WLC++ All experiments were conducted using M L C ++ partly partly funded by ONR grant N00014-94-1-0448 and NSF grants IRI 9116399 and IRI-941306ReferencesBailey, T L & E lkan C (1993) stimating the atcuracy of learned concepts, in Proceedings of In ternational Joint Conference on Artificial Intelli-gence , Morgan Kaufmann Publishers, pp 895 900 Breiman, L & Spector, P (1992) Submodel selectionand evaluation in regression the x random case Inttrnational St atistic al Review 60(3), 291-319 Breiman, L , Friedman, J H , Olshen, R A & StoneC J (1984), Cl a ssific ation a nd Regression Trets Wadsworth International GroupEfron, B (1983), 'E stimating the error rate of a pre-diction rule improvement on cross-validation",Journal of the Americ an St atistic al Associ ation 78(382), 316-330 Efron, B & Tibshiram, R (1993) An introduction tothe bootstra p, Chapman & HallJam, A K Dubes R C & Chen, C (1987), "Boot-strap techniques lor error estimation", IEEE tra ns-actions on p a ttern a n a lysis a nd m a chine intelli-gence P A M I -9(5), 628-633 Kohavi, R , John, G , Long, R , Manley, D &Pfleger K (1994), M L C ++ A machine learn-ing library in C ++ in 'Tools with Artifi-cial Intelligence I E E EComputer Society Press, pp 740-743 Available by anonymous ftp from s t a r r y Stanford E DU pub/ronnyk/mlc/ toolsmlc psLangley, P Tba, W & Thompson, K (1992), An anal-ysis of bayesian classifiers in Proceedings of the tenth national conference on artificial intelligence",A A A I Press and M I T Press, pp 223-228Murph' P M & Aha D W (1994), V( I repository of machine learning databases, For information con-tact ml-repository (Ui(,s uci edu Quinlan I R (1993) C4 5 Progra ms for Ma chine Learning Morgan Kaufmann Los Altos CaliforniaSchaffcr C (19941 A conservation law for generalization performance, in Maehinc Learning Proceedings of Lhe E leventh International conference Morgan Kaufmann, pp 259-265Shao, J (1993), Linear model seletion via cross-validation Journ a l of the America n sta tistica l As-sociation 88(422) 486-494 Weiss S M (1991), Small sample error rate estimationfor k nearest neighbor classifiers' I E EE Tr a ns ac tions on Pa ttern An alysis a nd Ma chine Intelligence 13(3), 285-289 Weiss, S M & lndurkhya N (1994) Decision Lreepruning Biased or optimal, in Proceedings of the twelfth national conference on artificial intel-ligence A A A I Press and M I T Press pp 626-632 Wolpert D H (1992), Stacked generalization , Neura lNetworks 5 241-259 Wolpert D H (1994a) Off training set error and a pri-ori distinctions between learning algorithms, tech-mcal Report SFI TR 94-12-121, The Sante Fe ln-stituteWolpert D II {1994b), The relationship between PAC, the statistical physics framework the Bayesian framework, and the VC framework Technical re-port, The Santa Fe Institute Santa Fe, NMZhang, P (1992), 'On the distributional properties of model selection criteria' Journ al of the America nStatistical Associa tion 87(419), 732-737 KOHAVI 1143。
estimation of

Statistical analysis of multiple optical flow values for estimation of unmanned aerial vehicle height above groundPaul Merrell, Dah-Jye Lee, and Randal BeardDepartment of Electrical and Computer EngineeringBrigham Young University, 459 CBProvo, Utah 84602ABSTRACTFor a UAV to be capable of autonomous low-level flight and landing, the UAV must be able to calculate its current height above the ground. If the speed of the UAV is approximately known, the height of the UAV can be estimated from the apparent motion of the ground in the images that are taken from an onboard camera. One of the most difficult aspects in estimating the height above ground lies in finding the correspondence between the position of an object in one image frame and its new position in succeeding frames. In some cases, due to the effects of noise and the aperture problem, it may not be possible to find the correct correspondence between an object’s position in one frame and in the next frame. Instead, it may only be possible to find a set of likely correspondences and each of their probabilities. We present a statistical method that takes into account the statistics of the noise, as well as the statistics of the correspondences. This gives a more robust method of calculating the height above ground on a UAV.Keywords: Unmanned Aerial Vehicle, Height above Ground, Optical Flow, Autonomous Low-level Flight, Autonomous Landing1.INTRODUCTIONOne of the fundamental problems of computer vision is how to reconstruct a 3D scene using a sequence of 2D images taken from a moving camera in the scene. A robust and accurate solution to this problem would have many important applications for UAVs and other mobile robots. One key application is height above ground estimation. With an accurate estimate of a UAV’s height, the UAV would be capable of autonomous low-level flight and autonomous landing. It would also be possible to reconstruct an elevation map of the ground with a height above ground measurement. The process of recovering 3D structure from motion is typically accomplished in two separate steps. First, optical flow values are calculated for a set of feature points using image data. From this set of optical flow values, the motion of the camera is estimated, as well as the depth of the objects in the scene.Noise from many sources prevents us from finding a completely accurate optical flow estimate. A better understanding of the noise could provide a better end result. Typically, a single optical flow value is calculated at each feature point, but this is not the best approach because it ignores important information about the noise. A better approach would be to examine each possible optical flow value and calculate the probability that each is the correct optical flow based on the image intensities. The result would be a calculation for not just one optical flow value, but an optical flow distribution. This new approach has several advantages. It allows us to quantify the accuracy of each feature point and then rely more heavily upon the more accurate feature points. The more accurate feature points will have lower variances in their optical flow distributions. A feature point also may have a lower variance in one direction over another, meaning the optical flow estimate is more accurate in that direction. All of this potentially valuable information is lost if only a single optical flow value is calculated at each feature point.Another advantage is that this new method allows us to effectively deal with the aperture problem. The aperture problem occurs on points of the image where there is a spatial gradient in only one direction. At such edge points, it is impossible to determine the optical flow in the direction parallel to the gradient. However, often it is possible to obtain a precise estimate of the optical flow in direction perpendicular to the gradient. These edge points can not be used by any method which uses only a single optical flow value because the true optical flow is unknown. However, this problem can easily be avoided if multiple optical flow values are allowed. Even though edge points are P.C. Merrell, D.J. Lee, and R.W. Beard, “Statistical Analysis of Multiple Optical Flow Valuesfor Estimation of Unmanned Air Vehicles Height Above Ground”, SPIE Optics East, Robotics Technologies and Architectures, Intelligent Robots and Computer Vision XXII, vol. 5608, p. 298-305, Philadelphia, PA, USA, October 25-typically ignored, they do contain useful information that can produce more accurate results. In fact, it is possible to reconstruct a scene that contains only edges with no corner points at all. Consequently, this method is more robust because it does not require the presence of any corners in the image.2. RELATED WORKA significant amount of work has been done to try to use vision for a variety of applications on a UAV, such as terrain-following [1], navigation [2], and autonomous landing on a helicopter pad [3]. Vision-based techniques have also been used for obstacle avoidance [4,5] on land robots. We hope to provide a more accurate and robust vision system by using multiple optical flow values.Dellaert et al. explore a similar idea [6] to the one presented here. Their method is also based upon the principle that the exact optical flow or correspondence between feature points in two images is unknown. They attempt to calculate structure from motion without a known optical flow. However, they do assume that each feature point in one image corresponds to one of a number of feature points in a second image. The idea we are proposing is less restrictive because it allows a possible correspondence between any nearby pixels and then calculates the probability of each correspondence.Langer and Mann [7] discuss scenarios in which the exact optical flow is unknown, but the optical flow is known to be one of a 1D or 2D set of optical flow. Unlike the method described here, their method does not compute the probability of each possible optical flow in the set.3. METHODOLOGY3.1. Optical flow probabilityThe image intensity at position x and at time t can be modeled as a signal plus white Gaussian noise.),(),(),(t N t S t I x x x +=(1)where I (x ,t ) represents image intensity, S (x ,t ) represents the signal, and N (x ,t ) represents the noise. Over a sufficiently small range of positions and with a sufficiently small time step, the change in the signal can be expressed as a simple translation),(),(dt t S t S +=+x U x (2)where U is the optical flow vector between the two frames. The shifted difference between the images,),(),(),(),(dt t N t N dt t I t I +−+=+−+x U x x U x , (3)is a Gaussian random process.The probability that a particular optical flow, u , is the correct optical flow value based on the image intensities is proportional to:222)),(),((221)],(),,(|[σπσdt t I t I e dt t I t I P +−+−∝++=x u x x u x u U , (4) for some 2σ. Repeating the same analysis over a window of neighboring positions, W ∈x , and assumingthat the noise is white, the optical flow probability can be calculated as∏∈+−+−∝=W dt t I t I e I P x x u x u U 222)),(),((221]|[σπσ. (5)The value of ),(t I u x + may need to be estimated through some kind of interpolation, since the image data usually comes in discrete samples or pixels, but the value of u in general is not an integer value.Simoncelli et al. [8] describe an alternative gradient-based method for calculating the probability distribution. This method calculates optical flow based on the spatial gradient and temporal derivative of the image. Noise comes in the form of error in the temporal derivative, as well as a breakdown of the planarity assumption in equation (2). The optical flow probability distribution ends up always being Gaussian. The probability distribution calculated from equation (5) can be more complex than a Gaussian distribution.3.2 Rotation and translation estimationAfter having established an equation that relates the probability of an optical flow value with the image data, this relationship is used to calculate the probability of a particular camera rotation, , and translation, T , given the image data, I . This can be accomplished by taking the expected value for all possible optical flow values, u , then applying Bayes’ rule .u I u I u T u I u T I T d P P d P P Ω=Ω=Ω)|(),|,()|,,()|,(. (6)Our estimate of and T is only based on the optical flow value, so )|,(),|,(u T I u T Ω=ΩP P . Using Bayes’ rule twice more yields:ΩΩ=Ω=Ωu I u u T u T u I u u T I T d P P P P d P P P )|()(),|(),()|()|,()|,(. (7)The value of u is a continuous variable, so it does not come in discrete quantities. However, since we do not have a closed form solution to the integral in equation (7), the integral is approximated by computing a summation over discrete samples of u . In the following sections, each of the terms in equation (7) will be examined. Once a solution has been found for each term, the final step will be to find the most likely rotation and translation. The optimal rotation and translation can be found using a genetic algorithm.3.3. Calculating P (u | ,T)The optical flow vector, []T y x u u =u, at the image position (x ,y ) for a given rotation,[]T z y xωωω=Ω and translation, []T z y x t t t =T , is approximately equal tox f xy f f y Z yt f t u y f f x f xy Z xt f t u z y x z y y z y x z x xϖϖϖϖϖϖ−− +++−=+ +−++−=22 (8)where f is the focal length of the camera and Z is the depth of the object in the scene [9]. For each possible camera rotation and translation, we can come up with a list of possible optical flow vectors. While the exact optical flow vector is unknown, since the depth, Z , is unknown, we do know from rearranging the terms in equation (8) that the optical flow vector is somewhere along an epipolar line. The epipolar line is given by:+ +−+−− +=+−+−=+=x f f x f xy m x f xy f f y b xt f t xt f t m bmu u z y x z y x z x zy x y ϖϖωϖϖω22 (9)Furthermore, since the depth, Z , is positive (because it is not possible to see objects behind the camera), we also know thaty f f x f xy u f t x t z y x x x z ϖϖϖ+++> >2, and (10)y f f x f xy u f t x t z y x x x z ϖϖϖ+ ++< <2.If an expression for the probability of the depth, P(Z ), is known, then the probability of an optical flow vector for a given rotation and translation is given by:+≠+=+ +−+−==Ωb mu u b mu u y f f x f xy xt f t Z P P x y x y z y x z x ,0,),|(2ϖϖϖT u (11)3.4. Calculating P (,T) and P (u)The two remaining terms in equation (7) for which a solution must be found are P(,T ) and P(u ). P(,T ) depends upon the characteristics of the UAV and how it is flown. We assume that the rotation is distributed normally with a variance of 2x σ,2y σ , and 2z σ for each of the three components of rotation, roll, pitch, and yaw. For height above ground estimation, the optimal position for the camera is facing directly at the ground. We assume that the motion of the UAV is usually perpendicular to the motion of the optical center of the camera and is usually along the y -axis of the camera coordinate system.=Ω222000000,000~z y x z y x N σσσϖϖϖ, =222000000,010~tz ty tx z y x N t t t σσσT .(12)Using these distributions, an expression for P (u ) can also be obtained. Equation (8) can be separated in two parts that depend upon the rotation and translation of the camera.Ω+=Ω+=2211R T Q R T Q y x u u (13)In this case and T are both random vectors with the distribution given in equation (12).++++++++ ΩΩ++=++=Ω=Ω222222222221221221221221221221221221212212212212111211,00~])[(])[(0][z z y y x x z z z y y y x x x z z z y y y x x x z z y y x x z z y y x x z z y y x x r r r r r r r r r r r r r r r r r r N r r r r r r E E E σσσσσσσσσσσσσσσωωωR R R R (14)For simplicity, we will assume in the next equation that the variation in the translation, 2tx σ,2ty σ , and 2tz σ, isnegligible. Now an expression for the value of P (u ) is obtained by using the probability density function in (14) and a probability density function for the depth, P (Z ))(21z Z P z f u u P u u P y x z y x = −= ΩΩ= R R . (15)3.5. Depth estimationOnce the rotation and translation of the camera between two frames have been estimated, we are now able to estimate the depth of the objects in the scene. For a given rotation, translation, and depth, an optical flow value can be calculated using equation (8). By using equation (5), we can find a probability for the depth Z xy at image position (x ,y ). The depth at position (x+1,y ) is likely to be close to the depth at position (x ,y ), for the simple reason that neighboring pixels are likely to be part of the same object and that object is likely to be smooth. Using this information, we can obtain a better estimate of the depth at one pixel by examining those pixels around it.∏ ∏ ∏∈∈∈===W j i ijij ij ij xy W j i ij ij ij xyW j i ij xy xy dZ I Z P Z Z P dZ I Z Z P I Z P Z P ,,,)|()|()|,()|()|(I (16) where I ij is the image data at position (i ,j ) and W is a set of positions close to the position (x ,y ). This approach adds smoothness and reduces noise. There is the possibility that this method could smooth over depth discontinuities, but it does not smooth over depth discontinuities if we have a high confidence that one exists. This method has the effect that if we are fairly confident that a pixel has a certain depth, it is left unchanged, but if we have very little depth information at that pixel we change the depth to a value closer to one of its neighbors. An additional way to improve our result is to use image data from more than just two frames to calculate the value of P (Z ij |I ij ).4. RESULTSFigures 1 and 2 show results obtained from synthetic data. The data used in Figure 1 was created by texture mapping an aerial photograph onto a sloped landscape. The advantage of using synthetic data instead of real-camera footage is that the true 3D structure of the scene is known, so the recovered depth can be directly compared with its true value. In Figures 1 through 5, the results are displayed in the form of inverse depth or one over the actual depth. A darker color indicates that the objects in the scene at that part of the image are further away from the camera. A lighter color indicates they are closer. In Figures 1 and 2, the recovered depth is fairly close to its true value. In each case, the depth is slightly overestimated.Figure 1: One frame from a sequence of images is shown (left). The recovered depth from this sequence of images (middle) is shown next to the actual depth (right). The camera is moving perpendicular to the direction it is facing.Figure 2: One frame from a sequence of images is shown (left). The recovered depth from this sequence of images (middle) is shown next to the actual depth (right). The camera is moving towards the direction it is facing.Figure 3: Two frames (left, middle) from a sequence of images demonstrating the aperture problem. The recovered depth is shown (right).Figure 4: Two frames (left, middle) from a sequence of images taken directly from a camera onboard a UAV while it is flying towards a tree. The recovered depth is shown (right).Figure 5: Two frames (left, middle) from a sequence of images taken from a camera onboard a UAV while it is flying low to the ground. The recovered depth is shown (right).Figure 3 shows two frames from a sequence of images taken from a rotating camera moving towards a scene that has a black circle in the foreground and a white oval in the background. The scene is a bit contrived, but its purpose is to demonstrate that the aperture problem can be overcome. Every point in the image, when examined closely, is either completely blank or contains a single approximately straight edge. Essentially, every point in the image is affected by the aperture problem. However, with our new method, the aperture problem can be effectively dealt with. The camera was rotated in the z -direction one degree per frame. The camera rotation was estimated to be correct with an error of only 0.034˚. The left image in Figure 3 shows the recovered depth. The gray part of this image is where there was not enough information to detect the depth (since it is impossible to obtain depth information from a blank wall). The white oval was correctly determined to be behind the black circle. This demonstrates that it is possible to extract useful information about the camera movement, as well as the object’s depth from a scene with no corners in it.Figures 4 and 5 show results from real data taken from a camera onboard a UAV. In Figure 4, the camera is moving towards a tree. The results are fairly noisy, but they do show a close object near where the tree should be. In Figure 5, the camera facing the ground and moving perpendicular to it. The results are fairly good with one exception. The UAV is so close to the ground that its shadow appears in the image. The shadow moves forward along with theUAV which violates our assumption that the scene is static. Consequently, the recovered depth at the border of the shadow is overestimated. This error only occurs in a small part of the image and is not a significant problem.The results are fairly accurate and appear to be satisfactory for both autonomous low-level flight and landing. However, this level of accuracy does not come without a price. This method is fairly computationally intense. We have not yet been able to run this algorithm in real-time, but we hope to do so in the near future.5.CONCLUSIONSIn future research, we hope to investigate several potential improvements to this method. First, a more sophisticated method of calculating the optical flow distributions may be necessary and could be very valuable. Second, there are many other structure from motion algorithms that perform very well [10, 11, and 12], besides the statistical method described here. Anyone of these methods could be extended to allow multiple optical flow values.We have presented a novel method to compute structure from motion. This method is unique in its ability to quantitatively describe the noise in the optical flow estimate from image data and use that information to its advantage. The resulting algorithm is more robust and, in many cases, more accurate than methods that use only a single optical flow value.REFERENCESter, T. and N. Francheschini. “A robotic aircraft that follows terrain using a neuromorphic eye,” Conf. IntelligentRobots and System, vol. 1, pp. 129-134, 2002.2. B. Sinopoli, M. Micheli, G. Donato, and T. J. Koo, “Vision based navigation for an unmanned aerial vehicle,” Proc.Conf. Robotics and Automation, pp. 1757-1764, 2001.3.S. Saripalli, J. F. Montgomery, and G. S. Sukhatme. “Vision-based autonomous landing of an unmanned aerialvehicle,” Proc. Conf Robotics and Automation. Vol. 3, pp 2799-2804, 2002.4.L. M. Lorigo, R. A. Brooks, and W. E. L. Grimsou. “Visually-guided obstacle avoidance in unstructuredenvironments,” Proc. Conf. Intelligent Robots and Systems. Vol. 1, pp 373-379, 1997.5.M. T. Chao, T. Braunl, and A. Zaknich. “Visually-guided obstacle avoidance,” Proc. Conf. Neural InformationProcessing. Vol. 2, pp. 650-655, 1999.6. F. Dellaert, S.M. Seitz, C.E. Thrope, and S. Thrun, “Structure From Motion without Correspondence,” Proc. Conf.Computer Vision and Pattern Recognition, pp. 557-564, 2000.7.M. S. Langer and R. Mann. “Dimensional analysis of image motion,” Proc. Conf. Computer Vision, pp. 155-162,2001.8. E. P. Simoncelli, E. H. Adelson, and D. J. Heeger, “Probability distributions of optical flow,” Proc. Conf. ComputerVision and Pattern Recognition, pp. 310-315, 1991.9.Z. Duric and A. Rosenfeld. “Shooting a smooth video with a shaky camera,” Machine Vision and Applications, Vol.13, pp. 303-313, 2003.10.S. Soatto and R. Brocket, “Optimal Structure from Motion: Local Ambiguities and Global Estimates,” Proc. Conf.Computer Vision and Pattern Recognition, pp. 282-288, 1998.11.I. Thomas and E. Simoncelli. Linear Structure from Motion. Technical Report IRCS 94-26, University ofPennsylvania, 1994.12.J. Weng, N Ahuja, and T. Huang. “Motion and structure from two perspective views: algorithms, error analysis, anderror estimation.” IEEE Trans. Pattern Anal. Mach. Intell. 11 (5): 451-476, 1989.。
The Unscented Kalman Filter for nonlinear estimation

The Unscented Kalman Filter for Nonlinear EstimationEric A.Wan and Rudolph van der MerweOregon Graduate Institute of Science&Technology20000NW Walker Rd,Beaverton,Oregon97006ericwan@,rvdmerwe@AbstractThe Extended Kalman Filter(EKF)has become a standardtechnique used in a number of nonlinear estimation and ma-chine learning applications.These include estimating thestate of a nonlinear dynamic system,estimating parame-ters for nonlinear system identification(e.g.,learning theweights of a neural network),and dual estimation(e.g.,theExpectation Maximization(EM)algorithm)where both statesand parameters are estimated simultaneously.This paper points out theflaws in using the EKF,andintroduces an improvement,the Unscented Kalman Filter(UKF),proposed by Julier and Uhlman[5].A central andvital operation performed in the Kalman Filter is the prop-agation of a Gaussian random variable(GRV)through thesystem dynamics.In the EKF,the state distribution is ap-proximated by a GRV,which is then propagated analyti-cally through thefirst-order linearization of the nonlinearsystem.This can introduce large errors in the true posteriormean and covariance of the transformed GRV,which maylead to sub-optimal performance and sometimes divergenceof thefilter.The UKF addresses this problem by using adeterministic sampling approach.The state distribution isagain approximated by a GRV,but is now represented usinga minimal set of carefully chosen sample points.These sam-ple points completely capture the true mean and covarianceof the GRV,and when propagated through the true non-linear system,captures the posterior mean and covarianceaccurately to the3rd order(Taylor series expansion)for anynonlinearity.The EKF,in contrast,only achievesfirst-orderaccuracy.Remarkably,the computational complexity of theUKF is the same order as that of the EKF.Julier and Uhlman demonstrated the substantial perfor-mance gains of the UKF in the context of state-estimationfor nonlinear control.Machine learning problems were notconsidered.We extend the use of the UKF to a broader classof nonlinear estimation problems,including nonlinear sys-tem identification,training of neural networks,and dual es-timation problems.Our preliminary results were presentedin[13].In this paper,the algorithms are further developedand illustrated with a number of additional examples.While a number of optimization approaches exist(e.g., gradient descent using backpropagation),the EKF may be used to estimate the parameters by writing a new state-space representation(4)(5) where the parameters correspond to a stationary pro-cess with identity state transition matrix,driven by process noise(the choice of variance determines tracking per-formance).The output corresponds to a nonlinear obser-vation on.The EKF can then be applied directly as an efficient“second-order”technique for learning the parame-ters.In the linear case,the relationship between the Kalman Filter(KF)and Recursive Least Squares(RLS)is given in [3].The use of the EKF for training neural networks has been developed by Singhal and Wu[9]and Puskorious and Feldkamp[8].Dual EstimationA special case of machine learning arises when the inputis unobserved,and requires coupling both state-estimation and parameter estimation.For these dual estimation prob-lems,we again consider a discrete-time nonlinear dynamic system,(6)(7) where both the system states and the set of model param-eters for the dynamic system must be simultaneously esti-mated from only the observed noisy signal.Approaches to dual-estimation are discussed in Section4.2.In the next section we explain the basic assumptions and flaws with the using the EKF.In Section3,we introduce the Unscented Kalman Filter(UKF)as a method to amend the flaws in the EKF.Finally,in Section4,we present results of using the UKF for the different areas of nonlinear estima-tion.2.The EKF and its FlawsConsider the basic state-space estimation framework as in Equations1and2.Given the noisy observation,a re-cursive estimation for can be expressed in the form(see [6]),prediction of prediction of(8) This recursion provides the optimal minimum mean-squared error(MMSE)estimate for assuming the prior estimate and current observation are Gaussian Random Vari-ables(GRV).We need not assume linearity of the model. The optimal terms in this recursion are given by(9)(10)(11) where the optimal prediction of is written as,and corresponds to the expectation of a nonlinear function of the random variables and(similar interpretation for the optimal prediction).The optimal gain termis expressed as a function of posterior covariance matrices (with).Note these terms also require tak-ing expectations of a nonlinear function of the prior state estimates.The Kalmanfilter calculates these quantities exactly in the linear case,and can be viewed as an efficient method for analytically propagating a GRV through linear system dy-namics.For nonlinear models,however,the EKF approxi-mates the optimal terms as:(12)(13)(14) where predictions are approximated as simply the function of the prior mean value for estimates(no expectation taken)1 The covariance are determined by linearizing the dynamic equations(),and then determining the posterior covariance matrices analyt-ically for the linear system.In other words,in the EKF the state distribution is approximated by a GRV which is then propagated analytically through the“first-order”lin-earization of the nonlinear system.The readers are referred to[6]for the explicit equations.As such,the EKF can be viewed as providing“first-order”approximations to the op-timal terms2.These approximations,however,can intro-duce large errors in the true posterior mean and covariance of the transformed(Gaussian)random variable,which may lead to sub-optimal performance and sometimes divergence of thefilter.It is these“flaws”which will be amended in the next section using the UKF.3.The Unscented Kalman FilterThe UKF addresses the approximation issues of the EKF. The state distribution is again represented by a GRV,but is now specified using a minimal set of carefully chosen sample points.These sample points completely capture the true mean and covariance of the GRV,and when propagated through the true non-linear system,captures the posterior mean and covariance accurately to the3rd order(Taylor se-ries expansion)for any nonlinearity.To elaborate on this,we start byfirst explaining the unscented transformation.The unscented transformation(UT)is a method for cal-culating the statistics of a random variable which undergoesa nonlinear transformation[5].Consider propagating a ran-dom variable(dimension)through a nonlinear function, .Assume has mean and covariance.To calculate the statistics of,we form a matrix ofsigma vectors(with corresponding weights),accord-ing to the following:(15)where is a scaling parameter.deter-mines the spread of the sigma points around and is usually set to a small positive value(e.g.,1e-3).is a secondary scaling parameter which is usually set to0,and is used to incorporate prior knowledge of the distribution of(for Gaussian distributions,is optimal).Initialize with:For, Calculate sigma points:rameters from the noisy data(see Equation7).As expressed earlier,a number of algorithmic approaches ex-ist for this problem.We present results for the Dual UKF and Joint UKF.Development of a Unscented Smoother for an EM approach[2]was presented in[13].As in the prior state-estimation example,we utilize a noisy time-series ap-plication modeled with neural networks for illustration of the approaches.In the the dual extended Kalmanfilter[11],a separate state-space representation is used for the signal and the weights. The state-space representation for the state is the same as in Equation20.In the context of a time-series,the state-space representation for the weights is given by(21)(22) where we set the innovations covariance equal to3. Two EKFs can now be run simultaneously for signal and weight estimation.At every time-step,the current estimateof the weights is used in the signal-filter,and the current es-timate of the signal-state is used in the weight-filter.In the new dual UKF algorithm,both state-and weight-estimation are done with the UKF.Note that the state-transition is lin-ear in the weightfilter,so the nonlinearity is restricted to the measurement equation.In the joint extended Kalmanfilter[7],the signal-state and weight vectors are concatenated into a single,joint state vector:.Estimation is done recursively by writ-ing the state-space equations for the joint state as:(23)(24)and running an EKF on the joint state-space4to produce simultaneous estimates of the states and.Again,our approach is to use the UKF instead of the EKF.Dual Estimation ExperimentsWe present results on two time-series to provide a clear il-lustration of the use of the UKF over the EKF.Thefirst series is again the Mackey-Glass-30chaotic series with ad-ditive noise(SNR3dB).The second time series(also chaotic)comes from an autoregressive neural network with random weights driven by Gaussian process noise and alsoF parameter estimationAs part of the dual UKF algorithm,we implemented the UKF for weight estimation.This represents a new param-eter estimation technique that can be applied to such prob-lems as training feedforward neural networks for either re-gression or classification problems.Recall that in this case we write a state-space representa-tion for the unknown weight parameters as given in Equa-tion5.Note that in this case both the UKF and EKF are or-der(is the number of weights).The advantage of the UKF over the EKF in this case is also not as obvious,as the state-transition function is linear.However,as pointed out earlier,the observation is nonlinear.Effectively,the EKF builds up an approximation to the expected Hessian by tak-ing outer products of the gradient.The UKF,however,may provide a more accurate estimate through direct approxima-tion of the expectation of the Hessian.Note another distinct advantage of the UKF occurs when either the architecture or error metric is such that differentiation with respect to the parameters is not easily derived as necessary in the EKF. The UKF effectively evaluates both the Jacobian and Hes-sian precisely through its sigma point propagation,without the need to perform any analytic differentiation.We have performed a number of experiments applied to training neural networks on standard benchmark data.Fig-ure4illustrates the differences in learning curves(averaged over100experiments with different initial weights)for the Mackay-Robot-Arm dataset and the Ikeda chaotic time se-ries.Note the slightly faster convergence and lowerfinal MSE performance of the UKF weight training.While these results are clearly encouraging,further study is still neces-sary to fully contrast differences between UKF and EKFFigure4:Comparison of learning curves for the EKF and UKF training.a)Mackay-Robot-Arm,2-12-2MLP,b)Ikeda time series,10-7-1MLP.5.Conclusions and future workThe EKF has been widely accepted as a standard tool in the machine learning community.In this paper we have pre-sented an alternative to the EKF using the unscentedfil-ter.The UKF consistently achieves a better level of ac-curacy than the EKF at a comparable level of complexity. We have demonstrated this performance gain in a number of application domains,including state-estimation,dual es-timation,and parameter estimation.Future work includes additional characterization of performance benefits,exten-sions to batch learning and non-MSE cost functions,as well as application to other neural and non-neural(e.g.,paramet-ric)architectures.In addition,we are also exploring the use of the UKF as a method to improve Particle Filters[10],as well as an extension of the UKF itself that avoids the linear update assumption by using a direct Bayesian update[12].6.References[1]J.de Freitas,M.Niranjan,A.Gee,and A.Doucet.Sequential montecarlo methods for optimisation of neural network models.Technical Report CUES/F-INFENG/TR-328,Dept.of Engineering,University of Cambridge,Nov1998.[2] A.Dempster,ird,and D.Rubin.Maximum-likelihood fromincomplete data via the EM algorithm.Journal of the Royal Statisti-cal Society,B39:1–38,1977.[3]S.Haykin.Adaptive Filter Theory.Prentice-Hall,Inc,3edition,1996.[4]S.J.Julier.The Scaled Unscented Transformation.To appear inAutomatica,February2000.[5]S.J.Julier and J.K.Uhlmann.A New Extension of the Kalman Filterto Nonlinear Systems.In Proc.of AeroSense:The11th Int.Symp.on Aerospace/Defence Sensing,Simulation and Controls.,1997.[6] F.L.Lewis.Optimal Estimation.John Wiley&Sons,Inc.,NewYork,1986.M.B.Matthews.A state-space approach to adaptive nonlinearfilter-ing using recurrent neural networks.In Proceedings IASTED Inter-nat.Symp.Artificial Intelligence Application and Neural Networks, pages197–200,1990.G.Puskorius and L.Feldkamp.Decoupled Extended Kalman FilterTraining of Feedforward Layered Networks.In IJCNN,volume1, pages771–777,1991.S.Singhal and L.Wu.Training multilayer perceptrons with the ex-tended Kalmanfilter.In Advances in Neural Information Processing Systems1,pages133–140,San Mateo,CA,1989.Morgan Kauffman.R.van der Merwe,J.F.G.de Freitas,A.Doucet,and E.A.Wan.The Unscented Particle Filter.Technical report,Dept.of Engineering, University of Cambridge,2000.In preparation.E.A.Wan and A.T.Nelson.Neural dual extended Kalmanfiltering:applications in speech enhancement and monaural blind signal sep-aration.In Proc.Neural Networks for Signal Processing Workshop.IEEE,1997.E.A.Wan and R.van der Merwe.The Unscented Bayes Filter.Tech-nical report,CSLU,Oregon Graduate Institute of Science and Tech-nology,2000.In preparation(/nsel).[13] E.A.Wan,R.van der Merwe,and A.T.Nelson.Dual Estimationand the Unscented Transformation.In S.Solla,T.Leen,and K.-R.M¨u ller,editors,Advances in Neural Information Processing Systems 12,pages666–672.MIT Press,2000.。
非线性最小二乘法Levenberg-Marquardt-method

Levenberg-Marquardt Method(麦夸尔特法)Levenberg-Marquardt is a popular alternative to the Gauss-Newton method of finding the minimum of afunction that is a sum of squares of nonlinear functions,Let the Jacobian of be denoted , then the Levenberg-Marquardt method searches in thedirection given by the solution to the equationswhere are nonnegative scalars and is the identity matrix. The method has the nice property that, forsome scalar related to , the vector is the solution of the constrained subproblem of minimizingsubject to (Gill et al. 1981, p. 136).The method is used by the command FindMinimum[f, x, x0] when given the Method -> Levenberg Marquardt option.SEE A LSO:Minimum, OptimizationREFERENCES:Bates, D. M. and Watts, D. G. N onlinear Regr ession and Its Applications. New York: Wiley, 1988.Gill, P. R.; Murray, W.; and Wright, M. H. "The Levenberg-Marquardt Method." §4.7.3 in Practical Optim ization. London: Academic Press, pp. 136-137, 1981.Levenberg, K. "A Method for the Solution of Certain Problems in Least Squares." Quart. Appl. Math.2, 164-168, 1944. Marquardt, D. "An Algor ithm for Least-Squares Estimation of Nonlinear Parameters." SIAM J. Appl. Math.11, 431-441, 1963.Levenberg–Marquardt algorithmFrom Wikipedia, the free encyclopediaJump to: navigation, searchIn mathematics and computing, the Levenberg–Marquardt algorithm (LMA)[1] provides a numerical solution to the problem of minimizing a function, generally nonlinear, over a space of parameters of the function. These minimization problems arise especially in least squares curve fitting and nonlinear programming.The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be a bit slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.The LMA is a very popular curve-fitting algorithm used in many software applications for solving generic curve-fitting problems. However, the LMA finds only a local minimum, not a global minimum.Contents[hide]∙ 1 Caveat Emptor∙ 2 The problem∙ 3 The solutiono 3.1 Choice of damping parameter∙ 4 Example∙ 5 Notes∙ 6 See also∙7 References∙8 External linkso8.1 Descriptionso8.2 Implementations[edit] Caveat EmptorOne important limitation that is very often over-looked is that it only optimises for residual errors in the dependant variable (y). It thereby implicitly assumes that any errors in the independent variable are zero or at least ratio of the two is so small as to be negligible. This is not a defect, it is intentional, but it must be taken into account when deciding whether to use this technique to do a fit. While this may be suitable in context of a controlled experiment there are many situations where this assumption cannot be made. In such situations either non-least squares methods should be used or the least-squares fit should be done in proportion to the relative errors in the two variables, not simply the vertical "y" error. Failing to recognise this can lead to a fit which is significantly incorrect and fundamentally wrong. It will usually underestimate the slope. This may or may not be obvious to the eye.MicroSoft Excel's chart offers a trend fit that has this limitation that is undocumented. Users often fall into this trap assuming the fit is correctly calculated for all situations. OpenOffice spreadsheet copied this feature and presents the same problem.[edit] The problemThe primary application of the Levenberg–Marquardt algorithm is in the least squares curve fitting problem: given a set of m empirical datum pairs of independent and dependent variables, (x i, y i), optimize the parameters β of the model curve f(x,β) so that the sum of the squares of the deviationsbecomes minimal.[edit] The solutionLike other numeric minimization algorithms, the Levenberg–Marquardt algorithm is an iterative procedure. To start a minimization, the user has to provide an initial guess for the parameter vector, β. In many cases, an uninformed standard guess like βT=(1,1,...,1) will work fine;in other cases, the algorithm converges only if the initial guess is already somewhat close to the final solution.In each iteration step, the parameter vector, β, is replaced by a new estimate, β + δ. To determine δ, the functions are approximated by their linearizationswhereis the gradient(row-vector in this case) of f with respect to β.At its minimum, the sum of squares, S(β), the gradient of S with respect to δwill be zero. The above first-order approximation of gives.Or in vector notation,.Taking the derivative with respect to δand setting theresult to zero gives:where is the Jacobian matrix whose i th row equals J i,and where and are vectors with i th componentand y i, respectively. This is a set of linear equations which can be solved for δ.Levenberg's contribution is to replace this equation by a "damped version",where I is the identity matrix, giving as the increment, δ, to the estimated parameter vector, β.The (non-negative) damping factor, λ, isadjusted at each iteration. If reduction of S is rapid, a smaller value can be used, bringing the algorithm closer to the Gauss–Newton algorithm, whereas if an iteration gives insufficientreduction in the residual, λ can be increased, giving a step closer to the gradient descentdirection. Note that the gradient of S withrespect to β equals .Therefore, for large values of λ, the step will be taken approximately in the direction of the gradient. If either the length of the calculated step, δ, or the reduction of sum of squares from the latest parameter vector, β + δ, fall below predefined limits, iteration stops and the last parameter vector, β, is considered to be the solution.Levenberg's algorithm has the disadvantage that if the value of damping factor, λ, is large, inverting J T J + λI is not used at all. Marquardt provided the insight that we can scale eachcomponent of the gradient according to thecurvature so that there is larger movement along the directions where the gradient is smaller. This avoids slow convergence in the direction of small gradient. Therefore, Marquardt replaced theidentity matrix, I, with the diagonal matrixconsisting of the diagonal elements of J T J,resulting in the Levenberg–Marquardt algorithm:.A similar damping factor appears in Tikhonov regularization, which is used to solve linear ill-posed problems, as well as in ridge regression, an estimation technique in statistics.[edit] Choice of damping parameterVarious more-or-less heuristic arguments have been put forward for the best choice for the damping parameter λ. Theoretical arguments exist showing why some of these choices guaranteed local convergence of the algorithm; however these choices can make the global convergence of the algorithm suffer from the undesirable properties of steepest-descent, in particular very slow convergence close to the optimum.The absolute values of any choice depends on how well-scaled the initial problem is. Marquardt recommended starting with a value λ0 and a factor ν>1. Initially setting λ=λ0and computing the residual sum of squares S(β) after one step from the starting point with the damping factor of λ=λ0 and secondly withλ0/ν. If both of these are worse than the initial point then the damping is increased by successive multiplication by νuntil a better point is found with a new damping factor of λ0νk for some k.If use of the damping factor λ/ν results in a reduction in squared residual then this is taken as the new value of λ (and the new optimum location is taken as that obtained with this damping factor) and the process continues; if using λ/ν resulted in a worse residual, but using λresulted in a better residual then λ is left unchanged and the new optimum is taken as the value obtained with λas damping factor.[edit] ExamplePoor FitBetter FitBest FitIn this example we try to fit the function y = a cos(bX) + b sin(aX) using theLevenberg–Marquardt algorithm implemented in GNU Octave as the leasqr function. The 3 graphs Fig 1,2,3 show progressively better fitting for the parameters a=100, b=102 used in the initial curve. Only when the parameters in Fig 3 are chosen closest to the original, are thecurves fitting exactly. This equation is an example of very sensitive initial conditions for the Levenberg–Marquardt algorithm. One reason for this sensitivity is the existenceof multiple minima —the function cos(βx)has minima at parameter value and[edit] Notes1.^ The algorithm was first published byKenneth Levenberg, while working at theFrankford Army Arsenal. It was rediscoveredby Donald Marquardt who worked as astatistician at DuPont and independently byGirard, Wynn and Morrison.[edit] See also∙Trust region[edit] References∙Kenneth Levenberg(1944). "A Method for the Solution of Certain Non-Linear Problems in Least Squares". The Quarterly of Applied Mathematics2: 164–168.∙ A. Girard (1958). Rev. Opt37: 225, 397. ∙ C.G. Wynne (1959). "Lens Designing by Electronic Digital Computer: I". Proc.Phys. Soc. London73 (5): 777.doi:10.1088/0370-1328/73/5/310.∙Jorje J. Moré and Daniel C. Sorensen (1983)."Computing a Trust-Region Step". SIAM J.Sci. Stat. Comput. (4): 553–572.∙ D.D. Morrison (1960). Jet Propulsion Laboratory Seminar proceedings.∙Donald Marquardt (1963). "An Algorithm for Least-Squares Estimation of NonlinearParameters". SIAM Journal on AppliedMathematics11 (2): 431–441.doi:10.1137/0111030.∙Philip E. Gill and Walter Murray (1978)."Algorithms for the solution of thenonlinear least-squares problem". SIAMJournal on Numerical Analysis15 (5):977–992. doi:10.1137/0715063.∙Nocedal, Jorge; Wright, Stephen J. (2006).Numerical Optimization, 2nd Edition.Springer. ISBN0-387-30303-0.[edit] External links[edit] Descriptions∙Detailed description of the algorithm can be found in Numerical Recipes in C, Chapter15.5: Nonlinear models∙ C. T. Kelley, Iterative Methods for Optimization, SIAM Frontiers in AppliedMathematics, no 18, 1999, ISBN0-89871-433-8. Online copy∙History of the algorithm in SIAM news∙ A tutorial by Ananth Ranganathan∙Methods for Non-Linear Least Squares Problems by K. Madsen, H.B. Nielsen, O.Tingleff is a tutorial discussingnon-linear least-squares in general andthe Levenberg-Marquardt method inparticular∙T. Strutz: Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Vieweg+Teubner, ISBN 978-3-8348-1022-9.[edit] Implementations∙Levenberg-Marquardt is a built-in algorithm with Mathematica∙Levenberg-Marquardt is a built-in algorithm with Matlab∙The oldest implementation still in use is lmdif, from MINPACK, in Fortran, in thepublic domain. See also:o lmfit, a translation of lmdif into C/C++ with an easy-to-use wrapper for curvefitting, public domain.o The GNU Scientific Library library hasa C interface to MINPACK.o C/C++ Minpack includes theLevenberg–Marquardt algorithm.o Several high-level languages andmathematical packages have wrappers forthe MINPACK routines, among them:▪Python library scipy, modulescipy.optimize.leastsq,▪IDL, add-on MPFIT.▪R (programming language) has theminpack.lm package.∙levmar is an implementation in C/C++ with support for constraints, distributed under the GNU General Public License.o levmar includes a MEX file interface for MATLABo Perl (PDL), python and Haskellinterfaces to levmar are available: seePDL::Fit::Levmar, PyLevmar andHackageDB levmar.∙sparseLM is a C implementation aimed at minimizing functions with large,arbitrarily sparse Jacobians. Includes a MATLAB MEX interface.∙ALGLIB has implementations of improved LMA in C# / C++ / Delphi / Visual Basic.Improved algorithm takes less time toconverge and can use either Jacobian orexact Hessian.∙NMath has an implementation for the .NET Framework.∙gnuplot uses its own implementation .∙Java programming language implementations:1) Javanumerics, 2) LMA-package (a small,user friendly and well documentedimplementation with examples and support),3) Apache Commons Math∙OOoConv implements the L-M algorithm as an Calc spreadsheet.∙SAS, there are multiple ways to access SAS's implementation of the Levenberg-Marquardt algorithm: it can be accessed via NLPLMCall in PROC IML and it can also be accessed through the LSQ statement in PROC NLP, and the METHOD=MARQUARDT option in PROC NLIN.。
ABSTRACT An Investigation on the Use of Machine Learned Models for Estimating Correction Co

ABSTRACTIn this paper we present the results of an empirical study in which we have investigated Machine Learning (ML) algorithms with regard to their capabilities to accurately assess the correctability of faulty software components. Three different families algorithms have been analyzed: Top Down Induction Decision Tree, covering, and Inductive Logic Programming (ILP). We have used (1) fault data collected on corrective maintenance activities for the Generalized Support Software reuse asset library located at the Flight Dynamics Division of NASA's GSFC and (2) product measures extracted directly from the faulty components of this library.1.0 INTRODUCTIONSoftware maintenance consumes most of the resources in many software organizations. We must be able to better characterize, assess, and improve the maintainability of software products in order to decrease maintenance costs. Maintenance involves activities such as correcting errors, migrating software to new technologies, and adapting software to deal with new environment requirements. Corrective maintenance is the part of software maintenance devoted to correcting errors. Mostly, when software maintainers have to correct a faulty software component, they rely almost exclusively on their previous experience in order to estimate the effort they will spend to do it. Even though highly experienced software maintainers may make accurate predictions, the estimation process remain informal, error-prone, and poorly documented, making it difficult to replicate and spread throughout the organization. Published in the Proc. of the 21th Int’l Conf. on S/W Eng. Kyoto, Japan, 1997.In general, software maintenance organizations tend to assign corrective maintenance activities to young software engineers who do not know a great deal about software systems they have to maintain.In order to improve corrective maintenance, we must be able to provide models which help software maintainers better assess the maintainability of software products and estimate corrective maintenance effort. The benefits of having such models for software maintenance are numerous. For instance, estimation models can help us optimize the allocation of resources to corrective maintenance activities. Evaluation models can help us made decisions about when to re-structure or re-engineer a software component in order to make it more maintainable. Understanding models can help us know better the underlying reasons about the difficulty of correcting specific kinds of errors.Many different approaches have been proposed to build corrective maintenance estimation/evaluation models. In this paper, we show the results of an empirical study in which we have investigated different ML algorithms with regard to their capabilities to generate accurate and easily interpretable correctability models. We have compared these algorithms with regard to their capabilities to assess the difficulty of correct Ada faulty components. The results show that ML algorithms are able to generate adequate prediction models. The rules produced by the ML algorithms can also be used as coding guidelines. In addition, the rules generated by these algorithms showed to be intuitive to software maintainers.2.0 MACHINE LEARNING ALGORITHMSMost of the work done in machine learning has focused on supervised machine learning algorithms. Starting from the description of classified examples, these algorithms produce definitions for each class. In general, they use an attribute-value representation language that allows the use of statistical properties on the learning set. Nevertheless, others use the first order logic language. It has better expressive capabilities than the attribute-value language. ItAn Investigation on the Use of Machine Learned Models for Estimating Correction CostsMauricio A. de Almeida1 and Hakim Lounis Centre de Recherche Informatique de Montréal 1801, McGill College Ave., #800Montréal, H3A 2N4 Qc, Canada{mdealmei, hlounis}@crim.caphone: (1) (514) 840-12341.Guest researcher at CRIM and assistant professor at Facul-dade de Tecnologia de Sao Paulo, Sao Paulo , Brazil.Walcelio L. MeloOracle do Brasil SCN Qd. 02 - Bl. D - Torre A - Salas 501/506 Brasilia, DF Brazil 70710-500wmelo@ phone/fax: (55) (61) 327-3027permits the expression of relations between objects. An important consequence is the diminution of the learning data-set size. Both are helpful for constructing efficient software quality models. The following table summarizes the four ML algorithms we have used.TABLE 1. ML algorithms used in the study3.0 STUDY OVERVIEW3.1 The studied environmentIn this study, we have used data from the maintenance of a library of reusable components. This library, known as the Generalized Support Software (GSS) reuse asset library, is located at the Flight Dynamics Division (FDD) of NASA’s Goddard Space Flight Center (GSFC). Component development began in 1993. Subsequent efforts focused on generating new components to populate the library and on implementing specification changes to satisfy mission requirements. The first application using this library was developed in early 1995. The asset library currently consists of 1K Ada83 components totalling approximately 515 KSLOC.3.2 Data CollectionIn this study, we collected error and fault data about this library. An error is represented by a single software Change Request Form (CRF) [10] filled by developers and configurers to institute and document a change to one or more components. A fault pertains to a single component and is evidenced by the physical change of that component in response to a particular error CRF. In this study, we have only used those components representing Ada 83 files. A faulty component version becomes a fixed component version after it is corrected. We are only interested in the Ada faulty component versions.For each CRF, we have collected data on: (1) error identification and error correction, including the names and version numbers of the Ada source code components that had faults in them, (2) the effort expended to isolate all faults associated with the error, (3) the effort required to correct all of these faults, and (4) source code metrics characterizing these particular components. The ASAP tool [1] was used to extract source code metrics from the Ada faulty component versions.3.3 Dependent and independent variablesIn our study, the dependent variable is the total effort spent to isolate and correct a faulty component. Isolation and correction effort at NASA SEL is measured on a 4-point ordinal scale: 1 hour, from 1 hour to 1 day, from 1 to 3 days, and more than 3 days. To build the classification model, we have dichotomized the corrective maintenance cost into two categories: low and high correct maintenance cost. To do so, we converted the four effort categories into average values following [3]. We assumed an 8 hour day, and took the average value for each of the categories of corrective maintenance effort. Therefore, the category of “1 Hour” was changed to 0.5 hours, the category of “1 hour to 1 Day” was changed to 4.5 hours, the category of “from 1 to 3 Days”was changed to 16 hours, and the category of “more than 3 Days” was changed to 32 hours. We then summed up these values for isolation and correction costs. This gives us an average overall corrective maintenance cost. We used the median of total corrective maintenance cost as the cutoff point for dichotomization.In this study, the independent variables are the ASAP product measures extracted from the faulty components. 3.4 Evaluating Prediction AccuracyIn order to evaluate the model, we need formal measures for evaluating the classification performance of the estimation models produced by the different ML algorithms. In this paper, we have used five criteria (see Table 2): sensitivity, specificity, predictive value (+), predictive value (-), and accuracy. These are defined below with reference to Table 3. In addition, we have used a measure of prediction validity as it was presented in [11] and used in [2]. It means that if the statistical significant coefficient p-value of the computed value of the X2 test is less than 0.05 then we can say that the generated model has predictive validity.TABLE 2. Formal measures of classificationperformance [12]All the criteria above are expected to be as high as possible, because when they are low, it will lead to a wrong allocation of resources to maintain the components.TABLE 3. Two-class classification performancematrixIn order to calculate the values of the formal measures ofMLalgorithms Algorithm family DescriptionlanguageInducedknowledgeNewID [4]Divide & conquer fam-ily: Top Down InductionDecision Tree -TDIDT-Attribute-valueDecision treeCN2 [7]Covering family Attribute-valueRulesC4.5 [13]Divide & conquer fam-ily: -TDIDT-Attribute-valueDecisiontree & rulesFOIL [14] Inductive Logic Pro-gramming -ILP-First orderlogicClausesSensitivity n11 / (n11+n21)Specificity n22 / (n12 + n22)Predictive value (+)n11 / (n11 + n12)Predictive value (-)n22 / (n21 + n22)Accuracy(n11 + n22) / ((n11+n21)+ (n12 + n22))Predicted CostHigh Cost Low CostReal Cost High Cost n11n12Low Cost n21n22classification performance as described in Table 2, we used a V-fold cross-validation procedure [5]. For each observation X in the sample, a model is developed based on the remaining observations (sample - X). This model is then used to predict whether observation X will be classified as either costly or not costly. This validation procedure is commonly used when data sets are small, e.g. [2] and [6].4.0 RESULTS4.1 Data preparationThe data we have used in this study is a set of 164 Ada faulty components classified as having either ‘high’ or ‘low’corrective maintenance cost. The data was structured as a sequence of attribute-value pairs containing 19 attributes, corresponding to the independent variables and 1 attribute associated to the dependent variable.4.2 Quantitative comparisonTable 4 presents the quantitative results of the study: TABLE 4. Results of the studyFrom the point of view of prediction validity measures, the models generated by NewID and CN2 are not statistically significant (X2 test p-value>0.05). All the other generated models are, from this point of view, statistically significant, since the p-values are less than 0.001 (far away from the threshold of 0.05).As we can see in Table 4, FOIL presents the best results in our experiment. The model generated by FOIL is composed by 6 rules. Table 5 presents the metrics used in the experiment and their number of occurrences in the rules of the two best learned models.The number of operands (N2) appeared in 5 of the 6 generated rules and the number of operators (N1) in 4. This results demonstrate that some of the Halstead metrics [9] in our data are useful for predicting the cost of corrective maintenance of faulty components. The two new metrics we have introduced, i.e., comments divided by size (comments div size) and blank lines divided by size (blank lines div size), have also been selected by FOIL: comments divided by size (comments div size) appeared twice and blank lines divided by size (blank lines div size) once.In fact, models built without using these two metrics were less accurate than the models we have shown in this section (Due to a lack of space, the models built without these two normalized metrics are not showed).Another important result is that FOIL, an ILP ML algorithm based on a subset of first order logic description language, provides the best results for all the measures we have computed. The sensitivity and the predictive value (-) is pretty high (around 80%) and the overall accuracy is 5% higher than the second best algorithm (C4.5 rules). This means that in some cases one can allocate resources to corrective maintenance of faulty components with a 82% of confidence.In a recent study where another set of metrics have been used, Basili and his colleagues [2] obtained similar results using C4.5 rules (sensitivity 76% and overall accurary 73%). Although the results are difficult to compare, since[2] have used a different data set and independent variables(i.e., software metrics), the results of our study demonstrated again that C4.5 rules have worked better than C4.5 decision trees. In addition, the results obtained with C4.5 rules on both studies are quite close (around 75%). 4.3 FOIL rulesFOIL is able to built predictive software models via rules expressed in first order logic. The greatest advantage of FOIL rules is that we are able to compare measures instead of simply listing attribute-value rules. Here, we show one of the rules generated by FOIL taken arbitrarily to exemplifyNewID CN2C4.5C4.5_rules FOIL Sensitivity53%56%70%74%80% Specificity50%53%62%64%68% Predictive-value(+)55%58%59%59%65% Predictive-value(-)48%51%73%77%82% Accuracy52%54%66%68%73% X20.1898 1.128917.3421.9137.08 p-value=0.66=0.2854<=0.000<=0.0000<=0.000TABLE 5. Metrics used in the experiment Metrics FOIL C4.5_rules Number of operands (N2)50Number of operators (N1) 40 declarative31inline comments30Number of distinct operators (n1)30blank lines21 comments div size20 cyclomatic complexity22lines of comments20total statement nesting depth20blank lines div size13 executable10lines of code10 maximum statement nesting depth10 statements10total source lines00Ada language statements00average statement nesting depth00Number of distinct operands (n2)00the rule’s interpretation.high(A):-executable(A,B),maximum_statement_nesting_depth(A,C),lines_of_comments(A,D),commentsdivsize(A,E),N1(A,F),N2(A,G),less_or_equal(E,F),~less_or_equal(B,G),C<>4,C<>43, less_or_equal(C,D) This rule can be read as:“a faulty component has a high corrective maintenance cost if the comments density (#commentsLines / # source lines of code) is less or equal to number of Operators, and execut-able statements is greater than number of operands, and maximum_statement_nesting_depth is less or equal to the number of lines of comments, and the maximum statement nesting depth is different from 4 and 43”.The data used in this study as well as all decision trees and rules are available under request.5.0 LESSONS LEARNEDIn this paper, we have empirically investigated different machine learning techniques with regard to their capabilities to generate accurate correctability models. The results show that the inductive logic programming algorithms are superior to the top-down induction decision tree, top-down induction attribute value rules, and covering algorithms, i.e. the overall accuracy of the model build using FOIL was higher than the other algorithms. The rules provided by the inductive logic programming algorithm we have used, i.e., FOIL, showed to be meaningful.As far as we know, we are one of the first to investigate the use of such ILP ML algorithms in the field of software engineering [8]. Most of the works done have exploited algorithms based on propositional logic. These latter are limited, in the sense that they can not induce models that compare a descriptor (i.e., metric or independent variable) to another.The work we have presented addresses two different but complementary domains: software engineering and machine learning. It confirms the usefulness of collaboration between the two domains. With regard to software engineering, we intend to do the following work:•The generation of other quality models, such as reliability, error-proneness, etc.•The use of other set of measures which enriches the measures provided by ASAP.•Replicate the study using other data sets. •Provide guidelines which help software managers to take preventive action early in the process life-cycle. ACKNOWLEDGEMENTSThe authors wish to thank Vic Basili from University of Maryland -SEL- and Steven Condon from CSC for providing the data used in the paper. We are also grateful to R. Tesoriero and P. Mackenzie for their feedback on the early versions of this paper. During this work, W Melo was in part, supported by the Software Quality Group of Bell Canada, and by NSERC operation grant #OGP 0197275. 6.0 REFERENCES[1]Amadeus Software Research Inc. “Getting Startedwith Amadeus”. Amadeus Measurement System.1994.[2]V. Basili, Condon, K. El Emam, R. B. Hendrick,W. L. Melo. “Characterizing and Modeling theCost of Rework in a Library of Reusable SoftwareComponents”. In Proc. of the IEEE 19th Int’l.Conf. on S/W Eng., Boston, MA, May 1997. [3]V. Basili and B. Perricone. “Software Errors andComplexity: An Empirical Investigation”. InCACM, 27(1):42-52, January 1984.[4]R. Boswell. “Manual for NewID”. The TuringInstitute, January 1990.[5]L. Breiman, J. Friedman, R. Olshen and C. Stone.“Classification and Regression Trees”. Publishedby Wadsworth, 1984.[6]L. Briand, V. Basili, C. Hetmanski. “A PatternRecognition Approach for Software EngineeringData Analysis”. In IEEE TSE, 18(1), Nov. 1992.[7]P. Clark & T. Niblet. “The CN2 inductionalgorithm”. In Machine Learning Journal, 3, p261-283.[8]W. W. Cohen & P. Devanbu. “A ComparativeStudy of Inductive Logic Programming Methodsfor Software Fault Prediction”. Technical ReportAT&T Labs-Research, 1996.[9]M. Halstead. “Elements of Software Science”.North-Holland, Amsterdam, 1977.[10]G. Heller, J. Valett and M. Wild. “Data CollectionProcedure for the Software Engineering Labo-ratory (SEL) Database”. Technical Report SEL-92-002, Software Engineering Laboratory, 1992.[11] F. Lanubile and G. Visaggio. “EvaluatingPredictive Quality Models Derived from SoftwareMeasures: Lessons Learned”. Technical ReportISERN-96-03, International Software EngineeringResearch Network, 1996.[12]S. M. Weiss, C. A. Kulikowski. “ComputerSystems That Learn”. Morgan KaufmannPublishers, Inc. Sao Francisco, CA. 1991.[13]J. R. Quinlan. “C4.5: Programs for MachineLearning”. Morgan Kaufmann Publishers, SaoMateo, CA, 1993.[14]J.R. Quinlan. “Learning Logical Definitions fromRelations”. In machine learning journal, vol 5, n°3,p 239-266, August 1990.。
prediction-error method

prediction-error methodThe prediction-error method, also known as the prediction error method, is a statistical technique used in various fields such as signal processing, machine learning, and econometrics. It is primarily used for model selection, parameter estimation, and prediction.The basic idea behind the prediction-error method is to compare the prediction errors of different models to determine which model performs the best. This is done by training multiple models on the same dataset and then evaluating their performance on a separate test dataset or through cross-validation.Here's a general outline of how the prediction-error method works:1. **Model Training**: Multiple models are trained on the dataset, using various algorithms or techniques.2. **Prediction Error Calculation**: For each model, the prediction error is calculated. This is the difference between the predicted values and the actual values of the output variable.3. **Model Comparison**: The models are compared based on their prediction errors. The model with the smallest prediction error is considered the best performing model.4. **Model Selection**: The model with the lowest prediction error is selected as the final model for prediction or further analysis.The prediction-error method is particularly useful when the goal is to find a model that is not only accurate but also parsimonious, meaning it has a relatively simple structure with fewer parameters. By focusing on minimizing prediction error, this method aims to find a balance between model complexity and performance.It's important to note that the prediction-error method is just one approach to model selection and evaluation. Other factors such as interpretability, computational efficiency, and domain-specific requirements may also influence the choice of model.。
泛化误差的各种交叉验证估计方法综述

练和测试集的角标集( 数据的划分方式) 。 a) 优点。Holdout 估计方法的提出打破了传统的基于相 同的数据进行训练和测试的分析, 避免了训练和测试数据重叠 引起的过拟合。 b) 缺点。Holdout 估计过分依赖于某一次数据划分, 数据 划分的好坏直接影响着估计的精度。 2. 2 基于多次 holdout 估计平均的交叉验证估计 注意到 holdout 估计依赖于数据的一次划分, 容易受到数
A( Dj ) , zi ) ∑ L(
( t)
( 6)
a) 优点。RLT 自从 1 981 年被提出以后在实际应用中就被 广泛使用, 因为它有可接受的计算开销且操作简单。 b) 缺点。B 的大小的选择一直是这个方法的最大问题, 不 同文献中有不同的结论, 如文献[ 1 0] 中建议 B = 1 5 。训练和测 试集比例的选择也没有一个确定的结论, 往往不同的研究者使 用不同的训练和测试容量。
2 泛化误差的各种交叉验证估计方法
1 8] 早在 20 世纪 30 年代, Larson [ 就提出在相同的数据上训
交叉验证方法是其中最简单且被广泛使用的方法, 也因此 得到了更多学者的关注, 各种不同形式的交叉验证方法被提 出, 包括最早的留一交叉验证、 标准 K折交叉验证、 RLT ( repeated learning testing) 交叉验证、 蒙特卡罗( M onteCarlo ) 交叉 练算法和评价算法的性能将得到过于乐观的结果。交叉验证 就是基于这个问题而被提出, 它通过在新数据集上进行算法的
3~ 5] 。近些年, 一类通过样本重用来直接估计泛化误差的方 法[
这里对 D 和 z 都取期望, 其中 z 也是从分布 P 中独立于 D 抽样得到的样本。式( 1) 中的期望意味着对一个算法的一般性 能感兴趣, 而不是仅考虑手边某个特定数据集上算法的性能。
潜变量的一般结构组成分析

PSYCHOMETRIKA—VOL.75,NO.2,228–242J UNE2010DOI:10.1007/S11336-010-9157-5GENERALIZED STRUCTURED COMPONENT ANALYSIS WITH LATENTINTERACTIONSH EUNGSUN H WANGMCGILL UNIVERSITYM OON-H O R INGO H ONANYANG TECHNOLOGICAL UNIVERSITYJ ONATHAN L EECALIFORNIA STATE UNIVERSITYGeneralized structured component analysis(GSCA)is a component-based approach to structural equation modeling.In practice,researchers may often be interested in examining the interaction effects oflatent variables.However,GSCA has been geared only for the specification and testing of the main effectsof variables.Thus,an extension of GSCA is proposed to effectively deal with various types of interactionsamong latent variables.In the proposed method,a latent interaction is defined as a product of interactinglatent variables.As a result,this method does not require the construction of additional indicators for latentinteractions.Moreover,it can easily accommodate both exogenous and endogenous latent interactions.Analternating least-squares algorithm is developed to minimize a single optimization criterion for parameterestimation.A Monte Carlo simulation study is conducted to investigate the parameter recovery capabilityof the proposed method.An application is also presented to demonstrate the empirical usefulness of theproposed method.Key words:generalized structured component analysis,latent interactions,alternating least squares.1.IntroductionGeneralized structured component analysis(GSCA)(Hwang&Takane,2004)was proposed for component-based structural equation modeling(SEM)where latent variables are defined as weighted composites or components of observed variables(Tenenhaus,2008).GSCA permits modeling the interrelationships among observed and latent variables in a unified manner,so that it can estimate parameters by minimizing a single least-squares optimization function.GSCA does not require the multivariate normality assumption of observed variables for parameter estimation. It hardly suffers from non-convergence and avoids convergence to improper solutions even in small samples.Moreover,the method results in the unique individual scores of latent variables. Furthermore,GSCA provides overall goodness-of-fit measures which may be of use for model evaluation and comparison.Recently,a simulation study was conducted to compare GSCA to two traditional approaches to SEM(covariance structure analysis and partial least squares)in terms of parameter recovery under various experimental conditions including data distribution,model specification,and sam-ple size(Hwang,Malhotra,Kim,Tomiuk,&Hong,in press).The results of this study provide guidelines with respect to the conditions under which GSCA is to be preferred over the two tradi-tional approaches.Specifically,GSCA is recommended as an alternative to partial least squares for general SEM purposes.In addition,GSCA is favored over covariance structure analysis un-less correct model specification is ensured.Investigating the interaction effects of latent variables has emerged as an issue of theoreti-cal and substantive importance in psychology and variousfields(e.g.,Bagozzi,Baumgartner,&228©2010The Psychometric SocietyH.HW ANG ET AL.229 Yi,1992;Busemeyer&Jones,1983;Kenny&Judd,1984;Schumacker&Marcoulides,1998). Researchers may concentrate on the interaction between a continuous latent variable and a cat-egorical observed variable with a relatively small number of categories(Ridgon,Schumacker, &Worthe,1998).GSCA can readily be utilized to examine this interaction through conven-tional multi-group comparison procedures(Hwang&Takane,2004).However,researchers may often be interested in studying the interaction between continuous latent variables—latent in-teractions(Marsh,Wen,&Hau,2004).GSCA has no capability to deal with such latent inter-actions.Thus,in this paper,GSCA is extended to investigate various types of latent interaction effects.In practice,a simple approach to accommodating latent interactions is to obtain product indicators of observed variables for interacting latent variables,and subsequently to use them as indicators for latent interactions(e.g.,Algina&Moulder,2001;Chin,Marcolin,&New-sted,1996).GSCA has no technical difficulty in adopting the product-indicator approach.This is mainly because GSCA estimation does not require the normality assumption of observed vari-ables,which is not ensured for product indicators.Although this approach appears easy to imple-ment,it is difficult to decide which and how many observed variables should be selected to form product indicators for latent interactions(e.g.,Jaccard&Wan,1995;Jöreskog&Yang,1996; Marsh et al.,2004).Further,this problem will likely deteriorate when researchers want to take into account higher-way latent interactions than two-way ones.Instead,the method proposed herein treats a latent interaction as a product of interacting latent variables whose individual scores are uniquely determined as weighted composites of ob-served variables.This specification of a latent interaction is comparable to that of an observed interaction in multiple linear regression(e.g.,Cohen&Cohen,1983).As a result,the proposed method does not require the construction of any new(product)indicators for latent interactions. This would be particularly beneficial in estimating the effects of higher-way latent interactions, where no clear rule seems to be available for creating an appropriate set of product indicators. Yang Jonsson(1998)also suggested using a product of latent variables for a preliminary test of a latent interaction effect.This approach follows two steps sequentially:In thefirst step,the individual scores of latent variables are estimated based on a confirmatory factor analysis,and the latent interaction between latent variables is obtained as a product of their factor scores.In the second step,the effect of the latent interaction is estimated based solely on a given structural model.Accordingly,as Yang Jonsson(1998)pointed out,the two-step approach does not guar-antee that the solutions obtained in thefirst step are optimal for the subsequent analysis because the two steps are carried out separately.Moreover,due to the adoption of a confirmatory factor analysis,the individual scores of latent variables are not uniquely determined in thefirst step,so that resultant latent interaction scores are not likely to be unique either.On the other hand,the proposed method represents a unified,single-step approach that esti-mates all model parameters including the path coefficients of latent interactions,taking into ac-count both measurement and structural models simultaneously.In addition,the proposed method enables the provision of the unique individual scores of latent interactions because the scores of the corresponding latent variables are uniquely determined.The paper is organized as follows.In Section2,the technical underpinnings of the proposed method are discussed in detail.A least-squares optimization function is proposed for parameter estimation.An alternating least-squares algorithm(de Leeuw,Young,&Takane,1976)is de-veloped to minimize the optimization function.In Section3,a Monte Carlo simulation study is conducted to evaluate the performance of the proposed method in terms of parameter recov-ery.In particular,this study considers two-way and three-way latent interactions at the same time.In Section4,the empirical usefulness of the proposed method is illustrated based on a real dataset.Thefinal section briefly summarizes and discusses the implications of the proposed method.230PSYCHOMETRIKA2.The MethodA brief description of GSCA isfirst presented to aid an understanding of this relatively novel method.Subsequently,the technical underpinnings of the proposed extension of GSCA are discussed in detail.2.1.Generalized Structured Component AnalysisLet Z denote an N by J matrix of observed variables,where N is the number of individ-ual observations.GSCA consists of three sub-models to specify the path-analytic relationships among observed and latent variables.The sub-models can be expressed in matrix form as follows:=ZW,(1)Z= C+E1,(2)= B+E2,(3) where is an N by T matrix of latent variables,W is a J by T matrix consisting of component weights assigned to observed variables,C is a T by J matrix consisting of loadings relating latent variables to observed variables,E1is an N by J matrix of residuals for Z,B is a T by T matrix of path coefficients connecting latent variables among themselves,and E2is an N by T matrix of residuals for .As shown in(1),in GSCA,latent variables are determined as weighted composites or components of observed variables.The second and third sub-models represent the measurement and structural models,respectively,which are formulated in a manner similar to the reticular action model(McArdle&McDonald,1984).When all observed variables are reflective indicators,all loadings in C are non-zeros in(2). When all observed variable are formative,no loadings are specified(i.e.,C=0),so that GSCA involves(1)and(3)only.This case corresponds with Glang’s(1988)method.Furthermore,when some observed variables are formative while the others reflective,for example,as in a multiple indicators of multiple causes model(Jöreskog&Goldberger,1975),C includes non-zero load-ings only for the reflective observed variables and zeros for the formative ones.GSCA integrates the three sub-models into a single one:[Z, ]= [C,B]+[E1,E2],Z[I,W]=ZW[C,B]+[E1,E2],(4)ZV=ZW A+E,= A+E,where V=[I,W],A=[C,B],E=[E1,E2], =ZV,and I is an identity matrix.This is called the GSCA model(Hwang&Takane,2004;Hwang,DeSarbo,&Takane,2007).As in other component analyses,GSCA is not scale-invariant.It typically pre-processes the data to be standardized.When it is of importance to examine the mean structure of a specified model,GSCA is to analyze the unstandardized data(Hwang&Takane,2004).2.2.Generalized Structured Component Analysis with Latent Interactions2.2.1.The Proposed Model The proposed method extends the original GSCA model to accommodate latent interactions.Let F denote an N by R matrix of latent interactions defined as products of latent variables.For example,letγ12denote the latent interaction between two latentH.HW ANG ET AL.231variables (say,γ1and γ2).It is then given by γ12=γ1◦γ2,where o indicates the elementwise multiplication of two vectors or the Hadamard product,i.e.,γ12i =γ1i ×γ2i (i =1,...,N).There are two possible model specifications with respect to latent interactions.One is to consider that latent interactions only influence observed and latent variables—exogenous latent interactions.The other is to specify endogenous latent interactions affected by other variables.Specifically,the former specification can be expressed asZV =ZW A +FD +E ,(5)where D indicates an R by J +T matrix consisting of path coefficients relating latent interac-tions to other observed or latent variables.The latter specification involving endogenous latent interactions can be written asF =ZWP +FQ +U ,(6)where P denotes a T by R matrix of path coefficients from latent variables to latent interactions,Q is an R by R matrix of path coefficients among latent interactions,and U is an N by R matrix of residuals for F .As shown in (5),all latent interactions are exogenous variables affecting observed and/or latent variables.On the other hand,in (6),latent interactions are influenced by latent variables and/or other latent interactions.The proposed model can be derived by combining (5)and (6)into one:[ZV ,F ]=[ZW ,F ] A P D Q +[E ,U ], ∗= ∗A ∗+E ∗,(7)where ∗=[ZV ,F ],A ∗= AP D Q ,and E ∗=[E ,U ].Accordingly,the proposed model can take into account both exogenous and endogenous latent interactions.To illustrate (7),we contemplate a structural equation model with latent interactions.As dis-played in Figure 1,this model involved twelve observed variables for four latent variables (three observed variables per latent variable).The three latent variables (γ1,γ2,γ3)were hypothesized to influence one latent variable (γ4).Moreover,this model included three two-way latent inter-actions (γ12,γ13,γ23)and a three-way latent interaction (γ123).This set of latent interactions represents all possible interactions among the three exogenous latent variables.These latent in-teractions were specified to affect the endogenous latent variable (γ4).Note that no residual terms associated with endogenous variables are displayed in Figure 1.In addition,although both weights (w ’s)and loadings (c ’s)were specified for the twelve observed variables as given in (2)and (3),only loadings are shown to make the figure concise.In the model,P =0and Q =0because no endogenous latent interactions are involved.This model can be expressed as (7)by specifying the other elements as follows:Z =[z 1,z 2,...,z 12],W =[w 1,w 2,w 3,w 4]=⎡⎢⎢⎢⎣w 1w 2w 3000000000000w 4w 5w 6000000000000w 7w 8w 9000000000000w 10w 11w 12⎤⎥⎥⎥⎦ , =ZW =[γ1,γ2,γ3,γ4],232PSYCHOMETRIKAF IGURE 1.The specified model for the simulation studyC =⎡⎢⎢⎢⎣c 1c 2c 3000000000000c 4c 5c 6000000000000c 7c 8c 9000000000000c 10c 11c 12⎤⎥⎥⎥⎦,B =⎡⎢⎢⎢⎣000b 1000b 2000b 30000⎤⎥⎥⎥⎦,F =[γ12,γ13,γ23,γ123]=[γ1◦γ2,γ1◦γ3,γ2◦γ3,γ1◦γ2◦γ3],and D =⎡⎢⎢⎢⎣000000000000000b 4000000000000000b 5000000000000000b 6000000000000000b 7⎤⎥⎥⎥⎦.H.HW ANG ET AL.2332.2.2.Parameter Estimation The unknown parameters of the proposed method (W ,A ,D ,P ,and Q )are estimated by minimizing the following least-squares optimization criterionφ=SS [ZV ,F ]−[ZW ,F ] A P D Q =SS ∗− ∗A ∗ ,(8)with respect to W ,A ,D ,P ,and Q ,subject to the identification constraints diag (W Z ZW )=I and diag (F F )=I ,where SS (M )=tr (M M ).The second constraint diag (F F )=I is imposed to normalize a product of interacting latent variables (i.e.,a latent interaction)because it is not normalized,even though each interacting latent variable is normalized.An alternating least-squares algorithm is developed to minimize (8).Specifically,this algo-rithm consists of the following two main steps.Step 1:A ∗(or equivalently A ,D ,P ,and Q )are updated for fixed W .Minimization of (8)with respect to A ∗is equivalent to minimizingφ=SS ∗− ∗A ∗=SS vec ∗ − I ⊗ ∗ vec A ∗ ,(9)where vec (M )denotes a supervector consisting of all columns of M one below another,and ⊗denotes the Kronecker product.Let a denote the vector of non-zero elements in vec (A ∗).Let X denote the matrix of the columns of I ⊗ ∗corresponding to the non-zero elements in vec (A ∗).Then the least-squares estimate of a is obtained byˆa =(X X )−1X vec ∗ .(10)The updated A ∗is reconstructed from ˆa .Step 2:W is updated for fixed A ∗.This is equivalent to minimizing (8)with respect to W .In (8),V subsumes each column of W ,which contains the weights of observed variables for a single latent variable (i.e.,V =[I ,W ]).F may also share some columns of W because latent interactions are defined as products of latent variables.Let w t denote the t th column in W ,which is shared by the p th column in V and also by the r th column in F ,where p =J +t (t =1,...,T ;r =1,...,R).Let G denote ∗whose columns involving w t are the vector of zeros.Let H denote ∗whose columns containing w t are the vector of zeros.Let e 1and e 2denote 1by J +T +R vectors whose elements are all zeros except the p th and k th elements being unity,respectively,where k =J +T +r .Let a 1denote the t th row of A ∗.Let a 2denote the s th row of A ∗,where s =T +r .Let R denote an N by N diagonal matrix consisting of the individual scores of a latent variable or a product of latent variables,which forms a latent interaction but does not involve w t .For instance,if γ12=γ1◦γ2=Zw 1◦Zw 2and t =1,then R =diag (γ2).To update w t ,(8)can generally be re-written asφ=SS ∗− ∗A ∗=SS vec (Zw t e 1+RZw t e 2+G )−vec (Zw t a 1+RZw t a 2+H ) =SS (e 1⊗Z )w t +(e 2⊗RZ )w t −(a 1⊗Z )w t −(a 2⊗RZ )w t −vec (H −G ) =SS w t −vec ( ) ,(11)where =(e 1⊗Z )+(e 2⊗RZ )−(a 1⊗Z )−(a 2⊗RZ ),and =H −G .Let ρt denotethe vector formed by eliminating zero elements from w t .Let Y denote the matrix formed by234PSYCHOMETRIKAeliminating the columns of corresponding to the zero elements in w t .Then the least-squares estimate of ρt is obtained byˆρt =(Y Y )−1Y vec ( ).(12)The updated w t is recovered from ˆρt ,and normalized to satisfy the normalization constraints (cf.ten Berge,1993).These two steps are alternated until convergence.To safeguard against convergence to local minima,the proposed algorithm can be applied to re-estimate parameters multiple times with different random starts each time.The solution associated with the smallest optimization function value may be regarded as the global one.As in GSCA,the proposed method provides two measures of overall model fit—FIT and Adjusted FIT (AFIT).The FIT indicates the total variance of all endogenous variables explained by a particular model specification.It is given by FIT =1−(φ/α),where α=SS ( ∗).The values of the FIT range from 0to 1.The larger this value,the more variance in the variables is accounted for by the specified model.However,this measure is obviously affected by model com-plexity,i.e.,the more parameters the larger the value of the FIT.Thus,the AFIT was developed totake model complexity into account (Hwang et al.,2007).It is given by AFIT =1−(1−FIT )df 0df 1,where df 0=NJ is the degrees of freedom for the null model in which all parameters are equalto zeros,and df 1=NJ −g is the degrees of freedom for the model being tested,where g is the number of free parameters.The model that maximizes the AFIT may be chosen over other competing models.The proposed method employs the bootstrap method (Efron,1982)to estimate the standard errors of parameter estimates.The bootstrap critical ratio (i.e.,a parameter estimate divided by its bootstrap standard error)may be used to examine the significance of a parameter estimate.For example,a parameter estimate having the bootstrap critical ratio equal to or greater than two in absolute value may be considered significant,if the bootstrap distribution of the parameter estimate is approximately normal.3.A Monte Carlo SimulationA Monte Carlo simulation study was conducted to investigate the performance of the pro-posed method in parameter recovery.The structural equation model displayed in Figure 1was used for the simulation study.This model is equivalent to (5)since it does not involve an endoge-nous latent interaction.In the simulation study,the values of parameters were chosen as follows:w ’s =0.4167,c ’s =0.8,and b ’s =0.3.Moreover,the elements of the covariance matrix of residuals E ,denoted by E ,were chosen asE =diag [0.36,0.36,0.36,0.36,0.36,0.36,0.36,0.36,0.36,0.36,0.36,0.36,0.64,0.64,0.64,0.2368] .Five different sample sizes were considered in the study (N =50,100,200,400,and 1000).At each sample size,five hundred samples were randomly generated from a multivari-ate normal distribution with the means FD ( )−1and the covariance matrix =( )−1 E ( )−1,where =V −W A (refer to Hwang &Takane,2004).To evaluate the accuracy of parameter estimates obtained under the proposed method,we computed the mean square errors of the estimates.The mean square error is the average squared difference between a parameter and its estimate,thereby indicating how far an estimator is on average from its parameter,i.e.,the smaller the mean square error,the closer to the parameter.H.HW ANG ET AL.235F IGURE2.Averagefinite-sample properties of the weight estimates of the proposed method across different sample sizes(RBIAS= relative bias,V AR=variance,MSE=mean square error).A dotted line indicates no relative biasSpecifically,the mean square error(MSE)is given byMSE(ˆθ)=E(ˆθ−θ)2=Eˆθ−E(ˆθ)2+E(ˆθ)−θ2,(13)whereθandˆθindicate a parameter and its estimate,respectively.In(13),thefirst and sec-ond terms represent the variance and the squared bias of an estimate,respectively.Thus,the mean square error is regarded as the standard criterion in assessing the accuracy of an estima-tor,which entails information on both bias and variability of the estimator(Mood,Graybill,& Boes,1974).Figures2,3,and4display the average relative bias,variance,and MSE of the estimates of weights,loadings,and path coefficients,respectively,across different sample sizes.In the present study,absolute values of relative bias greater than ten percent are considered indicative of an unacceptable degree of bias(Bollen,Kirby,Curran,Paxton,&Chen,2007;Lei,2009).As shown in thesefigures,the proposed method on average yielded negatively biased estimates of weights and path coefficients,whereas it resulted in positively biased estimates of loadings.The average relative bias of each set of estimates tended to remain similar across sample sizes.These patterns of bias in the parameter estimates are anticipated because GSCA involves components or weighted composites of observed variables as in principal components analysis or canonical correlation analysis(e.g.,Fornell&Cha,1994;Widaman,1990).A recent simulation study also reported the same patterns of bias in GSCA estimators(Hwang et al.,in press).Nonetheless, the degrees of bias appear to be acceptable for all sets of estimates because they were smaller than10%in absolute value.Moreover,overall,the parameter estimates of the proposed method were associated with quite small variances.As sample size increased,these variances tended to236PSYCHOMETRIKAF IGURE3.Averagefinite-sample properties of the loading estimates of the proposed method across different sample sizes(RBIAS= relative bias,V AR=variance,MSE=mean square error).A dotted line indicates no relative biasdecrease as stly,the proposed method involved very small average MSE values of all parameter estimates across sample sizes.The average MSE values of the estimates tended to decrease with sample size increased.Thus,this simulation study showed that the proposed method recovered parameters well. The estimates of the proposed method involved a slight yet tolerant level of bias,whereas they showed a very small amount of variability.In return,the estimates had very small mean square errors(close to zeros),indicating that they were quite close to the parameters on average.4.An Empirical ApplicationThe present example comes from a longitudinal study conducted for investigating the ef-fects of materialistic values on depression and anxiety(Abela,Ho,Webb,&McWhinnie,2008). In this study,participants were repeatedly measured on their levels of materialism,stress,depres-sion,and anxiety over multiple time points(weeks).The Aspirations Index(ASPQ)(Kasser& Ryan,1996)was used to measure participants’levels of materialism.The ASPQ consists of35 items,each of which presents participants with a particular life goal and asks to rate how impor-tant they view each goal on a7-point scale(1=“not at all important”to7=“very important”). The ASPQ items are related to six different types of goals:financial success,social recogni-tion,attractive appearance,self-acceptance,affiliation,and community feeling.Thefirst three types(financial success,social recognition,and attractive appearance)indicate extrinsic values,H.HW ANG ET AL.237F IGURE4.Averagefinite-sample properties of the path coefficient estimates of the proposed method across different sample sizes (RBIAS=relative bias,V AR=variance,MSE=mean square error).A dotted line indicates no relative bias whereas the other three(self-acceptance,affiliation,and community feeling)are intrinsic values. To assess the relative importance that a given participant places on materialistic values,the aver-age score of each of the three extrinsic values was calculated,and subtracted from the average of all intrinsic value scores combined.The resultant three sets of scores were used as indicators for the latent variable of materialism.The Center for Epidemiological Studies Depression Scale(CES-D)(Radloff,1977)was used to assess participants’depressive symptoms.The CES-D is a20-item self-report measure designed for the general population.For each item,participants were asked to indicate how often they had experienced the particular symptom over the past week,ranging from1(rarely or none of the time)to4(most or all of the time).The62-item version of the Mood and Anxiety Symptom Questionnaire(MASQ)(Watson&Clark,1991)was utilized to assess both specific and non-specific anxious symptoms of participants.For each item,participants were asked to rate on a scale of1to5the extent to which they had felt this way during the past24hours.A30-item abbreviated version of the General,Academic,Social Hassles Scale for Students(GASHSS) (Blankstein&Flett,1993)was adopted to assess participants’levels of stress.The GASHSS is comprised of items assessing general hassles(8items),academic hassles(10items),and social hassles(12items).For each item,participants were asked to rate how persistent the given hassle was(i.e.,its frequency and duration)over the last7days(0=“no hassle;not at all persistent”to 6=“extremely persistent hassle;high frequency and/or duration”).For our analysis,items from measures for depression,anxiety,and stress were combined to form item parcels,which subsequently served as indicators for these three latent variables.Three indicators(i.e.,item parcels)were used for each of the latent variables.F IGURE5.The structural model specified for the empirical analysisFigure5displays the structural model specified for our analysis.As shown in the model,the level of materialism measured at thefirst time point(M)was assumed to influence the levels of depression(D3and D5)and anxiety(A3and A5)measured at time points3and5.Materialism may be defined as“a set of centrally held beliefs about the importance of possessions in one’s life”(Richins&Dawson,1992).Although materialism is regarded as a form of modernistic survival value(Egri&Ralston,2004),it was found to have a negative association with individual well-being over time(Kasser&Ryan1993,1996;Sheldon&Elliot,1999;Sheldon&Kasser, 1998).In addition,the level of stress was hypothesized to affect the level of depression and anxiety measured at a later time point.Specifically,the level of stress measured at time point2(S2)had an effect on the levels of depression and anxiety measured at time point3(D3and A3),while that of stress measured at time point4(S4)were likely to affect the levels of depression and anxiety measured at time point5(D5and A5).Furthermore,the level of materialism was specified to moderate the relationships between stress and depression and between stress and anxiety at later time points.Thus,two latent in-teractions between stress and materialism(M*S2and M*S4)were assumed to have effects on the levels of depression and anxiety measured at two time points.These latent interaction effects were considered because(1)materialistic individuals’self-esteem is likely to be particularly vul-nerable in the face of threat/stress;(2)such individuals are likely to lack social support networks in times of stress;and(3)individuals with extrinsic motivations(i.e.,materialism)will likely perceive negative events as relatively more stressful.Lastly,the levels of depression and anxiety measured at time point3were hypothesized to affect their levels at time point5.Moreover,the level of Stress measured at time point2was specified to have an effect on its level measured at time point4.In addition,the latent interaction between materialism and stress measured at time point2(M*S2)was to influence the latent interaction between materialism and stress measured at time point4(M*S4).Thus,the latter latent interaction is an endogenous variable affected by the former.This indicates that Q=0in the proposed model(7).The proposed method was applied tofit the specified model to the data.The model provided FIT=0.839and AFIT=0.835,indicating that it accounted for about84%of the variance of the。
推荐下载-计量经济学讲义-4--第二章 单变量回归 精品

第二章单变量回归所谓回归分析(regression analysis),就是弄清楚两个或两个以上变量之间的因果关系的统计手法,是计量经济学中经常应用的方法。
我们也可以认为计量经济学的目的就是为了改进回归分析。
本章的对象:单变量的回归模型主要内容:古典正规线性回归模型的假定;最小二乘回归模型(Ordinary least-squares regression model, OLS)的重要结果;1.古典正规线性回归模型1-1 回归分析(1) 现在把两个变量Y和X 之间的关系,用一次函数的形式表示。
具体,,1,2,,t t t Y X u t n αβ=++=, 这样的模型称为单变量回归模型。
其中,X是代表原因(cause)的变量,我们称之为说明变量(explanatory variable),或者称之为独立变量(independent variable); Y是代表结果(result)的变量,我们称之为被说明变量(explained variable),或从属变量(dependent variable); u 是误差项(error term),或叫作搅乱项(disturbance term),代表不能用X的变化来反应出的Y的变化的那个部分。
也就是现实的Y与理论的Y之间的差异。
为什么需要加入误差项呢?因为精确的数学模型能解释的现象很少;现在能解释经济现象的手法大家更喜欢用随机变量来表示经济变量的不确性;(2)回归分析的目的主要目的是估计参数ˆα和ˆβ以及2σ以及对估计值进行显著性检验。
最常用的方法是最小2乘法(Ordinary Least Square method, OLS)1-2 5个基本假定A. 古典正规线性回归模型有以下五个假设:(1) 误差项tu 的平均为0,即n t t=11u 0n =∑; (2) 误差项12,,,n u u u 之间不相关,即(),0i j E u u =,或()cov ,0i j u u =;(3) 误差项具有相同的方差2σ,其中2σ是未知;(4) 说明变量12,,,nX X X 是可以指定的,也就是说12,,,nX X X 不是确率变量;(5) 误差项服从正规分布。
Monitoring Shifts in Mean Asymptotic Normality of Stopping Times 1

Keywords and Phrases: Asymptotic normality; change in the mean; CUSUM; Sequential detection.
Research partially supported by NATO grant PST.EAP.CLG 980599, NSF grants INT–0223262 and DMS–0413653 and grant MSM 0021620839. 2 Department of Mathematics, University of Utah, 155 South 1440 East, Salt Lake City, UT 84112– 0090, USA, emails: aue@ and horvath@ 3 Department of Mathematics and Statistics, Utah State University, 3900 Old Main Hill, Logan, UT 84322–3900, USA, email: Piotr.Kokoszka@ 4 Mathematisches Institut, Universit¨ at zu K¨ oln, Weyertal 86–90, D–50931 K¨ oln, Germany, email: jost@math.uni-koeln.de
1≤i≤n εi )/n.
( Xi − X m ) ,
m<i≤m+k
(1.3)
2 is an asymptotically consistent estimator for σ 2 = limn→∞ Var( and where σ ˆm The boundary function was chosen as
On the Estimation of Error-Correcting Parameters

I C P R 2000, C o p y r i g h t p r o t e c t e d : o n l y p e r s o n a l u s e i s p e r m i t t e dOn the Estimation of Error-Correcting ParametersJuan-Carlos AmengualDpto.de Inform´a tica,Universidad Jaume I Campus de Riu Sec,12080Castell´o n,Spainjcamen@inf.uji.esEnrique VidalInstituto Tecnol´o gico de Inform´a tica (ITI)Camino de Vera s/n,46071Valencia,Spainevidal@iti.upv.esAbstractError-Correcting (EC)techniques allow for coping with divergences in pattern strings with regard to their “stan-dard”form as represented by the language accepted by a regular or context-free grammar.There are two main types of EC parsers:minimum-distance and stochastic.The latter apply the maximum likelihood rule:classification into the classes of the strings in that have the greatest probability given the strings representing unknown patterns.Stochas-tic models are important in pattern recognition if good esti-mations for their parameters are provided.The problem of parameter estimation has been well studied for stochastic grammars,but this is not the case of EC parameters.This work is aimed at providing solutions to adequately solve it.1.IntroductionMany Syntactic Pattern Recognition systems rely on the use of Error-Correcting (EC)techniques to be able to cope with divergences in pattern strings with regard to their “standard”form as represented by regular or context-free grammars [6,5].Divergences (“noise”)can be due to:i)inaccuracy in the preprocessing and feature extraction mod-ules;ii)imperfections of the grammar.EC operation can be described as follows:a transmitter ()sends the stringbeing the string finally received.The noise in-troduced by can be represented by an error model,,accounting for insertion,substitution (including non-error substitutions of symbols for themselves)and deletion edit-ing operations (I-S-D).only affects to individual symbols and makes the reception of strings not belonging topossible.An EC parser (ECP)makes a reasonable esti-mation of.Two kinds of ECPs have been tradi-(1)In both cases classification is performed according to the class to whichactually received by according to [5].It isexpected superior performance when using adequately esti-mated stochastic parameters [8,1].However,although there are well-known robust methods to estimate the parameters of stochastic finite state models [4],this is not the case for the estimation of the parameters of .As far as we know this issue has been addressed only in [10,2,8,1].Henceforth,symbols (primitives)are represented by low-ercase letters,with being the null symbol;alphabets (col-lections of symbols)are represented by capital Greek let-ters;and strings (concatenations of symbols)are repre-sented by lowercase letters with a bar on top,with being the null string...The pairs ,,and stand for the substitution of by ,the insertion of ,and the deletion of respectively.The function ,,yields the cost (probability or frequency of use)of the corresponding I-S-D.Note that insertion parameters are tied with regard to the conventional definition given in [5].is consistent (see figure 1)if:(2)A parameter is needed to account for the probability of string end in order to keep consistent,since one or more insertions could be performed at the end of the string.This parameter is introduced by defining a special end-of-string symbol ().The only possible operation to be performedFigure1.A Markov model representing thestring and including(dotted lines).with is.If the frequencies of use or“counts”of EC parameters()are provided,a possible solution to fulfil equation(2)while keeping the“weight”of insertions ade-quately distributed among substitutions and deletions is:andand a position inandand1There cannot exist crossings in the alignment,i.e.no pairsand such that and.2As represented in an I-S-D operation by the symbol in this position.(5)(6);;;for to;end forfor to;end forfor todo;;end forend for;(7)leads to equation(8)().Thus,equation(7)applied to substitutions results in equations(9)and(10)(similarly with insertions and dele-tions)[1].(9)(and therefore)are ac-tually computed since only“Viterbi-like”(maximising in-stead of summing probabilities)algorithms are available to compute.This is due to the computational prob-lems posed by series of consecutive deletions in recursive sequences of derivations(this amounts to calculate the re-sult of sums of infinite series of probabilities).Our cur-rent research is mainly focused on developing an efficient Forward-Backward algorithm following ideas in[9].There-fore,the iterative re-estimation procedure proposed to max-imise this function consists in[1,2]:1.The counts of I-S-D,,are set to0and adequate ini-tial values are given to.If initial values of are pro-vided,is computed following equations(3)and(4).2.For each sentence is computed.Theassociated derivation is recovered.3.The counts of I-S-D operations are updated accord-ingly(adding1to the corresponding count each time an error rule was employed).4.Steps2and3are repeated until a given convergencecriterion is met,using the values of estimated in the previous iteration to compute.4.Coping with data sparsenessThe main problem we are faced to very often when us-ing stochastic EC is the sparseness(or lack)of(specific) training data to estimate EC parameters.This problem can be partially solved by using the technique proposed in sec-tion2.However the use of this approach raises another problem:training pairs are needed in order to perform ade-quate estimations.If no such training pairs are available,we are forced to employ the technique described in section3. But“Viterbi”methods are very sensible to parameter ini-tialisation and(mainly)data sparseness.Thus the appli-cation of a procedure based on the working of the well-known Cross-Validation method is proposed to overcome these problems[7].A short description is given,assuming that only a training corpus is available to learn[1]. 1.different blocks of are defined.Each block ispartitioned into two disjoint subsets,and.2.For each block,(a)is built using by means of grammaticalinference techniques[11].(b)is used to estimate the parameters of.TheViterbi re-estimation technique described in sec-tion3is employed to this end.(c)A estimation of the frequencies of use of errorparameters in,,is obtained as a result. 3.The values of the estimated frequencies,,are added thus yielding.The grammars serve only as a“guide”for the estimation of EC parameters.4.1.A bootstrap approach to collect training dataA possible solution to the lack of specific data(training pairs)to estimate EC parameters consists in the use of a bootstrapping procedure for both estimating them and col-lecting suitable training data,as described below:1.The system is started with and a non-stochastic errormodel using,for instance,the well-known Levenshtein distance.The probabilities of are ignored.2.A small labelled corpus is used to bootstrap the sys-tem.If the classification performed by the system matches the class of the corresponding input string, then:i)the string obtained by following the deriva-tion in is recovered and considered as the noisy channel input;ii)the string input to the system is con-sidered as the noisy channel output and iii)the pair is added to a training corpus.Alternatively a-best EC parser[1]can be employed,allowing for collecting those pairs formed by joining to the input string all of the strings that match its class.3.BW estimation of EC parameters is done using.4.Steps and are repeated,now performing stochasticEC parsing and using the same training labelled cor-pus,until no more pairs are appended to.5.The whole process is repeated with new labelled inputstrings,appending new training pairs to,until the performance of the system is satisfactory.5.Smoothing parametersEven if EC techniques are used,it is likely that some pattern strings were rejected.This can be produced if some EC parameters have been under-estimated.To solve this problem a set of four constants can be used to smooth the relative frequencies of use(or“counts”)of error parame-ters:(to be added to the counts of all possible insertion errors);(to be added to the counts of all possible dele-tion errors);(to be added to the counts of all possible substitution errors);and(to be added to the counts of all possible substitutions of symbols by themselves).A more elaborate solution consists in discounting a given quantity from seen events(estimated EC parameters)and then(equally)distributing this quantity between unseen events(under-estimated EC parameters).Alternatively the distribution between unseen events can be proportional to a given“ratio”,for instance the relative frequency of primi-tives(symbols in the alphabet)in a given corpus.6.Concluding remarksStochastic models are of paramount importance in the development of pattern recognition systems(PRSs).On the other hand,EC techniques are also regarded as a central part in the building of practical application systems.However, due to the lack of adequate procedures to perform EC pa-rameter estimation,only few PRSs have been developed us-ing stochastic EC parsing.This is mainly due to the usually high degree of human expertise required to manually set the values of these parameters.Several methods and proce-dures intended to solve not only the EC parameter estima-tion problem but also common problems frequently faced to in practical situations,such as parameter smoothing and training data sparseness,have been proposed in this paper. References[1]J.C.Amengual.T´e cnicas de Correcci´o n de Errores y suAplicaci´o n en Reconocimiento de Formas,Tratamiento del Lenguaje Natural y Traducci´o n Autom´a tica.PhD thesis, Depto.de Sistemas Inf.y Computaci´o n,Univ.Polit´e cnica de Valencia,Valencia(Spain),January1999.In Spanish. [2]J.C.Amengual,E.Vidal,and J.M.Bened´ı.Simplifyinglanguage through error-correcting decoding.In Proceedings of the ICSLP96,pages841–844,PA(USA),October1996.[3]L.E.Baum and J.Eagon.An inequality with applications tostatistical estimation for probabilistic functions of markov processes and to a model for ecology.Bulletin American Mathematical Society,73:360–363,May1967.[4] F.Casacuberta.Growth transformations for probabilisticfunctions of stochastic grammars.International Journal of Pattern Recog.and Artif.Intelligence,10(3):183–201,1996.[5]K.S.Fu.Syntactic Pattern Recognition and Applications.Prentice Hall,Englewood Cliffs,New Jersey,1982.[6]R.C.Gonzalez and M.G.Thomason.Syntactic PatternRecognition.An Introduction.Addison-Wesley,Reading, Massachusetts,1978.[7]S.J.Raudys and A.K.Jain.Small sample effects in statisti-cal pattern recognition:Recommendations for practitioners and open problems.IEEE PAMI,13(3):252–263,1991. [8] E.S.Ristad and P.N.Yianilos.Learning string-edit distance.IEEE PAMI,20(5):522–532,May1998.[9]G.Rote.Path problems in graphs.In G.Tinhofer,E.Mayr,H.Noltemeier,and M.M.Syslo,editors,ComputationalGraph Theory,Computing Supplementum7,pages155–189.Springer-Verlag,1990.[10]H.Rulot and E.Vidal.An efficient algorithm for the infer-ence of circuit-free automata.In G.Ferrat´e and et al,editors, Syntactic and Structural Pattern Recognition,volume F45of NATO ASI,pages173–183.Springer-Verlag,1988.[11] E.Vidal,F.Casacuberta,and P.Garc´ıa.Grammatical in-ference and automatic speech recognition.In A.Rubio and J.L´o pez,editors,Speech Recognition and Coding,New Ad-vances and Trends,NATO Advanced Study Institute,pages 174–191.Springer-Verlag,Berlin,1995.。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
n i=1
i
n
where x is the Dirac measure, concentrated at the point x. Such processes ( may be called as empirical. Surely, sn) is also a jump Markov process. In terms of the empirical process we have n = ( t(n)), and the estimation error ( t(n)) ? ( t) may be represented in the form ( t(n)) ? ( t( )) ( = ( t(n)) ? ( t( 0n) )) +
n
R
as the estimator of ( t ). Let us consider this procedure from the other viewpoint. If we collect processes i (s) onto one n-particle process (n)(s) = 1 (s); : : :; n (s) 2 Dn , then (n)(s) is also a jump Markov process. Its exponential scale parameter is equal to n, and the jump law may be expressed as following. If u = (u1 ; : : :; un ) is the phase coordinate of the n-particle process before a jump, we randomly choose one of the particles, and supposing this particle has number i, it jumps to a new position due to the distribution T ( ; ui). The initial distribution of the process is equal to n (here 1 2 stands for the outer product of measures 1 and 2 , n = : : : (n times)). (n) From the other hand, any n-particle process (s) produces the corresponding measure-valued process n X (n) =1 ;
variables, strati cation technique.
1 Introduction
We start with an example elucidating the problem under consideration. Let (D; ) be some metric space with Borel -algebra B. Consider linear evolution equation on measures
n
n
i=1
i
with independent i , L( i ) = . The second problem is to understand the role of independence for i . Surely, if i are `almost independent', the result must be the same as if they are independent. Can one choose dependent i = i(n) in such a way, that for any smooth F both bias and variance would be less, than in the simplest case of independence? It occurs that generally the answer is positive | if one manages to apply the proper strati cation technique, both error characteristics would have the form o(1=n), while in the case of independence they are O(1=n). To demonstrate such propositions we must deal with functions de ned on the appropriate measure spaces, their derivative mappings etc. De nitions and simplest properties of such objects are collected in the Section 2. Section 3 contains main theoretical results of the paper. Theorem 3.1 tells us that for independent i(n) and under some smoothness conditions on F , the bias m0(F ) = EF ( n ) ? F ( ) has the form A(F )=n + o(1=n). As for A(F ), it is a function of the second derivative d2 F ( )=d 2. The same result takes place if i(n) are `almost independent'. More precisely, these random variables must be equidistributed with the distribution , their mutual distribution must be invariant under their permutation, and ( ( ( the condition P2n) = L( 1n) ; 2n)) = 2 + o(1=n) must hold. The Corollary 3.1 deals with s2 (F ) = E(F ( n ) ? F ( ))2. It is proved that 0 under the same conditions, s2 (F ) = B (F )=n + o(1=n) with B (F ) depending 0 on the rst derivative dF ( )=d . 3
Abstract: The article is devoted to investigation of bias and mean-square deviation while estimating smooth functionals on measure spaces. Such problems appear when one uses a Monte-Carlo procedure to solve certain linear or nonlinear equation on measures (f.e., nonlinear Boltzmann-like equations), and wants to estimate and diminish the part of an error corresponding to simulation of an initial distribution. It is shown that by means of a special variant of strati cation technique both bias and variance of this part of error may be reduced to o(1=n) instead of usual O(n?1 ). Several simple examples of strati cation for linear functionals (i.e., integrals) are represented. Keywords and phrases: Non-linear functionals, measure-valued random
t = t(
) is a probabilistic measure too. 1
Usual Monte-Carlo procedure for estimating the integral ( t)= D fd t may be described as following. One simulates n independent copies i(s) of jump Markov process with the initial distribution L( i (0)) = , jump law T and independent exponentially distributed (with the unit scale parameter) time intervals between jumps, and takes n X = 1 f ( (t))
( ( t( 0n) )) ? ( t( )) ;
i=1
i(
s)
(2)
( ( where t ( 0n)) stands for the solution of (1) with the initial value 0n). ( ( As the start point of the process sn) is also 0n) , the rst addend in the right-hand side of (2) describes the di erence betweee evolution equation (1) and Markov process t(n) , while the second may ( be called as the part of the error, produced by the initial value 0n) . Our aim is to investigate statistical properties of this second addend.