Constrained stochastic language models

合集下载

H2O.ai 自动化机器学习蓝图:人类中心化、低风险的 AutoML 框架说明书

H2O.ai 自动化机器学习蓝图:人类中心化、低风险的 AutoML 框架说明书

Beyond Reason CodesA Blueprint for Human-Centered,Low-Risk AutoML H2O.ai Machine Learning Interpretability TeamH2O.aiMarch21,2019ContentsBlueprintEDABenchmarkTrainingPost-Hoc AnalysisReviewDeploymentAppealIterateQuestionsBlueprintThis mid-level technical document provides a basic blueprint for combining the best of AutoML,regulation-compliant predictive modeling,and machine learning research in the sub-disciplines of fairness,interpretable models,post-hoc explanations,privacy and security to create a low-risk,human-centered machine learning framework.Look for compliance mode in Driverless AI soon.∗Guidance from leading researchers and practitioners.Blueprint†EDA and Data VisualizationKnow thy data.Automation implemented inDriverless AI as AutoViz.OSS:H2O-3AggregatorReferences:Visualizing Big DataOutliers through DistributedAggregation;The Grammar ofGraphicsEstablish BenchmarksEstablishing a benchmark from which to gauge improvements in accuracy,fairness, interpretability or privacy is crucial for good(“data”)science and for compliance.Manual,Private,Sparse or Straightforward Feature EngineeringAutomation implemented inDriverless AI as high-interpretabilitytransformers.OSS:Pandas Profiler,Feature ToolsReferences:Deep Feature Synthesis:Towards Automating Data ScienceEndeavors;Label,Segment,Featurize:A Cross Domain Framework forPrediction EngineeringPreprocessing for Fairness,Privacy or SecurityOSS:IBM AI360References:Data PreprocessingTechniques for Classification WithoutDiscrimination;Certifying andRemoving Disparate Impact;Optimized Pre-processing forDiscrimination Prevention;Privacy-Preserving Data MiningRoadmap items for H2O.ai MLI.Constrained,Fair,Interpretable,Private or Simple ModelsAutomation implemented inDriverless AI as GLM,RuleFit,Monotonic GBM.References:Locally InterpretableModels and Effects Based onSupervised Partitioning(LIME-SUP);Explainable Neural Networks Based onAdditive Index Models(XNN);Scalable Bayesian Rule Lists(SBRL)LIME-SUP,SBRL,XNN areroadmap items for H2O.ai MLI.Traditional Model Assessment and DiagnosticsResidual analysis,Q-Q plots,AUC andlift curves confirm model is accurateand meets assumption criteria.Implemented as model diagnostics inDriverless AI.Post-hoc ExplanationsLIME,Tree SHAP implemented inDriverless AI.OSS:lime,shapReferences:Why Should I Trust You?:Explaining the Predictions of AnyClassifier;A Unified Approach toInterpreting Model Predictions;PleaseStop Explaining Black Box Models forHigh Stakes Decisions(criticism)Tree SHAP is roadmap for H2O-3;Explanations for unstructured data areroadmap for H2O.ai MLI.Interlude:The Time–Tested Shapley Value1.In the beginning:A Value for N-Person Games,19532.Nobel-worthy contributions:The Shapley Value:Essays in Honor of Lloyd S.Shapley,19883.Shapley regression:Analysis of Regression in Game Theory Approach,20014.First reference in ML?Fair Attribution of Functional Contribution in Artificialand Biological Networks,20045.Into the ML research mainstream,i.e.JMLR:An Efficient Explanation ofIndividual Classifications Using Game Theory,20106.Into the real-world data mining workflow...finally:Consistent IndividualizedFeature Attribution for Tree Ensembles,20177.Unification:A Unified Approach to Interpreting Model Predictions,2017Model Debugging for Accuracy,Privacy or SecurityEliminating errors in model predictions bytesting:adversarial examples,explanation ofresiduals,random attacks and“what-if”analysis.OSS:cleverhans,pdpbox,what-if toolReferences:Modeltracker:RedesigningPerformance Analysis Tools for MachineLearning;A Marauder’s Map of Security andPrivacy in Machine Learning:An overview ofcurrent and future research directions formaking machine learning secure and privateAdversarial examples,explanation ofresiduals,measures of epistemic uncertainty,“what-if”analysis are roadmap items inH2O.ai MLI.Post-hoc Disparate Impact Assessment and RemediationDisparate impact analysis can beperformed manually using Driverless AIor H2O-3.OSS:aequitas,IBM AI360,themisReferences:Equality of Opportunity inSupervised Learning;Certifying andRemoving Disparate ImpactDisparate impact analysis andremediation are roadmap items forH2O.ai MLI.Human Review and DocumentationAutomation implemented as AutoDocin Driverless AI.Various fairness,interpretabilityand model debugging roadmapitems to be added to AutoDoc.Documentation of consideredalternative approaches typicallynecessary for compliance.Deployment,Management and MonitoringMonitor models for accuracy,disparateimpact,privacy violations or securityvulnerabilities in real-time;track modeland data lineage.OSS:mlflow,modeldb,awesome-machine-learning-opsmetalistReference:Model DB:A System forMachine Learning Model ManagementBroader roadmap item for H2O.ai.Human AppealVery important,may require custom implementation for each deployment environment?Iterate:Use Gained Knowledge to Improve Accuracy,Fairness, Interpretability,Privacy or SecurityImprovements,KPIs should not be restricted to accuracy alone.Open Conceptual QuestionsHow much automation is appropriate,100%?How to automate learning by iteration,reinforcement learning?How to implement human appeals,is it productizable?ReferencesThis presentation:https:///navdeep-G/gtc-2019/blob/master/main.pdfDriverless AI API Interpretability Technique Examples:https:///h2oai/driverlessai-tutorials/tree/master/interpretable_ml In-Depth Open Source Interpretability Technique Examples:https:///jphall663/interpretable_machine_learning_with_python https:///navdeep-G/interpretable-ml"Awesome"Machine Learning Interpretability Resource List:https:///jphall663/awesome-machine-learning-interpretabilityAgrawal,Rakesh and Ramakrishnan Srikant(2000).“Privacy-Preserving Data Mining.”In:ACM Sigmod Record.Vol.29.2.URL:/cs/projects/iis/hdb/Publications/papers/sigmod00_privacy.pdf.ACM,pp.439–450.Amershi,Saleema et al.(2015).“Modeltracker:Redesigning Performance Analysis Tools for Machine Learning.”In:Proceedings of the33rd Annual ACM Conference on Human Factors in Computing Systems.URL: https:///en-us/research/wp-content/uploads/2016/02/amershi.CHI2015.ModelTracker.pdf.ACM,pp.337–346.Calmon,Flavio et al.(2017).“Optimized Pre-processing for Discrimination Prevention.”In:Advances in Neural Information Processing Systems.URL:/paper/6988-optimized-pre-processing-for-discrimination-prevention.pdf,pp.3992–4001.Feldman,Michael et al.(2015).“Certifying and Removing Disparate Impact.”In:Proceedings of the21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.URL:https:///pdf/1412.3756.pdf.ACM,pp.259–268.Hardt,Moritz,Eric Price,Nati Srebro,et al.(2016).“Equality of Opportunity in Supervised Learning.”In: Advances in neural information processing systems.URL:/paper/6374-equality-of-opportunity-in-supervised-learning.pdf,pp.3315–3323.Hu,Linwei et al.(2018).“Locally Interpretable Models and Effects Based on Supervised Partitioning (LIME-SUP).”In:arXiv preprint arXiv:1806.00663.URL:https:///ftp/arxiv/papers/1806/1806.00663.pdf.Kamiran,Faisal and Toon Calders(2012).“Data Preprocessing Techniques for Classification Without Discrimination.”In:Knowledge and Information Systems33.1.URL:https:///content/pdf/10.1007/s10115-011-0463-8.pdf,pp.1–33.Kanter,James Max,Owen Gillespie,and Kalyan Veeramachaneni(2016).“Label,Segment,Featurize:A Cross Domain Framework for Prediction Engineering.”In:Data Science and Advanced Analytics(DSAA),2016 IEEE International Conference on.URL:/static/papers/DSAA_LSF_2016.pdf.IEEE,pp.430–439.Kanter,James Max and Kalyan Veeramachaneni(2015).“Deep Feature Synthesis:Towards Automating Data Science Endeavors.”In:Data Science and Advanced Analytics(DSAA),2015.366782015.IEEEInternational Conference on.URL:https:///EVO-DesignOpt/groupWebSite/uploads/Site/DSAA_DSM_2015.pdf.IEEE,pp.1–10.Keinan,Alon et al.(2004).“Fair Attribution of Functional Contribution in Artificial and Biological Networks.”In:Neural Computation16.9.URL:https:///profile/Isaac_Meilijson/publication/2474580_Fair_Attribution_of_Functional_Contribution_in_Artificial_and_Biological_Networks/links/09e415146df8289373000000/Fair-Attribution-of-Functional-Contribution-in-Artificial-and-Biological-Networks.pdf,pp.1887–1915.Kononenko,Igor et al.(2010).“An Efficient Explanation of Individual Classifications Using Game Theory.”In: Journal of Machine Learning Research11.Jan.URL:/papers/volume11/strumbelj10a/strumbelj10a.pdf,pp.1–18.Lipovetsky,Stan and Michael Conklin(2001).“Analysis of Regression in Game Theory Approach.”In:Applied Stochastic Models in Business and Industry17.4,pp.319–330.Lundberg,Scott M.,Gabriel G.Erion,and Su-In Lee(2017).“Consistent Individualized Feature Attribution for Tree Ensembles.”In:Proceedings of the2017ICML Workshop on Human Interpretability in Machine Learning(WHI2017).Ed.by Been Kim et al.URL:https:///pdf?id=ByTKSo-m-.ICML WHI2017,pp.15–21.Lundberg,Scott M and Su-In Lee(2017).“A Unified Approach to Interpreting Model Predictions.”In: Advances in Neural Information Processing Systems30.Ed.by I.Guyon et al.URL:/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf.Curran Associates,Inc.,pp.4765–4774.Papernot,Nicolas(2018).“A Marauder’s Map of Security and Privacy in Machine Learning:An overview of current and future research directions for making machine learning secure and private.”In:Proceedings of the11th ACM Workshop on Artificial Intelligence and Security.URL:https:///pdf/1811.01134.pdf.ACM.Ribeiro,Marco Tulio,Sameer Singh,and Carlos Guestrin(2016).“Why Should I Trust You?:Explaining the Predictions of Any Classifier.”In:Proceedings of the22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.URL:/kdd2016/papers/files/rfp0573-ribeiroA.pdf.ACM,pp.1135–1144.Rudin,Cynthia(2018).“Please Stop Explaining Black Box Models for High Stakes Decisions.”In:arXiv preprint arXiv:1811.10154.URL:https:///pdf/1811.10154.pdf.Shapley,Lloyd S(1953).“A Value for N-Person Games.”In:Contributions to the Theory of Games2.28.URL: http://www.library.fa.ru/files/Roth2.pdf#page=39,pp.307–317.Shapley,Lloyd S,Alvin E Roth,et al.(1988).The Shapley Value:Essays in Honor of Lloyd S.Shapley.URL: http://www.library.fa.ru/files/Roth2.pdf.Cambridge University Press.Vartak,Manasi et al.(2016).“Model DB:A System for Machine Learning Model Management.”In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics.URL:https:///~matei/papers/2016/hilda_modeldb.pdf.ACM,p.14.Vaughan,Joel et al.(2018).“Explainable Neural Networks Based on Additive Index Models.”In:arXiv preprint arXiv:1806.01933.URL:https:///pdf/1806.01933.pdf.Wilkinson,Leland(2006).The Grammar of Graphics.—(2018).“Visualizing Big Data Outliers through Distributed Aggregation.”In:IEEE Transactions on Visualization&Computer Graphics.URL:https:///~wilkinson/Publications/outliers.pdf.Yang,Hongyu,Cynthia Rudin,and Margo Seltzer(2017).“Scalable Bayesian Rule Lists.”In:Proceedings of the34th International Conference on Machine Learning(ICML).URL:https:///pdf/1602.08610.pdf.。

多次波压制方法

多次波压制方法

1.共中心点叠加法共中心点叠加法是依据动校正后一次波和多次波之间剩余时差的差异,将不同接收点收到的来自地下同一反射点的不同激发点的信号,经动校正后叠加起来,这种方法可以比较有效地压制多次波。

用一次波的速度作动校正,这时一次波同相轴被校平而多次波仍有剩余时差,通过叠加使一次波得到增强而多次波得到削弱。

为了提高压制多次波的效果,采用加权叠加(炮检距与权系数成某种比例关系,使多次波剩余时差较大的道有较大的权系数)。

参考文献[14]说明了一种最佳加权叠加法,用最小二乘方法求解叠加各道的权系数,使叠加结果最佳,接近于一次波而使有剩余时差的多次波得到最大的削弱。

1973年E. Cassano等人提出了最佳滤波叠加方法,这是用最小二乘方法求解各叠加道的滤波因子,使叠加达到最佳压制多次波从而最佳逼近一次波。

当多次波剩余时差达到50ms以上,一般叠加可使多次波削弱10dB到20dB,而最佳加权叠加和最佳滤波叠加还可使多次波再削弱20多dB。

这只是理论上分析的效果,由于实际叠加各道的振幅均一性精度较低(理论上认为严格均一),故用计算而得的精度很高的权系数或滤波因子与之相乘或褶积,精度下降,无法达到理论最佳效果。

2.二维滤波法根据动校正后的道集上一次波与多次波时差不同,可用倾角滤波、速度滤波、扇形滤波等二维滤波方法滤除多次波保留一次波。

动校正速度可以用多次波的速度,如CGG 的FKMUL[15],也可采用一次波与多次波两者之间的速度,如Digicon的ZMULT[16][17]。

滤波可以在f-k域或x-t域或x-f域进行。

采用的道集可以是CMP 道集也可以是CSP道集。

B.Zhou等人较详细地分析了二维滤波压制多次波的一些特点,认为设计二维滤波关键是要把多次波的抑制区域确定合适,否则会损害一次波,同时抑制区与通放区的边界不能简单采用一条直线,直线边界会产生Gibbs现象,必须采用渐变呈椭圆状的边界,故设计好二维滤波是比较困难的,为此他们提出用波场外推所得的多次波模型来自动确定多次波的陷波区的一种非线性f-k滤波的方法,其陷波区边缘是光滑的。

融合多尺度通道注意力的开放词汇语义分割模型SAN

融合多尺度通道注意力的开放词汇语义分割模型SAN

融合多尺度通道注意力的开放词汇语义分割模型SAN作者:武玲张虹来源:《现代信息科技》2024年第03期收稿日期:2023-11-29基金项目:太原师范学院研究生教育教学改革研究课题(SYYJSJG-2154)DOI:10.19850/ki.2096-4706.2024.03.035摘要:随着视觉语言模型的发展,开放词汇方法在识别带注释的标签空间之外的类别方面具有广泛应用。

相比于弱监督和零样本方法,开放词汇方法被证明更加通用和有效。

文章研究的目标是改进面向开放词汇分割的轻量化模型SAN,即引入基于多尺度通道注意力的特征融合机制AFF来改进该模型,并改进原始SAN结构中的双分支特征融合方法。

然后在多个语义分割基准上评估了该改进算法,结果显示在几乎不改变参数量的情况下,模型表现有所提升。

这一改进方案有助于简化未来开放词汇语义分割的研究。

关键词:开放词汇;语义分割;SAN;CLIP;多尺度通道注意力中图分类号:TP391.4;TP18 文献标识码:A 文章编号:2096-4706(2024)03-0164-06An Open Vocabulary Semantic Segmentation Model SAN Integrating Multi Scale Channel AttentionWU Ling, ZHANG Hong(Taiyuan Normal University, Jinzhong 030619, China)Abstract: With the development of visual language models, open vocabulary methods have been widely used in identifying categories outside the annotated label. Compared with the weakly supervised and zero sample method, the open vocabulary method is proved to be more versatile and effective. The goal of this study is to improve the lightweight model SAN for open vocabularysegmentation, which introduces a feature fusion mechanism AFF based on multi scale channel attention to improve the model, and improve the dual branch feature fusion method in the original SAN structure. Then, the improved algorithm is evaluated based on multiple semantic segmentation benchmarks, and the results show that the model performance has certain improvement with almost no change in the number of parameters. This improvement plan will help simplify future research on open vocabulary semantic segmentation.Keywords: open vocabulary; semantic segmentation; SAN; CLIP; multi scale channel attention 0 引言識别和分割任何类别的视觉元素是图像语义分割的追求。

FTU相关外文翻译

FTU相关外文翻译

ABSTRACTIn modern power system control centers, operation and control of power systems are carried out with the help of real-time computers. Live data is captured by Remote Terminal Units (RTUs) located at various stations and transmitted over suitable communication media to the control center for display, monitoring and control.Telemetering equipment forms a sizable component of the project cost.A methodology is presented in which RTUs are located at different stations to meet certain criteria such as observability of the system and absence of critical measurements. Additional reliability constraint of loss of information from a single RTU for the above two constraints is also imposed.Keywords : Energy management systems (EMS), telemetering system, remote terminal units, observability, critical measurements.INTRODUCTIONIn modern power system control centers, also called as Energy Control Centers (ECC) , operation and control of power systems are carried out with the help of real-time computers. Live data is captured by Remote Terminal Units (RTUs) located at various stations and transmitted over suitable communication media to the control center computer system for display, monitoring and control. The data received at ECC are prone to errors arising from transducers, communication systems etc. A state estimator is used to provide a reliable and complete data base. In order to be able to estimate the state of the system, the measurement system should be sufficient .enough to estimate the magnitudes and phase angle of the voltages of all the buses in the system.Bad data in the measurements should be detected and eliminated from the measurement set. Further, it is also well known that for reliable estimation of the system state, the measurement system should have sufficient redundancy and uniform spread [1]. Hence the location of RTUs plays an important role in an EMS project.To support the security monitoring function in Energy Control Centers following alternatives are available with regard to placement of RTU: a)Placement of RTUs at all the stations to gather the information of the network status (position of switches, breakers) and all the relevant measurements like MW,MVAR,KV etc.b)Placement of RTUs at all the stations as in alternative (a) to gather the information on network status from all stations and gathering the information from measurements at only some selected stations.c)Placement of RTUs at only some selected stations to obtain the network status and measurements also from the same stations. The remaining information with regard to network status is obtained manually.The alternative (a) though desirable is most expensive and may not be practicable. Option (c) is the cheapest since the RTU placement and measurement systems are located at a selected number of stations only. Though the network status is not obtained in realtime, it can be updated whenever there is a change in the configuration. In the proposed method option (c) is considered.In this paper a new method for the design of measurement system is presented wherein RTUs are placed only at some selected substations. The proposed method honors location of RTUs decided apriori and works from that point.Handschin and Bongers [2] pointed out that local redundancy and the probability of detecting bad data are the most important factors when planning a measurement system. Their method consists of moving the measurements from the best to the worst part of the network, starting with an initial solution with almost designed redundancy. Roy and Villard [3] described a method in which different possible telemeasurement configurations are compared by off-li'ne simulations of state estimations. Koglin [4] used a general criterion to systematically eliminate Some of the measurements in the system to obtain an optimal set from various measurements. Phua and Dillon [5] developed a method based on entropy criterion. The problem is posed as non-linear programming problem which is solved using a sequential linearly constrained minimization method. Mafaakher et a1 [6] have used the ability of bad-data detection of state estimation to design a metering system. Aam, Holten and Gjerda [7] provided a brief survey of the various methods of optimal meter placements highlighting the advantages and disadvantages of each method. They also presented a method by extending Koglin's method to obtain a more robust solution. Nabil Abbasy and Shahidehpour [8] proposed a mathematical programming problem model to identify redundant measurements from a given set of measurements. Young Moon Park et a1 [9] presented an algorithm of optimal meter placement for the state estimation, which minimizes the total investment subject to a prespecified accuracy of the estimated state. Hiroyki Mori and Yasuo Tamura [lo] compared various approaches of meter placement in power system static state estimation and proposed a method based on stochastic load flow model.In all the above methods the emphasis was on the design of a measurement system at the meter level only. An RTU is required to be located at astation whether one or more quantities are to be measured at this station. Current trend is towards building transducers at lower costs. Hence it is logical to gather maximum possible information by the located RTU in a station. Thus the design of measurement system at the RTU level becomes most relevant.In this paper a new methodology for design of telemetering configuration is presented in which RTUs are located at different stations to meet certain criteria such as observability of the system, absence of critical measurements. Additional reliability constraint of loss of information from a single RTU for the above two considerations is also imposed. The locations where RTUs are placed for SCADA purposes as desired by the utilities are honored. The proposed method is tested on standard IEEE systems and on a practical system. The results are presented and discussed.THEORY OBSERVABILITY CRITICAL MEASUREMENTSSystem observability and absence of critical measurements are important criteria for the design of telemetering configuration. The concepts of observability and critical measurements are explained below.Observability :A power system is said to be 'observable, in the sense of state estimation with respect to a given measurement set M, if the bus voltage magnitudes and angles throughout the system can be determined by processing the measurements in M by a state estimator. Otherwise the power system is said to be 'unobservable' with respect to M. It was proved in [I13 that if measurements in the system form a spanning tree connecting all the buses then the system is observable. Clements [12] has explained various aspects of observability and reviewed some methods of meter placement.Critical Measurements :A measurement is said to have detectable error residual if an error in the measurement shows up in the measurement residual, residual being the measured value minus calculated value [13]. It was proved that there may be some measurements in which an error in the measurement will not be reflected in the residual. The problem of determining which measurements have detectable error residuals is solved by identifying a class of measurements, called 'critical measurements' and showing that only non-critical measurements have detectable error residuals. Since only non-critical measurements have detectable error residuals, a bad data in those measurements can be detected. But in the case of critical measurements, a bad data is not reflected in the error residual and hencecannot be detected. A critical measurement is defined as that measurement, which when not available makes the system unobservable.Thus critical measurements imply the following :. loss of critical measurements would make the system unobservable. . an error in the critical measurement cannot be detected.Hence it is important to see that a system would not have any critical measurements and be designed accordingly.DESIGN TELEHETERING CONFIGURATIONIn this section a new design methodology of the telemetering configuration at the RTU level is proposed.Various inputs and outputs of 'Telemeterinq Confiqurator' are shown in Fig.1. It is assumed that all possible MW and WAR flows in the lines and transformers are acquired by the RTU at the sub-station so that maximum information is gathered from the substation (sfs). It is assumed that voltages are measured at all the buses in the sfs where RTUs are placed.摘要在现代电力系统控制中心中,电力系统的运行和控制是在实时计算机的帮助下进行的。

Algorithms for bigram and trigram word clustering

Algorithms for bigram and trigram word clustering

Speech Communication24199819–37Algorithms for bigram and trigram word clustering1¨Sven Martin),Jorg Liermann,Hermann Ney2¨Lehrstuhl fur Informatik VI,RWTH Aachen,UniÕersity of Technology,Ahornstraße55,,D-52056Aachen,GermanyReceived5June1996;revised15January1997;accepted23September1997AbstractIn this paper,we describe an efficient method for obtaining word classes for class language models.The method employs an exchange algorithm using the criterion of perplexity improvement.The novel contributions of this paper are the extension of the class bigram perplexity criterion to the class trigram perplexity criterion,the description of an efficient implementation for speeding up the clustering process,the detailed computational complexity analysis of the clustering algorithm,and, finally,experimental results on large text corpora of about1,4,39and241million words including examples of word classes,test corpus perplexities in comparison to word language models,and speech recognition results.q1998Elsevier Science B.V.All rights reserved.Zusammenfassung¨In diesem Bericht beschreiben wir eine effiziente Methode zur Erzeugung von Wortklassen fur klassenbasierte Sprachmodelle.Die Methode beruht auf einem Austauschalgorithmus unter Verwendung des Kriteriums der Perplexi-¨¨tatsverbesserung.Die neuen Beitrage dieser Arbeit sind die Erweiterung des Kriteriums der Klassenbigramm-Perplexitat zum¨Kriterium der Klassentrigramm-Perplexitat,die Beschreibung einer effizienten Implementierung zur Beschleunigung des¨Klassenbildungsprozesses,die detaillierte Komplexitatsanalyse dieser Implementierung,und schließlich experimentelle¨¨¨Ergebnisse auf großen Textkorpora mit ungefahr1,4,39und241Millionen Wortern,einschließlich Beispielen fur erzeugte¨Wortklassen,Test Korpus Perplexitaten im Vergleich zu wortbasierten Sprachmodellen und Erkennungsergebnissen auf Sprachdaten.q1998Elsevier Science B.V.All rights reserved.´´Resume´´` Dans cet article,nous decrivons une methode efficace d’obtention des classes de mots pour des modeles de langage.´´`´´Cette methode emploie un algorithme d’echange qui utilise le critere d’amelioration de la perplexite.Les contributions ´`´nouvelles apportees par ce travail concernent l’extension aux trigrammes du critere de perplexite de bigrammes de classes,la ´´´´´´description d’une implementation efficace pour accelerer le processus de regroupement,l’analyse detaillee de la complexite´´calculatoire,et,finalement,des resultats experimentaux sur de grands corpus de textes de1,4,39et241millions de mots,)Corresponding author.Email:martin@informatik.rwth-aachen.de.1This paper is based on a communication presented at the ESCA Conference EUROSPEECH’95and has been recommended by the EUROSPEECH’95Scientific Committee.2Email:ney@informatik.rwth-aachen.de.0167-6393r98r$19.00q1998Elsevier Science B.V.All rights reserved.Ž.PII S0167-63939700062-9()S.Martin et al.r Speech Communication 24199819–3720incluant des exemples de classes de mots produites,de perplexites de corpus de test comparees aux modeles de langage de ´´`mots,et des resultats de reconnaissance de parole.q 1998Elsevier Science B.V.All rights reserved.´Keywords:Stochastic language modeling;Statistical clustering;Word equivalence classes;Wall Street Journal corpus1.IntroductionThe need for a stochastic language model in speech recognition arises from Bayes’decision rule Ž.for minimum error rate Bahl et al.,1983.The word sequence w ...w to be recognized from the se-1N quence of acoustic observations x ...x is deter-1T mined as that word sequence w ...w for which the 1N Ž<.posterior probability Pr w ...w x ...x attains 1N 1T its maximum.This rule can be rewritten in the form <arg max Pr w ...w P Pr x ...x w ...w ,4Ž.Ž1N 1T 1N w ...w 1NŽ<.where Pr x ...x w ...w is the conditional 1T 1N probability of,given the word sequence w ...w ,1N observing the sequence of acoustic measurements Ž.x ...x and where Pr w ...w is the prior proba-1T 1N bility of producing the word sequence w ...w .1N The task of the stochastic language model is to provide estimates of these prior probabilities Ž.Pr w ...w .Using the definition of conditional 1N probabilities,we obtain the decomposition:N<Pr w ...w sPr w w ...w .Ž.Ž.Ł1N n 1n y 1n s 1For large vocabulary speech recognition,these conditional probabilities are typically used in the Ž.following way Bahl et al.,1983.The dependence of the conditional probability of observing a word w n at a position n is assumed to be restricted to its Ž.immediate m y 1predecessor words w q n y m 1...w .The resulting model is that of a Markov n y 1chain and is referred to as m -gram model.For m s 2and m s 3,we obtain the widely used bigram and trigram models,respectively.These bigram and tri-gram models are estimated from a text corpus during a training phase.But even for these restricted mod-els,most of the possible events,i.e.,word pairs and word triples,are never seen in training because there are so many of them.Therefore in order to allow for events not seen in training,the probability distribu-tions obtained in these m -gram approaches are smoothed with more general ually,Ž.these are also m -grams with a smaller value for m or a more sophisticated approach like a singleton Ždistribution Jelinek,1991;Ney et al.,1994;Ney et .al.,1997.In this paper,we try a different approach for smoothing by using word equivalence classes,or word classes for short.Here,each word belongs to exactly one word class.If a certain word m -gram did not appear in the training corpus,it is still possible that the m -gram of the word classes corresponding to these words did occur and thus a word class based m -gram language model,or class m -gram model for short,can be estimated.More general,as the number of word classes is smaller than the number of words,the number of model parameters is reduced so that each parameter can be estimated more reliably.On the other hand,reducing the number of model pa-rameters makes the model coarser and thus the pre-diction of the next word less precise.So there has to be a tradeoff between these two extremes.Typically,word classes are based on syntactic semantic concepts and are defined by linguistic ex-perts.In this case,they are called parts of speech Ž.POS .Generalizing the concept of word similarities,we can also define word classes by using a statistical criterion,which in most cases,but not necessarily,is maximum likelihood or,equivalently,perplexity ŽJelinek,1991;Brown et al.,1992;Kneser and Ney,.1993;Ney et al.,1994.With the latter two ap-proaches,word classes are defined using a clustering algorithm based on minimizing the perplexity of a class bigram language model on the training corpus,which we will call bigram clustering for short.The contributions of this paper are:Øthe extension of the clustering algorithm from the bigram criterion to the trigram criterion;Øthe detailed analysis of the computational com-plexity of both bigram and trigram clustering algorithms;Øthe design and discussion of an efficient imple-mentation of both clustering algorithms;Øsystematic tests using the 39-million word Wall Street Journal corpus concerning perplexity and()S.Martin et al.r Speech Communication24199819–3721Table1List of symbolsW vocabulary sizeu,Õ,w,x words in a running text;usually w is the word under discussion,r its successor,y its predecessor and u the predecessor toÕw word in text corpus position nnŽ.S w set of successor words to word w in the training corpusŽ.P w set of predecessor words to word w in the training corpusŽ.Ž.SÕ,w set of successor words to bigramÕ,w in the training corpusŽ.Ž.PÕ,w set of predecessor words to bigramÕ,w in the training corpusG number of word classesG:w™g class mapping functionwg,k word classesŽ.N training corpus sizeB number of distinct word bigrams in the training corpusT number of distinct word trigrams in the training corpusŽ.N P number of occurrences in the training corpus of the event in parenthesesŽ.F G log-likelihood for a class bigram modelbiŽ.F G log-likelihood for a class trigram modeltriPP perplexityI number of iterations of the clustering algorithmŽ.Ž.G P,wÝ1i.e.,number of seen predecessor word classes to word wg:NŽg,w.)0Ž.Ž.G w,PÝ1i.e.,number of seen successor word classes to word wg:NŽw,g.)0y1Ž.Ž.W PÝG P,w i.e.,average number of seen predecessor word classesP w wy1Ž.Ž.W PÝG w,P i.e.,average number of seen successor word classesw P wŽ.Ž.G P,P,wÝ1i.e.,number of seen word class bigrams preceding word wg,g:NŽg,g,w.)01212Ž.Ž.G P,w,PÝ1i.e.,number of seen word class pairs embracing word wg,g:NŽg,w,g.)01212Ž.Ž.G w,P,PÝ1i.e.,number of seen word class bigrams succeeding word wg,g:NŽw,g,g.)01212b absolute discounting value for smoothingŽ.N g number of distinct words appearing r times in word class grŽ.G g,P number of distinct word classes seen r times right after word class grÕÕŽ.G P,g number of distinct word classes seen r times right beforeword class gr w wŽ.G P,P number of distinct word class bigrams seen r timesrŽ.b g generalized distribution for smoothingwclustering times for various numbers of word classes and initialization methods;Øspeech recognition results using the North Ameri-can Business corpus.The original exchange algorithm presented in thisŽ. paper was published in Kneser and Ney,1993with good results on the LOB corpus.There is a differentŽ. approach described in Brown et al.,1992employ-ing a bottom-up algorithm.There are also ap-Žproaches based on simulated annealing Jardino and .Adda,1994.Word classes can also be derived fromŽan automated semantic analysis Bellegarda et al., .Ž1996,or by morphological features Lafferty and.Mercer,1993.The organization of this paper is as follows: Section2gives a definition of class models,explains the outline of the clustering algorithm and the exten-sion to a trigram based statistical clustering criterion.Section3presents an efficient implementation of the clustering algorithm.Section4analyses the computa-tional complexity of this efficient implementation. Section5reports on text corpus experiments con-cerning the performance of the clustering algorithm in terms of CPU time,resulting word classes and training and test perplexities.Section6shows the results for the speech recognition experiments.Sec-tion7discusses the results and their usefulness to language models.In this paper,we introduce a large number of symbols and quantities;they are summa-rized in Table1.2.Class models and clustering algorithmIn this section,we will present our class bigram and trigram models and we will derive their log()S.Martin et al.r Speech Communication 24199819–3722likelihood function,which serves as our statistical criterion for obtaining word classes.With our ap-proach,word classes result from a clustering algo-rithm,which exchanges a word between a fixed number of word classes and assigns it to the word class where it optimizes the log likelihood.We will discuss alternative strategies for finding word classes.We will also describe smoothing methods for the class models trained,which are necessary to avoid zero probabilities on test corpora.2.1.Class bigram modelsWe partition the vocabulary of size W into a fixed number G of word classes.The partition is repre-sented by the so-calledclass or category mapping function G :w ™g Ž.w mapping each word w of the vocabulary to its word class g .Assigning a word to only one word class is w a possible drawback which is justified by the sim-plicity and efficiency of the clustering process.For the rest of this paper,we will use the letters g and k Ž.for arbitrary word classes.For a word bigram Õ,w Ž.we use g ,g to denote the corresponding class Õw bigram.For class models,we have two types of probabil-ity distributions:Ž<.Øa transition probability function p g g which 1w Õrepresents the first-order Markov chain probabil-ity for predicting the word class g from its w predecessor word class g ;ÕŽ<.Øa membership probability function p w g esti-0mating the word w from word class g .Since a word belongs to exactly one word class,we have )0if g s g ,w <p w g Ž.0½s 0if g /g .w Therefore,we can use the somewhat sloppy notation Ž<.p w g .0w For a class bigram model,we have then:<<<p w Õs p w g P p q g .1Ž.Ž.Ž.Ž.0w 1w ÕNote that this model is a proper probability function,and that we make an independency assumption be-tween the prediction of a word from its word class and the prediction of a word class from its predeces-sor word classes.Such a model leads to a drastic Žreduction in the number of free parameters:G P G y .Ž<.Ž.1probabilities for the table p g g ,W y G 1w ÕŽ<.probabilities for the table p w g ,and W indices 0w for the mapping G :w y g .w For maximum likelihood estimation,we construct Ž.the log likelihood function using Eq.1:N<F G slog Pr w w ...w Ž.Ž.Ýbi n 1n y 1n s f<s N Õ,w P log p w ÕŽ.Ž.ÝÕ,w<s N w P log p w g Ž.Ž.Ý0w w<qN g ,g P log p g g 2Ž.Ž.Ž.ÝÕw 1w Õg ,g ÕwŽ.with N P being the number of occurrences of the event given in the parentheses in the training data.To construct a class bigram model,we first hypothe-size a mapping function G .Then,for this hypothe-sized mapping function G ,the probabilities Ž<.Ž<.Ž.p w g and p g g in Eq.2can be estimated 0w 1w Õby adding the Lagrange multipliers for the normal-ization constraints and taking the derivatives.This Ž.results in relative frequencies Ney et al.,1994:N w Ž.<p w g s ,3Ž.Ž.0w N g Ž.w N g ,g Ž.Õw <p g g s.4Ž.Ž.1w ÕN g Ž.ÕŽ.Ž.Using the estimates given by Eqs.3and 4,we Ž.can now express the log likelihood function F G bi for a mapping G in terms of the counts:<F G s N Õ,w P log p w ÕŽ.Ž.Ž.Ýbi Õ,wN w Ž.s N w P logŽ.ÝN g Ž.w wN g ,g Ž.Õw q N g ,g P logŽ.ÝÕw N g Ž.Õg ,g ÕwsN g ,g P log N g ,g Ž.Ž.ÝÕw Õw g ,g Õwy 2P N g P log N g Ž.Ž.Ýgq N w P log N w 5Ž.Ž.Ž.Ýw()S.Martin et al.r Speech Communication 24199819–3723s N w log N w Ž.Ž.ÝwN g ,g Ž.Õw qN g ,g log .6Ž.Ž.ÝÕw N g N g Ž.Ž.Õw g ,gÕwŽ.Ž.In Brown et al.,1992the second sum of Eq.6isinterpreted as the mutual information between the word classes g and g .Note,however,that the Õw derivation given here is based on the maximum likelihood criterion only.2.2.Class trigram modelsConstructing the log likelihood function for the class trigram model<<<p w u ,Õs p w g P p g g ,g 7Ž.Ž.Ž.Ž.0w 2w u Õresults in<F G s N w P log p w g Ž.Ž.Ž.Ýtri 0w wqN g ,g ,g Ž.Ýu Õw g ,g ,g u Õw<P log p g g ,g .8Ž.Ž.2w u ÕŽ.Taking the derivatives of Eq.8for maximum likelihood parameter estimation also results in rela-tive frequencies N g ,g ,g Ž.u Õw <p g g ,g s9Ž.Ž.2w u ÕN g ,g Ž.u ÕŽ.Ž.Ž.and,using Eqs.3,7–9:<F G sN u ,Õ,w P log p w u ,ÕŽ.Ž.Ž.Ýtri u ,Õ,wN w Ž.s N w P logŽ.ÝN g Ž.w wN g ,g ,g Ž.u Õw q N g ,g ,g P logŽ.Ýu Õw N g ,g Ž.u Õg ,g ,g u ÕwsN g ,g ,g P log N g ,g ,g Ž.Ž.Ýu Õw u Õw g ,g ,g u ÕwyN g ,g P log N g ,g Ž.Ž.Ýu Õu Õg ,g u Õy N g P log N g q N w P log N w Ž.Ž.Ž.Ž.ÝÝw w g wws N w log N w Ž.Ž.ÝwN g ,g ,g Ž.u Õw qN g ,g ,g log.Ž.Ýu Õw N g ,g N g Ž.Ž.u Õw g ,g ,g u Õw10Ž.2.3.Exchange algorithmTo find the unknown mapping G :w y g ,we w will show now how to apply a clustering algorithm.The goal of this algorithm is to find a class mapping function G such that the perplexity of the class model is minimized over the training corpus.We use an exchange algorithm similar to the exchange algo-Žrithms used in conventional clustering ISODATA Ž..Duda and Hart,1973,pp.227–228,where an observation vector is exchanged from one cluster to another cluster in order to improve the criterion.In the case of language modeling,the optimization Ž.criterion is the log-likelihood,i.e.,Eq.5for the Ž.class bigram model and Eq.10for the class trigram model.The algorithm employs a technique of local optimization by looping through each element of the set,moving it tentatively to each of the G word classes and assigning it to that word class resulting in the lowest perplexity.The whole procedure is repeated until a stopping criterion is met.The outline of our algorithm is depicted in Fig.1.We will use the term to remo Õe for taking a word out of the word class to which it has been assigned in the previous iteration,the term to mo Õe for insert-ing a word into a word class,and the term to exchange for a combination of a removal followed by a move.For initialization,we use the following method:Ž.we consider the most frequent G y 1words,and each of these words defines its own word class.The remaining words are assigned to an additional word class.As a side effect,all the words with a zero Ž.unigram count N w are assigned to this word class and remain there,because exchanging them has no effect on the training corpus perplexity.The stopping criterion is a prespecified number of iterations.In addition,the algorithm stops if no words are ex-changed any more.()S.Martin et al.r Speech Communication 24199819–3724Fig.1.Outline of the exchange algorithm for word clustering.Thus,in this method,we exploit the training corpus in two ways:1.in order to find the optimal partitioning;2.in order to evaluate the perplexity.An alternative approach would be to use two different data sets for these two tasks,or to simulate unseen events using leaving-one-out.That would result in an upper bound and possibly in more robust word classes,but at the cost of higher mathematical Ž.and computational expenses.Kneser and Ney,1993employs leaving one out for clustering.However,the improvement was not very significant,and so we will use the simpler original method here.An effi-cient implementation of this clustering algorithm will be presented in Section 3.parison with alternati Õe optimization strate -giesIt is interesting to compare the exchange algo-rithm for word clustering with two other approaches described in the literature,namely simulated anneal -Ž.ing Jardino and Adda,1993and bottom-up cluster -Ž.ing Brown et al.,1992.In simulated annealing ,the baseline optimization strategy is similar to the strategy of the exchange algorithm.The important difference is according to the simulated annealing concept that we accept tem-porary degradations of the optimization criterion.The decision of whether to accept a degradation or not is made dependent on the so called cooling parameter.This approach is usually referred to as Metropolis algorithm.Another difference is that the words to be exchanged from one word class to another and the target word classes are selected by the so-called Monte Carlo ing the correct cooling parameter,simulated annealing converges to the global optimum.In our own experimental tests Ž.unpublished results ,we made the experience that there was only a marginal improvement in the per-plexity criterion at dramatically increased computa-Ž.tional costs.In Jardino,1996,simulated annealing is applied to a large training corpus from the Wall Street Journal,but no CPU times are given.In Ž.addition in Jardino and Adda,1994,the authors introduce a modification of the clustering model allowing several word classes for each word,at least in principle.This modification,however,is more related to the definition of the clustering model and not that much to the optimization strategy.In this paper,we do not consider such types of stochastic class mappings.The other optimization strategy,bottom-up clus -Ž.tering ,as presented in Brown et al.,1992,is also Ž.based on the perplexity criterion given by Eq.6.However,instead of the exchange algorithm,the authors use the well-known hierarchical bottom-up Žclustering algorithm as described in Duda and Hart,.1973,pp.230and 235.The typical iteration step here is to reduce the number of word classes by one.This is achieved by merging that pair of word classes for which the perplexity degradation is the smallest.This process is repeated until the desired number of word classes has been obtained.The iteration process is initialized by defining a separate word class for Ž.each word.In Brown et al.,1992,the authors describe special methods to keep the computational complexity of the algorithm as small as possible.Obviously,like the exchange algorithm,this bottom up clustering strategy achieves only a local optimum.Ž.As reported in Brown et al.,1992,the exchange algorithm can be used to improve the results ob-tained by bottom-up clustering.From this result and our own experimental results for the various initial-Žization methods of the exchange algorithm see Sec-.tion 5.4,we may conclude that there is no basic performance difference between bottom-up cluster-ing and exchange clustering.()S.Martin et al.r Speech Communication 24199819–37252.5.Smoothing methodsŽ.Ž.Ž.On the training corpus,Eqs.3,4and 9are well-defined.However,even though the parameter estimation for class models is more robust than for word models,some of the class bigrams or trigrams in a test corpus may have zero frequencies in the training corpus,resulting in zero probabilities.To avoid this,smoothing must be used on the test corpus.However,for the clustering process on the training corpus,the unsmoothed relative frequencies Ž.Ž.Ž.of Eqs.3,4and 9are still used.To smooth the transition probability,we use the method of absolute interpolation with a singleton Ž.generalized distribution Ney et al.,1995,1997:N g ,g y bŽ.Õw <p g g s max 0,Ž.1w Õž/N g Ž.Õbq G y G g ,P PP b g ,Ž.Ž.Ž.0Õw N g Ž.ÕG P ,P Ž.1b s,G P ,P q 2P G P ,P Ž.Ž.12G P ,g Ž.1w b g s,Ž.w G P ,P Ž.1with b standing for the history-independent discount-Ž.ing value,g g ,P for the number of word classes r ÕŽ.seen r times right after word class g ,g P ,g for Õr w the number of word classes seen r times right before Ž.word class g ,and g P ,P for the number of w r distinct word class bigrams seen r times in the Ž.training corpus.b g is the so-called singleton w Ž.generalized distribution Ney et al.,1995,1997.The same method is used for the class trigram model.To smooth the membership distribution,we use the method of absolute discounting with backing off Ž.Ney et al.,1995,1997:N w y b Ž.°g Õif N w )0,Ž.N g Ž.w ~<p w g sŽ.0w b 1g w N g PPif N w s 0,Ž.Ž.Ýr w ¢N g N g Ž.Ž.w 0w r )0N G Ž.1w b s,g w N g q 2P N g Ž.Ž.1w 2w N g [1,Ž.Ýr w XXŽ.w :g s g ,N w s rw w with b standing for the word class dependent g w Ž.discounting value and N g for the number of r w words appearing r times and belonging to word class g .The reason for a different smoothing w method for the membership distribution is that no singleton generalized distribution can be constructed from unigram counts.Without singletons,backing Ž.off works better than interpolation Ney et al.,1997.However,no smoothing is applied to word classes with no unseen words.With our clustering algo-rithm,there is only one word class containing unseen words.Therefore,the effect of the kind of smoothing used for the membership distribution is negligible.Thus,for the sake of consistency,absolute interpola-tion could be used to smooth both distributions.3.Efficient clustering implementationA straightforward implementation of our cluster-ing algorithm presented in Section 2.3is time con-suming and prohibitive even for a small number of word classes G .In this section,we will present our techniques to improve computational performance in order to obtain word classes for large numbers of word classes.A detailed complexity analysis of the resulting algorithm will be presented in Section 4.3.1.Bigram clusteringŽ.We will use the log-likelihood Eq.5as the criterion for bigram clustering,which is equivalent to the perplexity criterion.The exchange of a word between word classes is entirely described by alter-ing the affected counts of this formula.3.1.1.Efficient method for count generationŽ.All the counts of Eq.5are computed once,stored in tables and updated after a word exchange.As we will see later,we need additional counts N w ,g s N w ,x ,11Ž.Ž.Ž.Ýx :g s gx N g ,w sN Õ,w 12Ž.Ž.Ž.ÝÕ:g s gÕ()S.Martin et al.r Speech Communication 24199819–3726Fig.2.Efficient procedure for count generation.describing how often a word class g appears right after and right before,respectively,a word w .These counts are recounted anew for each word currently under consideration,because updating them,if nec-essary,would require the same effort as recounting,and would require more memory because of the large tables.Ž.Ž.For a fixed word w in Eqs.11and 12,we need to know the predecessor and the successor words,which are stored as lists for each word w ,and the corresponding bigram counts.However,we ob-serve that if word Õprecedes w ,then w succeeds Õ.Ž.Consequently,the bigram Õ,w is stored twice,once in the list of successors to Õ,and once in the list of predecessors to w ,thus resulting in high memory consumption.However,dropping one type of list would result in a high search effort.Therefore we keep both lists,but with bigram counts stored only in the list of ing four bytes for the counts and two bytes for the word indexes,we reduce the memory requirements by 1r 3at the cost of a minor Ž.search effort for obtaining the count N Õ,w from the list of successors to Õby binary search.The Ž.Ž.count generation procedure for Eqs.11and 12is depicted in Fig.2.3.1.2.Baseline perplexity recomputationŽ.We will examine how the counts in Eq.5must be updated in a word exchange.We observe that removing a word w from word class g and moving w it to a word class k only affects those counts of Eq.Ž.5that involve g or k ;all the other counts,and,w consequently,their contributions to the perplexity remain unchanged.Thus,to compute the change in Ž.perplexity,we recompute only those terms in Eq.5which involve the affected counts.We consider in detail how to remove a word from word class g .Moving a word to a word class k isw similar.First,we have to reduce the word class unigram count:N g [N g y N w .Ž.Ž.Ž.w w Then,we have to decrement the transition counts from g to a word class g /g and from an w w arbitrary word class g /g by the number of times w w appears right before or right after g ,respectively:;g /g :N g ,g [N g ,g y N g ,w ,13Ž.Ž.Ž.Ž.w w w ;g /g :N g ,g [N g ,g y N w ,g .14Ž.Ž.Ž.Ž.w w w Ž.Changing the self-transition count N g ,g is a bit w w more complicated.We have to reduce this count by the number of times w appears right before or right after another word of g .However,if w follows w Ž.itself in the corpus,N w ,w is considered in both Ž.Ž.Eqs.11and 12.Therefore,it is subtracted twice from the transition count and must be added once for compensation:N g ,g [N g ,g y N g ,w Ž.Ž.Ž.w w w w w y N w ,g q N w ,w .15Ž.Ž.Ž.w Ž.Finally,we have to update the counts N g ,w and w Ž.N w ,g :w N g ,w [N g ,w y N w ,w ,Ž.Ž.Ž.w w N w ,g [N w ,g y N w ,w .Ž.Ž.Ž.w w Ž.We can view Eq.15as an application of the inclusion r exclusion principle from combinatorics Ž.Takacs,1984.If two subsets A and B of a set C ´are to be removed from C ,the intersection of A and B can only be removed once.Fig.3gives an inter-pretation of this principle applied to our problem of count updating.Viewing these updates in terms of the inclusion r exclusion principle will help to under-stand the mathematically more complicated update formulae for trigram clustering.。

Deterministic and Stochastic Models

Deterministic and Stochastic Models

Irina Overeem Community Surface Dynamics Modeling SystemUniversity of Colorado at BoulderSeptember 20081Course outline 1•Lectures by Irina Overeem:•Introduction and overview•Deterministic and geometric models•Sedimentary process models I•Sedimentary process models II•Uncertainty in modeling•Lecture by Overeem & Teyukhina :•Synthetic migrated data2Geological Modeling: different tracks Static Reservoir ModelReservoir DataSeismic, borehole and wirelogsSedimentary Process Model Stochastic Model Deterministic ModelData-driven modeling Process modelingFlow ModelUpscalingDeterministic and Stochastic Models •Deterministic model-A mathematical model which contains no random components; consequently, each component and input is determined exactly.•Stochastic model -A mathematical model that includes some sort of random forcing.•In many cases, stochastic models are used to simulate deterministic systems that include smaller-scale phenomena that cannot be accurately observed or modeled. A good stochasticmodel manages to represent the average effect of unresolvedphenomena on larger-scale phenomena in terms of a randomforcing.4Deterministic geometric models•Two classes:–Faults (planes)–Sediment bodies (volumes)•Geometric models conditioned to seismic•QC from geological knowledge5Direct mapping of faults and sedimentaryunits from seismic data•Good quality 3D seismic data allows recognition of subtle faults and sedimentary structures directly.•Even more so, if (post-migration) specific seismic volume attributes are calculated.•Geophysics Group at DUT worked on methodology to extract 3-D geometrical signal characteristics directly from the data.6L08 Block, Southern North SeaSeismic volume attribute analysis ofthe Cenozoic succession in the L08block, Southern North Sea. Steeghs,Overeem, Tigrek, 2000. Global andPlanetary Change, 27, 245–262.07 January 20227Fault modelling Fault surfaces•from retrodeformation (geometries ofrestored depositional surfaces)10More faultmodellingin Petrel•Check plausibility of impliedstress and strain fields11FanFan Feederchannel Delta ForesetsGas-filled meandering channelDeterministic sedimentary model fromseismic attributesObject-based Stochastic Models •Point process: spatial distribution of points (object centroids) in space according to some probability law•Marked point process: a point process attached to (marked with) random processes defining type, shape, and size of objects •Marked point processes are used to supply inter-well object distributions in sedimentary environments with clearly definedobjects:–sand bodies encased in mud–shales encased in sand16Ingredients of marked point process •Spatial distribution (degree ofclustering, trends)•Object properties (size, shape,orientation)17An example: fluvial channel-fill sands •Geometries have become more sophisticated, but conceptual basis has not changed: attempt to capture geological knowledgeof spatial lithology distribution by probability laws18•Examples of shape characterisation:–Channel dimensions (L, W) and orientation–Overbank deposits–Crevasse channels–LeveesExploring uncertainty of object properties(channel width)•W = 100 m•W = 800 m•Major step forward:object-based model ofchannel belt generated byrandom avulsion at fixedpoint•Series of realisationsconditioned to wells(equiprobable)Stochastic Model constrained by multipleanalogue data •Extract as much information as possible from logs and cores (Tilje Fm. Haltenbanken area, offshore Norway).•Use outcrop or modern analogue data sets for facies comparison and definition of geometries •Only then ‘Stochastic modeling’will begin22Lithofacies types from core Example: Holocene Holland Tidal Basin 07 January 202223SELECTED WINDOW FOR STUDYModern Ganges tidal delta, India distanc e5kmChannel widthTidal channels Conceptual model of tidal basin (aerial photos, detailed maps)Quantify the analogue data into relevantproperties for reservoir model •Channel width vs distance to shorelineThe resulting stochastical model……28Some final remarks onstochastic/deterministic models•Stochastic Modeling should be data-driven modeling•Both outcrop and modern systems play an important role in aiding this kind of modeling.•Deterministic models are driven by seismic data.•The better the seismic data acquisition techniques become, the more accurate the resulting model.References •Steeghs, P., Overeem, I., Tigrek, S., 2000. Seismic Volume Attribute Analysis of the Cenozoic Succession in the L08 Block (Southern North Sea). Global and Planetary Change 27, 245-262.• C.R. Geel, M.E. Donselaar. 2007. Reservoir modelling of heterolithic tidal deposits: sensitivity analysis of an object-based29 stochastic model, Netherlands Journal of Geosciences, 86,4.。

无线充电网络的建模分析game theoretic modeling of jamming attack in wireless powered networks

无线充电网络的建模分析game theoretic modeling of jamming attack in wireless powered networks
Wireless energy source
Receiver + EnergБайду номын сангаас storage Transmitter User Traffic states + Attacker
Fig. 1. System model.
We consider a wireless powered network as shown in Fig. 1. The network is composed of a legitimate user who can request for wireless energy transfer (e.g., RF energy) from the wireless energy source. The user’s device has an energy storage with a finite capacity (i.e., EU units of energy). The received energy will be stored in the storage. The amount of
2
wireless energy received by the user’s device depends on the channel quality (i.e., a channel state) which is random (e.g., due to fading effect). The set of channel states is denoted by CU = {0, 1, . . . , C } where C is the highest channel states. Without loss of generality, we assume that at channel state c ∈ CU , c units of energy can be received by the user’s device if the wireless energy source transfers energy (i.e., the user makes a request). Let CU denote the transition probability U matrix of the channel state. Its element is denoted by Cc,c ′ which is the probability that the current channel state is c and the next channel state is c′ . If the user requests for wireless energy, it will incur the cost of µ. This cost can be the reserved energy needed to make a request. The user generates different types of traffic. The type of traffic depends on the traffic state, whose set is denoted by D = {1, . . . , D} where D is the highest traffic state. Let D denote the transition probability matrix of the traffic state. Its element is denoted by Dd,d′ which is the probability that the current traffic state is d and the next traffic state is d′ . The user’s device requires one unit of energy from the storage to transmit the data to the receiver in any traffic state. We consider a real-time traffic in which if the data is generated (according to the traffic state) but cannot be transmitted (e.g., due to lack of energy), it will be discarded immediately. If the generated data at traffic state d ∈ D can be transmitted successfully, the user will gain the utility of uU d . Otherwise, the user receives zero utility. In addition to the legitimate user, the network has an attacker that performs a jamming attack to the data transmission by the user. The attacker quietly receives wireless energy when the user makes a request. The amount of energy received by the attacker is random and depends on the channel state. The channel state transition probability matrix is denoted by CA . The attacker has an energy storage with the capacity of EA units of energy. If the attacker performs the jamming attack, it will consume one unit of energy from the storage. Note that the energy unit of the user and attacker can be different. For example, one unit of energy of the attacker could be larger than that of the user since the attacker requires more energy to successfully jam the data transmission of the user. If the attacker successfully jams the data transmission of the user, it gains the utility of uA . Otherwise, the attacker receives zero utility. In the aforementioned wireless powered network, the user is aware that there is an attacker. The user has to decide whether to request for wireless energy and transmit the data or not depending on his/her channel state, traffic state, and energy level of his/her own energy storage. Similarly, the attacker has to decide whether to perform the jamming attack or not based on its channel state and energy level of its own energy storage. In the following, we will develop a game theoretic model for the user and attacker to obtain their equilibrium policies.

文献——精选推荐

文献——精选推荐

⽂献徐胜元简介:徐胜元,男,南京理⼯⼤学⾃动化学院教授、博⼠、博⼠⽣导师。

毕业于南京理⼯⼤学控制理论与控制⼯程专业,获得博⼠学位。

研究⽅向:1、鲁棒控制与滤波2、⼴义系统3、⾮线性系统2017年SCI1.Relaxed conditions for stability of time-varying delay systems ☆TH Lee,HP Ju,S Xu 《Automatica》, 2017, 75:11-15EI1.Relaxed conditions for stability of time-varying delay systems ☆TH Lee,HP Ju,S Xu 《Automatica》, 2017, 75:11-152.Adaptive Tracking Control for Uncertain Switched Stochastic Nonlinear Pure-feedback Systems with Unknown Backlash-like HysteresisG Cui,S Xu,B Zhang,J Lu,Z Li,...《Journal of the Franklin Institute》, 20172016年SCI1..Finite-time output feedback control for a class of stochastic low-order nonlinear systemsL Liu,S Xu,YZhang《International Journal of Control》, 2016:1-162.Universal adaptive control of feedforward nonlinear systems with unknown input and state delaysX Jia,S Xu,Q Ma,Y Li,Y Chu《International Journal ofControl》, 2016, 89(11):1-193.Robust adaptive control of strict-feedback nonlinear systems with unmodeled dynamics and time-varying delaysX Shi,S Xu,Y Li,W Chen,Y Chu《International Journal of Control》, 2016:1-184.Stabilization of hybrid neutral stochastic differential delay equations by delay feedback controlW Chen,S Xu,YZou《Systems & Control Letters》, 2016, 88(1):1-135.Multi-agent zero-sum differential graphical games for disturbance rejection in distributed control ☆Q Jiao,H Modares,S Xu,FL Lewis,KG Vamvoudakis《Automatica》, 2016, 69(C):24-346.Semiactive Inerter and Its Application in Adaptive Tuned Vibration AbsorbersY Hu,MZQ Chen,S Xu,Y Liu《IEEE Transactions on Control Systems Technology》, 2016:1-77.Decentralised adaptive output feedback stabilisation for stochastic time-delay systems via LaSalle-Yoshizawa-type theoremT Jiao,S Xu,J Lu,Y Wei,Y Zou《International Journal of Control》, 2016, 89(1):69-838.Coverage control for heterogeneous mobile sensor networks on a circleC Song,L Liu,G Feng,S Xu《Automatica》, 2016, 63(3):349-358EI1.Finite-time output feedback control for a class of stochastic low-order nonlinear systemsL Liu,S Xu,YZhang《International Journal of Control》, 2016:1-162.Unified filters design for singular Markovian jump systems with time-varying delaysG Zhuang,S Xu,B Zhang,J Xia,Y Chu,...《Journal of the FranklinInstitute》, 2016, 353(15):3739-37683.Improvement on stability conditions for continuous-time T–S fuzzy systemsJ Chen,S Xu,Y Li,Z Qi,Y Chu《Journal of the Franklin Institute》, 2016, 353(10):2218-22364.Universal adaptive control of feedforward nonlinear systems with unknown input and state delaysX Jia,S Xu,Q Ma,Y Li,Y Chu《International Journal ofControl》, 2016, 89(11):1-195.H∞ Control with Transients for Singular Systems Z Feng,J Lam,S Xu,S Zhou 《Asian Journal of Control》, 2016,18(3):817-8272015年SCI1.Pinning control for cluster synchronisation of complex dynamical networks withsemi-Markovian jump topologyTH Lee,Q Ma,S Xu,HP Ju《International Journal of Control》, 2015, 88(6):1223-12352..Anti-disturbance control for nonlinear systems subject to input saturation via disturbance observer ☆Y Wei,WX Zheng,S Xu《Systems & ControlLetters》, 2015, 85:61-693.Exact tracking control of nonlinear systems with time delays and dead-zone inputZ Zhang,S Xu,B Zhang《Automatica》, 2015, 52(52):272-276EI1.Further studies on stability and stabilization conditions for discrete-time T–S systems with the order relation information of membership functionsJ Chen,S Xu,Y Li,Y Chu,Y Zou《Journal of the Franklin Institute》, 2015, 352(12):5796-5809 .2 .Stability analysis of random systems with Markovian switching and its application T Jiao,J Lu,Y Li,Y Chu,SXu《Journal of the Franklin Institute》, 2015, 353(1):200-220 3.Exact tracking control of nonlinear systems with time delays and dead-zone inputZ Zhang,S Xu,B Zhang《Automatica》, 2015, 52(52):272-2764.Event-triggered average consensus for multi-agent systems with nonlinear dynamics and switching topologyD Xie,S Xu,Y Chu,Y Zou《Journal of the Franklin Institute》, 2015, 352(3):1080-1098葛树志简介:葛树志,男,汉族,1963年9⽉20⽇⽣于⼭东省安丘县景芝的葛家彭旺村。

statistical language model

statistical language model

Statistical Language ModelIntroductionStatistical Language Model (SLM) is a technique used in natural language processing (NLP) and computational linguistics to estimate the probability of a sequence of words or phrases. It is a fundamental part of many NLP tasks such as speech recognition, machine translation, and text generation. In this article, we will explore the key concepts, applications, and techniques used in statistical language modeling.Key Concepts1. Language ModelsA language model is a mathematical representation of the probability of a sequence of words in a given language. It assigns a score to each possible word sequence based on the patterns observed in a large corpus of text. Language models capture the syntactic and semanticrelationships between words and help predict the next word in a sequence given the preceding context.2. Probability DistributionIn statistical language modeling, a probability distribution is used to assign probabilities to different word sequences. The probability distribution provides a measure of how likely a particular word sequence is in a given language. It can be represented as a conditional probability, where the probability of a word depends on the context or the preceding words in the sequence.3. N-gramsN-grams are contiguous sequences of N words. They are the basic building blocks of a language model. N-grams can be unigrams (single words), bigrams (pairs of words), trigrams (triplets of words), or more. Thechoice of the value of N depends on the context and the specific language modeling task. N-grams help capture the local dependencies between words and are used to estimate the probability of word sequences.Applications of Statistical Language Models1. Speech RecognitionStatistical language models play a crucial role in automatic speech recognition systems. They help improve the accuracy of speechrecognition by incorporating the language-specific knowledge into the decoding process. By assigning higher probabilities to likely word sequences, SLMs help in choosing the most probable interpretation of an input speech signal.2. Machine TranslationSLMs are used in machine translation systems to generate fluent and grammatically correct translations. By assigning higher probabilities to the translations that are more likely to occur in the target language, SLMs help in choosing the most appropriate translations for input sentences. They also help improve the overall quality of the translated text.3. Information RetrievalStatistical language models are used in information retrieval systems to rank search results based on their relevance to a given query. SLMs help in estimating the probability of a query given a document or acollection of documents. This probability is used to assign a relevance score to each document, and the documents with higher scores are ranked higher in the search results.4. Text GenerationSLMs are used in text generation tasks such as automatic summarization, text completion, and dialogue systems. By modeling the probabilities of word sequences, SLMs can generate coherent and contextually appropriatetext. They help in generating fluent and natural-sounding text by predicting the most probable next word based on the preceding context.Techniques in Statistical Language Modeling1. Maximum Likelihood EstimationMaximum Likelihood Estimation (MLE) is a common technique used to estimate the probabilities of word sequences in SLMs. MLE estimates the parameters of the probability distribution by maximizing the likelihood of the observed data (word sequences) given the model. This technique assumes that the observed data is generated by the model, and it finds the parameters that make the observed data most likely.2. Smoothing TechniquesSmoothing techniques are used to handle the problem of unseen or rare word sequences in SLMs. When a word sequence is not present in the training data, its probability is zero, leading to poor model performance. Smoothing techniques assign non-zero probabilities to unseen word sequences by redistributing the probability mass from the seen word sequences. Popular smoothing techniques used in SLMs include Laplace smoothing, add-k smoothing, and Good-Turing smoothing.3. Backoff and InterpolationBackoff and interpolation techniques are used to combine n-gram models of different orders to improve the performance of SLMs. In backoff, when a higher-order n-gram is not observed in the training data, the model backoffs to a lower-order n-gram to estimate the probability. Interpolation combines the probabilities from multiple n-gram models to estimate the probability of a word sequence. These techniques help in capturing both the local and global dependencies between words.4. Neural Language ModelsRecent advancements in deep learning have led to the development of neural language models. These models use neural networks, such asrecurrent neural networks (RNNs) or transformer models, to learn the probability distribution over word sequences. Neural language models have shown significant improvements over traditional SLMs by capturing complex language patterns and long-range dependencies.ConclusionStatistical Language Models are powerful tools in natural language processing and computational linguistics. They help in estimating the probability of word sequences and are used in various NLP tasks such as speech recognition, machine translation, and text generation. By capturing the syntactic and semantic relationships between words, SLMs enable computers to understand and generate human language. With advancements in techniques like neural language models, the accuracy and fluency of SLMs are continually improving, making them essential in numerous applications requiring natural language processing.。

东北大学博士学位论文格式

东北大学博士学位论文格式

东北大学博士学位论文排版打印格式1. 引言依据中华人民共和国《科学技术报告、学位论文和学术论文的编写格式》和东北大学学位论文格式改编,专为我校申请硕士、博士学位人员撰写打印论文时使用。

本格式自发布日起实行。

2. 学位论文主要部分学位论文主要部分由前头部分、主体部分和结尾部分组成。

2.1 前头部分(1)封面(2)扉页——题名页(中、英两种)(4)声明(独创性声明)(3)摘要(中、英两种文字)(5)目录(6)插图和附表清单(只限必要时)(7)缩略字、缩写词、符号、单位表(只限必要时)(8)名词术语注释表(只限必要时)2.2 主体部分(1)绪论(前言、引言、绪言)(2)正文(3)讨论、结论和建议2.3 结尾部分(只限必要时采用)(1)参考文献(2)致谢(3)攻读博士学位期间取得的学术成果(4)作者从事科学研究和学习经历的简历(5)可供参考的文献题录(只限必要时采用)(6)索引(只限必要时采用)3. 版式纸张大小:纸的尺寸为标准A4复印纸(210mm×297mm)。

版芯(打印尺寸):160mm×247mm(不包括页眉行、页码行)。

正文字体字号:小4号宋体,全文统一。

每页30~35行,每行35~38字。

装订:双面打印印刷,沿长边装订。

页码:页码用阿拉伯数字连续编页,字号与正文字体相同,页底居中,数字两侧用圆点或一字横线修饰,如·3·或-3-。

页眉:自摘要页起加页眉,眉体可用单线或双线(二等线、文武线),页眉说明5号楷体,左端“东北大学硕士、博士学位论文”,右端“章号章题”。

封面:东北大学研究生(博士或硕士)学位论文标准封面(双A4)。

4. 体例4.1 标题论文正文按章、条、款、项分级,在不同级的章、条、款、项阿拉伯数字编号之间用点“.”(半角实心下圆点)相隔,最末级编号之后不加点。

排版格式见表4.1。

此分级编号法只分至第四级。

再分可用(1)、(2)……;(a)、(b)……等。

A Neural Probabilistic Language Model

A Neural Probabilistic Language Model

Yoshua Bengio,R´e jean Ducharme and Pascal VincentD´e partement d’Informatique et Recherche Op´e rationnelleCentre de Recherche Math´e matiquesUniversit´e de Montr´e alMontr´e al,Qu´e bec,Canada,H3C3J7bengioy,ducharme,vincentp@iro.umontreal.caAbstractA goal of statistical language modeling is to learn the joint probabilityfunction of sequences of words.This is intrinsically difficult because ofthe curse of dimensionality:we propose tofight it with its own weapons.In the proposed approach one learns simultaneously(1)a distributed rep-resentation for each word(i.e.a similarity between words)along with(2)the probability function for word sequences,expressed with these repre-sentations.Generalization is obtained because a sequence of words thathas never been seen before gets high probability if it is made of wordsthat are similar to words forming an already seen sentence.We report onexperiments using neural networks for the probability function,showingon two text corpora that the proposed approach very significantly im-proves on a state-of-the-art trigram model.1IntroductionA fundamental problem that makes language modeling and other learning problems diffi-cult is the curse of dimensionality.It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables(such as words in a sentence,or discrete attributes in a data-mining task).For example,if one wants to model the joint distribution of10consecutive words in a natural language with a vocabulary of size100,000,there are potentially free parameters.A statistical model of language can be represented by the conditional probability of the next word given all the previous ones in the sequence,sincewhere is the-th word,and writing subsequence. When building statistical models of natural language,one reduces the difficulty by taking advantage of word order,and the fact that temporally closer words in the word sequence are statistically more dependent.Thus,n-gram models construct tables of conditional proba-bilities for the next word,for each one of a large number of contexts,binations of the last words:Only those combinations of succes-sive words that actually occur in the training corpus(or that occur frequently enough)are considered.What happens when a new combination of words appears that was not seen in the training corpus?A simple answer is to look at the probability predicted using smaller context size,as done in back-off trigram models[7]or in smoothed(or interpolated)trigram models[6].So,in such models,how is generalization basically obtained from sequences ofwords seen in the training corpus to new sequences of words?simply by looking at a short enough context,i.e.,the probability for a long sequence of words is obtained by“gluing”very short pieces of length1,2or3words that have been seen frequently enough in the training data.Obviously there is much more information in the sequence that precedes the word to predict than just the identity of the previous couple of words.There are at least two obviousflaws in this approach(which however has turned out to be very difficult to beat):first it is not taking into account contexts farther than1or2words,second it is not taking account of the“similarity”between words.For example,having seen the sentence The cat is walking in the bedroom in the training corpus should help us general-ize to make the sentence A dog was running in a room almost as likely,simply because“dog”and“cat”(resp.“the”and“a”,“room”and“bedroom”,etc...)have similar semantics and grammatical roles.1.1Fighting the Curse of Dimensionality with its Own WeaponsIn a nutshell,the idea of the proposed approach can be summarized as follows:1.associate with each word in the vocabulary a distributed“feature vector”(a real-valued vector in),thereby creating a notion of similarity between words,2.express the joint probability function of word sequences in terms of the featurevectors of these words in the sequence,and3.learn simultaneously the word feature vectors and the parameters of that function. The feature vector represents different aspects of a word:each word is associated with a point in a vector space.The number of features(e.g.or in the experiments) is much smaller than the size of the vocabulary.The probability function is expressed as a product of conditional probabilities of the next word given the previous ones,(ing a multi-layer neural network in the experiment).This function has parameters that can be iteratively tuned in order to maximize the log-likelihood of the training data or a regularized criterion,e.g.by adding a weight decay penalty.The feature vectors associated with each word are learned,but they can be initialized using prior knowledge.Why does it work?In the previous example,if we knew that dog and cat played simi-lar roles(semantically and syntactically),and similarly for(the,a),(bedroom,room), (is,was),(running,walking),we could naturally generalize from The cat is walking in the bedroom to A dog was running in a room and likewise to many other combinations.In the proposed model,it will so generalize because“simi-lar”words should have a similar feature vector,and because the probability function is a smooth function of these feature values,so a small change in the features(to obtain similar words)induces a small change in the probability:seeing only one of the above sentences will increase the probability not only of that sentence but also of its combinatorial number of“neighbors”in sentence space(as represented by sequences of feature vectors).1.2Relation to Previous WorkThe idea of using neural networks to model high-dimensional discrete distributions has already been found useful in[3]where the joint probability of is decomposed as a product of conditional probabilities:where is a function represented by part of a neural network, and it yields parameters for expressing the distribution of.Experiments on four UCI data sets show this approach to work comparatively very well[3,2].The idea of a dis-tributed representation for symbols dates from the early days of connectionism[5].More recently,Hinton’s approach was improved and successfully demonstrated on learning sev-eral symbolic relations[9].The idea of using neural networks for language modeling is not new either,e.g.[8].In contrast,here we push this idea to a large scale,and concentrate on learning a statistical model of the distribution of word sequences,rather than learning the role of words in a sentence.The proposed approach is also related to previous proposalsof character-based text compression using neural networks[11].Learning a clustering of words[10,1]is also a way to discover similarities between words.In the model proposed here,instead of characterizing the similarity with a discrete random or deterministic vari-able(which corresponds to a soft or hard partition of the set of words),we use a continuous real-vector for each word,i.e.a distributed feature vector,to indirectly represent similarity between words.The idea of using a vector-space representation for words has been well exploited in the area of information retrieval(for example see[12]),where vectorial fea-ture vectors for words are learned on the basis of their probability of co-occurring in the same documents(Latent Semantic Indexing[4]).An important difference is that here we look for a representation for words that is helpful in representing compactly the probabil-ity distribution of word sequences from natural language text.Experiments indicate that learning jointly the representation(word features)and the model makes a big difference in performance.2The Proposed Model:two ArchitecturesThe training set is a sequence of words,where the vocabulary is a large butfinite set.The objective is to learn a good model, in the sense that it gives high out-of-sample likelihood.In the experiments,we will report the geometric average of,also known as perplexity,which is also the ex-ponential of the average negative log-likelihood.The only constraint on the model is that for any choice of,.By the product of these conditional probabilities,one obtains a model of the joint probability of any sequence of words.The basic form of the model is described here.Refinements to speed it up and extend it will be described in the following sections.We decompose the functionin two parts:1.A mapping from any element of to a real vector.It representsthe“distributed feature vector”associated with each word in the vocabulary.Inpractice,is represented by a matrix(of free parameters).2.The probability function over words,expressed with.We have considered twoalternative formulations:(a)The direct architecture:a function maps a sequence of featurevectors for words in context to a probabil-ity distribution over words in.It is a vector function whose-thelement estimates the probability as infigure1..We used the“soft-max”in the output layer of a neural net:,where is the neural network output score for word.(b)The cycling architecture:a function maps a sequence of featurevectors(i.e.including the context wordsand a candidate next word)to a scalar,and again using a soft-max,..We call this architecture“cycling”be-cause one repeatedly runs(e.g.a neural net),each time putting in input thefeature vector for a candidate next word.The function is a composition of these two mappings(and),with being shared across all the words in the context.To each of these two parts are associated some pa-rameters.The parameters of the mapping are simply the feature vectors themselves (represented by a matrix whose row is the feature vector for word).The function may be implemented by a feed-forward or recurrent neural network or another parameterized function,with parameters.Figure1:“Direct Architecture”:where is the neural network and is the-th word feature vector.Training is achieved by looking for that maximize the training corpus penalized log-likelihood:descent,we have found it useful to break the corpus in paragraphs and to randomly permute them.In this way,some of the non-stationarity in the word stream is eliminated,yielding faster convergence.Capacity control.For the“smaller corpora”like Brown(1.2million examples),we have found early stopping and weight decay useful to avoid over-fitting.For the larger corpora, our networks still under-fit.For the larger corpora,we have found double-precision com-putation to be very important to obtain good results.Mixture of models.We have found improved performance by combining the probability predictions of the neural network with those of the smoothed trigram,with weights that were conditional on the frequency of the context(same procedure used to combine trigram, bigram,and unigram in the smoothed trigram).Initialization of word feature vectors.We have tried both random initialization(uniform between-.01and.01)and a“smarter”method based on a Singular Value Decomposition (SVD)of a very large matrix of“context features”.These context features are formed by counting the frequency of occurrence of each word in each one of the most frequent contexts(word sequences)in the corpus.The idea is that“similar”words should occur with similar frequency in the same contexts.We used about9000most frequent contexts, and compressed these to30features with the SVD.Out-of-vocabulary words.For an out-of-vocabulary word we need to come up with a feature vector in order to predict the words that follow,or predict its probability(that is only possible with the cycling architecture).We used as feature vector the weighted average feature vector of all the words in the short list,with the weights being the relative probabilities of those words:.4Experimental ResultsComparative experiments were performed on the Brown and Hansard corpora.The Brown corpus is a stream of1,181,041words(from a large variety of English texts and books). Thefirst800,000words were used for training,the following200,000for validation(model selection,weight decay,early stopping)and the remaining181,041for testing.The number of different words is(including punctuation,distinguishing between upper and lower case,and including the syntactical marks used to separate texts and paragraphs). Rare words with frequency were merged into a single token,reducing the vocabulary size to.The Hansard corpus(Canadian parliament proceedings,French version)is a stream of about34million words,of which32millions(set A)was used for training,1.1million (set B)was used for validation,and1.2million(set C)was used for out-of-sample tests. The original data has different words,and those with frequency were merged into a single token,yielding different words.The benchmark against which the neural network was compared is an interpolated or smoothed trigram model[6].Let represent the discretized fre-quency of occurrence of the context(we usedwhere is the frequency of occurrence of the context and is the size of the training cor-pus).A conditional mixture of the trigram,bigram,unigram and zero-gram was learned on the validation set,with mixture weights conditional on discretized frequency.Below are measures of test set perplexity(geometric average of)for dif-ferent models.Apparent convergence of the stochastic gradient descent procedure was obtained after around10epochs for Hansard and after about50epochs for Brown,with a learning rate gradually decreased from approximately to.Weight decay ofor was used in all the experiments(based on a few experiments compared on the validation set).The main result is that the neural network performs much better than the smoothedtrigram.On Brown the best neural network system,according to validation perplexity (among different architectures tried,see below)yielded a perplexity of258,while the smoothed trigram yields a perplexity of348,which is about35%worse.This is obtained using a network with the direct architecture mixed with the trigram(conditional mixture), with30word features initialized with the SVD method,40hidden units,and words of context.On Hansard,the correspondingfigures are44.8for the neural network and54.1 for the smoothed trigram,which is20.7%worse.This is obtained with a network with the direct architecture,100randomly initialized words features,120hidden units,and words of context.More context is useful.Experiments with the cycling architecture on Brown,with30 word features,and30hidden units,varying the number of context words:(like the bigram)yields a test perplexity of302,yields291,yields281,yields 279(N.B.the smoothed trigram yields348).Hidden units help.Experiments with the direct architecture on Brown(with direct input to output connections),with30word features,5words of context,varying the number of hidden units:0yields a test perplexity of275,10yields267,20yields266,40yields265, 80yields265.Learning the word features jointly is important.Experiments with the direct architec-ture on Brown(40hidden units,5words of context),in which the word features initialized with the SVD method are keptfixed during training yield a test perplexity of345.8whereas if the word features are trained jointly with the rest of the parameters,the perplexity is265. Initialization not so useful.Experiments on Brown with both architectures reveal that the SVD initialization of the word features does not bring much improvement with respect to random initialization:it speeds up initial convergence(saving about2epochs),and yields a perplexity improvement of less than0.3%.Direct architecture works a bit better.The direct architecture was found about2%better than the cycling architecture.Conditional mixture helps but even without it the neural net is better.On Brown,the best neural net without the mixture yields a test perplexity of265,the smoothed trigram yields348,and their conditional mixture yields258(i.e.,better than both).On Hansard the improvement is less:a neural network yielding46.7perplexity,mixed with the trigram (54.1),yields a mixture with perplexity45.1.5Conclusions and Proposed ExtensionsThe experiments on two corpora,a medium one(1.2million words),and a large one(34 million words)have shown that the proposed approach yields much better perplexity than a state-of-the-art method,the smoothed trigram,with differences on the order of20%to 35%.We believe that the main reason for these improvements is that the proposed approach allows to take advantage of the learned distributed representation tofight the curse of di-mensionality with its own weapons:each training sentence informs the model about a combinatorial number of other sentences.Note that if we had a separate feature vector for each“context”(short sequence of words),the model would have much more capacity (which could grow like that of n-grams)but it would not naturally generalize between the many different ways a word can be used.A more reasonable alternative would be to ex-plore language units other than words(e.g.some short word sequences,or alternatively some sub-word morphemic units).There is probably much more to be done to improve the model,at the level of architecture, computational efficiency,and taking advantage of prior knowledge.An important priority of future research should be to evaluate and improve the speeding-up tricks proposed here, andfind ways to increase capacity without increasing training time too much(to deal withcorpora with hundreds of millions of words).A simple idea to take advantage of temporal structure and extend the size of the input window to include possibly a whole paragraph, without increasing too much the number of parameters,is to use a time-delay and possibly recurrent neural network.In such a multi-layered network the computation that has been performed for small groups of consecutive words does not need to be redone when the network input window is shifted.Similarly,one could use a recurrent network to capture potentially even longer term information about the subject of the text.A very important area in which the proposed model could be improved is in the use of prior linguistic knowledge:semantic(e.g.Word Net),syntactic(e.g.a tagger),and morpho-logical(radix and morphemes).Looking at the word features learned by the model should help understand it and improve it.Finally,future research should establish how useful the proposed approach will be in applications to speech recognition,language translation,and information retrieval.AcknowledgmentsThe authors would like to thank L´e on Bottou and Yann Le Cun for useful discussions.This research was made possible by funding from the NSERC granting agency.References[1] D.Baker and A.McCallum.Distributional clustering of words for text classification.In SI-GIR’98,1998.[2]S.Bengio and Y.Bengio.Taking on the curse of dimensionality in joint distributions usingneural networks.IEEE Transactions on Neural Networks,special issue on Data Mining and Knowledge Discovery,11(3):550–557,2000.[3]Yoshua Bengio and Samy Bengio.Modeling high-dimensional discrete data with multi-layerneural networks.In S.A.Solla,T.K.Leen,and K-R.Mller,editors,Advances in Neural Information Processing Systems12,pages400–406.MIT Press,2000.[4]S.Deerwester,S.T.Dumais,G.W.Furnas,ndauer,and R.Harshman.Indexing by latentsemantic analysis.Journal of the American Society for Information Science,41(6):391–407, 1990.[5]G.E.Hinton.Learning distributed representations of concepts.In Proceedings of the Eighth An-nual Conference of the Cognitive Science Society,pages1–12,Amherst1986,wrence Erlbaum,Hillsdale.[6] F.Jelinek and R.L.Mercer.Interpolated estimation of Markov source parameters from sparsedata.In E.S.Gelsema and L.N.Kanal,editors,Pattern Recognition in Practice.North-Holland, Amsterdam,1980.[7]Slava M.Katz.Estimation of probabilities from sparse data for the language model componentof a speech recognizer.IEEE Transactions on Acoustics,Speech,and Signal Processing,ASSP-35(3):400–401,March1987.[8]R.Miikkulainen and M.G.Dyer.Natural language processing with modular neural networksand distributed lexicon.Cognitive Science,15:343–399,1991.[9] A.Paccanaro and G.E.Hinton.Extracting distributed representations of concepts and relationsfrom positive and negative propositions.In Proceedings of the International Joint Conference on Neural Network,IJCNN’2000,Como,Italy,2000.IEEE,New York.[10] F.Pereira,N.Tishby,and L.Lee.Distributional clustering of english words.In30th AnnualMeeting of the Association for Computational Linguistics,pages183–190,Columbus,Ohio, 1993.[11]J¨u rgen Schmidhuber.Sequential neural text compression.IEEE Transactions on Neural Net-works,7(1):142–146,1996.[12]H.Schutze.Word space.In S.J.Hanson,J.D.Cowan,and C.L.Giles,editors,Advances inNeural Information Processing Systems5,pages pp.895–902,San Mateo CA,1993.Morgan Kaufmann.。

L M S A l g o r i t h m 最 小 均 方 算 法 ( 2 0 2 0 )

L M S   A l g o r i t h m 最 小 均 方 算 法 ( 2 0 2 0 )

常用的机器学习&数据挖掘知识点[转]常用的数【导师实战追-女生资-源】据挖掘机器学习知识(点)Basis(基础):MSE(【扣扣】MeanSquare Error 均方误差),LMS(Least MeanSquare 最小均方)【⒈】,LSM(Least Square Methods 最小二乘法),MLE(Maximum Like 【0】lihoodEstimation最大似然估计),QP(QuadraticProgramming 二次规划【1】), CP(ConditionalProbability条件概率),JP(Joint Pro 【б】bability 联合概率),MP(Marginal Probability边缘概率),Bay 【9】esian Formula(贝叶斯公式),L1 -L2Regularization(L1-L2正则,以及更【5】多的,现在比较火的L2.5正则等),GD(Gradient Descent 梯度下降【2】),SGD(Stochastic GradientDescent 随机梯度下降),Eig 【6】envalue(特征值),Eigenvector(特征向量),QR-decomposition(QR 分解),Quantile (分位数),Covariance(协方差矩阵)。

Common Distribution(常见分布):Discrete Distribution(离散型分布):Bernoulli Distribution-Binomial(贝努利分步-二项分布),Negative BinomialDistribution(负二项分布),Multinomial Distribution(多式分布),Geometric Distribution(几何分布),Hypergeometric Distribution(超几何分布),Poisson Distribution (泊松分布) ContinuousDistribution (连续型分布):Uniform Distribution(均匀分布),Normal Distribution-GaussianDistribution(正态分布-高斯分布),Exponential Distribution(指数分布),Lognormal Distribution(对数正态分布),Gamma Distribution(Gamma分布),Beta Distribution(Beta分布),Dirichlet Distribution(狄利克雷分布),Rayleigh Distribution(瑞利分布),Cauchy Distribution(柯西分布),Weibull Distribution (韦伯分布)Three Sampling Distribution(三大抽样分布):Chi-square Distribution(卡方分布),t-distribution(t-distribution),F-distribution(F-分布)Data Pre-processing(数据预处理):MissingValue Imputation(缺失值填充),Discretization(离散化),Mapping(映射),Normalization(归一化-标准化)。

The Variable Competence Model

The Variable Competence Model
• 可变语言能力模式提倡一种由师生共同参 与听与说共存的双向交流式课堂。教师是 伙伴乃至旁观者; 学生们能够担任除了听众 以外的多种角色, 能够用话语完成多种言语 行为。课堂上出现真正意义的语言交际活 动, 即无计划话语活动 • 可变语言能力教学模式分成五个阶段, 进行 五次处理
• 第一处理: 建构语篇框架。从无计划的话语活动开始( Ellis, 1984) , 在开放情景下用自然语言交际, 通过背景介绍、问题讨论, 或选择 或列出关键词, 调动学习者语言能力中原有的语块、语义知识、功能 知识与话语活动交互作用, 建构出某一话题的主题( theme) 、次主 题。这一过程注重语义概念及其间的关联, 主要是建构某一话题的逻 辑结构, 从语篇上搭建粗略的框架( outline) 。 • 第二处理: 规范语言。调动语言能力中的语法知识、语篇知识( 修辞 、连接词) , 对自然话语进行规范监察。这一处理过程为������ 有计划 话语活动������ ( Ellis, 1984) 过程, 是有意识的过程。具体形式为 教师示范, 对其粗略的语言进行修正, 补充,规范, 从而从中介语向 目的语转化。在此过程中注意i+ 1的处理, 教师从学生的目光中可及 时得到学生的反馈, 及时降低或增大词汇、句型难度, 用paraphrase 的方式有意识地让学生接触一部分生词, 从词汇上先进行难度的转移 处理, 以免学生在学词汇时一下子觉得生词过多。 • 第三处理: 课文阅读, 丰富内容。调动语篇知识( 修辞、衔接) , 对 课文进行宏观篇章分析; 调动语法知识, 进行微观词句分析, 在第一 、第二处理的基础上, 使建构的内容一步充实, 语言更丰富。 • 第四处理: 组成语篇。调动语篇知识, 通过主位推进、同意替代、上 下义替换、所指、连接词等衔接与连贯手段组段成篇。 • 第五处理: 输出。包括书面、口头形式( 学生) 。形式可以是缩写或 复述、辩论、角色扮演。

几种典型的资产负债管理模型

几种典型的资产负债管理模型

几种典型的资产负债管理模型现在ALM 有许多方法都在应用,其中最主要的常用方法包括效率前沿模拟、久期匹配(或称免疫)、现金流量匹配等。

其中用到的数学方法主要集中于优化、随机控制等。

1. 效率前沿模型(The Efficient Frontier )效率前沿最初是由马可维茨1提出、作为资产组合选择的方法而发展起来的,它以期望代表收益,以对应的方差(或标准差)表示风险程度,因此又称期望—方差模型。

该模型产生一系列效率前沿而非一个单独建议。

这些效率前沿只包括了所有可能的资产组合中的一小部分2。

ALM 最常采用的手段之一即利用模拟的方法发现一个基于期望—方差的效率前沿策略。

假定有两个投资策略,我们很容易计算它们的期望与方差,如果我们随机地增加路径和策略,期望—方差散点图的上界将达到所谓的效率前沿线,这就意味着识别出了最优的风险/回报投资策略。

以一种效率前沿模型为例:模型:目标:∑∑∑∈∈∈∈-Ui i i U j j i ij U i R x x x x q I μλmin 限制:b Ax =其中U = {1,2,3,…I}为证券集;:证券ij q j i ,间的协方差,U j i ∈,;i μ:证券i 的期望收益;:证券在资产组合中的比例,即代表资产组合的结构,, i x i U i ∈ λ:相对于方差—期望的参数以产生效率前沿。

但是,根据最新的对资本资产定价模型(CAPM )的理论探讨,对于一个有特定负债的企业来说,其效率前沿将收缩为一个点3。

抽样技术的应用使我们能够检验一个代表的路径集,但事实上,构建足够的策略集,然后再检验所有的可行路径与策略几乎是不可能的。

2、久期匹配模型(Duration Matching )4如果给定了一组现金流量,某种证券的久期可以计算出来,从概念上看,久期可以看成是现金流量的时间加权现值。

久期匹配(或称免疫)法就是要在资产组合中将资产与负债的利率风险相匹配。

该方法传统的模型假定利率期限结构平缓且平行变动。

BUILDING COMPACT N-GRAM LANGUAGE MODELS INCREMENTALLY

BUILDING COMPACT N-GRAM LANGUAGE MODELS INCREMENTALLY

BUILDING COMPACT N-GRAM LANGUAGE MODELSINCREMENTALLYVesa SiivolaNeural Networks Research Centre,Helsinki University of Technology,FinlandAbstractIn traditional n-gram language modeling,we collect the statistics for all n-grams observed in the training set up to a certain order.The model can then be pruned down to a more compact size with some loss in modeling accuracy.One of the more principled methods for pruning the model is the entropy-based pruning proposed by Stolcke(1998).In this paper,we present an algorithm for incrementally constructing an n-gram model.During the model construction,our method uses less memory than the pruning-based algorithms,since we never have to handle the full unpruned model. When carefully implemented,the algorithm achieves a reasonable speed.We compare our models to the entropy-pruned models in both cross-entropy and speech recognition experiments in Finnish. The entropy experiments show that neither of the methods is optimal and that the entropy-based pruning is quite sensitive to the choice of the initial model.The proposed method seems better suitable for creating complex models.Nevertheless,even the small models created by our method perform along with the best of the small entropy-pruned models in speech recognition experiments. The more complex models created by the proposed method outperform the corresponding entropy-pruned models in our experiments.Keywords:variable length n-grams,speech recognition,sub-word units,language model pruning1.IntroductionThe most common way of modeling language for speech recognition is to build an n-gram model.Traditionally,all n-gram counts up to a certain order n are collected and smoothed probability estimates for words are based on these counts.There exist several heuristic methods for pruning the n-gram model to a smaller size.One can for example set cut-off values,so that the n-grams that have occurred less than m times are not used for constructing the model.A more principled approach is presented by Stolcke(1998), where the n-grams,which reduce the training set likelihood the least are pruned from the model.The algorithm seems to be effective in compressing the models with reasonable reductions in the modeling accuracy.In this paper,an incremental method for building n-gram models is presented.We start adding new n-grams to the model until we reach the desired complexity.When de-ciding if a new n-gram should be added,we weight the training set likelihood increase against the resulting growth in model complexity.The approach is based on the Mini-mum Description Length principle(Rissanen1989).The algorithm presented here hassome nice properties:we do not need to decide the highest possible order of an n-gram. The construction of the model takes less memory than with the entropy based pruning algorithm,since we are not pruning an existing large model to a smaller size,but extend-ing an existing small model to a bigger size.On the downside,the algorithm has to be carefully implemented to make it reasonably fast.All experiments are conducted on Finnish data.We have found that using morphs, that is statistically learned morpheme like units(Creutz and Lagus2002)as a basis for an n-gram model is more effective,than using a word-based model.Thefirst experiments (Siivola et al.2003)were confirmed by later experiments with a wider variety of models and the morphs were found to consistently outperform other units.Consequently,we will be using the morph-based n-gram models also in the experiments of this paper.We com-pare the proposed model to an entropy-pruned model in both cross-entropy and speech recognition experiments.2.Description of the methodThe algorithm is formulated loosely based on the Minimum Description Length criterion (Rissanen1989),where the object is to send given data with as few bits as possible.The more structure is contained in the data,the more useful it is to send a detailed model of the data,since the actual data can be then described with fewer bits.The coding length of the data is thus the sum of the model code length and the data log likelihood.2.1.Data likelihoodAssume that we have an existing model M o and we are trying to add n-grams of order n into the model.We start by drawing a prefix gram,that is an(n−1)-gram g n−1from some distribution.Next,we try adding all observed n-grams g n starting with the prefix g n−1to the model to create a new model M n.The change of the log likelihood L M of the training data T between the models isΛ(M n,M o)=L M n(T)−L M o(T)(1) Adding the n-grams g n increases the complexity of the model.We want to weight the gain in likelihood against the increase in the model complexity.2.2.Model coding lengthWe are actually only interested in the change of the model complexity.Thus,if we assume our vocabulary to be constant,we need not to think about coding it.For each n-gram g n, we need to store the probability of the n-gram.The interpolation(or back-off)coefficient is common to all n-grams g n starting with the same prefix g n−1.As n-gram models tend to be sparse,they can be efficiently stored in a tree structure(Whittaker and Raj2001). We can claim,that adding n-gram of any order into the tree demans an equal increase in model size,if we make the approximation that all n-grams are prefixes to other n-grams. This means that all n-grams need to store an interpolation coefficient correspondig to the n-grams they are the prefix to.Also,all n-grams also need to store what Whittaker and Raj call child node index,that is the range of child nodes of a particular n-gram prefix. Accordingly,if the n-gram prefix needed for storing interpolation coefficient or child node index is not in the model,we need to add the corresponding n-gram.The approximated costΩfor updating the model isΩ(M n,M o)=n·(2log2(W)+2θ)=nC,(2)where W is the size of the lexicon,n is the number of new n-grams in model M n,the cost of2log2(W)comes from storing the word and child node indices.The cost2θcomes from storing the log probability and the interpolation coefficient with the precision ofθbits.2.3.N-gram model constructionThe n-gram model is constructed by sampling the prefixes g n−1and adding all n-grams g n starting with the prefix,if the change∆in data coding lengthΨis negative.∆=Ψ(M n)−Ψ(M o)=Ω(M n,M o)−αΛ(M n,M o)(3) We have added the coefficientαto scale the relative importance of the training set data. We are not trying to encode a certain data set,but we are trying to build an optimal n-gram model of certain complexity.Withα,we can control the size of the resulting model.There is also afixed threshold,which the improvement of the data log likelihood Λ(M n,M o)has to exceed,before the new n-grams are even considered for inclusion to the model.Originally this was to speed up the model construction,but it seems that the resulting models are also somewhat better.For sampling the prefixes we used a simple greedy search.We go through the ex-isting model,order by order,n-gram by n-gram and use these n-grams as the prefix grams. For the n-gram probability estimate,we have used modified Kneser-Ney smoothing(Chen and Goodman1999).Instead of using estimates for optimal discounts,we decided use Powell search(Press et al.1997)tofind the optimal parameter values,since the n-gram distribution of the model was quite different from a model,where all n-grams of a given order from the training set are present.The discount parameters are re-estimated each time when10000new prefixes have been added to a new n-gram order.2.4.MorphsFor splitting words into morpheme-like units,we use slightly modified version of the algorithm presented by Creutz and Lagus(2002).The word list given to the algorithm wasfiltered so,that all words with frequency less than3were removed from the list. Word counts were ignored,all words were assumed to have occurred once.This resulted in a lexicon of26000morphs.2.5.Details of the implementationIt is important to consider the implementation of the algorithm carefully.A naive imple-mentation will be too slow for any practical use.In all places of the algorithm,where there is calculation of differences,we only modify and recalculate the parameters,which affect the difference.When we have sampled a prefix,we have tofind the corresponding n-gram counts from the training data.For effective search,we have a word table,where each entry contains an ordered list of locations,where the word has been seen in the training set.We use a slightly modified binary search,starting from the rarest word of the n-gram tofind all the occurrences of the n-gram.We initialized our model to unigram model.It would be possible to start the model construction from0-grams instead of unigrams.This is maybe a theoretically nicer solu-tion,but in practice we suspect that all words will have at least their unigram probabilities estimated anyway.a)Figure1:Experimental results.The model sizes are expressed on a logarithmic scale.a) Cross-entropies against the number of the n-grams in the model.The measured points on each curve correspond to different pruning or growing parameter values.b)Phoneme errors and model sizes.Corresponding word error rates range from25.5%to39.6%. 3.Experiments3.1.DataWe used some data from the Finnish Language Bank(CSC2004)augmented by an almost equal amount of short newswires,resulting in corpus of36M words(100M morphs).50k words were set aside as a test set.The audio data was5hours of short news stories read by one female reader.3.5 hours were used for training,the LM scaling factor was set based on a development set of 33minutes andfinally49minutes of the material were left as the test set.3.2.Cross-entropyWe trained an unpruned baseline3-gram and5-gram model from the data to serve as ref-erence models.We used the SRILM–toolkit(Stolcke2002)to train the entropy-pruned models and compared these against our models.Both the proposed and entropy based pruning method were run with different parameter values for pruning or growing the model.For testing the models,we calculated the cross-entropy of the model M and the test set text T:1H M(T)=−6N u m b e r o f g r a m s N−gram order Figure 2:N-gram distributions of pruned SRILM models and the proposed models.The plot shows the number of n-grams of each order in a model.The points belonging to the same model are connected with a line.3.3.Speech recognition systemOur acoustic features were 12Mel-Cepstral coefficients and power.The feature vector was concatenated with corresponding first order delta features.The acoustic models were monophone HMMs with Gaussian Mixture Models.The acoustic models had explicit du-ration modeling,the post-processor approach presented by Pylkkönen and Kurimo (2004).Our decoder is a so-called stack decoder (Hirsimäki and Kurimo 2004).3.4.Speech recognition experimentsThe speech recognition experiments were run on the same models as the cross-entropy experiments.The phoneme error rate of the models has been shown in Figure 1b.The recognition speeds ranged from 1.5to 3times real time on an AMD Opteron 248machine.Tightening the pruning to faster than real time recognition leads to a very similar figure,with phoneme error rates ranging from 6.2%to 8.4%.The proposed model seems to do relatively better in the speech recognition exper-iments than in the cross-entropy experiments.This is probably because the n-gram dis-tribution of the proposed model is more weighted towards the lower order n-grams.This way,the speech recognition errors affect a smaller number of utilized language model histories.It seems likely,that the decoder prunings also play some role.4.Discussion and conclusionsWe presented an incremental method for building n-gram language models.The method seems well suitable for building all but the smallest models.The method does not use a fixed n for building n-gram statistics,instead it incrementally expands the model.The model uses less memory when creating the model than the comparable pruning methods.The experiments show,that the proposed method robustly gets similar results as the ex-isting entropy-based pruning method (Stolcke 1998),where a good choice of the initial n-gram order is required.It seems that both the proposed and entropy based pruning method are suboptimal.In theory,an optimal pruning started from a 5-gram model should always be better than or equal to an optimal pruning started from a trigram model.When creating small models,the entropy based pruning from trigrams gives better results than either the proposed method or entropy based pruning from5-grams.One possible reason for the suboptimal behavior is that both methods use greedy search forfinding the best model.The search is not guaranteed tofind the optimal model. Also,neither of the models takes into account that the lower order n-grams will probably be proportionally more used in new data than the higher order n-grams.In our model we made some crude approximations when estimating the cost of adding new n-grams to the model.More accurate modeling of the cost of inserting an n-gram to the model would penalize the higher order n-grams somewhat and possibly lead to improved models.The models should be further tested with a wide range of different training set sizes and word error rates to get a more accurate view how the models perform compared to each other in more varying circumstances.We chose to use morphs as our base modeling units,but the presented method should also work on word-based models.Experiments should be run on languages where word-based models work better,such as English.5.AcknowledgementsThis work was funded by Finnish National Technology Agency(TEKES).The author thanks Mathias Creutz for the discussion leading to the development of this model and our speech group for helping with the speech recognition experiments.The Finnish news agency(STT)and the Finnish IT center for science(CSC)are thanked for the text data. Inger Ekman from the University of Tampere,Department of Information Studies,is thanked for providing the audio data.ReferencesCSC2004.Collection of Finnish text documents from years1990–2000.Finnish IT center for science(CSC).Chen,Stanley F.;Goodman,Joshua1999.An empirical study of smoothing techniques for language modeling.In:Computer Speech and Language13(4),359–393Creutz,Mathias;Lagus,Krista2002.Unsupervised discovery of morphemes.In:Proceedings of the Workshop on Morphological and Phonological Learning of ACL-02.21–30Hirsimäki,Teemu;Kurimo,Mikko2004.Decoder issues in unlimited Finnish speech recognition.In:Proceedings of the6th Nordic Signal Processing Symposium(Norsig).320–323Press,William;Teukolsky,Saul;Vetterling,William;Flannery,Brian(eds.)1997.Numerical recipes in C.Cambridge University PressPylkkönen,Janne;Kurimo,ing phone durations in Finnish large vocabulary con-tinuous speech recognition.In:Proc.Norsig2004.324–326Rissanen,Jorma1989.Stochastic complexity in statistical inquiry theory.World Scientific Pub-lishing Co.,Inc.Siivola,Vesa;Hirsimäki,Teemu;Creutz,Mathias;Kurimo,Mikko2003.Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner.In:Proc.Eurospeech 2003.2293–2296Stolcke,Andreas1998.Entropy-based pruning of backoff language models.In:Proc.DARPA Broadcast News Transcription and Understanding Workshop.270–274Stolcke,Andreas2002.SRILM–an extensible language modeling toolkit.In:Proc.ICSLP2002.901–904./projects/srilm/Whittaker,E.W.D.;Raj,B.2001.Quantization-based language model compression.In:Proc.Eurospeech2001.33–36VESA SIIVOLA is a graduate student(M.Sc.)working as a researcher in Neural Networks Re-search Centre,Helsinki University of Technology.E-mail:Vesa.Siivola@hut.fi.。

工业工程专业术语-工业工程英语术语

工业工程专业术语-工业工程英语术语

工业工程专业术语-工业工程英语术语工业工程专业术语Asystematiccontext系统环境;Accidentproneness事故倾向;Activeworksampling活动工作采样;Alternativetaskallocatio;Anthropometry人体测量学;Approximationmethod近似方法;Arrivalrate到达率;Auditorysense听觉;BilloflaborA systematic context 系统环境Accident proneness 事故倾向Active work sampling 活动工作采样Alternative task allocation 可选择任务分配Anthropometry 人体测量学Approximation method 近似方法Arrival rate 到达率Auditory sense 听觉Bill of labor system 人力系统清单Bill of material (BOM) system 物料系统清单Business planning and development 业务规划和开发Business process reengineering 业务处理再工程Cellular layout 单元……Communication network 通讯网络Computer-aided design 计算机辅助设计Computer-aided manufacturing 计算机辅助制造Computer-aided process planning 计算机辅助工艺编制Concurrent engineering 并行工程Constrained (unconstrained) optimization 基于约束的优化Constrained model 约束模型Decision-making 决策Delivery time to market 上市时间Design for assembly 面向装配的设计Design for Reusability 面向重用性设计Directed graphic 有向图Discrete 离散Discrete event stochastic system (DESS) 离散事件随机系统Distributed layout 分布式布局Economy of scale 经济规模Ergonomics 人因工程Experimental psychology 实验心理学Factory layout 工厂布局Feedback loops 反馈环Financial management 财务管理Finished goods (parts) 成品Flexible factory 柔性工厂Flexible manufacturing system (FMS) 柔性制造系统Flexible, modular, and easy to reconfigure 柔性,模块化,易于重新部署Functional layout 功能……Fundamental research 基础研究Generalized network 一般化网络Global problem 全局问题Hand held data collector 手持式数据采集器Hand-held computer 手提电脑Human centered design 以人为本的设计Human factors engineering 人因工程Human perception 人的感知Human resource management 人力资源管理Idiosyncratic variable/feature variable 特性变量Independent and dependent variables 独立和非独立变量Industrial Revolution 工业革命Industrially developing country (IDC) 工业发展中国家Industry engineering, 工业工程Information processing 信息处理Information technology 信息技术Innovation management 创新管理Input parameter 输入参数Integer planning 整数规划Integrated manufacturing cell 集成制造单元Integrative OR system 集成化OR(运筹学)系统Interactive expert system 交互式专家系统Interactive simulation language 交互式仿真语言Inventory control 库存控制Joint project 联合项目Lagrangian relaxation 拉格朗日松弛Laptop computer 笔记本Linear program 线性规划Logistical cost 物流成本Logistics 物流Machine utilization 设备利用率Manual response 手动反应Manufacturing and logistics 制造与物流Manufacturing line balancing 制造生产线平衡Markov decision processes 马尔科夫决策过程Material handling 物料搬运Mathematical programming model 数学编程模型Modeling and simulation 建模与仿真Motion analysis 动作分析Multichannel manufacturing 多渠道制造Multi-criteria 多准则Multi-disciplinary 多学科Multi-objective 多目标Network flow problem 网络流动问题Network queuing model 网络排队模型Nonlinear 非线性Objective function 目标函数Occupational hazards 职业危害Operation research 运筹学Operational and modeling science 运作与建模科学Optimization and stochastic process 优化与随机过程OR/AI interface (operation research, artificial intelligence) OR 接口(人工智能) Original Equipment Manufacturer (OEM) 原始设备制造商Outsourcing 外包Parallel processing 并行处理Performance improvement engineering 性能改进工程Portable machine 便携式机器Primal simplex algorithm 单纯型法Probabilistic phenomena 可能现象Problem solving 求解Process layout 工艺…… P61Processing time 处理时间Product layout 产品……Production plant 生产工厂Production system 生产系统Production volume 产量Project management 项目管理Qualitative optimization 数量优化Quality improvement/ Total quality management (TQM) 质量改进/全面质量管理 Quantitative management 量化管理Rapid prototyping 快速原型Raw material 原材料Resource allocation 资源分配Response 反应Routing and dispatching 路由和分发Scalable machine 可伸缩机器Short-term shortages and surpluses 短期短缺和过剩Spare part (warehouse) 备件Spreadsheet software 表格软件Statistical error figure and graphical histogram 统计误差图和柱状图Stochastic network analysis 随即网络分析Stochastic service system 随即服务系统Sub-system 子系统System engineering 系统工程System life cycle 系统生命周期Team building 团队建设Technology innovation 技术创新Test case 测试用例The hypercube queuing model 超立方排队模型Threshold value 阈值Throughput 产量Time consuming 费时Time study 时间研究Trucking operation 货运操作 Uncertainty 不确定性 Verbal response 语音反应 Visual sense 视觉Work-in-process 在制品Work-measured labor standard (WMLS) 基于工作测量的劳动标准工业工程英语术语1.工业工程的定义:工业工程是对人员,物料,设备,能源和信息所组成的集成系统,进行设计,改善和设置的一门学科。

二语习得引论-读书笔记-chapter-1-2

二语习得引论-读书笔记-chapter-1-2

一.概论Chapter 1. Introducing SLA1.Second language acquisition (SLA)2.Second language (L2)(也可能是第三四五外语) also commonly called a target language (TL)3.Basic questions:1). What exactly does the L2 learner come to know?2). How does the learner acquire this knowledge?3). Why are some learners more successful than others?4.linguistic; psychological; social.Only one (x) Combine (√)Chapter 2. Foundations of SLAⅠ. The world of second languages1.Multi-; bi-; mono- lingualism1)Multilingualism: the ability to use 2 or more languages.(bilingualism: 2 languages; multilingualism: >2)2)Monolingualism: the ability to use only one language.3)Multilingual competence (Vivian Cook, Multicompetence)Refers to: the compound state of a mind with 2 or more grammars.4)Monolingual competence (Vivian Cook, Monocompetence)Refers to: knowledge of only one language.2.People with multicompetence (a unique combination) ≠ 2 monolingualsWorld demographic shows:3.Acquisition4.The number of L1 and L2 speakers of different languages can only beestimated.1)Linguistic information is often not officially collected.2)Answers to questions seeking linguistic information may not bereliable.3) A lack of agreement on definition of terms and on criteria foridentification.Ⅱ. The nature of language learning1.L1 acquisition1). L1 acquisition was completed before you came to school and thedevelopment normally takes place without any conscious effort.2). Complex grammatical patterns continue to develop through the1) Refers to: Humans are born with an innate capacity to learnlanguage.2) Reasons:♦Children began to learn L1 at the same age and in much the same way.♦…master the basic phonological and grammatical operations in L1 at 5/ 6.♦…can understand and create novel utterances; and are not limited to repeating what they have heard; the utterances they produce are often systematically different from those of the adults around them.♦There is a cut-off age for L1 acquisition.♦L1 acquisition is not simply a facet of general intelligence.3)The natural ability, in terms of innate capacity, is that part oflanguage structure is genetically “given” to every human child.3. The role of social experience1) A necessary condition for acquisition: appropriate socialexperience (including L1 input and interaction) is2) Intentional L1 teaching to children is not necessary and may havelittle effect.3) Sources of L1 input and interaction vary for cultural and socialfactors.4) Children get adequate L1 input and interaction→sources has littleeffect on the rate and sequence of phonological and grammatical development.The regional and social varieties (sources) of the input→pronunciationⅢ. L1 vs. L2 learningⅣ. The logical problem of language learning1.Noam Chomsky:1)innate linguistic knowledge must underlie language acquisition2)Universal Grammar2.The theory of Universal Grammar:Reasons:1)Children’s knowledge of language > what could be learned from theinput.2)Constraints and principles cannot be learned.3)Universal patterns of development cannot be explained bylanguage-specific input.Children often say things that adults do not.♦Children use language in accordance with general universal rules of language though they have not developed the cognitive ability to understand these rules. Not learned from deduction or imitation.♦Patterns of children’s language development are not directly determined by the input they receive.。

资本投资论文参考文献范例

资本投资论文参考文献范例

资本投资论文参考文献一、资本投资论文期刊参考文献[1].经济绿色转型视域下的生态资本效率研究.《中国人口·资源与环境》.被中信所《中国科技期刊引证报告》收录ISTIC.被北京大学《中文核心期刊要目总览》收录PKU.被南京大学《核心期刊目录》收录CSSCI.2013年4期.严立冬.屈志光.黄鹂.[2].金融危机、货币政策与企业资本投资兼论经济新常态下货币调控何去何从.《上海财经大学学报(哲学社会科学版)》.被北京大学《中文核心期刊要目总览》收录PKU.被南京大学《核心期刊目录》收录CSSCI.2015年5期.连军.马宇.[3].货币政策传导、制度背景与企业资本投资.《重庆大学学报(社会科学版)》.被北京大学《中文核心期刊要目总览》收录PKU.被南京大学《核心期刊目录》收录CSSCI.2015年4期.张超.陈名芹.刘星.[4].基于个体语言技能资本投资特性的语言传播规律分析.《社会科学辑刊》.被北京大学《中文核心期刊要目总览》收录PKU.被南京大学《核心期刊目录》收录CSSCI.2014年3期.王海兰.宁继鸣.[5].民营企业政治联系的背后:扶持之手与掠夺之手——基于资本投资视角的经验研究.《财经研究》.被北京大学《中文核心期刊要目总览》收录PKU.被南京大学《核心期刊目录》收录CSSCI.2011年6期.连军.刘星.连翠珍.[6].政府治理、产权偏好与资本投资.《南开管理评论》.被北京大学《中文核心期刊要目总览》收录PKU.被南京大学《核心期刊目录》收录CSSCI.2012年1期.陈德球.李思飞.[7].工商资本投资现代农业的问题与对策分析.《安徽农业科学》.被中信所《中国科技期刊引证报告》收录ISTIC.被北京大学《中文核心期刊要目总览》收录PKU.2014年4期.严瑾.[8].论道德资本投资.《学术论坛》.被北京大学《中文核心期刊要目总览》收录PKU.被南京大学《核心期刊目录》收录CSSCI.2013年2期.杨晓锋.[9].工商资本投资农业的指导目录生成及其实现研究.《现代经济探讨》.被北京大学《中文核心期刊要目总览》收录PKU.被南京大学《核心期刊目录》收录CSSCI.2014年5期.蒋永穆.张尊帅.[10].大股东控制下的资本投资与利益攫取研究.《南开管理评论》.被北京大学《中文核心期刊要目总览》收录PKU.被南京大学《核心期刊目录》收录CSSCI.2009年2期.郝颖.刘星.林朝南.二、资本投资论文参考文献学位论文类[1].个体语言技能资本投资研究.被引次数:3作者:王海兰.语言经济学山东大学2012(学位年度)[2].中国资本投资动态效率及其影响因素的实证研究.作者:江娅.产业经济学重庆大学2014(学位年度)[3].我国资本投资宏观效率及其影响因素研究.作者:戴小红.产业经济学重庆大学2014(学位年度)[4].IPO超募融资与资本投资——基于我国中小板与创业板上市公司的研究.作者:范高燕.财务学厦门大学2012(学位年度)[5].控制权私有收益对公司价值的影响研究——基于资本投资的视角. 作者:李乐.会计学重庆大学2011(学位年度)[6].中国上市公司控制权私有收益与资本投资的关系研究.被引次数:1 作者:杨江涛.企业管理重庆大学2010(学位年度)[7].制度演进、集团内部资本市场运作与成员企业资本投资.作者:计方.会计学重庆大学2013(学位年度)[8].企业的技术资本投资分析——基于我国制造业现状的分析.作者:唐文文.会计学中国海洋大学2011(学位年度)[9].研发联盟的次优技术产权——基于不完全合约的产权分析.作者:程金伟.产业经济学上海交通大学2013(学位年度)[10].我国城市公共资本投资效益研究.作者:徐娟娟.财政学苏州大学2013(学位年度)三、相关资本投资论文外文参考文献[1]FLOODPROTECTION:HIGHLIGHTINGANINVESTMENTTRAPBETWEENBUILTANDNAT URALCAPITAL. MarjanvandenBeltThomasBowenKimberleySleeVickyForgie 《JournaloftheAmericanWaterResourcesAssociation》,被EI收录EI.被SCI 收录SCI.20133[2]Capitalinvestmentandemploymentintheinformationsector.T.RandolphBeardGeorgeS.FordHyeongwooKim《Telecommunicationspolicy》,被EI收录EI.被SCI收录SCI.20144[3]Numericalmethodsforoptimaldividendpaymentandinvestmentstrategi esofregimeswitchingjumpdiffusionmodelswithcapitalinjections. ZhuoJinHailiangYangG.GeorgeYin《Automatica》,被EI收录EI.被SCI收录SCI.20138[4]OrganizationalLearningandCapitalProductivityinSemiconductorMan ufacturing.Weber,C.M.Yang,J.《IEEETransactionsonSemiconductorManufacturing:APublicationoftheIEEEC omponents,Hybrids,andManufacturingTechnologySociety,theIEEEElectronDe vicesSociety,theIEEEReliabilitySociety,theIEEESolidStateCircuitsCounc il》,被EI收录EI.被SCI收录SCI.20143[5]Multistagecapitalbudgetingforsharedinvestments.Johnson,N.B.Pfeiffer,T.Schneider,G.《Managementscience:JournaloftheInstituteofManagementSciences》,被EI 收录EI.被SCI收录SCI.20135[6]Optimalinvestmentwithdeferredcapitalgainstaxes.Seifried,FT《Mathematicalmethodsofoperationsresearch》,被EI收录EI.被SCI收录SCI.20101[7]Acardinalityconstrainedstochasticgoalprogrammingmodelwithsatis factionfunctionsforventurecapitalinvestmentdecisionmaking. BelaidAouniCinziaColapintoDavideLaTorre《Annalsofoperationsresearch》,被EI收录EI.被SCI收录SCI.2013May[8]Impactsofcollaborativeinvestmentandinspectionpoliciesontheinte gratedinventorymodelwithdefectiveitems. LiangYuhOuyangLiYuanChenChihTeYang 《Internationaljournalofproductionresearch》,被EI收录EI.被SCI收录SCI.201319/20[9]MarketinformationfeedbackforthehightechdominatedIPOcompanies. ShihKHLinCWHuangHTChiangWC 《InternationalJournalofTechnologyManagement:TheJournaloftheTechnolog yManagementofTechnology,EngineeringManagement,TechnologyPolicyandStra tegy》,被EI收录EI.被SCI收录SCI.20081/3[10]Theinfluenceofcapitalmarketlawsandinitialpublicofferingproces sonventurecapital.WonglimpiyaratJ《EuropeanJournalofOperationalResearch》,被EI收录EI.被SCI收录SCI.20091四、资本投资论文专著参考文献[1]股权激励、资本投资偏好与股利政策来自股利保护性股权激励证据. 张俊民.胡国强,2011第十七届中国财务年会[2]CEO任期与企业资本投资.李培功.肖珉,2011第十届中国实证会计国际研讨会[3]融资约束与公司投资:来自营运资本投资的新证据.刘康兵,2008第八届中国青年经济学者论坛[4]语言传播规律:基于个体语言技能资本投资特性的分析.王海兰.宁继鸣,2012第三届中国语言经济学论坛[5]私有资本投资基础设施建设的政府监管分析.王威.李冲,20112011年中国工程管理论坛[6]工业反哺农业的积极探索——无锡市工商资本投资发展现代高效农业调查.黄继鹏,20062006全国都市农业与新农村建设高层论坛[7]中国PPP项目政治风险的变化.柯永建.王守清.陈炳泉.李湛湛,2008第六届全国土木工程研究生学术论坛[8]上市公司的银行信贷融资可获性与恶性增资行为研究.唐洋.刘志远.高佳旭,2011第十届中国实证会计国际研讨会[9]市场环境、会计信息质量与资本投资.袁建国.蒋瑜峰,2008中国会计学会高等工科院校分会2008年学术年会(第十五届)暨在鄂集团企业财务管理研讨会[10]藝術品價格指數述論.胡志祥.歐陽麗莎,2012第三届世界华人家大会。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
CONSTRAINED STOCHASTIC LANGUAGE MODELS
KEVIN E. MARK , MICHAEL I. MILLER AND ULF GRENANDERy
Abstract. Stochastic language models incorporating both n-grams and contextfree grammars are proposed. A constrained context-free model speci ed by a stochastic context-free prior distribution with superimposed n-gram frequency constraints is derived and the resulting maximum-entropy distribution is shown to induce a Markov random eld with neighborhood structure at the leaves determined by the relative n-gram frequencies. A computationally e cient version, the mixed tree/chain graph model, is derived with identical neighborhood structure. In this model, a word-tree derivation is given by a stochastic context-freeprior on trees down to the preterminal (part-of-speech) level and word attachment is made by a nonstationary Markov chain. Using the Penn TreeBank, a comparison of the mixed tree/chain graph model to both the n-gram and context-free models is performed using entropy measures. The model entropy of the mixed tree/chain graph model is shown to reduce the entropy of both the bigram and context-free models. Key words.
2
Mark, Miller, and Grenander
S S NP NP VP Art Art
The dog ran
VP
N dog
V ran
N dog
V The ran
The
(a)
(b)
(c)
Fig. 1.1. Three graph structures for natural language: (a) chain graph, (b) tree graph, and (c) tree/chain graph.
y
Dept. of Electrical Engineering, Washington University, St. Louis, Missouri 63130 Division of Applied Mathematics, Brown University, Providence, Rhode Island 1
Constrained Stochastic Language Models
S t
3
NP

VP
Art
N
V Wn
The
Fig. 1.2
dog
ran
. Two components of a derivation tree T : the tree t deriving the preterminal string \Art N V" and the word string Wn = \The dog ran".
N Y j =n+1
p(wj jwj ?1; : : :; wj ?n):
In particular, the trigram (n = 2) model is widely used in speech recognition systems with a great deal of success. However, for higher level processing relying on syntactics, the trigram model does not provide any hierarchical information. 1.2. Random Branching Process. In order to model the hierarchical structure of language as shown in gure 1.1b, a stochastic context-free grammar may be used. A stochastic context-free grammar is de ned as the quintuple G =< VN ; VT ; R; S; P > : The nite set of nonterminal symbols VN = f 1; 2; : : :; jVN j g contains syntactic variables denoting parts-of-speech, phrase structures, and other syntactic structures. It will be convenient in the following discussion to denote the set of parts-of-speech (or preterminals) as VP = f 1; : : :; jVP j g VN . The start symbol S is an element of the set of nonterminals. The set VT of terminal symbols is a set of words as in the Markov chain models above. We restrict the set of rewrite rules R to the following two forms: (1) i ! B1 B2 : : :Bk where i 2 VN ? VP and Bi 2 VN and (2) i ! wi
wi 2 VT for i = 1; 2; : : :; N is modeled as a realization of an nth order Markov chain with the property that p(wk jwk?1; : : :; w1) = p(wk jwk?1; : : :; wk?n): The probability for the word string WN is then (1.1) p(WN ) = p(w1 ; w2; : : :; wn)
1. Introduction. An important component of speech and text recognition systems is the language model which assigns probability to word sequences. In most speech recognition tasks, the language model is used to help distinguish ambiguous phonemes by placing higher probability on more likely possibilities. These models are typically based on some Markov process in the word strings such as a Markov chain 1,5]. Alternatively, more sophisticated language models have been developed which provide syntactic information to perform higher level tasks such as machine translation and message understanding. The underlying graph structures of existing models are known to have inherent weaknesses. While Markov chain models are e cient at encoding local word interaction, they do not provide the underlying hierarchical structure of language. On the other hand, grammars provide this information but fail to include the lexical information inherent in Markov chain models. In this paper, we establish an alternative structure which combines both a tree structure and a chain structure. The resulting mixed tree/chain graph model has the same neighborhood structure as the maximumentropy model 8] and has the advantage of computational e ciency. Closed form expressions for the entropy of this model are derived. Results using the Penn TreeBank are shown which demonstrate the power of these alternative graph-based probabilistic structures. 1.1. Markov Chain Models. The most widely used language model is the Markov chain or n-gram model as depicted in gure 1.1a. Given a vocabulary VT of words, a word sequence WN = w1; w2; : : :; wN with
相关文档
最新文档