汉语词性标注
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
湖南文理学院课程设计报告
课程名称:计算机软件技术基础
系部:电信系
专业班级:通信工程T09103班
学生姓名:刘程程
指导教师:
完成时间:2011.12.28
报告成绩:
目录
中文摘要 .................................................................................................................................................................... I ABSTRACT ............................................................................................................................................................ II 第一章引言 (1)
1.1背景和意义 (1)
1.2词性标注定义及其困难 (1)
1.2.1词性的定义 (2)
1.2.2词性标注的难点 (2)
第二章基础理论介绍 (3)
2.1隐马尔科夫模型(H1DDEN M ARKOV M ODEL,HM) (3)
2.2HMM用于词性标注 (4)
第三章改进HMM标注模型与参数估计 (4)
3.1改进HMM模型词性标注 (4)
3.2参数估计 (5)
3.2.1训练语料库 (5)
3.2.2当用数据库 (5)
第四章改进VITERBI算法标注 (7)
4.1标注过程 (7)
4.2改进后的V ITERBI算法的具体描述 (7)
第五章实验结果与分析 (8)
5.1评价标准 (8)
5.2实验结果 (9)
5.3错误分析 (10)
参考文献 (11)
中文摘要
汉语词性标注是中文信息处理技术中的一项基础性课题。一方面,它的研究成果可以直接融入到信息抽取、信息检索、机器翻译等诸多实际应用系统当中;另一方面,汉语自动词性标注也是汉语语块识别器、汉语句法分析器、汉语语义分析器必不可少的前端处理工具。因此,研究和实现汉语词性标注器具有重要的理论意义和实用价值。
词性标注的方法主要有基于规则和基于统计的两大类。由于基于统计的方法具有不需要人工总结语言学规则、正确识别率高等优点,已逐渐成为研究的热点。在基于统计的方法中,隐马尔科夫模型是最主要的算法模型之一。
在本文中,我们以汉语的词性自动标注为研究对象,提出了一种基于改进的隐马尔科夫模型汉语词性标注方法。该方法在原有隐马尔科夫模型的基础上,加入了更多的上下文信息,用于汉语词性的自动标注问题,取得了较好的效果。主要的研究内容有以下几方面: 1.虽然隐马尔科夫模型有很好的标注效果,但是它在对当前词词语出现概率的估计只与其词性有关。2.获得上下文信息的多少和数据平滑程度是评价统计词性标注模型性能的两个重要参数。本文详细介绍了现阶段几种平滑算法,针对该模型数据稀疏现象,采用性能稳定指数线性插值方法来平滑HMM的概率参数。 3.对HMM参数估计模型的修改,只是改进模型的第一步,为了更有效的使用训练所得到的参数,需要对Viterbi算法进行修改。由于传统的Viterbi算法不适合本模型,所以对Viterbi算法进行了拓展。4.对于自然语言来讲不存在完备的可计算的词性信息,如何确定未登录词的词性是除兼类问题之外词性标注所面临的另一个关键问题。本文对未登录词处理提出了具体处理方法。
关键词:中文信息处理;汉语词性标注;隐马尔科夫模型;平滑算法;
Abstract
Chinese Part-of-Speech Tagging is a fundamental problem to many Chinese Information Processing tasks. The task of Part-of-Speech Tagging is to design software that can identify Part-of-Speech in a sentence automatically.One side, the performance of many realistic applications such as information extraction, information retrieval, and machine translation would be improved if the right Part-of-Speech were available. And on the other hand, it is indispensable processing component in Chinese lexical analysis system, Chinese syntax analysis system, and etc. Therefore, its research is of great of theoretical importance as well as practicability.
The model of Part-of-Speech Tagging includes both rule and statistics technique. Because of the statistics technique requires no manual rules of natural language and has a high level accuracy, the statistical language model has gradually become a hot research topic. For its better performance, Hidden Makov Model (HMM), one of the statistical models, has been the recent trend in Part-of-Speech Tagging.
We propose a method of Chinese Part-of-Speech Tagging based on ameliorated Hidden Makov Model, taking more information of context into the model to describe language phenomena. The result of ameliorated model is satisfying. The main works of this paper includes four parts:1 .Although HMM are high performance, the probability of the word depends on its own tag. 2. Two key factors can be used in evaluating the performance of statistical model of Part-of-Speech Tagging. 3 .For the sake of making effective use of parameters trained from ameliorated Hidden Makov Model; we fit the Viterbi algorithm for the new parameter.4 .For the imperfection of computable information on each word in How to solve new words is anther key problem in statistical language In this paper, we propose a concreted method in new words.
Key words: Chinese Information Processing; Chinese Part-of-Speech
Tagging; Hidden Makov Model; Smoothing Algorithm