自动文本分类

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

摘要

九十年代以来，Internet 以惊人的速度发展起来，它容纳了海量的各种类型的原始信息，包括文本信息、声音信息、图像信息等等。如何在浩若烟海而又纷繁芜杂的文本中掌握最有效的信息始终是信息处理的一大目标。基于人工智能技术的文本分类系统能依据文本的语义将大量的文本自动分门别类，从而更好地帮助人们把握文本信息。近年来，文本分类技术已经逐渐与搜索引擎、信息推送、信息过滤等信息处理技术相结合，有效地提高了信息服务的质量。

自动文本分类就是对大量的自然语言文本按照一定的主题类别进行自动分类，它是自然语言处理的一个十分重要的问题。文本分类主要应用于信息检索，机器翻译，自动文摘，信息过滤，邮件分类等任务。文本分类的一个关键问题是特征词的选择问题及其权重分配。

在本设计中，我们实现了一个基于支持向量机（SVM）的网页分类器，使用LTC权重作为特征项的权重表示，利用SVM的方法进行分类；并结合Unigram模型进行特征提取，实验证明，该方法提高了分类的准确率。

关键词自然语言理解向量空间模型支持向量机文本分类

Unigram模型

- 1 -

Abstract

Since 1990s, Internet developed vapidly. There’re large amounts of information of any field, including text information, sound information, image information and so on. In recent years, how to find the most efficient information from the plentiful and disordered texts has become a target of information processing field. The Text Categorization System based on AI technique can automatically classify the texts according to their senses, thus help people control the information. Text Categorization has gradually been combined with other information processing techniques such as searching engine, information pushing, and information filter, in this way, the quality of information service has been effectively improved.

Automatically Text Categorization is the problem of categorizing natural language texts according to given topics, which is a very important problem in natural language processing. Text Categorization can be applied in the task of information retrieval, machine translation, automatic summarization, information filter, e-mail filter and so on. The main problem of Text Categorization is how to select the features (words) and assign the weighting of them.

In my work, I implemented a Chinese Web Page Classifier based on Support Vector Machine (SVM). The classifier uses LTC weighting as the representation of the features and uses SVM algorithm to categorize. Furthermore, I had combined the Unigram Model for feature selection, experimental results showed that this method can improve the categorization accuracy.

Keywords Natural Language Processing, Vector Space Model, Support Vector Machine, Text Categorization, Text Classification,

Unigram Model

- 2 -

毕业设计（论文）评语......................................................错误！未定义书签。毕业设计（论文）任务书..................................................错误！未定义书签。摘要.. (1)

Abstract (2)

第1章绪论 (5)

1.1 课题背景 (5)

1.2 相关工作介绍 (6)

1.3 研究内容和实验结论 (7)

1.4 本文的内容结构 (7)

第2章文本分类及向量空间模型 (8)

2.1 文本分类 (8)

2.1.1 系统任务 (8)

2.1.2 文本的表示 (8)

2.1.3 特征项的抽取 (11)

2.1.4 分类流程 (13)

2.1.5 评价方法 (13)

2.2 向量空间模型 (14)

2.2.1 最小距离分类器 (14)

2.2.2 K最近邻分类器 (15)

2.2.3 基本Bayes分类器 (16)

2.2.4 支持向量机分类器 (17)

2.3 本章小节 (19)

第3章问题理解 (20)

3.1 网页分类器的任务 (20)

3.2 网页的结构特征 (21)

3.3 网页在系统中的表示 (26)

3.4 本章小节 (27)

第4章中文网页分类器的实现 (28)

4.1 网页的预处理 (28)

- 3 -