文本相似度计算

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

文本相似度计算系统

摘要

在中文信息处理中，文本相似度的计算广泛应用于信息检索、机器翻译、自动问答系统、文本挖掘等领域，是一个非常基础而关键的问题，长期以来一直是人们研究的热点和难点。本次毕设的设计目标就是用两种方法来实现文本相似度的计算。

本文采用传统的设计方法，第一种是余弦算法。余弦算法是一种易于理解且结果易于观察的算法。通过余弦算法可以快捷的计算出文本间相似度，并通过余弦算法的结果（0、1之间）判断出相似度的大小。由于余弦计算是在空间向量模型的基础上，所以说要想用余弦算法来完成本次系统，那么必须要将文本转化成空间向量模型。而完成空间向量模型的转换则要用到加权。在空间向量模型实现之前，必须要进行文本的去停用词处理和特征选择的处理。第二种算法是BM25算法，本文将采用最基础的循环来完成，目的是观察余弦算法中使用倒排索引效率是否提高有多大提高。

本次文本相似度计算系统的主要工作是去除停用词、文本特征选择、加权，在加权之后用余弦算法计算文本的相似度。在文本特征选择之后用BM25计算相似度。由于为了使系统的效率提高，在程序设计中应用了大量的容器知识以及内积、倒排算法。

关键词：文本相似度；余弦；BM25；容器

Text Similarity Algorithm Research

Abstract

In Chinese information processing，text similarity computation is widely used in the area of information retrieval，machine translation,automatic question—answering，text mining and etc．It is a very essential and important issue that people study as a hotspot and difficulty for a long time．Currently，most text similarity algorithms are based on vector space model(VSM)．However,these methods will cause problems of high dimension and sparseness．Moreover，these methods do not effectively solve natural language problems existed in text data．These natural language problems are synonym and polyseme．These problems sidturb the efficiency and accuracy of text similarity algorithms and make the performance of text similarity computation decline．

This paper uses a new thought which gets semantic simirality computation into traditional text similarity computation to prove the performance of text similarity algorithms．This paper deeply discusses the existing text similarity algorithms and samentic text computation and gives a Chinese text similarity algorithm which is based on semantic similarity．There is an online information management system which is used to manage students’graduate design papers．Those papers ale used to calculate similarity by that the algorithm to validate that algorithm．

This text similarity computing system's main job is to stop word removal, text feature selection, weighting, after weighting using cosine algorithm to calculate the

similarity of the text. After the text feature selection calculation of similarity with the

BM25. Because in order for the system's efficiency, knowledge application in programming a lot of containers as well as the inner product, the inversion algorithm

KEY WORDS：Text similarity；cosine；BM25；container

1 绪论.................................................................................................. 错误！未定义书签。

1.1 开发背景................................................................................... 错误！未定义书签。

1.2 课题研究意义........................................................................... 错误！未定义书签。

1.3本课题要解决的问题................................................................ 错误！未定义书签。

2 研究方法.......................................................................................... 错误！未定义书签。

2.1根据研究的侧重点阐述相关的研究方法................................ 错误！未定义书签。

2.2历史以及研究现状.................................................................... 错误！未定义书签。3关键问题及分析（一）（余弦）..................................................... 错误！未定义书签。

3.1 研究设计中的关键问题........................................................... 错误！未定义书签。

3.2 具体实现中采用的关键技术................................................... 错误！未定义书签。

3.2.1 容器..................................................................................... 错误！未定义书签。

3.2.2 倒排..................................................................................... 错误！未定义书签。

3.2.3 内积..................................................................................... 错误！未定义书签。

3.2.4 算法..................................................................................... 错误！未定义书签。

3.3本章小结.................................................................................... 错误！未定义书签。4关键问题及分析（二）（BM25） .................................................. 错误！未定义书签。

4.1 研究设计中的关键问题........................................................... 错误！未定义书签。