基于Web的大规模中文人物信息提取研究

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

国内图书分类号: TP301.6 密级：公开国际图书分类号: 681.14

西南交通大学

研究生学位论文

基于Web的大规模中文人物信息提取研究

年级二〇一〇级

姓名胡万亭

申请学位级别硕士

专业计算机系统结构

指导教师杨燕教授

二〇一三年五月

Classified Index: TP301.6

U.D.C: 681.14

Southwest Jiaotong University

Master Degree Thesis

RESEARCH ON LARGE-SCALE CHINESE PEOPLE INFORMATION EXTRACTION

BASED ON WEB

Grade: 2010

Candidate: Wanting Hu

Academic Degree Applied for: Master

Speciality: Computer Architecture

Supervisor: Prof. Yan Yang

May,2013

西南交通大学

学位论文版权使用授权书

本学位论文作者完全了解学校有关保留、使用学位论文的规定，同意学校保留并向国家有关部门或机构送交论文的复印件和电子版，允许论文被查阅和借阅。本人授权西南交通大学可以将本论文的全部或部分内容编入有关数据库进行检索，可以采用影印、缩印或扫描等复印手段保存和汇编本学位论文。

本学位论文属于

1．保密□，在年解密后适用本授权书；

2．不保密□，使用本授权书。

（请在以上方框内打“√”）

学位论文作者签名：指导老师签名：

日期：日期：

西南交通大学硕士学位论文主要工作（贡献）声明

本人在学位论文中所做的主要工作或贡献如下：

1、采集人物相关网页数据，主要包括编写程序下载好大夫在线、评师网、百度百科等

网站数百万网页，CNKI网站三千多万条论文数据。

2、对基于统计的网页正文提取算法做出一些改进，并结合DOM解析工具实现正文提

取程序。用该程序提取了网页的正文。

3、完成分词系统的组织机构名识别模块，主要工作包括：统计词语词频并排序、整理

机构后缀词词典、建立机构名词典、统计机构名组成词词频、构建数学模型并实现基于词频统计的机构名识别算法。用该分词系统完成对网页正文的分词。

4、编程实现对半结构化和非结构化人物信息的提取，其中非结构化人物信息的提取采

用基于规则的提取算法，手动建立了规则库，规则依赖于实验室分词系统对正文的分词标注。

本人郑重声明：所呈交的学位论文，是在导师指导下独立进行研究工作所得的成果。除文中已经注明引用的内容外，本论文不包含任何其他个人或集体已经发表或撰写过的研究成果。对本文的研究做出贡献的个人和集体，均已在文中作了明确说明。本人完全了解违反上述声明所引起的一切法律责任将由本人承担。

学位论文作者签名：

日期：

摘要

现代人越来越依赖于从互联网上检索信息，人物信息是人们关注检索的一个重要领域。本文致力于抽取尽可能多的重要人物信息，构建一个人物信息的知识库，既可以作为人物搜索引擎的知识库，也可以作为语义搜索引擎的知识库的人物相关部分。网络上有海量的人物信息，但是这些信息格式多样、内容纷乱，大量的垃圾信息又充斥其中，如何从互联网中自动高效地抽取准确的信息相对复杂，有很多问题需要解决。本文研究了一个从网页数据采集、网页正文抽取、中文分词处理到人物信息结构化的完整过程，每个部分都对应论文的一章。

首先是网页数据的采集。论文详述了人物信息网页来源的选取和网页的下载方法。网页下载越来越困难，网站对爬虫程序的限制越来越严，甚至采取了各种反爬虫措施，比如对同一IP访问频率的限制。作者自己编写程序下载网页数据，针对网站的不同情况采用了三种网页数据的下载方式：一般下载方式、代理下载方式和动态网页数据的下载方式。

然后是对网页正文进行抽取。论文综述了网页正文抽取的相关研究，采用了基于统计和DOM的方法进行正文抽取。方法采用的统计信息是正文字长、超链接数和结束标点符号数。对每个容器标签，统计三个信息值后，利用它们的数量比值判断标签是否正文标签，进而抽取正文。

接着是对网页正文进行分词处理。常见的分词系统在实体识别方面存在不足，不能很好适用于知识抽取、自然语言处理等。本文分词处理使用的是西南交大思维与智慧研究所开发的分词系统，该系统在实体识别方面显著优于其它分词系统。机构名识别算法由本文作者实现，算法基于词频统计。实验中训练数据主要通过百度百科词条整理得到。训练时，作者利用百度百科词条名在词条文本中的频数统计，进行机构构成词的词频统计。在此基础上，构建了数学模型，实现了组织机构名识别算法。

最后是网页人物信息的结构化。网页上的人物信息一般以半结构化和非结构化呈现，人物信息抽取的最后部分就是抽取半结构化和非结构化的人物信息并保存为结构化的人物信息。对于半结构化人物信息，需要正文去匹配人物属性词典，然后结合简单规则，直接提取属性值就行了，方法简单而有效。对于非结构化人物信息的提取，采用基于规则的提取方法，过程中建立触发词库和规则库，触发词库包括基本人物属性和对应的触发词，规则库是人工定义的提取属性值的规则。

关键词：信息抽取；结构化；分词；词频统计；正文抽取

Abstract

Currently，people increasingly rely on the Internet to retrieve information. The information about people is an important aspect. The aim of this thesis is extracting information of famous people as much as possible. It can be used as a knowledge base of the people search engine, also can be used as a part of the knowledge base of the semantic search engine. This is vast personal information on the network. But, the format of information is different and complex. At the same time, a lot of spam full of the Internet. So, extracting accurate information from the network automatically and relatively faces with many difficulties. This thesis proposes a complete process of personal information extraction. It consists of downloading page, extracting webpage content, word segmentation and extracting structured personal information.

Firstly, this thesis introduces the processing of data collection. The thesis narrates the process of selecting Web data sources and ways of page-downloading. It is more difficult to download page than in the past. Some Websites take a variety of measures against reptiles, such as limiting access frequency of the same IP. The writer makes up the downloading program and used three ways of page-downloading: general way, agent download way and dynamic Web data download way.

Then, the content of page should be extracted. This thesis summarizes the relative research of content extraction and uses the extraction way based on statistics and DOM. To each container label, the thesis gets content length, the number of links and the number of end punctuation and computes their ratio. Then, it can be judged that whether the label contains content.

The next step is word segmentation. Common segmentation systems are less effective in entity recognition so that they don’t suite for knowledge extraction and natural language processing. The segmentation system of Southwest Jiaotong University is better than the other system in entity recognition. And, the organization name recognition algorithm is implemented in this thesis. The recognition algorithm is based on word frequency statistics. Training data mainly comes from Baidu encyclopedia entries. In the process of training, the organization names are split into a number of words and all the words frequency are computed. On the basis of computation of words frequency, this thesis establishes the mathematical model and implements the algorithm of organization name recognition.

Finally, the most critical step is extracting the structured personal information. The personal information commonly is semi-structured and unstructured. At this part,