面向舆情监控的热点人物及事件

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Candidate: Supervisor: Academic Degree Applied for : Speciality: Affiliation: Date of Defence :
Sun Zhenlong Prof. Li Sheng Master of Engineering Computer Science and Technology School of Computer Science and Technology June, 2012
Degree-Conferring-Institution: Harbin Institute of Technology


摘 要
随着 Internet 在全球范围内的广泛普及, 互联网已经成为社会舆论的放大 器和思想文化信息的集散地。舆情信息反映了民众思想状况,在 Web2.0 的强大 传播力之下,对舆情信息的研究显得极其重要。面对每天更新的海量的信息, 高效准确的挖掘出热点新闻和舆论走势已经成为亟待解决的问题。 一般情况下, 事件的发生和发展都与人物有关,很多热点事件都是围绕着人在进行。在这种 背景下,本文以分析热点人物为切入点,找到并分析发生在他们身上的事件, 进而把握网络舆情。围绕着热点人物及事件分析技术,本文的研究主要涉及以 下几个方面: (1) 提出了融合多种词法分析工具识别人名的方法和基于 Lingo 聚类策略 的人名消歧方法。首先利用现有的分词及标注工具中人名标注功能初步识别人 名,并根据最长原则融合几种人名识别方法的结果。同时,尝试了几种噪声人 名的去除方法,并基于 Lingo 聚类算法进行人名消歧。实验表明,融合策略在 不降低人名识别正确率的前提下提高了人名识别的召回率,人名去噪及消歧方 法能够很好的满足应用需求。 (2) 研究了有监督的人物分类技术,提出一种基于 SVM 的人物分类方法。 首先从包含人物的文本中抽取能够描述人物的一定长度的文本片段,然后利用 信息增益提取出代表人物的有用属性特征,最后用 SVM 算法对人物进行分类。 实验表明,这种方式能有效的预测人物的所属领域。 (3) 研究了基于信息熵和情感词典相结合的特征提取技术,并用其进行热 点人物事件的倾向性分析。信息熵计算特征的区分能力,而情感词典解决覆盖 率问题。本文提取的特征分为从训练集中提取的特征和从情感词典中提取的特 征。训练集中提取的特征是与语料相关的,或者说与领域相关的。而情感词典 具有通用性,其中含有训练集提取的特征集中没有的特征。实验结果显示,将 两种特征融合到一起能够有效地提高事件倾向性分析的性能。同时,本文尝试 了用同义词词林将候选特征集合进行聚合,即将同义的两个特征映射到一个特 征上,这样做既降低了空间向量的维数,又不丢失语义信息,达到了两者兼顾 的效果,而且提高了语义相似度计算的精度。在特征聚类过程中将特征的同义 词也加入,从而达到扩展重要特征的效果,提高了事件倾向性分析过程的特征 识别能力。 (4) 提出了一种面向舆情监控的热点人物排序模型。该模型综合考虑人物 的曝光率、热度趋势变化和所属领域的权重这些因素来计算分数,然后根据分
学校代码: 10213 密级:公开
工学硕士学位论文
面向舆情监控的热点人物及事件分析技术
硕 士 研 究 生: 孙振龙 导 师: 李生教授
申 请 学 位 : 工学硕士 学 科: 计算机科学与技术
所 在 单 位: 计算机科学与技术学院 答 辩 日 期: 2012 年 6 月 授予学位单位 : 哈尔滨工业大学
- II -
Abstract
Abstract
With the worldwide popularity of the Internet, the Internet has become the center of the ideology culture and the amplifier of public opinions. Public opinion information reflects the state of public mind, and the study of public opinion information is extremely important under the powerful spread of Web2.0. Faced with a flood of information updated daily, how to dig out the hot news and public opinion trends efficiently and accurately has become an urgent problem. Generally speaking, the occurrence and development of events are related to characters, and the expansion of many hot events is influenced by characters. In this context, we take hot charactor analysis as a starting point, find and analyze the events that happen to hot charactors, to grasp the public opinions of the network. Centering on analysis techniques of the hot charactor and event , our study involves the following aspects: (1) This paper presents a name recognition method based on a combination of lexical anaylis results and a name disambiguation method based on Lingo clustering strategy. We first use the existing lexical analysis tools to mark names, and integrate the results based on maximum length principle. At the same time, we try several methods to remove noise names, and do name disambiguation based on the Lingo clustering algorithm. Experiments show that the integration strategy improves the recall of name recognition without reducing the precision of name recognition, and the noise reduction method for names and the name disambiguation method can meet the application requirements. (2) This paper studies supervised charactor classification techniques and proposes a charactor classification method based on SVM. We first extract fixed length text fragments which could describe character from the text, then use the information gain to extract useful character attributes which can represent him or her, and finally use the SVM algorithm to classify the characters. Experiments show that this method can effectively predict the category of the character. (3) We research the feature extraction technique based on the combination of information entropy and emotional dictionary, and use it to analyze hot charactor and event sentiment. Information entropy measures the distinguishing ability of the features, and emotional dictionary solves the coverage problem. Features in th is paper are extracted from the training set and the emotional dictionary seperately. Features from the training set are related to the corpus, or some field. Emotional dictiony is universal, which contains the features that the training set doesn't contai n. Experimental results show that the feature integration can effectively improve the
-I-


数生成热点人物排行榜。其中人物曝光率就是人物一天内在新闻和评论中出现 的次数;热度趋势变化程度由 KL 距离的变形来衡量;人物所属领域的权重根据 该领域人物信息在舆情监控中的重要程度来设置,而人物的所属领域由人物自 动分类技术实现。实验结果表明热点人物排序模型能够将舆情监控中的重要人 物放到排行榜前段。 关键词 :舆情监控;人名识别;人物分类;倾向性分析;热点人物排序
硕士学位论文
面向舆情监控的热点人物及事件分析技术
RESEARCH ON HOT CHARACTER AND EVENT ANALYSIS TECHNIQUES ORIENTED TO PUBLIC SENTIMENT MONITORING
孙振龙
哈尔滨工业大学 2012 年 6 月
国内图书分类号: TP391.2 国际图书分类号: 681.37
Classified Index: TP391.2 U.D.C: 681.37
Dissertation for the Master Degree in Engineering
RESEARCH ON HOT CHARACTER AND EVENT ANALYSIS TECHNIQUES ORIENTED TO PUBLIC SENTIMENT MONITORING
- III -
Abstract
来自百度文库
performance of the event tendentious analysis. Meantime, this paper attempts to cluster the candidate features set using synonymous word dictionary. Synonyms are mapped to one feature, which reduces the dimension of the space vector without losing senmatic information, and improves the accuracy of semantic similarity calculation. Joining the synonyms of features in the feature clustering process achieves the effect of important feature expansion and improves the feature recognition capability in the analysis process of event sentiment. (4) This paper proposes hot charactor scheduling model oriented to public opioion monitoring. The model considerscharacter exposure rate, hot degree trend and field weight to calculate the score, and then generates the hot charactor ranklist. Character exposure rate is number of the news and commentaries which contain the character in a single day; the hot degree trend can be measured by the deformation of the KL distance; the field weight is set according to the field’s importance degree in monitoring public opinion of the character, and the field of character can be predicted by automatic character classification technology. Experimental resu lts show the hot character scheduling model can put important characters in monitoring public opinion on the front of hot character ranklist. Keywords: public sentiment monitoring, person name recognition, character classification, trend analysis, hot charactor sorting
相关文档
最新文档