【泰迪杯】通用论坛正文提取
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
所选题目: C
综合评定成绩:评委评语:
评委签名:
通用论坛正文提取
摘要:
在如今的大数据时代,伴随着互联网和移动互联网的高速发展,在线产生的数据总量不断攀升,其所蕴含的大量信息已成为各行各业的一个重要的数据分析来源。其中论坛类网页数量和涉及的信息越来越多,充分挖掘这类信息对社会舆论和情感分析、企业决策和政策制定等具有重要的现实意义。而各论坛网页风格不一,如何从海量的差异论坛网页中提取有价值的信息是目前互联网数据处理一个急需解决的问题。
本次数据挖掘目标旨在根据论坛网页的特性,提出一种全新的通用论坛正文提取方法。整个过程主要分为5个步骤:
第一步:数据的清洗。附件中有些url存在错误,体现为三类:网页找不到、本帖被删除了以及404的错误码,针对这些网页不做正文提取的处理。
第二步:无用标签的清洗。html网页存在部分非正文的区域,它们大多都是一些js脚本、css样式以及html注释信息,这些标签里面的内容没有价值,事先清除这些标签可以缩小正文区域的搜索范围。
第三步:关键词定位和噪声过滤。关键词指的是时间文本。首先利用BeautifulSoup工具查找出满足时间格式的文本,并将这些时间文本划分为目标时间和正文时间。正文时间出现在主题帖或回复帖的正文内容中,不能帮助定位目标区域,因此将其视为噪声,做相应的过滤处理。将过滤后的时间文本作为目标时间,生成关键词向量。
第四步:目标内容区域定位。首先,以DOM树解析各个论坛网页,分析DOM解析树中包含关键词的节点的路径特征,寻找各路径特征的最大公共子序列,定位所有关键词节点的最近公共父亲节点。其次,在下递归寻找目标区域,有且仅包含一个时间关键词的为目标区域,不包含时间关键词的为非目标区域。
第五步:目标内容提取。目标内容包括作者,主题、正文和发表时间。发表时间选取上述的关键词;主题信息通过定位
为了验证上述算法的鲁棒性和通用性,我们设计了三个实验,分别是:对比只含主题帖和同时包含主题帖和回复帖的网页内容提取效果;对比同类网页的内容提取效果;对比不同类型网页的内容提取效果。实验结果表明,上述算法简单高效,内容提取准确率高,且具有很好的通用性。
关键词:网页正文提取;数据挖掘;BeautifulSoup;DOM树;文本结构相似性;通用性
Text extraction of general BBS
Abstract:
In the current period of big data, with the high-speed development of internet and mobile internet, the amount of data generated online has increased dramatically, which contained a great deal of information that has become an important data resource of most industries. The number of BBS and data rise rapidly, fully excavate this kind of information has important practical significance for public opinion and sentiment analysis, enterprise decision-making and policy-making. However, the style of BBS pages are differ, how to extract valuable information from amounts different BBS web pages is an urgent problem of the Internet data analysis.
The goal of data mining is to propose a new and general method of content extraction in BBS according to the characteristics of BBS. The whole process involved five steps: First step: data clean. There are some wrong urls in the given data, such as “can’t find the webpage”, “The webpage has been delete” or “404 error”. We do not deal with the data with these replies.
Second step: useless tags clean. HTML can be divided into object and none-object regions, non-object regions may belong to js script, css style or comment item in which the content is worthless. Thus, we delete these items in advance in order to reduce the search scope of object region.
Third step: keywords location and noise filtering. Time is considered as the keyword. We firstly find out all text meeting the time format based on BeautifulSoup, and the time is divided into two part of object time and noise time. The noise time generally exit in main text that can't help to locate object region, which should be filtered. After noise filtering, the remaining time is regard as the keyword and generate the keyword vector.
Fourth step: object region location. Firstly, use DOM tree parse every BBS website, analyze path characteristics of the tree nodes contained keyword in the DOM tree, look for the biggest public subsequence of path characteristics and locate the nearest public father node of all keywords. Secondly, search target areas in the recursive way. The area that contains a keyword is the target area, and the area that do not include the time keyword is not the object region.
Fifth step, objective content extraction. Object content includes author, title, text and publish time. The publish time choose above keyword vector. We get the title information by positioning