
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

所选题目: C











第五步:目标内容提取。目标内容包括作者,主题、正文和发表时间。发表时间选取上述的关键词;主题信息通过定位标签并删除相应的网站信息提取得到;针对作者和正文信息,首先划分链接文本和非链接文本,并以结构化的文本特征向量表示。其次,在链接文本集合中,基于文本结构相似性提取作者信息;在非链接文本集合中,结合文本结构相似性和正文片段空间分布的连续性以及文本密度定位正文信息。</p><p>为了验证上述算法的鲁棒性和通用性,我们设计了三个实验,分别是:对比只含主题帖和同时包含主题帖和回复帖的网页内容提取效果;对比同类网页的内容提取效果;对比不同类型网页的内容提取效果。实验结果表明,上述算法简单高效,内容提取准确率高,且具有很好的通用性。</p><p>关键词:网页正文提取;数据挖掘;BeautifulSoup;DOM树;文本结构相似性;通用性</p><p>Text extraction of general BBS</p><p>Abstract:</p><p>In the current period of big data, with the high-speed development of internet and mobile internet, the amount of data generated online has increased dramatically, which contained a great deal of information that has become an important data resource of most industries. The number of BBS and data rise rapidly, fully excavate this kind of information has important practical significance for public opinion and sentiment analysis, enterprise decision-making and policy-making. However, the style of BBS pages are differ, how to extract valuable information from amounts different BBS web pages is an urgent problem of the Internet data analysis.</p><p>The goal of data mining is to propose a new and general method of content extraction in BBS according to the characteristics of BBS. The whole process involved five steps: First step: data clean. There are some wrong urls in the given data, such as “can’t find the webpage”, “The webpage has been delete” or “404 error”. We do not deal with the data with these replies.</p><p>Second step: useless tags clean. HTML can be divided into object and none-object regions, non-object regions may belong to js script, css style or comment item in which the content is worthless. Thus, we delete these items in advance in order to reduce the search scope of object region.</p><p>Third step: keywords location and noise filtering. Time is considered as the keyword. We firstly find out all text meeting the time format based on BeautifulSoup, and the time is divided into two part of object time and noise time. The noise time generally exit in main text that can't help to locate object region, which should be filtered. After noise filtering, the remaining time is regard as the keyword and generate the keyword vector.</p><p>Fourth step: object region location. Firstly, use DOM tree parse every BBS website, analyze path characteristics of the tree nodes contained keyword in the DOM tree, look for the biggest public subsequence of path characteristics and locate the nearest public father node of all keywords. Secondly, search target areas in the recursive way. The area that contains a keyword is the target area, and the area that do not include the time keyword is not the object region.</p><p>Fifth step, objective content extraction. Object content includes author, title, text and publish time. The publish time choose above keyword vector. We get the title information by positioning</p> </div> </div> </div> </div> </div> <div id="rightcol" class="viewcol"> <div class="coltitle">相关文档</div> <ul class="lista"> <li><a href="/doc/038198960.html" target="_blank">中国通用航空</a></li> <li><a href="/doc/1712884871.html" target="_blank">中国通用航空发展概况</a></li> <li><a href="/doc/1714581044.html" target="_blank">中国通用航空大事记(2001-2008)</a></li> <li><a href="/doc/233317370.html" target="_blank">中国通用航空现状</a></li> <li><a href="/doc/3a15919372.html" target="_blank">中国通用航空行业运营模式与发展趋势预测报告XX2021年</a></li> <li><a href="/doc/4c8860443.html" target="_blank">中国通用航空政策法规</a></li> <li><a href="/doc/5f3087761.html" target="_blank">中国通用航空政策法规</a></li> <li><a href="/doc/5818766879.html" target="_blank">中国10大通用机场</a></li> <li><a href="/doc/7d4351614.html" target="_blank">中国通用航空产业规模的预测研究</a></li> <li><a href="/doc/854915894.html" target="_blank">中国通用航空发展路线图PPT</a></li> </ul> <div class="coltitle">最新文档</div> <ul class="lista"> <li><a href="/doc/0f10077391.html" target="_blank">轨道交通工程原材料设备进场检验及储存管理办法</a></li> <li><a href="/doc/0010077392.html" target="_blank">南京南门老街文化商业街区规划方案</a></li> <li><a href="/doc/0710077393.html" target="_blank">公开课证明</a></li> <li><a href="/doc/0510077394.html" target="_blank">创客+ 众创空间诞生欢迎入住</a></li> <li><a href="/doc/0810077396.html" target="_blank">利用思维导图设计《2.1获取信息的过程与方法》</a></li> <li><a href="/doc/0810077397.html" target="_blank">小学生学具评比和知识抢答赛方案</a></li> <li><a href="/doc/0810077398.html" target="_blank">钢结构加固混凝土结构技术的应用</a></li> <li><a href="/doc/0b10077399.html" target="_blank">从眉毛看寿命长短 这几种眉形易短命</a></li> <li><a href="/doc/0a100774.html" target="_blank">XX学校“青蓝工程”师傅工作计划</a></li> <li><a href="/doc/0d1007740.html" target="_blank">09年5月助理人力资源管理师考试技能卷真题及答案</a></li> </ul> </div> </div> <script> var did = "a13459602"; var ext = 'pdf'; var docId = '1k092fbopvrmmduftd6xdbgaqnutvy11'; var totalPage = 35; const pageNum = '35'; </script> <div class="clearfloat"></div> <div id="footer"> <div class="ft_info"> <a href="https://beian.miit.gov.cn">闽ICP备16038512号-3</a> <a href="/tousu.html" target="_blank">侵权投诉</a>  ©2013-2023 360文档中心,www.360docs.net | <a target="_blank" href="/sitemap.html">站点地图</a><br> 本站资源均为网友上传分享,本站仅负责收集和整理,有任何问题请在对应网页下方投诉通道反馈 </div> <script>foot()</script> </div> </body> </html>