Chapter 4 非结构化数据挖掘01

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Chapter 4 非结构化数据挖掘

Outline

•4.1 非结构化数据

4.2 文本挖掘及其应用•42

•4.3 文本挖掘工具及其实现•4.4 其它非结构化数据挖掘

4.1 非结构化数据

非结构化数据

•结构化数据:

–容易识别，按照一定结构组织, 数据库/表数据, 行,列–容易被计算机理解，容易查找；

•非结构化数据:

没有个事先定义好的数据模型,或者不符合数据表–没有一个事先定义好的数据模型或者也不符合数据表格式.

–不容易被计算机辨识，处理；

•半结构化数据:

–一种结构化数据，但不符合关系数据库或数据表格式–包含标签或其它标记来区分数据不同内容或说明数据

内在关系

–越来越多，网络，XML，E‐mail

非结构化数据

•文本数据

•网络数据，日志，点击数据数据，

•多媒体数据：声音、图像、视频

•将非结构化数据转换为结构化数据，然后进行挖掘

4.2 文本挖掘及其应用

文本非结构化数据案例

•

Documents:

In return for a loan that I have received, I promise to pay $2,000 (this amount is called principal), plus interest, to the order of the lender. The lender is First Bank. I will make all payments under this note in the form of cash, check, or money order.

I understand that the lender may transfer this note. The lender or anyone who

takes this note by transfer and who is entitled . . .

•E‐mails:

Hi Sam. How are you coming with the chapter For Dummies

Sam on big data for the book? It is due on Friday.

Joanne

•Log files:

222.222.222.222‐‐[08/Oct/2012:11:11:54 ‐0400] “GET / HTTP/1.1” 200 10801 “/search?q=log+analyzer=ie=…. . .

•Tweets:

#Big data is the future of data!

•Facebook posts:

LOL. What are you doing later? BFF

文本挖掘定义

Text mining • mining , also referred to as text data , roughly equivalent to text analytics , refers to the process of deriving high ‐quality information from text . High ‐quality information is typically derived through the devising of patterns and trends through means such g g p g as statistical pattern learning .

•Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic f d h l f h d b features and the removal of others, and subsequent insertion into a database ), deriving patterns within the structured data , and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance ,novelty , and interestingness.

•

Typical text mining tasks

–text categorization ,text clustering ,concept/entity extraction , \

–production of granular taxonomies,sentiment analysis ,document summarization , and entity relation modeling (i.e., learning relations between named entities ).

文本挖掘：概述（I）

•交叉学科：信息检索、数据挖掘、机器学习、统计学、计算语言学，NLP

•大量信息以文本形式存在，新闻稿件、科

技论文书籍数字图书馆博客技论文、书籍、数字图书馆、Email，博客、网页

•目标：从文本中导出高质量的信息