Chapter 4 非结构化数据挖掘01
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Chapter 4 非结构化数据挖掘
Outline
•4.1 非结构化数据
4.2 文本挖掘及其应用•42
•4.3 文本挖掘工具及其实现•4.4 其它非结构化数据挖掘
4.1 非结构化数据
非结构化数据
•结构化数据:
–容易识别,按照一定结构组织, 数据库/表数据, 行,列–容易被计算机理解,容易查找;
•非结构化数据:
没有个事先定义好的数据模型,或者不符合数据表–没有一个事先定义好的数据模型或者也不符合数据表格式.
–不容易被计算机辨识,处理;
•半结构化数据:
–一种结构化数据,但不符合关系数据库或数据表格式–包含标签或其它标记来区分数据不同内容或说明数据
内在关系
–越来越多,网络,XML,E‐mail
非结构化数据
•文本数据
•网络数据,日志,点击数据数据,
•多媒体数据:声音、图像、视频
•将非结构化数据转换为结构化数据,然后进行挖掘
4.2 文本挖掘及其应用
文本非结构化数据案例
•
Documents:
In return for a loan that I have received, I promise to pay $2,000 (this amount is called principal), plus interest, to the order of the lender. The lender is First Bank. I will make all payments under this note in the form of cash, check, or money order.
I understand that the lender may transfer this note. The lender or anyone who
takes this note by transfer and who is entitled . . .
•E‐mails:
Hi Sam. How are you coming with the chapter For Dummies
Sam on big data for the book? It is due on Friday.
Joanne
g
•Log files:
222.222.222.222‐‐[08/Oct/2012:11:11:54 ‐0400] “GET / HTTP/1.1” 200 10801 “/search?q=log+analyzer=ie=…. . .
•Tweets:
#Big data is the future of data!
•Facebook posts:
LOL. What are you doing later? BFF
文本挖掘定义
Text mining • mining , also referred to as text data , roughly equivalent to text analytics , refers to the process of deriving high ‐quality information from text . High ‐quality information is typically derived through the devising of patterns and trends through means such g g p g as statistical pattern learning .
•Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic f d h l f h d b features and the removal of others, and subsequent insertion into a database ), deriving patterns within the structured data , and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance ,novelty , and interestingness.
•
Typical text mining tasks
–text categorization ,text clustering ,concept/entity extraction , \
–production of granular taxonomies,sentiment analysis ,document summarization , and entity relation modeling (i.e., learning relations between named entities ).
文本挖掘:概述(I)
•交叉学科:信息检索、数据挖掘、机器学习、统计学、计算语言学,NLP
•大量信息以文本形式存在,新闻稿件、科
技论文书籍数字图书馆博客技论文、书籍、数字图书馆、Email,博客、网页
•目标:从文本中导出高质量的信息