文本大数据分析与挖掘介绍
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
2. Mining content of text data
Real World
Perceive (Perspective)
Observed World Express (English)
Text Data + Context
3. Mining knowledge about the observer
Data Information Knowledge Text Data
Opportunities of Text Mining Applications
4. Infer other real-world variables (predictive analytics)
+ Non-Text Data
“Human Sensor”
3
文本数据的特殊应用价值 Unique Value of Text Data
• 对所有大数据应用都有应用价值: Useful to all big data applications • 特别有助于挖掘,利用有关人的行为,心态,观点的知识: Especially useful for mining knowledge about people’s behavior, attitude, and opinions • 直接表达知识;高质量数据( Directly express knowledge about our world ) 小文本数据应用 (Small text data are also useful!)
People Events Products Services, …
…
Sources:
Blogs Microblogs Forums Reviews ,…
45M reviews
53M blogs 65M msgs/day 1307M posts
115M users 10M groups
…
人= 主观智能“传感器” Humans as Subjective & Intelligent “Sensors”
Real World
Weather Locations Networks Perceive Sense
Sensor
Thermometer
Geo Sensor Network Sensor
Report
Data
3C , 15F, … 41°N and 120°W …. 01000100011100
Express
文本大数据分析与挖掘:机遇,挑战,及应用 前景
Analysis and Mining of Big Text Data: Opportunities, Challenges,and Applications
技术创新,变革未来
Text data cover all kinds of topics
Topics:
1. Mining knowledge about language
Challenges in Understanding Text Data (NLP)
A dog is
Det Noun Aux
chasing a boy on the playground
Verb
Det Noun Prep Det Noun Phrase Noun
Prep Phrase
Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back. Pragmatic analysis (speech act)
+
Scared(x) if Chasing(_,x,_). Scared(b1) Inference
Noun Phrase
wk.baidu.com
Semantic analysis
Noun Phrase Complex Verb
Lexical analysis (part-of-speech tagging)
Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1).
Verb Phrase Verb Phrase Sentence
• This makes EVERY step in NLP hard
– Ambiguity is a killer! – Common sense reasoning is pre-required.
Examples of Challenges
• Word-level ambiguity: – “root” has multiple meanings (ambiguous sense) • Syntactic ambiguity: – “natural language processing” (modification) – “A man saw a boy with a telescope.” (PP Attachment) • Anaphora resolution: “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?) • Presupposition: “He has quit smoking” implies that he smoked before.
The State of the Art: Mostly Relying on Machine Learning
A dog is chasing a boy on the playground
NLP is hard!
• Natural language is designed to make human communication efficient. As a result,
– we omit a lot of common sense knowledge, which we assume the hearer/reader possesses. – we keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve.
Real World
Perceive (Perspective)
Observed World Express (English)
Text Data + Context
3. Mining knowledge about the observer
Data Information Knowledge Text Data
Opportunities of Text Mining Applications
4. Infer other real-world variables (predictive analytics)
+ Non-Text Data
“Human Sensor”
3
文本数据的特殊应用价值 Unique Value of Text Data
• 对所有大数据应用都有应用价值: Useful to all big data applications • 特别有助于挖掘,利用有关人的行为,心态,观点的知识: Especially useful for mining knowledge about people’s behavior, attitude, and opinions • 直接表达知识;高质量数据( Directly express knowledge about our world ) 小文本数据应用 (Small text data are also useful!)
People Events Products Services, …
…
Sources:
Blogs Microblogs Forums Reviews ,…
45M reviews
53M blogs 65M msgs/day 1307M posts
115M users 10M groups
…
人= 主观智能“传感器” Humans as Subjective & Intelligent “Sensors”
Real World
Weather Locations Networks Perceive Sense
Sensor
Thermometer
Geo Sensor Network Sensor
Report
Data
3C , 15F, … 41°N and 120°W …. 01000100011100
Express
文本大数据分析与挖掘:机遇,挑战,及应用 前景
Analysis and Mining of Big Text Data: Opportunities, Challenges,and Applications
技术创新,变革未来
Text data cover all kinds of topics
Topics:
1. Mining knowledge about language
Challenges in Understanding Text Data (NLP)
A dog is
Det Noun Aux
chasing a boy on the playground
Verb
Det Noun Prep Det Noun Phrase Noun
Prep Phrase
Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back. Pragmatic analysis (speech act)
+
Scared(x) if Chasing(_,x,_). Scared(b1) Inference
Noun Phrase
wk.baidu.com
Semantic analysis
Noun Phrase Complex Verb
Lexical analysis (part-of-speech tagging)
Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1).
Verb Phrase Verb Phrase Sentence
• This makes EVERY step in NLP hard
– Ambiguity is a killer! – Common sense reasoning is pre-required.
Examples of Challenges
• Word-level ambiguity: – “root” has multiple meanings (ambiguous sense) • Syntactic ambiguity: – “natural language processing” (modification) – “A man saw a boy with a telescope.” (PP Attachment) • Anaphora resolution: “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?) • Presupposition: “He has quit smoking” implies that he smoked before.
The State of the Art: Mostly Relying on Machine Learning
A dog is chasing a boy on the playground
NLP is hard!
• Natural language is designed to make human communication efficient. As a result,
– we omit a lot of common sense knowledge, which we assume the hearer/reader possesses. – we keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve.