《人工智能与数据挖掘教学课件》lect.ppt
合集下载
《人工智能与数据挖掘教学课件》l(3)
Chapter 3 Basic Data Mining Techniques
3.1 Decision Trees
(For classification)
2020/11/2
ppt课件
1
Introduction: Classification—A Two-Step Process
• 1. Model construction: build a model that can describe a set of
– This set of examples is used for model construction: training set
– The model can be represented as classification rules, decision trees, or mathematical formulae
buys_computer no no yes yes yes no yes no yes yes yes yes yes no 5
1 Example (2): Output: A Decision Tree for “buys_computer”
age?
<=30 ov30e.r.c4a0st
student?
no
George Professor
5
yes
Joseph Assistant Prof 7 ppt课件 yes
(Jeff, Professor, 4)
Tenured?
4
1 Example (1): Training Dataset
An example from Quinlan’ s ID3 (1986)
student credit_rating no fair no excellent no fair no fair yes fair yes excellent yes excellent no fair yes fair yes fair yes excellent no excellent yes fair nppot课件 excellent
3.1 Decision Trees
(For classification)
2020/11/2
ppt课件
1
Introduction: Classification—A Two-Step Process
• 1. Model construction: build a model that can describe a set of
– This set of examples is used for model construction: training set
– The model can be represented as classification rules, decision trees, or mathematical formulae
buys_computer no no yes yes yes no yes no yes yes yes yes yes no 5
1 Example (2): Output: A Decision Tree for “buys_computer”
age?
<=30 ov30e.r.c4a0st
student?
no
George Professor
5
yes
Joseph Assistant Prof 7 ppt课件 yes
(Jeff, Professor, 4)
Tenured?
4
1 Example (1): Training Dataset
An example from Quinlan’ s ID3 (1986)
student credit_rating no fair no excellent no fair no fair yes fair yes excellent yes excellent no fair yes fair yes fair yes excellent no excellent yes fair nppot课件 excellent
《人工智能与数据挖掘教学课件》lect-3-12
– There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
– There are no samples left
– Reach the pre-set accuracy
4/22/2020
age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40
income high high high medium low low low medium low medium medium medium high medium
4/22/2020
AI&DM
12
3.2 Rules simplification and elimination
A Rule for the Tree in Figure 3.4
IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No (accuracy = 75%, Figure 3.4)
4/22/2020
AI&DM
9
Attribute Selection by Information Gain
Computation
Class P: buys_computer = “yes”
Class N:
E(age) 5 I (2,3) 4 I (4,0)
《人工智能与数据挖掘教学课件》courseintro-13.ppt
40
20
0
East Coast
South
Q1 Q2 Q3
Midwest West Coast
Insight
Based on:
Michael J. A. Berry, Data Miners,
2021/3/11
MonAtIh&s DaMs :CBUuPsTtomer
✓ Approach – give away gifts: target customers, what gift, what time
2021/3/11
AI & DM: BUPT
5
Issues to consider: 1. Targeting customers
• Every customer • High expenditure customers • Most profitable customers (who are) • Customers likely to churn (concentrate on the ones
– There are different types of information systems that can support the operation of business: word processor, spread sheets, databases, accounting systems, ERP, decision support systems, expert systems, business intelligence…
4
Example: why CRM needs DM
✓ CRM for mobile phone company – customer retention (churn)
《人工智能与数据挖掘教学课件》lect-5-13
d (i, x |2 ) i1 j1 i2 j2 ip jp
– d(i,i) = 0
– d(i,j) = d(j,i) – d(i,j) d(i,k) + d(k,j)
2019/1/28 AI&DM BUPT 16
– Calculate the standardized measurement (z-score)
xif m f zif sf
2019/1/28 AI&DM BUPT 18
4.2 Binary Variables (二值变量)
• A contingency table (相依表)for binary data
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer
2019/1/28 AI&DM BUPT 17
4.1 Interval-valued variables (cont. 2)
Object j
1
Object i
0 b d
sum a b cd p
1 0
a c
sum a c b d
• Simple matching coefficient (if the binary variable is
symmetric (对称的)):
d (i, j)
bc a bc d bc a bc
2019/1/28
AI&DM BUPT
4
Example
Price($)
7 20 22 50 51 53
《人工智能与数据挖掘教学课件》2.datawarehouse-文档资料
Data Warehouse environment
the source systems from which data is extracted the tools used to extract data for loading the data warehouse the data warehouse database itself where the data is stored the desktop query and reporting tools used for decision support
Data Warehousing Process Overview
Operational Vs. Multidimensional View Of Sales
Hale Waihona Puke Creating A Data Warehouse
The Data Warehouse
The Data Warehouse is an integrated, subject-oriented, time-variant, nonvolatile database that provides support for decision making.
The Data Warehouse
Integrated
The Data Warehouse is a centralized, consolidated database that integrates data retrieved from the entire organization. The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company.
人工智能(六)知识发现与数据挖掘ppt课件
人工智能 Artificial Intelligence
北京信息科技大学计算机学院 李宝安
精选ppt课件
1
知识发现与数据挖掘
精选ppt课件
2
数据库技术和计算机网络已经成为当前计 算机应用中的两个最重要的基础领域,触及到 人类生活的各个方面。目前,全世界数据库和 因特网中的数据总量正以极快的速度增长。虽 然简单的数据查询或统计可以满足某些低层次 的需求,但人们更为需要的是从大量数据资源 中挖掘出对各类决策有指导意义的一般知识。 数据的急剧膨胀和时效性、复杂性远远超过了 人们的手工处理能力,人们迫切需要高性能的 自动化数据分析工具,以高速、全面、深入、 有效地加工数据。
B
8.67
3.571 2.427 21.038 51.06
C
14.00
7.155
1.957 7.395
53.61
D
24.67 16.889 1.418 36.459 53.89
精选ppt课件
13
BACON4调用上述的启发式,寻到了D和P的单调趋势 关系,即P随D增大而增大,但相应的斜率项不是常数, 而是随D的增加而减少。这又导致BACON4定义D2/P, 此项的值也不是常数,但随D/P减少而增加,结果系统 考虑项D3/P2,这个值接近常数(系统给出了一个允许 的误差范围如7.5%)。BACON4根据这结果就归纳出 该定律了。 一旦一个推理项定义后,它和直接观察的变量就 没有区别了。例如,理想气体定律例中,趋势探测器 会首先确定如PV这样的推理项,并进而确定如PV/T那样 的推理项。也可以发现这些推理项所取值之间的关系, 又从中重新派生出新的推理项,导致对直接观察的变 量更为复杂的描述如PV/nT。BACON4递归地应用相同 的启发式逐步生成更复杂的高层次描述,这种推理能 力使系统具备相当强大的搜索经验定律的功能。
北京信息科技大学计算机学院 李宝安
精选ppt课件
1
知识发现与数据挖掘
精选ppt课件
2
数据库技术和计算机网络已经成为当前计 算机应用中的两个最重要的基础领域,触及到 人类生活的各个方面。目前,全世界数据库和 因特网中的数据总量正以极快的速度增长。虽 然简单的数据查询或统计可以满足某些低层次 的需求,但人们更为需要的是从大量数据资源 中挖掘出对各类决策有指导意义的一般知识。 数据的急剧膨胀和时效性、复杂性远远超过了 人们的手工处理能力,人们迫切需要高性能的 自动化数据分析工具,以高速、全面、深入、 有效地加工数据。
B
8.67
3.571 2.427 21.038 51.06
C
14.00
7.155
1.957 7.395
53.61
D
24.67 16.889 1.418 36.459 53.89
精选ppt课件
13
BACON4调用上述的启发式,寻到了D和P的单调趋势 关系,即P随D增大而增大,但相应的斜率项不是常数, 而是随D的增加而减少。这又导致BACON4定义D2/P, 此项的值也不是常数,但随D/P减少而增加,结果系统 考虑项D3/P2,这个值接近常数(系统给出了一个允许 的误差范围如7.5%)。BACON4根据这结果就归纳出 该定律了。 一旦一个推理项定义后,它和直接观察的变量就 没有区别了。例如,理想气体定律例中,趋势探测器 会首先确定如PV这样的推理项,并进而确定如PV/T那样 的推理项。也可以发现这些推理项所取值之间的关系, 又从中重新派生出新的推理项,导致对直接观察的变 量更为复杂的描述如PV/nT。BACON4递归地应用相同 的启发式逐步生成更复杂的高层次描述,这种推理能 力使系统具备相当强大的搜索经验定律的功能。
人工智能与数据挖掘教学课件-2.datawarehouse
Subject-Oriented
The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company.
What is Data Warehouse
The idea of a data warehouse is to put a wide range of operational data from internal and external sources into one place so it can be better utilized by executives, line of business managers and other business analysts.
The Data Warehouse
Time Variant
The Warehouse data represent the flow of data through time. It can even contain projected data.
Non-Volatile
Once data enter the Data Warehouse, they are never removed.
The Data Warehouse
The Data Warehouse is an integrated, subject-oriented, time-variant, nonvolatile database that provides support for decision making.
The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company.
What is Data Warehouse
The idea of a data warehouse is to put a wide range of operational data from internal and external sources into one place so it can be better utilized by executives, line of business managers and other business analysts.
The Data Warehouse
Time Variant
The Warehouse data represent the flow of data through time. It can even contain projected data.
Non-Volatile
Once data enter the Data Warehouse, they are never removed.
The Data Warehouse
The Data Warehouse is an integrated, subject-oriented, time-variant, nonvolatile database that provides support for decision making.
《人工智能与数据挖掘教学课件》2.datawarehouse.ppt
Once the information is gathered, OLAP (on-line analytical processing ) software comes into play by providing the desktop analysis tools for querying, manipulating and reporting the data from the data warehouse.
The Data Warehouse is always growing.
Operational Database vs. Data warehouse
Operational DB
Data Warehouse
Similar data can have Unified view of all
different representations data elements
Data Warehouse
Why Data warehouse
The most common issue companies face when looking at data mining is that the information is not in one place.
The biggest challenge business analysts face in using data mining is how to extract, integrate, cleanse, and prepare data to solve their most pressing business problems.
Data Mart
Data Marts can serve as a test vehicle for companies exploring the potential benefits of Data Warehouses.
The Data Warehouse is always growing.
Operational Database vs. Data warehouse
Operational DB
Data Warehouse
Similar data can have Unified view of all
different representations data elements
Data Warehouse
Why Data warehouse
The most common issue companies face when looking at data mining is that the information is not in one place.
The biggest challenge business analysts face in using data mining is how to extract, integrate, cleanse, and prepare data to solve their most pressing business problems.
Data Mart
Data Marts can serve as a test vehicle for companies exploring the potential benefits of Data Warehouses.
《人工智能与数据挖掘教学课件》2.datawarehou
The Data Warehouse is always growing.
Operational Database vs. Data warehouse
Operational DB
Similar data can have different representations or meanings
The Data Warehouse
Integrated
The Data Warehouse is a centralized, consolidated database that integrates data retrieved from the entire organization.
the desktop query and reporting tools used for decision support
Data Warehousing Process Overview
Operational Vs. Multidimensional View Of Sales
Creating A Data Warehouse
Data Warehouse environment
the source systems from which data is extracted
the tools used to extract data for loading the data warehouse
the data warehouse database itself where the data is stored
The Data Warehouse
Time Variant
The Warehouse data represent the flow of data through time. It can even contain projected data.
Operational Database vs. Data warehouse
Operational DB
Similar data can have different representations or meanings
The Data Warehouse
Integrated
The Data Warehouse is a centralized, consolidated database that integrates data retrieved from the entire organization.
the desktop query and reporting tools used for decision support
Data Warehousing Process Overview
Operational Vs. Multidimensional View Of Sales
Creating A Data Warehouse
Data Warehouse environment
the source systems from which data is extracted
the tools used to extract data for loading the data warehouse
the data warehouse database itself where the data is stored
The Data Warehouse
Time Variant
The Warehouse data represent the flow of data through time. It can even contain projected data.
《人工智能与数据挖掘教学课件》lect-7-13-31页精品文档
l j
l i
2 j
2 i
3 j
3 i
jk
ik
0 . 2 0 0 . 1 0 0 . 3 0 – 0 . 1 0 – 0 . 1 0 0 . 2 0 0 . 1 0 0 . 5 0
Input Layer
1.0
Node 1
W1j
W1i
W2j
0.4
Node 2
W2i
W3j
0.7
Node 3
W3i
Hidden Layer
3. Most popular ANN - Backpropagation Network (8.5.1 The Backpropagation Algorithm: An example)
2019/9/28
AI & DM
2
1. What & Why ANN: Artificial Neural Networks (ANN)
1. What & Why ANN (8.1 Feed forward Neural Network)
2. How ANN works - working principle (8.2.1 Supervised Learning)
3. Most popular ANN - Backpropagation Network (8.5.1 The Backpropagation Algorithm: An example)
– Step 6: Deploy developed network application if the test accuracy is acceptable
2019/9/28
AI & DM
9
《人工智能与数据挖掘教学课件》lect-2-13
•Output – dependent variable (因变量) •Input – independent variable (自变量)
•Conversion between categorical & numerical data
2019/10/20
AI&DM
3
Data Mining Strategies: Estimation
Classification
Estimation
Prediction
2019/10/20
AI&DM
2
Data Mining Strategies: Classification
• Learning is supervised. • The dependent variable is categorical (discrete). • Well-defined classes.
Numeric
Maximum heart rate achieved
Induced Angina?
True, False
1, 0
Does t he patient experience angina
as a result of exercise?
Old Peak
Numeric
Numeric
ST depression induced by exercise relative to rest
Slope
Up, flat, dow n
1–3
Slope of t he peak exercise ST
segment
Number Colored Vessels
0, 1, 2, 3
《人工智能与数据挖掘教学课件》lect-1-13-文档资料
2019/4/24
BUPT AI&DM
8
Induction-based Learning (基于归纳的学习)
– This is a time that one must speak with data.
– 未来属于运算师 (Super Crunchers《超级运算师》, Ian Ayres, 2009):日常决策将变得越来越自动化,人 的判断作用将局限于为计算提供数据
• 葡萄酒味道和香味的预测:奥利.阿申费尔特是普林斯顿大学的经 济学家,完全不懂葡萄酒的制作,但可以预测波尔多葡萄酒的价 格基于天气(炎热、干燥的年份酒会非常好),准确率高于葡 萄酒专家 • 本书原计划叫“理论的终结”,后来利用google改书名而不是与出 版社编辑讨论,因为发现用此名点击率高63% • 放贷员曾经收入优厚、职责最大,现在只是呼叫中心的接线员, 重复电脑提示的问题,报酬很低
• (2) Extraction of interesting
implicit, previously unknown and (non-trivial, potentially useful)
information or patterns from data in large databases. (generally accepted)
2019/4/24
BUPT AI&DM
6
• 在过去,上海通用保修问题分析主要依靠简单的纯手 工处理的计算方式,每次只能产生寥寥几篇问题报告。 尽管汽车生产量远不如现在大,但这个耗时费力的分 析周期却在根本上导致了保修成本居高不下。在非自 动操作环境下,从保修索赔出现到找出问题原因平均 要花费6~12个月的时间,且在此间往往还需要借助 于通用全球的支持,解决问题的整个过程也主要建立 在经验分析的基础上。另外,不准确的数据导致上海 通用难以准确预测保修成本,从而合理准备下一周期 的保修预算,导致大量运营资金被占用、现金流降低。 • 采用SAS的保修分析解决方案后,上海通用的保修 分析周期在头6个月里就缩短了70%,有效地降低了 保修成本,实现了该系统使用的预期目标。同时,这 些显著的改善效果帮助上海通用在短短半年内就收回 了保修分析系统所有的软硬件投资,共为公司节省了 1,800万人民币的成本。
《人工智能与数据挖掘教学》l(4)
• Learning is supervised. • The dependent variable is numeric. • Well-defined classes.
•Conversion between categorical & numerical data
2021/6/28
编辑ppt
4
Data Mining Strategies: Prediction
a s a re s u lt o f e x e rc is e ?
O ld P e a k
N u m e ric
N u m e ric
S T d e p re s s io n in d u c e d b y e x e rc is e re la t iv e t o re s t
S lo p e
• The emphasis is on predicting future rather than current outcomes.
• The output attribute may be categorical or numeric.
2021/6/28
编辑ppt
5
Classification, Estimation or Prediction? (I)
所以以后的讨论会侧重于是解决categorical的问题还是numerical 的问题,忽略是现行关系还是预测未来关系。这也是业界普遍 采用的做法,即
• Supervised learning
– Classification
– Prediction
2021/6/28
编辑ppt
10
Unsupervised Clustering can be used:
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
yes
>40 credit rating?
no
yes
excellent fair
no
yes
no
yes
2020/4/24
AI&DM
6
2 Algorithm for Decision Tree Building
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
2020/4/24
AI&DM
9
Attribute Selection by Information Gain
Computation
Class P: buys_computer = “yes”
Class N:
E(age) 5 I (2,3) 4 I (4,0)
14
14
5 I (3,2) 0.69 14
– There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
– There are no samples left
– Reach the pre-set accuracy
predetermined classes
– Preparation: Each tuple/sample is assumed to belong to a predefined class, labeled by the output attribute or class label attribute
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
2020/4/24
AI&DM
12
3.2 Rules simplification and elimination
A Rule for the Tree in Figure 3.4
IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No (accuracy = 75%, Figure 3.4)
A Simplified Rule Obtained by Removing Attribute Age
IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No (accuracy = 83.3% (5/6), Figure 3.5)
classify objects in all subsets Si is
E( A)
i1
pi p
ni n
I(
pi , ni )
• The encoding information that would be gained by branching on A Gain(A) I ( p, n) E(A)
buys_computer no no yes yes yes no yes no yes yes yes yes yes no 5
1 Example (2): Output: A Decision Tree for “buys_computer”
age?
<=30 ov30e.r.c4a0st
student?
Classification Process (2): Use the Model in Prediction
Classifier
Testing Data
Unseen Data
NAME RANK
YEARS TENURED
T om A ssistant P rof 2
no
M erlisa A ssociate P rof 7
Gain(income) 0.029 Gain(student) 0.151
30?0 4 0 0
Gain(credit _ rating) 0.048
2>02400/4/24
3 2 0.971
AI&DM
10
3. Decision Tree Rules
• Automate rule creation • Rules simplification and elimination • A default rule is chosen
– Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of testing set samples that are correctly classified by the model
2020/4/24
AI&DM
2
Classification Process (1): Model Construction
Training Data
Classification Algorithms
NAME RANK
YEARS TENURED
M ike A ssistant P rof 3
no
M ary A ssistant P rof 7
2020/4/24
AI&DM
11
3.1 Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
2020/4/24
AI&DM
7
Information Gain (信息增益)(ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n elements of class N
student credit_rating no fair no excellent no fair no fair yes fair yes excellent yes excellent no fair yes fair yes fair yes excellent no excellent yes fair nAIo&DM excellent
age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40
income high high high medium low low low medium low medium medium medium high medium
yes
B ill P rofessor
2
yes
Jim A ssociate P rof 7
yes
D ave A ssistant P rof 6
no
A nne A ssociate P rof 3
no
2020/4/24
AI&DM
Classifier (Model)
IF rank = ‘professor’ OR years > 6 THEN tenured = ‘y3 es’
– The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as
I ( p, n)
p
ห้องสมุดไป่ตู้
p
n
log2
p pn
p
n
n
log2
n pn
2020/4/24
AI&DM
• Note: Test set is independent of training set, otherwise over-fitting will occur
• 2. Model usage: use the model to classify future or unknown