数据挖掘第二次作业
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
数据挖掘第二次作业
第一题:
1.
a) Compute the Information Gain for Gender, Car Type and Shirt Size.
b) Construct a decision tree with Information Gain.
答案:
a)因为class分为两类:C0和C1,其中C0的频数为10个,C1的频数为10,所以class
元组的信息增益为Info(D)==1
1.按照Gender进行分类:
Info gender(D)==0.971
Gain(Gender)=1-0.971=0.029
2.按照Car Type进行分类
Info carType(D)=
=0.314
Gain(Car Type)=1-0.314=0.686
3.按照Shirt Size进行分类:
Info shirtSize (D)=
=0.988
Gain(Shirt Size)=1-0.988=0.012
b) 由a 中的信息增益结果可以看出采用Car Type 进行分类得到的信息增益最大,所以决策树为:
第二题:
2. (a) Design a multilayer feed-forward neural network (one hidden layer) for the data set in Q1. Label the nodes in the input and output layers.
(b) Using the neural network obtained above, show the weight values after one iteration of the back propagation algorithm, given the training instance “(M, Family, Small)". Indicate your initial weight values and biases and the learning rate used.
a)
Car Type?
Shirt Size?
C0
C1
family
Sport
luxury
C0 C1
small
medium,large, extra large
输入层隐藏层输出层
x12
x21
x22
x23
x31
x32
x33
x34
b)由a可以设每个输入单元代表的属性和初始赋值
由于初始的权重和偏倚值是随机生成的所以在此定义初始值为:
净输入和输出:
每个节点的误差表:
单元j Err j 10 0.0089 11 0.0030 12 -0.12
权重和偏倚的更新: W 1,10 W 1,11 W 2,10 W 2,11 W 3,10 W 3,11 W 4,10 W 4,11 W 5,10 W 5,11 0.201 0.198 -0.211 -0.099 0.4 0.308 -0.202 -0.098 0.101 -0.100 W 6,10 W 6,11 W 7,10 W 7,11 W 8,10 W 8,11 W 9,10 W 9,11 W 10,12 W 11,12 0.092 -0.211 -0.400 0.198 0.201 0.190 -0.110 0.300 -0.304 -0.099 θ10 θ11 θ12 -0.287 0.179
0.344
第三题:
3.
a) Suppose the fraction of undergraduate students who smoke is 15% and
the fraction of graduate students who smoke is 23%. If one-fifth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student? b) Given the information in part (a), is a randomly chosen college student
more likely to be a graduate or undergraduate student?
c) Suppose 30% of the graduate students live in a dorm but only 10% of the
undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke.
答:
a) 定义:A={A 1 ,A 2}其中A 1表示没有毕业的学生,A 2表示毕业的学生,B 表示抽烟
则由题意而知:
P(B|A 1)=15% P(B|A 2)=23% P(A 1)= P(A 2)=
则问题则是求P(A 2|B)
由()166.0)()|B ()()|B (B 2211=+=
A P A p A P A P P
则()277.0166
.02
.023.0)()()|(|222
=⨯=⨯=
B P A P A B P B A
P
b) 由a 可以看出随机抽取一个抽烟的大学生,是毕业生的概率是0.277,未毕业的学生是0.723,所以有
很大的可能性是未毕业的学生。 c) 设住在宿舍为事件C
则P(C|A 2)=30% P(C|A 1)=10%
()14.0)()|C ()()|C (C 2211=+=A P A p A P A P P