大数据挖掘第二次作业

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

数据挖掘第二次作业

第一题:

1.

a)Compute the Information Gain for Gender, Car Type and Shirt Size.

b)Construct a decision tree with Information Gain.

答案:

a)因为class分为两类:C0和C1,其中C0的频数为10个,C1的频数为10,所以class

元组的信息增益为Info(D)==1

1.按照Gender进行分类:

Info gender(D)==0.971 Gain(Gender)=1-0.971=0.029

2.按照Car Type进行分类

Info carType(D)=

=0.314

Gain(Car Type)=1-0.314=0.686

3.按照Shirt Size进行分类:

Info shirtSize(D)=

=0.988

Gain(Shirt Size)=1-0.988=0.012

b)由a中的信息增益结果可以看出采用Car Type进行分类得到的信息增益最大,所以决策树为:

第二题:

2.(a) Design a multilayer feed-forward neural network (one hidden layer)

for the data set in Q1. Label the nodes in the input and output layers.

(b)Using the neural network obtained above, show the weight values

after one iteration of the back propagation algorithm, given the training instance “(M, Family, Small)". Indicate your initial weight values and biases and the learning rate used.

a)

输入层隐藏层输出层

x12

x21

x22

x23

x31

x32

x33

x34

b)由a可以设每个输入单元代表的属性和初始赋值

由于初始的权重和偏倚值是随机生成的所以在此定义初始值为:

净输入和输出:

每个节点的误差表:

权重和偏倚的更新:

第三题:

3.

a) Suppose the fraction of undergraduate students who smoke is 15% and

the fraction of graduate students who smoke is 23%. If one-fifth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student? b) Given the information in part (a), is a randomly chosen college student

more likely to be a graduate or undergraduate student?

c) Suppose 30% of the graduate students live in a dorm but only 10% of

the undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke.

答:

a) 定义:A={A 1 ,A 2}其中A 1表示没有毕业的学生,A 2表示毕业的学生,B 表示抽烟

则由题意而知:

P(B|A 1)=15% P(B|A 2)=23% P(A 1)= P(A 2)= 则问题则是求P(A 2|B)

由()166.0)()|B ()()|B (B 2211=+=

A P A p A P A P P

则()277.0166

.02

.023.0)()()|(|222

=⨯=⨯=

B P A P A B P B A

P

b) 由a 可以看出随机抽取一个抽烟的大学生,是毕业生的概率是0.277,未毕业的学生是0.723,所以有

很大的可能性是未毕业的学生。 c) 设住在宿舍为事件C

则P(C|A 2)=30% P(C|A 1)=10%

()14.0)()|C ()()|C (C 2211=+=A P A p A P A P P

023.014.0166.0)()()(=⨯==C P B P BC P

6.0023

.02

.03.023.0)()()|()|()|(2222=⨯⨯==

BC P A P A C P A B P BC A P

)|(1BC A P =0.4

所以由上面的结果可以看出是毕业生的概率大一些

第四题:

4. Suppose that the data mining task is to cluster the following ten points

(with(x, y, z) representing location) into three clusters:

A1(4,2,5), A2(10,5,2), A3(5,8,7), B1(1,1,1), B2(2,3,2), B3(3,6,9), C1(11,9,2), C2(1,4,6), C3(9,1,7), C4(5,6,7)

The distance function is Euclidean distance. Suppose initially we assign A1, B1, C1 as the center of each cluster, respectively. Use the K-Means algorithm to show only

(a) The three cluster center after the first round execution (b) The final three clusters

答:

a) 各点到中心点的欧式距离 第一轮:

从而得到的三个簇为:

{A 1, A 3,B 3,C 2, C 3, C 4} {B 1,B 2} {C 1,A 2}

所以三个簇新的中心为:(4.5,4.5,6.83),(1.5,2,1.5),(10.5,7,2) 第二轮:

新的簇均值为:(4.5,4.5,6.83),(1.5,2,1.5),(10.5,7,2)

相关文档
最新文档