机器学习基础期末考试试题一、选择题(每题2分,共20分)1. 在机器学习中,下列哪个算法属于监督学习算法?A. 决策树B. K-meansC. 遗传算法D. 随机森林2. 以下哪个是线性回归的假设条件?A. 特征之间相互独立B. 特征与目标变量之间存在非线性关系C. 目标变量的误差项服从正态分布D. 所有特征都是类别型变量3. 支持向量机(SVM)的主要目标是什么?A. 找到数据点之间的最大间隔B. 减少模型的复杂度C. 增加模型的泛化能力D. 所有选项都正确4. 在深度学习中,卷积神经网络(CNN)通常用于处理哪种类型的数据?A. 音频数据B. 图像数据C. 文本数据D. 时间序列数据5. 交叉验证的主要目的是:A. 减少模型的过拟合B. 增加模型的复杂度C. 减少训练集的大小D. 增加模型的运行时间二、简答题(每题10分,共30分)6. 解释什么是过拟合,并给出一个避免过拟合的策略。

7. 描述随机森林算法的基本原理,并简述其相对于决策树的优势。

8. 解释梯度下降算法的工作原理,并说明为什么它在优化问题中如此重要。

三、计算题(每题25分,共50分)9. 假设你有一个线性回归模型,其目标函数为 \( J(\theta) =\frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 \),其中 \( h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2x_2 \)。

给定以下数据点:\[\begin{align*}x_1 & : [1, 2, 3] \\x_2 & : [1, 3, 4] \\y & : [2, 4, 5]\end{align*}\]请计算该模型的损失函数 \( J(\theta) \)。

10. 给定一个二分类问题的数据集,使用逻辑回归模型进行分类。

如果模型的决策边界是 \( w_1 x_1 + w_2 x_2 - \theta = 0 \),其中\( w_1 = 0.5 \),\( w_2 = -1 \),\( \theta = 0.5 \)。



1.What is Machine Learning
A Computer program can improve its performance automatically with experience
2.Learning Definition
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience
Learning: improving with experience at some task
Improve over task T
With respect to performance measure P
Based on experience E
Example:
A checkers learning problem:
T: play checkers
P: percentage of games won in a tournament(锦标赛)
E: opportunity to play against itself
Handwriting recognition learning problem
T: recognizing and classifying handwritten words within images
P: percent of words correctly classified
E: a database of handwritten words with given classifications
A robot driving learning problem
T: driving on public four-lane highways using vision sensors
P: average distance traveled before an error (as judged by human overseer)
E: a sequence of images and steering commands recorded while observing a human driver
3.Candidate-Elimination Learning Algorithm
If d is a negative example(先处理S)
Remove from S any hypothesis that is inconsistent with d
For each hypothesis g in G that is not consistent with d
remove g from G.
Add to G all minimal specializations h of g such that
h consistent with d
Some member of S is more specific than h
Remove from G any hypothesis that is less general than another hypothesis in G
Inductive bias of candidate-elimination algorithm
{c H}
Sky: Sunny, Cloudy, Rainy
AirTemp: Warm, Cold
Humidity: Normal, High
Wind: Strong, Weak
Water: Warm, Cold
Forecast: Same, Change
#distinct instances : 3*2*2*2*2*2 = 96
#distinct concepts : 296
#syntactically(语法)distinct hypotheses : 5*4*4*4*4*4=5120
#semantically(语义的)distinct hypotheses : 1+4*3*3*3*3*3=973
Example Candidate Elimination
4. Information Gain is the expected reduction in entropy caused by partitioning the examples according to this attribute
5.Gradient Descent
Two hypotheses: patient has cancer, patient does not
Laboratory test with two possible outcomes: + and -
Prior knowledge: only 0.008 have this disease.
The test returns a correct positive result in only 98% of the cases in which the disease is present
The test returns a correct negative result in only 97% of the cases in which the disease is not present
In summary
P(cancer)=0.008, P( cancer)=0.992
P(+|cancer)=0.98, P(-|cancer)=0.02
P(+| cancer)=0.03, P(-| cancer)=0.97
problem:
A patient for whom the test returns a positive result. The patient has cancer or not?
MAP can be found using Equation(6.2)
P(+|cancer)P(cancer)=0.0078
P(+| cancer)P( cancer)=0.0298
hMAP= cancer
P(canner|+)=0.0078/(0.0078+0.0298)=0.21
P( cancer|-)=0.79
The result of Bayesian inference depends strongly on the prior probabilities. In this example the hypotheses are not completely accepted or rejected, but rather become more or less probable as more data is observed.
7.EM
GA(Fitness, fitness_threshold, p,r,m)
initialize population:P=Generate p hypotheses at random
Evaluate:For each h in P,compute Fitness(h)
do
while[maxFitness(h)]<Fitness_threshold
create a new generation, PS:
probabilistically select (1-r)p members of P to add to PS. The probability Pr(hi)
1.Select:of selecting hypothesis hi from P is given by/∑Fitness(hj)
Pr(hi)=Fitness(hi)
pairs of hypotheses from P,according to Pr(hi)
r*p/2
2.Crossover:Probabilistically select
given above . For each pair, <h1,h2>, produce two offspring by applying the crossover operator. Add all offspring to PS.
3. Mutate: choose m percent of the members of PS with uniform probability. For each, invert one randomly selected bit in its representation.
update:P=PS
4.
5.Evaluate: for each h in P, compute Fitness(h)
Return the hypothesis from P that has the highest fitness
9. Fitness Function and Selection
The fitness function defines the criterion for ranking potential hypotheses
If the task is to learn classification rules, then the fitness function typically has a component that scores the classification accuracy (complexity or generality) of the rule over a set of provided training examples.
A probability method for selecting a hypothesis
Fitness proportionate selection(roulette whell selection) ,the probability of selecting a h is gibven by the ratio of its fitness to the fitness of other member of the current population
Tournament selection
Rank selection



1 Give the definitions or your comprehensions of the following terms.(12’) 1.1 The inductive learning hypothesis P171.2 Overfitting P491.4 Consistent learner P1482 Give brief answers to the following questions.(15’)2.2 If the size of a version space is ||VS , In general what is the smallest number of queries may be required by a concept learner using optimal query strategy to perfectly learn the target concept? P272.3 In genaral, decision trees represent a disjunction of conjunctions of constrains on the attribute values of instanse,then what expression does the following decision tree corresponds to ?3 Give the explaination to inductive bias, and list inductive bias of CANDIDATE-ELIMINATION algorithm, decision tree learning(ID3), BACKPROPAGATION algorithm.(10’)4 How to solve overfitting in decision tree and neural network?(10’) Solution:● Decision tree:◆ 及早停止树增长(stop growing earlier) ◆ 后修剪法(post-pruning) ● Neural Network◆ 权值衰减(weight decay) ◆ 验证数据集(validation set) 5 Prove that the LMS weight update rule ^(()())i i train i V b V b x ωωη←+-performs a gradient descent to minimize the squared error. In particular, define the squared error E as in the text. Now calculate the derivative of E with respect to the weight i ω, assuming that ^()V b is a linear function as defined in the text. Gradient descent is achieved by updating each weight in proportionto iEω∂-∂. Therefore, you must show that the LMS training rule alters weights in this proportion for each training example it encounters. ( ^2,() (()())train train b V b training exampleE V b V b 〈〉∈≡-∑) (8’)Solution :As Vtrain(b) ˆV(Successor(b)) we can get E=2ˆ(()())trainVb V b -∑ =ˆ2(()())train iV b V b x -- As mentioned in LMS:ˆ(()())i i train iV b V b x ωωη←+- We can get (/)i i i E w ωωη←+-∂∂Therefore, gradient descent is achievement by updating each weight in proportion to /i E w -∂∂; LMS rules alters weights in this proportion for each training example it encounters.6 True or false: if decision tree D2 is an elaboration of tree D1, then D1 is more-general-than D2. Assume D1 and D2 are decision trees representing arbitrary boolean funcions, and that D2 is an elaboration of D1 if ID3 could extend D1 to D2. If true give a proof; if false, a counter example. (Definition: Let j h and k h be boolean-valued functions defined over X .then j h is more_general_than_or_equal_tokh (writtenj g kh h ≥) If and only if()[(()1)(()1)]k j x X h x h x ∀∈=→= then ()()j k j g k k g j h h h h h h >⇔≥∧≥) (10’)The hypothesis is false.One counter example is A XOR Bwhile if A!=B, training examples are all positive, while if A==B, training examples are all negative,then, using ID3 to extend D1, the new tree D2 will be equivalent to D1, i.e., D2 is equal to D1. 7 Design a two-input perceptron that implements the boolean function A B ∧⌝.Design a two-layer network of perceptrons that implements A XOR B . (10’)8 Suppose that a hypothesis space containing three hypotheses, 1h ,2h ,3h , and the posterior probabilities of these typotheses given the training data are 0.4, 0.3 and 0.3 respectively. And if a new instance x is encountered, which is classified positive by 1h , but negative by 2h and3h ,then give the result and detail classification course of Bayes optimal classifier.(10’)P1259 Suppose S is a collection of training-example days described by attributes including Humidity, which can have the values High or Normal. Assume S is a collection containing 10 examples, [7+,3-]. Of these 10 examples, suppose 3 of the positive and 2 of the negative examples have Humidity = High, and the remainder have Humidity = Normal. Please calculate the information gain due to sorting the original 10 examples by the attribute Humidity.( log 21=0, log 22=1, log 23=1.58, log 24=2, log 25=2.32, log 26=2.58, log 27=2.8, log 28=3, log 29=3.16, log 210=3.32, ) (5’)Solution :(a)Here we denote S=[7+,3-],then Entropy([7+,3-])= 227733log log 10101010-- =0.886; (b)i v v values(Humidity )Gain(S,Humidity)=Entropy(S)-Entropy(S )vS S∈∑Gain(S,a2)Values(Humidity )={High, Normal}∴{|()}High S s S Humidity s High =∈=223322Entropy()=-log -log 0.9725555High S =,5High S ==4∴ 224411Entropy()=-log -log 0.725555Normal S = ,Normal S =5Thus Gain (S,Humidity)=0.886-55(0.972*0.72)1010⨯+=0.0410 Finish the following algorithm. (10’)(1) GRADIENT-DESCENT(training examples, η)Each training example is a pair of the form ,x t , where x is the vector of input values, and t is the target output value. η is the learning rate (e.g., 0.05). ● Initialize each i ω to some small random value ● Until the termination condition is met, Do● Initialize each i ω∆ to zero.● For each ,x t in training_examples, Do● Input the instance x to the unit and compute the output o ● For each linear unit weight i ω, Do ● For each linear unit weight i ω, Do(2) FIND-S Algorithm● Initialize h to the most specific hypothesis in H●For each positive training instance x●For each attribute constraint a i in hIfThendo nothingElsereplace a i in h by the next more general constraint that is satisfied by x●Output hypothesis h1.What is the definition of learning problem?(5)Use “a checkers learning problem” as an example to state how to design a learning system.(15)Answer:A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience. (5) Example:A checkers learning problem:T: play checkers(1)P: percentage of games won in a tournament (1) E: opportunity to play against itself (1) To design a learning system:Step 1: Choosing the Training Experience(4)A checkers learning problem:Task T: playing checkersPerformance measure P: percent of games won in the world tournamentTraining experience E: games played against itselfIn order to complete the design of the learning system, we must now choose1. the exact type of knowledge to be learned2. a representation for this target knowledge3. a learning mechanismStep 2: Choosing the Target Function (4)1. if b is a final board state that is won, then V(b)=1002. if b is a final board state that is lost, then V (b)=-1003. if b is a final board state that is drawn, then V (b)=04. if b is a not a final state in the game, then V(b)=V (b'), where b' is the bestfinal board state that can be achieved starting from b and playing optimallyuntil the end of the game (assuming the opponent plays optimally, as well).Step 3: Choosing a Representation for the Target Function (4) x1: the number of black pieces on the boardx2: the number of red pieces on the boardx3: the number of black kings on the boardx4: the number of red kings on the boardx5: the number of black pieces threatened by red (i.e., which can be captured on red's ext turn)x6: the number of red pieces threatened by black.Thus, our learning program will represent V (b) a's a linear function of the formV (b)=w o+w l x l+w2x2+w3x3+w4x4+w5x5+w6x6where w o through w6 are numerical coefficients, or weights, to be chosen by thelearning algorithm. Learned values for the weights w1 through w6 will determinethe relative importance of the various board features in determining the value ofthe board, whereas the weight wo will provide an additive constant to the board value.2.Answer:Find-S & Find-G:Step 1: Initialize S to the most specific hypothesis in H. (1)φ, φ, φ, φ, φ, φ}S0:{Initialize G to the most general hypothesis in H.G0:{?, ?, ?, ?, ?, ?}.Step 2: The first example is {<Sunny, Warm, Normal, Strong, Warm, Same, +>} (3)S1:{Sunny, Warm, Normal, Strong, Warm, Same}G1:{?, ?, ?, ?, ?, ?}.Step 3: The second example is {<Sunny, Warm, High, Strong, Warm, Same, +>} (3)S2:{Sunny, Warm, ?, Strong, Warm, Same}G2:{?, ?, ?, ?, ?, ?}.Step 4: The third example is {<Rainy, Cold, High, Strong, Warm, Change, ->} (3)S3:{ Sunny, Warm, ?, Strong, Warm, Same }G3:{<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, ?, Same>}Step 5: The fourth example is {<Sunny, Warm, High, Strong, Cool, Change, +>} (3)S4:{Sunny, Warm, ?, Strong, ?, ?}G4:{<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }Finally, all the hypotheses are: (2) {<Sunny, Warm, ?, Strong, ?, ?>, <Sunny, ?, ?, Strong, ?, ?>, <Sunny, Warm, ?, ?, ?, ?>,<?, Warm, ?, Strong, ?, ?>, <Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> }3.Answer:Flog(X) = -X*log(X)-(1-X)*log(1-X);STEP1 choose the root node:entropy_all = flog(4/10)=0.971; (2) gain_outlook = entropy_all - 0.3*flog(1/3) - 0.3*flog(1) - 0.4*flog(1/2)=0.296;(1)gain_templture = entropy_all - 0.3*flog(1/3) - 0.3*flog(1/3) - 0.4*flog(1/2)=0.02; (1)gain_humidity = entropy_all - 0.5*flog(2/5) - 0.5*flog(1/5)=0.125; (1) gain_wind = entropy_all - 0.6*flog(5/6) - 0.4*flog(1/4)=0.256;(1)Root Node is “outlook”:(2)step 2 choose the second NODE:for sunny (humidity OR temperature):entropy_sunny = flog(1/3)=0.918; (1) sunny_gain_wind = entropy_sunny - (2/3)*flog(0.5) - (1/3)*flog(1)=0.252; (1) sunny_gain_humidity = entropy_sunny - (2/3)*flog(1) - (1/3)*flog(1)=0.918;(1)sunny_gain_temperature = entropy_sunny - (2/3)*flog(1) - (1/3)*flog(1)=0.918; (1) choose humidity or temperature. (1) for rain (wind):entropy_rain = flog(1/2)=1; (1)rain_gain_wind = entropy_rain - (1/2)*flog(1) - (1/2)*flog(1)=1; (1) rain_gain_humidity = entropy_rain - (1/2)*flog(1/2)-(1/2)*flog(1/2)=0; (1) rain_gain_temperature = entropy_rain - (1/4)*flog(1)- (3/4)*flog(1/3)=0.311; (1) choose wind. (1)(2)or4.Answer:A: The primitive neural units are: perceptron, linear unit and sigmoid unit. (3) Perceptron: (2)A perceptron takes a vector of real-valued inputs, calculates a linear combination of theseinputs, then output a 1 if the result is greater than some threshold and -1 otherwise. More precisely, given input x1 through xn, the output o(x1,..xi,.. xn) computed by the perceptron is NSometimes write the perceptron function asLinear units: (2)a linear unit for which the output o is given byThus, a linear unit corresponds to the first stage of a perceptron, without the threshold.Sigmoid units: (2) The sigmoid unit is illustrated in picture like the perceptron, the sigmoid unit first computes a linear combination of its inputs, then applies a threshold to the result. In the case of the sigmoid unit, however, the threshold output is a continuous function of its input.More precisely, the sigmoid unit computes its output o asWhere,B: (因题目有打印错误,所以感知器规则和delta规则均可,给出的是delta规则) Derivation process is: (6)感知器规则(perceptron learning rule)5.Answer:P(no)=5/14 P(yes)=9/14 (1) P(sunny|no)=3/5 (1) P(cool|no)=1/5 (1) P(high|no)=4/5 (1) P(strong|no)=3/5 (1) P(no|new instance)=P(no)*P(sunny|no)*P(cool|no)*P(high|no)*P(strong|no)=5/14*3/5*1/5*4/5*3/5 = 0.02057=2.057*10-2 (2) P(sunny|yes)=2/9 (1) P(cool|yes)=3/9 (1) P(high|yes)=3/9 (1) P(strong|yes)=3/9 (1)P(yes|new instance)=P(yes)*P(sunny|yes)*P(cool|yes)*P(high|yes)*P(strong|yes)=9/14*2/9*3/9*3/9*3/9 = 0.05291=5.291*10-3 (2) ANSWER: NO (2) 6.Answer:INDUCTIVE BIAS: (8)Consider a concept learning algorithm L for the set of instances X, Let c be an arbitraryconcept define over X, and let D c = {< x; c(x) >} be an arbitrary set of training examples of c .Let denote the classification assigned to the instance x i by L after training on the data D c .The inductive bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples D c:(?x i ∈X)[(B ∧x i ∧D c) ? L(x i;D c)]---The?futility?of?bias-free?learning: (7)A?learner?that?makes?no?a?priori?assumptions?regarding?the?identity?of?the?target?concept?ha s?no?rational?basis?for?classifying?any?unseen?instances.?In?fact,?the?only?reason?that?the?lea rner?was?able?to?generalize?beyond?the?observed?training?examples?is?that?it?was?biased?by? the?inductive?bias.Unfortunately,the only instances that will produce a unanimous vote are the previously observed training examples. For, all the other instances, taking a vote will be futile: each unobserved instance will be classified positive by precisely half the hypotheses in the version space and will be classified negative by the other half.1In the EnjoySport learning task, every example day is represented by 6 attributes. Given that attributes Sky has three possible values, and that AirTemp、Humidity、Wind、Wind、Water and Forecast each have two possible values. Explain why the size of the hypothesis space is 973.How would the number of possible instances and possible hypotheses increase with the addition of one attribute A that takes on on K possible values?2Write the algorithm of Candidate_Elimination using version space. Assume G is the set of maximally general hopytheses in hypothesis space H, and S is the set of maximally specific hopytheses.3Consider the following set of training examples for EnjoySport:Example Sky AirTemp Humidity Wind Water Forcast EnjoySport1 sunny warm normal strong warm same Yes2 sunny warm high strong warm same yes3 rainy cold high strong warm chagge no4 sunny warm high strong cool change yes5 sunny warm normal weak warm same no(a)What is the Entropy of the collection training examples with respect to the target functionclassification?(b)According to the 5 traning examples, compute the decision tree that be learned by ID3, andshow the decision tree.(log23=1.585, log25=2.322)4Give several approaches to avoid overfitting in decision tree learning. How to determin thecorrect final tree size?5 Write the BackPropagation algorithm for feedforward network containing two layers of sigmoid units.6 Explain the Maximum a posteriori(MAP) hypothesis. 7Using Naive Byes Classifier to classify the new instance:<Outlook=sunny,Temperature=cool,Humidity=high,Wind=strong>Our task is to predict the target value (yes or no) of the target concept PlayTennis for this new8 Question Eight :The definition of three types of fitness functions in genetic algorithm Question one :(举一个例子,比如:导航仪、西洋跳棋) Question two :Initilize: G={?,?,?,?,?,?} S={,,,,,}Step 1:G={?,?,?,?,?,?} S={sunny,warm,normal,strong,warm,same} Step2: coming one positive instance 2G={?,?,?,?,?,?} S={sunny,warm,?,strong,warm,same} Step3: coming one negative instance 3 G=<Sunny,?,?,?,?,?> <?,warm,?,?,?,?> <?,?,?,?,?,same> S={sunny,warm,?,strong,warm,same} Step4: coming one positive instance 4 S= { sunny,warm,?,strong,?,? } G=<Sunny,?,?,?,?,?> <?,warm,?,?,?,?> Question three : (a) Entropy(S)=og(3/5)og(2/5)= 0.971(b)Gain(S,sky) = Entropy(S) –[(4/5) Entropy(Ssunny) + (1/5) Entropy(Srainny)] = 0.322Gain(S,AirTemp) = Gain(S,wind) = Gain(S,sky) =0.322Gain(S,Humidity) = Gain(S,Forcast) = 0.02Gain(S,water) = 0.171Choose any feature of AirTemp, wind and sky as the top node.The decision tree as follow: (If choose sky as the top node)Question Four:Answer:Inductive bias: give some proor assumption for a target concept made by the learner to have a basis for classifying unseen instances.Suppose L is a machine learning algorithm and x is a set of training examples. L(xi, Dc) denotes the classification assigned to xi by L after training examples on Dc. Then the inductive bias is a minimal set of assertion B, given an arbitrary target concept C and set of trainingexamples Dc: (Xi ) [(B Dc Xi) -| L(xi, Dc)]C_E: the target concept is contained in the given gypothesis space H, and the training examples are all positive examples.ID3: a, small trees are preferred over larger trees.B, the trees that place high information gain attribute close to root are preferred over those that do not.BP:Smooth interpolation beteen data points.Question Five:Answer: In na?ve bayes classification, we assump that all attributes are independent given the tatget value, while in bayes belif net, it specifes a set of conditional independence along with a set of probability distribution.Question Six:随即梯度下降算法Question Seven:朴素贝叶斯例子Question Eight:The definition of three types of fitness functions in genetic algorithm Answer:In order to select one hypothese according to fitness function, there are always three methods: roulette wheel selection, tournament selection and rank selection.Question nine:Single-point crossover:Two-point crossover:Offspring: ()Uniform crossover:Point mutation:Any mutation is ok!1Solution:A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improves with experience E.Example : (point out the T,P,E of the example)A checkers learning problem.A handwriting recognition learning problemA robot driving learning problem.……2 Solution:S 0:{φ, φ, φ, φ, φ, φ}S 1:{Sunny, Warm, Normal, Strong, Warm, Same}S 2:{Sunny, Warm, ?, Strong, Warm, Same}G 0, G 1, G 2:{?, ?, ?, ?, ?, ?}S 3:{ Sunny, Warm, ?, Strong, Warm, Same }G 3:{Sunny, ?, ?, ?, ?, ?} {?, Warm, ?, ?, ?, ?}{?, ?, ?, ?, ?, Same} S 4:{Sunny, Warm, ?, Strong, ?, ?}G 4:{Sunny, ?, ?, ?, ?, ?} {?, Warm, ?, ?, ?, ?}3 Solution: (a)Here we denote S=[7+,3-],then Entropy([7+,3-])= 227733log log 10101010-- =0.886; (b)i v v values(Humidity )Gain(S,Humidity)=Entropy(S)-Entropy(S )v S S∈∑Gain(S,a2) Values(Humidity )={High, Normal}∴{|()}High S s S Humidity s High =∈=223322Entropy()=-log -log 0.9725555High S =,5High S ==4 ∴ 224411Entropy()=-log -log 0.725555Normal S = ,Normal S =5 Thus Gain (S,Humidity)=0.886-55(0.972*0.72)1010⨯+=0.04 4 In general,inductive inference: Some form of prior assumptions regarding the indentity of thetarget concept made by a learner to have a rational basis for classifying an unseen instances. FormallyCANDIDATE-ELIMINATION:The target concept c is contained in the given hypothesis space H. Decision tree learning(ID3): Shorter trees are preferred over larger trees.Trees that place high information gain attributes close to the root are perferred over those that do not.BACKPROPAGATION algorithm:smooth interpolation between data points.5 Solution:(1)(2)6(3) GRADIENT-DESCENT(training examples, η)Each training example is a pair of the form ,x t , where x is the vector of input values, and t is the target output value. η is the learning rate (e.g., 0.05).● Initialize each i ω to some small random value● Until the termination condition is met, Do● Initialize each i ω∆ to zero.● For each ,x t in training_examples, Do● Input the instance x to the unit and compute the output o● For each linear unit weight i ω, Doa) n+1 8。



机器学习试题一、选择题1. 什么是机器学习?a) 一种人工智能技术b) 一种自动编程方法c) 一种人机交互界面d) 一种传统数据处理方法2. 以下哪一项不是机器学习的主要任务?a) 分类b) 回归c) 聚类d) 排序3. 机器学习算法的目标是什么?a) 最大化准确率b) 最小化计算时间c) 最小化学习误差d) 最大化训练数据规模二、判断题1. 监督学习是一种有标签数据的学习方法。

2. 无监督学习可以在没有标签的情况下自动学习数据。

3. 决策树是一种无监督学习算法。

三、简答题1. 请简要解释监督学习和无监督学习的区别。

2. 什么是过拟合问题?如何解决过拟合问题?3. 请举例说明聚类算法的应用场景。


提示:1. 可以使用第三方机器学习库(如scikit-learn)来实现线性回归模型。

2. 需要将数据集拆分为训练集和测试集,用于模型的训练和评估。

3. 可以使用均方误差(Mean Squared Error)作为模型评估指标。






In particular, define the squared error E as in the text. NowAcalculate the derivative of E with respect to the weight i, assuming that V (b) is a linear function as defi ned in the text. Gradie nt desce nt is achieved by updat ing each weight i n proport ion Eto --------- . Therefore, you must show that the LMS trai ning rule alters weights in this proporti on iA2for each training example it encounters. ( E (V train (b) V(b)) ) (8'b ,V t r ain (b) training exampleSolution :As Vtrai n(b) \? (Successor(b))we can get E= (V train (b) V(b))2=2(V train(b)伽)中As mentioned in LMS: i i (V train (b) \?(b))X iWe can get i i ( E / w i)Therefore, gradient descent is achievement by updating each weight in proportion to E / w i;LMS rules alters weights in this proportion for each training example it encounters.6 True or false: if decisi on tree D2 is an elaborati on of tree D1, the n D1 is more-ge neral-tha n D2. Assume D1 and D2 are decision trees representing arbitrary boolean funcions, and that D2 is an elaboratio n of D1 if ID3 could exte nd D1 to D2. If true give a proof; if false, a coun ter example. (Definition: Let h j and h k be boolean-valued functions defined over X .then h j ismore_ge neral_tha n_o r_equal_to h k (writte n h j g h k ) If and only if (x X)[(h k(x) 1) (h j(x) 1)] then h j h k (h j g h k )(h k g h j)) (10 'The hypothesis is false.One cou nter example is A XOR B while if A!=B, trai ning examples are all positive, while if A==B, trai ning examples are all n egative, then, usi ng ID3 to exte nd D1, the new tree D2 will be equivale nt to D1, i.e., D2 is equal to D1.7 Design a two-input perceptron that implements the boolean function A B .Design atwo-layer network of perceptrons that implements A XOR B . (10 '8 Suppose that a hypothesis space containing three hypotheses, h!, h2,h3, and the posteriorprobabilities of these typotheses given the training data are 0.4, 0.3 and 0.3 respectively. And if anew instanee x is encountered, which is classified positive by g, but negative by h2andh3,then give the result and detail classification course of Bayes optimal classifier.(10 'P1259 Suppose S is a collection of training-example days described by attributes including Humidity, which can have the values High or Normal. Assume S is a collection containing 10 examples, [7+,3_]. Of these 10 examples, suppose 3 of the positive and 2 of the negative examples have Humidity = High, and the rema in der have Humidity = Normal. Please calculate the in formati on gain due to sorting the original 10 examples by the attribute Humidity.( log 2l=0, log 22=1, Iog 23=1.58, Iog 24=2, Iog 25=2.32, Iog 26=2.58, Iog 27=2.8, Iog 28=3, Iog 29=3.16, Iog 2l0=3.32,) (5' Solution :(a)Here we denote S=[7+,3-],then Entropy([7+,3-])=丄 l^ 上? I^ ? =0.886;10 10 10 10(b) Gai n(S,Humidity)=E ntropy(S)-v values(Humidity JQ Values(Humidity )={High, Normal}S High {s S|Humidity (s) High}Each trai ning example is a pair of the form ;. x,t ;:, where x is the vector of in put values,Initialize eachi to some small random valueUn til the term in atio n con diti on is met, DoInitialize each i to zero.For each ( x, n in training_examples, DoIn put the in sta nee x to the un it and compute the output o For each linear unit weight i , Do For each linear unit weight i , Do(2) FIND-S AlgorithmIn itialize h to the most specific hypothesis in H For each positive trai ning in sta nee xFor each attribute constraint a i in h—Entropy(Sz) Gain(S,a2)3 3 2 2Entropy(S High )=-Jog2[-匸log ?匚 0.972, 0 5 5 5 54 4 En tropy(S Normal )=-:Iog 2 匚5 55 Thus Gain (S,Humidity)=0.886- ( 0.972 10 Fin ish the followi ng algorithm. (10 '(1) GRADIENT-DESCENT(training examples,)igh 5 =44 V 1 log ? 0.72 , S N5 55*0.72) =0.0410ormal=5and t is the target output value.is the lear ning rate (e.g., 0.05).If ________________________The ndo nothingElsereplace a i in h by the n ext more gen eral con stra int that is satisfied by x Output hypothesis h1. What is the defi niti on of lear ning problem(5)Use a checkers learning problem ” as an example to state how to design a learning system.(15)An swer:A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experie nce.(5)Example:A checkers lear ning problem:T: play checkers(1)P: perce ntage of games won in a tour name nt (1) E: opport un ity to play aga inst itself (1)To desig n a lear ning system:Step 1: Choos ing the Trai ning Experie nce(4)A checkers lear ning problem:Task T: play ing checkersPerforma nce measure P: perce nt of games won in the world tourn ame ntTraining experie nce E: games played aga in st itselfIn order to complete the design of the learning system, we must now choose1. the exact type of kno wledge to be lear ned2. a represe ntati on for this target kno wledge3. a lear ning mecha nismStep 2: Choos ing the Target Function ⑷1. if b is a final board state that is won, then V(b)=1002. if b is a final board state that is lost, then V (b)=-1003. if b is a final board state that is draw n, the n V (b)=04. if b is a not a final state in the game, then V(b)=V (b'), where b' is the best final board statethat can be achieved starting from b and playing optimally un til the end of the game (assuming the opp onent plays optimally, as well).Step 3: Choos ing a Represe ntati on for the Target Function (4) x1: the nu mber of black pieces on the boardx2: the nu mber of red pieces on the boardx3: the nu mber of black kings on the boardx4: the number of red kings on the boardx5: the number of black pieces threatened by red (i.e., which can be captured on red's ext turn)x6: the number of red pieces threatened by black.Thus, our learning program will represent V (b) a's a linear function of the formV (b)=w o+w l x l+w2x2+w 3x3+w 4x4+w5x5+w6x6 where w o through w6 are numerical coefficients, or weights, to be chosen by the learning algorithm. Learned values for the weights w 1 through w 6 will determine the relative importance of the various board features in determining the value of the board, whereas the weight wo will provide an additive constant to the board value.2. Answer: Find-S & Find-G: Step 1: Initialize S to the most specific hypothesis in H. (1)S0:{ , , , , , }Initialize G to the most general hypothesis in H.G0:{, , , , , }.Step 2: The first example is {<Sunny, Warm, Normal, Strong, Warm, Same, +>} (3)S1:{Sunny, Warm, Normal, Strong, Warm, Same} G1:{, , , , , }.Step 3: The second example is {<Sunny, Warm, High, Strong, Warm, Same, +>} (3) S2:{Sunny, Warm, , Strong, Warm, Same} G2:{, , , , , }.Step 4: The third example is {<Rainy, Cold, High, Strong, Warm, Change, ->} (3)S3:{ Sunny, Warm, , Strong, Warm, Same } G3:{<Sunny, , , , , >, <, Warm, , , , >, <, , , , , Same>} Step 5: The fourth example is {<Sunny, Warm, High, Strong, Cool, Change, +>} (3)S4:{Sunny, Warm, , Strong, , } G4:{<Sunny, , , , , >, <, Warm, , , , > }Finally, all the hypotheses are: (2) {<Sunny, Warm, , Strong, , >, <Sunny, , , Strong, , >, <Sunny, Warm, , , , >,<, Warm, , Strong, , >, <Sunny, , , , , >, <, Warm, , , , > }3. Answer: Flog(X) = -X*log(X)-(1-X)*log(1-X); STEP1 choose the root node: entropy_all =flog(4/10)=0.971; (2) gain_outlook = entropy_all - 0.3*flog(1/3) - 0.3*flog(1) - 0.4*flog(1/2)=0.296; (1) gain_templture = entropy_all - 0.3*flog(1/3) - 0.3*flog(1/3) - 0.4*flog(1/2)=0.02; (1)step 2 choose the sec ond NODE:for sunny (humidity OR temperature):en tropy_su nny = flog(1/3)=0.918; (1) sunn y_gain_wi nd = en tropy_su nny - (2/3)*flog(0.5) - (1/3)*flog(1)=0.252; (1) sunn y_gain_humidity = en tropy_su nny - (2/3)*flog(1) - (1/3)*flog(1)=0.918;(1)sunn y_gain_temperature = en tropy_su nny - (2/3)*flog(1) - (1/3)*flog(1)=0.918; (1) choose humidity or temperature. (1)for rain (win d):en tropy_rain = flog(1/2)=1; (1)rain_gain_wi nd = en tropy_rain - (1/2)*flog(1) - (1/2)*flog(1)=1; (1)rain_gain_humidity = en tropy_rain - (1/2)*flog(1/2)-(1/2)*flog(1/2)=0; (1)rain_gain_temperature = en tropy_rain - (1/4)*flog(1)- (3/4)*flog(1/3)=0.311; (1) gain_wind = en tropy_all - 0.6*flog(5/6)(1)Root Node is(2)⑴0.4*flog(1/4)=0.256;outlook ”: orgain_humidity = entropy_all - 0.5*flog(2/5) - 0.5*flog(1/5)=0.125;4. An swer:A: The primitive n eural un its are: perceptro n, li near unit and sigmoid un it. (3) Perceptr on: (2)A perceptr on takes a vector of real-valued in puts, calculates a lin ear comb in ati on of these inputs, the n output a 1 if the result is greater tha n some threshold and -1 otherwise. More precisely, give n in put x1 through xn, the output o(x1,..xi,.. xn) computed by the perceptr on is NSometimes write the perceptr on fun cti on asLin ear un its: (2)a lin ear unit for which the output o is give n byThus, a lin ear un it corresp onds to the first stage of a perceptr on, without the threshold. Sigmoid un its: (2) The sigmoid un it is illustrated in picture like the perceptro n, the sigmoid un it first computes a lin ear comb in atio n of its in puts, the n applies a threshold to the result. In the case of the sigmoid un it, however, the threshold output is a continu ous fun cti on of its in put.More precisely, the sigmoid un it computes its output o asWhere,B:(因题目有打印错误,所以感知器规则和delta规则均可,给出的是delta规则)Derivati on process is: (6)感知器规则(perceptron learning rule)5. An swer:P( no)=5/14 P(yes)=9/14 (1)P(su nny|no )=3/5 (1)P(cool| no)=1/5 (1) P(high| no)=4/5 (1) P(stro ng| no)=3/5 (1) P(no|new instance)=P(no)*P(sunny|no)*P(cool|no)*P(high|no)*P(strong|no)=5/14*3/5*1/5*4/5*3/5 = 0.02057=2.057*10 -2(2) P(su nny |yes)=2/9 (1) P(cool|yes)=3/9 (1) P(high|yes)=3/9 (1) P(stro ng|yes)=3/9 (1) P(yes|new instance)=P(yes)*P(sunny|yes)*P(cool|yes)*P(high|yes)*P(strong|yes)=9/14*2/9*3/9*3/9*3/9 = 0.0529 仁5.291*10 -3(2) ANSWER: NO (2) 6. An swer:INDUCTIVE BIAS: (8)Consider a concept learning algorithm L for the set of instances X, Let c be an arbitrary concept define over X, and let D c = {< x; c (x) >} be an arbitrary set of training examples of c . Let denote the classification assigned to the instanee x i by L after training on thedata D c .The in ductive bias of L is any mini mal set of asserti ons B such that for any targetconcept c and corresponding training examples D c:(?<i € X)[(B A x i A D c) ? L(x i;D c)]---The?futility?of?bias-free?learning: (7)A?learner?that?makes? no?a?priori?assumptio ns?regardi ng?the?ide ntity?of?the?target?co ncept?ha s?n o?ratio nal?basis?for?classifyi ng?a ny?u nsee n?i nsta nces.?l n?fact,?the?o nly?reaso n?that?the?lea rner?was?able?to?ge neralize?beyo nd?the?observed?trai ning?examples?is?that?it?was?biased?by? the? in ductive?bias.Unfortunately , the only instances that will produce a unanimous vote are the previously observed training examples. For, all the other instances, taking a vote will be futile: each unobserved instance will be classified positive by precisely half the hypotheses in the version space and will be classified n egative by the other half.1 In the EnjoySport lear ning task, every example day is represe nted by 6 attributes. Given thatattributes Sky has three possible values, and that AirTemp、Humidity、Wind、Wind、Water and Forecast each have two possible values. Expla in why the size of the hypothesis space is 973.How would the nu mber of possible in sta nces and possible hypotheses in crease with theaddition of one attribute A that takes on on K possible values2 Write the algorithm of Can didate_Elim in atio n using vers ion space. Assume G is the set ofmaximally gen eral hopytheses in hypothesis space H, and S is the set of maximally specific hopytheses.(a) What is the Entropy of the collection training examples with respect to the target functionclassificati on(b) According to the 5 traning examples, compute the decision tree that be learned by ID3, and showthe decisi on tree.(log23=1.585, log25=2.322)4 Give several approaches to avoid overfitti ng in decisi on tree lear ning. How to determ in thecorrect final tree size5 Write the BackPropagation algorithm for feedforward network containing two layers of sigmoid units.6 Explai n the Maximum a posteriori(MAP) hypothesis.7 Usi ng Naive Byes Classifier to classify the new in sta nee:<Outlook=s unn y,Temperature=cool,Humidity=high,Wi nd=stro ng> Our task is to predict the target value (yes or no) of the target concept PlayTennis for this new8 Question Eight : The definition of three types of fitness functions in genetic algorithmQuestion one :(举一个例子,比如:导航仪、西洋跳棋)Question two :In itilize: G={,,,,,} S={ ,,,,,}Step 1:G={,,,,,} S={s unny ,warm ,no rmal,str on g,warm,same}Step2: coming one positive in sta nee 2G={,,,,,} S={s unny ,warm,,str on g,warm,same}Step3: coming one n egative in sta nee 3G=<S unny,,,,,> <,warm,,,,> <,,,,,same>S={s unny ,warm,,str on g,warm,same}Step4: coming one positive in sta nee 4S= { sunny ,warm,,str on g,, }G=<Su nn y,,,,,> <,warm,,,,>Question three :(a) Entropy(S)= 一 -丨og(3/5) 一 -】og(2/5)= 0.971(b) Gain(S,sky) = Entropy(S) - (4/5) Entropy(Ssunny) + (1/5) Entropy(Srainny)] = 0.322Gai n( S,AirTemp) = Gai n(S,wi nd) = Gai n(S,sky) =0.322Gai n( S,Humidity) = Gain (S,Forcast) = 0.02Gai n( S,water) = 0.171Choose any feature of AirTemp, wi nd and sky as the top no de.The decisi on tree as follow: (If choose sky as the top no de)Question Four :An swer:In ductive bias: give some proor assumpti on for a target con cept made by the lear ner to have a basis for classify ing un see n in sta nces.Suppose L is a machine learning algorithm and x is a set of training examples. L(xi, Dc) denotes the classification assigned to xi by L after training examples on Dc. Then the inductive bias is a minimal set of assertion B, given an arbitrary target concept C and set of training examples Dc:(眾i E 艾)[(B n Dc「Xi) -| L(xi, Dc)]C_E: the target concept is contained in the given gypothesis space H, and the training examples are all positive examples.ID3: a, small trees are preferred over larger trees.B, the trees that place high information gain attribute close to root are preferred over those that do not.BP:Smooth in terpolati on betee n data poin ts.Question Five :Answer: In na?ve bayes classification, we assump that all attributes are independent given the tatget value, while in bayes belif n et, it specifes a set of con diti onal in depe ndence along with a set of probability distributi on.Question Six :随即梯度下降算法Question Seven :朴素贝叶斯例子Question Eight : The definition of three types of fitness functions in genetic algorithmAn swer:In order to select one hypothese according to fitness function, there are always three methods: roulette wheel selecti on, tour name nt selecti on and rank selectio n.Question nine :Sin gle-po int crossover:Two-po int crossover:Offspri ng:()Uniform crossover:Point mutati on:Any mutati on is ok!1 Solutio n:A computer program is said to lear n from experie nee E with respect to some class of tasks T and performa nee measure P,if its performa nee at tasks in T, as measured by P, improves with experie nee E.Example : (po int out the T,P,E of the example)A checkers lear ning problem.A handwriting recognition learning problemA robot drivi ng lear ning problem.2 Solutio n:S o :{ , , , , , }S 1:{Su nny, Warm, Normal, Stro ng, Warm, Same}S 2:{Su nny, Warm, , Stro ng, Warm, Same}G o , G 1, G 2:{, , , , , }S 3:{ Su nny, Warm, , Stro ng, Warm, Same }G 3:{Sunny, , , , , } U {, Warm, , , , } U {, , , , , Same}S 4:{Su nny, Warm, , Stro ng, , }G 4:{Sunny, , , , , } U {, Warm, , , , }3 Solutio n:4 In gen eral,i nductive inference: Some form of prior assumpti ons regard ing the inden tity of thetarget concept made by a learner to have a rational basis for classifying an unseen instances.FormallyCANDIDATE-ELIMINATION:The target con cept c is co ntain ed in the give n hypothesis space H. Decision tree learning(ID3): Shorter trees are preferred over larger trees.Trees that place highinformation gain attributes close to the root are perferred over those that do not. BACKPROPAGATION algorithm:smooth in terpolation between data poi nts.5 Soluti on: (1)⑵6(3) GRADIENT-DESCENT(training examples,)Each training example is a pair of the form : x,t. , where x is the vector of inputvalues, and t is the target output value. is the lear ning rate (e.g., 0.05).(a)Here we denote S=[7+,3-],then Entropy([7+,3-])=10 10 鼻2空 10 10 =0.886;(b) Gai n(S,Humidity)=E ntropy(S)-v values(Humidity J Entropy(S v ) Gain(S,a2)Q Values(Humidity )={High, Normal}S High {s S | Humidity (s) High}3 3 2 2Entropy(S High )=-:log 2:-匸log ?匚 0.972, S High 5=4 5 5 5 5 En tropy(S N ormal )=-|log 24-1log 21. 5 5 5 5 0.72 Thus Gain (S,Humidity) =0.886-(-°972 存OS =°04Initialize each i to some small random valueUn til the term in atio n con diti on is met, DoInitialize each i to zero.For each (x, t) in training_examples, DorIn put the in sta nee x to the un it and compute the output oFor each linear unit weight i, Doa) n+18Dtfinitiort: Consider a concept class C defined over a set of instances X of lengtli n and a learner L using hypothesis space H. C is PAC-learnable by L using H if for all c e C, distributions T> over X t芒such that 0 < € < 1/2, and $ such that 0 < 5 < 1/2, learner L will with probability at least (1 — 5) output a hypothesis h e H such that error^W< 巳in time that is polynomial in 1/百,l/久n r and。
