贝叶斯决策理论(英文)--非常经典!
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
accuracy
Test Data
Prediction model
Tenured? Unseen Data
Name Tom Merlisa George Joseph
Rank Assistant Prof Associate Prof Professor Assistant Prof
Years Tenured 2 no 7 no 5 yes 7 yes
Bayes theorem
P( X | H ) P(H ) P(H | X ) = P( X )
where P(H|X) is the posterior probability of the hypothesis H conditioned on the data sample X, P(H) is the prior probability of H, P(X|H) is the posterior probability of X conditioned on H, P(X) is the prior probability of X
Training Data
Name Mike Mary Bill Jim Dave Anne Rank Assistant Prof Assistant Prof Professor Associate Prof Assistant Prof Associate Prof Years 3 7 2 7 6 3 Tenured no yes yes yes no no
Sik Si
P ( xk | Ci ) = g xk , μ Ci , σ Ci =
class Ci (i = 1, …, m) attribute Ak (k = 1, …, n) feature vector X = (x1, x2, …, xn), where xk is the value of X on Ak
naive Bayesian classifier returns the maximum posteriori hypothesis Ci P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j ≠ i
usually 2/3 of the data set are used for training while the remaining 1/3 are used for test
• hold-out with random subsampling: repeat hold-out test for k times
• Speed
the computational cost involved Hale Waihona Puke Baidun generating and using the model
training time cost vs. test time cost
usually, larger training time cost but smaller test time cost
Classification vs. Regression
Classification predicts categorical class labels Prediction Regression models continuous-valued functions, i.e. predicts numerical values
• Robustness
the ability of the model to deal with noise or missing values
• Scalability
the ability of the model to deal with huge volume of data
• Comprehensibility
• k-fold cross-validation
partition the data set into k mutually exclusive subsets with approximately equal size. Perform training and test for k times. In the i-th time, the i-th subset is used for test while the remaining subsets are collectively used for training 10-fold cross-validation is often used
according to Bayes theorem,
P ( Ci | X ) =
P ( X | Ci ) P ( Ci ) P( X )
since P(X) is a constant for all classes, only P(X|Ci)P(Ci) need be maximized
Naive Bayesian classifier (II)
label
Prediction model
e.g., IF rank = professor OR years > 6 THEN tenured = yes
Two step process of prediction (II)
Step 2: Use the model to predict unseen instances
Unsupervised learning
• the labels of training data are unknown • given a set of observations, to discover the inherent properties, such as the existence of classes or clusters, in the data • usually: clustering, density estimation
Thomas Bayes (1701?-1761)
Naive Bayesian classifier (I)
also called simple Bayesian classifier
class conditional independence: assume that the effect of an attribute value on a given class is independent of the values of other attributes
How to evaluate prediction algorithms?
• Generalization
the ability of the model to correctly predict unseen instances. usually measured by predictive accuracy
• stratified cross-validation: the class distribution of the subsets is approximately the same as that in the initial data set • leave-one-out: k equals to the number of instances in the initial data set
Two step process of prediction (I)
Step 1: Construct a model to describe a training set
• the set of tuples used for model construction is called training set • the set of tuples can be called as a sample (a tuple can also be called as a sample) • a tuple is usually called an example (usually with the label) or an instance (usually without the label) • the attribute to be predicted is called label Training algorithm
What is Bayesian classification?
Bayesian classification is based on Bayes theorem
Bayesian classifiers have exhibited high accuracy and fast speed when applied to large databases
(Jeff, Professor, 7)
Supervised vs. Unsupervised learning
Supervised learning
• the training data are accompanied by labels indicating the desired outputs of the observations • the concerned property of unseen data is predicted • usually: classification, regression
the level of interpretability of the model
How to estimate accuracy?
two popular methods: • hold-out
partition the data set into two independent subsets, i.e. a training set and a test set
before use the model, we can estimate the accuracy of the model by a test set • test set is different from training set • the desired output of a test instance is compared with the actual output from the model • for classification, the accuracy is usually measured by the percentage of test instances that are correctly classified by the model • for regression, the accuracy is usually measured by mean squared error
to maximize P(X|Ci)P(Ci):
P(Ci) can be estimated by P ( Ci ) =
Si S
where Si is the number of training instances of class Ci, and S is the total number of training instances since naive Bayesian classifier assumes class conditional independence, P(X|Ci) can be n estimated by
P ( X | Ci ) = ∏ P ( xk | Ci )
k =1
• if Ak is an categorical attribute, then P ( xk | Ci ) =
where Sik is the number of training instances of class Ci having the value xk for Ak, and Si is the number of training instances of class Ci • if Ak is a continuous attribute, then
Test Data
Prediction model
Tenured? Unseen Data
Name Tom Merlisa George Joseph
Rank Assistant Prof Associate Prof Professor Assistant Prof
Years Tenured 2 no 7 no 5 yes 7 yes
Bayes theorem
P( X | H ) P(H ) P(H | X ) = P( X )
where P(H|X) is the posterior probability of the hypothesis H conditioned on the data sample X, P(H) is the prior probability of H, P(X|H) is the posterior probability of X conditioned on H, P(X) is the prior probability of X
Training Data
Name Mike Mary Bill Jim Dave Anne Rank Assistant Prof Assistant Prof Professor Associate Prof Assistant Prof Associate Prof Years 3 7 2 7 6 3 Tenured no yes yes yes no no
Sik Si
P ( xk | Ci ) = g xk , μ Ci , σ Ci =
class Ci (i = 1, …, m) attribute Ak (k = 1, …, n) feature vector X = (x1, x2, …, xn), where xk is the value of X on Ak
naive Bayesian classifier returns the maximum posteriori hypothesis Ci P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j ≠ i
usually 2/3 of the data set are used for training while the remaining 1/3 are used for test
• hold-out with random subsampling: repeat hold-out test for k times
• Speed
the computational cost involved Hale Waihona Puke Baidun generating and using the model
training time cost vs. test time cost
usually, larger training time cost but smaller test time cost
Classification vs. Regression
Classification predicts categorical class labels Prediction Regression models continuous-valued functions, i.e. predicts numerical values
• Robustness
the ability of the model to deal with noise or missing values
• Scalability
the ability of the model to deal with huge volume of data
• Comprehensibility
• k-fold cross-validation
partition the data set into k mutually exclusive subsets with approximately equal size. Perform training and test for k times. In the i-th time, the i-th subset is used for test while the remaining subsets are collectively used for training 10-fold cross-validation is often used
according to Bayes theorem,
P ( Ci | X ) =
P ( X | Ci ) P ( Ci ) P( X )
since P(X) is a constant for all classes, only P(X|Ci)P(Ci) need be maximized
Naive Bayesian classifier (II)
label
Prediction model
e.g., IF rank = professor OR years > 6 THEN tenured = yes
Two step process of prediction (II)
Step 2: Use the model to predict unseen instances
Unsupervised learning
• the labels of training data are unknown • given a set of observations, to discover the inherent properties, such as the existence of classes or clusters, in the data • usually: clustering, density estimation
Thomas Bayes (1701?-1761)
Naive Bayesian classifier (I)
also called simple Bayesian classifier
class conditional independence: assume that the effect of an attribute value on a given class is independent of the values of other attributes
How to evaluate prediction algorithms?
• Generalization
the ability of the model to correctly predict unseen instances. usually measured by predictive accuracy
• stratified cross-validation: the class distribution of the subsets is approximately the same as that in the initial data set • leave-one-out: k equals to the number of instances in the initial data set
Two step process of prediction (I)
Step 1: Construct a model to describe a training set
• the set of tuples used for model construction is called training set • the set of tuples can be called as a sample (a tuple can also be called as a sample) • a tuple is usually called an example (usually with the label) or an instance (usually without the label) • the attribute to be predicted is called label Training algorithm
What is Bayesian classification?
Bayesian classification is based on Bayes theorem
Bayesian classifiers have exhibited high accuracy and fast speed when applied to large databases
(Jeff, Professor, 7)
Supervised vs. Unsupervised learning
Supervised learning
• the training data are accompanied by labels indicating the desired outputs of the observations • the concerned property of unseen data is predicted • usually: classification, regression
the level of interpretability of the model
How to estimate accuracy?
two popular methods: • hold-out
partition the data set into two independent subsets, i.e. a training set and a test set
before use the model, we can estimate the accuracy of the model by a test set • test set is different from training set • the desired output of a test instance is compared with the actual output from the model • for classification, the accuracy is usually measured by the percentage of test instances that are correctly classified by the model • for regression, the accuracy is usually measured by mean squared error
to maximize P(X|Ci)P(Ci):
P(Ci) can be estimated by P ( Ci ) =
Si S
where Si is the number of training instances of class Ci, and S is the total number of training instances since naive Bayesian classifier assumes class conditional independence, P(X|Ci) can be n estimated by
P ( X | Ci ) = ∏ P ( xk | Ci )
k =1
• if Ak is an categorical attribute, then P ( xk | Ci ) =
where Sik is the number of training instances of class Ci having the value xk for Ak, and Si is the number of training instances of class Ci • if Ak is a continuous attribute, then