Cost-Sensitive Face Recognition

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Cost-Sensitive Face Recognition
Yin Zhang and Zhi-Hua Zhou,Senior Member,IEEE
Abstract—Most traditional face recognition systems attempt to achieve a low recognition error rate,implicitly assuming that the losses of all misclassifications are the same.In this paper,we argue that this is far from a reasonable setting because,in almost all application scenarios of face recognition,different kinds of mistakes will lead to different losses.For example,it would be troublesome if a door locker based on a face recognition system misclassified a family member as a stranger such that she/he was not allowed to enter the house,but it would be a much more serious disaster if a stranger was misclassified as a family member and allowed to enter the house.
We propose a framework which formulates the face recognition problem as a multiclass cost-sensitive learning task,and develop two theoretically sound methods for this task.Experimental results demonstrate the effectiveness and efficiency of the proposed methods.
Index Terms—Face recognition,cost-sensitive face recognition,cost-sensitive learning,multiclass cost-sensitive learning.
Ç
1I NTRODUCTION
F ACE recognition has attracted much research effort for
many years.Many successful face recognition systems have been developed,such as[2],[3],[19],[23],and Zhao et al.[25]provided a good survey.The operation of face recognition systems can be divided into two modes,i.e., verification mode and identification mode.In the verification mode,the system validates whether the individual is the identity she/he claims to be;in the identification mode, the system recognizes the individual by searching the database to find who the individual is or if she/he does not belong to the database[11].In this paper,we are concerned with identification.To the best of our knowl-edge,most of these face recognition systems attempt to achieve a low recognition error rate.
Pursuing a minimum error rate always implies that the system assumes that any misclassification will cause the same amount of loss since it simply tries to minimize the number of mistakes.Although this assumption is widely taken,we argue that it is not really reasonable because,for most real-world applications,different kinds of mistakes generally lead to different amounts of losses.For example,consider a door locker based on a face recognition system for a certain group (e.g.,family members or employees of a company);the possible mistakes in predicting a probe face image include:
1.false acceptance,i.e.,misrecognizing an impostor as
a gallery subject;
2.false rejection,i.e.,misrecognizing a gallery subject
as an impostor;
3.false identification,i.e.,misrecognizing between two
gallery subjects.
In traditional face recognition systems,these errors are treated equally.It is evident,however,that these errors will cause different amounts of losses.When the second error occurs,a gallery subject is mistakenly rejected,which is troublesome,but,if compared with the first error,the second one is less serious since it would be a disaster if an impostor is mistakenly allowed to enter the house.The third error also causes some trouble since members in the house might have different private rooms or the company wants to record the staff attendant,yet such an error is obviously much less serious than the first and the second ones.
There are many other applications where the severity of different kinds of errors varies.For example,a salesman at a shop may want to increase her/his chances of identifying old customers even at the cost of misclassifying some new customers as old ones.Thus,the false acceptance(misrecog-nizing a new customer as an old one)may not be as serious as the false rejection(missing an old customer)or even the false identification(misrecognizing two old customers).
It is clear that these three kinds of errors are quite different and simply taking error rate as the measure of the performance may not be a good choice.In the following, without loss of generality,we assume that the false rejection is more serious than false identification and the false acceptance is the most serious error.Other situations will be considered in our experiments later.
In the machine learning and data mining communities,a kind of classification problem called cost-sensitive learning has been studied for years[1],[5],[7],[13],[26].In such settings,“cost”information is introduced to measure the severity of misclassification and different costs reflect different amounts of losses.The purpose of cost-sensitive learning is to minimize total cost rather than total error.There are two kinds of cost-sensitive problems,i.e.,problems with class-dependent cost[5],[7],[13],[26]or example-dependent cost [1].In the former kind of problems,the cost is determined by error type,that is,misclassifying any example of the i th class into the j th class will always have the same cost, while misclassifying an example into different classes may lead to different costs.In the latter kind of problems,the cost is determined by the example,while different examples
.The authors are with the National Key Laboratory for Novel Software
Technology,Nanjing University,Mailbox419,22Hankou Road,
Nanjing210093,China.E-mail:{zhangyin,zhouzh}@.
Manuscript received6Dec.2008;revised23Apr.2009;accepted6Sept.2009;
published online18Nov.2009.
Recommended for acceptance by S.Li.
For information on obtaining reprints of this article,please send e-mail to:
tpami@,and reference IEEECS Log Number
TPAMI-2008-12-0839.
Digital Object Identifier no.10.1109/TPAMI.2009.195.
0162-8828/10/$26.00ß2010IEEE Published by the IEEE Computer Society
will have different costs even when the error types are the same.
Inspired by cost-sensitive learning,in this paper,we formulate the face recognition problem as a multiclass cost-sensitive learning task.For convenience of discussion,we consider a situation that accepting any impostor will result in the same loss,and misclassifying a gallery subject as another gallery subject or an impostor will result in different amounts of losses.So,our problem is a class-
dependent cost-sensitive problem.In contrast to conven-tional face recognition systems,which try to minimize the total error rate,we try to minimize the total cost,aiming to prevent disasters caused by mistakes with high costs.Note that in the setting with which we are,the costs of different kinds of misclassifications are given by the user according to their requirement.For example,if the user believes that letting an impostor in will cause a disaster10times more severe than blocking a legal person,she/he would set the cost for the former as10times as that of the latter.
Some previous biometric studies considered the differ-ence between false acceptance and false rejection,e.g.,[11], [14],and used the receiver operating characteristic(ROC) curve to select a proper threshold for classification.These methods can be viewed as implicitly using cost information since the threshold depends on the cost.To the best of our knowledge,our work is the first attempt to explicitly formulate the face recognition problem as a cost-sensitive learning problem and to try to minimize the total cost directly.This formulation is more natural and solid. Moreover,those ROC-based methods focused on binary classification problems,while face recognition is inherently a multiclass problem and extending ROC methods to multiclass is nontrivial[12].
In this paper,we propose two new cost-sensitive methods,mcKLR and mckNN,for face identification problems.mcKLR is an inductive learning method derived from Bayes decision theory,while mc k NN is a cost-sensitive version of k-nearest neighbor classifier.Both methods handle multiclass cost-sensitive classification problems. mc k NN is particularly suitable for situations where group members vary frequently,while mcKLR is more suitable for situations with stable group members.Experimental results on AR and FERET databases validate the effectiveness and efficiency of our methods.
The rest of this paper is organized as follows:In Section2, we formulate the cost-sensitive face recognition problem.In Section3,we briefly introduce some existing multiclass cost-sensitive learning methods.Then,we propose the mcKLR and mc k NN methods in Section4,and report on our experiments in Section5.In Section6,we discuss how to construct the cost matrix effectively from the interaction with users.Finally,we conclude the paper in Section7.
2P ROBLEM F ORMULATION
Denote a face image by x and y for its label.Consider that there are M gallery subjects,denoted by y¼G1;...;G M, and many impostors,denoted by a metaclass y¼I. Traditional face recognition systems try to generate a hypothesis ðxÞminimizing the expected error rate: Err¼E x;yðI Ið ðxÞ¼yÞÞ,where I I is the indicator function which takes1when ðxÞ¼y and0otherwise.Thus,these systems implicitly assume that the costs of all kinds of mistakes are the same.As mentioned before,however,such an assumption is often not reasonable and different mistakes are generally associated with different costs.As analyzed above,our problem is class-dependent cost-sensitive and we can categorize the costs into three types:
1.cost of false acceptance C IG;
2.cost of false rejection C GI;
3.cost of false identification C GG.
According to our discussion above about the severity of different types of errors,it is evident that C IG,C GI,and C GG are unequal.Given a cost setting according to the user’s intention,according to[7],we can reassign C IG¼C IG=C GG, C GI¼C GI=C GG,and C GG¼1with optimal solution un-changed.Here,for ease of understanding,we still preserve the original formulation.We can construct a cost matrix C, as shown in Table1,where C ij indicates the cost of misclassifying a face image of the i th person as the j th person.The diagonal elements of C are all zero since there is no loss for correct ually,it is easy for users to specify which kind of mistake is with a higher cost and which is with a lower cost.Thus,we assume that the cost matrix is given by users and we will focus on how to make the face recognition system behave well given a cost matrix.At the end of the paper,we will try to determine the cost matrix by interacting with users.
It is clear that the problem with which we are concerned is an(Mþ1)-class cost-sensitive learning problem and the hypothesis ðxÞshould minimize the expected cost: Cost¼E x;yðC y ðxÞÞ.Since E x;yðC y ðxÞÞ¼E xðE y j xðC y ðxÞj xÞÞ, minimizing E x;yðC y ðxÞÞis equivalent to minimizing E y j xðC y ðxÞj xÞon every x.Hence,we can define the expected loss of predicting x by ðxÞas:lossðx; ðxÞÞ¼E y j xðC y ðxÞj xÞ. For our problem,we have
loss
À
x; ðxÞ
Á
¼
X M
m¼1
m¼
PðG m j xÞC GGþPðI j xÞC IG;if ðxÞ¼G ;
X M
m¼1
PðG m j xÞC GI;if ðxÞ¼I;
8
>>>
>><
>>>
>>:
ð1Þ
where we denote Pðy¼G m j xÞand Pðy¼I j xÞas PðG m j xÞand PðI j xÞfor simplicity.Therefore,in order to minimize the total cost,the optimal prediction of x should be
ÃðxÞ¼arg min
ðxÞ2f G1;ÁÁÁ;G M;I g
lossðx; ðxÞÞ:ð2Þ
TABLE1
The Cost Matrix
Multiclass cost-sensitive learning algorithms try to solve this minimization problem.In the next section,we will briefly introduce some existing multiclass cost-sensitive methods.
3M ULTICLASS C OST -S ENSITIVE L EARNING
3.1Rescaling
Rescaling [7],[26]is a general approach that can be used to make cost-blind learning algorithms cost-sensitive.The principle is to enable the influences of the higher cost classes to be larger than that of the lower cost classes.The rescaling approach can be realized in many ways,such as assigning training examples of different classes with different weights [7],[22],sampling the classes according to their costs [6],[7],[17],or moving the decision threshold [5],[7].The rescaling approach is effective in dealing with binary-class problems.
Zhou and Liu [26]analyzed the reason why the traditional rescaling approach is often not effective on multiclass problems and revealed that it is helpful only when all of the classes can be consistently rescaled simultaneously.Based on the analysis,a new approach was proposed,which should be the choice if the user wants to use rescaling for multiclass cost-sensitive learning.For an ðM þ1Þ-class problem,if each class can be assigned with an optimal weight w m (1 m M þ1,w m >0),it is desired that the weights satisfy w i =w j ¼C ij =C ji for every two classes i and j .It implies that the following M þ1
2ÀÁnumber of constraints must be satisfied,which is called the consistency condition [26]:
w 1w 2¼C 12C 21;w 1w 3¼C 13
C 31
;ÁÁÁ;
w 1
w ðM þ1Þ¼
C 1ðM þ1Þ
C ðM þ1Þ1
w 2w 3¼C 23
C 32
;ÁÁÁ;w 2
w ðM þ1Þ¼
C 2ðM þ1ÞC ðM þ1Þ2
ÁÁÁ
ÁÁÁÁÁÁ
w M
w ðM þ1Þ¼C M ðM þ1ÞC ðM þ1ÞM :
In order to perform rescaling simultaneously,w ¼ðw 1;w 2;...;w M þ1ÞT must be the nontrivial solution of a linear equation system with the coefficient matrix:
C 21ÀC 120ÁÁÁ0
C 310ÀC 13ÁÁÁ0ÁÁÁÁÁÁÁÁÁÁÁÁ0C ðM þ1Þ100ÁÁÁÀC 1ðM þ1Þ0C 32ÀC 23ÁÁÁ0ÁÁÁÁÁÁÁÁÁÁÁÁ00C ðM þ1Þ2
0ÁÁÁÀC 2ðM þ1Þ000ÁÁÁÀC M ðM þ1Þ2
66
66
6666
66
4377777777775:ð3ÞIt is equivalent to requiring the coefficient matrix (3)to have a
rank smaller than ðM þ1Þ.Otherwise,rescaling may not be effective on multiclass problems.In our cost-sensitive face recognition task,the coefficient matrix’s rank is M .Therefore,theoretically,we can use rescaling to solve this problem.MetaCost [5]is popularly used to make cost-blind learning algorithms cost-sensitive.Actually,this is also a rescaling method which relabels training examples to minimize Bayesian risk by threshold moving;in other words,this method works by rescaling data based on the
posterior probability and the cost setting.As MetaCost directly modifies the labels rather than assigning weights to examples,the consistency problem mentioned above does not exist;however,from empirical study,Ting [21]indicated that the internal cost-sensitive classifier employed by MetaCost to relabel the training examples may outper-form the final model without additional computation,and therefore,he did not recommend MetaCost.
Our experimental results reveal that the rescaling meth-ods do not work well on our cost-sensitive face recognition task and the reason will be discussed in Section 5.
3.2Multiclass Cost-Sensitive SVM (mcSVM)
Support vector machine (SVM)has been successfully applied to face recognition [10],[15].Since SVM was originally designed for binary classification and our cost-sensitive face recognition task is a multiclass problem,multiclass exten-sions are needed.The one-versus-one and one-versus-all are two popular strategies to handle the gap between multiclass problems and binary classifiers by decomposing a multiclass problem into a series of binary-class problems.These approaches,however,may fail under various circumstances [13],[15].Lee et al.[13]derived a multiclass cost-sensitive SVM,i.e.,mcSVM.In this method,for an ðM þ1Þ-class classification problem,the example x ’s label y is extended to an ðM þ1Þ-dimensional label vector,denoted by y ,where y takes 1for the y th element and À1=M for the others.For instance,if the i th example falls into the first class,then y i ¼ð1;À1=M;...;À1=M ÞT .Accordingly,an ðM þ1Þ-tuple of separating functions f ðx Þ¼ðf 1ðx Þ;...;f M þ1ðx ÞÞT is de-fined,where f m ðx Þ¼h m ðx Þþb m ,h m 2H K ,and b m 2I R .H K is a reproducing kernel Hilbert space (RKHS)with the reproducing kernel function K ðÁ;ÁÞ,where f ðx Þis with the
sum-to-zero constraint P M þ1
m ¼1f m ðx Þ¼0for any x .
Define the loss function for mcSVM as L ðx ;f ðx Þ;y Þ¼C y Áðf ðx ÞÀy Þþ,where C y Áis the y th row of the cost matrix C and ðf ðx ÞÀy Þþis the generalized hinge loss function ððf 1ðx ÞÀy 1Þþ;...;ðf M þ1ðx ÞÀy ðM þ1ÞÞþÞT .Lee et al.[13]proved that the minimizer of expected risk E x ;y y ðL ðx ;f ðx Þ;y ÞÞunder the sum-to-zero constraint is
f Ãðx Þ¼ðf Ã1ðx Þ;...;f Ã
M þ1ðx ÞÞT ,with
f Ã
ðx Þ¼1;if ¼arg min m ¼1;...;M þ1
loss ðx ;m Þ;
À1=M;otherwise :
(
ð4Þ
Here,loss ðx ;m Þ¼P M þ1
m 0¼1P ðm 0
j x ÞC m 0m ,as defined in Sec-tion 2.It means that the best predicted label of the new example x under Bayes decision rule is the subscript of the maximum of separating functions,i.e.,
ðx Þ¼arg max m
f m ðx Þ¼ar
g min m ¼1;...;M þ1loss ðx ;m Þ:
ð5Þ
For finite case D ¼fðx i ;y i Þg jDj
i ¼1,the expected risk is replaced by the empirical risk.Considering structural risk,the optimization objective can be written as
1X jDj i ¼1C y i ÁÀf ðx i ÞÀy i Áþþ1 X M
þ1m ¼1
k f m k 2H K :ð6Þ
Actually,when M ¼1,the generalized hinge loss function reduces to the binary hinge loss and if all of the misclassification cost is 1,mcSVM reduces to the traditional binary cost-blind SVM.
4O UR M ETHODS
4.1Multiclass Cost-Sensitive KLR (mcKLR)4.1.1Derivation We can define a metaclass G as y ¼G ()9m y ¼G m and P ðG j x Þ¼P ðS M
m ¼1G m j x Þ¼P M m ¼1P ðG m j x Þ.Then,from (1),we have loss ðx ;G Þ¼
X
M m ¼1m ¼
P ðG m j x ÞC GG þP ðI j x ÞC IG
¼P ðG j x ÞÀP ðG j x ÞðÞC GG þP ðI j x ÞC IG ¼P ðG j x ÞC GG þP ðI j x ÞC IG ÀP ðG j x ÞC GG :
ð7Þ
As x can be labeled as either G or I ,we have P ðG j x ÞþP ðI j x Þ¼1.So,(7)becomes
loss ðx ;G Þ
¼À1ÀP ðI j x ÞÁ
C GG þP ðI j x ÞC IG ÀP ðG j x ÞC GG ¼C GG þP ðI j x ÞðC IG ÀC GG ÞÀP ðG j x ÞC GG ;ð8Þ
and
loss ðx ;I Þ¼
X M m ¼1
P ðG m j x ÞC GI ¼P ðG j x ÞC GI :ð9Þ
To minimize the loss,we should choose the minimum from the M þ1items below:
C GG þP ðI j x ÞðC IG ÀC GG ÞÀP ðG 1j x ÞC GG ;...C GG
þP ðI j x ÞðC IG ÀC GG ÞÀP ðG M j x ÞC GG ;P ðG j x ÞC GI :8
>>><>>>:ð10ÞSubtract C GG þP ðI j x ÞðC IG ÀC GG Þfrom every item,then the last item becomes
P ðG j x ÞC GI ÀC GG ÀP ðI j x ÞðC IG ÀC GG Þ
¼À1ÀP ðI j x ÞÁ
C GI ÀC GG ÀP ðI j x ÞðC IG ÀC GG Þ¼ÀP ðI j x ÞðC GI þC IG ÀC GG ÞþðC GI ÀC GG Þ:
ð11Þ
So,we have an equivalent problem of choosing the
minimum from
ÀP ðG 1j x ÞC GG ;...ÀP ðG M j x ÞC GG ;ÀP ðI j x ÞðC GI þC IG ÀC GG ÞþðC GI ÀC GG Þ:8
>>><>>>:
ð12ÞDivide ÀC GG from every item and denote
¼
C GI þC IG ÀC GG
C GG
ð13Þ
and
Á¼
C GI ÀC GG
C GG
:
ð14Þ
Then,the problem becomes choosing the maximum from
P ðG 1j x Þ;...P ðG M j x Þ; P ðI j x ÞÀÁ:
8
>>><>>>:
ð15Þ4.1.2Optimization
By using logistic regression [27],we define an M -tuple of separating functions f ðx Þ¼ðf 1ðx Þ;...;f M ðx ÞÞT and the loss function L ðx ;f ðx Þ;y Þas L Àx ;f ðx Þ;y Á¼
X
M ¼1
Àln
e
f ðx Þ
1þ
P M
m ¼1e
f m ðx Þ
!
I ðy ¼G Þ
þ
Àln
1
1þ
P M
m ¼1e
f m ðx Þ
!
I ðy ¼I Þ:
ð16Þ
On the ðx ;y Þspace with pdf p ðx ;y Þ,the optimal separating function f Ãðx Þis the minimizer of the expectation of L .Since E x ;y ðL Þ¼E x ðE y j x ðL j x ÞÞ,to minimize E x ;y ðL Þ,we can mini-mize E y j x ðL j x Þon every x ,where
E y j x ðL j x Þ¼
X M ¼1
Àln e f ðx Þ
1þ
P M
m ¼1e
f m ðx Þ
!P ðG j x Þ
þ
Àln
1
1þ
P m ¼1e
f m ðx Þ
!
P ðI j x Þ:
ð17Þ
Set the partial derivative with respect to every f to zero and we get the minimizer
f Ã
ðx Þ¼ln P ðG j x Þ
P ðI j x Þ
;
ð18Þ
for ¼1;...;M .Through f Ã1;...;f Ã
M ,we construct a new
function f Ã
I :
f ÃI ðx Þ¼ln
P ðI j x ÞÀÁ
P ðI j x Þ¼ln ÀÁ
1þ
X M ¼1
e
f Ã
ðx Þ
!!
:
ð19Þ
Thus,choosing the maximum from f f Ã1ðx Þ;...;f Ã
M ðx Þ,f Ã
I
ðx Þg is equivalent to choosing the maximum from (15).That is,the optimal predicted label of x under the Bayes decision rule is
ðx Þ¼G ;if f Ã
is the maximum ;I;if f Ã
I
is the maximum :&ð20ÞFor finite case D ¼fðx i ;y i Þg jDj
i ¼1,the expectation of loss is
replaced by empirical risk
L ðDÞ¼X jDj i ¼1
X M ¼1 Àln e
f ðx i Þ1þP m ¼1e f m ðx i Þ!I ðy i ¼G Þ
þ Àln 1
1þP m ¼1e f m ðx i Þ
!
I ðy i ¼I Þ!:ð21Þ
As in mcSVM,assume that f mðxÞ¼h mðxÞþb m,h m2H K and b m2I R.The optimization objective can be expressed as
LðDÞþ1
2
X M
m¼1
k f m k2H
K
:ð22Þ
Note that the optimization problem is similar to the optimization form of multiclass kernel logistic regression (KLR)[8],[27].Therefore,we can use an optimization technique similar to KLR to handle our problem and call our method mcKLR.
We can see that all of the information about the misclassification costs,i.e.,C GG,C GI,and C IG,is embedded into andÁ.In the training process,we only need to find the optimal fÃ ðxÞð1 MÞsince fÃIðxÞis dependent on them.Therefore,no cost information is needed,which means that the training process is independent of the cost setting.Cost sensitivity is achieved by fÃIðxÞalone,which is constructed from fÃ ðxÞ, ,andÁin the test process,as shown in(19).Therefore,if the cost setting(i.e., C IG:C GI:C GG)changes,we only need to adjust the prediction process rather than retraining the whole system. This is very efficient since the predictions of mcKLR can be updated according to the changing cost matrix online. Moreover,mcKLR can output all possible predictions corresponding to different andÁsettings almost all at once with negligible time consumption.We will discuss this further in Section6.
4.2Multiclass Cost-Sensitive k-Nearest Neighbor
(mc k NN)
k-nearest neighbor(k NN)is possibly one of the simplest machine learning algorithms.For any new instance,its label is decided by majority voting among the labels of its nearest k neighbors in the training set.There is no explicit training process and all computation is deferred to the prediction step.Therefore,k NN is a lazy learning method.Different distance types can be used to determine the k nearest neighbors of the unseen instance,such as euclidean distance and Mahalanobis distance.
Similar to mcKLR,here,we propose a cost-sensitive k NN, denoted by mc k NN.According to the statistical information gained from those neighboring instances,Bayes decision theory is utilized to determine the label of the test instance.
First,for a test instance x,its k nearest neighbors in the training set,x1;...;x k,are identified.Then,labels of the k nearest neighbors are regarded as new features for x. Denote the new features for x as z¼f y1;y2;...;y k g.Thus, the posterior probability of x having the label y can be expressed as Pðy j zÞ¼Pðy j y1;y2;...;y kÞ.Assume that the k nearest neighbors of x are conditionally independent;in other words,the i th nearest neighbor of x is independent of the previousðiÀ1Þnearest neighbors.Then,we have Pðy1;y2;...;y k j yÞ¼Pðy1j yÞPðy2j yÞ...Pðy k j yÞ;ð23Þwhere the likelihood,Pðy i j yÞði¼1;...;kÞ,implies that the probability of that the instance with label y i is one of the k nearest neighbors of the instance with label y.This assumption is mainly used for the computational simplicity, and otherwise,we will suffer from combinatorial explosion. In our experiments,it can be found that the assumption on the conditional independence is reasonable under most circumstances.Based on the assumption of(23),the poster-ior probability Pðy j zÞcan be expressed by Bayes rule as Pðy j zÞ¼
PðyÞPðz j yÞ
PðzÞ
¼
PðyÞPðy1j yÞÁÁÁPðy k j yÞ
PðzÞ
:ð24Þ
Here,Pðy i j yÞði¼1;...;kÞand PðyÞcan be estimated from the training set.Assume that there are s instances in the training set with label y,denoted as x y1;...;x y s,and in the k nearest neighbors of each x y tð1t sÞ,there are k t instances with label y i.We have
Pðy i j yÞ¼
P s
t¼1
k t
kÂs
;ð25Þand the prior probability PðyÞcan be estimated from the proportion of the training samples with label y among all the training samples as
PðyÞ¼
jD y j
;ð26Þ
where D y is the subset of samples with label y.
In contrast to traditional k NN which uses only the information of training set in prediction,we can compute Pðy i j y jÞfor each label pairðy i;y jÞand Pðy iÞfor each label y i before prediction.With this information,almost no addi-tional computation is needed if compared with traditional k NN in the test process.
Having Pðy i j yÞbeforehand,we can get the posterior probability Pðy j zÞfor a new instance as in(24).In the cost-sensitive formulation,similarly to(1),we have the cost of predicting x as ðxÞ:
lossðx; ðxÞÞ¼
X M
m¼1
m¼
PðG m j zÞC GGþPðI j zÞC IG;if ðxÞ¼G ;
X M
m¼1
PðG m j zÞC GI;if ðxÞ¼I: 8
>>>
>><
>>>
>>:
ð27Þ
Replacing PðG m j zÞand PðI j zÞwith the expression in(24), we have
lossðx; ðxÞÞ¼
1
PðzÞ
Â
X M
m¼1
m¼
PðG mÞ
Y k
i¼1
Pðy i j G mÞC GGþPðIÞ
Y k
i¼1
Pðy i j IÞC IG;
if ðxÞ¼G ;
X M
m¼1
PðG mÞ
Y k
i¼1
Pðy i j G mÞC GI;if ðxÞ¼O:
8
>>>
>>>
>><
>>>
>>>
>>:
ð28ÞFor each x,1=PðzÞis a constant and can be omitted.PðG mÞ, PðIÞ,Pðy i j G mÞ,and Pðy i j IÞare all estimated from the training set by(25)and(26)beforehand.Then,we can get the optimal prediction of x as
ÃðxÞ¼arg min
ðxÞ2f G1;...;G M;I g
loss
À
x; ðxÞ
Á
:ð29Þ。