Some Classes of Associative Binary Operations in Fuzzy Set Theory

合集下载

Consider a binary classification problem where we want to predict label y

Consider a binary classification problem where we want to predict label y

An Infinity-sample Theory for Multi-category Large Margin ClassificationTong ZhangIBM T.J.Watson Research CenterYorktown Heights,NY10598tzhang@AbstractThe purpose of this paper is to investigate infinity-sample properties ofrisk minimization based multi-category classification methods.Thesemethods can be considered as natural extensions to binary large marginclassification.We establish conditions that guarantee the infinity-sampleconsistency of classifiers obtained in the risk minimization framework.Examples are provided for two specific forms of the general formulation,which extend a number of known ing these examples,weshow that some risk minimization formulations can also be used to ob-tain conditional probability estimates for the underlying problem.Suchconditional probability information will be useful for statistical inferenc-ing tasks beyond classification.1MotivationConsider a binary classification problem where we want to predict label y∈{±1}based on observation x.One of the most significant achievements for binary classification in machine learning is the invention of large margin methods,which include support vector machines and boosting algorithms.Based on a set of observations(X1,Y1),...,(X n,Y n),a large margin classification algorithm produces a decision functionˆf n by empirically min-imizing a loss function that is often a convex upper bound of the binary classification errorfunction.Givenˆf n,the binary decision rule is to predict y=1ifˆf n(x)≥0,and to predicty=−1otherwise(the decision rule atˆf n(x)=0is not important).In the literature,the following form of large margin binary classification is often encountered:we minimize the empirical risk associated with a convex functionφin a pre-chosen function class C n:ˆf n =arg minf∈C n1nni=1φ(f(X i)Y i).(1)Originally such a scheme was regarded as a compromise to avoid computational difficulties associated with direct classification error minimization,which often leads to an NP-hard problem.The current view in the statistical literature interprets such methods as algorithms to obtain conditional probability estimates.For example,see[3,6,9,11]for some related studies.This point of view allows people to show the consistency of various large marginmethods:that is,in the large sample limit,the obtained classifiers achieve the optimal Bayes error rate.For example,see[1,4,7,8,10,11].The consistency of a learning method is certainly a very desirable property,and one may argue that a good classification method should be consistent in the large sample limit.Although statistical properties of binary classification algorithms based on the risk min-imization formulation(1)are quite well-understood due to many recent works such as those mentioned above,there are much fewer studies on risk minimization based multi-category problems which generalizes the binary large margin method(1).The complexity of possible generalizations may be one reason.Another reason may be that one can al-ways estimate the conditional probability for a multi-category problem using the binary classification formulation(1)for each category,and then pick the category with the high-est estimated conditional probability(or score).1However,it is still useful to understand whether there are more natural alternatives,and what kind of risk minimization formulation which generalizes(1)can be used to yield consistent classifiers in the large sample limit. An important step toward this direction has recently been taken in[5],where the authors proposed a multi-category extension of the support vector machine that is Bayes consistent (note that there were a number of earlier proposals that were not consistent).The purpose of this paper is to generalize their investigation so as to include a much wider class of risk minimization formulations that can lead to consistent classifiers in the infinity-sample limit. We shall see that there is a rich structure in risk minimization based multi-category classi-fication formulations.Multi-category large margin methods have started to draw more at-tention recently.For example,in[2],learning bounds for some multi-category convex risk minimization methods were obtained,although the authors did not study possible choices of Bayes consistent formulations.2Multi-category classificationWe consider the following K-class classification problem:we would like to predict the label y∈{1,...,K}of an input vector x.In this paper,we only consider the simplest scenario with0−1classification loss:we have a loss of0for correct prediction,and loss of1for incorrect prediction.In binary classification,the class label can be determined using the sign of a decision func-tion.This can be generalized to K class classification problem as follows:we consider K decision functions f c(x)where c=1,...,K and we predict the label y of x as:f c(x),(2)T(f(x))=arg maxc∈{1,...,K}where we denote by f(x)the vector function f(x)=[f1(x),...,f K(x)].Note that if two or more components of f achieve the same maximum value,then we may choose any of them as T(f).In this framework,f c(x)is often regarded as a scoring function for category c that is correlated with how likely x belongs to category c(compared with the remaining k−1categories).The classification error is given by:ℓ(f)=1−E X P(Y=T(X)|X).Note that only the relative strength of f c compared with the alternatives is important.In particular,the decision rule given in(2)does not change when we add the same numerical quantity to each component of f(x).This allows us to impose one constraint on the vector f(x)which decreases the degree of freedom K of the K-component vector f(x)to K−1.1This approach is often called one-versus-all or ranking in machine learning.Another main ap-proach is to encode a multi-category classification problem into binary classification sub-problems. The consistency of such encoding schemes can be difficult to analyze,and we shall not discuss them.For example,in the binary classification case,we can enforce f1(x)+f2(x)=0,and hence f(x)can be represented as[f1(x),−f1(x)].The decision rule in(2),which compares f1(x)≥f2(x),is equivalent to f1(x)≥0.This leads to the binary classification rule mentioned in the introduction.In the multi-category case,one may also interpret the possible constraint on the vector function f,which reduces its degree of freedom from K to K−1based on the following reasoning.In many cases,we seek f c(x)as a function of p(Y=c|x).Since we have a constraint K c=1p(Y=c|x)=1(implying that the degree of freedom for p(Y=c|x)is K−1),the degree of freedom for f is also K−1(instead of K).However,we shall point out that in the algorithms we formulate below,we may either enforce such a constraint that reduces the degree of freedom of f,or we do not impose any constraint,which keeps the degree of freedom of f to be K.The advantage of the latter is that it allows the computation of each f c to be decoupled.It is thus much simpler both conceptually and numerically.Moreover,it directly handles multiple-label problems where we may assign each x to multiple labels of y∈{1,...,K}.In this scenario,we do not have a constraint. In this paper,we consider an empirical risk minimization method to solve a multi-category problem,which is of the following general form:ˆf n =arg minf∈C n1nni=1ΨY i(f(X i)).(3)As we shall see later,this method is a natural generalization of the binary classification method(1).Note that one may consider an even more general form withΨY(f(X))re-placed byΨY(f(X),X),which we don’t study in this paper.From the standard learning theory,one can expect that with appropriately chosen C n,the solutionˆf n of(3)approximately minimizes the true risk R(ˆf)with respect to the unknown underlying distribution within the function class C n,R(f)=E X,YΨY(f(X))=E X L(P(·|X),f(X)),(4) where P(·|X)=[P(Y=1|X),...,P(Y=K|X)]is the conditional probability,andL(q,f)=Kc=1q cΨc(f).(5)In order to understand the large sample behavior of the algorithm based on solving(3),we first need to understand the behavior of a function f that approximately minimizes R(f). We introduce the following definition(also referred to as classification calibrated in[1]): Definition2.1ConsiderΨc(f)in(4).We say that the formulation is admissible(clas-sification calibrated)on a closed setΩ⊆[−∞,∞]K if the following conditions hold:∀c,Ψc(·):Ω→(−∞,∞]is bounded below and continuous;∩c{f:Ψc(f)<∞}is non-empty and dense inΩ;∀q,if L(q,f∗)=inf f L(q,f),then f∗c=sup k f∗k implies q c=sup k q k.Since we allowΨc(f)=∞,we use the convention that q cΨc(f)=0when q c=0and Ψc(f)=∞.The following result relates the approximate minimization of theΨrisk to the approximate minimization of classification error:Theorem2.1Let B be the set of all Borel measurable functions.For a closed setΩ⊂[−∞,∞]K,let BΩ={f∈B:∀x,f(x)∈Ω}.IfΨc(·)is admissible onΩ,then for a Borel measurable distribution,R(f)→inf g∈BΩR(g)impliesℓ(f)→inf g∈Bℓ(g).Proof Sketch.First we show that the admissibility implies that∀ǫ>0,∃δ>0such that∀q and x:infq c≤sup k q k−ǫ{L(q,f):f c=supkf k}≥infg∈ΩL(q,g)+δ.(6)If(6)does not hold,then∃ǫ>0,and a sequence of(c m,f m,q m)with f m∈Ωsuch that f m c m=sup k f m k,q m c m≤sup k q m k−ǫ,and L(q m,f m)−inf g∈ΩL(q m,g)→0.Taking a limit point of(c m,f m,q m),and using the continuity ofΨc(·),we obtain a contradiction (technical details handling the infinity case are skipped).Therefore(6)must be valid. Now we consider a vector function f(x)∈ΩB.Let q(x)=P(·|x).Given X,if P(Y= T(f(X))|X)≥P(Y=T(q(X))|X)+ǫ,then equation(6)implies that L(q(X),f(X))≥inf g∈ΩL(q(X),g)+δ.Thereforeℓ(f)−infg∈Bℓ(g)=E X[P(Y=T(q(X))|X)−P(Y=T(f(X))|X)]≤ǫ+E X I(P(Y=T(q(X))|X)−P(Y=T(f(X))|X)>ǫ)≤ǫ+E X L X(q(X),f(X))−inf g∈BΩL X(q(X),g)δ=ǫ+R(f)−inf g∈BΩR(g)δ.In the above derivation we use I to denote the indicator function.Sinceǫandδare arbitrary, we obtain the theorem by lettingǫ→0.2Clearly,based on the above theorem,an admissible risk minimization formulation is suit-able for multi-category classification problems.The classifier obtained from minimiz-ing(3)can approach the Bayes error rate if we can show that with appropriately chosen function class C n,approximate minimization of(3)implies approximate minimization of(4).Learning bounds of this forms have been very well-studied in statistics and ma-chine learning.For example,for large margin binary classification,such bounds can be found in[4,7,8,10,11,1],where they were used to prove the consistency of various large margin methods.In order to achieve consistency,it is also necessary to take a se-quence of function classes C n(C1⊂C2⊂···)such that∪n C n is dense in the set of Borel measurable functions.The set C n has the effect of regularization,which ensures that R(ˆf n)≈inf f∈CnR(f).It follows that as n→∞,R(ˆf n)P→inf f∈B R(f).Theorem2.1 then implies thatℓ(ˆf n)P→inf f∈Bℓ(f).The purpose of this paper is not to study similar learning bounds that relate approximate minimization of(3)to the approximate minimization of(4).See[2]for a recent investi-gation.We shall focus on the choices ofΨthat lead to admissible formulations.We pay special attention to the case that eachΨc(f)is a convex function of f,so that the resulting formulation becomes computational more tractable.Instead of working with the general form ofΨc in(4),we focus on two specific choices listed in the next two sections.3Unconstrained formulationsWe consider unconstrained formulation with the following choice ofΨ:Ψc(f)=φ(f c)+s K k=1t(f k) ,(7) whereφ,s and t are appropriately chosen functions that are continuously differentiable.Thefirst term,which has a relatively simple form,depends on the label c.The second term is independent of the label,and can be regarded as a normalization term.Note thatthis function is symmetric with respect to components of f.This choice treats all potential classes equally.It is also possible to treat different classes differently(e.g.replacingφ(f c) byφc(f c)),which can be useful if we associate different classification loss to different kinds of errors.3.1Optimality equation and probability modelUsing(7),the conditional true risk(5)can be written as:L(q,f)=Kc=1q cφ(f c)+s K c=1t(f c) .In the following,we study the property of the optimal vector f∗that minimizes L(q,f) for afixed q.Given q,the optimal solution f∗of L(q,f)satisfies the followingfirst order condition:q cφ′(f∗c)+µf∗t′(f∗c)=0(c=1,...,K).(8) where quantityµf∗=s′( K k=1t(f∗k))is independent of k.Clearly this equation relates q c to f∗c for each component c.The relationship of q and f∗defined by(8)can be regarded as the(infinite sample-size)probability model associated with the learning method(3)withΨgiven by(7).The following result presents a simple criterion to check admissibility.We skip the proof for simplicity.Most of our examples satisfy the condition.Proposition3.1Consider(7).AssumeΦc(f)is continuous on[−∞,∞]K and bounded below.If s′(u)≥0and∀p>0,pφ′(f)+t′(f)=0has a unique solution f p that is an increasing function of p,then the formulation is admissible.If s(u)=u,the condition∀p>0in Proposition3.1can be replaced by∀p∈(0,1).3.2Decoupled formulationsWe let s(u)=u in(7).The optimality condition(8)becomesq cφ′(f∗c)+t′(f∗c)=0(c=1,...,K).(9) This means that we have K decoupled equalities,one for each f c.This is the simplest and in the author’s opinion,the most interesting formulation.Since the estimation problem in (3)is also decoupled into K separate equations,one for each component ofˆf n,this class of methods are computationally relatively simple and easy to parallelize.Although this method seems to be preferable for multi-category problems,it is not the most efficient way for two-class problem(if we want to treat the two classes in a symmetric manner)since we have to solve two separate equations.We only need to deal with one equation in(1)due to the fact that an effective constraint f1+f2=0can be used to reduce the number of equations.This variable elimination has little impact if there are many categories.In the following,we list some examples of multi-category risk minimization formulations. They all satisfy the admissibility condition in Proposition3.1.We focus on the relationship of the optimal optimizer function f∗(q)and the conditional probability q.For simplicity, we focus on the choiceφ(u)=−u.3.2.1φ(u)=−u and t(u)=e uWe obtain the following probability model:q c=e f∗c.This formulation is closely related to the maximum-likelihood estimate with conditional model q c=e f c/ K k=1e f k(logisticregression).In particular,if we choose a function class such that the normalization condi-tion K k=1e f k=1holds,then the two formulations are identical.However,they become different when we do not impose such a normalization condition.Another very important and closely related formulation is the choice ofφ(u)=−ln u and t(u)=u.This is an extension of maximum-likelihood estimate with probability model q c=f c.The resulting method is identical to maximum-likelihood if we choose our function class such that k f k=1.However,the formulation also allows us to use function classes that do not satisfy the normalization constraint k f k=1.Therefore this method is moreflexible.3.2.2φ(u)=−u and t(u)=ln(1+e u)This version uses binary logistic regression loss,and we have the following probability model:q c=(1+e−f∗c)−1.Again this is an unnormalized model.3.2.3φ(u)=−u and t(u)=1p|u|p(p>1)We obtain the following probability model:q c=sign(f∗c)|f∗c|p−1.This means that at the solution,f∗c≥0.One may modify it such that we allow f∗c≤0to model the condition probability q c=0.3.2.4φ(u)=−u and t(u)=1pmax(u,0)p(p>1)In this probability model,we have the following relationship:q c=max(f∗c,0)p−1.The equation implies that we allow f∗c≤0to model the conditional probability q c=0.There-fore,with afixed function class,this model is more powerful than the previous one.How-ever,at the optimal solution,f∗c≤1.This requirement can be further alleviated with the following modification.3.2.5φ(u)=−u and t(u)=1pmin(max(u,0)p,p(u−1)+1)(p>1)In this probability model,we have the following relationship at the exact solution:q c= min(max(f c∗,0),1)p−1.Clearly this model is more powerful than the previous model since the function value f∗c≥1can be used to model q c=1.3.3Coupled formulationsIn the coupled formulation with s(u)=u,the probability model can be normalized in a certain way.We list a few examples.3.3.1φ(u)=−u,and t(u)=e u,and s(u)=ln(u)This is the standard logistic regression model.The probability model is:q c(x)=exp(f∗c(x))(Kc=1exp(f∗c(x)))−1.The right hand side is always normalized(sum up to1).Note that the model is not contin-uous at infinities,and thus not admissible in our definition.However,we may consider the regionΩ={f:sup k f k=0},and it is easy to check that this model is admissible inΩ. Let fΩc=f c−sup k f k∈Ω,then fΩhas the same decision rule as f and R(f)=R(fΩ). Therefore Theorem2.1implies that R(f)→inf g∈B R(g)impliesℓ(f)→inf g∈Bℓ(g).3.3.2φ(u)=−u,and t(u)=|u|p′,and s(u)=1p|u|p/p′(p,p′>1) The probability model is:q c(x)=(Kk=1|f∗k(x)|p′)(p−p′)/p′sign(f∗c(x))|f∗c(x)|p′−1.We may replace t(u)by t(u)=max(0,u)p,and the probability model becomes:q c(x)=(Kk=1max(f∗k(x),0)p′)(p−p′)/p′max(f∗c(x),0)p′−1.These formulations do not seem to have advantages over the decoupled counterparts.Notethat if we let p→1,then the sum of the p′p′−1-th power of the right hand side→1.In away,this means that the model is normalized in the limit of p→1.4Constrained formulationsAs pointed out,one may impose constraints on possible choices of f.We may impose such a condition when we specify the function class C n.However,for clarity,we shall directly impose a condition into our formulation.If we impose a constraint into(7),then its effect is rather similar to that of the second term in(7).In this section,we consider a direct extension of binary large-margin method(1)to multi-category case.The choice given below is motivated by[5],where an extension of SVM was proposed.We use a risk formulation that is different from(7),and for simplicity,we will consider linear equality constraint only:Ψc(f)=Kk=1,k=cφ(−f k),s.t.f∈Ω,(10)where we defineΩas:Ω={f:Kk=1f k=0}∪{f:sup k f k=∞}.We may interpret the added constraint as a restriction on the function class C n in(3)such that every f∈C n satisfies the constraint.Note that with K=2,this leads to the usually binary large margin ing(10),the conditional true risk(5)can be written as:L(q,f)=Kc=1(1−q c)φ(−f c),s.t.f∈Ω.(11)The following result provides a simple way to check the admissibility of(10). Proposition4.1Ifφis a convex function which is bounded below andφ′(0)<0,then(10) is admissible onΩ.Proof Sketch.The continuity condition is straight-forward to verify.We may also assume thatφ(·)≥0without loss of generality.Now let f achieves the minimum of L(q,·).If f c=∞,then it is clear that q c=1and thus q k=0for k=c.This implies that for k=c,φ(−f k)=inf fφ(−f),and thus f k<0.If f c=sup k f k<∞,then the constraint implies f c≥0.It is easy to see that∀k,q c≥q k since otherwise,we must have φ(−f k)>φ(−f c),and thusφ′(−f k)>0andφ′(−f c)<0,implying that with sufficient smallδ>0,φ(−(f k+δ))<φ(−f k)andφ(−(f c−δ))<φ(−f c).A contradiction.2Using the above criterion,we can convert any admissible convexφfor the binary formula-tion(1)into an admissible multi-category classification formulation(10).In[5]the special case of SVM(with loss functionφ(u)=max(0,1−u))was studied.The authors demonstrated the admissibility by direct calculation,although no results similar to Theorem2.1were established.Such a result is needed to prove consistency.The treatment presented here generalizes their study.Note that for the constrained formulation,it is more difficult to relate f c at the optimal solution to a probability model,since such a model will have a much more complicated form compared with the unconstrained counterpart.5ConclusionIn this paper we proposed a family of risk minimization methods for multi-category classi-fication problems,which are natural extensions of binary large margin classification meth-ods.We established admissibility conditions that ensure the consistency of the obtained classifiers in the large sample limit.Two specific forms of risk minimization were pro-posed and examples were given to study the induced probability models.As an implication of this work,we see that it is possible to obtain consistent(conditional)density estimation using various non-maximum likelihood estimation methods.One advantage of some of the newly proposed methods is that they allow us to model zero density directly.Note that for the maximum-likelihood method,near zero density may cause serious robustness problems at least in theory.References[1]P.L.Bartlett,M.I.Jordan,and J.D.McAuliffe.Convexity,classification,and riskbounds.Technical Report638,Statistics Department,University of California,Berke-ley,2003.[2]Ilya Desyatnikov and Ron Meir.Data-dependent bounds for multi-category classifi-cation based on convex losses.In COLT,2003.[3]J.Friedman,T.Hastie,and R.Tibshirani.Additive logistic regression:A statisticalview of boosting.The Annals of Statistics,28(2):337–407,2000.With discussion.[4]W.Jiang.Process consistency for adaboost.The Annals of Statistics,32,2004.withdiscussion.[5]Y.Lee,Y.Lin,and G.Wahba.Multicategory support vector machines,theory,andapplication to the classification of microarray data and satellite radiance data.Journal of American Statistical Association,2002.accepted.[6]Yi Lin.Support vector machines and the bayes rule in classification.Data Miningand Knowledge Discovery,pages259–275,2002.[7]G.Lugosi and N.Vayatis.On the Bayes-risk consistency of regularized boostingmethods.The Annals of Statistics,32,2004.with discussion.[8]Shie Mannor,Ron Meir,and Tong Zhang.Greedy algorithms for classification-con-sistency,convergence rates,and adaptivity.Journal of Machine Learning Research, 4:713–741,2003.[9]Robert E.Schapire and Yoram Singer.Improved boosting algorithms usingconfidence-rated predictions.Machine Learning,37:297–336,1999.[10]Ingo Steinwart.Support vector machines are universally plexity,18:768–791,2002.[11]Tong Zhang.Statistical behavior and consistency of classification methods based onconvex risk minimization.The Annals of Statitics,32,2004.with discussion.。

Classes = Objects + Data Abstraction

Classes = Objects + Data Abstraction

Absal studies of object systems, such as AC94, Bru93, FHM94, PT94] and the earlier papers appearing in GM94], types are viewed as interfaces to objects. This means that the type of an object lists the operations on the object, generally as method names and return types, but does not restrict its implementation. As a result, objects of the same type may have arbitrarily di erent internal representations. In contrast, the type of an object in common practical objectoriented languages such as Ei el Mey92] and C++ Str86, ES90] may impose some implementation constraints. In particular, although the \private" internal data of an object is not accessible outside the member functions of the class, all objects of the same class must have all of the private internal data listed in the class declaration. In this paper, we present a type-theoretic framework that incorporates both forms of type. We rst explain the basic principles by extending a core object calculus (developed for this purpose but previously described in FM95]) with a standard higherorder data abstraction mechanism as in MP88, CW85]. Then we devise a special-purpose syntax

CONNECTIVITY–PRESERVING TRANSFORMATIONS OF BINARY IMAGES

CONNECTIVITY–PRESERVING TRANSFORMATIONS OF BINARY IMAGES

We say that the interchange p, q is 4-local (respectively, 8-local) if p and q are adjacent in G4 (respectively, G8 ). In this paper we are primarily concerned with 8-local interchanges and we are interested
1
(a)
(b)
(c (a) a binary image I , (b) the graphs B4 (I ) and W8 (I ), (c) the graphs B8 (I ) and W4 (I ), (d) the graphs B4 (I ) and W4 (I ), and (e) the graphs B8 (I ) and W8 (I ). in whether two images with the same number of black pixels differ by a sequence of connectivity-preserving interchanges. More precisely, we say that two Ba , Wb -connected images I and J are (a, b)-IP-equivalent [8] if there exists a sequence of images I0 = I, I1 , . . . , Ir = J such that each Ii is Ba , Wb -connected and Ii can be converted into Ii+1 by a single (8-local) interchange.

A Dichotomy Theorem for Constraints on a Three-Element Set

A Dichotomy Theorem for Constraints on a Three-Element Set

A Dichotomy Theorem for Constraints on a Three-Element SetAndrei A.BulatovComputing Laboratory,University of Oxford,Oxford,UKE-mail:Andrei.Bulatov@AbstractThe Constraint Satisfaction Problem(CSP)provides a common framework for many combinatorial problems.The general CSP is known to be NP-complete;however,certain restrictions on the possible form of constraints may affect the complexity,and lead to tractable problem classes.There is,therefore,a fundamental research direction,aiming to separate those subclasses of the CSP which are tractable, from those which remain NP-complete.In1978Schaefer gave an exhaustive solution of this problem for the CSP on a2-element domain.In this paper we generalise this result to a classification of the complex-ity of CSPs on a3-element domain.The main result states that every subclass of the CSP defined by a set of allowed constraints is either tractable or NP-complete,and the cri-terion separating them is that conjectured in[6,8].We also exhibit a polynomial time algorithm which,for a given set of allowed constraints,determines whether if this set gives rise to a tractable problem class.To obtain the main result and the algorithm we extensively use the algebraic technique for the CSP developed in[17]and[6,8].1.IntroductionIn the Constraint Satisfaction Problem(CSP)[24]we aim tofind an assignment to a set of variables subject to specified constraints.Many combinatorial problems ap-pearing in computer science and artificial intelligence can be expressed as particular subclasses of the CSP.The stan-dard examples include the propositional satisfiability prob-lem,in which the variables must be assigned Boolean val-ues,graph colorability,scheduling problems,linear systems and many others.One advantage of considering a com-mon framework for all of these diverse problems is that it makes it possible to obtain generic structural results con-cerning the computational complexity of constraint satis-faction problems that can be applied in many different areas such as database theory[21,33],temporal and spatial rea-soning[30],machine vision[24],belief maintenance[11],technical design[26],natural language comprehension[1], programming language analysis[25],etc.The general CSP is NP-complete;however,certain re-strictions on the allowed form of the constraints involved may ensure tractability.Therefore,one of the main ap-proaches in the study of the CSP is identifying tractable subclasses of the general CSP obtained in this way[14,15, 9,17,29].Developments in this direction provide an ef-ficient algorithm solving a particular problem,if the prob-lem falls in one of the known tractable subclasses,or as-sist in speeding up of general superpolynomial algorithms [12,13,22].To formalize the idea of restricting the allowed constraints,we make use of the notion of a constraint lan-guage[16],which is simply a set of possible relations that can be used to specify constraints in a problem.We say that a constraint language is tractable[intractable]if the corresponding problem class is tractable[intractable.The ultimate goal of this research direction is tofind the pre-cise border between tractable and intractable constraint lan-guages.This goal was achieved by Schaefer[29]in the important case of Boolean constraints;he has characterised tractable constraint languages on a2-element set,and proved that the rest are NP-complete.This Schaefer’s result is known as Shaefer’s Dichotomy Theorem.Dichotomy theorems are of particular interests in study of the CSP,because,on the one hand,they determine the precise complexity of constraint languages,and on the other hand,the a priori existence of a dichotomy result cannot be taken for granted.For a short survey of dichotomy results the reader is referred to[9].The analogous problem,which is referred to as the clas-sification problem,for the CSP in which the variables can be assigned more than2values,remains open since1978, in spite of intensive efforts.For instance,Feder and Vardi, in[14],used database technique and group theory to iden-tify some large tractable families of constraints;Jeavons and coauthors have characterised many tractable and NP-complete constraint languages using invariance properties of constraints[17,18,19];in[8],a possible form of a di-chotomy result for the CSP onfinite domains was conjec-tured;in[7],a dichotomy result was proved for a certaintype of constraint languages on a3-element domain.In this paper we generalise the results of[29]and[7],and provethe dichotomy conjecture from[8]for the constraint sat-isfaction problem on a3-element domain.In particular, we completely characterise tractable constraint languagesin this case,and prove that the rest are NP-complete.The main result will be precisely stated at the end of Section2.The classification problem for constraint languages on aset containing more than2elements,even on a3-element set,turns out to be much harder than that for the2-elementcase.Besides the obvious reason that Boolean CSPs closely relate to various problems from propositional logic,and therefore,are much better investigated,there is another deepreason.As is showed in[19,20,17],when studying the complexity of constraint languages we may restrict our-selves with a certain class of languages,so called relationalclones.There are only countably many relational clones on a2-element set,and all of them are known[28].However,the class of relational clones on a3-element set already con-tains continuum many elements,and any its explicit charac-terization is believed to be unreachable.Another problem tackled here is referred to,in[9],as themeta-problem:given a constraint language determine if this language gives rise to a tractable problem class.Making useof the dichotomy theorem obtained we exhibit an effective algorithm solving the meta-problem for the CSP on a3-element domain.The technique used in this paper relies upon the idea,that was developed in[8,6,17](and also mentioned in[14]as a possible direction for future research),that algebraic in-variance properties of constraints can be used for studying the complexity of the corresponding constraint satisfaction problems.The main advantage of this technique is that itallows us to employ structural results from universal alge-bra.The algebraic approach has proved to be very fruitfulin identifying tractable classes of the CSP[2,4,18].We strongly believe that the synthesis between complexity the-ory and universal algebra which we describe here is likelyto lead to new results in bothfields.2.Algebraic structure of CSP classes2.1.The Constraint Satisfaction ProblemThe set of all-tuples with components from a set isdenoted.The th component of a tuple will be denoted .Any subset of is called an-ary relation on;and a constraint language on is an arbitrary set offinitaryrelations on.Definition1The constraint satisfaction problem()over a constraint language,denoted,is defined to be the decision problem with instance,where Instance:is a set of variables;is a set of values(some-times called a domain);and is a set of constraints,,in which the constraint is a pair with is a tuple of variables of length, called the constraint scope,and an-ary re-lation on,called the constraint relation. Question:is whether there exists a solution to, that is,a function from to,such that,for each constraint in,the image of the constraint scope is a member of the constraint relation.We shall be concerned with distinguishing between those constraint languages which give rise to tractable problems (i.e.,problems for which there exists a polynomial-time so-lution algorithm),and those which do not.Definition2A constraint language,is said to be tractable,if is tractable for eachfinite subset .It is said to be NP-complete,if is NP-complete for somefinite subset.By a Boolean constraint language we mean a constraint language on a2-element set,usually.In[29],Schae-fer has classified Boolean constraint languages with re-spect to complexity.This result is known as Schaefer’s Di-chotomy theorem.Theorem1(Schaefer,[29])A constraint language,,on is tractable if one of the following conditions holds:(1)every in contains;(2)every in contains;(3)every in is definable by a CNF formula in whicheach clause has at most one negated variable;(4)every in is definable by a CNF formula in whicheach clause has at most one unnegated variable; (5)every in is definable by a CNF formula in whicheach clause has at most two literals;(6)every in is the solution space of a linear systemover.Otherwise is NP-complete.More examples of both tractable and NP-complete con-straint languages will appear later in this paper and can also be found in[8,10,14,19].It follows from Theorem1 that every Boolean constraint language is either tractable or NP-complete;and so,there is no language of intermedi-ate complexity.Some dichotomy results have been obtained for other variations of Boolean CSP[9].The classification problem for larger domains is still open and seems to be very interesting and hard[14].Problem1(classification problem)Characterise all trac-table constraint languages onfinite domains.2.2.Algebraic structure of problem classesSchaefer’s technique heavily uses the natural representa-tion of Boolean relations by propositional formulas.Such a representation does not exist for larger domains.Instead, we shall use algebraic properties of relations.In our alge-braic definitions we mainly follow[23].Definition3An algebra is an ordered pairsuch that is a nonempty set and is a family offinitary operations on.The set is called the universe of,the operations from are called basic.An algebra with afinite universe is referred to as afinite algebra.Every constraint language on a set can be assigned an algebra with the universe.Definition4An-ary operation preserves an-ary re-lation(or is a polymorphism of,or is invariant un-der)if,for any, the tuple belongsto as well.The set of all polymorphisms of a constraint language is denoted;and the set of all relations invariant under all operations from a set is denoted.Given a constraint language,,on,the algebra is called the algebra associated with,and is denoted.Conversely,for anyfinite algebra,there is a constraint language associated with,the language, and the associated problem class. Notice that is the largest constraint language such that,see,e.g.[27,31].A connection between the complexity of a constraint language and the associated algebra is provided by the following theorem.Theorem2([17])A constraint language on afinite set is tractable[NP-complete]if and only if is tractable[NP-complete].Informally speaking,Theorem2says that the complexity of is determined by the algebra.We,therefore,make the following definition:for a constraint language,the alge-bra is said to be tractable[NP-complete]if is tractable [NP-complete].In[18,19],Jeavons and coauthors have identified cer-tain types of algebras which give rise to tractable problem classes.Definition5Let be afinite set.An operation on is calleda projection if there is such thatfor any;essentially unary if,for some unary operation,and any;a constant operation if there is such that,for any;idempotent if for any.a semilattice operation1,if it is binary idempotent andfor any satisfies the following two condi-tions:(a)(Associativity),(b)(Commutativity);a majority operation if it is ternary,and,for any;affine if where are the operations of an Abelian group.For afinite algebra,an operation from is said to be a term operation2of.If is a constraint language, the term operations of are the polymorphisms of. Proposition1([18,19])If afinite algebra has a term operation which is constant,semilattice,affine,or major-ity,then is tractable.The2-element algebras associated with Schaefer’s six types of constraint languages have the constant term operation or in cases(1),(2);the semilattice term operation or in cases(3),(4);the majority term operationin case(5);and the affine term operation in case(6).An algebra is said to be a-set if every term operation is essentially unary and the corresponding unary operation is a permutation.Proposition2([18,19])Afinite-set is NP-complete. By combining those two results,and the classical result of E.Post[28],the algebraic version of Schaefer’s theorem can be derived[8].Theorem3(Schaefer)A constraint language on a2-element set is tractable if is not a-set.Otherwise is NP-complete.2.3.Algebraic constructions and the complexity ofconstraint languagesCertain transformations of constraint languages preserve the complexity,and may lead to languages with certain de-sirable properties.Let be a constraint language on,anda unary polymorphism of such that. By we denote the set where;and by the set.If is a unary operation whose range is minimal among ranges of unary operations from with the property, then the constraint language will be denoted. Proposition3([8])Let be a constraint language on, and a unary operation on with a minimal range and such that.Then is tractable [NP-complete]if and only if is tractable[NP-complete]. If and satisfy the conditions of Proposition3then the algebra is idempotent,that is,all its basic operations are idempotent.The complexity of the constraint language does not depend on the choice of,and we shall denote every such language by.Due to Theorem2and Proposition3the study of the complexity of constraint languages is completely reduced to the study of properties of idempotent algebras.Definition6Let be an algebra,and a sub-set of such that,for any(-ary),and for any,we have.Then the al-gebra consists of restrictions of operations from onto,is called a subalgebra of.The universe of a subalgebra of is called a subuniverse of.An equivalence relation is said to be a con-gruence of.The-class containing is denoted, the set is said to be the factor-set, and the algebra,where,is said to be thefactor-algebra.Proposition4([8])Let be a tractable constraint lan-guage on,a subuniverse of,and an equiva-lence relation invariant under.Then(1)the subalgebra,for a natural number,we will denote the set.Let be an-ary relation,and;Definition7The algebrasatisfies the partial zero property if there exist a set of its subuniverses,,and for each,such that(a);(b)for any relation,and any,there is withifotherwisesatisfies the splitting property if any(-ary)relationcan be represented in the form, contains the tuple withifif andif andsatisfies the-semisplitting property if,for any ir-reducible(-ary)relation,we have(i)procedure until the instance stays unchanged:solve all re-stricted problems involving variables,and then remove from each constraint all tuples such thatis a part of no partial solution for a certain-element set of variables.This procedure is called‘establishing-minimality’,and is said to be the-minimal instance associated with.Definition9A class of constraint satisfaction problems is said to be of width3if any problem instance from has a solution if and only if the-minimal problem associ-ated with contains no empty constraint.Every class offinite width is tractable,because,assuming fixed,establishing-minimality takes polynomial time. 4.2.Multi-sorted constraint satisfaction problemsIn[5],an algebraic approach to a generalised version of the constraint satisfaction problem was developed.In this generalised version every variable is allowed to have its own domain.In this paper we need the notion of multi-sorted constraint satisfaction problem,and some results from[5] as an auxiliary tool.Definition10For any collection of sets, and any list of indices,a subset of,together with the list, will be called an(-ary)relation over with signature.For any such relation,the th compo-nent of the signature of will be denoted.Definition11The multi-sorted constraint satisfaction problem is the combinatorial decision problem with Instance:a quadruple where is a set of vari-ables;is a collection of sets of values[domains];is a mapping from to, called the domain function;is a set of constraints.Each constraint is a pair,whereis a tuple of variables of length, called the constraint scope;is an-ary relation over with signature,called the constraint relation.Question:does there exist a solution,i.e.a function, from to,such that,for each variable,,and for each constraint, with,the tuplebelongs to?It is possible to introduce the algebraic structure of the multi-sorted CSP in a very similar way to the usual one.Corollary1If a3-element idempotent algebra satisfies (N O-G-S ET)then any2-valued problem instance from can be solved in polynomial time.Most of the‘good’properties of relations allow us,first,to reduce an arbitrary problem instance to a2-valued problem instance,and,second,to solve the obtained instance as amulti-sorted problem instance by making use of the algo-rithms from[5].4.3.Why‘good’properties are good4.3.1.Relations invariant with respect to a special oper-ation.In condition(8)of Theorem5,the tractability offollows from Proposition1.In(9),is of width3,as is proved in[4].The result of[2]states that anyfinite alge-bra with a Mal’tsev operation is tractable,and the solutionalgorithm is very similar to algorithms from linear algebra.4.3.2.The partial zero property.In this case any prob-lem instance can be reduced to a2-valued one.Indeed,if satisfies the partial zero property,and a1-minimal probleminstance has a solution,then also has the solution such that if, and otherwise.Thus,to solve we assignthe value to each variable with. Since,the obtained problem instance is2-valued.4.3.3.The replacement property.In this case any prob-lem instance can also be reduced to a2-valued one.If satisfies the-replacement property,and a1-minimal problem instance has a solu-tion,then the mapping such thatif,,and otherwise, is a solution to.We therefore,may reduce to a2-valued problem instance where for eachthere is such that if and only if and whenever.4.3.4.The extendibility property.We prove that in this case is of width3.Suppose that satisfies the -extendibility property,and take a3-minimal problem in-stance.An easy proof of the following lemma is left to the reader.Lemma1Let.There is such that,for any.Finally,the-extendibility property of implies that the mapping whereifif oris a solution to.4.3.5.Rectangularity and semirectangularity.Suppose that is-rectangular or-semirectangular,and.We show that any problem instance in this case can be reduced to a2-valued one.Take a problem instance.Without loss of generality we may assume that is3-minimal.Let denote the set of all variables with.Let be the equivalence relation on generated by. Notice that,since is3-minimal,for any,any such that,and any,ei-ther,or.Repeat the follow-ing procedure until the obtained problem instance coincides with the previous one.For each class of,solve the problemwhere,for each,we make the constraint.If,for a class of,the problem instancehas no solution then,for each constraint, remove from all the tuples such that for some.Replace the obtained problem instance with the asso-ciated3-minimal problem instance.Remove from those variables for which no longer equals or.Calculate the relation for the obtained problem instance and the set.Obviously,the obtained problem instance has a solution if and only if the original problem instance has.Supposefirst that is-rectangular.Then,if has no empty constraint,then there is a solution to such that whenever.Indeed,letbe the classes of,and a solution to,It is not hard to see that if has a solution,then has a solution(see also[4]).Let be a solution to,andsolutions to.The mapping whereifif,,andif,,andis a solution to.Indeed,take a constraint. Since for eachis the majority opera-tion,and satisfies the-semisplitting property.A prob-lem instance is said to be irreducible if every constraint relation is irreducible.Every3-minimal problem instancecan be reduced to an equivalent irreducible problem instance in polynomial time.Indeed,denote by the binary relation on such that if and only if is the graph of a bijec-tive mapping.Since is3-minimal,,for any,where denotes the composition of binary relations;hence,is an equivalence relation.Choose a representative from each class of,and let be the set of the representatives.Then, for no pair of variables,is the graph of a bijective mapping,and for any,there issuch that is the graph of a bijective mapping.We transform in three steps.For each constraint and any,replacewith where,andis the representative of the-class containing.For each constraint and each,re-place with.Replace every constraint with.Now,let be a3-minimal ir-reducible problem instance,and consider the instancewhere,for each,there iswith for all such that .The problem instance is2-valued,there-fore,we just have to show that and are equivalent.Clearly,if has a solution then has a solution.Con-versely,let have a solution,and set,and.By condition(i)of the definition of the semisplitting property,has a solution if and only if both and have.Since,the instance has a solution.By condition(ii),for any,where denotes the set of partial solutions to for.Moreover,since is3-minimal,for any every such partial solution can be extended to a solution from.(The last property is called strong2-consistency[18].)Recall that any relation such that is invariant with respect to a majority operation,in particular,all the constraint rela-tions of satisfy this condition.By Theorem3.5of[18], if for any then strong2-consistency ensures existence of a solution to.5.Recognising tractable casesFrom a practical perspective,we need a method that al-lows us to recognise if a given constraint language is tractable.The following problem is,therefore,very tempt-ing.T RACTABLE-L ANGUAGE.Is a givenfinite constraint lan-guage on afinite set tractable?Schaefer’s Dichotomy Theorem[29]does not solve this problem satisfactorily.Indeed,it can be easily verified if a relation satisfies conditions(1)or(2)of Theorem1,how-ever,the way of recognising if one of conditions(3)–(6) holds is not obvious(see also[21]).Theorem3,the alge-braic version of Schaefer’s result,fills this gap:to check the tractability of a Boolean constraint language one just has to check whether all relations from the language are invari-ant under one of the6Boolean operations corresponding to conditions(1)–(6).In the general case,such a method can hopefully be de-rived from a description of tractable algebras.For example, in[6],a polynomial time algorithm has been exhibited that checks if afinite algebra,whose basic operations are given explicitely by their operation tables,satisfies(N O-G-S ET). Therefore,if Conjecture1holds then the tractability of an algebra can be tested in polynomial time.In particular,this algorithm is valid in the case of3-element algebras.However,this algorithm does not solve T RACTABLE-L ANGUAGE even under the assumption of Conjecture1, because in this problem we are given a constraint language, not an algebra.Actually,we need to solve the problemN O-G-S ET-L ANGUAGE.Given afinite constraint language on afinite set,does the algebra satisfy(N O-G-S ET)?By the results of[6],this problem is NP-complete.How-ever,its restricted version remains tractable.N O-G-SET-L ANGUAGE().Given afinite constraint lan-guage on afinite set,,does the algebra satisfy(N O-G-S ET)?This means that the tractability of a constraint language on a3-element set can be tested in polynomial time.Theorem7There is a polynomial time algorithm that given a constraint language on a3-element set determines if is tractable.An example of such an algorithm is provided by the gen-eral algorithm from[6].That algorithm employs some deep algebraic results and sophisticated constructions.In the par-ticular case of a3-element domain,we may avoid using hard algebra,and apply a simpler and easier algorithm.To this end,notice that if a3-element idempotent algebra has a2-element subuniverse or a nontrivial congruence, and there is a term operation which is not a projection on the subalgebra or the factor-algebra,then witnesses that the algebra itself is also not a-set.We,therefore,have two cases to consider.C ASE1.has no2-element subuniverse,and no proper congruence.Such an algebra is said to be strictly simple.There is a com-plete description offinite strictly simple algebras[32].In particular,if a strictly simple algebra satisfies(N O-G-S ET) then one of the following operations is its term operation: a majority operation,the affine operation of an Abelian group,or the operationifotherwisefor some element.C ASE2.has either a2-element subalgebra,or a proper congruence.In this case,satisfies(N O-G-SET)if and only if every2-element subalgebra and every proper factor-algebra(which is also2-element)is not a-set.In its turn,the latter con-dition holds if and only if,for any2-element subuniverse of,and any congruence,there is a polymorphism of such that()is one of the Boolean operations,,,;–if not then output“NO”.Output“Yes”.This algorithm is polynomial time,because the hardest step,finding the set,requires inspecting of all ternary operations on a3-element set;and,since their number does not depend on,takes cubic time.Recognising which of the10properties a tractable al-gebra satisfies also can be completed in polynomial time. However,to establish this requires a more detailed study of the set of ternary polymorphisms,see[3].6ConclusionIn fact,Theorem5implies a stronger result than that claimed in Theorem4.The difference appears when consid-ering infinite constraint languages satisfying the conditions of Conjecture1.Theorem4claims that,for anyfinite sub-set of such a language,there is its own polynomial time algorithm solving,and for different sub-sets the corresponding algorithms might be quite different.Theorem5yields a uniform polynomial time algorithm that solves any problem from the class associated with the con-straint language.Moreover,from the proof of Theorem5a general algorithm can be derived,which solves any problem instance on a3-element set provided that, for some tractable.Note that Theorem4is proved by a‘brute force’method, that is,by analysing a large number of operations which provide the condition(N O-G-S ET).We believe that devel-opment of algebraic tools and more subtle usage of results from universal algebra will make it possible to obtain di-chotomy results for larger domains,and eventually,for an arbitraryfinite domain.References[1]J.Allen.Natural Language Understanding.Benjamin Cum-mihgs,1994.[2] A.Bulatov.Mal’tsev constraints are tractable.TechnicalReport PRG-RR-02-05,Computing Laboratory,University of Oxford,Oxford,UK,2002.[3] A.Bulatov.Tractable constraints on a three-element set.Technical Report PRG-RR-02-06,Computing Laboratory, University of Oxford,Oxford,UK,2002.[4] A.Bulatov and P.Jeavons.Tractable constraints closed un-der a binary operation.Technical Report PRG-TR-12-00, Computing Laboratory,University of Oxford,Oxford,UK, 2000.[5] A.Bulatov and P.Jeavons.Algebraic approach to multi-sorted constraints.Technical Report PRG-RR-01-18,Com-puting Laboratory,University of Oxford,Oxford,UK,2001.[6] A.Bulatov and P.Jeavons.Algebraic structures in combina-torial problems.Technical Report MATH-AL-4-2001,Tech-nische universit¨a t Dresden,Dresden,Germany,2001. [7] A.Bulatov,P.Jeavons,and A.Krokhin.The complex-ity of maximal constraint languages.In Proceedings of the33rd Annual ACM Simposium on Theory of Comput-ing,pages667–674,Hersonissos,Crete,Greece,July2001.ACM Press.[8] A.Bulatov,A.Krokhin,and P.Jeavons.Constraint sat-isfaction problems andfinite algebras.In Proceedings of 27th International Colloquium on Automata,Languages and Programming—ICALP’00,volume1853of Lecture Notes in Computer Science,pages272–282.Springer-Verlag,2000.[9]N.Creignou,S.Khanna,and plexity Classi-fications of Boolean Constraint Satisfaction Problems,vol-ume7of SIAM Monographs on Discrete Mathematics and Applications.SIAM,2001.[10]putational Complexity of Problems overGeneralised Formulas.PhD thesis,Department LSI of the Universitat Politecnica de Catalunya(UPC),Barcelona., March,2000.[11]R.Dechter and A.Dechter.Structure-driven algorithmsfor truth maintenance.Artificial Intelligence,82(1-2):1–20, 1996.[12]R.Dechter and I.Meiri.Experimental evaluation of prepro-cessing algorithms for constraint satisfaction problems.Ar-tificial Intelligence,68:211–241,1994.[13]R.Dechter and work-based heuristics for con-straint satisfaction problems.Artificial Intelligence,34(1):1–38,1988.[14]T.Feder and M.Vardi.The computational structure ofmonotone monadic SNP and constraint satisfaction:A study through datalog and group theory.SIAM Journal of Comput-ing,28:57–104,1998.[15]G.Gottlob,L.Leone,and F.Scarcello.A comparison ofstructural CSP decomposition methods.Artificial Intelli-gence,124(2):243–282,2000.[16]P.Jeavons.Constructing constraints.In Proceedings4th In-ternational Conference on Constraint Programming—CP’98 (Pisa,October1998),volume1520of Lecture Notes in Com-puter Science,pages2–16.Springer-Verlag,1998.[17]P.Jeavons.On the algebraic structure of combinatorial prob-lems.Theoretical Computer Science,200:185–204,1998.[18]P.Jeavons,D.Cohen,and M.Cooper.Constraints,consis-tency and closure.Artificial Intelligence,101(1-2):251–265, 1998.[19]P.Jeavons,D.Cohen,and M.Gyssens.Closure properties ofconstraints.Journal of the ACM,44:527–548,1997. [20]P.Jeavons,D.Cohen,and J.Pearson.Constraints and uni-versal algebra.Annals of Mathematics and Artificial Intelli-gence,24:51–67,1998.[21]P.Kolaitis and M.Vardi.Conjunctive-query containment andconstraint put.Syst.Sci.,61:302–332, 2000.[22]V.Kumar.Algorithms for constraint satisfaction problems:A survey.AI Magazine,13(1):32–44,1992.[23]R.McKenzie,G.McNulty,and W.Taylor.Algebras,Latticesand Varieties,volume I.Wadsworth and Brooks,California, 1987.[24]works of constraints:Fundamental prop-erties and applications to picture rmation Sciences,7:95–132,1974.[25] B.Nadel.Constraint satisfaction in Prolog:Complexity andtheory-based rmation Sciences,83(3-4):113–131,1995.[26] B.Nadel and J.Lin.Automobile transmission design asa constraint satisfaction problem:Modeling the kinematiklevel.Artificial Intelligence for Engineering Design,Anaysis and Manufacturing(AI EDAM),5(3):137–171,1991. [27]R.P¨o schel and L.Kaluˇz nin.Funktionen-und Relationenal-gebren.DVW,Berlin,1979.[28] E.Post.The two-valued iterative systems of mathematicallogic,volume5of Annals Mathematical Studies.Princeton University Press,1941.[29]T.Schaefer.The complexity of satisfiability problems.InProceedings10th ACM Symposium on Theory of Computing (STOC’78),pages216–226,1978.[30] E.Schwalb and L.Vila.Temporal constraints:a survey.Con-straints,3(2-3):129–149,1998.[31] A.Szendrei.Clones in Universal Algebra,volume99ofSeminaires de Mathematiques Superieures.Universit´e de M´o ntreal,1986.[32] A.Szendrei.Simple surjective algebras having no propersubalgebras.Journal of the Australian Mathematical Society (Series A),48:434–454,1990.[33]M.Vardi.Constraint satisfaction and database theory:a tu-torial.In Proceedings of19th ACM Symposium on Priciples of Database Systems(PODS’00),2000.。

learning multi-label scene classification

learning multi-label scene classification
A short version of this paper was published in the Proceedings of the SPIE 2004 Electronic Imaging Conference.
∗ Corresponding author. Tel.: +1-585-722-7139; fax: +1-585722-0160.
In this work, we consider the following problem:
Many current digital library systems allow a user to specify a query image and search for images “similar” to it, where
0031-3203/$30.00 ? 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2004.03.009
However, in some classiÿcation tasks, it is likely that some data belongs to multiple classes, causing the actual classes to overlap by deÿnition. In text or music categorization, documents may belong to multiple genres, such as government and health, or rock and blues [2,3]. Architecture may belong to multiple genres as well. In medical diagnosis, a disease may belong to multiple categories, and genes may have multiple functions, yielding multiple labels [4].

On the learnability and design of output codes for multiclass problems

On the learnability and design of output codes for multiclass problems

Koby Crammer and Yoram Singer School of Computer Science&Engineering The Hebrew University,Jerusalem91904,Israel kobics,singer@cs.huji.ac.ilAbstractOutput coding is a general framework for solvingmulticlass categorization problems.Previous re-search on output codes has focused on buildingmulticlass machines given predefined output codes.In this paper we discuss for thefirst time the prob-lem of designing output codes for multiclass prob-lems.For the design problem of discrete codes,which have been used extensively in previous works,we present mostly negative results.We then in-troduce the notion of continuous codes and castthe design problem of continuous codes as a con-strained optimization problem.We describe threeoptimization problems corresponding to three dif-ferent norms of the code matrix.Interestingly,forthe norm our formalism results in a quadraticprogram whose dual does not depend on the lengthof the code.A special case of our formalism pro-vides a multiclass scheme for building support vec-tor machines which can be solved efficiently.Wegive a time and space efficient algorithm for solv-ing the quadratic program.Preliminary experimentswe have performed with synthetic data show thatour algorithm is often two orders of magnitude fasterthan standard quadratic programming packages.1IntroductionMany applied machine learning problems require assigning labels to instances where the labels are drawn from afinite set of labels.This problem is often referred to as multiclass categorization or classification.Examples for machine learn-ing applications that include a multiclass categorization com-ponent include optical character recognition,text classifica-tion,phoneme classification for speech synthesis,medical analysis,and more.Some of the well known binary classi-fication learning algorithms can be extended to handle mul-ticlass problem(see for instance[5,19,20]).A general ap-proach is to reduce a multiclass problem to a multiple binary classification problem.Dietterich and Bakiri[9]described a general approach based on error-correcting codes which they termed error-correcting output coding(ECOC),or in short output cod-ing.Output coding for multiclass problems is composed of two stages.In the training stage we need to construct multiple(supposedly)independent binary classifiers each of which is based on a different partition of the set of the labels into two disjoint sets.In the second stage,the classification part,the predictions of the binary classifiers are combined to extend a prediction on the original label of a test instance. Experimental work has shown that output coding can often greatly improve over standard reductions to binary problems [9,10,16,1,21,8,4,2].The performance of output coding was also analyzed in statistics and learning theoretic con-texts[12,15,22,2].Most of the previous work on output coding has concen-trated on the problem of solving multiclass problems using predefined output codes,independently of the specific ap-plication and the class of hypotheses used to construct the binary classifiers.Therefore,by predefining the output code we ignore the complexity of the induced binary problems. The output codes used in experiments were typically con-fined to a specific family of codes.Several family of codes have been suggested and tested so far,such as,comparing each class against the rest,comparing all pairs of classes[12, 2],random codes[9,21,2],exhaustive codes[9,2],and lin-ear error correcting codes[9].A few heuristics attempting to modify the code so as to improve the multiclass prediction accuracy were suggested(e.g.,[1]).However,they did not yield significant improvements and,furthermore,they lack any formal justification.In this paper we concentrate on the problem of designing a good code for a given multiclass problem.In Sec.3we study the problem offinding thefirst column of a discrete code matrix.Given a binary classifier,we show thatfinding a goodfirst column can be done in polynomial time.In con-trast,when we restrict the hypotheses class from which we choose the binary classifiers,the problem offinding a good first column becomes difficult.This result underscores the difficulty of the code design problem.Furthermore,in Sec.4 we discuss the general design problem and show that given a set of binary classifiers the problem offinding a good code matrix is NP-complete.Motivated by the intractability results we introduce in Sec.5the notion of continuous codes and cast the design problem of continuous codes as a constrained optimization problem.As in discrete codes,each column of the code ma-trix divides the set of labels into two subsets which are la-beled positive()and negative().The sign of each entry in the code matrix determines the subset association(or)and the magnitude corresponds to the confidence in this association.Given this formalism,we seek an output code with small empirical loss whose matrix norm is small.We describe three optimization problems corresponding to three different norms of the code matrix:and.For and we show that the code design problem can be solved by linear programming(LP).Interestingly,for the norm our formalism results in a quadratic program(QP)whose dual does not depend on the length of the code.Similar to sup-port vector machines,the dual program can be expressed in terms of inner-products between input instances,hence we can employ kernel-based binary classifiers.Our framework yields,as a special case,a direct and efficient method for constructing multiclass support vector machine.The number of variables in the dual quadratic problem is the product of the number of samples by the number of classes.This value becomes very large even for small datasets. For instance,an English letter recognition problem with training examples would require variables.In this case,the standard matrix representation of dual quadratic problem would require more than5Giga bytes of mem-ory.We therefore describe in Sec.6.1a memory efficient algorithm for solving the quadratic program for code design. Our algorithm is reminiscent of Platt’s sequential minimal optimization(SMO)[17].However,unlike SMO,our algo-rithm optimize on each round a reduced subset of the vari-ables that corresponds to a single rmally,our algorithm reduces the optimization problem to a sequence of small problems,where the size of each reduced problem is equal to the number of classes of the original multiclass problem.Each reduced problem can again be solved us-ing a standard QP technique.However,standard approaches would still require large amount of memory when the num-ber of classes is large and a straightforward solution is also time consuming.We therefore further develop the algorithm and provide an analytic solution for the reduced problems and an efficient algorithm for calculating the solution.The run time of the algorithm is polynomial and the memory re-quirements are linear in the number of classes.We conclude with simulations results showing that our algorithm is at least two orders of magnitude faster than a standard QP technique, even for small number of classes.2Discrete codesLet be a set of training examples where each instance belongs to a domain. We assume without loss of generality that each label is an integer from the set.A multiclass clas-sifier is a function that maps an instance into an element of.In this work we focus on a frame-work that uses output codes to build multiclass classifiers from binary classifiers.A discrete output code is a matrix of size over where each row of corre-spond to a class.Each column of defines a parti-tion of into two disjoint sets.Binary learning algorithms are used to construct classifiers,one for each column of .That is,the set of examples induced by column of is.This set is fed as training data to a learning algorithm thatfinds a hypothesis.This reduction yields different binary classifiers.We denote the vector of predictions of these clas-sifiers on an instance as.We denote the th row of by.Given an example we predict the label for which the row is the“closest”to.We will use a general notion for closeness and define it through an inner-product function.The higher the value ofis the more confident we are that is the correct label of according to the classifiers.An example for a closeness function is.It is easy to verify that this choice of is equivalent to picking the row of which attains the minimal Hamming distance to.Given a classifier and an example,we say that misclassified the example if.Let be 1if the predicate holds and0otherwise.Our goal is there-fore tofind a classifier such thatlabel set and the sample size.First,note that inthis case(2)Second,note that the sample can be divided into equiv-alence classes according to their labels and the classifica-tion of.For and,de-fine,which isequivalent to random guessing).Hence,the size of is:(4)Using Eqs.(2)and(4),we rewrite Eq.(3),.Thus,the minimum of isachieved if and only if the formula is satisfiable.There-fore,a learning algorithm for and can also be usedas an oracle for the satisfiability of.While the setting discussed in this section is somewhatsuperficial,these results underscore the difficulty of the prob-lem.We next show that the problem offinding a good out-put code given a relatively large set of classifiers is in-tractable.We would like to note in passing that efficient al-gorithm forfinding a single column might be useful in othersettings.For instance in building trees or directed acyclicgraphs for multiclass problems(cf.[18]).We leave this forfuture research.4Finding a general discrete output codeIn this section we prove that given a set of binary classi-fiers,finding a code matrix which minimizes the em-pirical loss is NP-complete.Given a sampleand a set of classifiers,let us de-note by the evaluation ofon the sample,where is the predictions vector for the th sample.We now show that even when andthe problem is NP-complete.(Clearly,the problem remains NPC for).Following the notation of previous sections,the output code matrix is composed of two rows and and the predicted class for instance is.For the simplicity of the presentation of the proof,we assume that both the code and the hypotheses’values are over the set(instead of).This assumption does not change the problem since there is a linear transform between the two sets. Theorem2The following decision problem is NP-complete. Input:A natural number,a labeled sample,where,and.Question:Does there exist a matrix such that the classifier based on an output code makes at most mistakes on.Proof:Our proof is based on a reduction technique intro-duced by H¨o ffgen and Simon[14].Since we can check in polynomial time whether the number of classification errors for a given a code matrix exceeds the bound,the prob-lem is clearly in NP.We show a reduction to Vertex Cover in order to prove that the problem is NP-hard.Given an undirected graph ,we will code the structure of the graph as follows. The sample will be composed of two subsets,andof size and respectively.We set.Each edge is encoded by two examples in. We set for thefirst vector to,,and elsewhere.We set the second vector to,, and elsewhere.We set the label of each example in to.Each example in encodes a node where,,and elsewhere.We set the label of each example in to(second class). We now show that there exists a vertex cover with at most nodes if and only if there exists a coding matrixthat induces at most classification errors on the sample.:Let be a vertex cover such that. We show that there exists a code which has at most mis-takes on.Let be the characteristic function of,that is,if and otherwise. Define the output code matrix to be and.Here,denotes the component-wise logical not operator.Since is a cover,for each we getandTherefore,for all the examples in the predicted label equals the true label and we suffer errors on these exam-ples.For each example that corresponds to a node we haveTherefore,these examples are misclassified(Recall that the label of each example in is).Analogously,for each example in which corresponds to we getand these examples are correctly classified.We thus have shown that the total number of mistakes according to is .:Let be a code which achieves at most mis-takes on.We construct a subset as follows.We scan and add to all vertices corresponding to misclas-sified examples from.Similarly,for each misclassified example from corresponding to an edge,we pick either or at random and add it to.Since we have at most misclassified examples in the size of is at most .We claim that the set is a vertex cover of the graph. Assume by contradiction that there is an edge for which neither nor belong to the set.Therefore,by construction,the examples corresponding to the vertices and are classified correctly and we get,Summing the above equations yields that,(7) In addition,the two examples corresponding to the edge are classified correctly,implying thatwhich again by summing the above equations yields,(8) Comparing Eqs.(7)and(8)we get a contradiction.that exactly one class attains the maximum value according to the function.We will concentrate on the problem of finding a good continuous code given a set of binary classi-fiers.The approach we will take is to cast the code design prob-lem as constrained optimization problem.Borrowing the idea of soft margin[7]we replace the discrete0-1multiclass loss with the linear bound(9) This formulation is also motivated by the generalization anal-ysis of Schapire et al.[2].The analysis they give is based on the margin of examples where the margin is closely related to the definition of the loss as given by Eq.(9).Put another way,the correct label should have a confi-dence value which is larger by at least one than any of the confidences for the rest of the labels.Otherwise,we suffer loss which is linearly proportional to the difference between the confidence of the correct label and the maximum among the confidences of the other labels.The bound on the empir-ical loss is then,where equals if and otherwise.We say that a sample is classified correctly using a set of binary classi-fiers if there exists a matrix such that the above loss is equal to zero,(10) Denote by(11) Thus,a matrix that satisfies Eq.(10)would also satisfy the following constraints,(12)We view a code as a collection of vectors and define the norm of to be the norm of the concatenation of the vectors constituting.Motivated by[24,2]we seek a ma-trix with a small norm which satisfies Eq.(12).Thus, when the entire sample can be labeled correctly,the prob-lem offinding a good matrix can be stated as the follow-ing optimization problem,subject to:Here is an integer.Note that of the constraints forare automatically satisfied.This is changed in the following derivation for the non-separable case.In the general case a matrix which classifies all the examples correctly might not exist.We therefore introduce slack variables and modify Eq.(10)to be,(13)The corresponding optimization problem is,(14) subject to:for some constant.This is an optimization problem with“soft”constraints.Analogously,we can define an opti-mization problem with“hard”constraints,subject to:,for someThe relation between the“hard”and“soft”constraints and their formal properties is beyond the scope of this paper. For further discussion on the relation between the problems see[24].5.1Design of continuous codes using LinearProgrammingWe now further develop Eq.(14)for the cases. We dealfirst with the cases and which result in linear programs.For the simplicity of presentation we will assume that.For the case the objective function of Eq.(14)be-comes.We introduce a set of auxiliary variables to get a standard linear programming setting,subject to:To obtain its dual program(see also App.B)we define one variable for each constraint of the primal problem.We use for thefirst set of constraints,and for the second set. The dual program is,subject to:The case of is similar.The objective function of Eq.(14)becomes.We introduce a single new variable to obtain the primal problem,subject to:Following the technique for,we get that the dual pro-gram is,subject to:Both programs(and)can be now solved using standard linear program packages.5.2Design of continuous codes using QuadricProgrammingWe now discuss in detail Eq.(14)for the case.For convenience we use the square of the norm of the matrix (instead the norm itself).Therefore,the primal program be-comes,subject to:(16) The saddle point we are seeking is a minimum for the primal variables(),and the maximum for the dual ones().To find the minimum over the primal variables we require,(18)(19)Eq.(19)implies that when the optimum of the objective function is achieved,each row of the matrix is a linear combination of.We say that an example is a support pattern for class if the coefficient ofin Eq.(19)is not zero.There are two settings for which an example can be a support pattern for class.Thefirst case is when the label of an example is equal to,then the th example is a support pattern if.The second case is when the label of the example is different from,then the th pattern is a support pattern if.Loosely speaking,since for all and we have and,the variable can be viewed as a distri-bution over the labels for each example.An example affects the solution for(Eq.(19))if and only if in not a point distribution concentrating on the correct label.Thus,only the questionable patterns contribute to the learning process.We develop the Lagrangian using only the dual variables. Substituting Eqs.(17)and(19)into Eq.(16)and using vari-ous algebraic manipulations,we obtain that the target func-tion of the dual program is,It is easy to verify that is strictly convex in.Since the constraints are linear the above problem has a single opti-mal solution and therefore QP methods can be used to solve it.In Sec.6we describe a memory efficient algorithm for solving this special QP problem.To simplify the equations we denote bythe difference between the correct point distribution and the distribution obtained by the optimization problem,Eq.(19) becomes,(21)Since we look for the value of the variables which maximize the objective function(and not the optimum of itself), we can omit constants and write the dual problem given by Eq.(20)as,subject to:and(22)whereand the classification rule becomes,(25)The general framework for designing output codes using the QP program described above,also provides,as a special case,a new algorithm for building multiclass Support Vec-tors Machines.Assume that the instance space is the vector space and define(thus),then the primal program in Eq.(15)becomesVariablesConstraints VariablesConstraints(31)For brevity,we will omit the index and drop constants (that do not affect the solution).The reduced optimization has variables and constraints,Since the program from Eq.(32)becomes,subject to:andwhere,(33) In Sec.6.1we discuss an analytic solution to Eq.(33)and in Sec.6.2we describe a time efficient algorithm for computing the analytic solution.6.1An analytic solutionWhile the algorithmic solution we describe in this section is simple to implement and efficient,its derivation is quite complex.Before describing the analytic solution to Eq.(33), we would like to give some intuition on our method.Let us fix some vector and denote.First note that is not a feasible point since the constraintis not satisfied.Hence for any feasible point some of the constraints are not tight.Second,note that the differences between the bounds and the variables sum to one.Let us induce a uniform distribution over the components of.Then,the variance of isSince the expectation is constrained to a given value,the optimal solution is the vector achieving the smallest vari-ance.That is,the components of of should attain similar values,as much as possible,under the inequality constraints .In Fig.1we illustrate this motivation.We pickedand show plots for two different feasible values for.The x-axis is the index of the point and the y-axis designates the values of the components of .The norm of on the plot on the right hand side plot is smaller than the norm of the plot on the left hand side.The right hand side plot is the optimal solution for.The sum of the lengths of the arrows in both plots is. Since both sets of points are feasible,they satisfy the con-straint.Thus,the sum of the lengths of the “arrows”in both plots is one.We exploit this observation in the algorithm we describe in the sequel.We therefore seek a feasible vector whose most of its components are equal to some threshold.Given we de-fine a vector whose its th component equal to the mini-mum between and,hence the inequality constraints are satisfied.We define(34)35duced optimization problem with .The x-axis is the index of the point,and the y-axis denotes the values .The bottom plot has a smaller variance hence it achieves a better value for .We denote byUsing,the equality constraint from Eq.(33)becomes.Let us assume without loss of generality that the compo-nents of the vector are given in a descending order,(this can be done in time).Letand .To prove the main theorem of this section we need the following lemma.Lemma 3is piecewise linear with a slope in eachrangefor .Proof:Let us develop.35Figure 2:An illustration of the solution of the QP problem using the inverse of for .The optimal value is the solution for the equation which is .Note that if thenfor all .Also,the equalityholds for eachin the range.Thus,for ,the functionhas the form,(35)This completes the proof.We now can prove the main theorem of this section.Theorem 5Let be the unique solution of.Then is the optimum value of the optimization problem stated in Eq.(33).The theorem tells us that the optimum value of Eq.(33)isof the form defined by Eq.(34)and that there is exactly one value of for which the equality constraintholds.A plot of and the solution for fromFig.1are shown in Fig.2.Proof:Corollary 4implies that a solution exists and isunique.Note also that from definition ofwe have that the vector is a feasible point of Eq.(33).We now prove that is the optimum of Eq.(33)by showing thatfor all feasible points .Assume,by contradiction,that there is a vector such that .Let ,and define.Since both and satisfy the equality constraint of Eq.(33),we have,(36)Since is a feasible point we have.Also, by the definition of the set we have that for all .Combining the two properties we get,for all(37) We start with the simpler case of for all.In this case,differs from only on a subset of the coordi-nates.However,for these coordinates the components of are equal to,thus we obtain a zero variance from the constant vector whose components are all.Therefore, no other feasible vector can achieve a better variance.For-mally,since for all,then the terms for cancel each other,From the definition of in Eq.(34)we get thatfor all,We use now the assumption that for all and the equality(Eq.(36))to obtain,and we get a contradiction since.We now turn to prove the complementary case in which .Since,then there exists such that.We use again Eq.(36)and conclude that there exists also such that.Let us assume without loss of generality that(The case follows analogously by switching the roles of and ).Define as follows,otherwiseThe vector satisfies the constraints of Eq.(33)sinceand.Since and are equal except for their and components we get,Input:.Initialize.Define.Sort the components of,such that. Define;.While..Eq.(39) ComputeFigure3:The algorithm forfinding the optimal solution of the reduced quadratic program(Eq.(33)).Substituting the values for and from the definition of we obtain,Using the definition of and forand for we obtain,Thefirst term of the bottom equation is negative since and.Also,hence and the second term is also negative.We thus get,which is a contradiction.Input:.Choose-a feasible point for Eq.(24).Iterate.Choose an exampleCompute and Eqs.(29)and(30)Eq.(33) Output thefinal hypothesis:Eq.(25)(40)The complete algorithm is described in Fig.3.Since it takes time to sort the vector and anothertime for the loop search,the total run time is.We arefinally ready to give the algorithm for solving learning problem described by Eq.(24).Since the output code is constructed of the supporting patterns we term our algorithm SPOC for Support Pattern Output Coding.The SPOC algorithm is described in Fig.4.We have also devel-oped methods for choosing an example to modify on each round and a stopping criterion for the entire optimization al-gorithm.Due to lack of space we omit the details which will appear in a full paper.We have performed preliminary experiments with syn-thetic data in order to check the actual performance of our algorithm.We tested the special case corresponding to mul-ticlass SVM by setting.The code matrices we test35Figure5:Run time comparison of two algorithms for code design using quadratic programming:Matlab’s standard QP package and the proposed algorithm(denoted SPOC).Note that we used a logarithmic scale for the run-time()axis. are of rows(classes)and columns.We varied the size of the training set size from to. The examples were generated using the uniform distribution over.The domain was partitioned into four quarters of equal size:,,,and.Each quarter was associated with a different label.For each sample size we tested,we ran the algorithm three times,each run used a different randomly generated training set.We compared the standard quadratic optimization routine available from Mat-lab with our algorithm which was also implemented in Mat-lab.The average running time results are shown in Fig.5. Note that we used a log-scale for the(run-time)axis.The results show that the efficient algorithm can be two orders of magnitude faster than the standard QP package.7Conclusions and future researchIn this paper we investigated the problem of designing out-put codes for solving multiclass problems.Wefirst discussed discrete codes and showed that while the problem is intractable in general we canfind thefirst column of a code matrix in polynomial time.The question whether the algorithm can be generalized to columns with running time ofor less remains open.Another closely related question is whether we canfind efficiently the next column given previ-ous columns.Also left open for future research is further us-age of the algorithm forfinding thefirst column as a subrou-tine in constructing codes based on trees or directed acyclic graphs[18],and as a tool for incremental(column by col-umn)construction of output codes.Motivated by the intractability results for discrete codes we introduced the notion of continuous output codes.We described three optimization problems forfinding good con-tinuous codes for a given a set of binary classifiers.We have discussed in detail an efficient algorithm for one of the three problems which is based on quadratic programming.As a special case,our framework also provides a new efficient al-gorithm for multiclass Support Vector Machines.The im-portance of this efficient algorithm might prove to be crucial in large classification problems with many classes such asKanji character recognition.We also devised efficient im-plementation of the algorithm.The implementation details of the algorithm,its convergence,generalization properties, and more experimental results were omitted due to the lack of space and will be presented elsewhere.Finally,an impor-tant question which we have tackled barely in this paper is the problem of interleaving the code design problem with the learning of binary classifiers.A viable direction in this do-main is combining our algorithm for continuous codes with the support vector machine algorithm. Acknowledgement We would like to thank Rob Schapire for numerous helpful discussions,to Vladimir Vapnik for his encouragement and support of this line of research,and to Nir Friedman and Ran Bachrach for useful comments and suggestions.References[1] D.W.Aha and R.L.Bankert.Cloud classification usingerror-correcting output codes.In Artificial Intelligence Ap-plications:Natural Science,Agriculture,and Environmental Science,volume11,pages13–28,1997.[2] E.L.Allwein,R.E.Schapire,and Y.Singer.Reducing multi-class to binary:A unifying approach for margin classifiers.In Machine Learning:Proceedings of the Seventeenth Interna-tional Conference,2000.[3]Peter L.Bartlett.The sample complexity of pattern classifi-cation with neural networks:the size of the weights is more important than the size of the network.IEEE Transactions on Information Theory,44(2):525–536,March1998.[4] A.Berger.Error-correcting output coding for text classifica-tion.In IJCAI’99:Workshop on machine learning for infor-mationfiltering,1999.[5]Leo Breiman,Jerome H.Friedman,Richard A.Olshen,and Charles J.Stone.Classification and Regression Trees.Wadsworth&Brooks,1984.[6]V.Chvatal.Linear Programming.Freeman,1980.[7]Corinna Cortes and Vladimir Vapnik.Support-vector net-works.Machine Learning,20(3):273–297,September1995.[8]Ghulum Bakiri Thomas G.Dietterich.Achieving high-accuracy text-to-speech with machine learning.In Data min-ing in speech synthesis,1999.[9]Thomas G.Dietterich and Ghulum Bakiri.Solving multiclasslearning problems via error-correcting output codes.Journal of Artificial Intelligence Research,2:263–286,January1995.[10]Tom Dietterich and Eun Bae Kong.Machine learning bias,statistical bias,and statistical variance of decision tree algo-rithms.Technical report,Oregon State University,1995. [11]R.Fletcher.Practical Methods of Optimization.John Wiley,second edition,1987.[12]Trevor Hastie and Robert Tibshirani.Classification by pair-wise coupling.The Annals of Statistics,26(1):451–471,1998.[13]David Haussler.Decision theoretic generalizations of the PACmodel for neural net and other learning rma-tion and Computation,100(1):78–150,1992.[14]Klaus-U.H¨o ffgen and Hans-U.Simon.Robust trainabilityof single neurons.In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory,pages428–439,Pittsburgh,Pennsylvania,July1992.[15]G.James and T.Hastie.The error coding method and PiCT.J.of computational and graphical stat.,7(3):377–387,1998.[16] E.B.Kong and T.G.Dietterich.Error-correcting output cod-ing corrects bias and variance.In Proc.of the Twelfth Interna-tional Conference on Machine Learning,p.313–321,1995.[17]J.C.Platt.Fast training of Support Vector Machines usingsequential minimal optimization.In B.Sch¨o lkopf,C.Burges, and A.Smola,editors,Advances in Kernel Methods-Support Vector Learning.MIT Press,1998.[18]J.C.Platt,N.Cristianini,and rge mar-gin dags for multiclass classification.In Advances in Neural Information Processing Systems12.MIT Press,2000.(To appear.).[19]J.Ross Quinlan.C4.5:Programs for Machine Learning.Mor-gan Kaufmann,1993.[20] D.E.Rumelhart,G.E.Hinton,and R.J.Williams.Learn-ing internal representations by error propagation.In David E.Rumelhart and James L.McClelland,editors,Parallel Dis-tributed Processing–Explorations in the Microstructure of Cognition,chapter8,pages318–362.MIT Press,1986. [21]Robert ing output codes to boost multiclasslearning problems.In Machine Learning:Proceedings of the Fourteenth International Conference,pages313–321,1997.[22]Robert E.Schapire and Yoram Singer.Improved boosting al-gorithms using confidence-rated predictions.Machine Learn-ing,37(3):1–40,1999.[23]V.N.Vapnik.Estimation of Dependences Based on EmpiricalData.Springer-Verlag,1982.[24]V.N.Vapnik.Statistical Learning Theory.Wiley,1998.[25]J.Weston and C.Watkins.Support vector machines for multi-class pattern recognition.In Proceedings of the Seventh Euro-pean Symposium On Artificial Neural Networks,April1999.A Legend Description Section6。

The Binary Component Adaptation User Guide

The Binary Component Adaptation User Guide

The Binary Component Adaptation User GuideJuly 16, 1998 Ralph KellerDISCLAIMER OF WARRANTY. Free of charge Software is provided on an “AS IS” basis, without warranty of any kind, including without limitation the warranties that the Software is free of defects, merchantable, fit for a particular purpose or non-infringing. The entire risk as to the quality and performance of the Software is borne by you.Copyright (c) 1998 Ralph Kellerralph@All rights reserved.1.IntroductionBinary component adaptation (BCA)[KH98a] is a mechanism that modifies existing components (such as Java class files) to the specific needs of a programmer. Binary component adaptation allows components to be adapted and evolved in binary form and on the fly (during program loading). That is, a programmer can add methods and fields to classes and interfaces, rename and reimplement methods, and alter the type hierarchy even without access to the source code.This guide focuses on how to install and use the Java implementation of BCA. It does neither explain how BCA works nor how it is implemented. Furthermore, we expect that the reader is familiar with Java and the Java Virtual Machine (JVM).1.1Related ReadingIf you want to know more about the motivation and design rationales of BCA, you should read Binary Component Adaptation [KH98a]. Also an interesting paper is Integrating Independently-Developed Components in Object-Oriented Languages [Höl93]. If you are interested in the inner workings of BCA, you should have a look at Implementing Binary Component Adap-tation [KH98b] which describes the implementation for Java.For further information about BCA, check out the following URL:/oocsb2.Restrictions2.1BCA RestrictionsThe current implementation of BCA has some (minor) restrictions:•No type checker.The delta file compiler does not check adaptations for correctness. For example, it is possible to add an interface to a class even though the class does not implement all the required methods. In this case, the adaptation will com-pile, but at load-time the VM will cause a linkage error.However, new code added to classes is legal if it compiles correctly and can only reference class members that really exist.•No name conflict resolution. BCA does not (yet) transparently resolve method or field name clashes. For example, if a change introduced by evolution clashes with changes made by adaptation, then a linkage error occurs.•Fully qualified type declarations. The delta file compiler requires all types in methods and field declarations to be fully qualified. For example, it is fine to use the class type ng.String in a method declaration, but the unqualified String would cause a delta file compilation error.2.2JDK RestrictionsBCA needs to use the class files in javasrc-1.1.5/classes. It won’t work with the classes provided in the binary distribu-tion of the JDK1.1.5. If the class path is set incorrectly, bcajava fails to initialize and aborts silenlty. Simply set your class path accordingly (e.g.,setenv1 CLASSPATH<bca-dir>/javasrc-1.1.5/classes).1The syntax to set environment variables depends on the shell. In the following we assume tcsh.3.The Package StructureThe BCA package consists of the following subdirectories:4.Installing BCA 1.Untar the release in a directory, so type:tar xvf bca-releasenumber .tarThis extracts the package and generates several subdirectories. In the following <bca-dir > is the absolute path of the generated bca directory (such as /users/ralph/bca )2.For convenience, you might want to include the directory <bca-dir >/bin in your PATH .setenv PATH $PATH:<bca-dir >/bin3.Set the class search path to include the classes in <bca-dir>/src/java for the delta file compiler such as:setenv CLASSPATH .:<bca-dir>/src/javapile the delta file compiler. Change directory to <bca-dir>/src/java/dfc/main and invoke bcajavac :javac Main.java5.Adaptation SpecificationAn adaptation specification describes changes to the classes that are retrieved from the file system or network. In general, an adaptation specification includes Java source code fragments for added or reimplemented methods.The complete grammer for the adaptation specification is described in Appendix A. The following subsections describe the class files specifier and modifications.Note: Currently, each NameDecl in an adaptation specification must be fully qualified . For example, the type ng.String is fine but the unqualified type String would cause an error.5.1Class Files SpecifierThe class files specifier defines the set of classes to which the adaptation is applied:DirectoryContent Description ./binMiscellaneous shell scripts and binaries ./libLibraries for BCA ./examplesSimple example uses of BCA ./src Sources for delta file compiler ./javasrc-1.1.5 A stripped version of Sun’s JDK1.1.5 source code distribution for SPARC/Solaris that includes theJava VM and javac executables, compiled classes, and libraries. This directory does not contain anysource code.Class Files SpecifierDescription class NameDeclclass file that represents specific class interface NameDeclclass file that represents specific interface implements NameDecl all class files (classes and interfaces) that implement theinterfaceTable 1.Class Files Specifier5.2ModificationsCurrently, the modifications in Table 2 are supported.ing BCAThe BCA distribution includes three convenient scripts to invoke the interpreter, Java compiler, and delta file compiler.6.1BCA InterpreterYou invoke the interpreter in the following way:bcajava [-options] classbcajava is a complete replacement for the standard Java interpreter. In addition to the standard options,bcajava recognizes the -deltas flag which contains a colon-separated list of delta files that are used for modifications at run-time. Each file name must also include the .df extension. For example, a typical invocation of the interpreter looks as following:bcajava -deltas StringEncryption.df:Enumeration.df:ImplementsEnumeration.df Main6.2BCA Java CompilerTo compile against adapted classes, you invoke the Java compiler as following:bcajavac [-options]file .java ...The BCA java compiler is a complete replacement for the JDK’s javac . In addition to the standard flags,bcajavac recognizes the -deltas flag which contains a colon-separated list of delta files used for compilation. Classes that bcajavac loads to look up type information are then modified according to the delta files.For example, if you have a delta file that adds encryption to ng.String (such as the StringEncryption delta), in order to compile classes that use methods encrypt or decrypt you have to invoke bcajavac as following:bcajavac -deltas StringEncryption.df UsesAdaptedStringClass.java6.3Delta File CompilerThe delta file compiler translates an adaptation specification into a binary delta file. You invoke delta file compiler as following:bcadfc [-v]file .deltaProductionsfie l d s add field AccessFlagDecls opt TypeDecl Identifierrename field Identifier to Identifierrename reference field NameDecl Identifier to Identifierm e t h o d s add method AccessFlagDecls opt TypeDecl Identifier ( FormalParameterListDecl opt )ThrowsDecl optMethodBodyDecl opt rename method TypeDecl Identifier ( FormalParameterListDecl )to Identifierrename reference method NameDecl TypeDecl Identifier ( FormalParameterListDecl )to Identifieri n t e r f a c e sadd implements NameDeclTable 2.ModificationsThe delta file compiler reads in file.delta and, if compilation succeeds, produces a binary delta file. The delta file has the name as specified in the adaptation specification. The-v option causes bcadfc to print debugging information and the gener-ated delta file.For example, if you want to compile StringEncryption.delta, you invoke the delta file compiler as following: bcadfc StringEncryption.deltaThis creates the StringEncryption.df delta file.6.4Class File ToolsThe BCA distribution also contains three tools that can be helpful when dealing with class files.6.4.1mainjvmThe tool mainjvm is a class file viewer. You invoke it as following:mainjvm classfile.classmainjvm shows the complete class file structure in an easy to read form. For example, it displays the constant pool, access modi-fiers, super class, interfaces, fields, methods, and attributes. Byte codes are shown as hex values.6.4.2mainjhlThe tool mainjhl is a high-level class file viewer. You invoke it as following:mainjhl classfile.classmainjhl shows the class in a high-level representation. For example, all references to classes, interfaces, methods, and fields are resolved.mainjhl also disassembles byte codes as JVM instructions and resolves all references to the constant pool.6.4.3mainjdlmainjdl displays a class file after performing modifications. You invoke mainjdl as following:mainjdl classfile.class deltafilesmainjdl loads the classfile and then makes all modifications described in deltafiles, a colon-separated list of delta files such as StringEncryption.df:Rename.df.You can use mainjdl to test delta files for their correct behavior. For example, if you run:mainjdl String.class StringEncryption.dfThen the printed class file must also include the new methods encrypt and decrypt:...61: public encrypt ()Ljava/lang/String;0: CodeAttribute: max stack: 4 max locals: 3 codelength: 400x0000: 2a b6 01 12 4c 03 3d a7 00 10 2b 1c 2b 1c 34 100x0010: 56 82 92 55 84 02 01 1c 2a b6 01 13 a1 ff ee bb0x0020: 00 11 59 2b b7 00 22 b062: public decrypt ()Ljava/lang/String;0: CodeAttribute: max stack: 4 max locals: 3 codelength: 400x0000: 2a b6 01 12 4c 03 3d a7 00 10 2b 1c 2b 1c 34 100x0010: 56 82 92 55 84 02 01 1c 2a b6 01 13 a1 ff ee bb0x0020: 00 11 59 2b b7 00 22 b0...7.First Experiences with BCAIn this section, we illustrate a simple example of using BCA step by step.Suppose, we want to extend the standard class ng.String with functionality for encryption. Using BCA, we write the following adaptation and save it in a file StringEncryption.delta:delta StringEncryption adapts class ng.String {add method public ng.String encrypt() {char[] buf = this.toCharArray();for (int i=0; i<this.length(); i++) {buf[i] = (char) (buf[i] ^ 0x56);}return new String(buf);};add method public ng.String decrypt() {char[] buf = this.toCharArray();for (int i=0; i<this.length(); i++) {buf[i] = (char) (buf[i] ^ 0x56);}return new String(buf);};}This adaptation adds two methods encrypt and decrypt to ng.String. Then we compile the adaptation into the delta file by invoking:bcadfc StringEncryption.deltaThis generates StringEncryption.df. Now, we write a class Main.java that uses the encryption functionality provided by the delta.public class Main {public static void main(String[] args) {String s = “***secret***”;String encrypted = s.encrypt();System.out.println(“encrypted: “ + encrypted);String decrypted = encrypted.decrypt();System.out.println(“decrypted: “ + decrypted);}}We compile Main.java by invoking:bcajavac -deltas StringEncryption.df Main.javabcajavac requires the StringEncryption delta file since we use the methods encrypt and decrypt in Main.java. Now we can run Main by invoking:bcajava -deltas StringEncryption.df MainMain produces the following output:encrypted: |||%35$3”|||decrypted: ***secret***References[GJS96]James Gosling, Bill Joy and Guy Steele.The Java Language Specification. Addison-Wesley, 1996.[Höl93]Urs Hölzle. Integrating Independently-Developed Components in Object-Oriented Languages.Proceedings of ECOOP’93, Springer Verlag LNCS 512, 1993.[KH98a]Ralph Keller and Urs Hölzle. Binary Component Adaptation.Proceedings of ECOOP’98,Springer Verlag, July 1998.[KH98b]Ralph Keller and Urs Hölzle. Implementing Binary Component Adaptation.Technical Report, Department of Computer Science, University of California, Santa Barbara, July 1998.Appendix A Adaptation Specification GrammarThe following subsection presents a grammar for the adaptation specification.A.1Grammar NotationFor the grammar, we use the same notation as in chapter 19 of the Java Language Specification [GJS96]. Terminal symbols of the grammar are shown in fixed-width font. Nonterminal symbols are shown in italic type. The definition of a nonterminal is introduced by the name of the nonterminal, followed by a colon. One or more alternative right-hand sides for the nonterminal then follow on succeeding lines. Nonterminals that are not specified are identical to the Java Language Specification.A.2Productions from Adaptation SpecificationCompilationUnit:ImportDecl opt delta SimpleName adapts SpecifierDecl UsesDecl opt{ ModificationDecls opt}A.2.1Import DeclarationImportDecl:SingleTypeImportDeclTypeImportOnDemandDeclSingleTypeImportDecl:import NameDeclTypeImportOnDemandDecl:import NameDecl. *A.2.2Class files Specifier DeclarationSpecifierDecl:ClassSpecifierDeclInterfaceSpecifierDeclImplementsSpecifierDeclClassSpecifierDeclclass NameDeclInterfaceSpecifierDeclinterface NameDeclImplementsSpecifierDeclimplements NameDeclA.2.3Uses DeclarationUsesDecl:uses ClassTypeListDeclA.3Productions from Modification DeclarationsModificationDecls:ModificationDecls ModificationDecl;ModificationDecl:AddFieldDeclAddMethodDeclRenameFieldDeclRenameMethodDeclRenameFieldRefDeclRenameMethodRefDeclAddImplementsDeclAddFieldDecl:add field AccessFlagDecls opt TypeDecl IdentifierAddMethodDecl:add method AccessFlagDecls opt TypeDecl Identifier( FormalParameterListDecl opt)ThrowsDecl opt MethodBodyDecl opt RenameFieldDecl:rename field Identifier to IdentifierRenameMethodDecl:rename method TypeDecl Identifier( FormalParameterListDecl)to IdentifierRenameFieldRefDecl:rename reference field NameDecl Identifier to IdentifierRenameMethodRefDecl:rename reference method NameDecl TypeDecl Identifier( FormalParameterListDecl)to Identifier AddImplementsDecl:add implements NameDeclA.3.1Productions from AddMethodDeclFormalParameterListDecl:ProperFormalParameterListDeclProperFormalParameterListDecl:ProperFormalParameterListDecl, FormalParameterDeclFormalParameterDeclFormalParameterDecl:TypeDecl IdentifierThrowsDecl:throws ClassTypeListDeclClassTypeListDecl:ClassTypeListDecl, ClassTypeDeclClassTypeDeclClassTypeDecl:NameDeclMethodBodyDecl:{ BlockDecl}BlockDecl: any legal Java method bodyA.4Productions from Access ModifersAccessFlagDecls:AccessFlagDecls AccessFlagDeclAccessFlagDecl: one ofpublic protected privatestaticabstract final native synchronized transient volatile A.5Productions from Type DeclarationsTypeDecl:BaseTypeDeclReferenceTypeDecBaseTypeDecl: one ofbyte char double float int long short boolean void ReferenceTypeDecl:ArrayTypeDeclClassOrInterfaceTypeDeclClassOrInterfaceTypeDecl:NameDeclArrayTypeDeclBaseTypeDecl[]NameDecl[]ArrayTypeDecl[]A.6Productions from NamesNameDecl:SimpleNameDeclQualifiedNameDeclSimpleNameDecl:IdentifierQualifiedNameDecl:NameDecl. Identifier。

Some Classes of Invertible Matrices in GF(2)

Some Classes of Invertible Matrices in GF(2)

Some Classes of Invertible Matrices in GF(2)James S.Plank∗Adam L.Buchsbaum‡Technical Report UT-CS-07-599Department of Electrical Engineering and Computer ScienceUniversity of TennesseeAugust16,2007The home for this paper is /∼plank/plank/papers/CS-07-599.html.Please visit that link for up-to-date information about the publication status of this and related papers.AbstractInvertible matrices in GF(2)are important for constructing MDS erasure codes.This paper proves that certain classes of matrices in GF(2)are invertible,plus some additional properties about invertible matrices.1IntroductionWe are concerned with the question of whether certain matrices in GF(2)are invertible.This question is important when designing erasure codes for storage applications.If an erasure code is composed solely of exclusive-or operations [1,2,3,4,5,8,9],then it may be represented as a matrix-vector product in GF(2).The act of decoding transforms an original distribution matrix into a square decoding matrix that must be inverted.The process is described for general GF(2w)by Plank[7]and isfirst used in GF(2)by Blomer et al.[2].As such,a fundamental part of defining MDS erasure codes is to construct distribution matrices that result in invertible decoding matrices.This paper does not delve into erasures codes,but instead proves that certain classes of matrices in GF(2)are invertible.It also proves some properties of invertible matrices.2NomenclatureIn GF(2),each element is either0or1;addition is the binary exclusive-or operator(denoted⊕),and multiplication is the binary and operator.When we refer to a matrix M w,that means that M w is a square matrix in GF(2)with w rows and columns.Other information about the matrix is included in the subscripts.We refer to the element in row r and column c of M w as M w[r,c].These are zero-indexed,so the top-left element of M w is M w[0,0],and the bottom-right element of M w is M w[w−1,w−1].We perform arithmetic of row and column indices in M w over the commutative ring Z/w Z.We denote the quantity x modulo w by x w.In particular,because x+w w=x w,we have−1w=w−1w.When context disambiguates,we drop the extra notation;e.g.,−1w=w−1.∗Department of Electrical Engineering and Computer Science,University of Tennessee,Knoxville,TN37996,plank@.‡AT&T Labs-Research,Shannon Laboratory,180Park Avenue,Florham Park,NJ07932,alb@.12.1InvertibilityOne way to test whether a square matrix M is invertible is to perform Gaussian Elimination on it until it is in upper triangular form.Then M is invertible if and only if the result is unit upper triangular.(Basic facts about invertibilityof matrices under simple operations are available in many textbooks,e.g.,Lancaster and Tismenetsky[6].)We define steps of Gaussian Elimination as follows.Let c be the leftmost column with at least two1’s in some M w;let r be the topmost row such that M w[r,c]=1and M w[r,c ]=0for0≤c <c.Then one step of Gaussian Elimination or Elimination Step replaces every row r =r such that M w[r ,c]=1with the sum of rows r and r .An example is in Figure1.Thefirst step of Gaussian Elimination for the matrix in Figure1(a)replaces row2with thesum of rows0and2,and row3with the sum of rows0and3.The resulting matrix is in Figure1(b).(a)(b)(c)Figure1:One step of Gaussian Elimination,and deleting rows and columns that are upper-triangular.When the leftmost columns of a matrix M w have zeros below the main diagonal—i.e.,M[i,j]=0for0≤i< and i<j<w—we say the leftmost columns are in upper triangular form or are upper triangular;if inaddition M w[i,i]=1for0≤i< ,we say the leftmost columns are in unit upper triangular form or are unit upper triangular.Assume the leftmost columns of M w are unit upper triangular,and construct matrix M w ,where w =w− ,by deleting the leftmost columns and top rows of M w.Then M w is invertible iff M w is invertible.For example,since the leftmost two columns of the matrix in Figure1(b)are in unit upper triangular form,we may delete the leftmost two columns and the top two rows to produce the matrix in Figure1(c).This matrix is not invertible; therefore,the matrices in Figures1(a)and1(b)are also not invertible.There are other simple operations that preserve invertibility.Thefirst are what we call row shifting and columnshifting.There are four variants.Each takes an original matrix M w and constructs a new matrix M w∗as follows:•Shifting up by r rows:M w∗[i,j]=M w[i+r w,j],for0≤i,j<w.•Shifting down by r rows:M w∗[i,j]=M w[i−r w,j],for0≤i,j<w.•Shifting left by c columns:M w∗[i,j]=M w[i,j+c w],for0≤i,j<w.•Shifting right by c columns:M w∗[i,j]=M w[i,j−c w],for0≤i,j<w.Obviously,shifting M w up by r rows is equivalent to shifting it down by w−r rows,and shifting M w left by rcolumns is equivalent to shifting it right by w−r columns.Swapping rows and columns preserves invertibility,andsubstituting any row with the sum of it and another row also preserves invertibility.Examples are in Figure2.We denote by I w(rsp.,I w→c)the w×w identity matrix(rsp.,shifted c columns to the right)and by0w the w×w matrix of all zeros.Finally,we say a matrix class M is invertible iff all matrices in M are invertible.3The Matrix Classes D w d,s and S w d,sWe now define two classes of matrices:D w d,s and S w d,s.In both:w>2,0<d<w,and0<s<w.The letters are short for“different”and“same”.We define D w d,s,0to be the base element of D w d,s.We construct D w d,s,0as follows:•Start with D w d,s,0=I w+I w→d.2(a)(b)(c)(d)(e)Figure2:Operations that preserve invertibility.(a)is the original matrix.(b)shifts(a)up by three rows,or down by four rows.(c)shifts(a)left by four rows,or right by three rows.(d)swaps rows3and6.(e)replaces row6with the sum of rows3and6.•Set D w d,s,0[0,w−1]=D w d,s,0[0,w−1]⊕1.•Set D w d,s,0[s,d+s−1w]=D w d,s,0[s,d+s−1w]⊕1.There are w elements of D w d,s,denoted D w d,s,0,...,D w d,s,w−1.D w d,s,i is equal to D w d,s,0shifted i rows down and i columns to the right.Therefore,all elements of D w d,s have the same invertibility.Figure3gives various examples.The intuition is that elements of D w d,s are composed of two diagonals that differ by d columns.There are two extra bits flipped in the matrix,which are s rows apart and adjacent to different diagonals.D73,2,0D73,2,2D73,2,6D71,3,0D71,3,3Figure3:Various examples of matrices in D w d,s.The definition of S w d,s is similar,except the two extra bits that areflipped are adjacent to the same diagonal.As with D w d,s,we define a base element S w d,s,0as follows:•Start with S w d,s,0=I w+I w→d.•Set S w d,s,0[0,w−1]=S w d,s,0[0,w−1]⊕1.•Set S w d,s,0[s,s−1w]=S w d,s,0[s,s−1w]⊕1.As with D w d,s,there are w elements of S w d,s,denoted S w d,s,0,...,S w d,s,w−1.S w d,s,i is equal to S w d,s,0shifted i rows down and i columns to the right.Note that when w is even,there are only w/2distinct elements of S w d,s,because S w d,s,i is .We give examples of S w d,s in Figure4.equal to S wd,s,i+w2w4Simple Relationships on D w d,s and S w d,s that Preserve InvertibilityWe use the following relationships on D w d,s and S w d,s.Lemma1D w d,s is invertible iff D w w−d,w−s is invertible.3S73,2,0S73,2,2S71,3,0S76,3,0S62,3,0=S62,3,3Figure4:Various examples of matrices in S w d,s.D113,4,0D118,7,0Figure5:D118,7,0is constructed from D113,4,0by shifting it four rows up and(4+3)rows to the left.Proof:D w w−d,w−s,0can be derived by shifting D w d,s,0s rows up and s+d columns left.2 Figure5demonstrates Lemma1.Lemma2S w d,s is invertible iff S w d,w−s is invertible.Proof:S w d,s,w−s is identical to S w d,w−s,0.2 Lemma3For s>1,D w d,s is invertible iff S w d,s is invertible.Proof:S w d,s,0can be constructed from D w d,s,0by substituting row s with row s plus row(s−1).2 Figure6demonstrates Lemma3.D113,4,0S113,4,0Figure6:S113,4,0is constructed from D113,4,0by substituting row4with row4plus row3.Lemma4For0<s<w−1,S w d,s is invertible iff S w w−d,s is invertible.Proof:S w w−d,s,0can be constructed from S w d,s,0byfirst substituting row s with row s plus row(s−1)and row0with row0plus row w−1,and then shifting the result d columns to the left.24(a)(b)(c)(d)S113,4,0S118,4,0Figure7:(b)is created by substituting row4with row4plus row3.(c)is created by substituting row0with row0 plus row10.(d)is created by shifting left three columns.Lemma4is demonstrated by Figure7,where each step of converting S113,4,0to S118,4,0is shown.The constraints on s in Lemmas3and4are due to the following.When s=1,adding rows0and1does not have the desired effect of moving row s’s one from one diagonal to the other,because row0has three ones.When s=w−1, adding rows0and w−1has the same problem.For convenience in the sequel,we consider invertibility to be an equivalence relation,so two matrices or matrix classes are equivalent iff they are both invertible or both not invertible.5Our Target Class of Matrices,L,and the Grand Liberation TheoremWe define the class L to be the union of all D w d,s such that:•w>1is odd.•GCD(d,w)=1.•If d is even,s=w−d2.•If d is odd,s=w−d2.Theorem5(The Grand Liberation Theorem)All matrices in L are invertible.The rest of this paper proves the theorem.After demonstrating a few special cases,which include D31,1,the proof proceeds as follows:1.We prove by induction that D w2,w−1is invertible for all odd w.2.For d>2,wefirst show that for any odd d there exists some even d such that D wd,w−d2is equivalent to D wd ,w−d 2.Hence we restrict our attention to even d>2.3.We show that for any even d>2,D wd,w−d2is equivalent to S wd,d2.4.We show that the derived S wd,d2is equivalent to some S wd ,w 2with2<w <w,w even,and GCD(w ,d )=1.5.We show that any S w d,w2with even w>2and GCD(w,d)=1is equivalent to some D w d ,s ∈L such thatw <w.6.A second inductive argument completes the proof,as we can iterate Steps2–5until w =3or d =2in Step5.55.1Step 1:Base Cases for the Global InductionFirst,there are only two D 3d,s ∈L :D 31,1and D 32,2.Their base elements are shown in Figure 8(a)and (b).It is easyto verify that they are invertible.Additionally,Figure 8(c)shows D 52,4,4,which will be used below.It is also easy to verify that it is invertible.(a)(b)(c)(d)(e)D 31,1,0D 32,2,0D 52,4,4D 112,10,10D 112,10,10after two stepsof Gaussian Elimination.Figure 8:Base cases for the inductive proof.We now prove that D w2,w −1is invertible for all odd w .We have already shown in Figure 8that this is truefor w =3and w =5.Let w >5be odd,and assume by induction that D w 2,w −1is invertible for odd 1<w<w .Consider D w 2,w −1,w −1.An example is D 112,10,10,depicted in Figure 8(d).This matrix has a very speci fic format:Allelements of I and I →2are set to one,as are D w 2,w −1,w −1[w −1][w −2]and D w2,w −1,w −1[w −2][w −1].Now,perform two steps of Gaussian Elimination.This will set D w 2,w −1,w −1[w −2,0]and D w2,w −1,w −1[w −1,1]to zero,and D w 2,w −1,w −1[w −2,2]and D w2,w −1,w −1[w −1,3]to one.Figure 8(e)demonstrates for w =11.The resulting matrix’s first two columns are unit upper triangular,so the first two rows and columns may be deleted.Thisyields D w −22,w −3,w −3,which is of the form D w 2,w −1,w −1for some odd 1<w <w .By induction,D w2,w −1,w −1isinvertible.Therefore,D w2,w −1is invertible for all odd w >1.5.2Steps 2–4:Reducing the Problem to S wd,w 2for w Even,GCD (w,d )=1Now consider any D w d,w −d 2∈L such that d is odd.By Lemma 1,this is equivalent to D w w −d,w −w −d 2.Since w −d is an even number,D w w−d,w −w −d 2∈L .Therefore,every element D wd,s ∈L for which d is odd has a corresponding element D w d ,s ∈L for which d is even.Thus we need only prove that the elements D wd,w −d 2∈L with even d are invertible.We proved above that D w 2,w −1is invertible,so we now prove that D wd,w −d 2is invertible for even d >2.Therefore,consider D wd,w −d2such that d >2is even,w >3is odd,and GCD (w,d )=1.Since d >2,it follows that w −w −12=w −12≤w −d 2≤w −2.Since w >3,the smallest value that w −d 2may be is 5−12=2.Therefore,by Lemma 3,D w d,w −d 2is equivalent to S w d,w −d 2,which by Lemma 2is equivalent to S w d,w −(w −d 2)=S wd,d 2.So now consider S w d,d 2,w −d2−1.An example is S 176,3,13,depicted in Figure 9(a).Suppose w >2d .(w will notequal 2d ,because GCD (w,d )=1.)Perform d steps of Gaussian Elimination on S wd,d2,w −d 2−1.This moves the ones in rows w −d through w −1from columns 0through d −1to columns d through 2d −1.In our example of S 176,3,13,six steps of Gaussian Elimination are shown in Figure 9(b).Therefore,when we delete the first d rows and columnsof the resulting matrix,we are left with S w −dd,d2,w −d −d 2−1.Note:w −d is odd;w −d >d ;and since GCD (w,d )=1,GCD (w −d,d )=1.Our example continues in Figure 9(c),where we delete the first six rows and columns ofFigure 9(b)to get S 116,3,7.Iterate this process until it yields S wd,d 2,w −d2−1for d <w <2d .We now perform (w −d )steps of Gaussian Elimination.This moves the leftmost ones in rows (w −d )through (2(w −d )−1)over d columns to the right.Whenwe delete the first w −d rows and columns,we are left with S d d−(w −d ),d 2,d 2−1=S d2d −w,d 2,d 2−1.Since GCD (w,d )=1,GCD (d,2d −w )=1as well.6(a)(b)(c)(d)(e)S 176,3,13Six Elimination StepsS 116,3,7Five Elimination StepsS 61,3,2Figure 9:An example of converting S w d,d 2to S dx,d 2for w =17and d =6.Figure 9(d)shows 11−6=5steps of Gaussian Elimination of S 116,3,7,and Figure 9(e)shows that S 61,3,2results when we delete the first five rows and columns from Figure 9(d).We have thus reduced the original problem to the following:Given S wd ,w2with even w >2and GCD (w ,d )=1,determine whetherS wd ,w2invertible.We address this in the next section.5.3Steps 5–6:Proving that S wd,w 2is Invertible for w Even,GCD (w,d )=1Since w >2,it follows that 1<w 2<w −1.Therefore,by Lemma 4,S w d,w 2is equivalent to S w w −d,w 2,so we may assume that d >w 2.We’re going to break this proof into two cases.The first is when d >w 2+1.Consider S wd,w 2,w2−1.An example of this is S 1611,8,7displayed in Figure 10(a).We perform w −d steps of Gaussian Elimination on S wd,w2,w 2−1.Since d >w 2+1,we know that w −d <w2−1,so the w −d steps of Gaussian Elimination simply move the leftmost ones in rows (w −d )through (2(w −d )−1)over d columns to the right.Deleting the first w −d rows and columnsfrom the matrix,we are left with S d 2d −w,w 2,d −w 2−1.These steps are shown in Figures 10(b)and (c),as S 1611,8,7is converted into into S 116,8,2.(a)(b)(c)S 1611,8,75Elimination StepsS 116,8,2Figure 10:An example of converting S w d,w 2to S d2d −w,w 2for w =16and d =11.As before,since GCD (w,d )=1,we know that GCD (2d −w,d )=1.That w is even implies that 2d −w isalso even.Moreover,since d >w 2+1,we know that 2d −w >1.Therefore,by Lemma 3,S d2d −w,w 2is equivalent to D d 2d −w,w 2.Finally:d −2d −w2=2d −2d +w 2=w 2.7Therefore,D d2d−w,w2=D d2d−w,d−2d−w2,which is an element of L.By induction,D d2d−w,d−2d−w2is invertible,imply-ing that S w d,w2is invertible.The second case is for S w d,w2when d=w2+1and GCD(w,w2+1)=1.An example is S169,8,7shown inFigure11(a).Again,we will perform w−d elimination steps.We will do this in two parts,however.In thefirst part,we perform w−d−1elimination steps.This moves the leftmost ones in rows(w−d)through(2(w−d)−2) over d columns to the right.This is pictured in Figure11(b).The last elimination step replaces two rows of the matrix, because row(w−d)has an extra one adjacent to the diagonal.Therefore,both rows(w−d)and(2(w−d)−1) move their leftmost ones into the column(w−1).This is pictured in Figure11(c).(a)(b)(c)(d)S169,8,76Elimination Steps1More S92,8,0Figure11:An example of converting S w d,w2to S d2,d−1when d=w2+1for w=16.When thefirst w−d rows and columns are deleted from the matrix,we are left with S d2d−w,w2,0.Since d=w2+1,this is equal to S d2,d−1,0(shown as S92,8,0in Figure11(d)).That w>2is even implies that d=w2+1>2isodd,so d−1>1;thus by Lemma3,S d2,d−1is equivalent to D d2,d−1,which we proved invertible in Section5.1. Therefore S w d,w2is invertible.Q.E.D.6The Class O and the Little Liberation TheoremWe now define a third class of matrices,O w d such that w>d≥1.We define O w d,0to be the base element of O w d and construct O w d,0as follows:•Start with O w d=I w+I w→d.•Set O w d,0[0,w−1]=O w d,0[0,w−1]⊕1.Thus,O w d,0is similar to D w d,s,0and S w d,s,0,except it only has one extra one in it,in the top-right corner.There are w elements of O w d,denoted O w d,0,...,O w d,w−1.O w d,i is equal to O w d,0shifted i rows down and i columns to the right. Therefore,all elements of O w d are equivalent.We show some examples of matrices in O w d in Figure12.(a)(b)(c)(d)O73,0O73,6O74,6O21,1Figure12:Examples of matrices in O w d.We start with a simple lemma:8Lemma 6O w d is invertible iff O ww −d is invertible.Proof:O w w −d,w −1can be derived by replacing row w −1of O w d,w −1with the sum of rows w −1and w −2,and shiftingthe resulting matrix d columns to the left.An example is in Figure 12,where O 74,6may be obtained by replacing row 6of O 73,6with the sum of rows 5and 6,and shifting the result three columns to the left.2De fine O to be the union of all O wd such that GCD (w,d )=1.Theorem 7(The Little Liberation Theorem)All matrices in O are invertible.Proof:This proof is far simpler than that of the Grand Liberation Theorem.It,too,is inductive.We start with the base case O 21,an element of which is pictured in Figure 12(d).This matrix is already in unit upper triangular form and is therefore invertible.Now,consider O w d ∈O and suppose by induction that O wd∈O is invertible for all 1≤d <w <w .By Lemma 6and the hypothesis that GCD (w,d )=1,we may assume that d <w 2,or else we consider O ww −d in lieu of O w d .Performing d elimination steps on O w d,w −1moves the leftmost ones in rows w −d through w −1over d columnsto the right.Since d <w 2,these ones will not be moved to the diagonal,nor will the one at O wd,w −1[w −1,w −2]be affected.Therefore deleting the leftmost d columns,which are now unit upper triangular,and top d rows leaves O w −d d,w −d −1.Since GCD (w,d )=1,GCD (w −d,d )=1,and therefore O w −d d,w −d −1∈O .By induction,O w −d dis invertible;therefore O wd is invertible.2An example is depicted in Figure 13where O 125,11is converted to O 75,6by five elimination steps.(a)(b)(c)O 125,11Five elimination stepsO 75,6Figure 13:An example of converting O wd,w −1to O w −d d,w −d −1.7Some Trivial Properties of Matrices in GF (2)The following two lemmas are likely folklore,but we include them for completeness.Lemma 8If some matrix M w has precisely w ones,then M w is invertible iff it is a permutation matrix.Proof:Any permutation matrix is invertible.Conversely,if M w has precisely w ones but is not a permutation matrix,then some row or column contains all zeros,in which case M w is not invertible.2Lemma 9Let M w 1and M w 2be permutation matrices.The sum M w 1+M w2is not invertible.Proof:Let M w =M w 1+M w 2.Suppose there exist r and c such that M w 1[r,c ]=M w2[r,c ]=1.Then row r of M w contains all zeros,so M w is not invertible.Thus,we assume there are no such r and c ;in this case,M has precisely two ones in each row and column.We prove by induction that such matrices are not invertible.The base case is shown in Figure 14(a),which depicts the only M 2with two ones in each row and column.This matrix is clearly not invertible.9Now,let matrix M w for some w>2have exactly two ones in each row and column.Let rows r1and r2be the two rows that have ones in column zero,and let c1,c2>0be such that M w[r1,c1]=M w[r2,c2]=1.Swap row r1with row0,and perform one elimination step.This will set M w[r2,0]=0and M w[r2,c1]=M w[r2,c1]⊕1.If c1=c2, then all of row r2’s elements become zero,so M w is not invertible.If c1=c2,then deleting thefirst row and column leaves a matrix M w−1with exactly two ones in each row and column.By induction,this new matrix in not invertible; therefore M w is also not invertible.2(a)(b)(c)(d)(e)(f)M2.M7,After one elimination M7,After one M6.c1=c2=3.step,Row4is all zeros.c1=3,c2=2.elimination step.Figure14:Matrices with two ones in each row and column.We show examples of the elimination step in Figure14(b)-(f).In Figure14(b),the elimination step,depicted in Figure14(c)turns row4into all zeros.In Figure14(d),the elimination step of the matrix M7results in Figure14(e), which is equivalent to a matrix M6(Figure14(f)).8A Final Theorem on the Invertibility of a Type of(k+2)w×kw MatrixLet row r of some matrix M w contain precisely one one—we call such a row an identity row—and let the one be in column c.By cofactor expansion,deleting row r and column c yields an equivalent matrix M w−1.The remaining theorem concerns a(k+2)×k block matrix A,structured as follows and pictured in Figure15:•Each block is w×w.•Block A[i,i]=I w for0≤i<k.•Blocks A[i,j]=A[j,i]=0w for0≤i<j<k.•Block A[k,j]=I w for0≤j<k.•Block A[k+1,j]=X j for0≤j<k and some given X j.Consider the class A∗of k+22 block matrices induced by deleting any two rows of blocks from A.Theorem10All matrices in A∗are invertible iff(1)every X i is invertible,and(2)for0≤i<j<k,X i+X j is invertible.Proof:Let A∈A∗.There are four cases.Case1:A is composed of thefirst k rows of blocks of A,which form I kw.Case2:A is composed of row k and any k−1of thefirst k rows of blocks of A.Now,A has(k−1)w identity rows.Deleting these rows and their associated columns yields I w.Case3:A is composed of row k+1and any k−1of thefirst k rows of blocks of A;let i be the omitted row of blocks from thefirst k.Again,A has(k−1)w identity rows,which we can delete with their associated columns to yield X i,so A is equivalent to X i.10Figure15:The(k+2)×k block matrix A of w×w matrices over GF(2).Figure16:The2w×2w matrix that results when w(k−2)identity rows are deleted from A in Case4.Case4:A is composed of rows k,k+1,and any k−2of thefirst k rows of blocks of A;let i and j be the omitted rows of blocks from thefirst k.Now A has(k−2)w identity rows,which we can delete with their associated columns to yield the matrix pictured in Figure16.Now,perform w elimination steps on this matrix.For each r and c such that X i[r,c]=1,the elimination step for column c will replace row w+r with the sum of rows w+r and c.This will set X i[r,c]=0and X j[r,c]=X j[r,c]⊕1. After the elimination steps,the leftmost w columns will be upper triangular,and deleting them leaves X i+X j. Therefore,A is equivalent to X i+X j.29AcknowledgementsThis material is based upon work supported by the National Science Foundation under grants CNS-0437508and CNS-0615221.References[1]M.Blaum,J.Brady,J.Bruck,and J.Menon.EVENODD:An efficient scheme for tolerating double disk failuresin RAID architectures.IEEE Transactions on Computing,44(2):192–202,February1995.[2]J.Blomer,M.Kalfane,M.Karpinski,R.Karp,M.Luby,and D.Zuckerman.An XOR-based erasure-resilientcoding scheme.Technical Report TR-95-048,International Computer Science Institute,August1995.[3]P.Corbett,B.English,A.Goel,T.Grcanac,S.Kleiman,J.Leong,and S.Sankar.Row diagonal parity for doubledisk failure correction.In4th Usenix Conference on File and Storage Technologies,San Francisco,CA,March 2004.11[4]J.L.Hafner.WEA VER Codes:Highly fault tolerant erasure codes for storage systems.In F AST-2005:4th UsenixConference on File and Storage Technologies,pages211–224,San Francisco,December2005.[5]J.L.Hafner.HoVer erasure codes for disk arrays.In DSN-2006:The International Conference on DependableSystems and Networks,Philadelphia,June2006.IEEE.[6]ncaster and M.Tismenetsky.The Theory of puter Science and Applied Mathematics.Aca-demic Press,San Diego,CA,second edition,1985.[7]J.S.Plank.A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems.Software–Practice&Experience,27(9):995–1012,September1997.[8]J.S.Plank and L.Xu.Optimizing Cauchy Reed-Solomon codes for fault-tolerant network storage applications.InNCA-06:5th IEEE International Symposium on Network Computing Applications,Cambridge,MA,July2006.[9]J.J.Wylie and R.Swaminathan.Determining fault tolerance of XOR-based erasure codes efficiently.In DSN-2007:The International Conference on Dependable Systems and Networks,Edinburgh,Scotland,June2007.IEEE.12。

Notes on quasiminimality and excellence

Notes on quasiminimality and excellence

Notes on Quasiminimality and ExcellenceJohn T.BaldwinDepartment of Mathematics,Statistics and Computer ScienceUniversity of Illinois at Chicago∗April7,2004AbstractThis paper ties together much of the model theory of the last50years.Shelah’s attempts to generalize the Morley theorem beyondfirst order logic led to the notion of excellence,which is a key to the structure theory of uncountable models.The notion of Abstract Elementary Class arose naturally in attempting toprove the categoricity theorem for Lω1,ω(Q).More recently,Zilber has attempted to identify canonicalmathematical structures as those whose theory(in an appropriate logic)is categorical in all powers.Zilber’strichotomy conjecture forfirst order categorical structures was refuted by Hrushovski,by the introducion of aspecial kind of Abstract Elementary Class.Zilber uses a powerful and essentailly infinitary variant on thesetechniques to investigate complex exponentiation.This not only demonstrates the relevance of Shelah’smodel theoretic investigations to mainstream mathematics but produces new results and conjectures inalgebraic geometry.Zilber proposes[63]to prove‘canonicity results for pseudo-analytic’rmally,‘canonical’means ‘the theory of the structure in a suitable possibly infinitary language(see Section2)has one model in each uncountable power’while‘pseudoanalytic’means‘the model of power2ℵ0can be taken as a reduct of an expansion of the complex numbers by analytic functions’.This program interacts with two other lines of research.First is the general study of categoricity theorems in infinitary languages.After initial results by Keisler,reported in[31],this line was taken up in a long series of works by Shelah.We place Zilber’s work in this context.The second direction stems from Hrushovski’s construction of a counterexample to Zilber’s conjecture that every strongly minimal set is‘trivial’,‘vector space-like’,or‘field-like’.This construction turns out to be a very concrete example of an Abstract Elementary Class,a concept that arose in Shelah’s analysis. And the construction is a crucial tool for Zilber’s investigations.This paper examines the intertwining of these three themes.For simplicity,we work in a countable vocabulary.The study of(C,+,·,exp)leads one immediately to some extension offirst order logic;the integers with all their arithmetic arefirst order definable in(C,+,·,exp).Thus,thefirst order theory of complex exponentiation is horribly complicated;it is certainly unstable and so itsfirst order theory cannot be categorical in power.That is,thefirst order theory of complex exponentiation cannot have exactly one model in each uncountable cardinal. One solution is to use infinitary logic to pin down the pathology.Insist that the kernel of the exponential map isfixed as a single copy of the integers while allowing the rest of the structure to grow.We describe in Section5 Zilber’s theorem that,modulo certain(very serious)algebraic hypotheses,(C,+,·,exp)can be axiomatized bya categorical Lω1,ω(Q)-sentence.The notion of amalgamation is fundamental to model theory.Even in thefirst order case,the notion is subtle because there are several variants depending on the choice of a class of models K and a notion≺of substructure.∗Partially supported by NSF grant DMS-0100594and CDRF grant KM2-2246.1The pair(K,≺)has the amalgamation property if whenever M∈K is embedded by f0,f1into N0,N1so that the image of the embeddings f0M,f1M≺N0,N1respectively,there is an N∗and embeddings g0,g1of N0,N1 into N∗with g0f0and g1f1agreeing on M.If K is the class of models of a completefirst order theory then the amalgamation property holds with≺as elementary embeddings of models.If K is the class of substructures of models of a complete quantifier eliminablefirst order theory then the amalgamation property holds for≺as arbitrary embeddings.Morley[39]observed that by adding names for each definable relation,we can assume, for studying the spectrum problm,that anyfirst order theory has elimination of quantifiers.Shelah[47],noted that this amalgamation hypothesis allows us to assume the existence of a‘monster model’which serves as a universal domain.In this domain the notion of type of an element a over a set A can be thought of either semantically as the orbit of a under automorphisms thatfix A or syntactially as the collection of formulas with parameters from A that are satisfied by a.Of course,the extension fromfirst order logic causes the failure of the compactness theorem.For example,it is easy to write a sentence in Lω1,ωwhose only model is the natural numbers with successor.But thereare some more subtle losses.The duality between the syntactic and semantic concept of type depends on the amalgamation property.Here is a simple example showing that amalgamation fails in models of a sentence ofLω1,ω.Consider the theory T of a dense linear order without endpoints,a unary predicate P(x)which is denseand codense,and an infinite set of constants arranged in order typeω+ω∗.Let K be class of all models of T which omit the type of a pair of points,which are both in the cut determined by the constants.Now consider the types p and q which are satisfied by a point in the cut,which is in P or in¬P respectively.Now p and q are each satisfiable in a member of K but they are not simultaneously satisfiable.So the amalgamation property has failed for K and elementary embeddings.This shows that a more subtle notion than consistency is needed to describe types in this wider context.We took‘canonical’above as meaning‘categorical in uncountable cardinalities’.That is,the class has exactly one model in each uncountable cardinality.The analysis offirst order theories categorical in power is based on first studying strongly minimal sets.A set is strongly minimal if every definable subset of it isfinite or cofinite.A natural generalization of this,particularly since it holds of simply defined subsets of(C,+,·,exp),is to consider structures where every definable set is countable or cocountable.As we will see,the useful formulation of this notion requires some auxiliary homogeneity conditions.The role of homogeneity in studying categoricity in infinitary languages has been known for a long time.There is a rough translation between‘homogeneity’hypotheses on a model and and corresponding‘amalgamation’hypotheses on the class of substructures of the model(Section2).A structure isℵ1-homogeneous if for any two countable sequences a,b,which realize the same type,and any c,there is a d such that a c and b d realize the same type.Thus,ℵ1-homogeneity corresponds to amalgamation over arbitrary countable subsets.Keisler[31]proved the natural generalization of Morley’stheorem for a sentenceψin Lω1,ωmodulo two assumptions:1.Every model ofψhas arbitrarily large elementary extensions.2.Every model ofψisℵ1-homogeneous.Keisler asked whether everyℵ1-categorical sentence in Lω1,ωsatisfies assumption2.The answer is no.Marcus[37]gave an example of a minimal prime model with infinitely many indiscernibles and a modification by Shelahprovides an example of a totally categorical(categorical in each uncountable cardinality)sentence in Lω1,ωwhich has noℵ1-homogeneous models.Shelah’s notion of an excellent class(extremely roughly:‘amalgamation over(independent)n-dimensional cubes for all n’and‘ℵ0-stability’)provides a middle ground.An excellent class(See paragraph2.0.9.)is a strengthening of Keisler’sfirst assumption(provides not only arbitrarily large models but a certain control over their construction)while weakening the second to assert amalgamation only over certain configurations.Recall that the logic L(Q)adds tofirst order logic the expression(Qx)φ(x)which holds if there are uncountably many solutions ofφ.I had asked whether a sentence in L(Q)could have have exactly one model and that model2have cardinalityℵ1.Shelah proved in[45]using that anℵ1-categorical sentence in Lω1,ω(Q)must have a modelof powerℵ2.There is a beautiful proof of this result in ZFC in[53].Shelah has moved this kind of argument from(ℵ1,ℵ2)to(λ,λ+)in a number of contexts.But,getting arbitrarily large models just from categoricity in a single cardinal has remained intractable,although Shelah reported substantial but not yet written progress in the summer of2003.Shelah proved an analogue to Morley’s theorem in[48,49]for‘excellent’classes defined in Lω1,ω.Assuming2ℵn<2ℵn+1,for all n<ω,he also proved the following kind of converse:every sentence in Lω1,ωthat iscategorical inℵn for all n<ωis excellent and categorical in all cardinals.The assumption of categoricityall the way up toℵωis shown to be essential in[18]by constructing for each n a sentenceψn of Lω1,ωwhichis categorical up toℵn but has the maximal number of models in all sufficiently large cardinalities.He alsoasserted that these results‘should be reproved’for Lω1,ω(Q).This‘reproving’has continued for20years andthefinale is supposed to appear in the forthcoming Shelah[50,51].Zilber’s approach to categoricity theorems is more analogous to the Baldwin-Lachlan approach than to Morley’s. Baldwin-Lachlan[8]provide a structural analysis;they show each model of anℵ1-categorical theory is prime over a strongly minimal set.This allows one to transfer the‘geometric’proof of categoricity in power for a strongly minimal theory to show categoricity inℵ1implies categoricity in all cardinalities.In fact,Zilber considers only the quasiminimal case.But a‘Baldwin-Lachlan’style proof was obtained by Lessmann for homogeneous model theory in[35]and for excellent classes in[34].That is,he proves every model is prime and minimal over a quasiminimal set.We begin in Section1by recalling the basic notions of the Fra¨ıss´e construction and the notion of homogeneity.In Section2,we sketch some results on the general theory of categoricity in non-elementary logics.In particular,we discuss both reductions to the‘first order logic with omitting types’and the‘syntax-free’approach of Abstract Elementary Classes.We turn to the development of the special case of quasiminimal theories in Section3. This culminates in Zilber’sfirst approximation of a quasiminimal axiomatization of complex exponentiation.In Section4we formulate the generalized Fra¨ıss´e construction and place it in the setting of Abstract Elementary Classes.We analyze this method for constructingfirst order categorical theories;we then see a variant to get examples in homogeneous model theory.Then we discuss the results and limitations of the program to obtain analytic representations of models obtained by this construction.Finally in Section5we return to Zilber’s use of these techniques to study complex exponentiation.We describe the major algebraic innovations of his approach and the innovations to the Hrushovski construction which result in structures that are excellent but definitely notfirst order axiomatizable.Many thanks to Rami Grossberg and Olivier Lessmann,who were invaluable in putting together this survey, but are not responsible for any ments by Assaf Hasson,David Kueker,David Marker,Charles Steinhorn,Saharon Shelah,and Boris Zilber improved both the accuracy and the exposition.We particularly thank the referee and editor for further clarifying the expositition.1The Fra¨ıss´e ConstructionIn the early1950’s Fra¨ıss´e[13]generalized Hausdorff’s back and forth argument for the uniqueness of the rationals as a countable dense linear order(without end points).He showed that any countable class K offinite relational structures closed under substructure and satisfying the joint embedding and amalgamation properties (see Definition4.1.6)has a unique countable(ultra)-homogeneous member(denoted G):any isomorphism betweenfinite subsets of G extends to an automorphism.There are easy variants of this notion for locally finite classes in a language with function symbols.The existence of such structures is proved by iterating the amalgamation property and taking unions of chains.(See[21]for a full account.)J´o nsson[28]extended the notion to arbitrary cardinals and Morley-Vaught[40]created an analogous notion for the class of models offirst3order theories with elementary embeddings as the morphisms.They characterized the homogeneous universal models in this situation as the saturated models.In general the existence of saturated models in powerκrequires thatκ=κ<κandκ>2|L|;alternatively,one may assume the theory is stable.In particular,κ-saturated models areκ-homogeneous.Morley proved every uncountable model of a theory categorical in an uncountable power is saturated.Abstract versions of the Fra¨ıss´e construction undergird the next section;concrete versions dominate the last two sections of the paper.2Syntax,Stability,AmalgamationThis section is devoted to investigations of categoricity for non-elementary classes.We barely touch the immense literature in this area;see[15].Rather we just describe some of the basic concepts and show how they arisefrom concrete questions of categoricity in Lω1,ωand Lω1,ω(Q).In particular,we show how different frameworksfor studying nonelementary classes arise and some relations among them.Any serious study of this topic begins with[30,31].In its strongest form Morley’s theorem asserts:Let T be afirst order theory having only infinite models.If T is categorical in some uncountable cardinal then T is complete and categorical in every uncountable cardinal.This strong form does not generalize to Lω1,ω;take the disjunction of a sentence which is categorical in allcardinalities with one that has models only up to,say, ing both the upward and downward L¨o wenheim-Skolem theorem, L os[36]proved that afirst order theory that is categorical in some cardinality is complete.Since the upwards L¨o wenheim-Skolem theorem fails for Lω1,ω,the completeness cannot be deduced for this logic.However,if the Lω1,ω-sentenceψis categorical inκ,then,applying the downwards L¨o wenheim-Skolem theorem,for every sentenceφeitherψ→φor all models ofφhave cardinality less thanκ.So ifφandψareκ-categoricalsentences with a common model of powerκthey are equivalent.We say a sentence of Lω1,ωis complete if iteither implies or contradicts every other Lω1,ω-sentence.Such a sentence is necessarilyℵ0-categorical(usingdownward L¨o wenheim-Skolem).Moreover,every countable structure is characterized by a complete sentence, which is called its Scott sentence.So if a model satisfies a complete sentence,it is L∞,ω-equivalent to a countablemodel.In particular,any model M ofψ∈Lω1,ωis small.That is,for every n it realizes only countably manyLω1,ω-n-types(over the empty set).Moreover,ifφhas a small model thenφis implied by a complete sentencesatisfied in that model.In thefirst order case it is trivial to reduce the study of categoricity to complete(for Lω,ω)theories.Moreover,first order theories share the fundamental properties of sentences–in particular,L¨o wenheim-Skolem down toℵ0.But an Lω1,ω-theory need not have a countable model.The difficulty is that an Lω1,ω-theory need not beequivalent to a countable conjunction of sentences,even in a countable language.So while we want to reducethe categoricity problem to that for complete Lω1,ω-sentences,we cannot make the reduction trivially.Wefirstshow that ifψ∈Lω1,ωhas arbitrarily large models and is uncountably categorical thenψextends to a completesentence.A key observation is that ifψhas arbitrarily large models thenψhas models that realize few types.Lemma2.0.1Supposeψ∈Lω1,ωhas arbitrarily large models.1.In every infinite cardinalityψhas a model that realizes only countably many Lω1,ω-types over the emptyset.2.Thus,if N is the unique model ofψin some cardinal,ψis implied by a consistent complete sentenceψwhich holds of N.Proof.Sinceψhas arbitrarily large models we can construct a model with indiscernibles(Chapters13-15of [31]).Now take an Ehrenfeucht-Mostowski model M forψover a set of indiscernibles ordered by a k-transitive4dense linear order.(A ordering is k-transitive if any two properly ordered k-tuples are in the same orbit under the automorphism group.These orders exist in every cardinal;take the order type of an orderedfield.)Then for every n,M has only countably many orbits of n-tuples and so realizes only countably many types in anylogic where truth is preserved by automorphism–in particular in Lω1,ω.Ifψisκ-categorical,letψ be theScott sentence of this Ehrenfeucht-Mostowski model with cardinalityκ. 2.0.1 If we do not assumeψhas arbitrarily large models the reduction to complete sentences,sketched below,is more convoluted and uses hypotheses(slightly)beyond ZFC.In particular,the complete sentenceψ does nothold,a priori of the categoricity model.The natural examples of Lω1,ω-sentences which have models of boundedcardinality(e.g.a linear order with a countable dense subset,or coding up an initial segment of the Vαhierarchy of all sets)have the maximal number of models in the largest cardinality where they have a model.Shelah discovers a dichotomy(Theorem2.0.2)between such sentences and‘excellent’sentences.We expand on the notion of excellence at2.0.9and later in the paper.For the moment just think of the assertion that a completeLω1,ω-sentence(equivalently,its class of models)is excellent as a step into paradise.For any class K of models,I(λ,K)denotes the number of isomorphism types of members of K,with cardinality λ.We may writeψinstead of K if K is the class of models ofψ.We say that a class K has many models of cardinalityℵn if I(ℵn,K)≥µ(n)(and few if not;there may not be any).We use as a black box the functionµ(n)(defined precisely in[49]).Either GCH or¬O#implyµ(n)=2ℵn but it is open whether it might be(consistently)smaller.The difficult heart of the argument is the following theorem of Shelah[48,49];we don’t discuss the proof of this result but just show how this solution for complete sentences gives the result forarbitrary sentences of Lω1,ω.Theorem2.0.2 1.(For n<ω,2ℵn<2ℵn+1)A complete Lω1,ω-sentence which has few models inℵn foreach n<ωis excellent(see2.0.9).2.(ZFC)An excellent class has models in every cardinality.3.(ZFC)Suppose thatφis an excellent Lω1,ω-sentence.Ifφis categorical in one uncountable cardinalκthen it is categorical in all uncountable cardinals.So a nonexcellent class defined by a complete Lω1,ω-sentenceψmay not have arbitrarily large models but,ifnot,it must have many models in some cardinal less thanℵω.Combining several results of Keisler,Shelah[48] shows:Lemma2.0.3Assume2ℵ0<2ℵ1.Letψbe a sentence of Lω1,ωthat has at least one but less than2ℵ1modelsof cardinalityℵ1.Thenψhas a small model of cardinalityℵ1.Proof.By Theorem45of[31],for any countable fragment L∗containingψand any N|=ψof cardinalityℵ1, N realizes only countably many L∗types over the empty set.Theorem2.2of[45]says that ifψhas a model M of cardinalityℵ1which realizes only countably many types in each fragment thenψhas a small model of cardinalityℵ1.We sketch a proof of that theorem.Add to the language a linear order<,interpreted as a linearorder of M with order typeωing that M realizes only countably many types in any fragment,write Lω1,ωas a continuous increasing chain of fragments Lαsuch that each type in Lαrealized in M is a formula in Lα+1. Add new2n+1-ary predicates and n+1-ary functions f n.Let M satisfy E n(α,a,b)if and only if a and b realize the same Lα-type and let f n map M n+1into the initialωelements of the order,so that E n(α,a,b) implies f n(α,a)=f n(α,b).Note:i)E n(β,y,z)refines E n(α,y,z)ifβ>α;ii)E n(0,a,b)implies a and b satisfy the same quantifier free formulas;iii)ifβ>α,E n(β,a,b)implies(∀x)(∃y)E n+1(α,x a,y b).Thus,iv) for any a∈M each equivalence relation E n(a,y,z)has only countably many classes.All these assertions canbe expressed by an Lω1,ωsentenceφ.Now add a unary predicate symbol P and a sentenceχwhich assertsthat M is an end extension of P(M).For everyα<ω1there is a model Mαofφ∧ψ∧χwith order type of5(P(M),<)greater thanα.(Start with P asαand alternately take an elementary submodel for the smallest fragment L∗containingφ∧ψ∧χand close down under<.Afterωsteps we have the P for Mα.)Now by Theorem12of[31]there is countable structure(N0,P(N0))such that P(N0)contains a copy of(Q,<)and N0 is an end extension of P(N0).By Theorem28of[31],N0has an L∗elementary extension of cardinalityℵ1.Fix an infinite decreasing sequence d0>d1>...in N0.For each n,define E+n(x,y)if for some i,E n(d i,x,y).Now using i),ii)and iii)prove by induction on the quantifier rank ofφthat N1|=E+n(a,b)implies N1|=φ(a)if andonly if N1|=φ(b)for every Lω1,ω-formulaφ.For each n,E n(d0,x,y)refines E+n(x,y)and by iv)E n(d0,x,y)has only countably many classes;so N is small. 2.0.3Using these two results,we easily derive a version of Morley’s theorem for an Lω1,ω-sentence.Theorem2.0.4Assume2ℵn<2ℵn+1for n<ω.If an Lω1,ω-sentenceψhas an uncountable model,then either1.ψhas many models inℵn for some n<ωor2.ψhas arbitrarily large models and ifψis categorical in one uncountable cardinalκthen it is categoricalall uncountable cardinals.Proof.Supposeψhas few models inℵn for each n<ω.By Lemma2.0.3,choose a small model ofψ,say with Scott sentenceψ .Assuming2ℵn<2ℵn+1for each n,Theorem2.0.21)impliesψ is excellent.By Theorem2.0.2 2)ψ and thusψhave arbitrarily large models.Now supposeψis categorical inκ>ℵ0.Then so isψ whence, by Theorem2.0.23),ψ is categorical in all uncountable powers.To showψis categorical aboveκnote that by downward L¨o wenheim-Skolem all models ofψwith cardinality at leastκsatisfyψ ;the result follows by the categoricity ofψ .Ifψis not categorical in some cardinalityµ<κ, there must be a sentenceθwhich is inconsistent withψ but consistent withψ.Applying the entire analysis to ψ∧θ,wefind a complete sentenceψ which has arbitrarily large models,is consistent withψand contradicts ψ .But this is forbidden by categoricity inκ. 2.0.4 One corollary of this result isCorollary2.0.5Assume2ℵ0<2ℵ1.If an Lω1,ω-sentence is categorical inℵn for n<ω,then it is categoricalin all cardinalities.Hart and Shelah[18]have shown the necessity of the hypothesis of categoricity up toℵω.A key tool in the study of complete Lω1,ω-sentences is the reduction of the class of models of such sentences toclasses which are‘closer’to beingfirst order.We now give a full account of this easy reduction.Chang proved in[12]that the class of models of any sentence in Lκ+,ωcould be viewed as the class of reducts to L of models of afirst order theory in an expansion L of L which omitted a family of types.Chang(Lopez-Escobar[12])used this observation to prove that the Hanf number for Lκ+,ωis same as the Hanf number for omitting a family ofκtypes.Shelah[45]took this reduction a step further and showed that the class of models of a complete sentencein Lω1,ωare in1-1correspondence(mapping L∞,ω-submodel to elementary submodel)with the class of atomicmodels of an appropriatefirst order theory in an expanded language.That is,to study the generalization ofMorley’s theorem to complete Lω1,ω-sentences it suffices to study classes of structures defined by a special typeoffinite diagram.By afinite diagram we mean an EC(T,Γ)class:those models offirst order theory that omit all types from a specified collectionΓof types infinitely many variables over the empty set.Abusing the EC(T,Γ)notation,EC(T,Atomic)denotes the class of atomic models of T(i.e.to conform to the notationwe should write nonatomic).Most detailed study of the spectrum of Lω1,ω-sentences[45,48,49,34,16,27]just work withfinite diagrams or more restrictively atomic models(and usually under stronger homogeneity conditions).In general,an atomic class might be defined by omitting uncountably many types;in the case of interest only countably many types have to be omitted.6Theorem2.0.6Letψbe a complete sentence in Lω1,ω.Then there is a countable language L extending Land afirst order L -theory T such that the reduct map is1-1from the atomic models of T onto the models ofψ.Proof.Let L∗be a countable fragment of Lω1,ωwhich contains all subformulas ofψand the conjunction ofeach Lω1,ω-type that is realized in a model ofψ.(This set is countable since complete sentences are small.)Expand L to L by inductively adding a predicate Pφ(x)for each L∗-formulaφ.Fix a model ofψand expand it to an L -structure by interpreting the new predicates so that the new predicates represent eachfinite Boolean connective and quantification faithfully:E.g.P¬φ(x)↔¬Pφ(x),andP(∀x)φ(x)↔(∀x)Pφ(x),and that,as far asfirst order logic can,the Pφpreserve the infinitary operations:for each i,P Vi φi(x)→Pφi(x).Let T be thefirst order theory of any such model and consider the countable setΓof typesp Vi φi(x)={¬P Viφi(x)}∪{Pφi(x):i<ω}.Note that if q is an Lω1,ω-type realized in a model of T,P V q generates a principal type in T.Now if M is amodel of T which omits all the types inΓ(in particular,if M is an atomic model of T),M|L|=ψand each model ofψhas a unique expansion to a model of T which omits the types inΓ(since this is an expansion bydefinitions in Lω1,ω). 2.0.6So in particular,any complete sentence of Lω1,ωcan be replaced(for spectrum purposes)by considering theatomic models of afirst order theory.Since all the new predicates are Lω1,ω-definable this is the naturalextension of Morley’s procedure of replacing eachfirst order formulaφby a predicate symbol Pφ.Morley’s procedure resulted in a theory with elimination of quantifiers thus guaranteeing amalgamation over sets for first order categorical T.A similar amalgamation result does not follow in this case.Nor,In general,dofinite diagrams satisfy the upwards L¨o wenheim-Skolem theorem.Remark2.0.7(Lω1,ω(Q))The situation for Lω1,ω(Q)is more complicated.The example[18]of a sentenceof Lω1,ωthat isℵ1-categorical and not categorial in all uncountable powers is quite complicated.But theL(Q)theory of two disjoint infinite sets illustrates this phenomena trivially.Some of the analysis of[48,49] goes over directly.But many problems intervene and Shelah has devoted several articles(notably[52,50,51]tocompleting the analysis;a definitive version has not appeared.The difficulty in extending from Lω1,ωto Lω1,ω(Q)is in constructing models with the proper interpretation of the Q-quantifier.Following Keisler’s analysis of this problem in[30]the technique is to consider various notions of strong submodel.Two notions are relevant:in the first,the relation of M≺K N holds when definable sets which are intended to be countable(M|=¬(Qx)φ(x)) do not increase from M to N.The seconds adds that definable sets intended to be uncountable(M|=(Qx)φ(x)) increase from M to N.Thefirst notion gives an AEC(Definition2.0.8);the second does not.The reduction [53,50]is actually to an AEC along with the second relation as an auxiliary that guarantees the existence of standard models.When J´o nsson generalized the Fra¨ısse construction to uncountable cardinalities[28,29],he did so by describ-ing a collection of axioms,which might be satisfied by a class of models,that guaranteed the existence of a7。

The outer derivation of a complex Poisson manifold

The outer derivation of a complex Poisson manifold
Revised in February, 1997 The outer derivation of a complex Poisson manifold. by J-L. Brylinski* and G. Zuckerman
arXiv:math/9802014v1 [math.DG] 3 Feb 1998
1. Hamiltonian and outesson manifold
We will work either with a C ∞ or a complex Poisson manifold M . In the first case there is a Poisson tensor π ∈ ∧2 T M , where T M is the tangent bundle. In the second case the Poisson tensor π is a holomorphic section of ∧2 ΘM , where ΘM is the holomorphic tangent bundle. There are three interesting classes of vector fields, which we enumerate starting with the largest class: (1) the Poisson vector fields: a vector field ξ is Poisson it it preserves the Poisson structure, that is Lξ π = {ξ, π } = 0, where { , } is the Schouten bracket. (2) the locally hamiltonian vector fields: locally ξ is of the form XH = i(dH )π , where i denotes interior product, and H is a smooth (resp. holomorphic) function defined locally. (3) the hamiltonian vector fields: ξ = XH for some global H . * This research was supported in part by NSF grant DMS-9504522. 1

离散数学双语专业词汇表wps

离散数学双语专业词汇表wps

《离散数学》双语专业词汇表Abelian group:交换(阿贝尔)群absorption property:吸收律acyclic:无(简单)回路的adjacent vertices:邻接结点adjacent vertices:邻接结点adjacent vertices:邻接结点algorithm verification:算法证明algorithm:算法alphabet:字母表alternating group:交替群analogous:类似的analysis of algorithm:算法分析antisymmetric:反对称的approach:方法,方式argument:自变量associative:可结合的associative:可结合的asymmetric:非对称的backtracking:回溯base 2 exponential function:以2为底的指数函数basic step:基础步biconditional, equivalence:双条件式,等价bijection, one-to-one correspondence:双射,一一对应binary operation on a set A:集合A上的二元运算binary operation:二元运算binary relation:二元关系(complete) binary tree:(完全)二元(叉)树bland meats:未加调料的肉block, cell:划分块,单元Boolean algebra:布尔代数Boolean function:布尔函数Boolean matrix:布尔矩阵Boolean polynomial, Boolean expression:布尔多项式(表达式)Boolean product:布尔乘积bounded lattice:有界格brace:花括号bridge:桥,割边by convention:按常规,按惯例cancellation property:消去律capacity:容量cardinality:基数,势category:类别,分类catenation:合并,拼接ceiling function:上取整函数certain event:必然事件characteristic equation:特征方程characteristic function:特征函数chromatic number of G:G的色数chromatic polynomial:着色多项式circuit design:线路设计circuit:回路closed under the operation:运算对…是封闭的closed with respect to:对…是封闭的closure:闭包collision:冲突coloring graphs:图的着色column:列combination:组合common divisor:公因子commutative:可交换的commutative:可交换的commuter:经常往来于两地的人comparable:可比较的compatible with:与…相容compatible:相容的complement of B with respect to A:A与B的差集complement:补元complementary relation:补关系complete graph:完全图complete match:完全匹配complete n-tree:完全n-元树component sentence:分句component:分图composition:复合composition:关系的复合compound statement:复合命题conditional statement, implication:条件式,蕴涵式congruence relation:同余关系congruent to:与…同余conjecture:猜想conjunction:合取connected:连通的connected:连通的connection:连接connectivity relation:连通性关系consecutively:相继地consequent, conclusion:结论,后件constructive proof:构造性证明contain(in):包含(于)contingency:可满足式contradiction, absurdity:永假(矛盾)式contrapositive:逆否命题conversation of flow:流的守恒converse:逆命题conversely:相反地coordinate:坐标coset:陪集countable(uncountable):可数(不可数)counterexample;反例counting:计数criteria:标准,准则custom:惯例cut:割cycle:回路cyclic permutation:循环置换,轮换de Morgan’s laws:德摩根律declarative sentence:陈述句degree of a vertex:结点的度depot:货站,仓库descendant:后代diagonal matrix:对角阵die:骰子digraph:有向图dimension:维(数)direct flight:直飞航班discipline:学科disconnected:不连通的discrete graph(null graph):零图disjoint sets:不相交集disjunction:析取distance:距离distinguish:区分distributive lattice:分配格distributive:可分配的distributive:可分配的division:除法dodecahedron:正十二面体domain:定义域doubly linked list:双向链表dual:对偶edge:边edge:边element,member:成员,元素empty relation:空关系empty sequence(string):空串empty set:空集end point:端点entry(element):元素equally likely:等可能的,等概率的equivalence class:等价类equivalent relation:等价关系Euclidian algorithm:欧几里得算法,辗转相除法Euler path(circuit):欧拉路径(回路)event:事件everywhere defined:处处有定义的excess capacity:增值容量existence proof:存在性证明existential quantification:存在量词化expected value:期望值explicit:显式的extensively:广泛地,全面地extremal element:极值元素factor:因子factorial:阶乘finite (infinite) set:有限(无限)集finite group:有限(阶)群floor function:下取整函数free semigroup generated by A:由A生成的自由半群frequency of occurrence:出现次数(频率) function, mapping, transformation:函数,映射,变换GCD(greatest common divisor):最大公因子gender:性别generalize:推广generic element:任一元素graduate school:研究生院graph:(无向)图graph:无向图greatest(least) element:最大(小)元greedy algorithm:贪婪算法group:群growth of function:函数增长Hamiltonian path(circuit):哈密尔顿路径(回路) hashing function:杂凑函数Hasse diagram:哈斯图height:树高homomorphic image:同态像homomorphism:同态hypothesis:假设,前提,前件idempotent:等幂的idempotent:幂等的identity function on A:A上的恒等函数identity(element):么(单位)元identity:么元,单位元impossible event:不可能事件inclusion-exclusion principle:容斥原理in-degree:入度indirect method:间接证明法induction step:归纳步informal brand:不严格的那种inorder search:中序遍历intersection:交intuitively:直觉地inverse:逆关系inverse:逆元inverse:逆元inverter:反向器invertible function:可逆函数involution property:对合律irreflexive:反自反的isolated vertex:孤立结点isomorphism:同构isomorphism:同构join:,保联,并join:并Karnaugh map:卡诺图Kernel:同态核key:键Klein 4 group:Klein四元群Konisberg Bridge problem:哥尼斯堡七桥问题Kruskal’s algorithm:Kruskal算法labeled digraph:标记有向图lattice:格LCM(least common multiple):最小公倍数leaf(leave):叶结点least upper(greatest lower) bound:上(下)确界level:层,lexicographic order:字典序likelihood:可能性linear array(list):线性表linear graph:线性图linear homogeneous relation of degree k:k阶线性齐次关系linear order(total order):线序,全序linearly ordered set, chain:线(全)序集,链linked list:链表linked-list representation:链表表示logarithm function to the base n:以n为底的对数logical connective:命题联结词logically equivalent:(逻辑)等价的logically follow:是…的逻辑结论logician:逻辑学家loop:自回路lower order:低阶main diagonal:主对角线map-coloring problem:地图着色问题matching function:匹配函数matching problems:匹配问题mathematical structure(system):数学结构(系统)matrix:矩阵maximal match:最大匹配maximal(minimal) element:极大(小)元maximum flow:最大流meet:保交,交meet:交minimal spanning tree:最小生成树minterm:极小项modular lattice:模格modulus:模modus ponens:肯定律m odus tollens:否定律monoid:含么半群,独异点multigraph:多重图multiple:倍数multiplication table:运算表multi-valued function:多值函数mutually exclusive:互斥的,不相交的natural homomorphism:自然同态nearest neighbor:最邻近结点negation:否定(式)normal subgroup:正规(不变)子群notation:标记notion:概念n-tree:n-元树n-tuple:n-元组odd(even) permutation:奇(偶)置换offspring:子女结点one to one:单射,一对一函数onto:到上函数,满射operation on sets:集合运算optimal solution:最佳方法or(and, not) gate:或(与,非)门order of a group:群的阶order relation:序关系ordered pair:有序对,序偶ordered tree:有序树ordered triple:有序三元组ordinance:法规out-degree:出度parent:父结点partial order:偏序关系partially ordered set, poset:偏序集partition, quotient set:划分,商集path:路径path:通路,路径permutation:置换,排列pictorially:以图形方式pigeonhole principle:鸽巢原理planar graph:(可)平面图plausible:似乎可能的pointer:指针Polish form:(表达式的)波兰表示polynomial:多项式positional binary tree:位置二元(叉)树positional tree:位置树postorder search:后序遍历power set:幂集predicate:谓词preorder search:前序遍历prerequisite:预备知识prescribe:命令,规定Prim’s algorithm:Prim算法prime:素(数)principle of mathematical induction:(第一)数学归纳法probabilistic:概率性的probability(theory):概率(论)product partial order:积偏序product set, Caretesian set:叉积,笛product:积proof by contradiction:反证法proper coloring:正规着色propositional function:命题公式propositional variable:命题变元pseudocode:伪码(拟码)pumping station:抽水站quantifier:量词quotient group:商群random access:随机访问random selection(choose an object at random):随机选择range:值域rational number:有理数reachability relation:可达性关系reasoning:推理recreational area:游乐场所recursive:递归recycle:回收,再循环reflexive closure:自反闭包reflexive:自反的regular expression:正则表达式regular graph:正规图,正则图relation:关系relationship:关系relay station:转送站remainder:余数representation:表示restriction:限制reverse Polish form:(表达式的)逆波兰表示(left) right coset:(左)右陪集root:根,根结点rooted tree:(有)根树row:行R-relative set:R相关集rules of reference:推理规则running time:运行时间same order:同阶sample space:样本空间semigroup:半群sensible:有意义的sensible:有意义的sequence:序列sequential access:顺序访问set corresponding to a sequence:对应于序列的集合set inclusion(containment):集合包含set:集合siblings:兄弟结点simple cycle:简单回路simple path(circuit):基本路径(回路)simple path:简单路径(通路)sink:汇sophisticated:复杂的source:源spanning tree:生成树,支撑树square matrix:方阵statement, proposition:命题storage cell:存储单元string:串,字符串strong induction:第二数学归纳法subgraph:子图subgroup:子群sublattice:子格submonoid:子含么半群subscript:下标subsemigroup:子半群subset:子集substitution:替换subtree:子树summarize:总结,概括symmetric closure:对称闭包symmetric difference:对称差symmetric group:对称群symmetric:对称的tacitly:默认tautology:永真(重言)式tedious:冗长乏味的terminology:术语the capacity of a cut:割的容量topological sorting:拓扑排序transitive closure:传递闭包transitive:传递的transport network:运输网络transposition:对换traverse:遍历,周游tree searching:树的搜索(遍历)tree:树truth table:真值表TSP(traveling salesperson problem):货郎担问题unary operation:一元运算undirected edge:无向边undirected edge:无向边undirected tree:无向树union:并unit element:么(单位)元universal quantification:全称量词化universal set:全集upper(lower) bound:上(下)界value of a flow:流的值value, image:值,像,应变量Venn diagram:文氏图verbally:用言语vertex(vertices):结点vertex(vertices):结点,顶点virtually:几乎Warshal’s algorithm:Warshall算法weight:权weight:树weighted graph:(赋)权图well-defined:良定,完全确定word:词zero element:零元。

A monothetic clustering method

A monothetic clustering method

A monothetic clustering method∗Marie Chavent(*)(**)(*)INRIA Rocquencourt,Action SODAS,Domaine de Voluceau,B.P.105,78153Le Chesnay cedex,France(**)Universit´e de Paris IX Dauphine,Lise Ceremade,Place du Mar´e chal De Lattre de Tassigny,75775Paris cedex16,Francee-mail:Marie.Chavent@inria.frAbstract:The proposed divisive clustering method performs simultaneously a hierarchy of a set of objects and a monothetic characterization of each cluster of the hierarchy.A division is performed according to the within-cluster inertia criterion which is minimized among the bipartitions induced by a set of binary questions.In order to improve the clustering,the algorithm revises at each step the division which has induced the cluster chosen for division.Key Words:Hierarchical clustering methods,Monothetic cluster,Inertia criterion1.IntroductionThe objective of cluster analysis is to group a setΩof N objects into clusters having the property that objects in the same cluster are similar to another and different from objects of other clusters.In the pattern recognition literature(Duda and Hart,1973)this type of problem is referred to as unsupervised pattern recognition.The most common clustering methods are partitioning,hierarchical agglomerative and hierarchical divisive ones.A partition ofΩis a list(C1,...,C K)of clusters verifying C1∪...∪C K=Ωand C k∩C k =∅for all k=k .The essence of partitioning is the optimization an objective function measuring the homogeneity within the clusters and/or the separation between the clusters.Algorithms of the exchange type are frequently used tofind a local optimum of the objective function, because of the complexity of the exact algorithms.Well-known partitioning procedures are the Forgy’s k-means and the isodata methods,described in Anderberg(1973),and the dynamical clustering method(Diday,1974).Agglomerative and divisive hierarchical clustering methods are different,in the type of structure they are searching,from partitioning.Indeed,a hierarchy ofΩis a family H of clusters satisfying Ω∈H,{ω}∈H for allω∈Ωand A∩B∈{∅,A,B}for all A,B∈H.A hierarchy can be represented in the form of a tree or dendogram,that shows how the clusters are hierarchically organized.The general algorithm for agglomerative clustering starts with N clusters,each consisting of ∗Pattern Recognition Letters19(1998)989-9961one element ofΩ,and merges successively two clusters on the basis of a similarity measure. Well-known agglomerative hierarchical methods are described in Everitt(1974).Divisive hierarchical clustering reverses the process of agglomerative hierarchical clustering,by starting with all objects in one cluster,and dividing successively each cluster into smaller ones. Those methods are usually iterative and determine at each iteration the cluster to be divided and the subdivision of this cluster.This process is continued until suitable stopping rule arrests further division.There is a variety of divisive clustering methods(Kaufman and Rousseeuw,1990).A natural approach of dividing a cluster C of n objects into two non-empty subsets would be to consider all the possible bipartitions.In this,Edward and Cavalli-Sforza(1965)choose among the2n−1−1 possible bipartitions of C,the one having the smallest within-cluster sum of squares.It is clear that such complete enumeration procedure provides a global optimum but is computationally prohibitive.Neverless,it is possible to construct divisive clustering methods that does not consider all bipar-titions.MacNaughton-Smith(1964)proposed an iterative divisive procedure using an average dissimilarity between an object and a group of objects.Chidananda Gowda and Krishna(1978) proposed a disaggregative clustering method based on the concept of mutual nearest neighbor-hood.Other methods taking as input a dissimilarity matrix are based on the optimization of criterions like the split or the diameter of the bipartition(Gu´e noche,Hansen and Jaumard,1991; Wang,Yan and Sriskandarajah,1996).Probabilistic validation approach for divisive clustering has also been proposed(Har-even and Brailovsky,1995).Another family of divisive clustering methods is monothetic.A cluster is called monothetic if a conjunction of logical properties is both necessary and sufficient for membership in the cluster (Sneath and Sokal,1973).Indeed,each division is carried out using a single variable and by separating objects possessing some specified values of this variable from those lacking them. Monothetic divisive clustering methods havefirst been proposed in the particular case of binary data(Williams and Lambert,1959;Lance and Williams,1968).Since then,monothetic cluster-ing methods have mostly been developed in thefield of unsupervised learning and are known as descendant conceptual clustering methods(Michalski,Diday and Stepp,1981;Michalski and Stepp,1983).In thefield of discriminant analysis,monothetic divisive methods have also been widely devel-oped.However,those methods are different from clustering in which the clusters are inferred from data.Indeed,a partition ofΩis pre-defined and the problem concerns the construction of a systematic way of predicting the class membership of a new object.In the pattern recognition literature,this type of classification is referred to as supervised pattern recognition.Divisive methods of this type are usually known as tree structured classifier like cart(Breiman,Fried-man,Olshen and Stone,1984)or id3(Quinlan,1986).Recently,Ciampi(1994)insisted on the idea that trees offer a natural approach for both class formation(clustering)and development of classification rules(discrimination).The clustering method proposed in this paper was developed in the framework of symbolic data analysis(Diday,1995),which aims at bringing together data analysis and machine learning. More precisely,we propose a monothetic hierarchical clustering method performed in the spirit of cart from an unsupervised point of view.We have restricted the presentation of this method to the particular case of quantitative data.At each stage,the division of a cluster is performed according to the within-cluster inertia criterion(section??).This criterion is minimized among bipartitions induced by a set of binary questions(section??).Moreover,clusters are not sys-2tematically divided but one of them is chosen according to a specific criterion(section??).The divisions are stopped after a number of iterations given as input by the user,usually interested in few clusters partitions.The output of this divisive clustering method is an indexed hierarchy. It is also a decision tree(section??).The Ruspini’s data are given as afirst illustration of this method(section??).We propose a modification of the algorithm in order to soften the property shared by both agglomerative and divisive hierarchical methods,that efficient early partition cannot be corrected at a later stage.It consists in revising,after the division of a cluster,the previous division which has induced the cluster itself(section??).Before the conclusion(section ??),the method is performed on Fisher’s iris dataset(section??).2.The inertia criterionLet N be the number of objects inΩ.Each object is described on p real variables Y1,...,Y p by a vector x i∈R p and weighted by a real value p i(i=1,...,N).Indeed,the analyst will prefer sometimes to weight the objects differently.For instance,countries could be weighted according to the size of their population.But usually,the weights are equal to1or equal to1n.The inertia I of a cluster C k is an homogeneity measure equal to:I(C k)=xi ∈C kp i d2M(x i,x k)(1)where d M is the Euclidean distance(M is a symmetric matrix positively defined):∀x,y∈R p,d2M(x,y)=(x−y)t M(x−y)(2) and x k is the center of gravity of the cluster C k:x k=1µkxi∈C kp i x i(3)µk=xi ∈C kp i(4)The within-cluster inertia W of a K-clusters-partition P K=(C1,...,C K)is equal to:W(P K)=Kk=1I(C k)(5)According to the Huygens Theorem,minimizing the within-cluster inertia of a partition(e.g. the homogeneity within the clusters)is equivalent to maximizing the between-cluster inertia (e.g.the separation between the clusters).This equals to:B(W K)=Kk=1µk d2M(x k,x)(6)3.Bipartitioning a clusterLet C be a set of n objects.We want tofind a bipartition(C1,C2)of C such that the within-cluster inertia is minimum.In the Edward and Cavalli-Sforza method(1965)one chooses the optimal bipartition(C1,C2)among the2n−1−1possible bipartitions.It is clear that the amount of calculation needed when n is large will be prohibitive.3In our approach,to reduce the complexity,we divide C according to a binary question(Breiman, Friedman,Olshen and Stone,1984)of the form“Y i≤c?”where Y i:Ω→R is a real variable and c∈R is called the cut point.The bipartition(C1,C2)induced by the binary question is defined as follows.Letωbe an object in C.If Y i(ω)≤c thenω∈C1elseω∈C2.Those objects in C answering“yes”go to the left descendant cluster and those answering“no”to the right descendant cluster(Fig.??).Figure1:“Is height≤172?”For each variable Y i,there will be at most n−1different bipartitions(C1,C2)induced by the above procedure.Indeed,whatever the cut point c between two consecutive observations Y i(ω) may be,the bipartition induced is the same.In order to ask only n−1questions to generate all these bipartitions,we decide to use the n−1cut points c,chosen as the middle of two consecutive observations Y i(ω)∈R.Indeed,if the n observations Y i(ω)are different,there are n−1cut points on Y i.If there are p variables,we choose among the p(n−1)corresponding bipartitions(C1,C2),the bipartition having the smallest within-cluster inertia.4.Choice of the clusterLet P K=(C1,...,C K)be a K-clusters-partition ofΩ.At each stage,a new(K+1)-clusters-partition is obtained by dividing a cluster C k∈P K into two new clusters C1k and C2k.Thepurpose is to choose the cluster C k∈P K so that the new partition,P K+1=P K∪{C1k,C2k}−{C k}has minimum within-cluster inertia.We know that:W(P K+1)=W(P K)−I(C k)+I(C1k)+I(C2k)In this,minimizing W(P K+1)is equivalent to choosing the cluster C k∈P K so that the differencebetween the inertia of C k and the within-cluster inertia of its bipartition(C1k ,C2k)is maximum.The criterion used to determine the cluster that will be divided is then equal to:∆(C k)=I(C k)−I(C1k)−I(C2k)(7) Of course,it means that the bipartitions of all the clusters of the partition P K have been definedpreviously.At each stage,the bipartitions of the two new clusters C1k and C2kare defined andused in the next stage.5.The stopping rule and the outputThe divisions are stopped after a number L of iterations and L is given as input by the user, usually interested in few clusters partitions.Indeed,the last partition obtained in the last iteration is a L+1-clusters-partition.The issue of stopping the divisions before obtaining the4total hierarchy(L=N)is to ensure that the partitions of smallest within-cluster inertia of the total hierarchy are still in the hierarchy obtained after L iterations.This property is verified because the clusters are not systematically divided but one cluster is chosen according to the criterion∆given in(??)which ensures that the partition induced by this division has minimum within-cluster inertia.However,this stopping rule doesn’t solve the issue of determining the number of clusters in the dataset(Milligan and Cooper,1985).The output of this divisive clustering method is a hierarchy H which singletons are the L+1 clusters of the partition obtained in the last iteration of the algorithm.Each cluster C k∈H is indexed by∆(C k).Because∆is a non-decreasing mapping,C k⊂C k ⇒∆(C k)≤∆(C k )(8)there will be no inversions in the dendogram of the hierarchy.This hierarchy is also a decision tree.The L clusters are the leaves and the nodes are the binary questions selected by the algorithm.Each cluster is characterized by a rule defined according to the binary questions leading from the root to the corresponding leaves.6.A simple exampleThe dataset is75points of R2(Ruspini,1970).Wefind successively a partition in2,3and4 clusters(L=3).At thefirst stage,the method induces2(75−1)=148bipartitions.We choose among the 148bipartitions(C1,C2),the one of smallest within-cluster inertia.It has been induced by the binary question“Is Y1≤75.5?”.Notice that the number of subdivisions has been reduced from 275−1=3,77×1022to148.At the second stage,we have to choose whether we divide C1or C2.Here,we choose the cluster C1and its bipartition(C11,C21)because∆(C1)>∆(C2).The binary question is“Is Y2≤54?”. At the third stage,we choose the cluster C2and its bipartition(C12,C22).The binary question is“Is Y2≤75.5?”.Finally,the divisive algorithm gives the4clusters represented Fig.??.Figure2:The4-clusters partitionAccording to the dendogram of the hierarchy givenfigure??,the four clusters are characterized by four rules.For instance cluster C11is characterized by the following rule:If[Y1(ω)≤75,5]and[Y2(ω)≤54]thenω∈C11.5Figure3:The dendogram of the indexed hierarchyThis dendogram can be read as a decision tree and the rules can be read as classification rules of new objects to one of the four clusters.7.Revising a binary questionThe purpose is to enable the analyst to revise at each division of a cluster the binary question which has induced the cluster itself.Let C be a cluster which has been divided in two clusters C and C according to the binary question“Is Y1≤c1?”.Then C is chosen to be divided in two clusters C1and C2according to the binary question“Is Y2≤c2?”.Figure4:Revising a binary questionAt this stage,the binary question“Is Y1≤c1?”is revised by modifying the cut point c1.We choose a new cut point c among all possible cut points on Y1,such that the3-clusters-partition (C 1,C 2,C )induced by“Is Y1≤c ?”and“Is Y2≤c2?”has minimum within-cluster inertia (figure??).For instance,figure??gives the3-clusters-partition of320points of R2simulated from four 2-dimensional Gaussian distributions.The points have been dividedfirst according to the binary question“Is Y2≤10,9?”and then according to the binary question“Is Y1≤8?”.Thefirst cut point10.9is then modified in order tofind,with the second binary question “Is Y1≤8?”,the3-clusters-partition of minimum within-cluster inertia.The new cut point is 12.1(figure??).8.The Fisher’s iris datasetThe above clustering method has been examined with the well-known Fisher’s iris dataset.The length and breadth of both petals and sepals were measured on150flowers.There are three varieties of iris:Setoa,Versicolor and Virginia.There are50iris of each variety.6Figure5:The two3-clusters-partitionsOf course,the knowledge of this pre-defined3-clusters-partition is not used in our unsupervised clustering procedure which is performed only with four quantitative variables:the petal width (PeWi),the petal length(PeLe),the sepal width(SeWi)and the sepal length(SeLe).First,we have used the Euclidean distance d M,with M=I,the identity matrix.Figure?? gives the dendogram of the hierarchy and the3-clusters-partition(C1,C2,C3)obtained after two divisions of the dataset.Thefirst cluster is composed of53iris including50Setoa,3Versicolor and no Virginia.Wholly,the3-clusters-partition contains19iris misclassified.Thefirst binary question“Is PeLe≤3.4”is then revised in order to improve the within-cluster inertia of the3-clusters-partition.Figure??gives the dendogram of the hierarchy obtained with the revised binary question“Is PeLe≤2.45”.We can notice that the mis-classifications have been reduced to16.Indeed,the50Versicolor are all in C2.Figure6:Before the revision Figure7:After the revisionDynamical clustering and Ward agglomerative hierarchical clustering methods have also been performed on the same dataset.The same distance was used.The partitions obtained with the two clustering methods contained the same number of mis-classifications since16iris were misclassified.where U i is Secondly,we have used the normalized Euclidean distance d M,with M=D1/U2ithe length between the maximum and the minimum value for the variable Y i.Thefigure?? gives the dendogram of the hierarchy obtained with this distance and we notice a reduction of the number of mis-classifications from16to10iris.It confirms the influence of the choice of the distance in the result of a clustering.Then,before the second division,we have normalized the Euclidean distance,according to the four length U i computed locally in the cluster which7Figure8:Global normalization Figure9:Local normalizationwas divided.Thefigure??gives the dendogram of the hierarchy obtained with the locally normalized Euclidean distance.We can notice that the number of mis-classifications is now reduced to6.It corresponds to an error rate of0.04.In their comparative study of the performance of different classifiers with Fisher’s iris dataset, Weiss&Kulikowski(1991)give for the cart decision tree an error rate equal to0.04.In this,we obtain with the Fisher’s iris dataset comparable results with both unsupervised and supervised approaches.However,the goal of the proposed clustering method and the cart algorithm are different since we aim at inferring clusters from the data and cart algorithm aims at discovering classification rules.9.ConclusionThe proposed clustering method has the advantages to be simple and to give simultaneously a hierarchy and a simple interpretation of its cluster.Moreover,it deals easily with very large datasets.Indeed,is possible to construct the hierarchy on a sample of the dataset,and to use the classification rules to assign the rest of the objects.This method has also given good results on the Fisher’s iris dataset and on other real applications where it has been compared with the dynamical clustering method and the Ward agglomerative hierarchical method(Chavent,1997). However,dividing a cluster according to a single variable can also be a deficiency in some situa-tions.As for cart algorithm,in situations where the cluster structure depends on combinations of variables,the divisive method will do poorly at discovering the structure.A perspective would be on the one hand to use a local stopping rule(Milligan and Cooper,1985; Har-even and Brailovsky,1995)for deciding if a cluster should be divided into two subclusters and on the other hand to divide a cluster according to a metric locally defined in the cluster itself.ReferencesAnderberg,M.R.(1973).Cluster analysis for applications.Academic Press,New York. Breiman,L.,J.H.Friedman,R.A.Olshen and C.J.Stone(1984).Classification and regression Trees.C.A:Wadsworth.8Chavent,M.(1997).Analyse des Donn´e es Symboliques.Une m´e those divisive de classification.PhD Thesis,Universit´e Paris-IX Dauphine,France.Chidananda Gowda,K.and G.Krishna(1978).Disaggregative Clustering Using the Concept of Mutual Nearest Neighborhood.ieee Transactions on Systems,Man,and Cybernetics 8,888-895.Ciampi,A.(1994).Classification and Discrimination:the recpam Approach.In proc.of compstat’94,129-147.Diday,E.(1974).Optimization in non-hierarchical clustering.Pattern Recognition6,17-33. Diday,E.(1995).Probabilist,possibilist and belief objects for knowledge analysis.Annals of Operations Research55,227-276.Duda,R.O.and Hart,P.E.(1973).Pattern Classification and Scene Analysis.Wiley,New York.Edwards,A.W.F.and L.L.Cavalli-Sforza(1965).A method for cluster analysis.Biometrics 21,362-375.Everitt,B.(1974).Cluster Analysis.Social Sciences Research Council,Heineman Educational Books.Gu´e noche,A.,P.Hansen and B.Jaumard(1991).Efficient algorithms for divisive hierarchical clustering.Journal of Classification8,5-30.Har-even,M.and V.L.Brailvosky(1995).Probabilistic validation approach for clustering.Pattern Recognition16,1189-1196.Kaufman,L.and P.J.Rousseeuw(1990).Finding groups in data.Wiley,New York. Lance,G.N.and W.T.Williams(1968).Note on a new information statistic classification program.The Computer Journal11,195-197.MacNaughton-Smith,P.(1964).Dissimilarity analysis:A new technique of hierarchical sub-division.Nature202,1034-1035.Michalski,R.S.and E.Diday,R.Stepp(1981).A recent advance in data analysis:Clustering objects into classes characterized by conjunctive concepts.Progress in Pattern Recognition, L.N.Kanal and A.Rosenfeld(eds),North Holland,33-56.Michalski,R.S.and R.Stepp(1983).Learning from observations:Conceptual clustering.Machine Learning:An Artificial Intelligence Approach,R.S.Michalsky,J.Carbonell and T.Mitchell(eds),163-190.Milligan,G.W.and M.C.Cooper(1985).An examination of procedures for determining the number of clusters in a data set.Psychometrika50,159-179.Quinlan,J.R.(1986).Induction of decision trees.Machine Learning1,81-106.Ruspini,E.M.(1970).Numerical Methods for Fuzzy rmation Science2,319-350.Sneath,P.H.and R.R.Sokal(1973).Numerical Taxonomy.Freeman and company,San Fran-cisco.Weiss,S.M.and C.A.Kulikowski(1991).Computer systems that learn:Classification and prediction methods from statistics,neural network,machine learning,and expert systems.San Mateo,Calif:Morgan Kaufmann.Williams,W.T.and mbert(1959).Multivariate methods in plant ecology.Journal of Ecology47,83-101.Wang,Y.and H.Yan,C.Sriskandarajah(1996).The weighted Sum of Split and Diameter Clustering.Journal of Classification13,231-248.9List of Figures10。

binary coding SVMs for the multiclass problem of blast furnace system

binary coding SVMs for the multiclass problem of blast furnace system

period, a great deal of heat energy is produced which can heat up the blast furnace temperature approaching 2000 ◦ C. The end products consisting of slag and hot metal sink to the bottom and are tapped periodically for the subsequent refining. It will take about 6–8 h for a cycle of ironmaking. For many countries, the metallurgical industry is playing an important role on the national economy, and there are thus extensive interests in blast furnace modeling and control for saving energy and reducing cost. Generally speaking, to control a blast furnace system often means to control the hot metal temperature and components, such as silicon content, sulfur content in hot metal, and carbon content in hot metal within acceptable bounds, among which the silicon content is the most important one. The silicon in the hot metal is originally present in the form of silica as the blast furnace burdens, which reacts with carbon monoxide to produce volatile silicon monoxide, SO(g) , and then, SO(g) is continuatively reduced by coke to further generate silicon dissolved in the hot metal or slag. Due to the hostile environment for directly measuring the furnace temperature, the silicon content in hot metal can act as the chief indicator of the in-furnace thermal state and is also a key factor to evaluate the hot metal quality and energy utilization rate. Therefore, it has always been the active research issue to build silicon prediction model in the recent decades, including mechanism-based models [1]–[3], databased models [4]–[13], and rule-based models [14].

Auxiliary Problem Principle and Stochastic Optimization

Auxiliary Problem Principle and Stochastic Optimization
EDF R&D D´ epartement OSIRIS, Groupe R31 1, avenue du G´ en´ eral de Gaulle F-92141 Clamart Cedex ++33 1 47 65 55 73 cyrille.strugarek@edf.fr ´ ´ also with the Ecole Nationale Sup´ erieure de Techniques Avanc´ ees (ENSTA) and the Ecole Nationale des Ponts et Chauss´ ees (ENPC)
u
E (j (u, ξ )) , s. t . u ∈ U f .
(1)
where U f is a closed convex subset of some Hilbert space U , ξ is a random variable with values in a metric space Ξ, and j : U × Ξ → R is a normal integrand such that j (·, ξ ) is convex, l.s.c., and differentiable. Let us denote by ΠU f the projection on U f . A classical projected gradient algorithm for the problem (1) would be: Algorithm 1.1 (Projected Gradient) Step 0: Let u0 ∈ U f , Step k : uk+1 = ΠU f uk − ρk E ju (uk , ξ ) . If uk+1 − uk small enough, stop, else loop, with ρk > 0 for all k ∈ N. Such an algorithm yields the computation of an expectation at each iteration, and can be sometimes too demanding. The stochastic gradient algorithm tries exactly to overcome this difficulty: Algorithm 1.2 (Stochastic Gradient) Step 0: Let u0 ∈ U f , Step k : draw ξ k+1 , and compute uk+1 = ΠU f uk − ρk ju (uk , ξ k+1 ) , If maximal iterations number reached, stop, else loop, with ρk > 0 for all k ∈ N, k∈N ρk = +∞ and these ideas can be found in [3].

Enhanced 1-D Chaotic Key-Based Algorithm for Image Encryption

Enhanced 1-D Chaotic Key-Based Algorithm for Image Encryption

I. I NTRODUCTION The security of digital images has become increasingly more important in today’s highly computerized and interconnected world. The media content must be protected in applications such as pay-per-view TV, confidential video conferencing, medical imaging, and in industrial or military imaging systems. With the rise of wireless portable devices, many users seek to protect the private multimedia messages that are exchanged over the wireless or wired networks. Unfortunately, in many applications, conventional encryption algorithms (such as AES) are not suitable for image and video encryption [1–3]. In order to overcome this problem, many fast encryption algorithms specifically designed for digital images have been proposed [4–7]. Some video encryption algorithms are applicable to still images as well as videos [4], [7]. However, a number of these algorithms have been shown to be insecure [8–10]. The image encryption methods based on chaotic maps attract considerable attention recently due to their potential for digital multimedia encryption [2]. In [6], Yen and Guo proposed a chaotic key-based algorithm (CKBA) for image encryption. Subsequently, Li and Zheng [10] showed that the security claims for CKBA have been vastly overestimated.

Round Robin Rule Learning

Round Robin Rule Learning

Johannes F¨urnkranz JUFFI@OEFAI.AT Austrian Research Institute for Artificial Intelligence,Schottengasse3,A-1010Wien,AustriaAbstractIn this paper,we discuss a technique for han-dling multi-class problems with binary classi-fiers,namely to learn one classifier for each pairof classes.Although this idea is known in theliterature,it has not yet been thoroughly inves-tigated in the context of inductive rule learning.We present an empirical evaluation of the methodas a wrapper around the Ripper rule learning al-gorithm on20multi-class datasets from the UCIdatabase repository.Our results show that themethod is very likely to improve Ripper’s classi-fication performance without having a high riskof decreasing it.In addition,we give a theoret-ical analysis of the complexity of the approachand show that its training time is within a smallconstant factor of the training time of the sequen-tial class binarization technique that is currentlyused in Ripper.1.IntroductionPairwise classification is a technique for reducing a multi-class problem to multiple2-class problems by learning a classifier for each pair of classes.It has been previously applied to various problems,mostly in the support vector machines community(see Section7),but we are not aware of an extensive experimental study,in particular not in the context of inductive rule learning algorithms.In this paper,we will show that this technique in fact gives significant improvements in predictive accuracy and pro-vide a detailed analysis of the complexity of the approach, which shows that it is not much slower,but may in fact be considerably faster than conventional approaches that train each class against all other classes.In particular,we prove that the performance ratio of pairwise classification over conventional approaches goes to zero with increasing asymptotic complexity of the base algorithm.Our exper-imental results underline the efficiency of the approach, even though we only experimented with a proof-of-concept implementation in the form of a wrapper around the Ripper rule learning algorithm.2.Class BinarizationMany machine learning algorithms are inherently designed for binary(two-class)decision problems.Prominent exam-ples are neural networks with single output nodes,support vector machines and separate-and-conquer rule learning. In addition,all regression algorithms can,in principle,be used for binary decision problems,but not for multi-class problems(unless,maybe,if the class values are ordered). However,real-world problems often have multiple classes. Fortunately,there exist several simple techniques for turn-ing multi-class problems into a set of binary problems.We will call such techniques class binarization techniques.Definition2.1(class binarization,decoding)A class bi-narization is a mapping of a multi-class learning problem to several2-class learning problems in a way that allows a sensible decoding of the prediction,i.e.,allows to derive a prediction for the multi-class problem from the predic-tions of the set of2-class classifiers.The learning algo-rithms used for solving the2-class problems is called the base classifier.The most popular class binarization rule is the unordered or one-against-all class binarization,where one takes each class in turn and learns binary concepts that discriminate this class from all other classes.It has been independently proposed for rule learning(Clark&Boswell,1991),neural networks(Anand et al.,1995),and support vector machines (Cortes&Vapnik,1995).Definition2.2(unordered class binarization)The un-ordered class binarization transforms a c-class problem into c2-class problems.These are constructed by using the examples of class i as the positive examples and the examples of classes j,j=1...c,j=i as the negative examples.The name“unordered”originates from Clark and Boswell (1991),who proposed this approach as an alternative to the decision-list learning approach that was originally used in CN2(Clark&Niblett,1989).In otherfields,the strategy has different names,but as our main concern is rule learn-ing,we stick with the terminology used there.The rule learning algorithm Ripper(Cohen,1995),which will be our main test object,also provides an option forIn C.Brodley and A.Danyluk(Eds.)Proceedings of the18th International Conference on Machine Learning(ICML-01),pp.146-153.~~~~~~~~~~~~~~------------------############o o o o o ooo o o o o oo ooo o o ++++++++++++++++++x xx xx xx xx xx x xx(a)Multi-class learningone classifier separates all classes.~~~~~~~~~~~~~~------------------############o o o oo oooo o o o o o ooo o o ++++++++++++++++++x xxx x xxxx xx x xx(b)Unordered learningc classifiers,each separates one class from all other classes.Here:+against all otherclasses.~~~~~~~~~~~~~~++++++++++++++++++(c)Round robin learningc (c −1)22-class problems,one foreach unique pair of classes <i,j >(i =1...c −1,j =i +1...c ).The binary classifier is trained on the training examples of classes i and j .The examples of classes =i,j are ignored for the binary classifier <i,j>.When classifying a new example,each of the learned base classifiers determines to which of its two classes the exam-ple is more likely to belong to.The winner is assigned a point,and in the end,the algorithm will predict the class that has accumulated the most points.In our version,ties are broken by preferring the class that is more frequent in the training set (or flipping a coin if the frequencies are equal),but more elaborate techniques are possible (see Sec-tion 6).Round robin binarization is illustrated in Figure 1.For the 6-class problem shown in Figure 1(a),the round robin pro-cedure learns 15classifiers,one for each pair of classes.Figure 1(c)shows the classifier for the class pair <+,∼>.In comparison,Figure 1(b)shows one of the classifiers for the unordered class binarization,namely the one that pairs class +against all other classes.It is obvious that the round robin base classifier uses fewer examples and thus has more freedom for fitting a decision boundary between the two classes.In fact,in this problem,all binary classification problems of the round robin binarization could be solved with a simple linear discriminant,while neither the multi-class problem nor its unordered binarization have a linear solution.Note that some examples will be forced to be classified er-roneously by some of the binary base classifiers because each classifier must label all examples as belonging to one of the two classes it was trained on.Consider the classifiershown in Figure1(c):it will arbitrarily assign all exam-ples of class o to either+or∼(depending on which side of the decision boundary they are).In principle,such“un-qualified”votes may lead to an incorrectfinal classification. However,the votes of thefive classifiers that contain exam-ples of class o should be able to over-rule the votes of the other ten classifiers,which pick one of their two constituent classes for each o example.If the class values are indepen-dent,it is unlikely that all classifiers would unanimously vote for a wrong class.However,the likelihood of such a situation could increase if there is some similarity between the correct class and some other class value(e.g.,in prob-lems with a hierarchical class structure).1In any case,if the five o classifiers unanimously vote for o,no other class can accumulate four votes(because they lost their direct match against o).In the above definition,we assume that the problem of dis-criminating class i from class j is identical to the problem of discriminating class j from class i.This is the case if the base learner is class-symmetric.Rule learning algo-rithms,however,need not be class-symmetric.Many of them choose one of the two classes as the default class, and learn only rules to cover the other class.In such a case, <i,j>and<j,i>may be two different classification prob-lems,if j is used as the default class in the former and i in the latter.Ripper is such an algorithm.Unless specified otherwise,it will treat the larger class of a2-class problem as the default class and learn rules for the smaller class.Although this procedure is class-symmetric(problem<i,j>is converted to<j,i>if i>j),we felt that it would not be fair.For ex-ample,the largest class in the multi-class problem would be used as the default class in all round robin problems.This may be an unfair advantage(or disadvantage)to this class. One solution for avoiding this,is a double round robin,in which separate classifiers are learned for both problems, <i,j>and<j,i>.2Definition3.2(double round robin)The double round robin class binarization transforms a c-class problem into c(c−1)2-class problems,one for each pair of classes <i,j>,(i,j=...c,j=i).The examples of class i are used as the positive examples and the examples of class j as the negative examples.In our experiments,we used Ripper with the option -agiven as the base classifier,which uses the classes as given in the specification.Hence,<i,j>and<j,i>are two different problems,and each class is the default class in exactly half of its binary classification problems.Ripperdatasetabalone35.3738.5033.200.862++letter14.2517.0511.150.654++shuttle64.9453.2553.46 1.004=5.7912.15 2.260.186++glass4.15 4.29 3.460.808+lr spectrometer7.799.48 3.740.394++page-blocks15.9115.9115.770.991=solarflares(m)8.798.79 6.300.717++thyroid(hyper)0.640.560.530.955=thyroid(repl.)28.2530.3829.080.957=yeast21.8021.9018.520.770Table1.Error rates:Thefirst three columns show the results of Ripper(in unordered and in default,ordered mode)and R3.The last two columns show the improvement rate of R3over Ripper (default),and whether R3is significantly better(++if p>0.99, +if p>0.95)than Ripper,measured with a McNemar test.The last line shows the average of the columns above.4.AccuracyIn this section,we will briefly present an experimental eval-uation of round robin binarization in a rule learning con-text.We chose Ripper(Cohen,1995)as the base classifier, which—in our view—is the most advanced member of the family of separate-and-conquer(or covering)rule learning algorithms(F¨u rnkranz1997;1999).The unordered and ordered binarization procedures were used as implemented within Ripper.The round robin bi-narization was implemented as a wrapper around Ripper, which provided it with the appropriate training sets.The wrapper was implemented in perl and had to communi-cate with Ripper by writing the training sets to and reading Ripper’s results from the disk.This implementation is re-ferred to as R3(Round Robin Ripper).For evaluation,we arbitrarily chose20datasets with≥4 classes from the UCI repository(Blake&Merz,1998).3 The implementation of the algorithm was developed inde-pendently and not tuned on these datasets.Six datasets had a dedicated test set.On the other14sets,we estimated the error rate using paired,stratified10-fold cross-validations.Table1shows the accuracies of Ripper(unordered and or-dered)and R3on the selected datasets.On half of the20 sets,R3is significantly better(p>0.99on a McNemar test(Feelders&Verkooijen,1995))than Ripper’s default mode(ordered binarization).There are only two sets(thy-roid(repl.)and vowel),where R3is worse than Ripper, both differences being insignificant.The comparison to unordered Ripper is similar(the significance levels for this case are not shown).We can safely conclude that round robin binarization may result in significant improvements over ordered or un-ordered binarization without having a high risk of decreas-ing performance.5.EfficiencyAtfirst sight,it appears to be a questionable idea to re-place O(c)binary learning tasks(unordered binarization) with O(c2)binary learning tasks(round robin binarization) because the quadratic complexity seems to be prohibitive for tasks with more than a few classes.This section will illustrate that this is not the case.5.1.Theoretical ConsiderationsIn this section,we will see that although round robin classi-fication turns a single c-class learning problem into c(c−1) 2-class problems,the total learning effort is only linear in the number of classes,and may in some circumstances even be smaller than the effort needed for an unordered binarization.The analysis independent of the base learn-ing algorithm used.Some of the ideas have already been sketched in a short paragraph by Friedman(1996),but we go into considerably more detail here,and,in particular, focus on the comparison to conventional class binarization techniques.Definition5.1(class penalty)If the base learner has a complexity growth function f(n)(i.e.,the time needed for a n-example training set is f(n)),and the total time needed for the class binarized problem isπ(c)f(n),we call the functionπ(c)the class penalty.Intuitively,the class penaltyπ(c)measures the perfor-mance of an algorithm on a(class binarized)c-class prob-lem relative to its performance on a single2-class problem of equal size.In the following,we will compare the class penaltyπr of a single round robin class binarization to class penaltyπu of an unordered binarization.Note that the class penalty for a double round robin is twice as high as the class penalty of a single round robin.Also,unordered binarization is more expensive than ordered binarization,but it is simpler to analyze because it does not depend on the class distribu-tion.We would estimate that the ordered approach will take about half of the training time of the unordered approach.4 So,unless noted otherwise,the ratio of the penalty func-tions has to be adjusted by factor of4to get the results of ordered binarization versus double round robin binariza-tion.First,we look at the class penaltyπu of unordered class binarization:Theorem5.2πu(c)=cProof:There are c learning tasks,each using the entire training set of n examples.Hence the total complexity is cf(n),andπu(c)=c.2Lemma5.3The sum of examples in all training sets of a single round robin class binarization is(c−1)n.Proof:Each example of the original training set will occur once in each of the c−1binary tasks where its class is paired against one of the other c−1classes.As there are n examples in the original training set,the total number of examples is(c−1)n.2Note that this number is less than the total number of train-ing examples in the unordered class binarization(i.e.,cn).In the following analysis of the class penaltyπr of a single round robin binarization,we assume that our learner has a complexity growth function f(n)=n o with o≥1. Theorem5.4For f(n)=n o,o≥1:πr(c)≤c−1.The inequality is strict for o>1.Proof:We have c classes with n i examples, c i=1n i=n. Without loss of generality,let us assume c is even(if c is odd,we add a dummy class with n c+1=0examples). Then we can arrange the learning tasks in the form of c−1 rounds.Each round consists of c2i=1f(n2i+n2i−1).As for o≥1and a,b≥0,it holds that a o+b o≤(a+b)o(with equality only in the case of o=1),we have c2i=1n2i+n2i−1)=f( c i=1n i)=f(n). Anologously,we can derive the same upper bound for each of the c−1rounds.Thus the total complexity of the round robin binarization is≤(c−1)f(n),where the inequality is strict for o>1.2Corollary5.5For algorithms with at least linear complex-ity,the class penalty ratioπr(c)πu(c,o)=0Proof:In the proof of theorem5.2we derived an upper bound of(c−1)f(n)for the total complexity of the round robin task.Let us denote the error we made by this approx-imation with (c,o).We then haveπr(c,o)cf(n)=c−1cf(n)The error (c,o)is the sum of the errors i(c,o)that were made at each of the c−1rounds of the tournament de-scribed in the proof of Theorem5.2.Let us again,without loss of generality,look at the error i(c,o)of a round where successive classes are paired together.Thenlimo→∞i(c,o)2i=1(n2i+n2i−1)o2i=1lim o→∞ n2i+n2i−1f(n)=c−1i=1lim o→∞ i(c,o)πu(c,o)=c−1climo→∞(c,o)32<1R3vs.unordered vs.ordered193.0 4.51 5.73covertype1250.00.51 1.14sat277.0 2.10 3.16vowelcar2.03 2.263.80image489.67 4.40 6.93optical36.66 1.43 1.93solarflares(c)3.98 5.627.49soybean19.71 2.68 3.46thyroid(hypo)15.35 2.26 3.33vehicle16.90 1.77 3.122.59 4.09Table2.Runtime results:Thefirst column shows the run-times(in CPU er time)of R3.The following columns show theratio of R3over unordered and ordered Ripper.Thefirstfive linesare total run-times,i.e.,training and test time,while the cross-validated results report training time only.We failed to measurethe run-times for the covertype data set,where the situation wascomplicated because of the large test set,which had to be splitinto several pieces.as slow,and assuming that the ordered approach is twice asfast as the unordered approach(see the following sectionfor empirical values on that),a double round robin wouldbe faster than the ordered approach.In the following sec-tion,this scenario will be empirically evaluated.5.2.Empirical EvaluationContrary to the theoretical analysis in the previous section,where we focussed on the lenient case of pairing unorderedbinarization vs.single round robin,our empirical resultsshow the performance of ordered binarization(Ripper’s de-fault mode)vs.double round robin binarization.This,aswe have noted above,is the worst case.In the case of a lin-ear algorithm complexity,the round robin should be aboutfour times slower than the ordered binarization.Table2shows the training times5in CPU er time(measured on a Sparc Ultra-2under Sun Solaris)of R3and its performance ratios against Ripper in unordered andordered mode.On average,it is about2.6times slowerthan Ripper in unordered mode,and about4times slowerthan Ripper in default,ordered mode.Coincidentally,thesenumbers also show that the ordered mode is less than twice as fast as the unordered mode,which confirms the assump-tions we made at the beginning of Section5.1. Moreover,there are several cases where R3is even faster than Ripper in unordered mode and comes close to Ripper in ordered mode.This is despite the fact that R3is imple-mented as a perl-script that communicates to Ripper by writing the training and test sets of the new tasks to the disk.Although this is somewhat compensated by the fact that we only report user time(which ignores time for disk access and system time),6a tight integration of round robin binarization into Ripper’s C-code would certainly be more efficient.The good performance of R3does not come entirely sur-prising,if we consider the super-linear run-time complexity of Ripper.7For more expensive base learning algorithms (like support vector machines),the analysis in the previous section lets us expect even bigger savings.6.Other IssuesR3as an Ensemble Method:In(F¨u rnkranz,2001)we compared round robin binarization to boosting.In partic-ular,we compared the performance improvement obtained by C5-boost over C5to the improvement obtained by R3 over Ripper.It turned out that both algorithms seemed to be apt for similar types of problems.Round robin binarization seemed to work well whenever boosting worked well,and vice versa.Figure2plots the error ratios of C5-boost/C5 and R3/Ripper on the20datasets.The correlation between the improvement ratios was0.618,which is in the same range as correlation coefficients for bagging and boosting (Opitz&Maclin,1999).However,in terms of efficiency, C5-boost was about8.75times slower than C5,which is not surprising as it basically calls C510times on different samples of the original dataset.Parallel Implementations:It should be noted that—contrary to boosting,where the individual runs depend on each other and have to be performed in succession—pairwise classification can be entirely parallelized(as al-ready noted by Friedman(1996)and Lu and Ito(1999)).Our straight-forward approach of using the a priori more likely class,can certainly be improved upon.In addition to techniques known from the literature(see,e.g.,(Hastie& Tibshirani,1998;Allwein et al.,2000)),one could think of exploring techniques that are commonly used for breaking ties in tournament cross tables in games and sports(such as the Sonneborn-Berger ranking in chess tournaments). Imbalanced Class Distributions:Although we have not yet evaluated this issue,we also believe that round robin binarization provides a better focus on minority classes,in particular in problems where several large classes appear next to a few small classes.The fact that separate classifiers are trained to discriminate the small classes from each other (and not only from all remaining examples as would be the case for unordered binarization or for treating the multi-class problem as a whole)may help to improve the focus in the case of imbalanced class distributions. Comprehensibility:While boosting also provides similar gains in accuracy,the price to pay is that the learned ensem-ble of classifiers is no longer easy to comprehend.8While round robin rule learning also learns an ensemble of clas-sifiers,we think that it has the advantage that each element of the ensemble has a well-defined semantics(separating two classes from each other).In fact,Pyle(1999)proposes a very similar technique called pairwise ranking in order to facilitate human decision-making in ranking problems.He claims that it is easier for a human to determine an order between n items if s/he makes pairwise comparisons be-tween the individual items and then adds up the wins for each item,instead of trying to order the items right away.9 7.Related WorkPairwise classification was introduced to the machine learning literature by Friedman(1996)but the idea seems to have been known for some time.Hastie and Tibshi-rani(1998)picked up the technique and introduced pair-wise coupling,a tie-breaking technique which combines the class probability estimates from the binary classifiers into a joint probability distribution for the multi-class prob-lem.Pairwise classification was soon applied in the sup-port vector(Schmidt,1996;Kreßel,1999)and neural net-works communities(Lu&Ito,1999).In comparison toTo this end,we plan to evaluate round robin binarization using C5and C5-boost as base learners.Alternatively,a direct comparison of R3to Slipper(Cohen&Singer,1999) could provide evidence to answer this question.The main disadvantage of the approach,of course,is its de-pendency on the number of classes.More classes result in a bigger ensemble which should produce better predictions. In particular in the limiting case,where only two or three classes are available,the benefits should be rather small. This trade-off also needs to be investigated in more detail. AcknowledgementsI wish to thank the anonymous reviewers,William Cohen,Eibe Frank,Stefan Kramer,Johann Petrak,Lupˇc o Todorovski,Ger-hard Widmer,and the maintainers of and contributors to the UCI collection of databases for discussions,programs,databases,and pointers to relevant literature.The Austrian Research Institute for Artificial Intelligence is supported by the Austrian Federal Ministry of Education,Science and Culture.The author was sup-ported by the ESPRIT long-term project METAL(26.357). ReferencesAllwein,E.L.,Schapire,R.E.,&Singer,Y.(2000).Reduc-ing multiclass to binary:A unifying approach for mar-gin classifiers.Journal of Machine Learning Research, 1,113-141.Anand,R.,Mehrotra,K.G.,Mohan,C.K.,&Ranka,S. (1995).Efficient classification for multiclass problems using modular neural networks.IEEE Transactions on Neural Networks,6,117–124.Blake, C.L.,&Merz, C.J.(1998).UCI repository of machine learning databases.http://www.ics. /˜mlearn/MLRepository.html.De-partment of Information and Computer Science,Univer-sity of California at Irvine.Irvine,CA.Clark,P.,&Boswell,R.(1991).Rule induction with CN2: Some recent improvements.Proceedings of the5th Eu-ropean Working Session on Learning(EWSL-91)(pp. 151–163).Porto,Portugal:Springer-Verlag.Clark,P.,&Niblett,T.(1989).The CN2induction algo-rithm.Machine Learning,3,261–283.Cohen,W.W.(1995).Fast effective rule induction.Pro-ceedings of the12th International Conference on Ma-chine Learning(ML-95)(pp.115–123).Lake Tahoe, CA:Morgan Kaufmann.Cohen,W.W.,&Singer,Y.(1999).A simple,fast, and effective rule learner.Proceedings of the16th Na-tional Conference on Artificial Intelligence(AAAI-99) (pp.335–342).Menlo Park,CA:AAAI/MIT Press.Cortes,C.,&Vapnik,V.(1995).Support-vector networks. Machine Learning,20,273–297.Dietterich,T.G.,&Bakiri,G.(1995).Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research,2,263–286.Feelders,A.,&Verkooijen,W.(1995).Which method learns most from the data?Methodological issues in the analysis of comparative studies.Proceedings of the 5th International Workshop on Artificial Intelligence and Statistics.Fort Lauderdale,Florida.Friedman,J.H.(1996).Another approach to polychoto-mous classification(Technical Report).Department of Statistics,Stanford University,Stanford,CA.F¨u rnkranz,J.(1997).Pruning algorithms for rule learning. Machine Learning,27,139–171.F¨u rnkranz,J.(1999).Separate-and-conquer rule learning. Artificial Intelligence Review,13,3–54.F¨u rnkranz,J.(2001).Round robin rule learning(Technical Report OEFAI-TR-2001-02).Austrian Research Insti-tute for Artificial Intelligence,Wien,Austria.Hastie,T.,&Tibshirani,R.(1998).Classification by pair-wise coupling.Advances in Neural Information Process-ing Systems10(NIPS-97)(pp.507–513).MIT Press.Kreßel,U.H.-G.(1999).Pairwise classification and sup-port vector machines.In B.Sch¨o lkopf,C.Burges and A.Smola(Eds.),Advances in kernel methods:Support vector learning,255–268.Cambridge,MA:MIT Press. Lu,B.-L.,&Ito,M.(1999).Task decomposition and mod-ule combination based on class relations:A modular neural network for pattern classification.IEEE Trans-actions on Neural Networks,10,1244–1256.Opitz,D.,&Maclin,R.(1999).Popular ensemble meth-ods:An empirical study.Journal of Artificial Intelli-gence Research,11,169–198.Platt,J.C.,Cristianini,N.,&Shawe-Taylor,J.(2000). Large margin DAGs for multiclass classification.Ad-vances in Neural Information Processing Systems12 (NIPS-99)(pp.547–553).MIT Press.Pyle,D.(1999).Data preparation for data mining.San Francisco,CA:Morgan Kaufmann.Schmidt,M.S.(1996).Identifying speakers with support vector networks.Proceedings of the28th Symposium on the Interface(INTERF ACE-96).Sydney,Australia.。

监督字典学习

监督字典学习
α∈Rk

It is well known in the statistics, optimization, and compressed sensing communities that the ℓ1 penalty yields a sparse solution, very few non-zero coefficients in α, although there is no explicit analytic link between the value of λ1 and the effective sparsity that this model yields. Other sparsity penalties using the ℓ0 regularization2 can be used as well. Since it uses a proper norm, the ℓ1 formulation of sparse coding is a convex problem, which makes the optimization tractable with algorithms such as those introduced in [16, 17], and has proven in practice to be more stable than its ℓ0 counterpart, in the sense that the resulting decompositions are less sensitive to small perturbations of the input signal x. Note that sparse coding with an ℓ0 penalty is an NP-hard problem and is often approximated using greedy algorithms. In this paper, we consider a setting, where the signal may belong to any of p different classes. We first consider the case of p = 2 classes and later discuss the multiclass extension. We consider a n m training set of m labeled signals (xi )m i=1 in R , associated with binary labels (yi ∈ {−1, +1})i=1 . Our goal is to learn jointly a single dictionary D adapted to the classification task and a function f which should be positive for any signal in class +1 and negative otherwise. We consider in this paper two different models to use the sparse code α for the classification task: (i) linear in α: f (x, α, θ ) = wT α + b, where θ = {w ∈ Rk , b ∈ R} parametrizes the model. (ii) bilinear in x and α: f (x, α, θ ) = xT Wα + b, where θ = {W ∈ Rn×k , b ∈ R}. In this case, the model is bilinear and f acts on both x and its sparse code α. The number of parameters in (ii) is greater than in (i), which allows for richer models. Note that one can interpret W as a linear filter encoding the input signal x into a model for the coefficients α, which has a role similar to the encoder in [18] but for a discriminative task. A classical approach to obtain α for (i) or (ii) is to first adapt D to the data, solving

On the gauge orbit space stratification (a review)

On the gauge orbit space stratification (a review)

. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
1
Skobeltsyn Institute of Nuclear Physics Moscow State University 119992 Moscow & Russia
Abstract. First, we review the basic mathematical structures and results concerning the gauge orbit space stratification. This includes general properties of the gauge group action, fibre bundle structures induced by this action, basic properties of the stratification and the natural Riemannian structures of the strata. In the second part, we study the stratification for theories with gauge group SU(n) in space time dimension 4. We develop a general method for determining the orbit types and their partial ordering, based on the 1-1 correspondence between orbit types and holonomy-induced Howe subbundles of the underlying principal SU(n)-bundle. We show that the orbit types are classified by certain cohomology elements of space time satisfying two relations and that the partial ordering is characterized by a system of algebraic equations. Moreover, operations for generating direct successors and direct predecessors are formulated, which allow one to construct the set of orbit types, starting from the principal one. Finally, we discuss an application to nodal configurations in Yang-Mills-Chern-Simons theory.
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

is a t-conorm. The structure of a uninorm with neutral element e ∈ ]0, 1[ (these are called proper uninorms ) on the squares [0, e ] 2 and [e, 1 ] 2 is therefore closely related to t-norms and t-conorms. For e ∈ ]0, 1[, we denote by φe x−e and ψe the linear transformations defined by φe (x) = x e and ψe (x) = 1−e . To any uninorm U with neutral element e ∈ ]0, 1[, there corresponds a t-norm T and a t-conorm S such that:
Theorem 1 ([6]). Consider a uninorm U with neutral element e ∈ ]0, 1[, then there exists a strictly increasing continuous [0, 1] → R mapping h with h(0) = −∞, h(e) = 0 and h(1) = +∞ such that U (x, y ) = h−1 (h(x) + h(y )) for any (x, y ) ∈ [0, 1]2 \ {(0, 1), (1, 0)} if and only if (i) U is strictly increasing and continuous on ]0, 1[2 ; (ii) there exists a strong negation N with fixpoint e such that U (x, y ) = N (U (N (x), N (y )))) for any (x, y ) ∈ [0, 1]2 \ {(0, 1), (1, 0)}. The strong negation corresponding to a representable uninorm U with additive generator h, as mentioned in condition (ii) above, is denoted NU and is given by NU (x) = h−1 (−h(x)). Clearly, any representable uninorm comes in a conjunctive and a disjunctive version, i.e., there always exist two representable uninorms that only differ in the points (0, 1) and (1, 0). Representable uninorms are almost continuous, i.e., continuous except in (0, 1) and (1, 0).
This research has been supported in part by the Bilateral Scientific and Technological Cooperation Flanders–Hungary BIL00/51 (B-08/2000).
It is interesting to notice that uninorms U with a neutral element in ] 0, 1[ are just those binary operators which make the structures ([0, 1 ] , sup, U ) and ([0, 1 ] , inf , U ) distributive semirings in the sense of Golan [7]. It is also known (see e.g. [10]) that in MYCIN-like expert systems combining functions are used to calculate the global degrees of suggested diagnoses. A careful study reveals that such combining functions are representable uninorms [3]. T-norms do not allow low values to be compensated by high values, while t-conorms do not allow high values to be compensated by low values. Uninorms may allow values separated by their neutral element to be aggregated in a compensating way. The structure of uninorms was studied by Fodor et al. [6]. For a uninorm U with neutral element e ∈ ] 0, 1 ] , the binary operator TU defined by U (e x, e y ) TU (x, y ) = e is a t-norm; for a uninorm U with neutral element e ∈ [0, 1[, the binary operator SU defined by SU (x, y ) = U (e + (1 − e)x, e + (1 − e)y ) − e 1−e
1
Introduction
Nowadays it is needless to define t-norms and t-conorms in papers related to theoretical or practical aspects of fuzzy sets and logic: researchers have learned the basics and these notions have became part of their everyday scientific vocabulary [13]. Nevertheless, from time to time it is necessary to summarize recent developments even in such a fundamental subject. This is the main aim of the present paper. Somewhat subjectively, we have selected topics where, on one hand, essential contributions have been made, and on the other hand, both theoreticians and practitioners may find these subjects interesting and useful. In this paper we concentrate on associative operations that are more general than t-norms and t-conorms. These extensions are based on a flexible choice of the neutral (unit) element, or the absorbing element of an associative operation. The resulted classes are known as uninorms and nullnorms (in other terminology: t-operators ), respectively.
1 (i) for any (x, y ) ∈ [0, e ] 2 : U (x, y ) = φ− e (T (φe (x), φe (y ))); 2 −1 (ii) for any (x, y ) ∈ [e, 1 ] : U (x, y ) = ψe (S (ψe (x), ψe (y ))).
In analogy to the representation of continuous Archimedean t-norms and tconorms in terms of additive generators, Fodor et al. [6] have investigated the existence of uninorms with a similar representation in terms of a singlevariable function. This search leads back to Dombi’s class of aggregative operators [4]. This work is also closely related to that of Klement et al. on associative compensatory operators [12]. The result is as follows.
2
Uninorms
Uninorms were introduced in [20] as a generalization of t-norms and t-conorms. For uninorms, the neutral element is not forced to be either 0 or 1, but can be any value in the unit interval. Definition 1 ([20]). A uninorm U is a commutative, associative and increasing binary operator with a neutral element e ∈ [0, 1 ] , i.e., for all x ∈ [0, 1 ] we have U (x, e) = x.
相关文档
最新文档