C o l l e c t i n g L u g g a g e 算 法 训 练 ( 2 0 2 0 )
a b c d e f g h i j k l m n o p q r s t u v w x y z
a b c d e f g h i j k l m n o p q r s t u v w x y z注:(1)r、v两个字母用来拼写普通话和外来语,拼写广州话时不用。
(2)广州话拼音字母有三个附加符号:ê、é、ü,其中é是和汉语拼音字母不同的。
这几个字母是e、u字母的变体,叫不列入表内。
二、声母表b波p婆m摸f科d多t拖n挪l罗g哥k卡ng我h何gu姑ku箍z左c初s梳j知q雌x思y也w华注:(1)z、c、s和j、q、x两组声母,广州话的读音没有区别,只是在拼写韵母时有不同,z、c、s拼写a、o、é及a、o、e、é、ê、u等字母开头的韵母,例如:za渣,ca茶,xa沙。
j、q、x拼写i、ü及i、ü字母开头的韵母,例如:ji知,qi次,xi思。
(2)gu姑、ku箍是圆唇的舌跟音,作为声母使用,不能单独注音,单独注音时是音节,不是声母。
(3)y也,w华拼音时作为声母使用,拼写出来的音节相当于汉语拼音方案的复韵母,但由于广州话当中这些韵母前面不再拼声母,因此只作为音节使用三、韵母表a呀o柯u乌i衣ū于ê(靴) é诶m唔n五ai挨 ei矮 oi哀 ui会éi(非)ao拗 eo欧 ou奥iu妖êu(去)am(监) em庵im淹an晏 en(恩) on安 un碗in烟ūn冤 ên(春)ang(横) eng莺 ong(康) ung瓮ing英êng(香) éng(镜)ab鸭 eb(急) ib叶ad押 ed(不) od(渴) ud活 id热ūd月 êd(律)ag(客) eg(德) og恶 ug屋 ig益êg(约) ég(尺)注:(1)例字外加( )号的,只取其韵母。
(2)i行的韵母,前面没有声母的时候,写成yi衣,yiu妖,yim淹,yin烟,ying 英,yib叶,yid热,yig益。
Hazardous Products (Pacifiers) Regulations C.R.C.,_c._930
Current to June 28, 2010À jour au 28 juin 2010Published by the Minister of Justice at the following address:http://laws-lois.justice.gc.ca Publié par le ministre de la Justice à l’adresse suivante :http://laws-lois.justice.gc.caCANADACONSOLIDATION Hazardous Products (Pacifiers) RegulationsCODIFICATIONRèglement sur les produits dangereux(sucettes)C.R.C., c. 930C.R.C., ch. 930OFFICIAL STATUS OF CONSOLIDATIONS CARACTÈRE OFFICIEL DES CODIFICATIONSSubsections 31(1) and (3) of the Legislation Revision and Consolidation Act, in force on June 1, 2009, provide as follows:Les paragraphes 31(1) et (3) de la Loi sur la révision et la codification des textes législatifs, en vigueur le 1er juin 2009, prévoient ce qui suit :Published consolidation is evidence31. (1) Every copy of a consolidated statute orconsolidated regulation published by the Ministerunder this Act in either print or electronic form is ev-idence of that statute or regulation and of its contentsand every copy purporting to be published by theMinister is deemed to be so published, unless thecontrary is shown.31. (1) Tout exemplaire d'une loi codifiée oud'un règlement codifié, publié par le ministre en ver-tu de la présente loi sur support papier ou sur supportélectronique, fait foi de cette loi ou de ce règlementet de son contenu. Tout exemplaire donné commepublié par le ministre est réputé avoir été ainsi pu-blié, sauf preuve contraire.Codificationscomme élémentde preuve ...[...]Inconsistencies in regulations(3) In the event of an inconsistency between aconsolidated regulation published by the Ministerunder this Act and the original regulation or a subse-quent amendment as registered by the Clerk of thePrivy Council under the Statutory Instruments Act,the original regulation or amendment prevails to theextent of the inconsistency.(3) Les dispositions du règlement d'origine avecses modifications subséquentes enregistrées par legreffier du Conseil privé en vertu de la Loi sur lestextes réglementaires l'emportent sur les dispositionsincompatibles du règlement codifié publié par le mi-nistre en vertu de la présente loi.Incompatibilité— règlementsCHAPTER 930CHAPITRE 930HAZARDOUS PRODUCTS ACT LOI SUR LES PRODUITS DANGEREUX Hazardous Products (Pacifiers) Regulations Règlement sur les produits dangereux (sucettes)REGULATIONS RESPECTING THE ADVERTISING, SALE AND IMPORTATION OF HAZARDOUS PRODUCTS (PACIFIERS)RÈGLEMENT CONCERNANT LA VENTE, L’IMPORTATION ET LA PUBLICITÉ DES SUCETTESSHORT TITLE TITRE ABRÉGÉ1. These Regulations may be cited as the Hazardous Products (Pacifiers) Regulations.1. Le présent règlement peut être cité sous le titre : Règlement sur les produits dangereux (sucettes).INTERPRETATION DÉFINITIONS2. In these Regulations,“Act” means the Hazardous Products Act; (Loi)“product” means a pacifier or similar product included in item 27 of Part II of Schedule I to the Act. (produit)SOR/91-265, s. 2.2. Les définitions qui suivent s’appliquent au présent règlement.« Loi » La Loi sur les produits dangereux. (Act)« produit » Sucette ou produit semblable visé à l’ar-ticle 27 de la partie II de l’annexe I de la Loi. (product) DORS/91-265, art. 2.GENERAL DISPOSITIONS GÉNÉRALES3. A person may advertise, sell or import into Canadaa product only if it meets the requirements of these Reg-ulations.SOR/91-265, s. 3(F).3. La vente, l’importation et la publicité d’un produit sont autorisées à la condition que celui-ci soit conforme aux exigences du présent règlement.DORS/91-265, art. 3(F).ADVERTISING AND LABELLING PUBLICITÉ ET ÉTIQUETAGE4. (1) No reference, direct or indirect, to the Act or to these Regulations shall be made in any written material applied to or accompanying a product or in any adver-tisement thereof.4. (1) Il est interdit de faire tout renvoi direct ou indi-rect à la Loi ou au présent règlement dans les renseigne-ments écrits apposés sur un produit ou l’accompagnant, ainsi que dans la publicité de ce produit.(2) No representation in respect of the use of or modi-fication to a product shall be made in any written materi-al applied to or accompanying the product or in any ad-vertisement thereof, which use or modification would result in the failure of the product to meet a requirement of these Regulations.SOR/91-265, s. 4(F).(2) Il est interdit de donner, dans les renseignements écrits apposés sur un produit ou l’accompagnant, ou dans la publicité du produit, des indications sur un mode d’utilisation ou de modification du produit qui rendrait celui-ci non conforme aux exigences du présent règle-ment.DORS/91-265, art. 4(F).TOXICITY [SOR/92-586, s. 2]TOXICITÉ[DORS/92-586, art. 2]5. (1) [Revoked, SOR/92-586, s. 2] 5. (1) [Abrogé, DORS/92-586, art. 2](2) Every product, including all its parts and compo-nents shall meet the requirements of section 10 of the Hazardous Products (Toys) Regulations.(2) Tout produit, y compris tous ses éléments, doit ré-pondre aux prescriptions de l’article 10 du Règlement sur les produits dangereux (jouets).(3) No product or any part or component of the prod-uct shall contain more than 10 micrograms per kilogram total volatile N-nitrosamines, as determined by dichloromethane extraction.SOR/84-272, s. 1; SOR/85-478, s. 1; SOR/92-586, s. 2.(3) Aucun produit, y compris chaque élément, ne doit contenir plus de 10 microgrammes de N-nitrosamines volatiles totales par kilogramme, tel que déterminé par extraction au dichlorométhane.DORS/84-272, art. 1; DORS/85-478, art. 1; DORS/92-586, art. 2.DESIGN AND CONSTRUCTION CONCEPTION ET CONSTRUCTION6. Every product shall(a) be designed and constructed in such a manner as to protect the user, under reasonably foreseeable con-ditions of use, from(i) obstruction of the pharyngeal orifice,(ii) strangulation,(iii) ingestion or aspiration of the product or any part or component thereof, and(iv) wounding;(b) be designed and constructed so that,(i) the nipple is attached to a guard or shield of such dimensions that it cannot pass through the opening in the template illustrated in Schedule I when the nipple is centered on the opening and a load of 2.2 pounds is applied axially to the nipple in such a way as to induce the guard or shield to pull through the opening in the template,(ii) any loop of cord or other material attached to the product is not more than 14 inches in circumfer-ence,(iii) when tested in accordance with the procedure described in Schedule II(A) the nipple remains attached to the guard orshield described in subparagraph (i), and(B) no part or component is separated or brokenfree from the product that will fit, in a non-com-pressed state, into the small parts cylinder illus-trated in Schedule III, and 6. Tout produit doit êtrea) conçu et construit de façon à protéger l’utilisateur, dans les conditions d’utilisation raisonnablement pré-visibles, contre les dangers suivants :(i) obstruction de l’orifice pharyngien,(ii) strangulation,(iii) ingestion ou aspiration du produit ou d’un élé-ment du produit, et(iv) lésion;b) conçu et construit de façon(i) que la tétine soit fixée à une garde assez grande pour que celle-ci ne puisse passer par l’ouverture du gabarit indiqué à l’annexe I, lorsque la tétine est centrée sur l’ouverture et qu’une charge de 2,2 livres est appliquée à la tétine suivant l’axe de celle-ci de façon à entraîner la garde à travers l’ou-verture du gabarit,(ii) que toute boucle de corde ou d’autre matière at-tachée au produit ne mesure pas plus de 14 pouces de circonférence,(iii) que lorsque le produit est soumis à un essai conformément à la méthode exposée à l’annexe II,(A) la tétine reste fixée à la garde mentionnée ausous-alinéa (i), et(B) aucun élément qui s’insère, à l’état non com-primé, dans le cylindre pour petites pièces illustréà l’annexe III ne se détache ni ne se dégage; et(iv) any ring or handle is hinged, collapsible or flexible.SOR/2004-65, s. 1.(iv) que tout anneau ou poignée soit articulé, souple ou flexible.DORS/2004-65, art. 1.SCHEDULE I(s. 6)ANNEXE I (art. 6)GUARD TEMPLATE GABARIT DE LA GARDESCHEDULE II(s. 6)ANNEXE II (art. 6)TESTING PROCEDURE MÉTHODE D’ESSAI1. Hold the nipple of the pacifier in a fixed position. Apply a load 10 ± 0.25 pounds in the plane of the axis of the nipple to the handle of the pacifier at a rate of 1 ± 0.25 pounds per second and maintain the final load for 10 ± 0.5 seconds.1. Tenir la tétine de la sucette en position fixe. Appliquer à la poi-gnée une charge de 10 ± 0,25 livres sur le plan de l’axe de la tétine au rythme de 1 ± 0,25 livre par seconde et maintenir la tension définitive durant 10 ± 0,5 secondes.2. Hold the guard or shield of the pacifier in a fixed position. Ap-ply a load of 10 ± 0.25 pounds in the plane normal to the axis of the nipple to the handle of the pacifier at a rate of 1 ± 0.25 pounds per second and maintain the final load for 10 ± 0.5 seconds.2. Tenir la garde de la sucette en position fixe. Appliquer à la poi-gnée une charge de 10 ± 0,25 livres sur un plan normal par rapport àl’axe de la tétine au rythme de 1 ± 0,25 livre par seconde et maintenir la tension définitive durant 10 ± 0,5 secondes.3. Repeat the procedure described in section 2 with the load ap-plied to the nipple of the pacifier.3. Répéter l’opération de l’article 2, la charge étant appliquée à la tétine de la sucette.4. Immerse the pacifier in boiling water for 10 ± 0.5 minutes. Re-move the pacifier from the boiling water and allow to cool in air at 70 ± 5 degrees Fahrenheit for 15 ± 0.5 minutes. Repeat the tests de-scribed in sections 1, 2 and 3.4. Plonger la sucette dans de l’eau bouillante pour une période de 10 ± 0,5 minutes. Retirer la sucette de l’eau bouillante et laisser re-froidir à l’air à 70 ± 5 degrés Fahrenheit durant 15 ± 0,5 minutes. Ré-péter les essais des articles 1, 2 et 3.5. Repeat the entire procedure described in section 4 nine times. 5. Répéter neuf fois toute l’opération de l’article 4.SCHEDULE III (Clause 6(b)(iii)(B))ANNEXE III (division 6b)(iii)(B))SMALL PARTS CYLINDERCYLINDRE POUR PETITES PIÈCESNotes:– Not to scale– All dimensions in mmSOR/2004-65, s. 2.Remarques :– Pas à l’échelle– Dimensions en mmDORS/2004-65, art. 2.。
Example-based metonymy recognition for proper nouns
Example-Based Metonymy Recognition for Proper NounsYves PeirsmanQuantitative Lexicology and Variational LinguisticsUniversity of Leuven,Belgiumyves.peirsman@arts.kuleuven.beAbstractMetonymy recognition is generally ap-proached with complex algorithms thatrely heavily on the manual annotation oftraining and test data.This paper will re-lieve this complexity in two ways.First,it will show that the results of the cur-rent learning algorithms can be replicatedby the‘lazy’algorithm of Memory-BasedLearning.This approach simply stores alltraining instances to its memory and clas-sifies a test instance by comparing it to alltraining examples.Second,this paper willargue that the number of labelled trainingexamples that is currently used in the lit-erature can be reduced drastically.Thisfinding can help relieve the knowledge ac-quisition bottleneck in metonymy recog-nition,and allow the algorithms to be ap-plied on a wider scale.1IntroductionMetonymy is afigure of speech that uses“one en-tity to refer to another that is related to it”(Lakoff and Johnson,1980,p.35).In example(1),for in-stance,China and Taiwan stand for the govern-ments of the respective countries:(1)China has always threatened to use forceif Taiwan declared independence.(BNC) Metonymy resolution is the task of automatically recognizing these words and determining their ref-erent.It is therefore generally split up into two phases:metonymy recognition and metonymy in-terpretation(Fass,1997).The earliest approaches to metonymy recogni-tion identify a word as metonymical when it vio-lates selectional restrictions(Pustejovsky,1995).Indeed,in example(1),China and Taiwan both violate the restriction that threaten and declare require an animate subject,and thus have to be interpreted metonymically.However,it is clear that many metonymies escape this characteriza-tion.Nixon in example(2)does not violate the se-lectional restrictions of the verb to bomb,and yet, it metonymically refers to the army under Nixon’s command.(2)Nixon bombed Hanoi.This example shows that metonymy recognition should not be based on rigid rules,but rather on statistical information about the semantic and grammatical context in which the target word oc-curs.This statistical dependency between the read-ing of a word and its grammatical and seman-tic context was investigated by Markert and Nis-sim(2002a)and Nissim and Markert(2003; 2005).The key to their approach was the in-sight that metonymy recognition is basically a sub-problem of Word Sense Disambiguation(WSD). Possibly metonymical words are polysemous,and they generally belong to one of a number of pre-defined metonymical categories.Hence,like WSD, metonymy recognition boils down to the auto-matic assignment of a sense label to a polysemous word.This insight thus implied that all machine learning approaches to WSD can also be applied to metonymy recognition.There are,however,two differences between metonymy recognition and WSD.First,theo-retically speaking,the set of possible readings of a metonymical word is open-ended(Nunberg, 1978).In practice,however,metonymies tend to stick to a small number of patterns,and their la-bels can thus be defined a priori.Second,classic 71WSD algorithms take training instances of one par-ticular word as their input and then disambiguate test instances of the same word.By contrast,since all words of the same semantic class may undergo the same metonymical shifts,metonymy recogni-tion systems can be built for an entire semantic class instead of one particular word(Markert and Nissim,2002a).To this goal,Markert and Nissim extracted from the BNC a corpus of possibly metonymical words from two categories:country names (Markert and Nissim,2002b)and organization names(Nissim and Markert,2005).All these words were annotated with a semantic label —either literal or the metonymical cate-gory they belonged to.For the country names, Markert and Nissim distinguished between place-for-people,place-for-event and place-for-product.For the organi-zation names,the most frequent metonymies are organization-for-members and organization-for-product.In addition, Markert and Nissim used a label mixed for examples that had two readings,and othermet for examples that did not belong to any of the pre-defined metonymical patterns.For both categories,the results were promis-ing.The best algorithms returned an accuracy of 87%for the countries and of76%for the orga-nizations.Grammatical features,which gave the function of a possibly metonymical word and its head,proved indispensable for the accurate recog-nition of metonymies,but led to extremely low recall values,due to data sparseness.Therefore Nissim and Markert(2003)developed an algo-rithm that also relied on semantic information,and tested it on the mixed country data.This algo-rithm used Dekang Lin’s(1998)thesaurus of se-mantically similar words in order to search the training data for instances whose head was sim-ilar,and not just identical,to the test instances. Nissim and Markert(2003)showed that a combi-nation of semantic and grammatical information gave the most promising results(87%). However,Nissim and Markert’s(2003)ap-proach has two major disadvantages.Thefirst of these is its complexity:the best-performing al-gorithm requires smoothing,backing-off to gram-matical roles,iterative searches through clusters of semantically similar words,etc.In section2,I will therefore investigate if a metonymy recognition al-gorithm needs to be that computationally demand-ing.In particular,I will try and replicate Nissim and Markert’s results with the‘lazy’algorithm of Memory-Based Learning.The second disadvantage of Nissim and Mark-ert’s(2003)algorithms is their supervised nature. Because they rely so heavily on the manual an-notation of training and test data,an extension of the classifiers to more metonymical patterns is ex-tremely problematic.Yet,such an extension is es-sential for many tasks throughout thefield of Nat-ural Language Processing,particularly Machine Translation.This knowledge acquisition bottle-neck is a well-known problem in NLP,and many approaches have been developed to address it.One of these is active learning,or sample selection,a strategy that makes it possible to selectively an-notate those examples that are most helpful to the classifier.It has previously been applied to NLP tasks such as parsing(Hwa,2002;Osborne and Baldridge,2004)and Word Sense Disambiguation (Fujii et al.,1998).In section3,I will introduce active learning into thefield of metonymy recog-nition.2Example-based metonymy recognition As I have argued,Nissim and Markert’s(2003) approach to metonymy recognition is quite com-plex.I therefore wanted to see if this complexity can be dispensed with,and if it can be replaced with the much more simple algorithm of Memory-Based Learning.The advantages of Memory-Based Learning(MBL),which is implemented in the T i MBL classifier(Daelemans et al.,2004)1,are twofold.First,it is based on a plausible psycho-logical hypothesis of human learning.It holds that people interpret new examples of a phenom-enon by comparing them to“stored representa-tions of earlier experiences”(Daelemans et al., 2004,p.19).This contrasts to many other classi-fication algorithms,such as Naive Bayes,whose psychological validity is an object of heavy de-bate.Second,as a result of this learning hypothe-sis,an MBL classifier such as T i MBL eschews the formulation of complex rules or the computation of probabilities during its training phase.Instead it stores all training vectors to its memory,together with their labels.In the test phase,it computes the distance between the test vector and all these train-ing vectors,and simply returns the most frequentlabel of the most similar training examples.One of the most important challenges inMemory-Based Learning is adapting the algorithmto one’s data.This includesfinding a represen-tative seed set as well as determining the rightdistance measures.For my purposes,however, T i MBL’s default settings proved more than satis-factory.T i MBL implements the IB1and IB2algo-rithms that were presented in Aha et al.(1991),butadds a broad choice of distance measures.Its de-fault implementation of the IB1algorithm,whichis called IB1-IG in full(Daelemans and Van denBosch,1992),proved most successful in my ex-periments.It computes the distance between twovectors X and Y by adding up the weighted dis-tancesδbetween their corresponding feature val-ues x i and y i:∆(X,Y)=ni=1w iδ(x i,y i)(3)The most important element in this equation is theweight that is given to each feature.In IB1-IG,features are weighted by their Gain Ratio(equa-tion4),the division of the feature’s InformationGain by its split rmation Gain,the nu-merator in equation(4),“measures how much in-formation it[feature i]contributes to our knowl-edge of the correct class label[...]by comput-ing the difference in uncertainty(i.e.entropy)be-tween the situations without and with knowledgeof the value of that feature”(Daelemans et al.,2004,p.20).In order not“to overestimate the rel-evance of features with large numbers of values”(Daelemans et al.,2004,p.21),this InformationGain is then divided by the split info,the entropyof the feature values(equation5).In the followingequations,C is the set of class labels,H(C)is theentropy of that set,and V i is the set of values forfeature i.w i=H(C)− v∈V i P(v)×H(C|v)2This data is publicly available and can be downloadedfrom /mnissim/mascara.73P F86.6%49.5%N&M81.4%62.7%Table1:Results for the mixed country data.T i MBL:my T i MBL resultsN&M:Nissim and Markert’s(2003)results simple learning phase,T i MBL is able to replicate the results from Nissim and Markert(2003;2005). As table1shows,accuracy for the mixed coun-try data is almost identical to Nissim and Mark-ert’sfigure,and precision,recall and F-score for the metonymical class lie only slightly lower.3 T i MBL’s results for the Hungary data were simi-lar,and equally comparable to Markert and Nis-sim’s(Katja Markert,personal communication). Note,moreover,that these results were reached with grammatical information only,whereas Nis-sim and Markert’s(2003)algorithm relied on se-mantics as well.Next,table2indicates that T i MBL’s accuracy for the mixed organization data lies about1.5%be-low Nissim and Markert’s(2005)figure.This re-sult should be treated with caution,however.First, Nissim and Markert’s available organization data had not yet been annotated for grammatical fea-tures,and my annotation may slightly differ from theirs.Second,Nissim and Markert used several feature vectors for instances with more than one grammatical role andfiltered all mixed instances from the training set.A test instance was treated as mixed only when its several feature vectors were classified differently.My experiments,in contrast, were similar to those for the location data,in that each instance corresponded to one vector.Hence, the slightly lower performance of T i MBL is prob-ably due to differences between the two experi-ments.Thesefirst experiments thus demonstrate that Memory-Based Learning can give state-of-the-art performance in metonymy recognition.In this re-spect,it is important to stress that the results for the country data were reached without any se-mantic information,whereas Nissim and Mark-ert’s(2003)algorithm used Dekang Lin’s(1998) clusters of semantically similar words in order to deal with data sparseness.This fact,togetherAcc RT i MBL78.65%65.10%76.0%—Figure1:Accuracy learning curves for the mixed country data with and without semantic informa-tion.in more detail.4Asfigure1indicates,with re-spect to overall accuracy,semantic features have a negative influence:the learning curve with both features climbs much more slowly than that with only grammatical features.Hence,contrary to my expectations,grammatical features seem to allow a better generalization from a limited number of training instances.With respect to the F-score on the metonymical category infigure2,the differ-ences are much less outspoken.Both features give similar learning curves,but semantic features lead to a higherfinal F-score.In particular,the use of semantic features results in a lower precisionfig-ure,but a higher recall score.Semantic features thus cause the classifier to slightly overgeneralize from the metonymic training examples.There are two possible reasons for this inabil-ity of semantic information to improve the clas-sifier’s performance.First,WordNet’s synsets do not always map well to one of our semantic la-bels:many are rather broad and allow for several readings of the target word,while others are too specific to make generalization possible.Second, there is the predominance of prepositional phrases in our data.With their closed set of heads,the number of examples that benefits from semantic information about its head is actually rather small. Nevertheless,myfirst round of experiments has indicated that Memory-Based Learning is a sim-ple but robust approach to metonymy recogni-tion.It is able to replace current approaches that need smoothing or iterative searches through a the-saurus,with a simple,distance-based algorithm.Figure3:Accuracy learning curves for the coun-try data with random and maximum-distance se-lection of training examples.over all possible labels.The algorithm then picks those instances with the lowest confidence,since these will contain valuable information about the training set(and hopefully also the test set)that is still unknown to the system.One problem with Memory-Based Learning al-gorithms is that they do not directly output prob-abilities.Since they are example-based,they can only give the distances between the unlabelled in-stance and all labelled training instances.Never-theless,these distances can be used as a measure of certainty,too:we can assume that the system is most certain about the classification of test in-stances that lie very close to one or more of its training instances,and less certain about those that are further away.Therefore the selection function that minimizes the probability of the most likely label can intuitively be replaced by one that max-imizes the distance from the labelled training in-stances.However,figure3shows that for the mixed country instances,this function is not an option. Both learning curves give the results of an algo-rithm that starts withfifty random instances,and then iteratively adds ten new training instances to this initial seed set.The algorithm behind the solid curve chooses these instances randomly,whereas the one behind the dotted line selects those that are most distant from the labelled training exam-ples.In thefirst half of the learning process,both functions are equally successful;in the second the distance-based function performs better,but only slightly so.There are two reasons for this bad initial per-formance of the active learning function.First,it is not able to distinguish between informativeandFigure4:Accuracy learning curves for the coun-try data with random and maximum/minimum-distance selection of training examples. unusual training instances.This is because a large distance from the seed set simply means that the particular instance’s feature values are relatively unknown.This does not necessarily imply that the instance is informative to the classifier,how-ever.After all,it may be so unusual and so badly representative of the training(and test)set that the algorithm had better exclude it—something that is impossible on the basis of distances only.This bias towards outliers is a well-known disadvantage of many simple active learning algorithms.A sec-ond type of bias is due to the fact that the data has been annotated with a few features only.More par-ticularly,the present algorithm will keep adding instances whose head is not yet represented in the training set.This entails that it will put off adding instances whose function is pp,simply because other functions(subj,gen,...)have a wider variety in heads.Again,the result is a labelled set that is not very representative of the entire training set.There are,however,a few easy ways to increase the number of prototypical examples in the train-ing set.In a second run of experiments,I used an active learning function that added not only those instances that were most distant from the labelled training set,but also those that were closest to it. After a few test runs,I decided to add six distant and four close instances on each iteration.Figure4 shows that such a function is indeed fairly success-ful.Because it builds a labelled training set that is more representative of the test set,this algorithm clearly reduces the number of annotated instances that is needed to reach a given performance.Despite its success,this function is obviously not yet a sophisticated way of selecting good train-76Figure5:Accuracy learning curves for the organi-zation data with random and distance-based(AL) selection of training examples with a random seed set.ing examples.The selection of the initial seed set in particular can be improved upon:ideally,this seed set should take into account the overall dis-tribution of the training examples.Currently,the seeds are chosen randomly.Thisflaw in the al-gorithm becomes clear if it is applied to another data set:figure5shows that it does not outper-form random selection on the organization data, for instance.As I suggested,the selection of prototypical or representative instances as seeds can be used to make the present algorithm more robust.Again,it is possible to use distance measures to do this:be-fore the selection of seed instances,the algorithm can calculate for each unlabelled instance its dis-tance from each of the other unlabelled instances. In this way,it can build a prototypical seed set by selecting those instances with the smallest dis-tance on average.Figure6indicates that such an algorithm indeed outperforms random sample se-lection on the mixed organization data.For the calculation of the initial distances,each feature re-ceived the same weight.The algorithm then se-lected50random samples from the‘most proto-typical’half of the training set.5The other settings were the same as above.With the present small number of features,how-ever,such a prototypical seed set is not yet always as advantageous as it could be.A few experiments indicated that it did not lead to better performance on the mixed country data,for instance.However, as soon as a wider variety of features is taken into account(as with the organization data),the advan-pling can help choose those instances that are most helpful to the classifier.A few distance-based al-gorithms were able to drastically reduce the num-ber of training instances that is needed for a given accuracy,both for the country and the organization names.If current metonymy recognition algorithms are to be used in a system that can recognize all pos-sible metonymical patterns across a broad variety of semantic classes,it is crucial that the required number of labelled training examples be reduced. This paper has taken thefirst steps along this path and has set out some interesting questions for fu-ture research.This research should include the investigation of new features that can make clas-sifiers more robust and allow us to measure their confidence more reliably.This confidence mea-surement can then also be used in semi-supervised learning algorithms,for instance,where the clas-sifier itself labels the majority of training exam-ples.Only with techniques such as selective sam-pling and semi-supervised learning can the knowl-edge acquisition bottleneck in metonymy recogni-tion be addressed.AcknowledgementsI would like to thank Mirella Lapata,Dirk Geer-aerts and Dirk Speelman for their feedback on this project.I am also very grateful to Katja Markert and Malvina Nissim for their helpful information about their research.ReferencesD.W.Aha, D.Kibler,and M.K.Albert.1991.Instance-based learning algorithms.Machine Learning,6:37–66.W.Daelemans and A.Van den Bosch.1992.Generali-sation performance of backpropagation learning on a syllabification task.In M.F.J.Drossaers and A.Ni-jholt,editors,Proceedings of TWLT3:Connection-ism and Natural Language Processing,pages27–37, Enschede,The Netherlands.W.Daelemans,J.Zavrel,K.Van der Sloot,andA.Van den Bosch.2004.TiMBL:Tilburg Memory-Based Learner.Technical report,Induction of Linguistic Knowledge,Computational Linguistics, Tilburg University.D.Fass.1997.Processing Metaphor and Metonymy.Stanford,CA:Ablex.A.Fujii,K.Inui,T.Tokunaga,and H.Tanaka.1998.Selective sampling for example-based wordsense putational Linguistics, 24(4):573–597.R.Hwa.2002.Sample selection for statistical parsing.Computational Linguistics,30(3):253–276.koff and M.Johnson.1980.Metaphors We LiveBy.London:The University of Chicago Press.D.Lin.1998.An information-theoretic definition ofsimilarity.In Proceedings of the International Con-ference on Machine Learning,Madison,USA.K.Markert and M.Nissim.2002a.Metonymy res-olution as a classification task.In Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP2002),Philadelphia, USA.K.Markert and M.Nissim.2002b.Towards a cor-pus annotated for metonymies:the case of location names.In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC2002),Las Palmas,Spain.M.Nissim and K.Markert.2003.Syntactic features and word similarity for supervised metonymy res-olution.In Proceedings of the41st Annual Meet-ing of the Association for Computational Linguistics (ACL-03),Sapporo,Japan.M.Nissim and K.Markert.2005.Learning to buy a Renault and talk to BMW:A supervised approach to conventional metonymy.In H.Bunt,editor,Pro-ceedings of the6th International Workshop on Com-putational Semantics,Tilburg,The Netherlands. G.Nunberg.1978.The Pragmatics of Reference.Ph.D.thesis,City University of New York.M.Osborne and J.Baldridge.2004.Ensemble-based active learning for parse selection.In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics(HLT-NAACL).Boston, USA.J.Pustejovsky.1995.The Generative Lexicon.Cam-bridge,MA:MIT Press.78。
a o e i u 26个字母写法幼儿园
a o e i u 26个字母写法幼儿园幼儿26个拼音的正确写法一.汉语拼音字母表(26个字母)A a(啊)B b(播)C c(刺)D d(得)E e(鹅)F f(佛)G g(哥)H h (喝)I i(衣)J j(鸡)K k(科)L l(勒)M m(摸)N n(讷)O o(喔)P p(泼)Q q(气)R r(日)S s(丝)T t(特)U u(乌)V v(微)W w(屋)X x(西)Y y(医)Z z(字)二.26个汉语拼音字母的书写规则要求:工整、美观,注意在四线三格中的占位。
1.占中格的字母,中格要占满。
如:a c e m n o r s u v w x z2.占上中格的字母,上格要留空。
如:b d f h i k l t3.占中下格的字母,下格要留空。
如:g p q y4.占上中下三格的字母,上下格都要留空。
如:j你说的那个26个是英文字母的数声母:Bb Pp M m Ff Dd Tt Nn Ll Gg Kk Hh Jj Qq Xx Zz Cc Ss Rr Yy Ww翘舌音:zh ch sh单韵母:a o e i u u(u上边有两点读於)复韵母:ai ei ui ao ou iu ie ue (u上边有两点) er an en in ang eng ing ong 整体认读音节:zhi chi shi yi wu yu yin yun ye yue音序表:Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm NnOo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz声母表b p m fd t n lg k h j q xzh ch sh rz c sy w韵母表a o e i u üai ei uiao ou iu ie üe eran en in unang eng ing ong声母表b p m fd t n lg k hj q xzh ch shr z c sy w韵母表a o e i u üai ei uiao ou iuie üe eran en in unang eng ing ong26个拼音四线三格应该怎么占?如下图:汉语拼音26个声母书写格式是怎样的汉语拼音字母书写格式:以上表格中不包括字母V,书写字母V,在四线三格中也占中格。
C o l l e c t i n g L u g g a g e 算 法 训 练
WF 2007 (UVaLive 2397) - Collecting LuggageLink To The Problem知识点:二分,最短路,线段和多边形判交-- WF 2007 (UValive 2397 Collecting Luggage-- 知识点:二分,最短路,线段和多边形判交#includeiostream#includecstdio#includecstring#includealgorithm#includecmath#includevector#includequeueusing namespace std;#define FOR(i,a,b) for(int (i)=(a);(i)=(b);(i)++)#define DOR(i,a,b) for(int (i)=(a);(i)=(b);(i)--)#define oo 1e6#define eps 1e-6#define nMax 1010#define pb push_back#define dbg(x) cerr __LINE__ ": " #x " = " x endl#define F first#define S second#define bug puts("OOOOh.");#define zero(x) (((x)0?(x):-(x))eps)#define LL long long#define DB double#define sf scanf#define pf printf#define rep(i,n) for(int (i)=0;(i)(n);(i)++)double const pi = acos(-1.0);double const inf = 1e30;double inline sqr(double x) { return x*x; }int dcmp(double x){if(fabs(x)eps) return 0;return x0?1:-1;-- Describe of the 2_K Geomtry-- First Part : Point and Line-- Second Part Cicle-- Third Part Polygan-- First Part:-- ****************************** Point and Line *******************************class point {double x,y;point (double x=0,double y=0):x(x),y(y) {}void make(double _x,double _y) {x=_x;y=_y;}void read() { scanf("%lf%lf",x,y); }void out() { printf("%.3lf %.3lf",x,y);}double len() { return sqrt(x*x+y*y); }point friend operator - (point const u,point const v) { return point(u.x-v.x,u.y-v.y); }point friend operator + (point const u,point const v) { return point(u.x+v.x,u.y+v.y); }double friend operator * (point const u,point const v){ return u.x*v.y-u.y*v.x; }double friend operator ^ (point const u,point const v) { return u.x*v.x+u.y*v.y; }point friend operator * (point const u,double const k) { return point(u.x*k,u.y*k); }point friend operator - (point const u,double const k) { return point(u.x-k,u.y-k); }friend bool operator (point const u,point const v){if(dcmp(v.x-u.x)==0) return dcmp(u.y-v.y)0;return dcmp(u.x-v.x)0;friend bool operator != (point const u,point const v){return dcmp(u.x-v.x) || dcmp(u.y-v.y);point rotate(double s) {return point(x*cos(s) + y*sin(s),-x*sin(s) + y*cos(s));typedef point Vector;typedef class line{point a,b;line() {}line (point a,point b):a(a),b(b){}void make(point u,point v) {a=u;b=v;}void read() { a.read(),b.read(); }}segment;double det(point u,point v) {return u.x*v.y - u.y*v.x;double dot(point u,point v) {return u.x*v.x + u.y*v.y;-- Weather P is On the Segment (uv)int dot_on_seg(point p,point u,point v){return dcmp(det(p-u,v-p))==0 dcmp(dot(p-u,p-v)) = 0; -- '=' means P is u or v-- The distance from point p to line ldouble PToLine(point p,line l) {return fabs((p-l.a)*(l.a-l.b))-(l.a-l.b).len();-- The ProJect Of Point(p) To Line(l)point PointProjectLine(point p,line l) {double t = dot(l.b-l.a,p-l.a)-dot(l.b-l.a,l.b-l.a);return l.a + (l.b-l.a)*t;-- Weather line u parallel line vint parallel(line u,line v) {return dcmp(det(u.a-u.b,v.a-v.b))==0;-- The Intersection Point Of Line u and Line vpoint intersection(line u,line v) {point ret = u.a;double t = det(u.a-v.a,v.a-v.b)-det(u.a-u.b,v.a-v.b);return ret + (u.b-u.a)*t;-- ****************************** First Part end ********************************-- Second Part:-- ********************************* Circle *************************************struct circle {point O;double r;circle() {};circle(point O,double r):O(O),r(r){};-- ****************************** Second Part End *******************************-- Third Part :-- ********************************* Polygan *************************************int ConvexHull(vectorpoint p){int n=p.size();int m=0;vectorpoint q;q.resize(2*n+5);rep(i,n) {while(m1 dcmp((q[m-1]-q[m-2])*(p[i]-q[m-2])) = 0) m--;q[m++] = p[i];int k = m;for(int i=n-2;i=0;i--) {while(mk dcmp((q[m-1]-q[m-2])*(p[i]-q[m-2])) = 0) m--;q[m++] = p[i];q.resize(m) ;if(m1) q.resize(m-1);-- p = q; -- 是否修改原来的多边形return q.size();-- 三角形重心point Center(point a,point b,point c){return (a+b+c)-3.0;-- Centroid of Polyganpoint Centroid(vectorpoint p){point O(0,0),ret(0,0);int n = p.size();p.pb(p[0]);double area = 0.0;rep(i,n) {ret = ret + Center(O,p[i],p[i+1])*dot(p[i]-O,p[i+1]-O);area += dot(p[i]-O,p[i+1]-O);if(dcmp(area)==0) {pf("There maybe something wrong");return p[0];return ret - area;struct Polygan{vectorpoint g;Polygan() {};Polygan(vectorpoint g):g(g){};Polygan(point p[],int n) {g.clear();rep(i,n) g.pb(p[i]); };int convex() { return ConvexHull(g); }point center() { return Centroid(g); } -- 多边形的重心-- ******************************* Third Part End ********************************double g[nMax][nMax],h[nMax];int vis[nMax];queueint que;double MinDist(int t,int s) {while(!que.empty()) que.pop();rep(i,n+2) vis[i]=0,h[i]=inf;h[s]=0;vis[s]=1;que.push(s);int u,v;while(!que.empty()){u=que.front();que.pop();rep(i,n+1) if(dcmp(g[i][u]-inf)0) {if(dcmp(h[i]-h[u]-g[i][u])0) {h[i]=h[u]+g[u][i];if(!vis[i]){vis[i]=1;que.push(i);vis[u]=0;return h[t];point p[nMax],s;double vs,vp;segment l[nMax];double T,t[nMax],d[nMax];int dot_on_seg_ex(point p,point u,point v){return dcmp(det(p-u,v-p))==0 dcmp(dot(p-u,v-p)) 0; -- '=' means P is u or vint LLJ(line l,point u,point v,point tmp){if(parallel(l,line(u,v))) {if(dot_on_seg(l.a,u,v)) { tmp = l.a ; return -1;}if(dot_on_seg(l.b,u,v)) { tmp = l.b ; return -1;}if(dot_on_seg(u,l.a,l.b)) { tmp = u ; return -1;}if(dot_on_seg(v,l.a,l.b)) { tmp = v ; return -1;}return 0;tmp = intersection(l,line(u,v));if(dot_on_seg_ex(tmp,l.a,l.b) dot_on_seg_ex(tmp,u,v)) return 1;if(dot_on_seg(tmp,l.a,l.b) dot_on_seg(tmp,u,v)) return -1;return 0;int Point_In(point t) {int num=0,i,d1,d2,k;p[n]=p[0];for(i=0;in;i++) {if(dot_on_seg(t,p[i],p[i+1])) return 0;k = dcmp(det(p[i+1]-p[i],t-p[i]));d1 = dcmp(p[i].y-t.y);d2 = dcmp(p[i+1].y-t.y);if(k0 d1=0 d20) num++;if(k0 d2=0 d10) num--;return num!=0;int chk(point a,point b){point tmp;vectorpoint res;res.clear();res.pb(a),res.pb(b);rep(i,n) {d = LLJ(l[i],a,b,tmp);if(d==1) return 0;else if(d==-1) {res.pb(tmp);sort(res.begin(),res.end());rep(i,res.size()-1){if(Point_In((res[i]+res[i+1])-2.0)) return 0;return 1;-*double inline fmod(double x,double mo){int b = (int)(x-mo);x -= b*mo;return x;point find(double cur){cur *= vp;cur = fmod(cur,T);int i=0;while(cur = d[i]) {cur-=d[i];i++;}point n = l[i].b-l[i].a;n = n-n.len();return n*cur + l[i].a;int deal(double cur) {point q = find(cur);rep(i,n) if(chk(q,p[i])) g[i][n+1]=g[n+1][i]=(q-p[i]).len();else g[i][n+1]=g[n+1][i]=inf;if(chk(q,s)) g[n][n+1]=g[n+1][n]=(s-q).len();else g[n][n+1]=g[n+1][n]=inf;double ret = MinDist(n,n+1);return dcmp(ret-cur*vs) = 0;int cas = 1;void work() {p[n]=p[0];rep(i,n) l[i]=line(p[i],p[i+1]);rep(i,n) d[i]=(l[i].a-l[i].b).len();rep(i,n) T += d[i];rep(i,n) for(int j=i+1;jn;j++) if(i!=j) {if(chk(p[i],p[j])) {g[i][j]=g[j][i]=(p[i]-p[j]).len();}else g[i][j]=g[j][i]=inf;rep(i,n) if(chk(s,p[i])) g[i][n]=g[n][i]=(s-p[i]).len();else g[i][n]=g[n][i]=inf;--rep(i,n+1) { rep(j,n+1) pf("%.3lf ",g[i][j]);pf("");} double la=0,ra=1e8;double ans,mid;while(ra-la 1e-8){mid = (la+ra)-2.0;if(deal(mid)) ra=mid;else la=mid;int Base = ((la+ra)*30+0.5+eps);pf("Case %d: Time = %d:%02d",cas++,Base-60,Base%60);int main() {#ifndef ONLINE_JUDGEfreopen("in.txt","r",stdin);freopen("out.txt","w",stdout);while(~sf("%d",n) , n){rep(i,n) p[i].read();s.read();sf("%lf%lf",vp,vs);work();return 0;g[i][j]=g[j][i]=(p[i]-p[j]).len();cra=(xb-xa)*(yc-ya)-(yb-ya)*(xc-xa);printf("Case %d: Time = %d:%02d", cas++, ans-60, ans%60);VIP试题?The Ministers' Major Mess你可以假设乘客位于多边形外。
Implicit Self-esteem and Social Identity AND
ANTHONY GREENWALD, G. AND MAHZARIN R . construct validity
William James (1890)defined self-esteem as a self-feeling that is determined by a comparison between the actual self and the ideal self. Following James's definition of self-esteem, standard self-report measures of self-esteem ask respondents either to rate themselves on a variety of specific traits (Marsh, 1986; Pelham and Swann, 1989; Wells and Marwell, 1976), or to indicate how they feel about themselves globally (Rosenberg, 1979). However, research has not supported James's formulation because self-esteem does not appear to be the product of honest appraisal of one's traits and abilities (Rosenberg, 1979) or one's social identity (Crocker and Major, 1989). Rather, research indicates that the higher one's self-esteem, the greater the self-enhancing bias (see Brown, 1991, for review). Consequently, psychologists have debated extensively whether selfesteem causes self-appraisals or vice versa (Brown, 1993; Pelham and Swann, 1989), whether self-esteem leads to discriminatory behavior or vice versa (Abrams and Hogg, 1988), whether people are motivated towards accuracy or positivity in their self-concepts (Brown, 1991; Shrauger, 1975; Swann, 1990), and why, if having high self-esteem is not based on accurate self-appraisals, anyone would have low self-esteem (Baumeister, 1993). What psychologists have only recently considered is that the correspondence between self-esteem measures and self-enhancing behaviors suggests that selfesteem measures may be capturing the wrong construct (Baumeister, Tice, and Hutton, 1989): the motive to present a positive attitude toward self rather than genuine self-esteem. A positivity bias provides no threat to the construct validity of self-esteem measures (i.e., their ability to measure the self-esteem construct). Whether such biases arise from positive feelings toward the self (Brown, 1993) or cognitive beliefs about the self (Markus and Wurf, 1986), they are a reflection of the level of positive self-regard. Such an automatic positivity bias can be interpreted as a manifestation of implicit self-esteem. Greenwald and Banaji defined implicit selfesteem as "the introspectively unidentified (or inaccurately identified) effect of the self-attitude on evaluation of self-associated and self-dissociated objects" (1995, p. 11). This tendency to overestimate one's traits and abilities is understood as a spillover of positive affect from the self to objects associated with the self. Because most people have positive self-affect (Banaji and Prentice, 1994; Greenwald, 1980; Taylor and Brown, 1988), implicit self-esteem effects usually
Smooth Projective Hashing and Two-Message Oblivious Transfer
Smooth Projective Hashing and Two-MessageOblivious TransferYael Tauman KalaiMassachusetts Institute of Technologytauman@,/∼taumanAbstract.We present a general framework for constructing two-messageoblivious transfer protocols using a modification of Cramer and Shoup’snotion of smooth projective hashing(2002).Our framework is actuallyan abstraction of the two-message oblivious transfer protocols of Naorand Pinkas(2001)and Aiello et.al.(2001),whose security is based onthe Decisional Diffie Hellman Assumption.In particular,this frameworkgives rise to two new oblivious transfer protocols.The security of oneis based on the N’th-Residuosity Assumption,and the security of theother is based on both the Quadratic Residuosity Assumption and theExtended Riemann Hypothesis.When using smooth projective hashing in this context,we must dealwith maliciously chosen smooth projective hash families.This raises newtechnical difficulties that did not arise in previous applications,and inparticular it is here that the Extended Riemann Hypothesis comes intoplay.Similar to the previous two-message protocols for oblivious transfer,ourconstructions give a security guarantee which is weaker than the tradi-tional,simulation based,definition of security.Nevertheless,the securitynotion that we consider is nontrivial and seems to be meaningful forsome applications in which oblivious transfer is used in the presence ofmalicious adversaries.1IntroductionIn[CS98],Cramer and Shoup introduced thefirst CCA2secure encryption scheme,whose security is based on the Decisional Diffie Hellman(DDH)As-sumption.They later presented an abstraction of this scheme based on a new notion which they called“smooth projective hashing”[CS02].This abstrac-tion yielded new CCA2secure encryption schemes whose security is based on the Quadratic Residuosity Assumption or on the N’th Residuosity Assumption [Pa99].1This notion of smooth projective hashing was then used by Genarro Supported in part by NSF CyberTrust grant CNS-04304501The N’th Residuosity Assumption is also referred to in the literature as the Deci-sional Composite Residuosity Assumption and as Paillier’s Assumption.and Lindell[GL03]in the context of key generation from humanly memoriz-able passwords.Analogously,their work generalizes an earlier protocol for this problem[KOY01],whose security is also based on the DDH Assumption.In this paper,we use smooth projective hashing to construct efficient two-message oblivious transfer protocols.Our work follows the above pattern,in that it generalizes earlier protocols for this problem[NP01,AIR01]whose security is based on the DDH assumption.Interestingly,using smooth projective hashing in this context raises a new issue.Specifically,we must deal with maliciously chosen smooth projective hash families.This issue did not arise in the previous two applications because these were either in the public key model or in the common reference string model.1.1Oblivious TransferOblivious transfer is a protocol between a sender,holding two stringsγ0and γ1,and a receiver holding a choice bit b.At the end of the protocol the receiver should learn the string of his choice(i.e.,γb)but learn nothing about the other string.The sender,on the other hand,should learn nothing about the receiver’s choice b.Oblivious transfer,first introduced by Rabin[Rab81],is a central primitive in modern cryptography.It serves as the basis of a wide range of cryptographic tasks.Most notably,any secure multi-party computation can be based on a secure oblivious transfer protocol[Y86,GMW87,Kil88].Oblivious transfer has been studied in several variants,all of which have been shown to be equivalent. The variant considered in this paper is the one by Even,Goldreich and Lempel [EGL85](a.k.a.1-out-of-2oblivious transfer),shown to be equivalent to Rabin’s original definition by Cr´e peau[Cre87].The study of oblivious transfer has been motivated by both theoretical and practical considerations.On the theoretical side,much work has been devoted to the understanding of the hardness assumptions required to guarantee obliv-ious transfer.In this context,it is important to note that known construc-tions for oblivious transfer are based on relatively strong computational as-sumptions–either specific assumptions such as factoring or Diffie Hellman (cf.[Rab81,BM89,NP01,AIR01])or generic assumption such as the existence of enhanced trapdoor permutations(cf.[EGL85,Gol04,Hai04]).Unfortunately, oblivious transfer cannot be reduced in a black box manner to presumably weaker primitives such as one-way functions[IR89].On the practical side,research has been motivated by the fact oblivious transfer is considered to be the main bottle-neck with respect to the amount of computation required by secure multiparty protocols.This makes the construction of efficient protocols for oblivious transfer a well-motivated task.In particular,constructing round-efficient oblivious transfer protocols is an important task.Indeed,[NP01](in Protocol4.1)and[AIR01]independently constructed a two-message(1-round)oblivious transfer protocol based on the DDH Assumption(with weaker security guarantees than the simulation based security).Their work was the starting point of our work.1.2Smooth Projective HashingSmooth projective hashing is a beautiful notion introduced by Cramer and Shoup [CS02].To define this notion they rely on the existence of a set X(actually a distribution on sets),and an underlying N P-language L⊆X(with an associ-ated N P-relation R).The basic hardness assumption is that it is infeasible to distinguish between a random element in L and a random element in X\L.This is called a hard subset membership problem.A smooth projective hash family is a family of hash functions that operate on the set X.Each function in the family has two keys associated with it:a hash key k,and a projection keyα(k).Thefirst requirement(which is the standard requirement of a hash family)is that given a hash key k and an element x in the domain X,one can compute H k(x).There are two additional requirements: the“projection requirement”and the“smoothness requirement.”The“projection requirement”is that given a projection keyα(k)and an element in x∈L,the value of H k(x)is uniquely determined.Moreover,com-puting H k(x)can be done efficiently,given the projection keyα(k)and a pair (x,w)∈R.The“smoothness requirement,”on the other hand,is that given a random projection key s=α(k)and any element in x∈X\L,the value H k(x) is statistically indistinguishable from random.1.3Our resultsWe present a methodology for constructing a two-message oblivious transfer pro-tocol from any(modification of a)smooth projective hash family.In particular, we show how the previously known(DDH based)protocols of[NP01,AIR01]can be viewed as a special case of this methodology.Moreover,we show that this methodology gives rise to two new oblivious transfer protocols;one based on the N’th Residuosity Assumption,and the other based on the Quadratic Residuosity Assumption along with the Extended Riemann Hypothesis.Our protocols,similarly to the protocols of[NP01,AIR01],are not known to be secure according to the traditional simulation based definition.Yet,they have the advantage of providing a certain level of security even against malicious adversaries without having to compromise on efficiency(see Section3for further discussion on the guaranteed level of security).The basic idea.Given a smooth projective hash family for a hard subset mem-bership problem(which generates pairs X,L according to some distribution), consider the following two-message protocol for semi-honest oblivious transfer. Recall that the sender’s input is a pair of stringsγ0,γ1and the receiver’s input is a choice bit b.R→S:Choose a pair X,L(with an associated NP-relation R L)according to the specified distribution.Randomly generate a triplet(x0,x1,w b)where x b∈R L,(x b,w b)∈R L,and x1−b∈R X\L.Send(X,x0,x1).S→R:Choose independently two random keys k0,k1for H and sendα(k0)andα(k1)along with y0=γ0⊕H k0(x0)and y1=γ1⊕H k1(x1).R:Retrieveγb by computing y b⊕H kb (x b),using the witness w b and the pro-jection keyα(k b).The security of the receiver is implied by the hardness of the subset mem-bership problem on X.Specifically,guessing the value of b is equivalent to dis-tinguishing between a random element in L and a random element in X\L. The security of the sender is implied by the smoothness property of the hash family H.Specifically,given a random projection keyα(k)and any element in x∈X\L,the value H k(x)is statistically indistinguishable from random.Thus, the message y1−b gives no information aboutγ1−b(since x1−b∈X\L).Note that the functionality of the protocol is implied by the projection property. Technical difficulty.Notice that when considering malicious receivers,the security of the sender is no longer guaranteed.The reason is that there is no guarantee that the receiver will choose x1−b∈X\L.A malicious receiver might choose x0,x1∈L and learn both values.To overcome this problem,we extend the notion of a hard subset membership problem so that it is possible to verify that at least one of x0,x1belongs to X\L.This should work even if the set X is maliciously chosen by the receiver.It turns out that implementing this extended notion in the context of the DDH assumption is straightforward[NP01,AIR01].Loosely speaking,in this case X is generated by choosing a random prime p,and choosing two random elements g0,g1in Z∗p of some prime order q.The resulting set X is defined by X {(g r00,g r11):r0,r1∈Z q},the corresponding language L is defined by L {(g r0,g r1):r∈Z q},and the witness of each element(g r0,g r1)∈L is its discrete logarithm r.In order to enable the sender to verify that two elements x0,x1are not both in L,we instruct the receiver to generate x0,x1by choosing at random two distinct elements r0,r1∈Z q,setting x b=(g r00,g r01),w b=r0,and x1−b=(g r00,g r11).Notice that x b is uniformly distributed in L,x1−b is uniformly distributed in X\L,and the sender can easily check that it is not the case that both x0and x1are in L by merely checking that they agree on theirfirst coordinate and differ on their second coordinate.Implementing this verifiability property in the context of the N’th Residuos-ity Assumption and the Quadratic Residuosity Assumption is not as easy.This part contains the bulk of technical difficulties of this work.In particular,this is where the Extended Riemann Hypothesis comes into play in the context of Quadratic Residuosity.2Smooth Projective Hash FunctionsOur definition of smooth projective hashing differs from its original definition in [CS02].The main difference(from both[CS02]and[GL03])is in the definition of the smoothness requirement,which we relax to Y-smoothness,and in the definition of a subset membership problem,where we incorporate an additional requirement called Y-verifiability.Notation.The security parameter is denoted by n .For a distribution D ,x ←D denotes the action of choosing x according to D ,and x ∈support (D )means that the distribution D samples the value x with positive probability.We denote by x ∈R S the action of uniformly choosing an element from the set S .For any two random variables X,Y ,we say that X and Y are -close if Dist (X,Y )≤ ,where Dist (X,Y )denotes the statistical difference between X and Y .2We say that the ensembles {X n }n ∈N and {Y n }n ∈N are statistically indistinguishable if there exists a negligible function (·)such that for every n ∈N ,the random variables X n and Y n are (n )-close.3Recall that a function ν:N →N is said to be negligible if for every polynomial p (·)and for every large enough n ,ν(n )<1/p (n ).Hard subset membership problems.A subset membership problem M spec-ifies a collection {I n }n ∈N of distributions,where for every n ,I n is a probability distribution over instance descriptions .Each instance description Λspecifies two finite non-empty sets X,W ⊆{0,1}poly (n ),and an NP-relation R ⊂X ×W ,such that the corresponding language L {x :∃w s.t.(x,w )∈R }is non-empty.For every x ∈X and w ∈W ,if (x,w )∈R ,we say that w is a witness for x .We use the following notation throughout the paper:for any instance description Λwe let X (Λ),W (Λ),R (Λ)and L (Λ)denote the sets specified by Λ.Loosely speaking,subset membership problem M ={I n }n ∈N is said to be hard if for a random instance description Λ←I n ,it is hard to distinguish random members of L (Λ)from random non-members.Definition 1(Hard subset membership problem).Let M ={I n }n ∈N be a subset membership problem as above.We say that M is hard if the ensembles{Λn ,x 0n }n ∈N and {Λn ,x 1n }n ∈N are computationally indistinguishable,where Λn ←I n ,x 0n ∈R L (Λn ),and x 1n ∈R X (Λn )\L (Λn ).4Projective hash family.We next present the notion of a projective hash family with respect to a hard subset membership problem M ={I n }n ∈N .Let H ={H k }k ∈K be a collection of hash functions.K ,referred to as the key space,consists of a set of keys such that for each instance description Λ∈M ,5there is a subset of keys K (Λ)⊆K corresponding to Λ.For every Λand for every k ∈K (Λ),H k is a hash function from X (Λ)to G (Λ),where G (Λ)is some finite non-empty set.We denote by G = Λ∈M G (Λ).We define a projection key function α:K →S ,where S is the space of projection rmally,2Recall that Dist (X,Y ) 1 s ∈S |P r [X =s ]−P r [Y =s ]|,or equivalently,Dist (X,Y ) max S ⊂S |P r [X ∈S ]−P r [Y ∈S ]|,where S is any set that con-tains the support of both X and Y .3For simplicity,throughout this paper we say that two random variables X n and Y n are statistically indistinguishable,meaning that the corresponding distribution ensembles {X n }n ∈N and {Y n }n ∈N are statistically indistinguishable.4Note that this hardness requirement also implies that it is hard to distinguish be-tween a random element x ∈R L (Λ)and a random element x ∈R X (Λ).We will use this fact in the proof of Theorem 1.5We abuse notation and let Λ∈M denote the fact that Λ∈support (I n )for some n .a family(H,K,S,α,G)is a projective hash family for M if for every instance descriptionΛ∈M and for every x∈L(Λ),the projection key s=α(k)uniquely determines H k(x).(We stress that the projection key s=α(k)is only guaranteed to determine H k(x)for x∈L(Λ),and nothing is guaranteed for x∈X(Λ)\L(Λ).) Definition2(Projective hash family).(H,K,S,α,G)is a projective hash family for a subset membership problem M if for every instance description Λ∈M there is a well defined(not necessarily efficient)function f such that for every x∈L(Λ)and every k∈K(Λ),f(x,α(k))=H k(x).Efficient projective hash family.We say that a projective hash family is efficient if there exist polynomial time algorithms for:(1)Sampling a key k∈R K(Λ)givenΛ;(2)Computing a projectionα(k)fromΛand k∈K(Λ);(3) Computing H k(x)fromΛ,k∈K(Λ)and x∈X(Λ);and(4)Computing H k(x) fromΛ,(x,w)∈R(Λ)andα(k),where k∈K(Λ).Notice that this gives two ways to compute H k(x):either by knowing the hash key k,or by knowing the projection keyα(k)and a witness w for x.Y-smooth projective hash family.Let Y be any function from instance de-scriptionsΛ∈M to subsets Y(Λ)⊆X(Λ)\L(Λ).Loosely speaking,a projective hash family for M is Y-smooth if for every instance descriptionΛ=(X,W,R), for every x∈Y(Λ),and for a random k∈R K(Λ),the projection keyα(k) reveals(almost)nothing about H k(x).Definition3(Y-smooth projective hash family).A projective hash family (H,K,S,α,G)for a subset membership problem M is said to be Y-smooth if for every(even maliciously chosen)instance descriptionΛ=(X,W,R)and every x∈Y(Λ),the random variables(α(k),H k(x))and(α(k),g)are statistically indistinguishable,where k∈R K(Λ)and g∈R G(Λ).6A Y-smooth projective hash family thus has the property that a projection of a (random)key enables the computation of H k(x)for x∈L,but gives almost no information about the value of H k(x)for x∈Y(Λ).Remark.This definition of Y-smooth projective hash family differs from the original definition proposed in[CS02]in two ways.First,it requires the smooth-ness property to hold against maliciously chosen instance descriptionsΛ,whereas in[CS02]the smoothness is only with respect toΛ∈M.Second,it requires the smoothness property to hold with respect to every x∈Y,whereas in[CS02]the smoothness condition is required to hold for randomly chosen x∈R X\L.The main reason for our divergence from the original definition in[CS02] is that we need to cope with maliciously chosenΛ.We would like to set Y= X\L(as in[CS02]),and construct a(X\L)-smooth projective hash fam-ily.However,we do not know how to construct such a family,for which the 6We assume throughout this paper,without loss of generality,that a(maliciously chosen)Λhas the same structure as an honestly chosenΛ.smoothness condition holds for every(even maliciously chosen)Λ.7Therefore, we relax our smoothness requirement and require only Y-smoothness,for some Y⊆X\L.In both our constructions of Y-smooth projective hash families, Y(Λ)⊂X(Λ)\L(Λ)for maliciously chosenΛ∈M,and Y(Λ)=X(Λ)\L(Λ)for every honestly chosenΛ∈M.Jumping ahead,the latter will enable the(honest) receiver to choose x b∈R L(Λ),x1−b∈R X(Λ)\L(Λ)such that x1−b is also in Y(Λ).This will enable the(honest)sender to be convinced of its security by checking that either x0or x1is in Y(Λ),and it will enable the(honest)receiver to be convinced that a(dishonest)sender cannot guess the bit b,assuming the underlying subset membership problem is hard.(From now on the reader should think of Y(Λ)as equal to X(Λ)\L(Λ)for everyΛ∈M.)Thus,we need a subset membership problem M such that for every honestly chosenΛ∈M it is easy to sample uniformly from both L(Λ)and X(Λ)\L(Λ). On the other hand,for every(even maliciously chosen)(Λ,x0,x1)it is easy to verify that either x0∈Y(Λ)or x1∈Y(Λ).To this end we define the notion of a“Y-verifiably samplable”subset membership problem.Definition4(Y-verifiably samplable subset membership problem).A subset membership problem M={I n}n∈N is said to be Y-verifiably samplable if the following conditions hold.1.Problem samplability:There exists a probabilistic polynomial-time algorithmthat on input1n,samples an instanceΛ=(X,W,R)according to I n.2.Member samplability:There exists a probabilistic polynomial-time algorithmthat on input an instance descriptionΛ=(X,W,R)∈M,outputs an ele-ment x∈L together with its witness w∈W,such that the distribution of x is statistically close to uniform on L.3.Non-member samplability:There exists a probabilistic polynomial-time al-gorithm A that given an instance descriptionΛ=(X,W,R)∈M and an element x0∈X,outputs an element x1=A(Λ,x0),such that if x0∈R L then the distribution of x1is statistically close to uniform on X\L,and if x0∈R X then the distribution of x1is statistically close to uniform on X.4.Y-Verifiability:There exists a probabilistic polynomial-time algorithm B,thatgiven any triplet(Λ,x0,x1),verifies that there exists a bit b such that x b∈Y(Λ).This should hold even ifΛis maliciously chosen.Specifically:–For everyΛand every x0,x1,if both x0∈Y(Λ)and x1∈Y(Λ)then B(Λ,x0,x1)=0.–For every honestly chosenΛ∈M and every x0,x1,if there exists b such that x b∈L(Λ)and x1−b∈support(A(Λ,x b)),then B(Λ,x0,x1)=1.For simplicity,throughout the paper we do not distinguish between uniform and statistically close to uniform distributions.This is inconsequential.7We note that[CS02,GL03]did not deal with maliciously chosenΛ’s,and indeed the smoothness property of their constructions does not hold for maliciously chosenΛ’s.3Security of Oblivious TransferOur definition of oblivious transfer is similar to the ones considered in previous works on oblivious transfer in the Bounded Storage Model[DHRS04,CCM98].A similar(somewhat weaker)definition was also used in[NP01]in the context of their DDH based two message oblivious transfer protocol.In what follows we let viewˆS (ˆS(z),R(b))denote the view of a cheating senderˆS(z)after interacting with R(b).This view consists of its input z,its random coin tosses,and the messages that it received from R(b)during the interaction.Similarly,we let viewˆR (S(γ0,γ1),ˆR(z))denote the view of a cheating ReceiverˆR(z)after interacting with S(γ,γ1).Definition5(Secure implementation of Oblivious Transfer).A two party protocol(S,R)is said to securely implement oblivious transfer if it is a protocol in which both the sender and the receiver are probabilistic polynomial time machines that get as input a security parameter n in unary representation.Moreover,the sender gets as input two stringsγ0,γ1∈{0,1} (n),the receiver gets as input a choice bit b∈{0,1},and the following conditions are satisfied:–Functionality:If the sender and the receiver follow the protocol then for any security parameter n,any two input stringsγ0,γ1∈{0,1} (n),and any bit b,the receiver outputsγb whereas the sender outputs nothing.8–Receiver’s security:For any probabilistic polynomial-time adversaryˆS,exe-cuting the sender’s part,for any security parameter n,and for any auxiliary input z of size polynomial in n,the view thatˆS(z)sees when the receiver tries to obtain thefirst message is computationally indistinguishable from the view it sees when the receiver tries to obtain the second message.That is,{viewˆS (ˆS(z),R(1n,0))}n,z c≡{viewˆS(ˆS(z),R(1n,1))}n,z–Sender’s security:For any deterministic(not necessarily polynomial-time) adversaryˆR,executing the receiver’s part,for any security parameter n,for any auxiliary input z of size polynomial in n,and for anyγ0,γ1∈{0,1} (n), there exists a bit b such that for everyψ∈{0,1} (n),the view ofˆR(z)when interacting with S(1n,γb,ψ),and the view ofˆR(z)when interacting with S(1n,γ0,γ1),are statistically indistinguishable.9That is,{viewˆR (S(1n,γ0,γ1),ˆR(z))}n,γ,γ1,zs≡{viewˆR(S(1n,γb,ψ),ˆR(z))}n,γb,ψ,zNote that Definition5(similarly to the definitions in[DHRS04,NP01])de-parts from the traditional,simulation based,definition in that it handles the security of the sender and of the receiver separately.This results in a some-what weaker security guarantee,with the main drawback being that neither the 8This condition is also referred to as the completeness condition.9We abuse notation by letting S(1n,γb,ψ)denote S(1n,γ0,ψ)if b=0,and letting it denote S(1n,ψ,γ1)if b=1.sender nor the receiver are actually guaranteed to“know”their own input.(This is unavoidable in two message protocols using“standard”techniques).It is easy to show that Definition5implies simulatability for semi honest adversaries(the proof is omitted due to lack of space).More importantly,Defini-tion5also gives meaningful security guarantees in face of malicious participants. In the case of a malicious sender,the guarantee is that the damage incurred by malicious participation is limited to“replacing”the input stringsγ0,γ1with a pair of strings that are somewhat“related”to the receiver’sfirst message(with-out actually learning anything about the receiver’s choice).In the case of a mali-cious receiver,Definition5can be shown to provide exponential time simulation of the receiver’s view of the interaction(similarly to the definition of[NP01]).In particular,the interaction gives no information to an unbounded receiver beyond the value ofγb.(Again,the proof is omitted due to lack of space.)4Constructing2-Round OT ProtocolsLet M={I n}n∈N be a hard subset membership problem which is Y-verifiably samplable,and let(H,K,S,α,G)be a an efficient Y-smooth projective hash family for M.Recall that the Y-verifiably samplable condition of M implies the existence of algorithms A and B as described in Section2.We assume for simplicity that for any n and for anyΛ∈I n,G(Λ)={0,1} (n), and that the two messagesγ0,γ1,to be transferred in the OT protocol,are binary strings of length at most (n).Let n be the security parameter.Let(γ0,γ1)be the input of the sender and let b∈{0,1}be the input of the receiver.R→S:The receiver chooses a random instance descriptionΛ=(X,W,R)←I n.It then samples a random element x b∈R L together with its corre-sponding witness w b,using the member samplability algorithm,and invokes Algorithm A on input(Λ,x b)to obtain a random element x1−b∈X\L.It sends(Λ,x0,x1).S→R:The sender invokes algorithm B on input(Λ,x0,x1)to verify that there exists a bit b such that x1−b∈Y(Λ).If B outputs0then it aborts,and ifB outputs1then it chooses independently at random k0,k1∈R K(Λ),andsendsα(k0)andα(k1)along with y0=γ0⊕H k0(x0)and y1=γ1⊕H k1(x1).R:The receiver retrievesγb by computing y b⊕H kb (x b)using the projectionkeyα(k b)and the pair(x b,w b).We next prove that the above protocol is secure according to Definition5. Intuitively,the receiver’s security follows from the fact that x b is uniformly distributed in L,x1−b is uniformly distributed in X\L,and from the assumption that it is hard to distinguish random L elements from random X\L elements. The sender’s security follows from the assumption that(H,K,S,α,G)is a Y-smooth projective hash family for M,and from the assumption that one of x0 or x1is in Y(Λ)(otherwise,it will be detected by B and the sender will abort).Theorem1.The above2-round OT protocol is secure according Definition5,assuming M is a Y-verifiably samplable hard subset membership problem,and assuming(H,K,S,α,G)is a Y-smooth projective hash family for M.Proof.we start by proving the receiver’s security.Assume for the sake of con-tradiction that there exists a(malicious)probabilistic polynomial-time senderˆS such that for infinitely many n’s there exists a polynomial size auxiliary input z n such thatˆS(z n)can predict(with non-negligible advantage)the choice bit b when interacting with R(1n,b).In what follows,we useˆS(z n)to break the hard-ness of M,by distinguishing between x∈R L and x∈R X.Given an instance descriptionΛ=(X,W,R)←(I n)and an element x∈X:1.Choose at random a bit b and let x b=x2.Apply algorithm A on input(Λ,x b)to obtain an element x1−b.3.FeedˆS(z n)the message(Λ,x0,x1),and obtain its prediction bit b .4.If b =b then predict“x∈R L”and if b =b then predict“x∈R L.”Notice that if x b∈R L thenˆS(z n)will predict the bit b with non-negligible advantage(follows from our contradiction assumption).On the other hand,if x b∈R X then x1−b is also uniformly distributed in X.In this case it is impossible (information theoretically)to predict b.We now turn to prove the sender’s security.LetˆR be any(not necessarily polynomial time)malicious receiver,and for any n∈N,let z n be any polynomial size auxiliary information given toˆR.Let(Λn,x0,x1)be thefirst message sent by ˆR(zn).Our goal is to show that for every n∈N and for everyγ0,γ1∈{0,1} (n),there exists b∈{0,1}such that the random variables viewˆR(S(1n,γ0,γ1),ˆR(z n))and viewˆR (S(1n,γb,ψ),ˆR(z n))are statistically indistinguishable.We assume without loss of generality that either x0∈Y(Λn)or x1∈Y(Λn). If this is not the case,the sender aborts the execution and b can be set to either0 or1.Let b be the bit satisfying x1−b∈Y(Λn).By the Y-smoothness property of the hash family,the random variables(α(k),H k(x1−b))and(α(k),g)are statis-tically indistinguishable,for a random k∈R K(Λn)and a random g∈R G(Λn). This implies that the random variables(α(k),γ1−b⊕H k(x1−b))and(α(k),g) are statistically indistinguishable,which implies that viewˆR(S(1n,γ0,γ1),ˆR(z))and viewˆR(S(1n,γb,ψ),ˆR(z))are statistically indistinguishable.5Constructing Smooth Projective Hash FamiliesWe next present two constructions of Y-smooth projective hash families for hard subset membership problems which are Y-verifiably samplable.One based on the N’th Residuosity Assumption,and the other based on the Quadratic-Residuosity Assumption together with the Extended Reimann Hypothesis.A key vehicle in both constructions is the notion of an( ,Y)-universal projective hash family. Definition6(Universal projective hash families).Let M={I n}n∈N be any hard subset membership problem.A projective hash family(H,K,S,α,G)。
A Balancing Act for Taxol Precursor Pathways in Ecoli
DOI: 10.1126/science.1191652, 70 (2010);330 Science , et al.Parayil Kumaran Ajikumar Escherichia coliin Isoprenoid Pathway Optimization for Taxol Precursor OverproductionThis copy is for your personal, non-commercial use only.clicking here.colleagues, clients, or customers by , you can order high-quality copies for your If you wish to distribute this article to othershere.following the guidelines can be obtained by Permission to republish or repurpose articles or portions of articles): August 4, 2011 (this infomation is current as of The following resources related to this article are available online at/content/330/6000/70.full.html version of this article at:including high-resolution figures, can be found in the online Updated information and services, /content/suppl/2010/09/27/330.6000.70.DC1.htmlcan be found at:Supporting Online Material /content/330/6000/70.full.html#related found at:can be related to this article A list of selected additional articles on the Science Web sites /content/330/6000/70.full.html#ref-list-1, 4 of which can be accessed free:cites 33 articles This article 1 article(s) on the ISI Web of Science cited by This article has been /content/330/6000/70.full.html#related-urls 1 articles hosted by HighWire Press; see:cited by This article has been/cgi/collection/chemistry Chemistrysubject collections:This article appears in the following registered trademark of AAAS.is a Science 2010 by the American Association for the Advancement of Science; all rights reserved. The title Copyright American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the Science o n A u g u s t 4, 2011w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o mIsoprenoid Pathway Optimizationfor Taxol Precursor Overproductionin Escherichia coliParayil Kumaran Ajikumar,1,2Wen-Hai Xiao,1Keith E.J.Tyo,1Yong Wang,3Fritz Simeon,1 Effendi Leonard,1Oliver Mucha,1Too Heng Phon,2Blaine Pfeifer,3*Gregory Stephanopoulos1,2* Taxol(paclitaxel)is a potent anticancer drug first isolated from the Taxus brevifolia Pacific yew tree. Currently,cost-efficient production of Taxol and its analogs remains limited.Here,we report a multivariate-modular approach to metabolic-pathway engineering that succeeded in increasing titers of taxadiene—the first committed Taxol intermediate—approximately1gram per liter(~15,000-fold)in an engineered Escherichia coli strain.Our approach partitioned the taxadiene metabolic pathwayinto two modules:a native upstream methylerythritol-phosphate(MEP)pathway forming isopentenyl pyrophosphate and a heterologous downstream terpenoid–forming pathway.Systematic multivariate search identified conditions that optimally balance the two pathway modules so as to maximize the taxadiene production with minimal accumulation of indole,which is an inhibitory compound found here. We also engineered the next step in Taxol biosynthesis,a P450-mediated5a-oxidation of taxadieneto taxadien-5a-ol.More broadly,the modular pathway engineering approach helped to unlock the potential of the MEP pathway for the engineered production of terpenoid natural products.T axol(paclitaxel)and its structural analogs are among the most potent and commer-cially successful anticancer drugs(1).Taxol was first isolated from the bark of the Pacific yew tree(2),and early-stage production methods required sacrificing two to four fully grown trees to secure sufficient dosage for one patient(3). Taxol’s structural complexity limited its chemical synthesis to elaborate routes that required35to 51steps,with a highest yield of0.4%(4–6).Asemisynthetic route was later devised in whichthe biosynthetic intermediate baccatin III,isolatedfrom plant sources,was chemically converted toTaxol(7).Although this approach and subse-quent plant cell culture–based production effortshave decreased the need for harvesting the yewtree,production still depends on plant-based pro-cesses(8),with accompanying limitations onproductivity and scalability.These methods ofproduction also constrain the number of Taxolderivatives that can be synthesized in the searchfor more efficacious drugs(9,10).Recent developments in metabolic engineer-ing and synthetic biology offer new possibilitiesfor the overproduction of complex natural productsby optimizing more technically amenable micro-bial hosts(11,12).The metabolic pathway forTaxol consists of an upstream isoprenoid pathwaythat is native to Escherichia coli and a het-erologous downstream terpenoid pathway(fig.S1).The upstream methylerythritol-phosphate(MEP)or heterologous mevalonic acid(MV A)pathwayscan produce the two common building blocks,isopentenyl pyrophosphate(IPP)and dimethyl-allyl pyrophosphate(DMAPP),from which Taxoland other isoprenoid compounds are formed(12).Recent studies have highlighted the engi-neering of the above upstream pathways to sup-port the biosynthesis of heterologous isoprenoidssuch as lycopene(13,14),artemisinic acid(15,16),and abietadiene(17,18).The downstream taxadienepathway has been reconstructed in E.coli andSaccharomyces cerevisiae together with the over-expression of upstream pathway enzymes,but todate titers have been limited to less than10mg/liter(19,20).The above rational metabolic engineering ap-proaches examined separately either the upstreamor the downstream terpenoid pathway,implicitlyassuming that modifications are additive(a linearbehavior)(13,17,21).Although this approachcan yield moderate increases in flux,it generallyignores nonspecific effects,such as toxicity of in-termediate metabolites,adverse cellular effects ofthe vectors used for expression,and hidden path-ways and metabolites that may compete with themain pathway and inhibit the production of thedesired binatorial approaches canovercome such problems because they offer theopportunity to broadly sample the parameter spaceand bypass these complex nonlinear interactions(21–23).However,combinatorial approaches re-quire high-throughput screens,which are often notavailable for many desirable natural products(24).Considering the lack of a high-throughputscreen for taxadiene(or other Taxol pathwayintermediate),we resorted to a focused combi-1Department of Chemical Engineering,Massachusetts Institute of Technology(MIT),Cambridge,MA02139,USA.2Chemical and Pharmaceutical Engineering Program,Singapore-MIT Alli-ance,117546Singapore.3Department of Chemical and Bio-logical Engineering,Tufts University,4Colby Street,Medford, MA02155,USA.*To whom correspondence should be addressed.E-mail: gregstep@(G.S.);blaine.pfeifer@(B.P.)Upstream moduleFig.1.isoprenoid pathwaythe flux through thewe targeted reported(dxs,idi,ispD,andexpression by anTo channel theversal isoprenoidtoward Taxolsynthetic operon of downstream genes GGPP synthase(G)and taxadienesynthase(T)(37).Both pathways were placed under the control of induciblepromoters in order to control their relative gene expression.In the E.colimetabolic network,the MEP isoprenoid pathway is initiated by the con-densation of the precursors glyceraldehyde-3phosphate(G3P)and pyruvate(PYR)from glycolysis.The Taxol pathway bifurcation starts from the universalisoprenoid precursors IPP and DMAPP to form geranylgeranyl diphosphate,and then the taxadiene.The cyclic olefin taxadiene undergoes multiple roundsof stereospecific oxidations,acylations,and benzoylation to form the lateintermediate Baccatin III and side chain assembly to,ultimately,form Taxol. REPORTS1OCTOBER2010VOL330SCIENCE 70onAugust4,211www.sciencemag.orgDownloadedfromnatorial approach,which we term “multivariate-modular pathway engineering.”In this approach,the overall pathway is partitioned into smaller modules,and the modules ’expression are varied simultaneously —a multivariate search.This ap-proach can identify an optimally balanced path-way while searching a small combinatorial space.Specifically,we partition the taxadiene-forming pathway into two modules separated at IPP,which is the key intermediate in terpenoid bio-synthesis.The first module comprises an eight-gene,upstream,native (MEP)pathway of which the expression of only four genes deemed to be rate-limiting was modulated,and the second mod-ule comprises a two-gene,downstream,heterolo-gous pathway to taxadiene (Fig.1).This modular approach allowed us to efficiently sample the main parameters affecting pathway flux without the need for a high-throughput screen and to unveil the role of the metabolite indole as in-hibitor of isoprenoid pathway activity.Addition-ally,the multivariate search revealed a highly nonlinear taxadiene flux landscape with a global maximum exhibiting a 15,000-fold increase in taxadiene production over the control,yielding 1.02T 0.08g/liter (SD)taxadiene in fed-batch bioreactor fermentations.We have further engineered the P450-based oxidation chemistry in Taxol biosynthesis in E.coli to convert taxadiene to taxadien-5a -ol and provide the basis for the synthesis of sub-sequent metabolites in the pathway by means of similar cytochrome P450(CYP450)oxida-tion chemistry.Our engineered strain improved taxadiene-5a -ol production by 2400-fold over the state of the art with yeast (25).These ad-vances unlock the potential of microbial pro-cesses for the large-scale production of Taxol or its derivatives and thousands of other valuable terpenoids.The multivariate-modular approach in which various promoters and gene copy-numbers are combined to modulate diverse expression levels of upstream and downstream pathways of taxadiene synthesis is schematically described in fig.S2.A total of 16strains were constructed in order to widen the bottleneck of the MEP pathway as well as optimally balance it with the downstream tax-adiene pathway (26).The dependence of tax-adiene accumulation on the upstream pathway for constant values of the downstream pathway is shown in Fig.2A,and the dependence on the downstream pathway for constant upstream path-way strength is shown in Fig.2B (table S1,cal-culation of the upstream and downstream pathway strength from gene copy number and promoter strength).As the upstream pathway expression increases in Fig.2A from very low levels,tax-adiene production also rises initially because of increased supply of precursors to the overall path-way.However,after an intermediate value further upstream pathway increases cannot be accom-modated by the capacity of the downstream path-way.For constant upstream pathway expression (Fig.2B),a maximum in downstream expressionwas similarly observed owing to the rising edge to initial limiting of taxadiene production by low expression levels of the downstream pathway.At high (after peak)levels of downstream pathway expression,we were probably observing the neg-ative effect on cell physiology of the high copy number.These results demonstrate that dramatic changes in taxadiene accumulation can be obtained fromchanges within a narrow window of expression levels for the upstream and downstream path-ways.For example,a strain containing an ad-ditional copy of the upstream pathway on its chromosome under Trc promoter control (strain 8)(Fig.2A)produced 2000-fold more taxadiene than one expressing only the native MEP path-way (strain 1)(Fig.2A).Furthermore,changing the order of the genes in the downstreamsyn-Fig.2.Optimization of taxadiene production through regulating the expression of the upstream and downstream modular pathways.(A )Response in taxadiene accumulation to changes in upstream pathway strengths for constant values of the downstream pathway.(B )Dependence of taxadiene on the down-stream pathway for constant levels of upstream pathway strength.(C )Taxadiene response from strains (17to 24)engineered with high upstream pathway overexpressions (6to 100a.u.)at two different down-stream expressions (31a.u.and 61a.u.).(D )Modulation of a chromosomally integrated upstream pathway by using increasing promoter strength at two different downstream expressions (31a.u.and 61a.u.).(E )Genotypes of the 32strain constructs whose taxadiene phenotype is shown in Fig.2,A to D.E,E.coli K12MG1655D recA D endA ;EDE3,E.coli K12MG1655D recA D endA with DE3T7RNA polymerase gene in the chromosome;MEP,dxs-idi-ispDF operon;GT,GPPS-TS operon;TG,TS-GPPS operon;Ch1,1copy in chromosome;Trc,Trc promoter;T5,T5promoter;T7,T7promoter;p5,pSC101plasmid;p10,p15A plasmid;and p20,pBR322plasmid. SCIENCEVOL 3301OCTOBER 201071REPORTSo n A u g u s t 4, 2011w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o mthetic operon from GT (GGPS-TS)to TG (TS-GGPS)resulted in a two-to threefold increase (strains 1to 4as compared with strains 5,8,11,and 14).Altogether,the engineered strains estab-lished that the MEP pathway flux can be substan-tial if an appropriate range of expression levels for the endogenous upstream and synthetic down-stream pathway are searched simultaneously.To provide ample downstream pathway strength while minimizing the plasmid-born metabolic bur-den (27),two new sets of four strains each were engineered (strains 17to 20and 21to 24),in which the downstream pathway was placed un-der the control of a strong promoter (T7)while keeping a relatively low number of five and 10plasmid copies,respectively.The taxadiene maxi-mum was maintained at high downstream strength (strains 21to 24),whereas a monotonic response was obtained at the low downstream pathway strength (strains 17to 20)(Fig.2C).This ob-servation prompted the construction of two addi-tional sets of four strains each that maintained the same level of downstream pathway strength as before but expressed very low levels of the up-stream pathway (strains 25to 28and 29to 32)(Fig.2D).Additionally,the operon of the up-stream pathway of the latter strain set was chro-mosomally integrated (fig S3).Not only was the taxadiene maximum recovered in these strains,albeit at very low upstream pathway levels,but a much greater taxadiene maximum was attained (~300mg/liter).We believe that this significant increase can be attributed to a decrease in the cell ’s metabolic burden.We next quantified the mRNA levels of 1-deoxy-D -xylulose-5-phosphate synthase (dxs)and taxadiene synthase (TS)(representing the up-stream and downstream pathways,respectively)for the high-taxadiene-producing strains (25to 32and 17and 22)that exhibited varying up-stream and downstream pathway strengths (fig.S4,A and B)to verify our predicted expression strengths were consistent with the actual pathway levels.We found that dxs expression level cor-relates well with the upstream pathway strength.Similar correlations were found for the other genes of the upstream pathway:idi ,ispD ,and ispF (fig.S4,C and D).In downstream TS gene expres-sion,an approximately twofold improvement was quantified as the downstream pathway strength increased from 31to 61arbitrary units (a.u.)(fig.S4B).Metabolomic analysis of the previous strains led to the identification of a distinct metabolite by-product that inversely correlated with taxadiene accumulation (figs.S5and S6).The corresponding peak in the gas chromatography –mass spectrom-etry (GC-MS)chromatogram was identified as indole through GC-MS,1H,and 13C nuclear magnetic resonance (NMR)spectroscopy studies (fig.S7).We found that taxadiene synthesis by strain 26is severely inhibited by exogenous in-dole at indole levels higher than ~100mg/liter (fig.S5B).Further increasing the indole concen-tration also inhibited cell growth,with the level ofinhibition being very strain-dependent (fig.S5C).Although the biochemical mechanism of indole interaction with the isoprenoid pathway is pres-ently unclear,the results in fig.S5suggest a possible synergistic effect between indole and terpenoid compounds of the isoprenoid pathway in inhibiting cell growth.Without knowing the specific mechanism,it appears that strain 26has mitigated the indole ’s effect,which we carried forward for further study.In order to explore the taxadiene-producing potential under controlled conditions for the en-gineered strains,fed-batch cultivations of the three highest taxadiene accumulating strains (~60mg/liter from strain 22;~125mg/liter from strain 17;and ~300mg/liter from strain 26)were carried out in 1-liter bioreactors (Fig.3).The fed-batch cultivation studies were carried out as liquid-liquid two-phase fermentation using a 20%(v/v)dodecane overlay.The organic solvent was intro-duced to prevent air stripping of secreted tax-adiene from the fermentation medium,as indicated by preliminary findings (fig.S8).In defined media with controlled glycerol feeding,taxadiene pro-ductivity increased to 174T 5mg/liter (SD),210T 7mg/liter (SD),and 1020T 80mg/liter (SD)for strains 22,17,and 26,respectively (Fig.3A).Additionally,taxadiene production significantly affected the growth phenotype,acetate accumu-lation,and glycerol consumption [Fig.3,B and D,and supporting online material (SOM)text].Clearly,the high productivity and more robustgrowth of strain 26allowed very high taxadiene accumulation.Further improvements should be possible through optimizing conditions in the bio-reactor,balancing nutrients in the growth medi-um and optimizing carbon delivery.Having succeeded in engineering the bio-synthesis of the “cyclase phase ”of Taxol for high taxadiene production,we turned next to engineer-ing the oxidation-chemistry of Taxol biosynthesis.In this phase,hydroxyl groups are incorporated by oxygenation at seven positions on the taxane core structure,mediated by CYP450-dependent monooxygenases (28).The first oxygenation is the hydroxylation of the C5position,followed by seven similar reactions en route to Taxol (fig.S1)(29).Thus,a key step toward engineering Taxol-producing microbes is the development of CYP450-based oxidation chemistry in vivo.The first oxygenation step is catalyzed by a CYP450,taxadiene 5a -hydroxylase,which is an unusual monooxygenase that catalyzes the hydroxylation reaction along with double-bond migration in the diterpene precursor taxadiene (Fig.1).In general,functional expression of plant CYP450in E.coli is challenging (30)because of the inherent limitations of bacterial platforms,such as the absence of electron transfer machin-ery and CYP450-reductases (CPRs)and trans-lational incompatibility of the membrane signal modules of CYP450enzymes because of the lack of an endoplasmic reticulum.Recently,through transmembrane (TM)engineering and the gener-24487296120T a x a d i e n e (m g /L )Time (h)1234024487296120N e t g l y c e r o l a d d e d (g /L )Time (h)A BC DC e l l g r o w t h (OD 600 n m )Time (h)24487296120A c e t i c a c i d (g /L )Time (h)Fig.3.Fed-batch cultivation of engineered strains in a 1-liter bioreactor.Time courses of (A )taxadiene accumulation,(B )cell growth,(C )acetic acid accumulation,and (D )total substrate (glycerol)addition for strains 22,17,and 26during 5days of fed-batch bioreactor cultivation in 1-liter bioreactor vessels under controlled pH and oxygen conditions with minimal media and 0.5%yeast extract.After glycerol depletes to ~0.5to 1g/liter in the fermentor,3g/liter of glycerol was introduced into the bioreactor during the fermentation.Data are mean of two replicate bioreactors.1OCTOBER 2010VOL 330SCIENCE72REPORTSo n A u g u s t 4, 2011w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o mation of chimera enzymes of CYP450and CPR,some plant CYP450s have been expressed in E.coli for the biosynthesis of functional mole-cules (15,31).Still,every plant CYP450has distinct TM signal sequences and electron transfer characteristics from its reductase counterpart (32).Our initial studies were focused on optimizing the expression of codon-optimized synthetic tax-adiene 5a -hydroxylase by N-terminal TM engi-neering and generating chimera enzymes through translational fusion with the CPR redox partner from the Taxus species,Taxus CYP450reductase (TCPR)(Fig.4A)(29,31,33).One of the chi-mera enzymes generated,At24T5a OH-tTCPR,was highly efficient in carrying out the first oxi-dation step,resulting in more than 98%taxadiene conversion to taxadien-5a -ol and the byproduct 5(12)-Oxa-3(11)-cyclotaxane (OCT)(fig.S9A).Compared with the other chimeric CYP450s,At24T5a OH-tTCPR yielded twofold higher (21mg/liter)production of taxadien-5a -ol (Fig.4B).Because of the functional plasticity of taxadiene 5a -hydroxylase with its chimeric CYP450’s en-zymes (At8T5a OH-tTCPR,At24T5a OH-tTCPR,and At42T5a OH-tTCPR),the reaction also yields a complex structural rearrangement of taxadiene into the cyclic ether OCT (fig.S9)(34).The by-product accumulated in approximately equal amounts (~24mg/liter from At24T5a OH-tTCPR)to the desired product taxadien-5a -ol.The productivity of strain 26-At24T5a OH-tTCPR was significantly reduced relative to that of taxadiene production by the parent strain 26(~300mg/liter),with a concomitant increase in indole accumulation.No taxadiene accumulation was observed.Apparently,the introduction of an additional medium copy plasmid (10-copy,p10T7)bearing the At24T5a OH-tTCPR construct dis-turbed the carefully engineered balance in the up-stream and downstream pathway of strain 26(fig S10).Small-scale fermentations were carried out in bioreactors so as to quantify the alcohol production by strain 26-At24T5a OH-tTCPR.The time course profile of taxadien-5a -ol accumulation (Fig.4C)indicates alcohol production of up to 58T 3mg/liter (SD)with an equal amount of the OCT by-product produced.The observed alcohol production was approximately 2400-fold higher than previous production in S.cerevisiae (25).The MEP pathway is energetically balanced and thus overall more efficient in converting either glucose or glycerol to isoprenoids (fig.S11).Yet,during the past 10years many attempts at en-gineering the MEP pathway in E.coli in order to increase the supply of the key precursors IPP and DMAPP for carotenoid (21,35),sesquiterpenoid (16),and diterpenoid (17)overproduction met with limited success.This inefficiency was at-tributed to unknown regulatory effects associated specifically with the expression of the MEP path-way in E.coli (16).Here,we provide evidence that such limitations are correlated with the accumu-lation of the metabolite indole,owing to the non-optimal expression of the pathway,which inhibits the isoprenoid pathway activity.Taxadiene over-production (under conditions of indole-formation suppression),establishes the MEP pathway as a very efficient route for biosynthesis of pharma-ceutical and chemical products of the isoprenoid family (fig.S11).One simply needs to carefully balance the modular pathways,as suggested by our multivariate-modular pathway –engineering approach.For successful microbial production of Taxol,demonstration of the chemical decoration of the taxadiene core by means of CYP450-based oxi-dation chemistry is essential (28).Previous ef-forts to reconstitute partial Taxol pathways in yeast found CYP450activity limiting (25),making the At24T5a OH-tTCPR activity levels an im-portant step to debottleneck the late Taxol path-way.Additionally,the strategies used to create At24T5a OH-tTCPR are probably applicable for the remaining monooxygenases that will require expression in E.coli .CYP450monooxygenases constitute about one half of the 19distinct en-zymatic steps in the Taxol biosynthetic pathway.These genes show unusually high sequence sim-ilarity with each other (>70%)but low similarity (<30%)with other plant CYP450s (36),implying that these monooxygenases are amenable to similar engineering.To complete the synthesis of a suitable Taxol precursor,baccatin III,six more hydroxylation reactions and other steps (including some that have not been identified)need to be effectively engineered.Although this is certainly a daunting task,the current study shows potential by provid-ing the basis for the functional expression of two key steps,cyclization and oxygenation,in Taxol biosynthesis.Most importantly,by unlocking the potential of the MEP pathway a new more ef-ficient route to terpenoid biosynthesis is capable of providing potential commercial production of microbially derived terpenoids for use as chem-icals and fuels from renewable resources.References and Notes1.D.G.Kingston,Phytochemistry 68,1844(2007).2.M.C.Wani,H.L.Taylor,M.E.Wall,P.Coggon,A.T.McPhail,J.Am.Chem.Soc.93,2325(1971).3.M.Suffness,M.E.Wall,in Taxol:Science and Applications ,M.Suffness,Ed.(CRC,Boca Raton,FL,1995),pp.3–26.4.K.C.Nicolaou et al .,Nature 367,630(1994).5.R.A.Holton et al .,J.Am.Chem.Soc.116,1597(1994).6.A.M.Walji,D.W.C.MacMillan,Synlett 18,1477(2007).7.R.A.Holton,R.J.Biediger,P.D.Boatman,in Taxol:Science and Applications ,M.Suffness,Ed.(CRC,Boca Raton,FL,1995),pp.97–119.8.D.Frense,Appl.Microbiol.Biotechnol.73,1233(2007).9.S.C.Roberts,Nat.Chem.Biol.3,387(2007).10.J.Goodman,V.Walsh,The Story of Taxol:Nature andPolitics in the Pursuit of an Anti-Cancer Drug .(Cambridge Univ.Press,Cambridge,2001).11.K.E.Tyo,H.S.Alper,G.N.Stephanopoulos,TrendsBiotechnol.25,132(2007).12.P.K.Ajikumar et al .,Mol.Pharm.5,167(2008).510152025T a x a d i e n -5α-o l p r o d u c t i o n (m g e q u i v a l e n t o f t a x a d i e n e /L )BC048121620020406020406080100C e l l g r o w t h (OD 600n m )T a x a d i e n e -5α-o l p r o d u c t i o n (m g e q u i v a l e n t o f t a x a d i e n e /L )Time (h)Fig.4.Engineering Taxol P450oxidation chemistry in E.coli .(A )TM engineering and construction of chimera protein from taxadien-5a -ol hydroxylase (T5a OH)and Taxus cytochrome P450reductase (TCPR).The labels 1and 2represent the full-length proteins of T5a OH and TCPR identified with 42and 74amino acid TM regions,respectively,and 3represents chimera enzymes generated from three different TM en-gineered T5a OH constructs [At8T5a OH,At24T5a OH,and At42T5a OH constructed by fusing an 8-residue synthetic peptide MALLLAVF (A)to 8,24,and 42AA truncated T5a OH]through a translational fusion with 74AA truncated TCPR (tTCPR)by use of linker peptide GSTGS.(B )Functional activity of At8T5a OH-tTCPR,At24T5a OH-tTCPR,and At42T5a OH-tTCPR constructs transformed into taxadiene producing strain 26.Data are mean T SD for three replicates.(C )Time course profile of taxadien-5a -ol accumulation and growth profile of the strain 26-At24T5a OH-tTCPR fermented in a 1-liter bioreactor.Data are mean of two replicate bioreactors.SCIENCEVOL 3301OCTOBER 201073REPORTSo n A u g u s t 4, 2011w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o m13.W.R.Farmer,J.C.Liao,Nat.Biotechnol.18,533(2000).14.H.Alper,K.Miyaoku,G.Stephanopoulos,Nat.Biotechnol.23,612(2005).15.M.C.Chang,J.D.Keasling,Nat.Chem.Biol.2,674(2006).16.V.J.Martin,D.J.Pitera,S.T.Withers,J.D.Newman,J.D.Keasling,Nat.Biotechnol.21,796(2003).17.D.Morrone et al .,Appl.Microbiol.Biotechnol.85,1893(2010).18.E.Leonard et al .,Proc.Natl.Acad.Sci.U.S.A.107,13654(2010).19.Q.Huang,C.A.Roessner,R.Croteau,A.I.Scott,Bioorg.Med.Chem.9,2237(2001).20.B.Engels,P.Dahm,S.Jennewein,Metab.Eng.10,201(2008).21.L.Z.Yuan,P.E.Rouvière,rossa,W.Suh,Metab.Eng.8,79(2006).22.Y.S.Jin,G.Stephanopoulos,Metab.Eng.9,337(2007).23.H.H.Wang et al .,Nature 460,894(2009).24.D.Klein-Marcuschamer,P.K.Ajikumar,G.Stephanopoulos,Trends Biotechnol.25,417(2007).25.J.M.Dejong et al .,Biotechnol.Bioeng.93,212(2006).26.Materials and methods are available as supportingmaterial on Science Online.27.K.L.Jones,S.W.Kim,J.D.Keasling,Metab.Eng.2,328(2000).28.R.Kaspera,R.Croteau,Phytochem.Rev.5,433(2006).29.S.Jennewein,R.M.Long,R.M.Williams,R.Croteau,Chem.Biol.11,379(2004).30.M.A.Schuler,D.Werck-Reichhart,Annu.Rev.Plant Biol.54,629(2003).31.E.Leonard,M.A.Koffas,Appl.Environ.Microbiol.73,7246(2007).32.D.R.Nelson,Arch.Biochem.Biophys.369,1(1999).33.S.Jennewein et al .,Biotechnol.Bioeng.89,588(2005).34.D.Rontein et al .,J.Biol.Chem.283,6067(2008).35.W.R.Farmer,J.C.Liao,Biotechnol.Prog.17,57(2001).36.S.Jennewein,M.R.Wildung,M.Chau,K.Walker,R.Croteau,Proc.Natl.Acad.Sci.U.S.A.101,9149(2004).37.K.Walker,R.Croteau,Phytochemistry 58,1(2001).38.We thank R.Renu for extraction,purification,andcharacterization of metabolite Indole;C.Santos for providing the pACYCmelA plasmid,constructivesuggestions during the experiments,and preparation of the manuscript;D.Dugar,H.Zhou,and X.Huang for helping with experiments and suggestions;and K.Hiller for data analysis and comments on the manuscript.We gratefully acknowledge support by the Singapore-MIT Alliance (SMA-2)and NIH,grant 1-R01-GM085323-01A1.B.P.acknowledges the Milheim Foundation Grant for Cancer Research 2006-17.A patent application that is based on the results presented here has been filed by MIT.P.K.A.designed the experiments and performed the engineering and screening of the strains;W-H.X.performed screening of the strains,bioreactorexperiments,and GC-MS analysis;F.S.carried out the quantitative PCR measurements;O.M.performed the extraction and characterization of taxadiene standard;E.L.,Y.W.,and B.P.supported with cloning experiments;P.K.A.,K.E.J.T.,T.H.P.,B.P.and G.S.analyzed the data;P.K.A.,K.E.J.T.,and G.S.wrote the manuscript;G.S.supervised the research;and all of the authors contributed to discussion of the research and edited and commented on the manuscript.Supporting Online Material/cgi/content/full/330/6000/70/DC1Materials and Methods SOM TextFigs.S1to S11Tables S1to S4References29April 2010;accepted 9August 201010.1126/science.1191652Reactivity of the Gold/Water Interface During Selective Oxidation CatalysisBhushan N.Zope,David D.Hibbitts,Matthew Neurock,Robert J.Davis *The selective oxidation of alcohols in aqueous phase over supported metal catalysts is facilitated by high-pH conditions.We have studied the mechanism of ethanol and glycerol oxidation to acids over various supported gold and platinum beling experiments with 18O 2and H 218O demonstrate that oxygen atoms originating from hydroxide ions instead of molecular oxygen are incorporated into the alcohol during the oxidation reaction.Density functional theory calculations suggest that the reaction path involves both solution-mediated and metal-catalyzed elementary steps.Molecular oxygen is proposed to participate in the catalytic cycle not by dissociation to atomic oxygen but by regenerating hydroxide ions formed via the catalytic decomposition of a peroxide intermediate.The selective oxidation of alcohols with mo-lecular oxygen over gold (Au)catalysts in liquid water offers a sustainable,envi-ronmentally benign alternative to traditional pro-cesses that use expensive inorganic oxidants and harmful organic solvents (1,2).These catalytic transformations are important to the rapidly de-veloping industry based on the conversion of bio-renewable feedstocks to higher-valued chemicals (3,4)as well as the current production of petro-chemicals.Although gold is the noblest of metals (5),the water/Au interface provides a reaction en-vironment that enhances its catalytic performance.We provide here direct evidence for the predomi-nant reaction path during alcohol oxidation at high pH that includes the coupling of both solution-mediated and metal-catalyzed elementary steps.Alcohol oxidation catalyzed by Pt-group metals has been studied extensively,although the precisereaction path and extent of O 2contribution are still under debate (4,6–8).The mechanism for the selective oxidation of alcohols in liquid water over the Au catalysts remains largely un-known (6,9),despite a few recent studies with organic solvents (10–12).In general,supported Au nanoparticles are exceptionally good catalysts for the aerobic oxidation of diverse reagents ranging from simple molecules such as CO and H 2(13)to more complex substrates such as hy-drocarbons and alcohols (14).Au catalysts are also substrate-specific,highly selective,stable against metal leaching,and resistant to overoxidation by O 2(6,15,16).The active catalytic species has been suggested to be anionic Au species (17),cat-ionic Au species (18,19),and neutral Au metal particles (20).Moreover,the size and structure of Au nanoparticles (21,22)as well as the interface of these particles with the support (23)have also been claimed to be important for catalytic ac-tivity.For the well-studied CO oxidation reaction,the presence of water vapor increases the observed rate of the reaction (24–26).Large metallic Au particles and Au metal powder,which are usually considered to be catalytically inert,have consider-able oxidation activity under aqueous conditions at high pH (27,28).We provide insights into the active intermediates and the mechanism for al-cohol oxidation in aqueous media derived from experimental kinetic studies on the oxidation of glycerol and ethanol with isotopically labeled O 2and H 2O over supported Au and Pt catalysts,as well as ab initio density functional theory calcu-lations on ethanol oxidation over metal surfaces.Previous studies indicate that alcohol oxida-tion over supported metal catalysts (Au,Pt,and Pd)proceeds by dehydrogenation to an aldehyde or ketone intermediate,followed by oxidation to the acid product (Eq.1)RCH 2OH À!O 2,catalyst RCH ¼O À!O 2,catalystRCOOH(1)Hydroxide ions play an important role during oxidation;the product distribution depends on pH,and little or no activity is seen over Au cat-alysts without added base.We studied Au par-ticles of various sizes (average diameter ranging from 3.5to 10nm)on different supports (TiO 2and C)as catalysts for alcohol oxidation and com-pared them to Pt and Pd particles supported on C.The oxidation of glycerol (HOCH 2CHOHCH 2OH)to glyceric (HOCH 2CHOHCOOH)and glycolic (HOCH 2COOH)acids occurred at a turnover frequency (TOF)of 6.1and 4.9s −1on Au/C and Au/TiO 2,respectively,at high pH (>13)whereas the TOF on supported Pt and Pd (1.6and 2.2s −1,respectively)was slightly lower at otherwise iden-tical conditions (Table 1).For these Au catalysts,particle size and support composition had negligi-ble effect on the rate or selectivity.In the absence of base,the glycerol oxidation rate was much lower over the Pt and Pd catalysts and no conver-sion was observed over the Au catalysts (Table 1).Moreover,the products detected over Pt and Pd in the absence of base are primarily the intermediate aldehyde and ketone,rather than acids.Department of Chemical Engineering,University of Virginia,102Engineers ’Way,Post Office Box 400741,Charlottesville,VA,22904–4741,USA.*To whom correspondence should be addressed.E-mail:rjd4f@1OCTOBER 2010VOL 330SCIENCE74REPORTSo n A u g u s t 4, 2011w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o m。
Conceptual Density Functional Theory
1793
Conceptual Density Functional Theory
P. Geerlings,*,† F. De Proft,† and W. Langenaeker‡
Eenheid Algemene Chemie, Faculteit Wetenschappen, Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050 Brussels, Belgium, and Department of Molecular Design and Chemoinformatics, Janssen Pharmaceutica NV, Turnhoutseweg 30, B-2340 Beerse, Belgium Received April 2, 2002
* Corresponding author (telephone +32.2.629.33.14; fax +32.2.629. 33.17; E-mail pgeerlin@vub.ac.be). † Vrije Universiteit Brussel. ‡ Janssen Pharmaceutica NV.
Contents
I. Introduction: Conceptual vs Fundamental and Computational Aspects of DFT II. Fundamental and Computational Aspects of DFT A. The Basics of DFT: The Hohenberg−Kohn Theorems B. DFT as a Tool for Calculating Atomic and Molecular Properties: The Kohn−Sham Equations C. Electronic Chemical Potential and Electronegativity: Bridging Computational and Conceptual DFT III. DFT-Based Concepts and Principles A. General Scheme: Nalewajski’s Charge Sensitivity Analysis B. Concepts and Their Calculation 1. Electronegativity and the Electronic Chemical Potential 2. Global Hardness and Softness 3. The Electronic Fukui Function, Local Softness, and Softness Kerndness Kernel 5. The Molecular Shape FunctionsSimilarity 6. The Nuclear Fukui Function and Its Derivatives 7. Spin-Polarized Generalizations 8. Solvent Effects 9. Time Evolution of Reactivity Indices C. Principles 1. Sanderson’s Electronegativity Equalization Principle 2. Pearson’s Hard and Soft Acids and Bases Principle 3. The Maximum Hardness Principle IV. Applications A. Atoms and Functional Groups B. Molecular Properties 1. Dipole Moment, Hardness, Softness, and Related Properties 2. Conformation 3. Aromaticity C. Reactivity 1. Introduction 2. Comparison of Intramolecular Reactivity Sequences 1793 1795 1795 1796 1797 1798 1798 1800 1800 1802 1807 1813 1814 1816 1819 1820 1821 1822 1822 1825 1829 1833 1833 1838 1838 1840 1840 1842 1842 1844 V. VI. VII. VIII. IX.
Sequence analysis and gene content of PMTV RNA 3
Scottish Crop Research Institute, Invergowrie, Dundee DD2 5DA, United Kingdom; and *Department of Biological Sciences, University of Dundee DD 1 4HN, United Kingdom Received July 6, 1994; accepted September 301 1994
The complete sequence of the 2315 nucleotides in RNA 3 of potato mop-top furovirus (PMTV) isolate T was obtained by analysis of cDNA clones and by direct RNA sequencing. The sequence contains an open reading frame for the coat protein (2OK)terminated by an amber codon, followed by an in-phase coding region for an additional 47K. PMTV therefore resembles soil-borne wheat mosaic (SBWMV) and beet necrotic yellow vein (BNYVV) viruses (two other fungus4ransmitted viruses with rod-shaped particles) in having a coat protein-readthrough product. Comparison of the 3' untranslated regio
IEEE 802.11n Draft2.0 无线局域网广带路由器快速安装指南说明书
IEEE 802.11n Draft2. 0 Wireless LAN Broadband RouterQuick Installation Guide(Q.I.G.)Version 1.0 / April 2007M u l t i -L a n g u a g e s Q I G o n t h e C D===============================Če s k ý: Českého pr ůvodce rychlou instalací naleznete na p řiloženém CD s ovlada či D e u t s c h : Finden Sie bitte das deutsche S.A.L. beiliegend in der Treiber CD E s p a ño l : Incluido en el CD el G.R.I. en Español.F r a n ça i s : Veuillez trouver l’français G.I.R ci-joint dans le CD I t a l i a n o : Incluso nel CD il Q.I.G. in Italiano.M a g y a r : A magyar telepítési útmutató megtalálható a mellékelt CD-n N e d e r l a n d s : De nederlandse Q.I.G. treft u aan op de bijgesloten CDP o l s k i : Skrócona instrukcja instalacji w j ęzyku polskim znajduje si ę na za łączonej p łycie CDP o r t u g u ês : Incluído no CD o G.I.R. em PortuguesРусский: Найдите Q.I.G. на p усскo м языке на приложеном CDT ür k çe : Ürün ile beraber gelen CD içinde Türkçe H ızl ı Kurulum K ılavuzu'nu bulabilirsiniz R o m a n a :Cd-ul cuprinde Ghid de instalare rapida in limba romana1Copyright © by Edimax Technology Co, LTD. all rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language or computer language, in any form or by any means, electronic, mechanical, magnetic, optical, chemical, manual or otherwise, without the prior written permission of this companyThis company makes no representations or warranties, either expressed or implied, with respect to the contents hereof and specifically disclaims any warranties, merchantability or fitness for any particular purpose. Any software described in this manual is sold or licensed "as is". Should the programs prove defective following their purchase, the buyer (and not this company, its distributor, or its dealer) assumes the entire cost of all necessary servicing, repair, and any incidental or consequential damages resulting from any defect in the software. Further, this company reserves the right to revise this publication and to make changes from time to time in the contents hereof without obligation to notify any person of such revision or changes.The product you have purchased and the setup screen may appear slightly different from those shown in this QIG. For more detailed information about this product, please refer to the User's Manual on the CD-ROM.The software and specifications subject to change without notice. Please visit our web site for the update. All right reserved including all brand and product names mentioned in this manual are trademarks and/or registered trademarks of their respective holders Linux Open Source CodeCertain Edimax products include software code developed by third parties, including software code subject to the GNU General Public License ("GPL") or GNU Lesser General Public License ("LGPL"). Please see the GNU () and LPGL () Websites to view the terms of each license.The GPL Code and LGPL Code used in Edimax products are distributed without any warranty and are subject to the copyrights of their authors. For details, see the GPL Code and LGPL Code licenses. You can download the firmware-files at under "Download" page.2.1 Introduction :1-1 Package ContentsBefore you starting to use this router, please check if there’s anything missing in the package, and contact your dealer of purchase to claim for missing items: □ Broadband router (BR-6504N, 1 pcs) □ Quick installation guide (1 pcs) □ User manual CDROM (1 pcs)□ 3dBi RP-SMA Detachable Antenna (3 pcs) □ Holding Base (1 pcs)□12V Switching power adapter (1 pcs)1-2 Front PanelLED Name Light Status DescriptionPWR ON Router is switched on and correctly powered On Wireless network is switched on or WPS mode is on.Off Wireless network is switched off WLAN Flashing Wireless LAN activity (transferring data). On WAN port (Internet) is running at 100Mbps WANOffWAN port (Internet) is running at 10Mbps310/100M Flashing WANactivity (transferring data) On WAN port is connected OffWAN port is not connectedWANLNK/ACT Flashing WAN activity (transferring data) On LAN port is running at 100Mbps LAN 10/100M Off LAN port is running at 10Mbps On LAN port is connected OffLAN port is not connectedLANLNK/ACTFlashing LAN activity (transferring data)1-3 Back PanelItem Name DescriptionAntenna3dBi detachable AntennaAntenna C4Power Power connector, connects to power adapterReset / WPSReset the router to factory default settings (clear all settings) or start WPS function. Press this button and hold for 20 seconds to clear all settings, and press this button for less than 20 seconds to start WPS function.1 - 4 Local Area Network (LAN) ports 1 to 4 WANWide Area Network (WAN / Internet) portBR-6504N supports WPS function . If your wireless adapter supports the function too. After you finish the following setting of Step 1~ Step 3, You can read “Step 4: WPS function“ to simplify the setting of security with wireless adapters2 N e t w o r k S e t u pS t e p 1:G e t t i n g S t a r t e dInstructions for using the router to share the Internet with multiple PCs. (Power on the Modem and the Router.)C a b l i n gin s t a l l a t i o n :1. Connect the Ethernet cable from the router’s WAN port to the LAN port of the modem.2. Connect another Ethernet cable from the any LAN ports (1~4) on router to the Ethernet socket on the PC.53. Check to make sure the router’s LINK LED is lit; to confirm the cable connections are made correctly.6S t e p 2 : P C L A N I P C o n f i g u r a t i o n1. Configure the PC LAN setting to automatically obtain an IP address from the router by following below steps.z Click “Start” in the task bar then select the “Control Panel” to open it.zClick “Switch to Classic View”on the left top to see more setting icon…zFind an icon “Network Connection” then Double-Click to open the network connection setting.7zYou will see an icon “Local Area Connection”, click the icon then Right-Click the mouse to open the sub-menu and select the “Properties” for setting the IP.8zSelect the “Internet Protocol (TCP/IP)” then click the “Properties”zEnsure the parameter marked in blue are checked in “Obtain an IP addressautomatically” and “Obtain DNS server address automatically” then press “OK” toclose.2. Power on sequence for the networking devices and PC. z Firstly, power off the modem followed by router and PC. z Second, power on the modem.z Third power on the router followed by PC for next step.9S t e p 3 : R o u t e r C o n f i g u r a t i o n – P P P o EN o t e : P l e a s e d o n o t r u n t h e P P P o E s o f t w a r e ; o t h e r w i s e t h e I n t e r n e t c o n n e c t i o n f o r t h e r o u t e r m a y f a i l .1. Enter the router’s default IP address: “192.168.2.1” into your PC’s web browser and press “enter”.2. The login screen below will appear. Enter the “User Name” and “Password” and click “OK” to login.The default User name is “admin ” and the Password is “1234”,Note: It’s highly recommended to change and save Router's log-in settings in another location.103. The main page will appear, click “Quick Setup”Following example is for “PPPoE” WAN setting.4. Select ‘(GMT) Greenwich Mean Time: (your country or city) then Click “Next”button.5. Select “PPPoE xDSL”, the system will move into next step116. Enter the “User Name” and “Password” that ISP provided and leave the others. (The “Service Name” can be blank) and click “OK”to save the setting then reboot the router7. After reboot, your router is now ready for Internet connection.N o t e : C h e c k t h e m a n u a l o n t h e C D f o r m o r e I n t e r n e t c o n n e c t i o n t y p e a n d o t h e r s e t t i n g d e t a i l sR o u t e r C o n f i g u r a t i o n-C a b l e M o d e m1. The following example is for U.K. Click on “Quick Setup”.2. Select ‘(GMT) Greenwich Mean Time : “London”. Click “Next” button.1213 3. Select “Cable Modem”, the system will move into next step4. Enter ”Host Name” and “MAC Address”. (The “Host Name” can be blank) The MACAddress is provided by ISP (e.g. NTL) or click “Clone Mac Address” button if you’re using the computer’s MAC Address.Confirm with your ISP about MAC Address used, Click “OK” button to save the settings thenreboot the router.5. After reboot, your router is now ready for Internet connection.S t e p 4 : W P S b u t t o nThis wireless router supports two types of WPS: Push-Button Configuration (PBC), and PIN code. If you want to use PBC, you have to push a specific button on the wireless client to start WPS mode, and switch this wireless router to WPS mode too. You can push Reset/WPS button of this wireless router, or click ‘Start PBC’ button in the web configuration interface to do this; if you want to use PIN code, you have to know the PIN code of wireless client and switch it to WPS mode, then provide the PIN code of the wireless client you wish to connect to this wireless router.Antenna C3 Advanced Setup3-1 Change management passwordDefault password of BR-6504N is 1234, and it’s displayed on the login prompt when accessed from web browser. There’s a security risk if you don’t change the default password, since everyone can see it. This is very important when you have wireless function enabled.To change password, please follow the following instructions:Please click ‘System’ menu on the left of web management interface, then click ‘Password Settings’, and the following message will be displayed on your web browser:Here are descriptions of every setup items:Current Please input current password here.New Password : Please input new password here.Confirmed Password : Please input new password here again.When you finish, click ‘Apply’; If you want to keep original password unchanged, click ‘Cancel’.3-2 Configuration Backup and RestoreYou can backup all configurations of this router to a file, so you can make several copied of router configuration for security reason.To backup or restore router configuration, please follow the following instructions:14Please click ‘Tool’ located at the upper-right corner of web management interface, then click‘Configuration Tools’ on the left of web management interface, then the following message will be displayed on your web browser:Here are descriptions of every buttons:Backup Press ‘Save...’ button, and you’ll be prompted toSettings : download the configuration as a file, default filename is‘default.bin’, you can please save it as another filename fordifferent versions, and keep it in a safe place.Restore Press ‘Browse…’ to pick a previously-savedSettings : configuration file from your computer, and then click ‘Upload’ totransfer the configuration file to router. After the configuration isuploaded, the router’s configuration will be replaced by the fileyou just uploaded.Restore to Click this button to remove all settingsFactory Default : you made, and restore the configuration of thisrouter back to factory default settings.3-3 Firmware UpgradeThe system software used by this router is called as ‘firmware’, just like any applications on your computer, when you replace the old application with a new one, your computer will be equipped with new function. You can also use this firmware upgrade function to add new functions to your1516router, even fix the bugs of this router.To upgrade firmware, please follow the following instructions:Please click ‘Tool’ located at the upper-right corner of web management interface, then click‘Firmware Upgrade’ on the left of web management interface, then the following message will be displayed on your web browser:Please click ‘Next’, and the following message will be displayed:Click ‘Browse’ button first, you’ll be prompted to provide the filename of firmware upgrade file. Please download the latest firmware file from our website, and use it to upgrade your router.After a firmware upgrade file is selected, click ‘Apply’ button, and the router will start firmware17upgrade procedure automatically. The procedure may take several minutes, please be patient.3-4 System ResetIf you think the network performance is bad, or you found the behavior of the router is strange, you can perform a router reset, sometime it will solve the problem.To do so, please click ‘Tool’ located at the upper-right corner of web management interface, then click ‘Reset’ on the left of web management interface, then the following message will be displayed on your web browser:Please click ‘Apply’ to reset your router, and it will be available again after few minutes, please be patient.N o t e : C h e c k t h e m a n u a l o n t h e C D f o r m o r e I n t e r n e t c o n n e c t i o n t y p e a n d o t h e r s e t t i n g d e t a i l sNOTE: Never interrupt the upgrade procedure by closing the web browser or physically disconnect your computer from router. If the firmware you uploaded is corrupt, the firmware upgrade will fail, and you may have to return this router to the dealer of purchase to ask for help. (Warranty voids if you interrupted the upgrade procedure).F e d e r a l C o m m u n i c a t i o n C o m m i s s i o nI n t e r f e r e n c e S t a t e m e n tThis equipment has been tested and found to comply with the limits for a Class B digital device, pursuant to Part 15 of FCC Rules. These limits are designed to provide reasonable protection against harmful interference in a residential installation. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instructions, may cause harmful interference to radio communications.However, there is no guarantee that interference will not occur in a particular installation. If this equipment does cause harmful interference to radio or television reception, which can be determined by turning the equipment off and on, the user is encouraged to try to correct the interference by one or more of the following measures:1. Reorient or relocate the receiving antenna.2. Increase the separation between the equipment and receiver.3. Connect the equipment into an outlet on a circuit different from that towhich the receiver is connected.4. Consult the dealer or an experienced radio technician for help.FCC CautionThis device and its antenna must not be co-located or operating in conjunction with any other antenna or transmitter.This device complies with Part 15 of the FCC Rules. Operation is subject to the following two conditions: (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including interference that may cause undesired operation.Any changes or modifications not expressly approved by the party responsible for compliance could void the authority to operate equipment.Federal Communications Commission (FCC) Radiation Exposure StatementThis equipment complies with FCC radiation exposure set forth for an uncontrolled environment. In order to avoid the possibility of exceeding the FCC radio frequency exposure limits, human proximity to the antenna shall not be less than 2.5cm (1 inch) during normal operation.Federal Communications Commission (FCC) RF Exposure RequirementsSAR compliance has been established in the laptop computer(s) configurations with PCMCIA slot on the side near the center, as tested in the application for Certification, and can be used in laptop computer(s) with substantially similar physical dimensions, construction, and electrical and RF characteristics. Use in other devices such a PDAs or lappads is not authorized.This transmitter is restricted for use with the specific antenna(s) tested in the application for Certification. The antenna(s) used for this transmitter must not be co-located or operating in conjunction with any other antenna or transmitter.R&TTE Compliance StatementThis equipment complies with all the requirements of DIRECTIVE 1999/5/EC OF THE EUROPEAN PARLIAMENT AND THE COUNCIL of March 9, 1999 on radio equipment and telecommunication terminal Equipment and the mutual recognition of their conformity (R&TTE)The R&TTE Directive repeals and replaces in the directive 98/13/EEC (Telecommunications Terminal Equipment and Satellite Earth Station Equipment) As of April 8, 2000.18SafetyThis equipment is designed with the utmost care for the safety of those who install and use it. However, special attention must be paid to the dangers of electric shock and static electricity when working with electrical equipment. All guidelines of this and of the computer manufacture must therefore be allowed at all times to ensure the safe use of the equipment.EU Countries Intended for UseThe ETSI version of this device is intended for home and office use in Austria, Belgium, Denmark, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg, the Netherlands, Portugal, Spain, Sweden, and the United Kingdom.The ETSI version of this device is also authorized for use in EFTA member states: Iceland, Liechtenstein, Norway, and Switzerland.EU Countries not intended for useNone19。
Learning with local and global consistency
Learning with Local and Global ConsistencyDengyong Zhou,Olivier Bousquet,Thomas Navin Lal,Jason Weston,and Bernhard Sch¨o lkopfMax Planck Institute for Biological Cybernetics,72076Tuebingen,Germany{firstname.secondname}@tuebingen.mpg.deAbstractWe consider the general problem of learning from labeled and unlabeleddata,which is often called semi-supervised learning or transductive in-ference.A principled approach to semi-supervised learning is to designa classifying function which is sufficiently smooth with respect to theintrinsic structure collectively revealed by known labeled and unlabeledpoints.We present a simple algorithm to obtain such a smooth solution.Our method yields encouraging experimental results on a number of clas-sification problems and demonstrates effective use of unlabeled data.1IntroductionWe consider the general problem of learning from labeled and unlabeled data.Given a point set X={x1,...,x l,x l+1,...,x n}and a label set L={1,...,c},thefirst l points have labels{y1,...,y l}∈L and the remaining points are unlabeled.The goal is to predict the labels of the unlabeled points.The performance of an algorithm is measured by the error rate on these unlabeled points only.Such a learning problem is often called semi-supervised or transductive.Since labeling often requires expensive human labor,whereas unlabeled data is far easier to obtain,semi-supervised learning is very useful in many real-world problems and has recently attracted a considerable amount of research[10].A typical application is web categorization,in which manually classified web pages are always a very small part of the entire web,and the number of unlabeled examples is large.The key to semi-supervised learning problems is the prior assumption of consistency,which means:(1)nearby points are likely to have the same label;and(2)points on the same struc-ture(typically referred to as a cluster or a manifold)are likely to have the same label.This argument is akin to that in[2,3,4,10,15]and often called the cluster assumption[4,10]. Note that thefirst assumption is local,whereas the second one is global.Orthodox super-vised learning algorithms,such as k-NN,in general depend only on thefirst assumption of local consistency.To illustrate the prior assumption of consistency underlying semi-supervised learning,let us consider a toy dataset generated according to a pattern of two intertwining moons in Figure 1(a).Every point should be similar to points in its local neighborhood,and furthermore, points in one moon should be more similar to each other than to points in the other moon. The classification results given by the Support Vector Machine(SVM)with a RBF kernelFigure1:Classification on the two moons pattern.(a)toy data set with two labeled points;(b)classifying result given by the SVM with a RBF kernel;(c)k-NN with k=1;(d)ideal classification that we hope to obtain.and k-NN are shown in Figure1(b)&1(c)respectively.According to the assumption of consistency,however,the two moons should be classified as shown in Figure1(d).The main differences between the various semi-supervised learning algorithms,such as spectral methods[2,4,6],random walks[13,15],graph mincuts[3]and transductive SVM [14],lie in their way of realizing the assumption of consistency.A principled approach to formalize the assumption is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure revealed by known labeled and unlabeled points.Here we propose a simple iteration algorithm to construct such a smooth function inspired by the work on spreading activation networks[1,11]and diffusion kernels[7,8,12],recent work on semi-supervised learning and clustering[2,4,9],and more specifically by the work of Zhu et al.[15].The keynote of our method is to let every point iteratively spread its label information to its neighbors until a global stable state is achieved.We organize the paper as follows:Section2shows the algorithm in detail and also discusses possible variants;Section3introduces a regularization framework for the method;Section 4presents the experimental results for toy data,digit recognition and text classification, and Section5concludes this paper and points out the next researches.2AlgorithmGiven a point set X={x1,...,x l,x l+1,...,x n}⊂R m and a label set L={1,...,c}, thefirst l points x i(i≤l)are labeled as y i∈L and the remaining points x u(l+1≤u≤n) are unlabeled.The goal is to predict the label of the unlabeled points.Let F denote the set of n×c matrices with nonnegative entries.A matrix F= [F T1,...,F T n]T∈F corresponds to a classification on the dataset X by labeling each point x i as a label y i=arg max j≤c F ij.We can understand F as a vectorial function F:X→R c which assigns a vector F i to each point x i.Define a n×c matrix Y∈F with Y ij=1if x i is labeled as y i=j and Y ij=0otherwise.Clearly,Y is consistent with theinitial labels according the decision rule.The algorithm is as follows:1.Form the affinity matrix W defined by W ij=exp(− x i−x j 2/2σ2)if i=jand W ii=0.2.Construct the matrix S=D−1/2W D−1/2in which D is a diagonal matrix withits(i,i)-element equal to the sum of the i-th row of W.3.Iterate F(t+1)=αSF(t)+(1−α)Y until convergence,whereαis a parameterin(0,1).4.Let F∗denote the limit of the sequence{F(t)}.Label each point x i as a labely i=arg max j≤c F∗ij.This algorithm can be understood intuitively in terms of spreading activation networks [1,11]from experimental psychology.Wefirst define a pairwise relationship W on the dataset X with the diagonal elements being zero.We can think that a graph G=(V,E)is defined on X,where the the vertex set V is just X and the edges E are weighted by W.In the second step,the weight matrix W of G is normalized symmetrically,which is necessary for the convergence of the following iteration.Thefirst two steps are exactly the same as in spectral clustering[9].During each iteration of the third step each point receives the information from its neighbors(first term),and also retains its initial information(second term).The parameterαspecifies the relative amount of the information from its neighbors and its initial label information.It is worth mentioning that self-reinforcement is avoided since the diagonal elements of the affinity matrix are set to zero in thefirst step.Moreover, the information is spread symmetrically since S is a symmetric matrix.Finally,the label of each unlabeled point is set to be the class of which it has received most information during the iteration process.Let us show that the sequence{F(t)}converges and F∗=(1−α)(I−αS)−1Y.Without loss of generality,suppose F(0)=Y.By the iteration equation F(t+1)=αSF(t)+(1−α)Y used in the algorithm,we haveF(t)=(αS)t−1Y+(1−α)t−1i=0(αS)i Y.(1)Since0<α<1and the eigenvalues of S in[-1,1](note that S is similar to the stochastic matrix P=D−1W=D−1/2SD1/2),lim t→∞(αS)t−1=0,and limt→∞t−1i=0(αS)i=(I−αS)−1.(2)HenceF∗=limt→∞F(t)=(1−α)(I−αS)−1Y,for classification,which is clearly equivalent toF∗=(I−αS)−1Y.(3) Now we can compute F∗directly without iterations.This also shows that the iteration result does not depend on the initial value for the iteration.In addition,it is worth to notice that(I−αS)−1is in fact a graph or diffusion kernel[7,12].Now we discuss some possible variants of this method.The simplest modification is to repeat the iteration after convergence,i.e.F∗=(I−αS)−1···(I−αS)−1Y=(I−αS)−p Y,where p is an arbitrary positive integer.In addition,since that S is similar to P, we can consider to substitute P for S in the third step,and then the corresponding closed form is F∗=(I−αP)−1Y.It is also interesting to replace S with P T,the transpose of P. Then the classifying function is F∗=(I−αP T)−1Y.It is not hard to see this is equivalent to F∗=(D−αW)−1Y.We will compare these variants with the original algorithm in the experiments.3Regularization FrameworkHere we develop a regularization framework for the above iteration algorithm.The cost function associated with F is defined to beQ (F )=12n i,j =1W ij F i √D ii −F j D jj 2+µn i =1 F i −Y i 2,(4)Where µ>0is the regularization parameter.Then the classifying function is F ∗=arg min F ∈FQ (F ).(5)The first term of the right-hand side in the cost function is the smoothness constraint ,which means that a good classifying function should not change too much between nearby points.The second term is the fitting constraint ,which means a good classifying function should not change too much from the initial label assignment.The trade-off between these two competing constraints is captured by a positive parameter µ.Note that the fitting constraint contains labeled as well as unlabeled data.We can understand the smoothness term as the sum of the local variations,i.e.the local changes of the function between nearby points.As we have mentioned,the points involving pairwise relationships can be be thought of as an undirected weighted graph,the weights of which represent the pairwise relationships.The local variation is then in fact measured on each edge.We do not simply define the local variation on an edge by the difference of the function values on the two ends of the edge.The smoothness term essentially splits the function value at each point among the edges attached to it before computing the local changes,and the value assigned to each edge is proportional to its weight.Differentiating Q (F )with respect to F ,we have∂Q ∂F F =F∗=F ∗−SF ∗+µ(F ∗−Y )=0,which can be transformed intoF ∗−1SF ∗−µY =0.Let us introduce two new variables,α=1,and β=µ.Note that α+β=1.Then(I −αS )F ∗=βY,Since I −αS is invertible,we haveF ∗=β(I −αS )−1Y.(6)which recovers the closed form expression of the above iteration algorithm.Similarly we can develop the optimization frameworks for the variants F ∗=(I −αP )−1Y and F ∗=(D −αW )−1Y .We omit the discussions due to lack of space.4ExperimentsWe used k -NN and one-vs-rest SVMs as baselines,and compared our method to its two variants:(1)F ∗=(I −αP )−1Y ;and (2)F ∗=(D −αW )−1Y.We also compared to Zhu et al.’s harmonic Gaussian field method coupled with the Class Mass Normalization (CMN)[15],which is closely related to ours.To the best of our knowledge,there is no reliable approach for model selection if only very few labeled points are available.Hence we let all algorithms use their respective optimal parameters,except that the parameter αused in our methods and its variants was simply fixed at 0.99.Figure2:Classification on the pattern of two moons.The convergence process of our iteration algorithm with t increasing from1to400is shown from(a)to(d).Note that the initial label information are diffused along the moons.Figure3:The real-valued classifying function becomesflatter andflatter with respect to the two moons pattern with increasing t.Note that two clear moons emerge in(d).Figure4:Smooth classification results given by supervised classifiers with the global con-sistency:(a)the classification result given by the SVM with a RBF kernel;(b)smooth the result of the SVM using the consistency method.4.1Toy ProblemIn this experiment we considered the toy problem mentioned in Section1(Figure1). The affinity matrix is defined by a RBF kernel but the diagonal elements are set to zero. The convergence process of our iteration algorithm with t increasing from1to400is shown in Figure2(a)-2(d).Note that the initial label information are diffused along the moons.The assumption of consistency essentially means that a good classifying func-tion should change slowly on the coherent structure aggregated by a large amount of data.This can be illustrated by this toy problem very clearly.Let us define a function f(x i)=(F∗i1−F∗i2)/(F∗i1+F∗i2)and accordingly the decision function is sign(f(x i)), which is equivalent to the decision rule described in Section2.In Figure3,we show that f(x i)becomes successivelyflatter with respect to the two moons pattern from Figure3(a)-3(d)with increasing t.Note that two clear moons emerge in the Figure3(d).The basic idea of our method is to construct a smooth function.It is natural to consider using this method to improve a supervised classifier by smoothing its classifying result.In other words,we use the classifying result given by a supervised classifier as the input of our algorithm.This conjecture is demonstrated by a toy problem in Figure4.Figure4(a)is the classification result given by the SVM with a RBF kernel.This result is then assigned to Y in our method.The output of our method is shown in Figure4(b).Note that the points classified incorrectly by the SVM are successfully smoothed by the consistency method.4.2Digit RecognitionIn this experiment,we addressed a classification task using the USPS handwritten16x16 digits dataset.We used digits1,2,3,and4in our experiments as the four classes.There are1269,929,824,and852examples for each class,for a total of3874.The k in k-NN was set to1.The width of the RBF kernel for SVM was set to5,and for the harmonic Gaussianfield method it was set to1.25.In our method and its variants, the affinity matrix was constructed by the RBF kernel with the same width used as in the harmonic Gaussian method,but the diagonal elements were set to0.The test errors averaged over100trials are summarized in the left panel of Figure5.Samples were chosen so that they contain at least one labeled point for each class.Our consistency method and one of its variant are clearly superior to the orthodox supervised learning algorithms k-NN and SVM,and also better than the harmonic Gaussian method.Note that our approach does not require the affinity matrix W to be positive definite.This enables us to incorporate prior knowledge about digit image invariance in an elegant way, e.g.,by using a jittered kernel to compute the affinity matrix[5].Other kernel methods aret e s t e r r o rFigure 5:Left panel:the error rates of digit recognition with USPS handwritten 16x16digits dataset for a total of 3874(a subset containing digits from 1to 4).Right panel:the error rates of text classification with 3970document vectors in a 8014-dimensional space.Samples are chosen so that they contain at least one labeled point for each class.known to have problems with this method [5].In our case,jittering by 1pixel translation leads to an error rate around 0.01for 30labeled points.4.3Text ClassificationIn this experiment,we investigated the task of text classification using the 20-newsgroups dataset.We chose the topic rec which contains autos,motorcycles,baseball,and hockey from the version 20-news-18828.The articles were processed by the Rainbow software package with the following options:(1)passing all words through the Porter stemmer before counting them;(2)tossing out any token which is on the stoplist of the SMART system;(3)skipping any headers;(4)ignoring words that occur in 5or fewer documents.No further preprocessing was done.Removing the empty documents,we obtained 3970document vectors in a 8014-dimensional space.Finally the documents were normalized into TFIDF representation.The distance between points x i and x j was defined to be d (x i ,x j )=1− x i ,x j / x i x j [15].The k in k -NN was set to 1.The width of the RBF kernel for SVM was set to 1.5,and for the harmonic Gaussian method it was set to 0.15.In our methods,the affinity matrix was constructed by the RBF kernel with the same width used as in the harmonic Gaussian method,but the diagonal elements were set to 0.The test errors averaged over 100trials are summarized in the right panel of Figure 5.Samples were chosen so that they contain at least one labeled point for each class.It is interesting to note that the harmonic method is very good when the number of labeled points is 4,i.e.one labeled point for each class.We think this is because there are almost equal proportions of different classes in the dataset,and so with four labeled points,the pro-portions happen to be estimated exactly.The harmonic method becomes worse,however,if slightly more labeled points are used,for instance,10labeled points,which leads to pretty poor estimation.As the number of labeled points increases further,the harmonic method works well again and somewhat better than our method,since the proportions of classes are estimated successfully again.However,our decision rule is much simpler,which in fact corresponds to the so-called naive threshold ,the baseline of the harmonic method.5ConclusionThe key to semi-supervised learning problems is the consistency assumption,which essen-tially requires a classifying function to be sufficiently smooth with respect to the intrinsic structure revealed by a huge amount of labeled and unlabeled points.We proposed a sim-ple algorithm to obtain such a solution,which demonstrated effective use of unlabeled data in experiments including toy data,digit recognition and text categorization.In our further research,we will focus on model selection and theoretic analysis.AcknowledgmentsWe would like to thank Vladimir Vapnik,Olivier Chapelle,Arthur Gretton,and Andre Elis-seeff for their help with this work.We also thank Andrew Ng for helpful discussions about spectral clustering,and the anonymous reviewers for their constructive comments.Special thanks go to Xiaojin Zhu,Zoubin Ghahramani,and John Lafferty who communicated with us on the important post-processing step class mass normalization used in their method and also provided us with their detailed experimental data.References[1]J.R.Anderson.The architecture of cognition.Harvard Univ.press,Cambridge,MA,1983.[2]M.Belkin and P.Niyogi.Semi-supervised learning on manifolds.Machine LearningJournal,to appear.[3]A.Blum and S.Chawla.Learning from labeled and unlabeled data using graph min-cuts.In ICML,2001.[4]O.Chapelle,J.Weston,and B.Sch¨o lkopf.Cluster kernels for semi-supervised learn-ing.In NIPS,2002.[5]D.DeCoste and B.Sch¨o lkopf.Training invariant support vector machines.MachineLearning,46:161–190,2002.[6]T.Joachims.Transductive learning via spectral graph partitioning.In ICML,2003.[7]J.Kandola,J.Shawe-Taylor,and N.Cristianini.Learning semantic similarity.InNIPS,2002.[8]R.I.Kondor and fferty.Diffusion kernels on graphs and other discrete inputspaces.In ICML,2002.[9]A.Y.Ng,M.I.Jordan,and Y.Weiss.On spectral clustering:analysis and an algo-rithm.In NIPS,2001.[10]M.Seeger.Learning with labeled and unlabeled data.Technical report,The Univer-sity of Edinburgh,2000.[11]J.Shrager,T.Hogg,and B.A.Huberman.Observation of phase transitions in spread-ing activation networks.Science,236:1092–1094,1987.[12]A.Smola and R.I.Kondor.Kernels and regularization on graphs.In Learning Theoryand Kernel Machines,Berlin-Heidelberg,Germany,2003.Springer Verlag. [13]M.Szummer and T.Jaakkola.Partially labeled classification with markov randomwalks.In NIPS,2001.[14]V.N.Vapnik.Statistical learning theory.Wiley,NY,1998.[15]X.Zhu,Z.Ghahramani,and fferty.Semi-supervised learning using gaussianfields and harmonic functions.In ICML,2003.。
Language structure of pattern Sturmian words
Discrete Mathematics 306(2006)1651–1668/locate/discLanguage structure of pattern Sturmian wordsTeturo Kamae a ,Hui Rao b ,Bo Tan c ,Yu-Mei Xue ba Matsuyama University,790-8578,Japanb Department of Mathematics,Tsinghua University,Beijing 100084,PR Chinac Department of Mathematics,Huazhong University of Science and Technology,Wuhan 430074,PR China Received 28November 2005;received in revised form 14March 2006;accepted 28March 2006Available online 19June 2006AbstractPattern Sturmian words introduced by Kamae and Zamboni [Sequence entropy and the maximal pattern complexity of infinite words,Ergodic Theory Dynamical Systems 22(2002)1191–1199;Maximal pattern complexity for discrete systems,Ergodic Theory Dynamical Systems 22(2002)1201–1214]are an analogy of Sturmian words for the maximal pattern complexity instead of the block complexity.So far,two kinds of recurrent pattern Sturmian words are known,namely,rotation words and Toeplitz words.But neither a structural characterization nor a reasonable classification of the recurrent pattern Sturmian words is known.In this paper,we introduce a new notion,pattern Sturmian sets,which are used to study the language structure of pattern Sturmian words.We prove that there are exactly two primitive structures for pattern Sturmian words.Consequently,we suggest a classification of pattern Sturmian words according to structures of pattern Sturmian sets and prove that there are at most three classes in this classification.Rotation words and Toeplitz words fall into two different classes,but no examples of words from the third class are known.©2006Elsevier B.V .All rights reserved.Keywords:Uniform complexity;Pattern Sturmian word;Language structure1.Introduction1.1.Pattern Sturmian wordsLet A denote a nonempty finite set which is called an alphabet .Let ∈A N be an infinite word over A ,where N ={0,1,2,...}is the index set .Let k be a positive integer.By a k -window ,we mean a subset of N with cardinality k .For a word ∈A N and a k -window ={ 0< 1<···< k −1},we denote[n + ]:= (n + 0) (n + 1)··· (n + k −1)∈A ,F ( ):={ [n + ];n ∈N },p ( ):=#F ( ),where [n + ]is considered as a word on the index set ,and #E denotes the cardinality of a finite set E .An element in F ( )is called a -factor of .The maximal pattern complexity p ∗for a word is introduced by Kamae and E-mail addresses:kamae@apost.plala.or.jp (T.Kamae),hrao@ (H.Rao),bo_tan@ (B.Tan),yxue@ (Y-M.Xue).0012-365X/$-see front matter ©2006Elsevier B.V .All rights reserved.doi:10.1016/j.disc.2006.03.0431652T.Kamae et al./Discrete Mathematics306(2006)1651–1668Zamboni[10]asp ( )(k=1,2,3,...),p∗ (k):=supwhere the supremum is taken over all k-windows .The block complexity p is defined asp (k)=p ({0,1,...,k−1}).Morse and Hedlund[14]characterized the eventually periodicity in term of block complexity by showing that a word is eventually periodic if and only if p (k)<k+1for some k∈Z+:={1,2,...}.A word with block complexity p (k)=k+1(k∈Z+),which is of the minimal complexity among the nonperiodic words,is known as a Sturmian word.Excellent descriptions of Sturmian words can be found in Chapter2of[13]by J.Berstel and P.Séébold,and in Chapter6of[5]by P.Arnoux.In a similar way,Kamae and Zamboni[10]characterized the eventually periodicity in term of maximal pattern complexity.They proved that a word is eventually periodic if and only if p∗ (k)<2k for some k∈Z+.Accordingly, a word with p∗ (k)=2k(k∈Z+)is called a pattern Sturmian word.It is shown that Sturmian words are pattern Sturmian.Indeed,the class of pattern Sturmian words is larger than that of Sturmian words.Till now,three classes of pattern Sturmian words are known:rotation words,Toeplitz words and a class of{0,1}-words with rare1,where thefirst two of them are recurrent,while the last ones are not(see[10,11]). We do not know whether there are pattern Sturmian words other than of these kinds or not.We are also interested in what are the common points of the three known pattern Sturmian words,and what are the differences between them.In this paper,we analyze the language structure of recurrent pattern Sturmian words,and try to answer these questions.1.2.Uniform setLet A be an alphabet and be a countable infinite set.An element w∈A (which is a mapping from to A)is called a word on the index set over A,or a -word over A.For a nonemptyfinite set S⊂ ,define S(w)to be the S-word which is the restriction of w to S.For ⊂A ,put S( ):={ S(w);w∈ }.A subset ⊂A is called a uniform set if# S( )depends only on the size of S.Thus,we introduce the uniform complexity function p :Z+→Z+by p (k)=# S( )with#S=k.Special concern is paid to two classes of uniform sets,namely,Sturmian sets with p (k)=k+1(k∈Z+)and pattern Sturmian sets with p (k)=2k(k∈Z+).Example1.1.Take =N.A word ∈{0,1}N is called an increasing word(a decreasing word)if (i) (j) ( (i) (j),resp.)whenever i<j.A word is monotone if it is increasing or decreasing.A word ∈{0,1}N is called a Dirac word if there exists i0∈N such that (i)=0for any i=i0.Define0:={ ∈{0,1} ; is increasing},0:={ ∈{0,1} ; is Dirac},1:={ ∈{0,1} ; is monotone},1:={ ∈{0,1} ; is either decreasing or Dirac},2:={ ∈{0,1} ; is either increasing or Dirac}.Then,it is easily seen that 0and 0are Sturmian sets,while 1, 1and 2are pattern Sturmian sets.These sets will play an important role in our study and these notations will be used throughout the paper.We will show that a uniform set isfinite if and only if p (k) k holds for some k(Proposition2.3).Hence,a Sturmian set is an infinite uniform set with the minimum uniform complexity.As we will see,the pattern Sturmian sets are closely related to the pattern Sturmian words.T.Kamae et al./Discrete Mathematics306(2006)1651–16681653Fig.1. 0and 0.Fig.2. 1and 1.Fig.3. 2.1.3.Classification of recurrent pattern Sturmian wordsWe study the language structure of the uniform sets on the index set N.We introduce in Section3the notion of isomorphism between uniform sets U and V on N,so that U and V are isomorphic to each other if and only if the trees representing the extension schemes of the languages of them along the indices0,1,2,...are isomorphic.Then,the structure of a uniform set on the index set N is defined to be the isomorphic class of this isomorphism containing ,which is denoted by[ ].It holds that in Example1.1, 0and 0are isomorphic to each other and 1and 1are isomorphic to each other, while 1and 2are not isomorphic(see Figs.1–3).Let N={n0<n1<n2<···}⊂N and N:{0,1}N→{0,1}N be such that N( )(k)= (n k)(k∈N).The induced set (N)of a set on N is defined to be the set N( ).It is a uniform set on N if is so.A uniform set is called primitive if all induced sets of are isomorphic to itself.The structure[ ]for a primitive uniform set is called primitive.That is,[ ]is primitive if there exists a primitive element in[ ].We prove that[ 0]is the unique primitive structure among the Sturmian sets,while there are exactly two different primitive structures among the pattern Sturmian sets,namely,[ 1]and[ 2].The uniform sets are interesting subject to be studied in general.For example,how to characterize the uni-form complexity functions is an interesting problem.Here,we only discussfinite uniform sets,Sturmian sets and1654T.Kamae et al./Discrete Mathematics306(2006)1651–1668pattern Sturmian sets in Sections2and3.The results there except Theorem3.5are irrelevant to the arguments after Section3.1.4.Ultimate structureGiven a recurrent pattern Sturmian word ∈{0,1}N.We prove in Theorem4.1that there exists an infinite subset N of N,which is called an optimal window,such that for any nonemptyfinite set ⊂N,we have p ( )=2# .Then, we have a pattern Sturmian set ( )(N),where ( )denotes the orbit closure of with respect to the shift on{0,1}N. We denote by US( )the set of structures[ ( )(N)]for all optimal windows N of such that ( )(N)is primitive. We prove that US( )={[ 1]}for all rotation words ,while US( )={[ 2]}for all Toeplitz words (Theorems4.3 and4.8).Thus,we can classify the recurrent pattern Sturmian words in terms of the language structure. Specially,for Toeplitz words,we give concrete constructions of optimal windows,which give an alternative proof of the fact that the simple Toeplitz words are pattern Sturmian,which is presented with a wrong proof in[11].Also remark that a proof of this fact in a more general setting can be found in[7].1.5.More references on the complexity in generalTo survey the block complexity in general,see Ferenczi[4].The block complexity of general Toeplitz words are discussed by Cassaigne and Karhumäki[3]and Koskas[12].Other kinds of complexity are defined and discussed by Allouche et al.[1],Avgustinovich et al.[2],Frid[6],Nakashima et al.[15],Restivo and Salemi[16].The notion of pattern Sturmian words is extended to the words over letters in[8],and to the two-dimensional words in[9].anization of the paperThis paper is organized as follows.Sections2and3are devoted to the study of uniform sets.In Section2,the notion of uniform sets is introduced and some basic properties are investigated.We are specially interested in the pattern Sturmian sets which have the uniform complexity2k.In Section3,we study the isomorphism between uniform sets. The isomorphism classes are called structures.We prove that there exist exactly two primitive structures among the pattern Sturmian sets.Section4is devoted to the study of language structure of pattern Sturmian words.In Section 4.1,we prove that all recurrent pattern Sturmian words admit optimal windows,which define the ultimate structure of them.In Section4.2,we study the ultimate structure of the rotation words,while in Section4.3,we study the ultimate structure of the Toeplitz words.2.Uniform setsLet A be an alphabet and be an index set.Let F k(k∈Z+)be the collection of subsets of consisting of k elements,that is,F k={S⊂ ;#S=k}.Set F=∪k 1F k.For S⊂ ,a S-word over A is called a constant word if there exists a∈A such that ( )=a for any ∈S.Let S and S be two disjoint subsets of ,w and w be an S-word and an S -word,respectively,the concatenation of w and w is defined to be the S∪S -word ww with the property ww ( )=w( )if ∈S and ww ( )=w ( )if ∈S .Given w∈ S( ).If w ∈ S ( )satisfies that ww ∈ S∪S ( ),then we say that w is an S -extension of w in . For ∈ \S,w∈ S( )is called -special if there are at least two different{ }-extensions of w in .The complexity of is the function p :F→Z+defined by p (S)=# S .Definition2.1.A nonempty subset ⊂A is called a uniform set if the complexity p (S)depends only on#S.If is a uniform set,we have a function p :Z+→Z+such that p (k)=p (S)for any S∈F k.The function p is called the uniform complexity function.From now on,we always take A={0,1}.We consider thefinite uniform setsfirst.T.Kamae et al./Discrete Mathematics306(2006)1651–16681655Proposition2.2.Let ∈{0,1} be afinite uniform set,then either(i)p (k)≡1and ={w}for some w∈{0,1} or(ii)p (k)≡2and ={w,w}for some w∈{0,1} ,where we put0=1,1=0and w( )=w( )for any ∈ .Proof.Assume that ={w1,w2,...,w n}is a uniform set.For ∈ ,define a vector v( )=(w1( ),w2( ),...,w n( ))∈{0,1}n.Since there are onlyfinite number of vectors in{0,1}n,there exists a vector v,such that v( )=v for infinitely many of .Denote ={ ;v( )=v}.Then for any S⊂ , S( )consists only of constant words,so that# S( ) 2.Since#S as above can be any positive number and is uniform,we have p (k) 2(k=1,2,...).This implies that either p (k)≡1or p (k)≡2since p (k)is an increasing function of k and p (1)=1implies p (k)=1(k=1,2,...).On the other hand,if p (k)≡1,then is a singleton;and if p (k)≡2,then ={w,w}.Proposition2.3.Let be a uniform set.If there exists k such that p (k) k,then is afinite set.Proof.If p (1)=1,then is a singleton.If p (1)=2and p (k) k,then there exists a k <k such that p (k +1)=p (k ).Take S⊂ with#S=k .Then for any ∈ \S and w∈ ,since p (k +1)=p (k ),w( )is determined by the S-word S(w).Hence w is determined by S(w),and is afinite set.Definition2.4.A uniform set ⊂{0,1} is called a Sturmian set if the uniform complexity satisfies p (k)=k+1 for any k∈Z+; is called a pattern Sturmian set if the uniform complexity satisfies p (k)=2k for any k∈Z+. Hence,Sturmian sets have the minimal uniform complexity among all the infinite uniform set by Proposition2.3. Examples of Sturmian sets and pattern Sturmian sets are given in Example1.1.We have the following characterizations. Theorem2.5.If is a uniform set.Then, is a Sturmian set if and only if p (2)=3.Proof.Obviously,the condition p (2)=3is necessary.To show the sufficiency,suppose that is uniform and p (2)=3.Then by Propositions2.2and2.3, is an infinite set and p (k) k+1for any k∈Z+.So we only need to show that p (k) k+1.Otherwise,suppose that p (k) k+2for some k 3.Then there exists a k <k such that p (k +1) p (k )+2. Take S⊂ with#S=k and ∈ \S.Since A={0,1}and each S-word has at most two{ }-extensions,there are two S-words w1and w2which are -special.Since w1and w2are distinct,there exists a ∈S such that w1( )=w2( ). So { , }( )={00,01,10,11}and p (2)=4,a contradiction.A uniform set is said to fulfill(k)-Condition if its uniform complexity satisfiesp (m)=2m for m=1,2,...,k;but p (k+1)=2k+1.Thus,by Theorem2.5, is a Sturmian set if and only if it fulfills(1)-Condition.The following lemma will be used for the characterization of pattern Sturmian sets.Lemma2.6.Let be a uniform set.(1)There exists an infinite subset ⊂ such that ( )contains at least one constant word.(2)If fulfills(k)-Condition for some k 2,then for any S⊂ with S=k,there exist 1∈ S( )and an infinitesubset ⊂ \S such that 1is{ }-special for any ∈ and{ (w);w∈ and S(w)=w1}consists of two constant words.Proof.(1)Take an arbitrary word ∈ .Since is an infinite set,there exists an infinite subset such that ( ) is a constant word.Thus, ( )contains at least one constant word.1656T.Kamae et al./Discrete Mathematics 306(2006)1651–1668(2)Assume that fulfills (k)-Condition for some k 2.Take S ⊂ with #S =k and write S ( )={w 1,w 2,...,w 2k }.Since p (k +1)=2k +1,for any ∈ \S ,just one element in {w 1,w 2,...,w 2k }is -special.Therefore,without loss of generality,we may assume that there exists an infinite set ⊂ \S such that w 1is the only element in S ( )which is -special for any ∈ and{ (w);w ∈ and S (w)=w 1}consists only of constant words.We conclude the proof by claiming that both constant words with 0and 1are included in the above set.Otherwise taking ∈S and ∈ ,one can find only three different { , }-words,which contradicts the fact that p (2)=4. Theorem 2.7.If is a uniform set .Then is a pattern Sturmian set if and only if p (2)=4and p (3)=6.Proof.We only need to show that if is uniform and p (2)=4,p (3)=6,then p (k)=2k for any k .•p (k) 2k for any k 1:otherwise,there exists a k 3such that p (k +1) p (k)+3.Take S ⊂ with #S =k and ∈ \S .There are three S -words w 1,w 2and w 3which are -special.Since w 1,w 2and w 3are distinct,there exists a 1, 2∈S such that { 1, 2}(w 1), { 1, 2}(w 2)and { 1, 2}(w 3)are different from each other.Also all of them are -special.Since p (2)=4,there is another { 1, 2}-word besides { 1, 2}(w 1), { 1, 2}(w 2)and { 1, 2}(w 3)which has at least one { }-extension.Then { 1, 2, }( ) 7,which contradicts the assumption p (3)=6.•p (k) 2k for any k 1:otherwise,there exists a k 3such that p (k +1) p (k)+1.If p (k +1)=p (k),then as in the proof of Proposition 2.3,we can show that is finite which is a contradiction.It remains only one possibility: fulfills (k)-Condition for some k 3.In this case,fix S ⊂ with #S =k .Then # S ( )=2k , S ( )={w 1,w 2,...,w 2k }.By Lemma 2.6(2),there exists an infinite subset ⊂ \S and w 1∈ S ( )such that w 1is -special for any ∈ and { (w);w ∈ and S (w)=w 1}consists only of the two constant words.Construct a set as follows:={ (w);w ∈ and S (w)=w 1}.We claim that is a uniform set.To see this,for a fixed finite subset S ⊂ ,consider the S ∪S -words of .Since any S -word but w 1has a unique S -extension,and w 1has just p (S )different S -extensions,we havep (S )=p (S ∪S )−(2k −1)=p (k +#S )−(2k −1)which implies is a uniform set such thatp (m)=p (k +m)−(2k −1)for any m 1.Obviously,p (1)=2.We claim that p (2)=4.Otherwise p (2)=2or 3.If p (2)=2,then by Propositions 2.2and 2.3, ={ , }for some ∈{0,1} .Take S ⊂ with #S =3.Consider the set S ( )which is the union of S ( )and the two constant words.Then,we havep (S ) 2+2=4,contradicting to the fact that p (3)=6.If p (2)=3,then is a Sturmian set.Then by Lemma 2.6(1),there exists a three-elements subset S ⊂ such that S ( )consists of four elements and among them at least one is a constant word.Consider the set S ( )which is the union of S ( )and the two constant words.Then,p (S ) 4+2−1=5,contradicting to the fact that p (3)=6.T.Kamae et al./Discrete Mathematics306(2006)1651–16681657Moreover,since is defined on ⊂ and ⊂ ( ),we havep (m) p (m)for any m 1.Therefore, fulfills(k )-Condition for some2 k k.Due to Lemma2.6(2)applied to ,there exists an infinite set ⊂ such that := ( )contains the two constant words.Replacing and by these and ,we may assume that the above contains the two constant words.Since for any S ⊂ , S ( )is the union of S ( )and two constant words,we have S ( )= S ( )and p ≡p .On the other hand,fix ∈S and S ⊂ with#S =m.Without loss of generality,assume that w1( )=0.Since p (2)=4,for any ∈S , { , }={00,01,10,11}.Since S ( )contains both the constant words, { }∪S ( )={0w;w∈ S ( )}∪{10m,11m},where,for example,10m denotes the{ }∪S -word w with w( )=1and w(s)=0for s∈S .Since the union is disjoint, p (m+1)=p (m)+2for any m 1.This is a contradiction against the facts p ≡p ,p (k)=2k and p (k+1)=2k+1,which completes the proof of p (k) 2k.3.Isomorphism between uniform setsIn this section,we consider only the uniform sets on the index set :=N equipped with the natural total ordering. Recall that the alphabet A is always{0,1}.The product topology defined on{0,1}N is consistent with the following metric:for x=x(0)x(1)x(2)...,y= y(0)y(1)y(2)...∈{0,1}N,d(x,y)=2−inf{k∈N;x(k)=y(k)}.Thus two points are closer to each other if they share a longer prefix.The cylinder[ ],where = 1 2··· n∈{0,1}n, is the set of words of the form[ ]={x∈{0,1}N;x(0)= 1,x(1)= 2,...,x(n−1)= n}.The order of a cylinder[ ]is defined to be the length n of ,denoted by| |.Note that for the empty word∅, []:=[∅]={0,1}N.Definition3.1.Two uniform sets , ⊂{0,1}N are said to be isomorphic to each other,written ≈ ,if there is an isometry between their closures and ,that is,there is a bijection : → such that for any x,y∈ , d( (x), (y))=d(x,y).An equivalence class of uniform sets with respect to this isomorphism is called a structure.The structure containing is denoted by[ ].Note that and its closure always have the same language,that is,the set offinite words appearing in and coincide.Also,note that two uniform sets which are isomorphic to each other have the same uniform complexity. For a uniform set ⊂{0,1}N,we define the prefix tree G( )as follows:G( )=(V,E)is a directed graph. The set V of vertices is the set of the cylinders which meet ,and the set E of(directed)edges is the set of the ordered pairs([u],[v])of cylinders in V such that v is an immediate extension of u,that is,|v|=|u|+1and u1=v1, u2=v2,...,u|u|=v|u|.Recall that two directed graphs G=(V,E)and G =(V ,E )are isomorphic,written G G ,if there is a bijection :V→V between their vertices,such that there is an edge in E from u to v if and only if there is an edge in E from (u)to (v).Theorem3.2.Let and be two uniform sets.Then ≈ if and only if G( ) G( ).1658T.Kamae et al./Discrete Mathematics306(2006)1651–1668Proof.If ≈ ,then there is an isometry : → .Thus,x,y∈ are in an identical cylinder of order n if and only if (x)and (y)are also in an identical cylinder of order n in .Hence, induces a bijection between the cylinders intersecting with and the cylinders intersecting with keeping the orders.Thus, is a bijection between the vertices of G( )and G( )which preserves the edges,and is an isomorphism between G( )and G( ). Conversely,assume G( ) G( ).Noticing that there is a natural correspondence between the words in and theinfinite paths from the root in G( ),the isomorphism between the prefix trees induces a map : → ,which is an isometry.Let N={n0<n1<n2<···}be an infinite subset of N.Let be a uniform set on N.Recall the notion of induced set (N)of on N in Section1.3.Definition3.3.A uniform set ⊂{0,1}N or a structure[ ]is said to be primitive if for any infinite subset N⊂N, the induced set (N)is isomorphic to the original set .All the uniform sets in Example1.1are easily seen to be primitive.Their prefix trees are depicted in Figs.1–3.Theorem3.4.For any Sturmian set ,there exists an infinite subset N⊂N such that (N)≈ 0.In particular,if is primitive,then ≈ 0.Hence,[ 0]is the unique primitive structure among the Sturmian sets.Proof.Let ⊂{0,1}N be a Sturmian set.Put n0=0.Just as in the proof of Lemma2.6,we can take an infinite set N1⊂N\{n0}such that one{n0}-word, say a(∈{0,1}),is -special for ∈N1,while the other{n0}-word a has only one{ }-extension for any ∈N1. Put ={ N1( ); ∈ with (n0)=a}.Then, is again a Sturmian set since for any set S⊂N1with#S=k,p (S)=p (S∪{n0})−1=k+2−1=k+1.Put n1=min N1,and we continue the above process:find N2⊂N1\{n1}such that in one{n1}-word has a unique N2-extension while the N2-extensions of the other{n1}-word form a Sturmian set.Put n2=min N2,and so on.At last,setting N={n0,n1,n2,...},by the construction of N we have (N)≈ 0(see Fig.1).Moreover,if is primitive,then ≈ (N)≈ 0.The next theorem characterizes the primitive pattern Sturmian sets.Theorem3.5.Let be a pattern Sturmian set.Then either (N)≈ 1for some N⊂N or (N)≈ 2for some N⊂N.In particular,if is primitive,then either ≈ 1or ≈ 2.Hence,[ 1]and[ 2]are the only primitive structures among the pattern Sturmian sets.Proof.Let ⊂{0,1}N be a pattern Sturmian set.For any , ∈N with < , { , }( )={00,01,10,11}because p (2)=4.For any > ,since p (3)=6,there are two{ , }-words which are -special.If one of the special words comes from the set{00,01}and the other comes from{10,11},we call a{ , }-balanced place.If all butfinite number of places are{ , }-balanced places,we say that is{ , }-balanced.If is{ , }-balanced for any , ∈N with < ,we say that is balanced.We consider two cases according to the balance property.Case1: is balanced.Take n0=0,n1=1.Since is{n0,n1}-balanced,we can take an infinite subset N1⊂{n1+1,n1+2,...}such that one{n0,n1}-word in{00,01}is -special for any ∈N1,and one{n0,n1}-word in{10,11}is also -special for any ∈N1.Any of the other two words in{00,01,10,11}has a unique N1-extension.Take n2=min N1.For any{n0,n1}-word w which is{n2}-special,w0,w1∈ {n0,n1,n2}( ).Since is{n0,n2}-balanced,for all butfinite number of ’s with >n2,one word in{w0,w1}is not -special,we can take an infinite subset N2⊂N1\{n2}such that,for all{n0,n1}-words w which are{n2}-special,one word in{w0,w1}has a unique N2-extension.T.Kamae et al./Discrete Mathematics 306(2006)1651–16681659Take n 3=min N 2.Since 4words in {n 0,n 1,n 2}( )have a unique {n 3}-extension and p (4)=8,the other 2words in {n 0,n 1,n 2}( )are n 3-special.One of these 2words starts by 0and the other starts by 1.We continue the above process.Finally,we get N ={n 0,n 1,n 2,n 3,...}such that (N )≈ 1(see Fig.2).Case 2: is not balanced.There are places 0, 0∈N with 0< 0such that is not { 0, 0}-balanced,that is,there are infinite number of places > 0which are not { 0, 0}-balanced.Then,we find an infinite subset N ⊂{ 0+1, 0+2,...}such that for any ∈N ,both { 0, 0}-words either in {00,01}or in {10,11}are -special.Without loss of generality,we assume that both { 0, 0}-words in {00,01}are -special for infinitely many ∈N .Collecting all these ,we define an infinite set N ⊂N such that both { 0, 0}-words in {00,01}are -special for any ∈N ,while { 0, 0}-words in {10,11}are not -special for ∈N .Denote by and ∗the N -extensions of {00,01}and {10,11},respectively.More precisely,={ N (w);w ∈ such that { 0, 1}(w)∈{00,01}},∗={ N (w);w ∈ such that { 0, 1}(w)∈{10,11}}.We claim that is again a pattern Sturmian set.To see this,we study the set ∗first.Since any { 0, 1}-word in {10,11}has a unique N -extension,# ∗ 2.Moreover,since { 0, }( )={00,01,10,11}for any ∈N ,the -extensions of 10and 11are different,thus ∗={x,x }for some N -word x .Hence,for any finite subset S ⊂N ,{ 0}∪S ( )={0w ;w ∈ S ( )}∪{1u,1u }for some S -word u .From this,we havep (S)=p ({ 0}∪S)−2=2(#S +1)−2=2#S ,being a pattern Sturmian set.Subcase 2.1:If is balanced,then by Case 1,there is an N such that (N )≈ 1.But for any S ⊂N , S ( (N ))⊂ S ( (N )),and since both (N )and (N )are pattern Sturmian,by comparing the cardinality, S ( (N ))= S ( (N )).Therefore,the prefix trees of (N )and (N )are just the same,and (N )≈ (N )≈ 1.Subcase 2.2:If is not balanced,then there exist 1, 1∈N with 1< 1such that is not { 1, 1}-balanced,and we get a new pattern Sturmian set ,and continue the above discussion.In case of the new pattern Sturmian set constructed in some step is balanced,then just as shown in Subcase 2.1,we find N ⊂N such that (N )≈ 1.Otherwise,if all the pattern Sturmian sets constructed in the process are balanced,we get a sequence 0< 1< 2<···.PuttingN ={ 0, 1, 2,...},we have (N )≈ 2(see Fig.3).Example 3.6.Let ˜ ⊂{0,1}Z be the set of words ˜ on the index set Z such that either ˜ is increasing or ˜ is Dirac.Define a word (˜ )∈{0,1}N by(˜ )(k)=˜ (i)if k =2i is even ,˜ (−i)if k =2i −1is odd .Let = (˜ ).Then, is a pattern Sturmian set such that (N )= 2if N ={0,2,4,...}and (N )= 1≈ 1if N ={1,3,5,...}.Hence, is not primitive.Moreover,as shown in Fig.4, is isomorphic neither to 1nor to 2,so that [ ]is not primitive.1660T.Kamae et al./Discrete Mathematics306(2006)1651–1668Fig.4. in Example3.6.4.Structures of pattern Sturmian words4.1.From pattern Sturmian words to pattern Sturmian setsRecall that a word on N is called recurrent if for any L 1,there exists M 1such that(i)= (i+M)for i=0,1,...,L−1.(1) Note that if is recurrent,then there exist infinitely many M’s satisfying(1).In this subsection,we will show how to construct pattern Sturmian sets from a recurrent pattern Sturmian word.Let be a pattern Sturmian word.For afinite or infinite subset N⊂N,consider the following property(we will call it the optimal property):for any nonemptyfinite subset ⊂N,it holds thatp ( )=2# .(2) If an infinite subset N of N has the optimal property,then N is called an optimal window for .We note that an infinite subset of an optimal window is again an optimal window.Theorem4.1.Let be a recurrent pattern Sturmian word.Then,there exists an optimal window for .Proof.Suppose that is a recurrent pattern Sturmian word.We construct an increasing family of sets satisfying the optimal property.Put n0=0.Assume that n0<n1<···<n k−1have been picked out such that the optimal property holds for the set := {n0,n1,...,n k−1}.Take L∈N such thatF ( )={ [n+ ];n=0,1,...,L−1}.Since is recurrent,there exists M with M>n k−1such that(1)holds for this L.Put n k=M.Now we show that(2)holds for any nonempty ⊂{n0,n1,...,n k}.•If n k/∈ ,(2)holds for by the hypothesis of induction.•If n k∈ but n0/∈ .Write ={n0}∪( \{n k}).Then# =# .Note that(2)holds for by the hypothesis of induction,actually#{ [n+ ];n=0,1,...,L−1}=#F ( )=2# .On the other hand,by(1), (n k)= (n0)for n=0,1,...,L−1.Therefore,there is an one-to-one correspondence between the sets{ [n+ ];n=0,1,...,L−1}。
Traffic Classification Using Clustering Algorithms
Traffic Classification Using Clustering AlgorithmsJeffrey Erman,Martin Arlitt,Anirban MahantiUniversity of Calgary,2500University Drive NW,Calgary,AB,Canada{erman,arlitt,mahanti}@cpsc.ucalgary.caABSTRACTClassification of network traffic using port-based or payload-based analysis is becoming increasingly difficult with many peer-to-peer (P2P)applications using dynamic port numbers,masquerading tech-niques,and encryption to avoid detection.An alternative approach is to classify traffic by exploiting the distinctive characteristics of applications when they communicate on a network.We pursue this latter approach and demonstrate how cluster analysis can be used to effectively identify groups of traffic that are similar using only transport layer statistics.Our work considers two unsupervised clustering algorithms,namely K-Means and DBSCAN,that have previously not been used for network traffic classification.We eval-uate these two algorithms and compare them to the previously used AutoClass algorithm,using empirical Internet traces.The experi-mental results show that both K-Means and DBSCAN work very well and much more quickly then AutoClass.Our results indicate that although DBSCAN has lower accuracy compared to K-Means and AutoClass,DBSCAN produces better clusters.Categories and Subject DescriptorsI.5.4[Computing Methodologies]:Pattern Recognition—Appli-cationsGeneral TermsAlgorithms,classificationKeywordsmachine learning,unsupervised clustering1.INTRODUCTIONAccurate identification and categorization of network traffic ac-cording to application type is an important element of many net-work management tasks such asflow prioritization,traffic shap-ing/policing,and diagnostic monitoring.For example,a network operator may want to identify and throttle(or block)traffic from peer-to-peer(P2P)file sharing applications to manage its band-width budget and to ensure good performance of business criti-Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.SIGCOMM’06Workshops September11-15,2006,Pisa,Italy. Copyright2006ACM1-59593-417-0/06/0009...$5.00.cal applications.Similar to network management tasks,many net-work engineering problems such as workload characterization and modelling,capacity planning,and route provisioning also benefit from accurate identification of network traffic.In this paper,we present preliminary results from our experience with using a ma-chine learning approach called clustering for the network traffic identification problem.In the remainder of this section,we moti-vate why clustering is useful,discuss the specific contributions of this paper,and outline our ongoing work.The classical approach to traffic classification relies on mapping applications to well-known port numbers and has been very suc-cessful in the past.To avoid detection by this method,P2P appli-cations began using dynamic port numbers,and also started dis-guising themselves by using port numbers for commonly used pro-tocols such as HTTP and FTP.Many recent studies confirm that port-based identification of network traffic is ineffective[8,15]. To address the aforementioned drawbacks of port-based classi-fication,several payload-based analysis techniques have been pro-posed[3,6,9,11,15].In this approach,packet payloads are ana-lyzed to determine whether they contain characteristic signatures of known applications.Studies show that these approaches work very well for the current Internet traffic including P2P traffic.In fact, some commercial packet shaping tools have started using these techniques.However,P2P applications such as BitTorrent are be-ginning to elude this technique by using obfuscation methods such as plain-text ciphers,variable-length padding,and/or encryption. In addition,there are some other disadvantages.First,these tech-niques only identify traffic for which signatures are available and are unable to classify any other traffic.Second,these techniques typically require increased processing and storage capacity.The limitations of port-based and payload-based analysis have motivated use of transport layer statistics for traffic classification[8, 10,12,14,17].These classification techniques rely on the fact that different applications typically have distinct behaviour patterns when communicating on a network.For instance,a largefile trans-fer using FTP would have a longer connection duration and larger average packet size than an instant messaging client sending short occasional messages to other clients.Similarly,some P2P appli-cations such as BitTorrent1can be distinguished from FTP data transfers because these P2P connections typically are persistent and send data bidirectionally;FTP data transfer connections are non-persistent and send data only unidirectionally.Transport layer statistics such as the total number of packets sent,the ratio of the bytes sent in each direction,the duration of the connection,and the average size of the packets characterize these behaviours.In this paper,we explore the use of a machine learning approach called clustering for classifying traffic using only transport layerstatistics.Cluster analysis is one of the most prominent methods for identifying classes amongst a group of objects,and has been used as a tool in manyfields such as biology,finance,and com-puter science.Recent work by McGregor et al.[10]and Zander et al.[17]show that cluster analysis has the ability to group Inter-net traffic using only transport layer characteristics.In this paper, we confirm their observations by evaluating two clustering algo-rithms,namely K-Means[7]and DBSCAN[5],that to the best of our knowledge have not been previously applied to this problem. In addition,as a baseline,we present results from the previously considered AutoClass[1]algorithm[10,17].The algorithms evaluated in this paper use an unsupervised learn-ing mechanism,wherein unlabelled training data is grouped based on similarity.This ability to group unlabelled training data is ad-vantageous and offers some practical benefits over learning ap-proaches that require labelled training data(discussed in Section 2).Although the selected algorithms use an unsupervised learning mechanism,each of these algorithms,however,is based on differ-ent clustering principles.The K-Means clustering algorithm is a partition-based algorithm[7],the DBSCAN algorithm is a density-based algorithm[5],and the AutoClass algorithm is a probabilistic model-based algorithm[1].One reason in particular why K-Means and DBSCAN algorithms were chosen is that they are much faster at clustering data than the previously used AutoClass algorithm. We evaluate the algorithms using two empirical traces:a well-known publicly available Internet traffic trace from the University of Auckland,and a recent trace we collected from the University of Calgary’s Internet connection.The algorithms are compared based on their ability to generate clusters that have a high predictive power of a single application.We show that clustering works for a variety of different applications,including Web,P2Pfile-sharing, andfile transfer with the AutoClass and K-Means algorithm’s ac-curacy exceeding85%in our results and DBSCAN achieving an accuracy of75%.Furthermore,we analyze the number of clusters and the number of objects in each of the clusters produced by the different algorithms.In general,the ability of an algorithm to group objects into a few“good”clusters is particularly useful in reducing the amount of processing required to label the clusters.We show that while DBSCAN has a lower overall accuracy the clusters it forms are the most accurate.Additionally,wefind that by looking at only a few of DBSCAN’s clusters one could identify a significant portion of the connections.Ours is a work-in-progress.Preliminary results indicate that clustering is indeed a useful technique for traffic identification.Our goal is to build an efficient and accurate classification tool using clustering techniques as the building block.Such a clustering tool would consist of two stages:a model building stage and a classifi-cation stage.In thefirst stage,an unsupervised clustering algorithm clusters training data.This produces a set of clusters that are then labelled to become our classification model.In the second stage, this model is used to develop a classifier that has the ability to label both online and offline network traffic.We note that offline classifi-cation is relatively easier compared to online classification,asflow statistics needed by the clustering algorithm may be easily obtained in the former case;the latter requires use of estimation techniques forflow statistics.We should also note that this approach is not a “panacea”for the traffic classification problem.While the model building phase does automatically generate clusters,we still need to use other techniques to label the clusters(e.g.,payload anal-ysis,manual classification,port-based analysis,or a combination thereof).This task is manageable because the model would typi-cally be built using small data sets.We believe that in order to build an accurate classifier,a good classification model must be used.In this paper,we focused on the model building step.Specifically,we investigate which clustering algorithm generates the best model.We are currently investigating building efficient classifiers for K-Means and DBSCAN and testing the classification accuracy of the algorithms.We are also investi-gating how often the models should be retrained(e.g.,on a daily, weekly,or monthly basis).The remainder of this paper is arranged as follows.The different Internet traffic classification methods including those using cluster analysis are reviewed in Section2.Section3outlines the theory and methods employed by the clustering algorithms studied in this paper.Section4and Section5present our methodology and out-line our experimental results,respectively.Section6discusses the experimental results.Section7presents our conclusions.2.BACKGROUNDSeveral techniques use transport layer information to address the problems associated with payload-based analysis and the diminish-ing effectiveness of port-based identification.McGregor et al.hy-pothesize the ability of using cluster analysis to groupflows using transport layer attributes[10].The authors,however,do not evalu-ate the accuracy of the classification as well as whichflow attributes produce the best results.Zander et al.extend this work by using another Expectation Maximization(EM)algorithm[2]called Au-toClass[1]and analyze the best set of attributes to use[17].Both [10]and[17]only test Bayesian clustering techniques implemented by an EM algorithm.The EM algorithm has a slow learning time. This paper evaluates clustering algorithms that are different and faster than the EM algorithm used in previous work.Some non-clustering techniques also use transport layer statis-tics to classify traffic[8,9,12,14].Roughan et e nearest neighbor and linear discriminate analysis[14].The connection du-rations and average packet size are used for classifying traffic into four distinct classes.This approach has some limitations in that the analysis from these two statistics may not be enough to classify all applications classes.Karagiannis et al.propose a technique that uses the unique be-haviors of P2P applications when they are transferring data or mak-ing connections to identify this traffic[8].Their results show that this approach is comparable with that of payload-based identifica-tion in terms of accuracy.More recently,Karagiannis et al.devel-oped another method that uses the social,functional,and applica-tion behaviors to identify all types of traffic[9].These approaches focus on higher level behaviours such as the number of concurrent connections to an IP address and does not use the transport layer characteristics of single connection that we utilize in this paper. In[12],Moore et e a supervised machine learning algo-rithm called Na¨ıve Bayes as a classifier.Moore et al.show that the Na¨ıve Bayes approach has a high accuracy classifying traffic.Su-pervised learning requires the training data to be labelled before the model is built.We believe that an unsupervised clustering approach offers some advantages over supervised learning approaches.One of the main benefits is that new applications can be identified by examining the connections that are grouped to form a new clus-ter.The supervised approach can not discover new applications and can only classify traffic for which it has labelled training data. Another advantage occurs when the connections are being labelled. Due to the high accuracy of our clusters,only a few of the connec-tions need to be identified in order to label the cluster with a high degree of confidence.Also consider the case where the data set be-ing clustered contains encrypted P2P connections or other types of encrypted traffic.These connections would not be labelled using payload-based classification.These connections would,therefore,be excluded from the supervised learning approach which can only use labelled training data as input.This could reduce the super-vised approach’s accuracy.However,the unsupervised clustering approach does not have this limitation.It might place the encrypted P2P traffic into a cluster with other unencrypted P2P traffic.By looking at the connections in the cluster,an analyst may be able to see similarities between unencrypted P2P traffic and the encrypted traffic and conclude that it may be P2P traffic.3.CLUSTERING ALGORITHMSThis section reviews the clustering algorithms,namely K-Means,DBSCAN,and AutoClass,considered in this work.The K-Means algorithm produces clusters that are spherical in shape whereas the DBSCAN algorithm has the ability to produce clusters that are non-spherical.The different cluster shapes that DBSCAN is capable of finding may allow for a better set of clusters to be found that minimize the amount of analysis required.The AutoClass algo-rithm uses a Bayesian approach and can automatically determine the number of clusters.Additionally,it performs soft clustering wherein objects are assigned to multiple clusters fractionally.The Cluster 3.0[4]software suite is used to obtain the results for K-Means clustering.The DBSCAN results are obtained the WEKA software suite [16].The AutoClass results are obtained using an implementation provided by [1].In order for the clustering of the connections to occur,a similar-ity (or distance)measurement must be established first.While vari-ous similarity measurements exist,Euclidean distance is one of the most commonly used metrics for clustering problems [7,16].With Euclidean distance,a small distance between two objects implies a strong similarity whereas a large distance implies a low similarity.In an n-dimensional space of features,Euclidean distance can be calculated between objects x and y as follows:dist (x,y )=v u ut4.METHODOLOGY4.1Empirical TracesTo analyze the algorithms,we used data from two empirical packet traces.One is a publicly available packet trace called Auck-land IV2,the other is a full packet trace that we collected ourselves at the University of Calgary.Auckland IV:The Auckland IV trace contains only TCP/IP head-ers of the traffic going through the University of Auckland’s link to the Internet.We used a subset of the Auckland IV trace from March16,2001at06:00:00to March19,2001at05:59:59.This subset provided sufficient connection samples to build our model (see Section4.4).Calgary:This trace was collected from a traffic monitor attached to the University of Calgary’s Internet link.We collected this trace on March10,2006from1to2pm.This trace is a full packet trace with the entire payloads of all the packets captured.Due to the amount of data generated when capturing full payloads,the disk capacity(60GB)of our traffic monitor wasfilled after one hour of collection,thus,limiting the duration of the trace.4.2Connection IdentificationTo collect the statisticalflow information necessary for the clus-tering evaluations,theflows must be identified within the traces. Theseflows,also known as connections,are a bidirectional ex-change of packets between two nodes.In the traces,the data is not exclusively from connection-based transport layer protocols such as TCP.While this study focused solely on the TCP-based applications it should be noted that statis-ticalflow information could be calculated for UDP traffic also.We identified the start of a connection using TCP’s3-way handshake and terminated a connection when FIN/RST packets were received. In addition,we assumed that aflow is terminated if the connection was idle for over90seconds.The statisticalflow characteristics considered include:total num-ber of packets,mean packet size,mean payload size excluding headers,number of bytes transfered(in each direction and com-bined),and mean inter-arrival time of packets.Our decision to use these characteristics was based primarily on the previous work done by Zander et al.[17].Due the heavy-tail distribution of many of the characteristics and our use of Euclidean distance as our similarity metric,we found that the logarithms of the characteristics gives much better results for all the clustering algorithms[13,16].4.3Classification of the Data SetsThe publicly available Auckland IV traces include no payload information.Thus,to determine the connections“true”classifica-tions port numbers are used.For this trace,we believe that a port-based classification will be largely accurate,as this archived trace predates the widespread use of dynamic port numbers.The classes considered for the Auckland IV datasets are DNS,FTP(control), FTP(data),HTTP,IRC,LIMEWIRE,NNTP,POP3,and SOCKS. LimeWire is a P2P application that uses the Gnutella protocol.In the Calgary trace,we were able to capture the full payloads of the packets,and therefore,were able to use an automated payload-based classification to determine the“true”classes.The payload-based classification algorithm and signatures we used is very sim-ilar to those described by Karagiannis et al.[9].We augmented their signatures to classify some newer P2P applications and instant messaging programs.The traffic classes considered for the Calgary trace are HTTP,P2P,SMTP,and POP3.The application breakdownConnections%Bytes1,132,92047.3% P2P17,578,995,93446,882 6.0% IMAP228,156,0603,6740.1% MSSQL23,824,93641,239 1.3%354,7989.6%of the Calgary trace is presented in Table1.The breakdown of the Auckland IV trace has been omitted due to space limitations.How-ever,HTTP is also the most dominant application accounting for over76%of the bytes and connections.4.4Testing MethodologyThe majority of the connections in both traces carry HTTP traf-fic.This unequal distribution does not allow for equal testing of the different classes.To address this problem,the Auckland data sets used for the clustering consist of1000random samples of each traf-fic class,and the Calgary data sets use2000random sample of each traffic category.This allows the test results to fairly judge the abil-ity on all traffic and not just HTTP.The size of the data sets were limited to8000connections because this was the upper bound that the AutoClass algorithm could cluster within a reasonable amount of time(4-10hours).In addition,to achieve a greater confidence in the results we generated10different data sets for each trace.Each of these data sets was then,in turn,used to evaluate the cluster-ing algorithms.We report the minimum,maximum,and average results from the data sets of each trace.In the future,we plan on examining the practical issue of what is the best way to pick the connections used as samples to build the model.Some ways that we think this could be accomplished is by random selection or a weighted selection using different criteria such as bytes transfered or duration.Also,in order to get a reason-able representative model of the traffic,one would need to select a fairly large yet manageable number of samples.We found that K-Means and DBSCAN algorithms are able to cluster much larger data sets(greater than100,000)within4-10hours.5.EXPERIMENTAL RESULTSIn this section,the overall effectiveness of each clustering algo-rithm is evaluatedfirst.Next,the number of objects in each cluster produced by the algorithms are analyzed.5.1Algorithm EffectivenessThe overall effectiveness of the clustering algorithms is calcu-lated using overall accuracy.This overall accuracy measurement determines how well the clustering algorithm is able to create clus-ters that contain only a single traffic category.The traffic class that makes up the majority of the connections in a cluster is used to label the cluster.The number of correctly classified connections in a cluster is referred to as the True Pos-itives(TP).Any connections that are not correctly classified are considered False Positives(FP).Any connection that has not been assigned to a cluster is labelled as noise.The overall accuracy is thus calculated as follows:overall accuracy=P T P for all clusters0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 020406080 100 120 140 160O v e r a l l A c c u r a c yNumber of ClustersCalgary AucklandIVFigure 1:Accuracy using K-Means 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.010.020.030.04O v e r a l l A c c u r a c yEpsilon DistanceAuckland IV (3 minPts)Calgary (3 minPts)Figure 2:Accuracy using DBSCAN0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.010.020.03 0.04O v e r a l l A c c u r a c yEpsilon Distance3 minPts 6 minPts 12 minPts 24 minPtsFigure 3:Parametrization of DBSCANTable 2:Accuracy using AutoClass Data Set Minimum Auckland IV 91.5%88.7%90.0%5.1.1K-Means ClusteringThe K-Means algorithm has an input parameter of K.This inputparameter as mentioned in Section 3.1,is the number of disjoint partitions used by K-Means.In our data sets,we would expect there would be at least one cluster for each traffic class.In ad-dition,due to the diversity of the traffic in some classes such as HTTP (e.g.,browsing,bulk download,streaming)we would ex-pect even more clusters to be formed.Therefore,based on this,the K-Means algorithm was evaluated with K initially being 10and K being incremented by 10for each subsequent clustering.The min-imum,maximum,and average results for the K-Means clustering algorithm are shown in Figure 1.Initially,when the number of clusters is small the overall ac-curacy of K-Means is approximately 49%for the Auckland IV data sets and 67%for the Calgary data sets.The overall accuracy steadily improves as the number of clusters increases.This contin-ues until K is around 100with the overall accuracy being 79%and 84%on average,for the Auckland IV and Calgary data sets,respec-tively.At this point,the improvement is much more gradual with the overall accuracy only improving by an additional 1.0%when K is 150in both data sets.When K is greater than 150,the improve-ment is further diminished with the overall accuracy improving to the high 80%range when K is 500.However,large values of K increase the likelihood of over-fitting.5.1.2DBSCAN ClusteringThe accuracy results for the DBSCAN algorithm are presented in Figure 2.Recall that DBSCAN has two input parameters (minPts,eps).We varied these parameters,and in Figure 2report results for the combination that produce the best clustering results.The values used for minPts were tested between 3and 24.The eps dis-tance was tested from 0.005to 0.040.Figure 3presents results for different combinations of (minPts,eps)values for the Calgary data sets.As may be expected,when the minPts was 3better results were produced than when the minPts was 24because smaller clus-ters are formed.The additional clusters found using three minPts were typically small clusters containing only 3to 5connections.When using minPts equal to 3while varying the eps distance between 0.005and 0.020(see Figure 2),the DBSCAN algorithm improved its overall accuracy from 59.5%to 75.6%for the Auck-land IV data sets.For the Calgary data sets,the DBSCAN algo-rithm improved its overall accuracy from 32.0%to 72.0%as the eps distance was varied with these same values.The overall ac-curacy for eps distances greater than 0.020decreased significantly0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1% C o n n e c t i o n s% ClustersDBSCAN K-Means AutoClassFigure 4:CDF of cluster weightsas the distance increased.Our analysis indicates that this large de-crease occurs because the clusters of different traffic classes merge into a single large cluster.We found that this larger cluster was for connections with few packets,few bytes transfered,and short dura-tions.This cluster contained typically equal amounts of P2P,POP3,and SMTP connections.Many of the SMTP connections were for emails with rejected recipient addresses and connections immedi-ately closed after connecting to the SMTP server.For POP3,many of the connections contained instances where no email was in the users mailbox.Gnutella clients attempting to connect to a remote node and having its “GNUTELLA CONNECT”packets rejected accounted for most of the P2P connections.5.1.3AutoClass ClusteringThe results for the AutoClass algorithm are shown in Table 2.For this algorithm,the number of clusters and the cluster param-eters are automatically determined.Overall,the AutoClass algo-rithm has the highest accuracy.On average,AutoClass is 92.4%and 88.7%accurate in the Auckland IV and Calgary data sets,re-spectively.AutoClass produces an average of 167clusters for the Auckland IV data sets,and 247clusters for the Calgary data sets.5.2Cluster WeightsFor the traffic classification problem,the number of clusters pro-duced by a clustering algorithm is an important consideration.The reason being that once the clustering is complete,each of the clus-ters must be labelled.Minimizing the number of clusters is also cost effective during the classification stage.One way of reducing the number of clusters to label is by evalu-ating the clusters with many connections in them.For example,if a clustering algorithm with high accuracy places the majority of the connections in a small subset of the clusters,then by analyzing only this subset a majority of the connections can be classified.Figure 4shows the percentage of connections represented as the percentage of clusters increases,using the Auckland IV data sets.In this eval-uation,the K-Means algorithm had 100for K.For the DBSCAN and AutoClass algorithms,the number of clusters can not be set.0.50.6 0.7 0.8 0.9 1P r e c i s i o nFigure 5:Precision using DBSCAN,K-Means,and AutoClass DBSCAN uses 0.03for eps,3for minPts,and has,on average,190clusters.We selected this point because it gave the best overall accuracy for DBSCAN.AutoClass has,on average,167clusters.As seen in Figure 4,both K-Means and AutoClass have more evenly distributed clusters than DBSCAN.The 15largest clusters produced by K-Means only contain 50%of the connections.In contrast,for the DBSCAN algorithm the five largest clusters con-tain over 50%of the connections in the data sets.These five clus-ters identified 75.4%of the NNTP,POP3,SOCKS,DNS,and IRC connections with a 97.6%overall accuracy.These results are un-expected when considering that by only looking at five of the 190clusters,one can identify a significant portion of traffic.Qualita-tively similar results were obtained for the Calgary data sets.6.DISCUSSIONThe DBSCAN algorithm is the only algorithm considered in this paper that can label connections as noise.The K-Means and Au-toClass algorithms place every connection into a cluster.The con-nections that are labelled as noise reduce the overall accuracy of the DBSCAN algorithm because they are regarded as misclassified.We have found some interesting results by excluding the connec-tions labelled as noise and just examining the clusters produced by DBSCAN.Figure 5shows the precision values for the DBSCAN (eps=0.02,minPts=3),the K-Means (K=190),and the AutoClass algorithms using the Calgary data sets.Precision is the ratio of TP to FP for a traffic class.Precision measures the accuracy of the clusters to classify a particular category of traffic.Figure 5shows that for the Calgary data sets,the DBSCAN algo-rithm has the highest precision values for three of the four classes of traffic.While not shown for the Auckland IV data sets,seven of the nine traffic classes have average precision values over 95%.This shows that while DBSCAN’s overall accuracy is lower than K-Means and AutoClass it produces highly accurate clusters.Another noteworthy difference among the clustering algorithms is the time required to build the models.On average to build the models,the K-Means algorithm took 1minute,the DBSCAN algo-rithm took 3minutes,and the AutoClass algorithm took 4.5hours.Clearly,the model building phase of AutoClass is time consum-ing.We believe this may deter systems developers from using this algorithm even if the frequency of retraining the model is low.7.CONCLUSIONSIn this paper,we evaluated three different clustering algorithms,namely K-Means,DBSCAN,and AutoClass,for the network traffic classification problem.Our analysis is based on each algorithm’s ability to produce clusters that have a high predictive power of a single traffic class,and each algorithm’s ability to generate a min-imal number of clusters that contain the majority of the connec-tions.The results showed that the AutoClass algorithm produces the best overall accuracy.However,the DBSCAN algorithm hasgreat potential because it places the majority of the connections in a small subset of the clusters.This is very useful because these clusters have a high predictive power of a single category of traffic.The overall accuracy of the K-Means algorithm is only marginally lower than that of the AutoClass algorithm,but is more suitable for this problem due to its much faster model building time.Ours in a work-in-progress and we continue to investigate these and other clustering algorithms for use as an efficient classification tool.8.ACKNOWLEDGMENTSThis work was supported by the Natural Sciences and Engineer-ing Research Council (NSERC)of Canada and Informatics Circle of Research Excellence (iCORE)of the province of Alberta.We thank Carey Williamson for his comments and suggestions which helped improve this paper.9.REFERENCES[1]P.Cheeseman and J.Strutz.Bayesian Classification (AutoClass):Theory and Results.In Advances in Knowledge Discovery and Data Mining,AAI/MIT Press,USA ,1996.[2]A.P.Dempster,N.M.Paird,and D.B.Rubin.Maximum likelihoodfrom incomeplete data via the EM algorithm.Journal of the Royal Statistical Society ,39(1):1–38,1977.[3]C.Dews,A.Wichmann,and A.Feldmann.An analysis of internetchat systems.In IMC’03,Miami Beach,USA,Oct 27-29,2003.[4]M.B.Eisen,P.T.Spellman,P.O.Brown,and D.Botstein.ClusterAnalysis and Display of Genome-wide Expression Patterns.Genetics ,95(1):14863–15868,1998.[5]M.Ester,H.Kriegel,J.Sander,and X.Xu.A Density-basedAlgorithm for Discovering Clusters in Large Spatial Databases with Noise.In 2nd Int.Conf.on Knowledge Discovery and Data Mining (KDD 96),Portland,USA,1996.[6]P.Haffner,S.Sen,O.Spatscheck,and D.Wang.ACAS:AutomatedConstruction of Application Signatures.In SIGCOMM’05MineNet Workshop ,Philadelphia,USA,August 22-26,2005.[7]A.K.Jain and R.C.Dubes.Algorithms for Clustering Data .PrenticeHall,Englewood Cliffs,USA,1988.[8]T.Karagiannis,A.Broido,M.Faloutsos,and K.claffy.TransportLayer Identification of P2P Traffic.In IMC’04,Taormina,Italy,October 25-27,2004.[9]T.Karagiannis,K.Papagiannaki,and M.Faloutsos.BLINK:Multilevel Traffic Classification in the Dark.In SIGCOMM’05,Philadelphia,USA,August 21-26,2005.[10]A.McGregor,M.Hall,P.Lorier,and J.Brunskill.Flow ClusteringUsing Machine Learning Techniques.In PAM 2004,Antibes Juan-les-Pins,France,April 19-20,2004.[11]A.W.Moore and K.Papagiannaki.Toward the AccurateIdentification of Network Applications.In PAM 2005,Boston,USA,March 31-April 1,2005.[12]A.W.Moore and D.Zuev.Internet Traffic Classification UsingBayesian Analysis Techniques.In SIGMETRIC’05,Banff,Canada,June 6-10,2005.[13]V .Paxson.Empirically-Derived Analytic Models of Wide-Area TCPConnections.IEEE/ACM Transactions on Networking ,2(4):316–336,August 1998.[14]M.Roughan,S.Sen,O.Spatscheck,and N.Duffield.Class-of-Service Mapping for QoS:A Statistical Signature-based Approach to IP Traffic Classification.In IMC’04,Taormina,Italy,October 25-27,2004.[15]S.Sen,O.Spatscheck,and D.Wang.Accurate,Scalable In-NetworkIdentification of P2P Traffic Using Application Signatures.In WWW2005,New York,USA,May 17-22,2004.[16]I.H.Witten and E.Frank.(2005)Data Mining:Pratical MachineLearning Tools and Techniques .Morgan Kaufmann,San Francisco,2nd edition,2005.[17]S.Zander,T.Nguyen,and G.Armitage.Automated TrafficClassification and Application Identification using Machine Learning.In LCN’05,Sydney,Australia,Nov 15-17,2005.。
Sensitive electrochemical sensor for hydrogen peroxide using
ORIGINAL PAPER
Sensitive electrochemical sensor for hydrogen peroxide using Fe3O4 magnetic nanoparticles as a mimic for peroxidase
grade and used as received. The water used throughout the experiment was purified using a Milli-Q water purification system (Millipore, /).
Z. Zhang : H. Zhu : X. Wang : X. Yang (*)
State Key Laboratory of Electroanalytical Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences, Renmin Street 5625, Changchun, Jilin 130022, China e-mail: xryang@
Experimental
Chemical and reagents
3-Aminopropyltriethoxysilane (APTES) and glutaraldehyde (GA, 25% aqueous solution) were both purchased from Acros (http://www.acros.be/). Chitosan was provided by Generay Biotech Co., Ltd (/ com-guide22133/). All other chemicals were of analytical
汉语拼音字母写法顺序
汉语拼音字母写法顺序汉语拼音字母写法顺序汉语拼音采用拉丁字母和一些附加符号表示汉语的发音。
对应汉语音系学(现代音韵学)的汉语音节结构划分,汉语拼音的形式构成也分为声母、韵母和声调三部分。
下面是小编精心为大家整理的26个拼音字母表,欢迎阅读。
26个拼音字母表汉语拼音字母:Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz汉语拼音声母:b [玻] p [坡] m [摸] f [佛] d [得] t [特] n [讷] l [勒] g [哥] k [科] h [喝] j [基] q [欺] x [希] z [资] c[;雌] s [思] r [日] zh[知] ch [嗤] sh [诗] y [医] w [巫]汉语拼音韵母:单韵母 a[阿] o[喔] e[鹅] i[衣] u[乌] ü[迂]复韵母 ai[哀] ei[唉] ui[威] ao[奥] ou[欧] iu[由] ie[耶] üe[椰] er[儿] 前鼻韵母 an[安] en[恩] in[因] un[温]后鼻韵母 ang[昂] eng[摁] ing[英] ong[雍]整体认读音节:zi ci si zhi chi shi ri yi wu yu yin ying yun ye yue yuan声调符号:阴平:- 阳平:/ 上声:∨ 去声:﹨汉语拼音采用拉丁字母和一些附加符号表示汉语的发音。
对应汉语音系学(现代音韵学)的汉语音节结构划分,汉语拼音的形式构成也分为声母、韵母和声调三部分。
根据汉语拼音方案《字母表》的规定,汉语拼音使用26个现代基本拉丁字母,有大小写之分,字母顺序与英语字母表一致。
其中字母V/v,在方案中规定为“拼写外来语、少数民族语言和方言”之用。
由于汉语拼音的实际职能仅限于拼写汉语普通话,如今这条规定已然无人问津。
1An inducible expression system permitting the e¤cient puri
An inducible expression system permitting the e¤cient puri¢cation of a recombinant antigen from Mycobacterium smegmatis James A.Triccas Y*,Tanya Parish ,Warwick J.Britton Y d,Brigitte GicquelUniteède Geèneètique Mycobacteèrienne,Institut Pasteur,25rue du Dr.Roux,75724Paris,Cedex15,FranceDepartment of Clinical Sciences,London School of Hygiene and Tropical Medicine,London,UKCentenary Institute of Cancer Medicine and Cell Biology,Newtown,Australiad Department of Medicine,University of Sydney,Sydney,AustraliaReceived7July1998;revised6August1998;accepted21August1998AbstractA novel expression vector utilising the highly inducible acetamidase promoter of Mycobacterium smegmatis was constructed. High-level induction of a model antigen,the Mycobacterium leprae35kDa protein,was demonstrated in recombinant M. smegmatis grown in the presence of the acetamidase inducer acetamide.The recombinant protein could be simply and efficiently purified from the bacterial sonicate by virtue of a C-terminal6-histidine tag,demonstrating that this purification strategy can be used for the mycobacteria.The histidine tag had no apparent effect on the protein conformation or immunogenicity,suggesting that the vector described may prove useful for the purification of native-like recombinant mycobacterial proteins from fast-growing mycobacterial hosts.z1998Federation of European Microbiological Societies. Published by Elsevier Science B.V.All rights reserved.Keywords:Expression system;Mycobacterium smegmatis;Protein puri¢cation;Acetamidase1.IntroductionFundamental to the analysis of the biological function and immunological relevance of mycobacte-rial proteins is their production in a recombinant form that resembles that of their native counterpart. Recent studies analysing both structure[1,2]and im-munogenicity[1^3]of recombinant proteins obtained from fast-growing mycobacterial hosts,such as My-cobacterium smegmatis,have demonstrated superior-ity over the same protein puri¢ed from Escherichia coli expression systems.Although such approaches for the production of recombinant mycobacterial proteins appear advantageous,two major obstacles lie in the way of further improvement to these sys-tems.The¢rst is the inability to regulate high-level expression of foreign genes in M.smegmatis,analo-gous to systems such as induction of the lac pro-moter in E.coli[4].Secondly,no simple,e¤cient and widely adaptable method for the puri¢cation of proteins from recombinant mycobacteria has been described.In this report,we attempt to resolve0378-1097/98/$19.00ß1998Federation of European Microbiological Societies.Published by Elsevier Science B.V.All rights reserved. PII:S0378-1097(98)00381-4these two problems.Firstly,we describe the con-struction of a vector,pJAM2,that utilises the promoter of the inducible acetamidase enzyme of M.smegmatis to drive high-level expression of for-eign genes in M.smegmatis.Secondly,we demon-strate the simple and e¤cient puri¢cation of our model antigen by use of a poly-histidine tag and one-step Ni P a¤nity chromatography,indicating that this versatile puri¢cation system can be adapted for use with the mycobacteria.The addition of the histidine tag did not appear to a¡ect the conforma-tion or immunogenicity of the recombinant protein, suggesting the system described may prove extremely useful for the puri¢cation of structurally and immu-nologically intact recombinant mycobacterial pro-teins from fast-growing mycobacterial hosts.2.Materials and methods2.1.Construction of the acetamidase promoterexpression vector pJAM2The acetamidase promoter region was ampli¢ed from plasmid pAMI1,which contains the M.smeg-matis NCTC9449inducible acetamidase gene and upstream region[5],by use of primers HIS5 (CACGGTACCAAGCTTTCTAGCAGA)and HIS7(GTCAGTGGTGGTGGTGGTGGTGTCTA-GAAGTACTGGATCCGAAAACTACCTCG).The resulting1.5kb fragment was cloned into plasmid pJEM12[6]to give plasmid pJAM2(Fig.1).The coding region of the Mycobacterium leprae35kDa protein was ampli¢ed by primers JN8(TAGCTG-CAGGGATCCATGACGTCGGCT)and35REV2 (GTGTCTAGACTTGTACTCATG)and cloned into the Bam HI/Xba I sites of pJAM2,yielding pJAM4.2.2.Expression and puri¢cation of recombinanthistidine-tagged protein from M.smegmatis Plasmid pJAM4was introduced into M.smegma-tis mc P155and kanamycin resistant colonies grown in M63medium(7.6U103P M(NH R)P SO R,0.5M KH P PO R,5.8U103T M FeSO R.7H P O,pH7)supple-mented with1mM MgSO R,0.5%Tween-80and2% succinate for uninduced cultures or2%succinate and 2%acetamide for induced cultures.Bacteria were grown for3days,after which cells were harvested and sonicated4times for1min.Sonicates were an-alysed for expression of the M.leprae35kDa pro-tein by SDS-PAGE and immunoblotting with the anti-35kDa monoclonal antibody(mAb)CS38(sup-plied by Professor Patrick Brennan,Colorado State University,Colorado,USA)using the ECL detection system(Amersham Int.,Buckinghamshire,UK).For protein puri¢cation,the sonicate was applied to Ni-NTA resin(Qiagen Inc.,CA,USA)and bound pro-tein was washed consecutively with5mM,20mM and40mM imidazole in sonication bu¡er(1U PBS, 5%glycerol,0.5M NaCl and5mM MgCl P).Protein was eluted with200mM imidazole in sonication bu¡er and dialysed against PBS.Non-histidine-tagged M.leprae35kDa protein derived from M. smegmatis and the E.coli35kDa6-histidine fusion protein were puri¢ed as described previously[2].2.3.Protein capture ELISAELISA plates were coated with the murine anti-M. leprae35kDa mAb ML03(50W g ml3I;supplied by Professor J.Ivanyi,Hammersmith Hospital,Lon-don,UK)and mycobacterial sonicates were added at a concentration range of0.1W g ml3I to100W g ml3I.Plates were blocked with3%BSA,washed, and anti-rabbit35kDa protein polyclonal antibody (1:1000)added.Binding was visualised using alkaline phosphatase conjugated anti-rabbit IgG and n-nitro-phenyl-phosphate(NPP)(1mg ml3I).Protein amount was determined by comparison with puri¢ed M.leprae35kDa protein concentration standards [2].2.4.Assessment of protein binding by lepromatousleprosy seraMicrotitre plates were coated with puri¢ed antigen (100pg ml3I to100W g ml3I)overnight at room temperature.Plates were washed,blocked with3% BSA,and pooled lepromatous leprosy sera(diluted 1:100)added for90min at37³C.Plates were washed,and alkaline phosphatase conjugated anti-human IgG added for60min at37³C.Binding was visualised by the addition of NPP(1mg ml3I)and absorbance measured at405nm.J.A.Triccas et al./FEMS Microbiology Letters167(1998)151^156 1523.Results3.1.Construction of the pJAM2vector and utilisationfor over-expression of the gene encoding the35kDa antigen of M.leprae in M.smegmatis The promoter region of the gene encoding the acetamidase of M.smegmatis NCTC8159permits the inducible expression of the enzyme in the pres-ence of the substrate acetamide[4,7].In order to determine if the promoter could regulate the expres-sion of foreign genes placed under its control,the vector pJAM2was constructed(Fig.1).This plasmid contains approximately1.5kb upstream of the acet-amidase coding region,DNA encoding the¢rst six amino acids of the acetamidase gene,three restric-tion enzymes sites,and codons for6-histidine resi-dues.Thus this vector should allow for the inducibleexpression of cloned foreign genes,while also permit-ting simple puri¢cation of the recombinant protein by virtue of the poly-histidine tag.In order to vali-date the system,the coding region of the M.leprae 35kDa protein was ampli¢ed and cloned into the Bam HI/Xba I sites of pJAM2to give plasmid pJAM4.This protein is a major antigen of M.leprae and represents a promising candidate as a leprosy-speci¢c diagnostic reagent[2].This cloning step re-sulted in a fusion protein consisting of the initial six amino acids of the acetamidase protein,the entire 35kDa protein,and the6-histidine residues at the C-terminus.Plasmid pJAM4was introduced into M.smegmatis mc P155,and recombinant colonies grown in minimal media containing2%succinate in the presence or absence of2%acetamide.Soni-cates were prepared and proteins analysed by SDS-PAGE.As shown in Fig.2,left,a prominent band was visible at around37kDa in cells grown in acet-amide plus succinate(lane2),but absent from cells grown in succinate alone(lane1).The37kDa band reacted in immunoblotting with mAb CS38,which is raised against the native M.leprae35kDa protein (Fig.2,right,lane2).3.2.Quantitation of recombinant protein productionin M.smegmatis harbouring the pJAM4expression constructIn order to determine an approximate measure of the level at which the35kDa protein was being produced by virtue of the acetamidase promoter in M.smegmatis/pJAM4,antigen capture ELISA was employed.As shown in Fig.3,no protein was de-tected in M.smegmatis/pJAM4grown in succinate alone.In the same strain grown in the presenceof Fig.2.Inducible expression of the gene encoding the M.leprae 35kDa protein in M.smegmatis in the presence or absence of the acetamidase inducer acetamide.Left:SDS-PAGE of bacterial sonicates and purifed protein and right:immunoblotting of a corresponding gel with the anti-M.leprae35kDa mAb CS38. Lane1,M.smegmatis harbouring pJAM4grown in the absence of acetamide;lane2,M.smegmatis harbouring pJAM4grown in the presence of acetamide;lane3,puri¢ed M.leprae35kDaprotein.Fig.1.Genetic organisation of the pJAM2expression vector.A:Vector map and B:nucleotide sequence of the multi-cloning siteand surrounding regions.The Shine-Dalgarno sequence(SD)isshown in bold type.J.A.Triccas et al./FEMS Microbiology Letters167(1998)151^156153acetamide,the35kDa protein represented approxi-mately8.6%of the total bacterial sonicate.The level of protein produced was comparable to that in M. smegmatis harbouring plasmid pWL19[8],where ex-pression of the35kDa protein gene is driven by the non-inducible L-lactamase promoter of Mycobacte-rium fortuitum,one of the strongest mycobacterial promoters characterised to date[6,9].3.3.Puri¢cation of histidine-tagged protein fromrecombinant M.smegmatisWe next determined if the high-level expression by virtue of the M.smegmatis acetamidase promoter could allow e¤cient puri¢cation of the35kDa pro-tein using the6-histidine residues attached to its C-terminus.This system has been successfully used in a number of eucaryotic and procaryotic expressionsystems,and is favoured due to its simple and reli-able puri¢cation procedure,coupled with minimal e¡ects of the histidine tag on the target protein con-formation,function and immunogenicity[10]. Although the use of this system in the mycobacteria had not been previously described,it seemed an ideal choice to allow the simple and rapid puri¢cation of structurally and immunologically intact recombinant mycobacterial proteins.Sonicates of M.smegmatis/ pJAM4grown in the presence of acetamide were added to Ni-NTA resin,the column washed consec-utively with varying amounts of imidazole(5mM,20 mM and40mM)and bound protein eluted with200 mM imidazole.This single-step procedure allowed a 35kDa protein of predominantly a single species to be puri¢ed(Fig.2,left,lane3).The puri¢ed product was reactive with the anti-M.leprae35kDa protein mAb CS38(Fig.2,right,lane3).The band directly beneath the35kDa protein most likely represents a degradation product,as this band is not detected in samples analysed immediately after protein puri¢ca-tion.Therefore the strategy of Ni-NTA a¤nity chro-matography by virtue of a poly-histidine tag can be utilised for the e¡ective puri¢cation of recombinant proteins from mycobacteria.3.4.Analysis of the e¡ect of the histidine tag onrecombinant protein conformation andimmunogenicityPreviously it was demonstrated that recombinant forms of the M.leprae35kDa protein will only react with sera from leprosy patients if the protein is pro-duced in a conformation that resembles that of the native antigen[2].This property allowed us to test the e¡ect,if any,of the histidine tag on the confor-mation of the recombinant35kDa protein.To assess this three preparations of recombinant35kDa pro-tein were used;the histidine-tagged version puri¢ed in this study,together with anon-histidine-tagged Fig.4.Recognition of the puri¢ed recombinant M.leprae35 kDa protein by lepromatous leprosy sera.M.smg35kDa,M. smegmatis derived35kDa protein;M.smg35kDa-HIS,M. smegmatis derived histidine-tagged35kDa protein;E.coli35 kDa,E.coli derived35kDaprotein. Fig.3.Quantitation of the M.leprae protein produced in re-combinant M.smegmatis in the presence or absence of the acet-amidase inducer acetamide.Results are expressed as the meanvalueþS.E.M.of three experiments.Suc,succinate;Suc/Act,suc-cinate plus acetamide.J.A.Triccas et al./FEMS Microbiology Letters167(1998)151^156154version puri¢ed from M.smegmatis and an E.coli 35kDa6-histidine fusion protein.The two latter proteins were puri¢ed as described previously[2]. The binding of pooled lepromatous leprosy sera to these three forms of the35kDa protein were as-sessed by ELISA.As described previously[2],the sera were not reactive with the E.coli form of the 35kDa protein(Fig.4).By contrast,the35kDa histidine fusion protein puri¢ed from M.smegma-tis/pJAM4was avidly recognised by the sera.Fur-thermore,similar reactivity was exhibited towards the same protein puri¢ed from M.smegmatis con-taining no additional histidine residues,suggesting that the addition of the histidine tag had no appar-ent a¡ect on the conformation and indeed immuno-genicity of the recombinant protein.4.DiscussionThe47kDa acetamidase enzyme of M.smegmatis NCTC8159permits the growth of the organism on simple amides as the sole carbon source and is highly inducible in the presence of acetamide[5,7].This property has been previously used to assess luciferase as a reporter of gene expression in mycobacteria[11] and to develop a mycobacterial-conditional antisense mutagenesis system[12].In this study,we have con-structed a vector that allows for regulated high-level expression of foreign genes in mycobacteria by virtue of the M.smegmatis acetamidase promoter.Re-combinant M.leprae35kDa antigen produced in this system represented a considerable percentage of the total M.smegmatis soluble protein,with the amount of protein produced similar to that when the same gene is placed under the control of the strong mutated L-lactamase promoter of M.fortuitum(Fig.3).Of interest to note is that we achieved high-level induction of our model antigen using the initial 1.5kb of DNA upstream of the acetamidase struc-tural gene(Fig.1).Previous analysis of acetamidase regulation suggested that this initial1.5kb was not su¤cient for expression or induction of the enzyme, but elements contained within the DNA further up-stream were necessary[13].The reason for this dis-crepancy is unclear,but suggests that regulatory mechanisms associated with this enzyme are complex and require further evaluation.Of major importance in the study of microbial antigens is the ability to produce recombinant prod-ucts in a form that closely resembles their native state.In the case of mycobacteria,recent studies have highlighted the superiority of recombinant pro-tein puri¢ed from mycobacterial hosts compared to E.coli derived products,as assessed by structural and immunological analysis[1^3].In previous work,we demonstrated that sera from leprosy pa-tients would only recognise the M.leprae35kDa protein if the antigen was produced in a form that resembles the native protein,based on the binding of conformational dependent mAbs and FPLC size ex-clusion analysis[2].We recon¢rm such a¢nding with the same protein produced using the acetamidase promoter expression system(Fig.4).Furthermore, the addition of6-histidine residues to the C-terminus of the recombinant protein did not appear to e¡ect its conformation,as there was little di¡erence in the recognition of leprosy sera by histidine-tagged and non-histidine-tagged35kDa protein(Fig.4).The e¤cient expression of the6-histidine tag in mycobac-teria and the simple and e¡ective puri¢cation of our model protein by Ni-NTA a¤nity chromatography (Fig.2)suggests that this versatile puri¢cation sys-tem,used successfully in a number of eucaryotic and procaryotic expression systems[10]could be more widely applied to mycobacterial proteins.Further-more,the histidine puri¢cation system overcomes the problems involved with antibody a¤nity chro-matography,used in a number of studies to purify recombinant mycobacterial proteins from mycobac-terial hosts[2,3],such as the unavailability of appro-priate antibodies or the presence of homologues cap-able of binding the antibody.Together,these results suggests an application for the pJAM2expression vector in the production of native-like recombinant mycobacterial proteins that can be exploited to cor-rectly analyse protein structure,function and antige-nicity.AcknowledgmentsThis work was supported by the National Health and Medical Research Council of Australia,NIH grant AI35207and the European CommunityJ.A.Triccas et al./FEMS Microbiology Letters167(1998)151^156155(Grant BMH4CT972167).J.T.was the recipient of an Institut Pasteur Cantarini Fellowship.References[1]Garbe,T.,Harris,D.,Vordermeier,M.,Lathigra,R.,Ivanyi,J.and Young,D.(1993)Expression of the Mycobacterium tuberculosis19-kilodalton antigen in Mycobacterium smegma-tis:immunological analysis and evidence of glycosylation.In-fect.Immun.61,260^267.[2]Triccas,J.A.,Roche,P.W.,Winter,N.,Feng,C.G.,Butlin,C.R.and Britton,W.J.(1996)A35kDa protein is a majortarget of the human immune response to Mycobacterium le-prae.Infect.Immun.64,5171^5177.[3]Roche,P.W.,Winter,N.,Triccas,J.A.,Feng,C.and Britton,W.J.(1996)Expression of Mycobacterium tuberculosis MPT64 in recombinant M.smegmatis:puri¢cation,immunogenicity and application to skin tests for tuberculosis.Clin.Exp.Im-munol.103,226^232.[4]de Boer,H.A.,Comstock,L.J.and Vasser,M.(1983)The tacpromoter:a functional hybrid derived from the trp and lac A80,21^25.[5]Mahenthiralingam,E.,Draper,P.,Davis,E.O.and Colston,M.J.(1993)Cloning and sequencing of the gene which en-codes the highly inducible acetamidase of Mycobacterium smegmatis.J.Gen.Microbiol.139,575^583.[6]Timm,J.,Lim,E.M.and Gicquel,B.(1994)Escherichia coli-mycobacteria shuttle vectors for operon and gene fusions to lacZ:the pJEM series.J.Bacteriol.176,6749^6753.[7]Draper,P.(1967)The aliphatic acylamide amidohydrolase ofMycobacterium smegmatis:its inducible nature and relation to acyl-transfer to hydroxylamine.J.Gen.Microbiol.46,111^ 123.[8]Winter,N.,Triccas,J.A.,Rivoire,B.,Pessolani,M.C.V.,Eigl-meier,K.,Hunter,S.W.,Brennan,P.J.and Britton,W.J.(1995)Characterization of the gene encoding the immuno-dominant35kDa protein of Mycobacterium leprae.Mol.Mi-crobiol.16,865^876.[9]Timm,J.,Perilli,M.G.,Duez,C.,Trias,J.,Ore¢ci,G.,Fat-torini,L.,Amicosante,G.,Oratore,A.,Joris,B.,Frere,J.M., Pugsley,A.P.and Gicquel,B.(1994)Transcription and ex-pression analysis,using lacZ and phoA gene fusions,of My-cobacterium fortuitum L-lactamase genes cloned from a natural isolate and a high-level L-lactamase producer.Mol.Microbiol.12,491^504.[10]Crowe,J.,Dobeli,H.,Gentz,E.,Hochilu,E.,Stuber,D.andHenco,K.(1994)6U HIS-Ni-NTA chromatography as a superior technique in recombinant protein expression/puri¢ca-tion.Methods Mol.Biol.31,371^387.[11]Gordon,S.,Parish,T.,Roberts,I.S.and Andrew,P.W.(1994)The application of luciferase as a reporter of environmental regulation of gene expression in mycobacteria.Lett.Appl.Microbiol.19,336^340.[12]Parish,T.and Stoker,N.G.(1997)Development and use of aconditional antisense mutagenesis system in mycobacteria.FEMS Microbiol.Lett.154,151^157.[13]Parish,T.,Mahenthiralingam,E.,Draper,P.,Davis,E.O.andColston,M.J.(1997)Regulation of the inducible acetamidase gene of Mycobacterium smegmatis.Microbiology143,2267^ 2276.J.A.Triccas et al./FEMS Microbiology Letters167(1998)151^156 156。
The ecological approach to text visualization
The Ecological Approach to Text VisualizationJames A.WiseIntegral Visuals,Inc.,2620Willowbrook Avenue,Richland,WA99352.E-mail:JamesAWise@ “Words and rocks contain a language that follows a syntaxof splits and ruptures.Look at any word long enough andyou will see it open up into...a terrain of particles,eachcontaining its own void...”Robert Smithson(1996)This article presents both theoretical and technical bases on which to build a‘‘science of text visualization.’’These conceptually produce‘‘the ecological approach,’’which is rooted in ecological and evolutionary psychol-ogy.The basic idea is that humans are genetically se-lected from their species history to perceptually inter-pret certain informational aspects of natural environ-ments.If information from text documents is visually spatialized in a manner conformal with these predilec-tions,its meaningful interpretation to the user of a text visualization system becomes relatively intuitive and ac-curate.The SPIRE text visualization system,which im-ages information from free text documents as natural terrains,serves as an example of the‘‘ecological ap-proach’’in its visual metaphor,its text analysis,and its spatializing procedures.This article both formalizes Smithson’s evocative prose and responds to Steven Eick’s recent challenge(Eick,l997) to proceed to a real“science of information visualization.”It describes the theoretical rationale and technical basis of two years of research investigations at Pacific Northwest National Laboratory(operated for the Department of En-ergy by Battelle,Inc.)on the Spatial Paradigm for Informa-tion Retrieval and Exploration(SPIRE)project,which the author co-created and managed.The SPIRE project is funded by the Department of Energy and the U.S.intelligence agencies to determine if it is possible and practicable tofind a means of“visualizing text”in order to reduce information processing load and to improve productivity for intelligence analysis.Most intelli-gence information is in prose,in the form of cables,reports, and articles,and it is not unreasonable for30,000docu-ments to cross the electronic desk of an analyst every week. There is no way that a person could read,retain,and synthesize even one-half of1%of these.Clearly,another way was needed to both represent the documents and their contents,while permitting their rapid retrieval,categoriza-tion,abstraction,and comparison,without the requirement to read them all.What became known as the SPIRE project began in January1994,with the somewhat loquacious title“Multi-dimensional Visualization and Browsing.”Thefirst product software,“Galaxies”(described by Crow et al.,1994)was produced in August of that year and delivered to the Army’s Pathfinder Program forfield tests.The second product soft-ware,“ThemeScapes™,”1was demonstrated in an alpha version at the Automated Intelligence Processing and Anal-ysis Symposium in March of l995(Pennock&Lantrip, 1995;Wise&Thomas,1995)and in a beta version at Information Visualization’95in September of that year (Wise et al.,1995).Those short,descriptive publications focus on project and software performance,and hardly convey the breadth of theoretical rationale and depth of technical research that actually underlie the SPIRE project. This article attempts to redress that shortcoming,placing the SPIRE visualizations in the context of their full“ecological approach”to text visualizations,acknowledging their tech-nical bases in previously published work(e.g.,Chalmers& Chitson,1992;Chalmers,1993;Pazner,1994),and re-sponding to Eick’s(1997)challenge to provide a scientific foundation for information visualization.The“Ecology”of Text VisualizationsEcology may seem like a strange term to use in reference to visual information retrieval interfaces(VIRIs),as it is customarily understood to refer to the“science of the rela-tions of organisms to their environment.”Yet it is crucial to understanding the distinctive viewpoint that was taken from the beginning of the SPIRE project,which from its incep-tion sought a coherent and comprehensive approach to the analysis and visualization of text in a manner that best©1999John Wiley&Sons,Inc.1ThemeScape™is now a trademarked term of Cartia,Inc. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE.50(13):1224–1233,1999CCC0002-8231/99/131224-10utilized human analysts’native perceptual abilities.It posed the question:How should an analyst view and manipulate a visualization with features directly determined by text char-acteristics relevant to the analyst’s task,and that appear in visual forms which everyone,both genetically and experi-entially,already knows how to interpret?The project’s task was thus seen to be one of creating a“synthetic ecology”for prose that incorporated,analogously or literally,visual fea-tures of the natural world within which human visual per-ception has so successfully operated.This viewpoint on what text visualizations should be thus became the“ecolog-ical approach”in four ways:●Steps from text analysis to visualization are tuned toreflect the needs of ecological vision.●Visualizations are built as“emergent forms”in the man-ner that natural visual patterns originate.●Visualizations access processes of human’s“ecologicalperception”and are thus intuitively interpreted.●The visualization and correspondent analyst’s perceptionare co-determined or“enacted,”within an ecological con-text of guided activity,analogous to processes of colorperception(Thompson,Palacios,&Varela,l992).According to the“ecological approach,”then,one sur-prising implication is that successful visualization of text requires that text analysis—the process by which words are re-represented mathematically so as to provide the compu-tational basis for visualizations—is best undertaken by working backwards from the visualization itself to see what is computationally optimal for its support.This forgoes the traditional“information retrieval”view of text analysis,and instead sets up the comparative analogy of how the retina of the eye begins to construct visions of the natural world. When text analysis is approached this way,some traditional “intractable”information retrieval problems simply vanish.Another inference of the“ecological approach”is that text visualizations ought to take advantage of the visual appearances of natural forms that humans have learned to interpret visually as part of the biological heritage from their species’history on the Earth.It was not accidental that the “Galaxies”visualization invoked the metaphor of docu-ments as stars in the night sky,or that ThemeScape™s represented themes as sedimentary layers that together cre-ate the appearance of a natural landscape.These are funda-mental visual experiences of our world that people have incorporated and responded to for eons.They both carry a natural interpretation that does not require instruction or prolonged training to appreciate and use.The“ecological approach”also redirects attention away from seeing a visualization as an“illustration”of a preex-istent semantic form,and correctly characterizes it as the end result of an interactive process between the observer and the informational content of the prose.As Stoner(1990) so tersely expressed it:“Structure represents the product of information interacting with matter.”A text visualization’s structure should then result from and reflect a process his-tory,being intimately bound up with“how it got that way.”In the“ecological approach,”text visualization is not a process of“drawing pictures.”It is a result of transferring to the spatial realm the results of computational processes that are themselves analogs of the means by which physical forms are produced.Finally,the“ecological approach”analogously resurrects and adopts Gibson’s(1979)view of perception as being mediated and guided by the actions of the observer.This directly addresses the human-computer interaction(HCI) issues of text visualizations and suggests that the electron-ically mediated means of exploring visualizations should analogously reproduce,as far as possible,the ways we sensually explore objects in the natural world.This became the least well-developed aspect of the SPIRE software, although the ThemeScape™’s probe tool(Wise et al.,1995) and other work on“intuitive user interfaces”(Wise,1996; Lopresti&Harris,1996)show how gesturing and audition can greatly enhance the usefulness and experience of infor-mation visualizations.The SPIRE Process of Text Visualization As implemented in the SPIRE project,there was afive step process to preparing a Text Visualization(Fig.1):1.The system received a corpus of unstructured,digitizedtext documents.There was no use of keywords,nodictionary,no preestablished topics or themes extracted,and no predefined structure to the text that would havetied the resultant visualizations to any particular textanalysis.As prepared for use by the Intelligence Community, the visualizations were meant to optimally handle newsstories,resumes,e-mail,letters,abstracts,short articles,communiques,etc.The upper limit of the number ofdocuments that can be processed in such a corpus is setby a number of factors irrelevant to this article.Theseinclude the processing power of the computer,the screenspace of the display(for the Galaxies visualization),thetext analysis method and projection algorithm used,andthe time demands of the analyst’s need.But visualiza-tions of corpora numbering up to6K documents wereroutine in the project’s investigations.While this is arelatively small number of documents with respect toother systems,the goal of the research was to developnew approaches to text visualization,and problems ofscaling to large document sets were left for later study.2.The digitized documents were then analyzed(via a res-ident text engine)to characterize them as high dimen-sional vectors.The SPIRE used the vector space model(see Salton,1991)exclusively.In such a model,each textdocument is represented as a high-dimensional vector,which can be constructed using a variety of techniques.The exact means is not important as long as a statistically“rich,”reasonable sized dimensional representation re-sults.The two commercial text engines used in thisresearch represent somewhat polar approaches to thesame problem of vector construction.JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—November19991225Thefirst Galaxies visualization in l994relied on a commercially available text engine that utilized a dictio-nary of200,000words.If a word in the dictionary appeared in the document,a“1”appeared at that place in the vector.Otherwise a“0”was assigned.This resulted in each document being represented by a binary vector 200,000(!)units long,which significantly restricted the kinds of manipulations one can perform(and subse-quently the kind of visualization one can produce).The lesson is that text analysis approaches,which may be perfectly suitable for information retrieval,may be rel-atively unsuitable for text visualization.When the ThemeScape™’s visualization was pre-pared(early l995),the text analysis was performed by the Matchplus™text engine created by HNC,Inc.This engine used a neural net trained on a document corpus within a given domain(general news orfinancial articles, etc.)The neural net had280output nodes,resulting in a 280-dimensional vector for each document.With their continuous numerical output,the280nodes also pro-vided a much improved document representation with shorter vectors that allowed the ThemeScape™visual-ization to be ing the experience gained with these text engines,the SPIRE team was later able to develop its own text engine,optimized for visualization purposes(see section:Text Analysis from a Visualiza-tion Basis).Given any of these ways of constructing a vector space over the documents,a metric is then placed on that space to represent similarity in the content of the documents.The most common ones are the Euclidean distance or cosine measures(Salton&McGill,l983).Thefirst is based on the sum of the squares of the differences between a pair of documents on every dimension.The second is based on the differences in the angles of the document’s vectors from the origin of the space.Finally, the document vectors are normalized for the size of the document,and the next step of clustering the documents can begin.ing the normalized document vectors,the documentswere then clustered in the high-dimensional space (Frakes&Baeza-Yates,l992).This produced topical groups of documents where each document was assigned to only one cluster.The SPIRE visualizations were notdesigned to be dependent on any one particular cluster-ing method.The K-Means and complete linkage hierar-chical clustering approaches were both studied in-depth,and found to be satisfactory for up toϳ5K documentsets.Their algorithms are widely available.4.The next step was to project the high dimensional doc-ument vectors and their cluster centroids down onto atwo-dimensional plane.This plane provided the ground-plan for both the Galaxies and ThemeScape™’s visual-izations.Again,different projection techniques are available.For small(up to1.5K)document sets,multidimensionalscaling analysis(MDS)(see Shepard1962a,b)is suffi-rger document sets required development of theteam’s own projection algorithm,which we called“An-chored Least Stress”(ALS).5.Thefinal step was the construction and display of theGalaxies and ThemeScape™’s visualizations based onthe positions of the document in the second groundplane.While Galaxies represented the documents di-rectly as blue-green“docustars”in a night sky withorange cluster centroids,ThemeScape™’s used the doc-ument positions as points from which to build up alandscape representation when thematic terms takenfrom the documents were successively layered over thegroundplane.The complete sequence of thosefive steps is sche-matically represented in Figure2.Clustering and Projection of Documentsfor VisualizationsSince similarity of document content equals document placement in the high-dimensional space,preserving high-dimensional spatial relations among documents in their clustering and projection to“visible”spaces is essential in this and similar schemes of text visualization.The simple requirement is that proximity of visible Euclidean distances in the plane be proportional to distances and topical simi-larities among the documents in the high dimensional rep-resentation.FIG.1.The Galaxies and ThemeScape™s text visualizations of the SPIRE system(1995). 1226JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—November1999The clustering approach designed to handle large docu-ment sets was developed primarily by Jeremy York of the SPIRE team.It was called “Fast Divisive Clustering.”The process begins with the analyst selecting the number of clusters he or she wants to contain all of the documents to be visualized.This number needs to be heuristically determined by the analyst’s experience and prior knowledge of the document set,and could be the result of a more formal,Bayesian analysis in itself.This number sets the number of cluster “seeds”distributed in the high-dimen-sional space.The seeds are distributed randomly,and sub-spaces are then sampled to ensure that seeds have not inadvertently ended up too close to one another.Then non-overlapping hyperspheres are defined around each clus-ter seed,and all documents in the high-dimensional space whose coordinates fall within a cluster’s hypersphere are assigned to that cluster.Through an iterative procedure,the center of mass defining a new centroid is calculated for each cluster,which shifts its corresponding hypersphere.As hy-perspheres shift,documents drop out and are assigned to correspondingly new clusters for hyperspheres that now enfold them.After a few iterations,the cluster centroids stop shifting above an assigned threshold,and documents take their final cluster memberships.This approach remained under refinement and was ex-perimental until the end of government fiscal year l996,when SPIRE development was transferred to a privately funded company outside the laboratory.While MDS is a “tried and true”technique in the psy-chometrics of information retrieval (see Rorvig,l988)it has severe shortcomings for dimensionality redirection as doc-ument sets become large.Multidimensional scaling analysis uses pairwise distances (Euclidean or cosine angle)and attempts to minimize a measure on differences in pairwise distances (“stress”)between high-and low-dimensional document positions.The intent is to preserve distance rela-tions between documents in the high-dimensional space as they are projected into the two-dimensional one.As the number of documents,n ,grows,the number of pairwise distances to be considered grows as a simple quadratic,producing an exponential increase in computational com-plexity.This significantly increases the requisite processing time.The first Galaxies runs in l994on Sparc 3worksta-tions,for a few hundred documents could take 12hours when using binary vector representations.York (1995)cleverly found a way around the MDS projection bottleneck through the ALS approach.Beginning with cluster centroids (that are two-dimensional)based on an initial clustering of the documents,a document’s itera-tive projection and placement is based on a vector of its distances to the different cluster centroids,not its pairwise distances to all other documents.The document is ulti-mately placed in the 2-D plane so that its position reflects its similarity to every cluster,not every other document.This used a computationally simple linear regression solution that constructed a new vector for every document which contained the distances of that document to each cluster centroid,then minimized the squared differences between the observed and fitted distances to the centroids.Linear Principle Components Analysis was used to initially project the cluster centroids onto the 2-D plane.Its algorithms are widely available.Anchored Least Stress also made a qualitative difference in the way it treated distances with respect to traditional MDS.In MDS,fitting all of the pairwise distances among documents means that small deviations among pointsareFIG.2.The sequence of steps leading to the text visualizations of the SPIREsystem.FIG.3.The first “Galaxies”visualization software product displaying documents of a technology database.JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—November 19991227placed at high relative importance.Under ALS,it is the large deviations that are considered important.Smallscale differences in the projected placement of points onto the second plane are sacrificed somewhat in order to arrive at a better and faster overall largescale solution.Since so much information is necessarily lost in compressing high-dimen-sional spaces down to the 2-D plane anyway,it seems as if this is a worthwhile price to pay to overcome a major computational bottleneck to the visualization process.Construction of VisualizationsThe original Galaxies visualization was essentially a “starfield”of documents in a type of display seen previously in visualizations like “Filmfinder”(Ahlberg &Schneider-man,l994)and IVEE (Ahlberg &Wistrand,l995)and now commercialized in a product called “Spotfire.”The ubiquity and usefulness of scatterplot and starfield type displays demonstrate how well even a simple visualization can aid human problem solving,particularly when it is accompa-nied by an effective user interface for selective interactions.The enthusiastic reception given by the intelligence-com-munity users to even the first generation Galaxies’visual-ization tool (Hendrickson,l995)demonstrates both thepower of visualization and the value of an aesthetically rendered “ecological”visual metaphor that is intuitively apprehended by the analyst.The particular value of Galaxies as a first software prod-uct for the PNNL research was that it demonstrated the usefulness of document visualization for analysts’tasks,and strengthened the resolve to seek further visualization metaphors derived from visual and cognitive processes that enable spatial interactions with the natural world.Within four months of the delivery of Galaxies,this effort resulted in a spatialization of unstructured document information derived from GIS techniques that formed a landscape rep-resentation we called a ThemeScape™.Construction of ThemeScape™Type Text Visualization A ThemeScape™is a surface plot similar to Chalmers (1993)built up by successively layering computed contribu-tions of recovered theme terms over underlying document positions (see Pazner,1994).It is constructed directly from the distribution of documents in the Galaxies’two-dimensional plane.First,the characteristic thematic terms that describe each cluster of documents (or the visualized corpus as a whole)are identified on the basis of their discriminability across regions of the high-dimensional document space.The met-ric is the common term frequency,inverse document fre-quency weighting scheme first proposed by Salton (1991).Term N Value ϭf term n /cluster i ءj not i1/f term n /cluster jWheref term n /cluster i ϭfrequency of term n in cluster i andj not il/f term n /cluster jϭfrequency of term n in all other clusters.This yields terms which best discriminate the clusters from each other.In repeated term extractions of thiskind,FIG.4.The process of constructing a ThemeScape™type representation that mimics sedimentarydeposition.FIG.5.A hypothetical view of a core probe from a point on a thematic landscape.1228JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—November 1999over90%of recovered terms are usually nouns,which meets the goal of building the landscape from topical iden-tities.The number of terms selected in this process can vary, and ThemeScape™s from as few as50terms were con-structed infirst efforts,but low numbers of terms degrade landscape detail differences considerably,and so from150–300terms are recommended.A list of these thematic terms with their corresponding document coordinate pairs provides the basis for depositing each term’s contribution to the height of the landscape,as shown in Figure4.The terms are used like layers of sedi-mentary strata,wherein each term’s layer will vary in thick-ness as the real probability offinding that term within a document at each point in the2-D plane.The term layers are then summed and normalized to produce the composite thematic landscape visualization.The summary vector at any point in the thematic land-scape is equal to the sum of unitary document vectors within a selected analysis region.If there are,for example,12 documents that contain a thematic term within the region, then the summary vector is12units high and placed at the center of the region.This placement assumes that the P of finding a thematic term,t i in that region is12/⌺t i at the center and zero elsewhere in the region.As afinal step,a smoothingfilter is passed over the summary vectors of the different heights of the thematic terms to produce a more natural-appearing landscape form.This has the effect of decreasing the height of the central peak vector in each analysis region and distributing the probability according to the smoothing function employed.Different smoothing schemes can emphasize or decrease differentiation of the landscape,and can be adjusted to facilitate an analyst’s tasks.A useful starting scheme is a variation of the standard Gaussian function that places it at the center of a document analysis region.Where the standard Gaussian function is1/͙2eϪ1/2͑xϪ/2the adjusted one would beN/͙2eϪ1/2͑x/2.This removes the-term,which centers the smoothing, while scaling the height to the number of documents found within the analysis region through the N term in the numer-ator.Placing the mean,,in the exponent denominator has the effect of leveling out the landscape in the region where the mean values are large,if many documents occur around the edges of the analysis region.Overall,distributed docu-ments around the edges of a region willflatten it out(while raising the overall landscape height),and documents located near the center will tend to produce a peak or pinnacle in the landscape.In a ThemeScape™,a term layer is thickest at the highest density of documents that carry that term because the prob-ability offinding that term there is correspondingly greater.If the clustering and projections of documents onto the2-D plane are accurate,documents containing same thematic terms should be in roughly the same place.As term layers accumulate,the highest elevations occur where the thickest layers overlay each other.Lower regions reflect places where there are fewer documents or where the documents are less thematically focused.When there is a sharp distinc-tion from strong thematic term content to low content in the distributed documents,there will be a correspondingly sharp cliff in the ThemeScape™,while a ridgeline connect-ing two peaks indicates strong themes that are held in common by two different thematic concentrations.The smoothed Y coordinate height for any x axis point is given byy xϭnϪm nϩm d xϩnءf͑xϩn͒where dxϩnϭ1for a document at coordinate xϩn, otherwise0,f(xϩn)is the value of the smoothing function at x n,and2mϭwidth of the smoothing function when centered on any x.Thefinal height,z x,y,of any point on a thematic land-scape is given by the sum of the heights of all of the term layers that correspond to their own“miniThemeScape™”s at that point.Afinal normalization is then usually added.z x,yϭjϭ1#of cluster terms term layer j x,yThe result is a thematic landscape that has literally embod-ied the content information of a document corpus,and may be treated in most all respects like a sedimentary form,including taking a probe or“core samples”at any point in the landscape that reveal the corresponding terms and their%contribution to the ThemeScape™at that location(Fig.5).However,a thematic landscape constructed in this fash-ion is different from a sedimentary landscape in one crucial way:The layering of terms has nothing to do with the age of the documents wherein the terms appear.Thus,a“core sample”of terms can be arranged to read from maximum to minimum contribution(the usual display method)or in any other ers can sometimes take the“sedimen-tary”deposition analog a bit too literally,and infer that the arrangement of terms in a probe corresponds to dates in documents.This is one instance where a naturalistic land-scape metaphor can potentially mislead a viewer,but it demonstrates again the intuitive power of such a visualiza-tion.A ThemeScape™thus creates a full spatial representa-tion of terms from documents,as opposed to the documents themselves as in Bead(Chalmers&Chitson,l992).Such a thematic terrain synthesizes a mimic of the natural physicalJOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—November19991229form-giving process of sedimentation.Other physical pro-cesses like accretion,condensation,and growth would seem to offer other useful bases on which to build spatializations of textual information.Text Analysis From a Visualization Basis The experience and insights gained from building land-scape representations of text revealed fundamental short-comings in the way that statistically based text analysis was undertaken by current software systems.All of these had their origins in the information retrieval(IR)research tra-dition,and none of them had been designed explicitly for purposes of visualizing their outputs.For example,the IR tradition uses Precision&Recall measures from a document corpus in response to a query as indicants of a text engine’s effectiveness.Ideally,a query should acquire all relevant documents and no spurious ones,implying that a visualiza-tion should be equally selective.This is expensive in terms of time and computational resources.Why not simply visu-alize the entire corpus of meanings,and then select as needed?Through ecological vision,every creature extracts from its physical setting exactly what it needs for its immediate environmentally directed behaviors.Vision is highly selec-tive(even in humans)from the retina of the eye onwards to more central processing centers.At the earliest stages of perception,assemblages of neurons in the retina are prewired to extract such things as corners,crossings,and edges from the visualfield.Thesefirst order extractions are called“textons”(Treisman,l986).They then become the primitives upon which visual perception is further con-structed,without the need for a total internal computational duplication of the external environment.The process of ecological perception provides clues as to how text analysis should proceed if its express purpose is to support text visualization.Specifically,text analysis should:●mimic the sequential extraction of information that occursin ecological vision rather than require holistic,upfrontprocessing●create a vector representation of the document that is“interpretable”by being constructed on the contributionsof topical or thematic content in the document●map the topical content of documents directly into ageographical landscape representation●be scalable as larger document collections are processed●provide a basis for altered visualizations of the informa-tion for different users and purposes(environments ap-pear different to us under different behavioral intentionsor“perceptual sets”;similarly,why should we precon-ceive that there is only one“correct”visualization of textinformation in a document corpus?)●incrementally update the analysis with addition of newdocumentsThis implies that the vector representation should be built directly on extracted topical terms in the documents,which become the“textons”of the system.It contrasts strongly with uses of neural net algorithms or dictionaries in other text visualization systems.Neural nets require exten-sive(and expensive)up-front training on the entire docu-ment corpus,do not give interpretable vectors from their output nodes,and tend to equally weight the output nodes in their contributions to the vectorization.They also require retraining as the document corpus is updated.Dictionary based vectorizations can be extremely limited in applica-tion,and require constant updating to remain current in technicalfields.Construction of a Text Engine Based on Ecological VisionThis effort was led by Kelly Pennock of the SPIRE team, from October of l995to October of l996,when SPIRE was transferred to a privately funded company outside the lab-oratory.The successful research completed to that point seemed to fully justify the guiding principles,and had resulted in an intriguingly novel approach that we felt represented a paradigm shift in text analysis.This visually based analysis system centered onfinding a limited set of topical terms in the document set for a vector representation that were frequent enough to span the documents,and yet discriminating enough to capture documents’distinctive content.It proceeded as follows:Step1compressed the vocabulary for documents in the corpus by performing stop word removal and stemming. This step is similar to that taken in other text analysis approaches,and the algorithms for it are widely available.Step2performed a band pass frequencyfiltering on the document corpus,eliminating high and low frequency terms.High frequency terms occur too often to discriminate among document content.Low frequency terms are also non-discriminating,and in addition produce unreliable sta-tistics.Term frequencies depend upon the size of the corpus, but we found it useful with general news stories tofilter so as to retain terms that occurϾ3–5times in a document and that result in a reduction of the terms(after stemming and stop word removal)to a vocabulary10–15%of its original size.This is based on ourfindings that in theϳ15%or so of middle frequency terms,there areϳ15%of these that show significant topical value.Step3was the importantfiltering step that extracted significant topical terms(after Bookstein,Klein,&Raita, l995)based on the condensation clustering value(CCV)of the terms surviving band passfiltering.The CCV measures the degree of randomness in the appearance of a term in documents.Terms that are highly topical in their content(like nouns)tend to occur in bursts or serial clusters in language use,and are not randomly distributed throughout a document or documents.Terms that are less topical(like modifiers)tend to be more ran-domly distributed.The calculated CCV of any term(word) occurring in a document thus quantifies its potential topical1230JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—November1999。
It is time to view John Searle’s Chinese Room thought
The Chinese Room: Just Say “No!”To appear in the Proceedings of the 22nd Annual Cognitive Science Society Conference, (2000), NJ: LEARobert M. FrenchQuantitative Psychology and Cognitive ScienceUniversity of Liège4000 Liège, Belgiumemail: rfrench@ulg.ac.beAbstractIt is time to view John Searle’s Chinese Room thought experiment in a new light. The main focus of attention has always been on showing what is wrong (or right) with the argument, with the tacit assumption being that somehow there could be such a Room. In this article I argue that the debate should not focus on the question “If a person in the Room answered all the questions in perfect Chinese, while not understanding a word of Chinese, what would the implications of this be for strong AI?” Rather, the question should be, “Does the very idea of such a Room and a person in the Room who is able to answer questions in perfect Chinese while not understanding any Chinese make any sense at all?” And I believe that the answer, in parallel with recent arguments that claim that it would be impossible for a machine to pass the Turing Test unless it had experienced the world as we humans have, is no.IntroductionAlan Turing’s (1950) classic article on the Imitation Game provided an elegant operational definition of intelligence. His article is now exactly fifty years old and ranks, without question, as one of the most important scientific/philosophical papers of the twentieth century. The essence of the test proposed by Turing was that the ability to perfectly simulate unrestricted human conversation would constitute a sufficient criterion for intelligence. This way of defining intelligence, for better or for worse, was largely adopted as of the mid-1950’s, implicitly if not explicitly, as the overarching goal of the nascent field of artificial intelligence (AI).Thirty years after Turing’s article appeared, John Searle (1980) put a new spin on Turing’s original arguments. He developed a thought experiment, now called “The Chinese Room,” which was a reformulation of Turing’s original test and, in so doing, produced what is undoubtedly the second most widely read and hotly discussed paper in artificial intelligence. While Turing was optimistic about the possibility of creating intelligent programs in the foreseeable future, Searle concluded his article on precisely the opposite note:“...no [computer] program, by itself, is sufficient for intentionality.” In short, Searle purported to have shown that real (human-like) intelligence was impossible for any program implemented on a computer. In the present article I will begin by briefly presenting Searle’s well-known transformation of the Turing’s Test. Unlike other critics of the Chinese Room argument, however, I will not take issue with Searle’s argument per se. Rather, I will focus on the argument’s central premise and will argue that the correct approach to the whole argument is simply to refuse to go beyond this premise, for it is, as I hope to show, untenable.The Chinese RoomInstead of Turing’s Imitation Game in which a computer in one room and a person in a separate room both attempt to convince an interrogator that they are human, Searle asks us to begin by imagining a closed room in which there is an English-speaker who knows no Chinese whatsoever. This room is full of symbolic rules specifying inputs and outputs, but, importantly, there are no translations in English to indicate to the person in the room the meaning of any Chinese symbol or string of symbols. A native Chinese person outside the room writes questions — any questions — in Chinese on a piece of paper and sends them into the room. The English-speaker receives each question inside the Room then matches the symbols in the question with symbols in the rule-base. (This does not have to be a direct table matching of the string of symbols in the question with symbols in the rule base, but can include any type of look-up program, regardless of its structural complexity.) The English-speaker is blindly led through the maze of rules to a string of symbols that constitutes an answer to the question. He copies this answer on a piece of paper and sends it out of the room. The Chinese person on the outside of the room would see a perfect response, even though the English-speaker understood no Chinese whatsoever. The Chinese person would therefore be fooled into believing that the person inside the room understood perfect Chinese.Searle then compares the person in the room to a computer program and the symbolic rules that fill the room to the knowledge databases used by the computer program. In Searle’s thought experiment the person who is answering the questions in perfect written Chinese still has no knowledge of Chinese. Searle then applies the conclusion of his thought experiment to the general question of machine intelligence. He concludes that a computer program, however perfectly it managed to communicate in writing, thereby fooling all humanquestioners, would still not understand what it was writing, any more than the person in the Chinese Room understood any Chinese. Ergo, computer programs capable of true understanding are impossible.Searle’s Central PremiseBut this reasoning is based on a central premise that needs close scrutiny.Let us begin with a simple example. If someone began a line of reasoning thus: “Just for the sake of argument, let’s assume that cows are as big as the moon,” you would most likely reply, “Stop right there, I’m not interested in hearing the rest of your argument because cows are demonstrably NOT as big as the moon.” You would be justified in not allowing the person to continue to his conclusions because, as logical as any of his subsequent reasoning might be, any conclusion arising from his absurd premise would be unjustified.Now let us consider the central premise on which Searle’s argument hangs — namely, that there could be such a thing as a “Chinese Room” in which an English-only person could actually fool a native-Chinese questioner. I hope to show that this premise is no more plausible than the existence of lunar-sized cows and, as a result, we have no business allowing ourselves to be drawn into the rest of Searle’s argument, any more than when we were asked to accept that all cows were the size of the moon.Ironically, the arguments in the present paper support Searle’s point that symbolic AI is not sufficient to produce human-like intelligence, but do so not by comparing the person in the Chinese Room to a computer program, but rather by showing that the Chinese Room itself would be an impossibility for a symbol-based AI paradigm.Subcognitive Questioning andthe Turing TestTo understand why such a Room would be impossible, which would mean that the person in the Room could never fool the outside-the-Room questioner, we must look at an argument concerning the Turing Test first put forward by French (1988, 1990, 2000). French’s claim is that no machine that had not experienced life as we humans had could ever hope to pass the Turing Test. His demonstration involves showing just how hard it would be for a computer to consistently reply in a human-like manner to what he called “subcognitive”questions. Since Searle’s Chinese Room argument is simply a reformulation of the Turing Test, we would expect to be able to apply these arguments to the Chinese Room as well, something which we will do this later in this paper.It is important to spend a moment reviewing the nature and the power of “subcognitive” questions.These are questions that are explicitly designed to provide a window on low-level (i.e., unconscious) cognitive or physical structure. By "low-level cognitive structure", we mean the subconscious associative network in human minds that consists of highly overlapping activatable representations of experience (French, 1990). Creating these questions and, especially, gathering the answers to them require a bit of preparation on the part of the Interrogator who will be administering the Turing Test.The Interrogator in the Turing Test (or the Questioner in the Chinese Room) begins by preparing a long list of these questions — the Subcognitive Question List. To get answers to these questions, she ventures out into an English-language population and selects a representative sample of individuals from that population. She asks each person surveyed all the questions on her Subcognitive Question List and records their answers. The questions along with the statistical range of answers to these questions will be the basis for her Human Subcognitive Profile. Here are some of the questions on her list (French, 1988, 1990). Questions using neologisms:"On a scale of 0 (completely implausible) to 10 (completely plausible):- Rate Flugblogs as a name Kellogg's would giveto a new breakfast cereal.- Rate Flugblogs as the name of start-up computercompany- Rate Flugblogs as the name of big, air-filled bagsworn on the feet and used to walk acrossswamps.- Rate Flugly as the name a child might give to afavorite teddy bear.- Rate Flugly as the surname of a bank accountantin a W. C. Fields movie.- Rate Flugly as the surname of a glamorous femalemovie star.“Would you like it if someone called you atrubhead? (0= not at all, ..., 10 = very much)”“Which word do you find prettier: blutch orfarfaletta?”Note that the words flugblogs, flugly, trubhead, blutch and farfaletta are made-up. They will not be found in any dictionary and, yet, because of the uncountable influences, experiences and associations of a lifetime of hearing and using English, we are able to make judgments about these neologisms. And, most importantly, while these judgments may vary between individuals, their variation is not random. For example, the average rating of Flugly as the surname of a glamorous actress will most certainly fall below the average rating of Flugly as the name for a child’s teddy bear. Why? Because English speakers, all of us, havegrown up surrounded by roughly the same sea of sounds and associations that have gradually formed our impressions of the prettiness (or ugliness) of particular words or sounds. And while not all of these associations are identical, of course, they are similar enough to be able to make predictions about how, on average, English-speaking people will react to certain words and sounds. This is precisely why Hollywood movie moguls gave the name “Cary Grant” to a suave and handsome actor born “Archibald Alexander Leach” and why “Henry Deutschendorf, Jr.” was re-baptised “John Denver.”Questions using categories:- Rate banana splits as medicine.- Rate purses as weapons.- Rate pens as weapons.- Rate dry leaves as hiding places.No dictionary definition of “dry leaves” will include in its definition “hiding place,” and, yet, everyone who was ever a child where trees shed their leaves in the fall knows that that piles of dry leaves make wonderful hiding places. But how could this information, and an infinite amount of information just like it that is based on our having experienced the world in a particular way, ever be explicitly programmed into a computer? Questions relying on human physical sensations: - Does holding a gulp of Coca-Cola in your mouthfeel more like having pins-and-needles in yourfoot or having cold water poured on your head?- Put your palms together, fingers outstretched andpressed together. Fold down your two middlefingers till the middle knuckles touch. Move theother four pairs of fingers. What happens to yourother fingers? (Try it!)We can imagine many more questions that would be designed to test not only for subcognitive associations, but for internal physical structure. These would include questions whose answers would arise, for example, from the spacing of a human’s eyes, would be the results of little self-experiments involving tactile sensations on their bodies or sensations after running in place, and so on.People’s answers to subcognitive questions are the product of a lifetime of experiencing the world with our human bodies, our human behaviors (whether culturally or genetically engendered), our human desires and needs, etc. (See Harnard (1989) for a discussion of the closely related symbol grounding problem.)I have asked people the question about Coca-Cola and pins-and-needles many times and they overwhelmingly respond that holding a soft-drink in their mouth feels more like having pins and needles in their foot than having cold water poured on them. Answering this question is dead easy for people who have a head and mouth, have drunk soft-drinks, have had cold water poured on their head, and have feet that occasionally fall asleep. But think of what it would take for a machine that had none of these to answer this question. How could the answer to this question be explicitly programmed into the machine? Perhaps (after reading this article) a programmer could put the question explicitly into the machine’s database, but there are literally infinitely many questions of this sort and to program them all in would be impossible. A program that could answer questions like these in a human-like enough manner to pass a Turing Test would have had to have experienced the world in a way that was very similar to the way in which we had experienced the world. This would mean, among many other things, that it would have to have a body very much like ours with hands like ours, with eyes where we had eyes, etc. For example, if an otherwise perfectly intelligent robot had its eyes on its knees, this would result in detectably non-human associations for such activities as, say, praying in church, falling when riding a bicycle, playing soccer, or wearing pants.The moral of the story is that it doesn’t matter if we humans are confronted with made-up words or conceptual juxtapositions that never normally occur (e.g., dry leaves and hiding place), we can still respond and, moreover, our responses will show statistical regularities over the population. Thus, by surveying the population at large with an extensive set of these questions, we draw up a Human Subcognitive Profile for the population. It is precisely this subcognitive profile that could not be reproduced by a machine that had not experienced the world as the members of the sampled human population had. The Subcognitive Question List that was used to produce the Human Subcognitive Profile gives the well-prepared Interrogator a sure-fire tool for eliminating machines from a Turing test in which humans are also participating. The Interrogator would come to the Turing Test and ask both candidates the questions on her Subcognitive Question List. The candidate most closely matching the average answer profile from the human population will be the human.The English RoomNow let us see how this technique can be gainfully applied to Searle’s Chinese Room thought experiment. We will start by modifying Searle’s original Gedankenexperiment by switching the languages around. This, of course, has no real bearing on the argument itself, but it will make our argument easier to follow. We will assume that inside the Room there is a Chinese person (let’s call him Wu) who understands not a word of written English and outside the Room is a native speaker/writer of English (Sue). Sue sends into the Room questions written in English and Wu must produce the answers to these questions in English.Now, it turns out that Sue is not your average naive questioner, but has read many articles on the Turing Test, knows about subcognitive questions and is thoroughly familiar with John Searle’s argument. She also suspects that the person inside the (English) Room might not actually be able to read English and she sets out to prove her hunch.Sue will not only send into the Room questions like, “What is the capital of Cambodia?”, “Who painted The Mona Lisa?” or “Can fleas fly?” but will also ask a large number of “subcognitive questions.” Because the Room, like the computer in the Turing Test, had not experienced the world as we had and because it would be impossible to explicitly write down all of the rules necessary to answer subcognitive questions in general, the answers to the full range of subcognitive questions could not be contained in the lists of symbolic rules in the Room. Consequently, the person in the Room would be revealed not to speak English for exactly the same reason that the machine in the Turing Test would be revealed not to be a person.Take the simple example of non existent words like blutch or trubhead. These words are neologisms and would certainly be nowhere to be found in the symbolic rules in the English Room. Somehow, the Room would have to contain, in some symbolic form, information not only about all words, but also non-words as well. But the Room, if it is to be compared with a real computer, cannot be infinitely large, nor can we assume infinite fast search of the rule base (see Hofstadter & Dennett, 1981, for a discussion of this point). So, we have two closely related problems: First, and most crucially, how could the rules have gotten into the Room in the first place (a point that Searle simply ignores)? And secondly, the number of explicit symbolic rules would require essentially an infinite amount of space. And while rooms in thought experiments can perhaps be infinitely large, the computers that they are compared to cannot be.In other words, the moral of the story here, as it was for the machine trying to pass the Turing Test, is that no matter how many symbolic rules were in the English Room they would not be sufficient for someone who did not understand written English to fool a determined English questioner. And this is where the story should rightfully end. Searle has no business taking his argument any further — and, ironically, he doesn’t need to, since the necessary inadequacy of an such a Room, regardless of how many symbolic rules it contains, proves his point about the impossibility of achieving artificial intelligence in a traditional symbol-based framework. So, when Searle asks us to accept that the English-only human in his Chinese Room could reply in perfect written Chinese to questions written in Chinese, we must say, “That’s strictly impossible, so stop right there.”Shift in Perception of the Turing Test Let us once again return to the Turing Test to better understand the present argument.It is easy to forget just how high the optimism once ran for the rapid achievement of artificial intelligence. In 1958 when computers were still in their infancy and even high-level programming languages had only just been invented, Simon and Newell, two of the founders of the field of artificial intelligence, wrote, “...there are now in the world machines that think, that learn and that create. Moreover, their ability to do these things is going to increase rapidly until – in a visible future – the range of problems they can handle will be coextensive with the range to which the human mind has been applied.” (Simon & Newell, 1958). Marvin Minsky, head of the MIT AI Laboratory, wrote in 1967, “Within a generation the problem of creating ‘artificial intelligence’ will be substantially solved” (Minsky, 1967).During this period of initial optimism, the vast majority of the authors writing about the Turing Test tacitly accepted Turing’s premise that a machine might actually be able to be built that could pass the Test in the foreseeable future. The debate in the early days of AI, therefore, centered almost exclusively around the validity of Turing’s operational definition of intelligence — namely, did passing the Turing Test constitute a sufficient condition for intelligence or did it not? But researchers’ views on the possibility of achieving artificial intelligence shifted radically between the mid-1960’s and the early 1980’s. By 1982, for example, Minsky’s position regarding achieving artificial intelligence had undergone a radical shift from one of unbounded optimism 15 years earlier to a far more sober assessment of the situation: “The AI problem is one of the hardest ever undertaken by science” (Kolata, 1982). The perception of the Turing Test underwent a parallel shift. At least in part because of the great difficulties being experienced by AI, there was a growing realization of just how hard it would be for a machine to ever pass the Turing Test. Thus, instead of discussing whether or not a machine that had passed the Turing Test was really intelligent, the discussion shifted to the question of whether it would even be possible for any machine to pass such a test (Dennett, 1985; French, 1988, 1990; Crockett 1994; Harnad, 1989; for a review, see French, 2000).The Need for a Corresponding Shift in the Perception of the Chinese RoomA shift in emphasis identical to the one that has occurred for the Turing Test is now needed for Searle’s Chinese Room thought experiment. Searle’s article was published in pre-connectionist 1980, when traditional symbolic AI was still the dominant paradigm in the field. Many of the major difficulties facing symbolic AI had come to light, but in 1980 there was still little emphasis on the “sub-symbolic” side of things.But the growing difficulties that symbolic AI had in dealing with “sub-symbolic cognition” were responsible, at least in part, for the widespread appeal of the connectionist movement of the mid-1980’s. While several of the commentaries of Searle’s original article (Searle, 1980) briefly touch on the difficulties involved in actually creating a Chinese Room, none of them focus outright on the impossibility of the Chinese Room as described by Searle and reject the rest of the argument because of its impossible premise. But this rejection corresponds precisely to rejecting the idea that a machine (that had not experienced the world as we humans have) could ever pass the Turing Test, an idea that many people now accept. We are arguing for a parallel shift in emphasis for the Chinese Room Gedankenexperiment.Can the “Robot Reply” Help?It is necessary to explore for a moment the possibility that one could somehow fill the Chinese Room with all of the appropriate rules that would allow the non-Chinese-reading person to fool a non-holds-barred Chinese questioner. Where could rules come from that would allow the person in the Chinese Room to answer all of the in-coming questions in Chinese perfectly? One possible reply is a version of the Robot Reply (Searle, 1980). Since the rules couldn’t have been symbolic and couldn’t have been explicitly programmed in for the reasons outlined above (also see French, 1988, 1990), perhaps they could have been the product of a Robot that had experienced and interacted with the world as we humans would have, all the while generating rules that would be put in the Chinese Room.This is much closer to what would be required to have the appropriate “rules,” but still leaves open the question of how you could ever come up with such a Robot. The Robot would have to be able to interact seamlessly with the world, exactly as a Chinese person would, in order to have been able to produce all the “rules” (high-level and subcognitive) that would later allow the person in the Room to fool the Well-Prepared Questioner. But then we are back to square one, for creating such a robot amounts to creating a robot that would pass the Turing Test.The Chinese Room: a Simple Refutation It must be reiterated that when Searle is attacking the “strong AI” claim that machines processing strings of symbols are capable of doing what we humans call thinking, he is explicitly talking about programs implemented on computers. It is important not to ignore the fact, as some authors unfortunately have (e.g., Block, 1981), that computers are real machines of finite size and speed; they have neither infinite storage capacity nor infinite processing speed.Now consider the standard Chinese Room, i.e., the one in which the person inside the Room has no knowledge of Chinese and the Questioner outside the Room is Chinese. Now assume that the last character of the following question is distorted in an extremely phallic way, but in a way that nonetheless leaves the character completely readable to any reader of Chinese:“Would the last character of this sentence embarrass a very shy young woman?” In order to answer this question correctly — a trivially easy task for anyone who actually reads Chinese — the Chinese Room would have to contain rules that would not only allow the person to respond perfectly to all strings of Chinese characters that formed comprehensible questions, but also to the infinitely many possible legible distortions of those strings of characters. Combinatorial explosion brings the house down around the Chinese Room. (Remember, we are talking about real computers that can store a finite amount information and must retrieve it in a finite amount of time.)One might be tempted to reply, “The solution is to eliminate all distortions. Only standard fonts of Chinese characters are permitted.” But, of course, there are hundreds, probably thousands, of different fonts of characters in Chinese (Hofstadter, 1985) and it is completely unclear what would constitute “standard fonts.” In any event, one can sidestep even this problem.Consider an equivalent situation in English. It makes perfect sense to ask, “Which letter could be most easily distorted to look like a cloud: an ‘O’ or an ‘X’?”An overwhelming majority of people would, of course, reply “O”, even though clouds, superficially and theoretically, have virtually nothing in common with the letter “O”. But how could the symbolic rules in Searle’s Room possibly serve to answer this perfectly legitimate question? A theory of clouds contained in the rules certainly wouldn’t be of any help, because that would be about storms, wind, rain and meteorology. A theory or database of cloud forms would be of scant help either, since clouds are anything but two dimensional, much less round. Perhaps only if the machine/Room had grown up scrawling vaguely circular shapes on paper and calling them clouds in kindergarten and elementary school, then maybe it would be able to answer this question. But short of having had that experience, I see little hope of an a priori theory of correspondence between clouds and letters that would be of any help.ConclusionThe time has come to view John Searle’s Chinese Room thought experiment in a new light. Up until now, the main focus of attention has been on showing what is wrong (or right) with the argument, with the tacit assumption being that somehow there could be such a Room. This parallels the first forty years of discussionson the Turing Test, where virtually all discussion centered on the sufficiency of the Test as a criterion for machine intelligence, rather than whether any machine could ever actually pass it. However, as the overwhelming difficulties of AI gradually became apparent, the debate on the Turing Test shifted to whether or not any machine that had not experience the world as we had could ever actually pass the Turing Test. It is time for an equivalent shift in attention for Searle’s Chinese Room. The question should not be, “If a person in the Room answered all the questions in perfect Chinese, while not understanding a word of Chinese, what would the implications of this be for strong AI?"” Rather, the question should be, “Does the very idea of such a Room and a person actually be able to answer questions in perfect Chinese while not understanding any Chinese make any sense at all?” And I believe that the answer, in parallel with the impossibility of a machine passing the Turing Test, is no.AcknowledgmentsThe present paper was supported in part by research grant IUAP P4/19 from the Belgian government.ReferencesBlock, N. (1981) Psychologism and behaviourism.Philosophical Review, 90, 5-43Crockett, L. (1994) The Turing Test and the Frame Problem: AI's Mistaken Understanding of Intelligence. AblexDavidson, D. (1990) Turing's test. In Karim A. Said et al. (eds.), Modelling the Mind. Oxford University Press, 1-11.Dennett, D. (1985) Can machines think? In How We Know. (ed.) M. Shafto. Harper & RowFrench, R. M. (1988). Subcognitive Probing: Hard Questions for the Turing Test. Proceedings of the Tenth Annual Cognitive Science Society Conference, Hillsdale, NJ: LEA. 361-367.French, R. M. (1990). Subcognition and the Limits of the Turing Test. Mind, 99(393), 53-65. Reprinted in: P. Millican & A. Clark (eds.). Machines and Thought: The Legacy of Alan Turing Oxford, UK: Clarendon Press, 1996.French, R. M. (2000). Peeking Behind the Screen: The Unsuspected Power of the Standard Turing Test.Journal of Experimental and Theoretical Artificial Intelligence. (in press).French, R. M. (2000). The Turing Test: The First Fifty Years. Trends in Cognitive Sciences, 4(3), 115-122. Harnad, S. (1989) Minds, machines and Searle. Journal of Experimental and Theoretical Artificial Intelligence, 1, 5-25Hofstadter, D. (1985). Variations on a Theme as the Crux of Creativity. In Metamagical Themas. New York, NY: Basic Books. p. 244.Hofstatder, D. & Dennett, D. (1981). The Mind’s I.New York, NY: Basic Books.Kolata, G. (1982) How can computers get common sense? Science, 217, p. 1237Minsky, M. (1967) Computation: Finite and Infinite Machines. Prentice-Hall, p. 2Searle, J. R. (1980). Minds, brains, and programs.Behavioral and Brain Sciences, 3, 414-424. Simon, H. and Newell, A. (1958) Heuristic problem solving: The next advance in operations research.Operations Research, 6。
Integrating E-services with a Telecommunication
Integrating E-services with a Telecommunication E-commerce using Service-Oriented ArchitectureTung-Hsiang ChouDepartment of Information Management, National Kaohsiung First University of Science and Technology, Kaohsiung,TaiwanEmail: sam@.twYu-Min LeeDepartment of Management Information Systems, National Chengchi University, Taipei, TaiwanEmail: vivi.lee.mail@Abstract—In the past, electronic commerce only focused on customer-to-business web interaction and on business-to-business web interaction. With the emergence of business process management and of service-oriented architecture, the focus has shifted to the development of e-services that integrate business processes and that diversify functionalities available to customers. The potential of electronic commerce and its information technology also has attracted some telecommunication corporations—for example, Chunghwa Telecom, Singtel Telecom, and AT&T. They have built their electronic commerce environment on the Internet, too. Most of these worldwide telecom corporations have many kinds of operations support systems in their backend environment. In the past, enterprises had to integrate their telecom services manually, so that they could work together. However, this integration required considerable time and cost, and it worked only for the specific services that were manually linked. Adding additional services required even more effort. And then, enterprise application integration (EAI) solved these kinds of problems by working via point-to-point interfaces. In this paper, we present a research framework to describe the method. Then, we use two illustrations to explain the generality of our method, and we focus on how international telecom corporations have become concerned with the agility, the leanness, and the integration underlying electronic services (e-services) integration with enterprise application interface technology.Index Term s—electronic services, telecommunication, enterprise application integrationI.I NTRODUCTIONMany enterprises in order to increase the satisfaction of their customer, to manage electronic commerce (e-Commerce, e-commerce, EC) transactions and to rapidly and reliably deliver services to businesses and their customers, have started to consider how to develop business process management that, in enterprise operations, is more agile. In recent years, e-commerce has increasingly supported business services between enterprises and consumers. The biggest challenges that businesses face today is their need to get their diverse services, often built on different platforms, to work together when necessary. Hence, telecom corporations have adjusted their backend systems (including legacy and heterogeneous systems) so that common interfaces have integrated processes and so that the customer can access more telecom services. Hence, these corporations have questioned the feasibility underlying the e-services integration of these business process operations. By the definition proposed in [3], e-services provide value-added services whose delivery rests on the composition of existing functions. Figure 1 depicts the cooperation between e-services and backend systems when a chaoticsituation arises.Figure 1. The chaotic situation between e-services and backendsystemsIn the past, if enterprises wanted to develop an e-service, they had to develop it section by section. Over time, an enterprise would accumulate more and more e-services that rested on heterogeneous platforms, a phenomenon that was especially pronounced in the telecommunication industry. The objective of this research is to help integrate heterogeneous platforms and e-services. For a general solution herein, the research applies both integration methodology and service-oriented architecture to the development of a enterprise application integration.Hence, there will have many complexity problems in the e-services environment and these e-services are provided by the backed system and legacy system to user. Therefore, we need to propose a framework to manage a chaos e-services environment and let backend system and legacy system to be worked more efficiency. This research framework will use several information communication technologies to solve these problems.More specifically, in section 2 of this research, we discuss the related works of Next Generation Operations Systems and Software (NGOSS) and Service-oriented A rchitecture (SOA). In section 3, we will describe e-services integration design with EAI technology. Next in section 4, we will start to illustrate the data flow of implementation and section 5 illustrates case studies by using telecommunication industry. Then we conclude with some comments and some suggestions for future research directions.II.R ELATED W ORKSThis section reviews the relevant research and practice literatures in the following areas. They are the Next Generation Operations Systems and Software (NGOSS) and the Service-oriented A rchitecture (SOA). The objective is to examine the current status of these research results and their relationship with this research study. We propose to extend NGOSS and integrate it with e-services technology. We also review SOA’s research topics to identify their applicability and their feasibility in relation to enterprise’s activities.Next Generation Operations Systems and Software (NGOSS)The TeleManagement Forum (TM Forum) was founded in 1988 and is a non-profit global organization that provides leadership, strategic guidance, and practical solutions for improving the management of and the operations of information and communications services.In recent years, telecom corporations have encountered great difficulties in the integration of business-process frameworks with heterogeneous platforms. Because there are many business process workflows in telecoms such as ordering operations, billing operations, trouble management, resource management, and marketing, these corporations have invented several operations support systems to assist in the related work. Hence, TM Forum proposed Next Generation Operations Systems and Software that would support telecommunication business process.The NGOSS 6.0 is not a system, it is a methodology that TM Forum developed in 2007 [13]. There are four core dimensions of NGOSS architecture, eTOM, SID, TNA, and TAM. Figure 2 depicts the core dimensions of NGOSS both separately and in relation to one another. The NGOSS has to provide a framework for building telecom operations support systems with following views: First, the enhanced Telecom Operations Map (eTOM) describes all the enterprise processes required by a service provider and analyzes them according to different levels of detail that reflect the processes’ significance and priority for the telecom business. For such enterprises, eTOM serves as the blueprint for process direction and provides a neutral reference point for internal processes, reengineering needs, partnerships, alliances, and general working agreements with other providers. For suppliers, eTOM not only outlines potential boundaries of software components so that these components better align with the customers’ needs but also highlights the required functions, inputs, and outputs that must be supported by products [13]. This model also provides an overall concept and describes the business processes’ relationship between internal and external entities in the enterprise. Hence, the TM Forum and the International Telecommunications Union Telecommunication Standardization Sector (ITU-T) have formally approvedthe eTOM framework.Figure 2. The NGOSS frameworkSecond, the Shared Information Data (SID) model defines the information entities so that developers can help map the eTOM model describing the business process. The SID model can be the common standard that connects diverse systems to one another; therefore, the SID model constitutes a common view in the telecommunication industry.Third, the Technology Neutral A rchitecture (TNA) comprises key architectural guidelines that confirm high levels of flow-through amongst various systems. The TM Forum’s definition of ‘TNA’ includes the following points:the core architecture is applicable to both legacy and next generation implementationsthe NGOSS Contract for distributed interfacesthe NGOSS Metamodel defines the relationships of core elementsthe Distribution and Transparency Framework details capabilities that are necessary supportsfor a distributed NGOSSLastly, The Telecom A pplication Map (TA M) is a guide that can help telecommunication corporations and their suppliers use a common reference map and a common language to navigate a complex systems landscape that is typically found in mobile communication, fixed communication, data communication, and wire operators [13]. The TA Mprovides the bridge between the NGOSS framework’s building blocks (eTOM and the SID model) and real, deployable, potentially procurable applications. TA M accomplishes this objective by grouping together process functions and information data into recognized OSS and BSS applications or services [13]. TA M also provides global telecom software with a reference that helps explain the relationships among many operational systems.The research literature presents many suggestions for the development of business process operations that derive from eTOM[5][13]. From the technology viewpoint of Parkyn, eTOM can be implemented with many different technologies such as a .NET framework and J2EE solutions. Therefore, many researchers have started to survey the feasibility of NGOSS-based or eTOM-based solutionsService-Oriented Architecture (SOA)Before discussing the concept of SOA, we must first define ‘e-services’. A ccording to relevant literatures [3][11][12], e-services create value-added services and are self-describing, open components that support rapid, low-cost composition of distributed applications [9]. The Kim [8] defines ‘e-services’ as Internet-based services. The delivery of these services depends on a combination of existing backend systems, legacy systems, and e-services—a combination that tackles the problem of e-commerce business process integration [11][1][12]. In the telecommunication industry, e-services are vital to telecom corporations because the services enable the business processes to interact collaboratively with Internet users and with telecom internal operations systems in a new and digital way. Worth of note in this regard is that telecommunication business processes involve the management of many significant information services. These e-services also represent a business model in the corporation [12]. Hence, businesses that carry out e-services do so by invoking several other basic or composite processes. Casati [3] illustrated the integration of service composition into business process management. Stafford [12] listed the applications of e-services for product marketing, for the internal revenue service, and for e-commerce services. Song [11] proposed the use of e-services at FedEx, which has developed e-services and e-commerce. For example, FedEx has provided online shipping support for its customers. But most of these literatures are not describing how to implement these e-services; hence, for realizing our research goals, this research surveys feasible technology in order first to develop a common interface and second to integrate these e-services into one another. Web service is the optimal solution in integrating methodologies [5][6].However, the development of service-oriented architecture derives from these e-services and treats software resources as services available on a network [2]. According to Huhns et al. [7][9], the SOA satisfies some key features:Loose coupling is the most important key to SOA.SOA should create an environment that is moreagility, feasibility and user-friendly, whether theusers are system developers or customers or anyonein between, they can reuse the e-services each other. Implementation neutrality means that no specific program language can underlie the development ofthe services. Only general implementation canunderlie the development.Flexible configurability is necessary in SOA environment, because SOA is configured late andflexibly. Hence, the user can use the configurationto change dynamically as needed and without lossof correctness.Persistence is required for the service, although services do not need a long lifetime, but when theusers deal with the transaction among heterogeneous platforms, they must always be ableto handle exceptions. The service must exist longenough to handle this situation.Granularity means SOA’s participants should be modeled and understood at a coarse granularity, andthe developers should capture visible informationwith business contracts among participants.Teams include business partners and solve problems cooperatively or compete intelligently In order to comprehend the current research on SOA, this research has surveyed relevant research literatures. Most of the research literatures that discuss the principle of SOA does not implement SOA [2][7][9]. A lthough Chen [4] helps companies to construct stronger relationships with their trading partners by integrating business logic with collaborative commerce, this research does not integrate heterogeneous web services into SOA. Shen [10] proposes an agent-based service-oriented integration architecture that features web services, but this research is applicable only to internal-enterprise manufacturing resource sharing. Hence, this research will propose a new framework that encompasses SOA, EA I, and e-services and that applies to the telecommunication industry.III.E-SERVICES I NTEGRATION D ESIGN W ITH EAIT ECHNOLOGYIn this research, we propose a research framework that provides a context for analyzing several layers to better understand our goals, benefits, and limitations. In the past, many enterprises have implemented their e-commerce computing environment by the web technology. And they use web page to present their thinking of application for a specific function, once the customers have submitted their request to the enterprise. The webpage will process customers’ request and notify the enterprise officers. To accumulate over a long period, the enterprise will have many web pages to present to customers and each web page has hid many business process and e-service without management. Hence, this research framework divides the web page into presentation layer and interaction layer. Due to different function has located in diverse backend systems and legacy systems, this research framework also proposes an exchange layer to integrate theseheterogeneous platforms and use a processing layer to collect these diverse backend systems and legacy systems. In this framework, a telecommunication corporation’s complete computing environment can be divided into several layers: the presentation layer, the interaction layer, the exchange layer, the processing layer and the data layer.In general, the presentation layer includes customer interface such as web pages and e-services links. We propose an interaction layer where integrating e-services both receives customers’ requests and transfers these requests to several e-services. Sometimes these e-services should be aggregated or integrated with other e-services and placed in the e-services pool. The business process management will assemble relevant e-services to attend to customers’ requests. But before integrating e-services accomplishes these requests, the business process management would get more information through the exchange layer. The enterprise application integration (EA I) is responsible particularly for integrating the backend systems and the legacy systems into the exchange layer. The backend systems and the legacy systems cover a variety of information systems such as the integrated customer content system, the data warehouse system, the management system, the provision service system, and the customer service system in the processing layer. Each backend system and each legacy system in the processing layer are controlled by EAI that are enabled by the integrating e-services. Figure 3 depicts the relationships among several layers in thetelecommunication computing environment.Figure 3. The context layers in the large ebusinessPresentation Layer This layer describes the client environment. The presentation layer is based on web technology, such as a java server page, an active server page, or a hypertext markup language. The resulting collection of client-themed information includes, for example, clients’ requests, which web client then prepares. The web client transfers the requests to information collecting so that interaction layer can verify them. Interaction LayerUsing eXtensible Markup Language (XML), presentation layer delivers all of the information to interaction layer. Several steps characterize this layer. Step 1: A fter receiving clients’ information, interactionlayer will convert the customer information into an XML format. Before starting to execute the business process, the customer information conversion-and-verification will verify the XML format on the basis of XML Structure Definition (XSD).Step 2: If step 1 has a successful pass, the systemcommon component starts to process clients’ information and sends it to functional components.Step 3: The functional components will extract theinformation from the XML and will check the functional business process by using the database connection objects.Step 4: When the e-services gateway receives functionalcomponents’ requests, the e-services gateway will send the requests to the exchange layer and will wait for its response.Step 5: Once the exchange layer has sent the request-related results to the e-services, the e-services will deliver the results—through the original request route—to presentation layer. Exchange LayerWe borrow the advantages from the EA I to communicate with the processing layer and with the interaction layer. This layer also integrates several heterogeneous backend systems and legacy systems. The EA I facilitates the operation of information system communication. We also use XML standard to transmit user’s request through the web service and wait for the response from the processing layer.Processing LayerThe systems can be divided into two categories, backend systems and legacy systems. Each of them connects the exchange layer and the data layer to each other. In this layer, each system has different specific functionality such as a mobile billing, dedicated line provision and so on.Data LayerFinally, this layer collects the raw data and thehistorical data in the different databases and prepares to provide request-related information to processing layer. In order to strengthen the efficient cooperation of these e-services with one another, this research will illustrate these concepts on the basis of the collaboration model.This research proposes a collaboration model that will help identify every stage in a relevant position. In Figure 4, we depict the collaboration model that features e-services. The customers and the commercialagents send their requests into the square shape that represents information collection. A fter collecting the requests, the information collection uses the intranet to send them into e-services integration. The e-servicesintegration represented by the square shape collects several e-services and coordinates relevant e-services. The oval-shaped is in an e-service aggregator, and the business processes are based on e-service. Then, the e-services seek matched systems so that e-services integration can provide e-services aggregator with e-services. To accomplish this objective, e-services integration uses an interface of enterprise application integration. These e-services include data, business logic, and object. The e-services assemble these characteristicsaccording to different situations.Figure 4. The NGOSS-centric collaboration model in the frameworkIV.T HE D ATA F LOW OF I MPLEMENTATIONIn order to illustrate the data flow of user request, we depict the steps that characterize a website’s processing of user requests in Figure 5.Step 1: The web application receives the user request andtransfers it into an XML document.Step 2: The web application sends the XML documentthrough the firewall by using the web service.Step 3: A fter receiving the XML information, the webserver transforms this information into the specified data format by using the remote systems.Step 4: The remote system processes the result of step 3and sends the result to the web server.Step 5: The web application sends the results to the webuser. Figure 5. The process of a user request through the e-services portalA nd in order to realize these research proposes and solve research issue, we will develop a prototype toaccomplish these tasks. This prototype will enable several information technologies:Web TechnologiesMany information technologies make the web page to be ideal. A jax, Javascript/Vbscript, Html, A ctive Server Page and A dobe Flash provide user with a rich and fun interactive experience without drawbacks of most old web applications. These information technologies aim to display and invoke web services just like software resource. These web technologies also can use XML format to deliver user request.XMLXML is an abbreviation of extensible markup language and it was announced by W3C in 1998. XML can be defined by user and use a tag to describe the presentation of information. XML also has a descriptive language and a interchange format. In this research, we will use XML to be a communication message between each layer. Web ServiceWeb service is a self-contained, self-described, published, located, and invoked over a network. Web service makes distributed processing difficulties invisible and treats invoked application to be a software resource. Web service also provides a means for wrapping existing applications so developers can access them trough standard languages and protocols. It is a way of realizing service oriented architecture, focusing on the integration and enabling machine-to-machine communication. In general, web service uses XML for data description; HTTP for message transfer; the Simple Object A ccess Protocol (SOAP) for message exchange; the web services description language (WSDL) for service description; and universal description, discovery, and integration (UDDI) protocol for discovering and publishing web services.V.T HE C ASE S TUDIES OF T ELECOMMUNICATION C ORPORATION Once we have finished the work of systemdevelopment. We use two case studies to evaluate thefeasibility and validity in this research. In the feasibility, we use a real telecommunication environment to validate the functionalities of trouble management in the telecommunication corporation.A Live Case StudyThe company in question is the largest telecommunication corporation in Taiwan and is also the 423rd largest company in the world (according to Forbes’ survey). The company’s scope of services covers local phone services, long-distance phone services, international calls, mobile communication, data communication, Internet services, broadband networking, satellite communication, intelligent network, mobile data, and multimedia broadband. The company is the most experienced and largest integrated telecommunication provider in Taiwan, and providing these telecommunication services to more than 25 millioncustomers. The Table I depicts the subscribers of the major service in telecommunication corporation.TABLE I. S UBSCRIBE N UMBER IN THIS L IVE C ASESubscriber Number of Telecom's Major ServicesServices Subscriber number (Oct. 2007)Local phone service 12,985,386Mobile phone service 8,661,817Payphone service 102,971Internet access service 4,071,727The goals of this research are to establish an NGOSS-centric telecommunication e-services enabling operations support systems in an e-commerce environment. Hence,in this section, we use two illustrations to describe the situation of trouble management and QoS-management, and we then depict the implementation plan. This research will illustrate these topics in the following sections:Trouble management IllustrationThis section uses trouble management to explain the benefits that NGOSS and e-services have for integration. In the past, trouble management provided only passive information to customers.Since 2002, the TM Forum has proposed a third version of NGOSS architecture. There are three eTOM sections:(1) strategy, infrastructure, and product (SIP);(2) operations; and(3) enterprise management.The SIP section supports, at least in theory, customer operations. Enterprise management provides several services that, on behalf of telecommunication service providers, improve the efficiency of telecom business operations. Customer relations management is a key feature of eTOM. Hence, we will discuss CRM operations in eTOM. And this topic will focus the CRM process on problem management. The problem management processes help system receive trouble reports from customers, resolve problems, increase customer satisfaction, maximize Quality of Services (QoS), and repair circuits. A ccording to [13], several processes underlie eTOM architecture’s problem management:Isolate Problem and Initiate Resolution: In this initial stage, the system identifies a customer’sproblems and receives the customer’s request thatstates the problem.Report Problem: A fter receiving the customer’s request, the system should generate a trouble reportfrom the previous process and match the report withrelevant problems.Track and Manage Problem: When a telecommunication business has many problemreports, it needs a tracking mechanism to monitorthe progress of the trouble reports. The purpose ofthis process will be to track a customer’s troublereport actively.Figure 6 depicts the business process of legacy trouble management operations. In this figure, two major procedures characterize the legacy operations: one is receiving customers’ requests from the Internet, and the other is making the subsequent arrangement in the workerdatabase. A lthough the legacy trouble management operations have had a simple business process for responses to customers’ requests, the operations cannot provide customers with satisfactory service. Owing to the competitive telecommunication market, to changes in the environment, and to improved information technology, telecommunications corporations have started to provide more e-services functionality on the Internet.Figure 6. The business process of the legacy trouble management operationsIntegration between e-services and backend systems Telecom corporations have many types of backend systems such as operations support systems, customer relations management systems, and billing management systems. These systems provide several telecommunication operations business services such as applications for a new home phone, transfer of a phone service, and reactivation of a home phone. But these backend systems might derive from heterogeneous platforms. Hence, we have started to consider a common technology such as EA I that might integrate these platforms.Close Problem: If the telecommunication businesses have finished their trouble reports, the last stage will be to ensure that the problem has been resolved and to inquire into the customer’s satisfaction.QoS-management IllustrationIn the past, developers tried to integrate data and resources into each other but found the task difficult. Hence, researchers started to use interfaces. But this method grows complex when a business tries to integratem systems with n systems: The integration generates m*n interfaces so that these systems can interact each other. So the complexity constitutes an m*n problem. Therefore, the enterprise application integration (EA I) will emerge, and all the systems will have a common interface bus. Once the systems have connected to the EAI bus, all the messages will flow into the bus. A lthough the EA I has solved many integrated problems, the EA I can solve。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
1060 - Collecting LuggageCollecting your luggage after a flight can be far from trivial. Suitcases and bags appear on a conveyor belt, and hundreds of passengers fight for a good vantage point from which to find and retrieve their belongings.Recently, the Narita Airport Authority has decided to make this process more efficient. Before redesigning their baggage claim areas, they need a simulation program to determine how average passengers behave when collecting their luggage. This simulation assumesthat passengers will always take a path of straight line segments to reach their luggage in the least amount of time.For this problem, a conveyor belt is modeled as a simple polygon. A luggage piece appears on some point of the conveyor belt, and then moves along the conveyor belt at a constant speed.A passenger is initiallypositioned at some point outside the conveyor belt polygon. As soon as the piece of luggage appears, the passenger moves at a constant speed (which is greater than the speed of the luggage piece) in order to pick up the luggage. The passenger's path, whichmay not cross over the conveyer belt but may touch it, puts the passenger in the same position as the moving piece of luggage in the least amount of time.In the following figure, the conveyor belt is depicted as a polygon ABCDEF. The luggage starts at the top-left corner (Point A) and moves in counterclockwise direction around the polygon as shown with the smallarrows. The passenger begins at point P and moves on the path that puts him and the luggage into the same place (point M in the figure) in the shortest amount of time. The passenger's path is shown by a red arrow. This figure corresponds to the first sampleThe input consists of one or more test cases describing luggage pickup scenarios. A scenario description begins with a line containing a single integer?N?(3N100)?,the number of vertices of the conveyor belt polygon. This is followed by?N?lines, each containing a pair of integers?xi?,?yi?(|?xi|,|?yi|10000)?givingthe coordinates of the vertices of the polygon in counterclockwise order. The polygon is simple, that is, it will not intersect itself and it will not touch itself. The polygon description is followed by a line containing twointegers?px?,?py?(|?px|,|?py|10000)?,the coordinates of the starting position of the passenger. The last line of the description contains two positive integers?VL?and?VP?(0 ?VL?VP10000)?,which are the speed of the luggage and the passenger respectively. All the coordinates are given in meters, and the speeds are given in meters per minute.You can assume that the passenger is positioned outside the conveyor belt polygon. The luggage will move in counterclockwise direction around the conveyor belt, starting at the first vertex of the polygon.The input is terminated by a line containing a single integer zero.For each test case, print a line containing the test case number (beginning with 1) followed by the minimum time that it takes the passenger to reach the luggage. Use the formatting shown in the sample output (withminutes and seconds separated by a colon), rounded to the nearest second. The value for seconds should be printed in a field of width two (padded with leading zeroes if required).Sample Input?Sample Output?Case 1: Time = 1:02Case 2: Time = 12:36#includestdio.h#includestring.h#includestdlib.h#includemath.hint n,i,j,x[101],y[101],mark[102],px,py,vl,vp,cases,ans;double C,d,l,r,m,t,map[102][102],len[102];double dis(double dx,double dy)return sqrt(dx*dx+dy*dy);int cross(double xa,double ya,double xb,double yb,double xc,double yc,double xd,double yd)if(xaxcxbxcxaxdxbxd)return 0;if(yaycybycyaydycyd)return 0;if(xaxcxbxcxaxdxbxd)return 0;if(yaycybycyaydybyd)return 0;double cra,crb;cra=(xb-xa)*(yc-ya)-(yb-ya)*(xc-xa);crb=(xb-xa)*(yd-ya)-(yb-ya)*(xd-xa);if(cra*crb0)return 0;cra=(xd-xc)*(ya-yc)-(yd-yc)*(xa-xc);crb=(xd-xc)*(yb-yc)-(yd-yc)*(xb-xc);if(cra*crb0)return 0;return 1;int in_shape(double ox,double oy)int i,c;for(i=0;in;i++)c+=cross(ox,oy,98765,43210,x[i],y[i],x[i+1],y[i+1]);return c1;int no_cross(double xa,double ya,double xb,double yb) if(in_shape((xa+xb)-2,(ya+yb)-2))return 0;xa=xa*(1-1e-7)+xb*1e-7;ya=ya*(1-1e-7)+yb*1e-7;xb=xb*(1-1e-7)+xa*1e-7;yb=yb*(1-1e-7)+ya*1e-7;for(i=0;in;i++)if(cross(xa,ya,xb,yb,x[i],y[i],x[i+1],y[i+1]))return 0;return 1;double best_path()double s,lx,ly,bl,tl;int i,j,bj;s=m*vl;s-=int(s-C)*C;for(i=0;in;i++)if(smap[i][(i+1)%n])s-=map[i][(i+1)%n];d=map[i][(i+1)%n];lx=x[i]*(d-s)-d + x[(i+1)%n]*s-d;ly=y[i]*(d-s)-d + y[(i+1)%n]*s-d;if(no_cross(px,py,lx,ly))return dis(px-lx,py-ly);for(i=0;in;i++)if(i==j||i==(j+1)%n||no_cross(lx,ly,x[i],y[i])) map[n][i]=dis(lx-x[i],ly-y[i]);map[i][n]=map[n][i];map[n][i]=-2;map[i][n]=-2;if(no_cross(px,py,x[i],y[i]))map[n+1][i]=dis(px-x[i],py-y[i]);map[i][n+1]=map[n+1][i];map[n+1][i]=-2;map[i][n+1]=-2;len[i]=1e9;len[n]=1e9;len[n+1]=0;memset(mark,0,sizeof(mark));for(i=0;i=n;i++)bl=1e9;for(j=0;jn+2;j++)if(!mark[j]len[j]bl)bl=len[j];mark[bj]=1;for(j=0;j=n;j++)if(!mark[j]map[bj][j]-1)tl=bl+map[bj][j];if(tllen[j])len[j]=tl;return len[n];int main()while(scanf("%d",n)n)for(i=0;in+2;i++)for(j=0;jn+2;j++)map[i][j]=-2;for(i=0;in;i++)scanf("%d%d",x[i],y[i]);x[n]=x[0];y[n]=y[0];for(i=0;in;i++)d=dis(x[i]-x[i+1],y[i]-y[i+1]);map[i][(i+1)%n]=d;map[(i+1)%n][i]=d;for(i=0;in-2;i++)for(j=i+2;jn;j++)if(no_cross(x[i],y[i],x[j],y[j]))map[i][j]=dis(x[i]-x[j],y[i]-y[j]);map[j][i]=map[i][j];scanf("%d%d%d%d",px,py,vl,vp);while(r-l1e-8)m=(l+r)-2;t=best_path()-vp;ans=int(m*60+0.5);printf("Case %d: Time= %d:%02d",++cases,ans-60,ans%60);return 0;belt-readPoint(), pax-readPoint();int partition(int a[], int p, int r)对于每组数据,输出一行,包括测试数据编号(从1开始编号)和乘客取得行李的最少时间。