Lecture5_1EvaluationClustering
Cluster assignment strategies for a clustered trace cache processor
Cluster Assignment Strategiesfor a Clustered Trace Cache ProcessorRavi Bhargava and Lizy K.JohnTechnical Report TR-033103-01Laboratory for Computer ArchitectureThe University of Texas at AustinAustin,Texas,78712{ravib,ljohn}@March31,2003AbstractThis report examines dynamic cluster assignment for a clustered trace cache processor(CTCP).Previously proposed clustering techniques run into unique problems as issue width and clustercount increase.Realistic design conditions,such as variable data forwarding latencies betweenclusters and a heavily partitioned instruction window also increase the degree of difficulty foreffective cluster assignment.In this report,the trace cache andfill unit are used to perform effective dynamic cluster as-signment.The retire-timefill unit analysis is aided by a dynamic profiling mechanism embeddedwithin the trace cache.This mechanism provides information on inter-trace trace dependen-cies and critical inputs,elements absent in previous retire-time CTCP cluster assignment work.The strategy proposed in this report leads to more intra-cluster data forwarding and shorterdata forwarding distances.In addition,performing this strategy at retire-time reduces issue-time complexity and eliminates early pipeline stages.This increases overall performance for theSPEC CPU2000integer programs by8.4%over our base CTCP architecture.This speedup issignificantly higher than a previously proposed retire-time CTCP assignment strategy(1.9%).Dynamic cluster assignment is also evaluated for several alternate cluster designs as well asmedia benchmarks.1IntroductionA clustered microarchitecture design allows for wide instruction execution while reducing the amount of complexity and long-latency communication[2,3,5,7,11,21].The execution resources and registerfile are partitioned into smaller and simpler units.Within a cluster,communication is fast while inter-cluster communication is more costly.Therefore,the key to high performance on a clustered microarchitecture is assigning instructions to clusters in a way that limits inter-cluster data communication.During cluster assignment,an instruction is designated for execution on a particular cluster. This assignment process can be accomplished statically,dynamically at issue-time,or dynamicallyat retire-time.Static cluster assignment is traditionally done by a compiler or assembly programmer and may require ISA modification and intimate knowledge of the underlying cluster hardware. Studies that have compared static and dynamic assignment conclude that dynamic assignment results in higher performance[2,15].Dynamic issue-time cluster assignment occurs after instructions are fetched and decoded.In recent literature,the prevailing philosophy is to assign instructions to a cluster based on data de-pendencies and workload balance[2,11,15,21].The precise method varies based on the underlying architecture and execution cluster characteristics.Typical issue-time cluster assignment strategies do not scale well.Dependency analysis is an inherently serial process that must be performed in parallel on all fetched instructions.Therefore, increasing the width of the microarchitecture further delays and frustrates this dependency anal-ysis(also noted by Zyuban et al.[21]).Accomplishing even a simple steering algorithm requires additional pipeline stages early in the instruction pipeline.In this report,the clustered execution architecture is combined with an instruction trace cache, resulting in a clustered trace cache processor(CTCP).A CTCP achieves a very wide instruction fetch bandwidth using the trace cache to fetch past multiple branches in a low-latency and high-bandwidth manner[13,14,17].The CTCP environment enables the use of retire-time cluster assignment,which addresses many of the problems associated with issue-time cluster assignment.In a CTCP,the issue-time dynamic cluster assignment logic and steering network can be removed entirely.Instead,instructions are issued directly to clusters based on their physical instruction order in an trace cache line or instruc-tion cache block.This eliminates critical latency from the front-end of the pipeline.Instead,cluster assignment is accomplished at retire-time by physically(but not logically)reordering instructions so that they are issued directly to the desired cluster.Friendly et al.present a retire-time cluster assignment strategy for a CTCP based on intra-trace data dependencies[6].The trace cachefill unit is capable of performing advanced analysis since the latency at retire-time is more tolerable and less critical to performance[6,8].The shortcoming of this strategy is the loss of dynamic information.Inter-trace dependencies and workload balance information are not available at instruction retirement and are ignored.In this report,we increase the performance of a wide issue CTCP using a feedback-directed, retire-time(FDRT)cluster assignment strategy.Extrafields are added to the trace cache to accu-mulate inter-trace dependency history,as well as the criticality of instruction inputs.Thefill unit combines this information with intra-trace dependency analysis to determine cluster assignments.This novel strategy increases the amount of critical intra-cluster data forwarding by44%while decreasing the average data forwarding distance by35%over our baseline four-cluster,16-way CTCP.This leads to a8.4%improvement in performance over our base architecture compared to 1.9%improvement for Friendly’s method.2Clustered MicroarchitectureA clustered microarchitecture is designed to reduce the performance bottlenecks that result from wide-issue complexity[11].Structures within a cluster are small and data forwarding delays are reduced as long as communication takes place within the cluster.The target microarchitecture in this report is composed of four,four-way clusters.Four-wide, out-of-order execution engines have proven manageable in the past and are the building blocks of previously proposed two-cluster microarchitectures.Similarly configured16-wide CTCP’s have been studied[6,21],but not with respect to the performance of dynamic cluster assignment options.An example of the instruction and data routing for the baseline CTCP is shown in Figure1. Notice that the cluster assignment for a particular instruction is dependent on its placement in the instruction buffer.The details of a single cluster are explored later in Figure3Figure1:Overview of a Clustered Trace Cache ProcessorC2and C3are clusters identical to Cluster1and Cluster4.2.1Shared ComponentsThe front-end of the processor(i.e.fetch and decode)is shared by all of the cluster resources. Instructions fetched from the trace cache(or from the instruction cache on a trace cache miss)are decoded and renamed in parallel beforefinally being distributed to their respective clusters.The memory subsystem components,including the store buffer,load queue,and data cache,are also shared.Pipeline The baseline pipeline for our microarchitecture is shown in Figure2.Three pipeline stages are assigned for instruction fetch(illustrated as one box).After the instructions are fetched, there are additional pipeline stages for decode,rename,issue,dispatch,and execute.Registerfile accesses are initiated during the rename stage.Memory instructions incur extra stages to access the TLB and data cache.Floating point instructions and complex instructions(not shown)alsoexecution.endure extra pipeline stages forTrace Cache The trace cache allows multiple basic blocks to be fetched with just one request[13, 14,17].The retired instruction stream is fed to thefill unit which constructs the traces.These traces consist of up to three basic blocks of instructions.When the traces are constructed,the intra-trace and intra-block dependencies are analyzed.This allows thefill unit to add bits to the trace cache line which accelerates register renaming and instruction steering[13].This is the mechanism which is exploited to improve instruction reordering and cluster assignment.2.2Cluster DesignThe execution resources modeled in this report are heavily partitioned.As shown in Figure3,each cluster consists offive reservation stations which feed a total of eight special-purpose functional units.The reservation stations hold eight instructions and permit out-of-order instruction selec-tion.The economical size reduces the complexity of the wake-up and instruction select logic while maintaining a large overall instruction window size[11].Figure3:Details of One ClusterThere are eight special-purpose functional units per cluster:two simple integer units,one integermemory unit,one branch unit,one complex integer unit,one basicfloating point(FP),one complexFP,one FP memory.There arefive8-entry reservation stations:one for the memory operations(integer and FP),one for branches,one for complex arithmetic,two for the simple operations.FPis not shown.Intra-cluster communication(i.e.forwarding results from the execution units to the reservation stations within the same cluster)is done in the same cycle as instruction dispatch.However,to forward data to a neighboring cluster takes two cycles and beyond that another two cycles.This latency includes all of the communication and routing overhead associated with sharing inter-cluster data[12,21].The end clusters do not communicate directly.There are no data bandwidth limitations between clusters in our work.Parcerisa et al.show that a point-to-point interconnect network can be built efficiently and is preferable to bus-based interconnects[12].2.3Cluster AssignmentThe challenge to high performance in clustered microarchitectures is assigning instructions to the proper cluster.This includes identifying which instruction should go to which cluster and then routing the instructions accordingly.With16instructions to analyze and four clusters from whichto choose,picking the best execution resource is not straightforward.Accurate dependency analysis is a serial process and is difficult to accomplish in a timely fashion.For example,approximately half of all result-producing instructions have data consumed by an instruction in the same cache line.Some of this information is preprocessed by thefill unit, but issue-time processing is also required.Properly analyzing the relationships is critical but costly in terms of pipe stages.Any extra pipeline stages hurt performance when the pipeline refills after branch mispredictions and instruction cache misses.Totallyflexible routing is also a high-latency process.So instead,our baseline architecture steers instructions to a cluster based on its physical placement in the instruction buffer.Instructions are sent in groups of four to their corresponding cluster where they are routed on a smaller crossbar to their proper reservation station.This style of partitioning results in less complexity and fewer potential pipeline stages,but is restrictive in terms of issue-timeflexibility and steering power.A large crossbar will permit instruction movement from any position in the instruction buffer to any of the clusters.In addition to the latency and complexity drawbacks,this option mandates providing enough reservations stations write ports to accommodate up to16new instructions per cycle.Therefore,we concentrate on simpler,low-latency instruction steering.Assignment Options For comparison purposes,we look at the following dynamic cluster assign-ment options:•Issue-Time:Instructions are distributed to the cluster where one or more of their input data is known to be generated.Inter-trace and intra-trace dependencies are visible.A limit of four instructions are assigned to each cluster every cycle.Besides simplifying hardware, this also balances the cluster workloads.This option is examined with zero latency and with four cycles of latency for dependency analysis,instruction steering,and routing.•Friendly Retire-Time:This is the only previously proposedfill unit cluster assignment policy.Friendly et al.propose afill unit reordering and assignment scheme based on intra-trace dependency analysis[6].Their scheme assumes a front-end scheduler restricted to simple slot-based issue,as in our base model.For each issue slot,each instruction is checked for an intra-trace input dependency for the respective cluster.Based on these data dependencies, instructions are physically reordered within the trace.3CTCP CharacteristicsThe following characterization serves to highlight the cluster assignment optimization opportunities.3.1Trace-level AnalysisTable1presents some run-time trace line characteristics for our benchmarks.Thefirst metric(% TC Instr)is the percentage of all retired instructions fetched from the trace cache.Benchmarks with a large percentage of trace cache instructions benefit more fromfill unit optimizations since instructions from the instruction cache are unoptimized for the CTCP.Trace Size is the average number of instructions per trace line.When thefill unit does the intra-trace dependency analysis for a trace,this is the available optimization scope.Table1:Trace CharacteristicsTrace Size99.31craft y10.8772.63gap11.7569.66gzip11.7996.61parser9.0286.43twolf10.3273.77vpr11.1084.310%20%40%60%80%100%bzp crf eon gap gcc gzp mcf psr prl twf vor vpr%D y n a m i c I n s t r u c t i o n s W i t h I n p u t sFigure 4:Source of Critical Input DependencyFrom RS2:Critical input provided by the producer for input RS2.From RS1:Critical input provided by the producer for input RS1.From RF:Critical input provided by the register file.Table 2:Dynamic Consumers Per InstructionInter-Trace0.90crafty0.270.82gap0.241.03gzip0.300.54parser0.501.17twolf0.331.00vpr0.331.03Table3:Critical Data Forwarding Dependencies%of critical dep.’sthat are inter-trace85.63%crafty24.32%86.58%gap22.72%87.59%gzip24.38%89.62%parser38.16%86.11%twolf23.95%85.51%vpr25.84%84.91%1For bzip2,the branch predictor accuracy is sensitive to the rate at which instructions retire and the“better”case with no data forwarding latency actually leads to an increase in branch mispredictions and worse performance.the registerfile read latency is presented and has almost no effect on overall performance.In fact, registerfile latencies between zero and10cycles have no impact on performance.This is due to the abundance and critical nature of in-flight instruction data forwarding.3.3Resolving Inter-Trace DependenciesThefill unit accurately determines intra-trace dependencies.Since a trace is an atomic trace cache unit,the same intra-trace instruction data dependencies will exist when the trace is later fetched. However,incorporating inter-trace dependencies at retire-time is essentially a prediction of issue-time dependencies,some of which may occur thousands or millions of cycles in the future.This problem presents an opportunity for an execution history based mechanism to predict the source clusters or producers for instructions with inter-trace dependencies.Table4examines the how often an instruction’s forwarded data comes from the same producer instruction.For each static instruction,the program counter of the last producer is tracked for each source register(RS1 and RS2).The table shows that an instruction’s data forwarding producer is the same for RS1 96.3%of the time and the same for RS294.3%of the time.Table4:Frequency of Repeated Forwarding ProducersAll Critical Inter-traceInput RS1Input RS2Input RS1Input RS297.44%89.30%crafty97.82%93.55%93.83%85.79%gap93.65%77.88%96.39%85.36%gzip99.02%96.04%96.61%92.36%parser87.66%78.76%97.78%90.83%twolf90.78%76.40%89.67%70.87%vpr96.06%91.67%96.25%86.92%Inter-trace dependencies do not necessarily arrive from the previous trace.They could arrive from any trace in the past.In addition,static instructions are sometimes incorporated into several different dynamic traces.Table5analyzes the distance between an instruction and its critical inter-trace producer.The values are the percentage of such instructions that encounter the same distance in consecutive executions.This percentage correlates very well to the percentages in the last two columns of Table4.On average,85.9%of critical inter-trace forwarding is the same distance from a producer as the previoue dynamic instance of the instruction.Table5:Frequency of Repeated Critical Inter-Trace Forwarding Distancesbzip2craftyeongapgccgzipmcfparserperlbmktwolfvortexvprAveragedecrease.The important aspect is that physical reordering reduces inter-cluster communications while maintaining low-latency,complexity-effective issue logic.4.1Pinning InstructionsPhysically reordering instructions at retire-time based on execution history can cause more inter-cluster data forwarding than it eliminates.When speculated inter-trace dependencies guide the reordering strategy,the same trace of instructions can be reordered in a different manner each time the trace is constructed.Producers shift from one cluster to another,never allowing the consumers to accurately gauge the cluster from which their input data will be produced.A retire-time instruction reordering heuristic must be chosen carefully to avoid unstable cluster assignments and self-perpetuating performance problems.To combat this problem,we pin key producers and their subsequent inter-trace consumers to the same cluster to force intra-cluster data forwarding between inter-trace dependencies.This creates a pinned chain of instructions with inter-trace dependencies.Thefirst instruction of the pinned chain is called a leader.The subsequent links of the chain are referred to as followers.The criteria for selecting pin leaders and followers are presented in Table6.Table6:Leader and Follower CriteriaConditions to become a pin leader:1.Not already a leader or follower2.Producer provides last input data3.Producer is a leader or follower4.Producer is from different traceThere are two key aspects to these guidelines.Thefirst is that only inter-trace dependencies are considered.Placing instructions with intra-trace dependencies on the same cluster is easy and reliable.Therefore,these instructions do not require the pinned chain method to establish dependencies.Second,once an instruction is assigned to a pinned chain as a leader or a follower,its status should not change.The idea is to pin an instruction to one cluster and force the other instructions in the inter-trace dependency chain to follow it to that same cluster.If the pinned cluster is allowed to change,then it could lead to performance-limiting race conditions discussed earlier.Table7:FDRT Cluster Assignment StrategyDependency typeif...if...if...if...if...Intra-trace producer:no yes yes no no Intra-trace consumer:1.producer 1.pin 1.pin 1.middle 1.skipAssignment Priority:3.skip 3.skip 3.neighbor2Using Cacti2.0[16],an additional byte per instruction in a trace cache line is determined not to change the fetch latency of the trace cache.4.3Cluster Assignment StrategyThefill unit must weigh intra-trace information along with the inter-trace feedback from the trace cache execution histories.Table7summarizes our proposed cluster assignment policies.The inputs to the strategy are:1)the presence of an intra-trace dependency for the instruction’s most critical source register(i.e.the input that was satisfied last during execution),2)the pin chain status, and3)the presence of intra-trace consumers.Thefill unit starts with the oldest instruction and progresses in logical order to the youngest instruction.For option A in Table7,thefill unit attempts to place instructions that have only an intra-trace dependency on the same cluster as its producer.If there are no instruction slots available for the producer’s cluster,an attempt is made to assign the instruction to a neighboring cluster.For an instruction with just an inter-trace pin dependency(option B),thefill unit attempts to place the instruction on the pinned cluster(which is found in the Pinned Cluster trace profilefield)or a neighboring cluster.An instruction can have both an intra-trace dependency and a pinned inter-trace dependency (option C).Recall that pinning an instruction is irreversible.Therefore,if the critical input changes or an instruction is built into a new trace,an intra-trace dependency could exist along with a pinned inter-trace dependency.When an instructions has both a pinned cluster and an intra-trace producer,the pinned cluster takes precedence(although our simulations show that it doesn’t matter which gets precedence).If four instructions have already been assigned to this cluster,the intra-trace producer’s cluster is the next target.Finally,there is an attempt to assign the instruction to a pinned cluster neighbor.If an instruction has no dynamically forwarded input data but does have an intra-trace output dependency(option D),it is assigned to a middle cluster(to reduce potential forwarding distances).Instructions are skipped if they have no input or output dependencies(option E),or if they cannot be assigned to a cluster near their producer(lowest priority assignment for options A-D). These instructions are later assigned to the remaining slots using Friendly’s method.4.4ExampleA simple illustration of the FDRT cluster assignment strategy is shown in Figure6.There are two traces,T1and T2.Trace T1is the older of the two traces.Four instructions(out of at least10)of each trace are shown.Instruction I7is older than instruction I10.The arrows indicate dependencies. The arrows originate at the producer and go to the consumer.A solid black arrowhead represents an intra-trace dependence and a white arrowhead represents an inter-trace dependence.A solidarrow line represents a critical input,and a dashed line represents a non-critical input.Arrows with numbers adjacent to them are chain dependencies where the number represents the chain clusterpinned.to which the instructions should beThe upper portion of this example examines four instructions from the middle of two different traces.The instruction numberings are in logical order.The arrows indicate dependencies,traveling fromthe producer to the consumer.The instructions are assigned to clusters2and3based on theirdependencies.The letterings in parenthesis match those in Table7.Instruction T1I8has one intra-trace producer(Option A),and is assigned to the same cluster as its producer TIdependencies(Option B).The chain cluster value is3,so these instructions are assigned to that cluster.Note that the intra-trace input to T1I7follows Option B and is assigned to Cluster3based on its chain inter-trace dependency.Instruction T2I9and T2I7 and T2Table8:SPEC CINT2000Benchmarks Benchmark InputsMinneSPECcrafty crafty.inSPEC testgap-q-m64M test.inMinneSPECgzip smred.log1MinneSPECparser 2.1.dict-batch mdred.inMinneSPECtwolf mdredMinneSPECvpr small.arch.in-nodisp-place t5-exit t0.9412-innerL1Data Cache:4-way,32KB,2-cycle accessL2Unified cache:4-way,1MB,+8cyclesNon-blocking:16MSHRs and4portsD-TLB:128-entry,4-way,1-cyc hit,30-cyc miss Store buffer:32-entry w/load forwardingLoad queue:32-entry,no speculative disambiguation Main Memory:Infinite,+65cyclesFetch Engine ·Functional unit#t.Issue lat. Simple Integer21cycle1cycle Simple FP231 Memory111Int.Mul/Div13/201/19FP Mul/Div/Sqrt13/12/241/12/24Int Branch111FP Branch111·Inter-Cluster Forwarding Latency:2cycles per forward ·Register File Latency:2cycles·5Reservation stations·8entries per reservation station·2write ports per reservation station·192-entry ROB·Fetch width:16·Decode width:16·Issue width:16·Execute width:16·Retire width:165.2Performance AnalysisFigure7presents the speedups over our base architecture for different dynamic cluster assignment strategies.The proposed feedback-directed,retire-time(FDRT)cluster assignment strategy pro-vides a 7.3%improvement.Friendly’s method improves performance by 1.9%3.This improvement in performance is due to enhancements in both the intra-trace and inter-trace aspects of cluster as-signment.Additional simulations (not shown)show that isolating the intra-trace heuristics from the FDRT strategy results in a 3.4%improvement by itself.The remaining performance improvement generated by FDRT assignment comes from the inter-traceanalysis.0.900.951.001.051.101.151.201.251.301.351.40bzpgcccrfeongapgzpmcfpsrprltwfvorvprHMFigure 7:Speedup Due to Different Cluster Assignment StrategiesThe performance boost is due to an increase in intra-cluster forwarding and a reduction in average data forwarding distance.Table 10presents the changes in intra-cluster forwarding.On average,both CTCP retire-time cluster assignment schemes increase the amount of same-cluster forwarding to above 50%,with FDRT assignment doing better.The inter-cluster distance is the primary cluster assignment performance-related factor (Ta-ble 11).For every benchmark,the retire-time instruction reordering schemes are able to improve upon the average forwarding distance.In addition,the FDRT scheme always provides shorter overall data forwarding distances than the Friendly method.This is a result of funneling producers with no input dependencies to the middle clusters and placing consumers as close as possible to their producers.For the program eon ,the Friendly strategy provides a higher intra-cluster forwarding percent-age than FDRT without resulting in higher performance.The reasons for this are two-fold.MostTable10:Percentage of Intra-Cluster Forwarding For Critical InputsFriendlybzip260.84%crafty54.29%eon52.83%gap58.77%gcc58.14%gzip53.91%mcf64.69%parser57.67%perlbmk58.36%twolf56.91%vortex54.00%vpr58.70%Average57.43%Base FDRT0.830.240.900.590.960.700.710.490.710.510.940.560.620.440.690.530.780.490.730.560.780.520.920.570.800.52Distance is the number of clusters traversed by forwarded data.importantly,the average data forwarding distance is reduced compared to the Friendly method de-spite the extra inter-cluster forwarding.There are also secondary effects that result from improving overall forwarding latency,such as a change in the update rate for the branch predictor and BTB. In this case,our simulations show that FDRT scheme led to improved branch prediction as well.The two retire-time instruction reordering strategies are also compared to issue-time instruction steering in Figure7.In one case,instruction steering and routing is modeled with no latency (labeled as No-lat Issue-time)and in the other case,four cycles are modeled(Issue-time).The results show that latency-free issue-time steering is the best,with a9.9%improvement over the base.However,when applying an aggressive four-cycle latency,issue-time steering is only preferable for three of the12benchmarks and the average performance improvement(3.8%)is almost half that of FDRT cluster assignment.5.3FDRT Assignment AnalysisFigure 8is a breakdown of instructions based on their FDRT assignment strategy option.On average 32%have only an intra-trace dependency,while 16%of the instructions have just an inter-trace pinned dependency.Only 7%of the instructions have both a pin inter-trace dependency and a critical intra-trace dependency.Therefore,55%of the instructions are considered to be consumers and are therefore placed near theirproducers.0%10%20%30%40%50%60%70%80%90%100%bzpcrf eon gap gcc gzp mcf psrprltwfvor vpr AvgFigure 8:FDRT Critical Input DistributionThe letters A-E correspond to the lettered options in Table 7.Around 10%of the instructions had no input dependencies but did have an intra-trace consumer.These producer instructions are assigned to a middle cluster where their consumers will be placed on the same cluster later.Only a very small percentage (less than 1%)of instructions with identified input dependencies are initially skipped because their is no suitable neighbor cluster for assignment.Finally,a large percentage of instructions (around 34%)are determined to not have a critical intra-trace dependency or pinned inter-trace dependency.Most of these instructions do have data dependencies,but they did not require data forwarding or did not meet the pin criteria.Table 12presents pinned chain characteristics,including the average number of leaders per trace and average number of followers per trace.Because pin dependencies are limited to inter-trace dependencies,the combined number of leaders and followers is only 2.90per trace.This is about 1/4of the instructions in a trace.For some of the options in Table 7,the fill unit cluster assignment mechanism attempts to place。
Algorithms for bigram and trigram word clustering
Speech Communication24199819–37Algorithms for bigram and trigram word clustering1¨Sven Martin),Jorg Liermann,Hermann Ney2¨Lehrstuhl fur Informatik VI,RWTH Aachen,UniÕersity of Technology,Ahornstraße55,,D-52056Aachen,GermanyReceived5June1996;revised15January1997;accepted23September1997AbstractIn this paper,we describe an efficient method for obtaining word classes for class language models.The method employs an exchange algorithm using the criterion of perplexity improvement.The novel contributions of this paper are the extension of the class bigram perplexity criterion to the class trigram perplexity criterion,the description of an efficient implementation for speeding up the clustering process,the detailed computational complexity analysis of the clustering algorithm,and, finally,experimental results on large text corpora of about1,4,39and241million words including examples of word classes,test corpus perplexities in comparison to word language models,and speech recognition results.q1998Elsevier Science B.V.All rights reserved.Zusammenfassung¨In diesem Bericht beschreiben wir eine effiziente Methode zur Erzeugung von Wortklassen fur klassenbasierte Sprachmodelle.Die Methode beruht auf einem Austauschalgorithmus unter Verwendung des Kriteriums der Perplexi-¨¨tatsverbesserung.Die neuen Beitrage dieser Arbeit sind die Erweiterung des Kriteriums der Klassenbigramm-Perplexitat zum¨Kriterium der Klassentrigramm-Perplexitat,die Beschreibung einer effizienten Implementierung zur Beschleunigung des¨Klassenbildungsprozesses,die detaillierte Komplexitatsanalyse dieser Implementierung,und schließlich experimentelle¨¨¨Ergebnisse auf großen Textkorpora mit ungefahr1,4,39und241Millionen Wortern,einschließlich Beispielen fur erzeugte¨Wortklassen,Test Korpus Perplexitaten im Vergleich zu wortbasierten Sprachmodellen und Erkennungsergebnissen auf Sprachdaten.q1998Elsevier Science B.V.All rights reserved.´´Resume´´` Dans cet article,nous decrivons une methode efficace d’obtention des classes de mots pour des modeles de langage.´´`´´Cette methode emploie un algorithme d’echange qui utilise le critere d’amelioration de la perplexite.Les contributions ´`´nouvelles apportees par ce travail concernent l’extension aux trigrammes du critere de perplexite de bigrammes de classes,la ´´´´´´description d’une implementation efficace pour accelerer le processus de regroupement,l’analyse detaillee de la complexite´´calculatoire,et,finalement,des resultats experimentaux sur de grands corpus de textes de1,4,39et241millions de mots,)Corresponding author.Email:martin@informatik.rwth-aachen.de.1This paper is based on a communication presented at the ESCA Conference EUROSPEECH’95and has been recommended by the EUROSPEECH’95Scientific Committee.2Email:ney@informatik.rwth-aachen.de.0167-6393r98r$19.00q1998Elsevier Science B.V.All rights reserved.Ž.PII S0167-63939700062-9()S.Martin et al.r Speech Communication 24199819–3720incluant des exemples de classes de mots produites,de perplexites de corpus de test comparees aux modeles de langage de ´´`mots,et des resultats de reconnaissance de parole.q 1998Elsevier Science B.V.All rights reserved.´Keywords:Stochastic language modeling;Statistical clustering;Word equivalence classes;Wall Street Journal corpus1.IntroductionThe need for a stochastic language model in speech recognition arises from Bayes’decision rule Ž.for minimum error rate Bahl et al.,1983.The word sequence w ...w to be recognized from the se-1N quence of acoustic observations x ...x is deter-1T mined as that word sequence w ...w for which the 1N Ž<.posterior probability Pr w ...w x ...x attains 1N 1T its maximum.This rule can be rewritten in the form <arg max Pr w ...w P Pr x ...x w ...w ,4Ž.Ž1N 1T 1N w ...w 1NŽ<.where Pr x ...x w ...w is the conditional 1T 1N probability of,given the word sequence w ...w ,1N observing the sequence of acoustic measurements Ž.x ...x and where Pr w ...w is the prior proba-1T 1N bility of producing the word sequence w ...w .1N The task of the stochastic language model is to provide estimates of these prior probabilities Ž.Pr w ...w .Using the definition of conditional 1N probabilities,we obtain the decomposition:N<Pr w ...w sPr w w ...w .Ž.Ž.Ł1N n 1n y 1n s 1For large vocabulary speech recognition,these conditional probabilities are typically used in the Ž.following way Bahl et al.,1983.The dependence of the conditional probability of observing a word w n at a position n is assumed to be restricted to its Ž.immediate m y 1predecessor words w q n y m 1...w .The resulting model is that of a Markov n y 1chain and is referred to as m -gram model.For m s 2and m s 3,we obtain the widely used bigram and trigram models,respectively.These bigram and tri-gram models are estimated from a text corpus during a training phase.But even for these restricted mod-els,most of the possible events,i.e.,word pairs and word triples,are never seen in training because there are so many of them.Therefore in order to allow for events not seen in training,the probability distribu-tions obtained in these m -gram approaches are smoothed with more general ually,Ž.these are also m -grams with a smaller value for m or a more sophisticated approach like a singleton Ždistribution Jelinek,1991;Ney et al.,1994;Ney et .al.,1997.In this paper,we try a different approach for smoothing by using word equivalence classes,or word classes for short.Here,each word belongs to exactly one word class.If a certain word m -gram did not appear in the training corpus,it is still possible that the m -gram of the word classes corresponding to these words did occur and thus a word class based m -gram language model,or class m -gram model for short,can be estimated.More general,as the number of word classes is smaller than the number of words,the number of model parameters is reduced so that each parameter can be estimated more reliably.On the other hand,reducing the number of model pa-rameters makes the model coarser and thus the pre-diction of the next word less precise.So there has to be a tradeoff between these two extremes.Typically,word classes are based on syntactic semantic concepts and are defined by linguistic ex-perts.In this case,they are called parts of speech Ž.POS .Generalizing the concept of word similarities,we can also define word classes by using a statistical criterion,which in most cases,but not necessarily,is maximum likelihood or,equivalently,perplexity ŽJelinek,1991;Brown et al.,1992;Kneser and Ney,.1993;Ney et al.,1994.With the latter two ap-proaches,word classes are defined using a clustering algorithm based on minimizing the perplexity of a class bigram language model on the training corpus,which we will call bigram clustering for short.The contributions of this paper are:Øthe extension of the clustering algorithm from the bigram criterion to the trigram criterion;Øthe detailed analysis of the computational com-plexity of both bigram and trigram clustering algorithms;Øthe design and discussion of an efficient imple-mentation of both clustering algorithms;Øsystematic tests using the 39-million word Wall Street Journal corpus concerning perplexity and()S.Martin et al.r Speech Communication24199819–3721Table1List of symbolsW vocabulary sizeu,Õ,w,x words in a running text;usually w is the word under discussion,r its successor,y its predecessor and u the predecessor toÕw word in text corpus position nnŽ.S w set of successor words to word w in the training corpusŽ.P w set of predecessor words to word w in the training corpusŽ.Ž.SÕ,w set of successor words to bigramÕ,w in the training corpusŽ.Ž.PÕ,w set of predecessor words to bigramÕ,w in the training corpusG number of word classesG:w™g class mapping functionwg,k word classesŽ.N training corpus sizeB number of distinct word bigrams in the training corpusT number of distinct word trigrams in the training corpusŽ.N P number of occurrences in the training corpus of the event in parenthesesŽ.F G log-likelihood for a class bigram modelbiŽ.F G log-likelihood for a class trigram modeltriPP perplexityI number of iterations of the clustering algorithmŽ.Ž.G P,wÝ1i.e.,number of seen predecessor word classes to word wg:NŽg,w.)0Ž.Ž.G w,PÝ1i.e.,number of seen successor word classes to word wg:NŽw,g.)0y1Ž.Ž.W PÝG P,w i.e.,average number of seen predecessor word classesP w wy1Ž.Ž.W PÝG w,P i.e.,average number of seen successor word classesw P wŽ.Ž.G P,P,wÝ1i.e.,number of seen word class bigrams preceding word wg,g:NŽg,g,w.)01212Ž.Ž.G P,w,PÝ1i.e.,number of seen word class pairs embracing word wg,g:NŽg,w,g.)01212Ž.Ž.G w,P,PÝ1i.e.,number of seen word class bigrams succeeding word wg,g:NŽw,g,g.)01212b absolute discounting value for smoothingŽ.N g number of distinct words appearing r times in word class grŽ.G g,P number of distinct word classes seen r times right after word class grÕÕŽ.G P,g number of distinct word classes seen r times right beforeword class gr w wŽ.G P,P number of distinct word class bigrams seen r timesrŽ.b g generalized distribution for smoothingwclustering times for various numbers of word classes and initialization methods;Øspeech recognition results using the North Ameri-can Business corpus.The original exchange algorithm presented in thisŽ. paper was published in Kneser and Ney,1993with good results on the LOB corpus.There is a differentŽ. approach described in Brown et al.,1992employ-ing a bottom-up algorithm.There are also ap-Žproaches based on simulated annealing Jardino and .Adda,1994.Word classes can also be derived fromŽan automated semantic analysis Bellegarda et al., .Ž1996,or by morphological features Lafferty and.Mercer,1993.The organization of this paper is as follows: Section2gives a definition of class models,explains the outline of the clustering algorithm and the exten-sion to a trigram based statistical clustering criterion.Section3presents an efficient implementation of the clustering algorithm.Section4analyses the computa-tional complexity of this efficient implementation. Section5reports on text corpus experiments con-cerning the performance of the clustering algorithm in terms of CPU time,resulting word classes and training and test perplexities.Section6shows the results for the speech recognition experiments.Sec-tion7discusses the results and their usefulness to language models.In this paper,we introduce a large number of symbols and quantities;they are summa-rized in Table1.2.Class models and clustering algorithmIn this section,we will present our class bigram and trigram models and we will derive their log()S.Martin et al.r Speech Communication 24199819–3722likelihood function,which serves as our statistical criterion for obtaining word classes.With our ap-proach,word classes result from a clustering algo-rithm,which exchanges a word between a fixed number of word classes and assigns it to the word class where it optimizes the log likelihood.We will discuss alternative strategies for finding word classes.We will also describe smoothing methods for the class models trained,which are necessary to avoid zero probabilities on test corpora.2.1.Class bigram modelsWe partition the vocabulary of size W into a fixed number G of word classes.The partition is repre-sented by the so-calledclass or category mapping function G :w ™g Ž.w mapping each word w of the vocabulary to its word class g .Assigning a word to only one word class is w a possible drawback which is justified by the sim-plicity and efficiency of the clustering process.For the rest of this paper,we will use the letters g and k Ž.for arbitrary word classes.For a word bigram Õ,w Ž.we use g ,g to denote the corresponding class Õw bigram.For class models,we have two types of probabil-ity distributions:Ž<.Øa transition probability function p g g which 1w Õrepresents the first-order Markov chain probabil-ity for predicting the word class g from its w predecessor word class g ;ÕŽ<.Øa membership probability function p w g esti-0mating the word w from word class g .Since a word belongs to exactly one word class,we have )0if g s g ,w <p w g Ž.0½s 0if g /g .w Therefore,we can use the somewhat sloppy notation Ž<.p w g .0w For a class bigram model,we have then:<<<p w Õs p w g P p q g .1Ž.Ž.Ž.Ž.0w 1w ÕNote that this model is a proper probability function,and that we make an independency assumption be-tween the prediction of a word from its word class and the prediction of a word class from its predeces-sor word classes.Such a model leads to a drastic Žreduction in the number of free parameters:G P G y .Ž<.Ž.1probabilities for the table p g g ,W y G 1w ÕŽ<.probabilities for the table p w g ,and W indices 0w for the mapping G :w y g .w For maximum likelihood estimation,we construct Ž.the log likelihood function using Eq.1:N<F G slog Pr w w ...w Ž.Ž.Ýbi n 1n y 1n s f<s N Õ,w P log p w ÕŽ.Ž.ÝÕ,w<s N w P log p w g Ž.Ž.Ý0w w<qN g ,g P log p g g 2Ž.Ž.Ž.ÝÕw 1w Õg ,g ÕwŽ.with N P being the number of occurrences of the event given in the parentheses in the training data.To construct a class bigram model,we first hypothe-size a mapping function G .Then,for this hypothe-sized mapping function G ,the probabilities Ž<.Ž<.Ž.p w g and p g g in Eq.2can be estimated 0w 1w Õby adding the Lagrange multipliers for the normal-ization constraints and taking the derivatives.This Ž.results in relative frequencies Ney et al.,1994:N w Ž.<p w g s ,3Ž.Ž.0w N g Ž.w N g ,g Ž.Õw <p g g s.4Ž.Ž.1w ÕN g Ž.ÕŽ.Ž.Using the estimates given by Eqs.3and 4,we Ž.can now express the log likelihood function F G bi for a mapping G in terms of the counts:<F G s N Õ,w P log p w ÕŽ.Ž.Ž.Ýbi Õ,wN w Ž.s N w P logŽ.ÝN g Ž.w wN g ,g Ž.Õw q N g ,g P logŽ.ÝÕw N g Ž.Õg ,g ÕwsN g ,g P log N g ,g Ž.Ž.ÝÕw Õw g ,g Õwy 2P N g P log N g Ž.Ž.Ýgq N w P log N w 5Ž.Ž.Ž.Ýw()S.Martin et al.r Speech Communication 24199819–3723s N w log N w Ž.Ž.ÝwN g ,g Ž.Õw qN g ,g log .6Ž.Ž.ÝÕw N g N g Ž.Ž.Õw g ,gÕwŽ.Ž.In Brown et al.,1992the second sum of Eq.6isinterpreted as the mutual information between the word classes g and g .Note,however,that the Õw derivation given here is based on the maximum likelihood criterion only.2.2.Class trigram modelsConstructing the log likelihood function for the class trigram model<<<p w u ,Õs p w g P p g g ,g 7Ž.Ž.Ž.Ž.0w 2w u Õresults in<F G s N w P log p w g Ž.Ž.Ž.Ýtri 0w wqN g ,g ,g Ž.Ýu Õw g ,g ,g u Õw<P log p g g ,g .8Ž.Ž.2w u ÕŽ.Taking the derivatives of Eq.8for maximum likelihood parameter estimation also results in rela-tive frequencies N g ,g ,g Ž.u Õw <p g g ,g s9Ž.Ž.2w u ÕN g ,g Ž.u ÕŽ.Ž.Ž.and,using Eqs.3,7–9:<F G sN u ,Õ,w P log p w u ,ÕŽ.Ž.Ž.Ýtri u ,Õ,wN w Ž.s N w P logŽ.ÝN g Ž.w wN g ,g ,g Ž.u Õw q N g ,g ,g P logŽ.Ýu Õw N g ,g Ž.u Õg ,g ,g u ÕwsN g ,g ,g P log N g ,g ,g Ž.Ž.Ýu Õw u Õw g ,g ,g u ÕwyN g ,g P log N g ,g Ž.Ž.Ýu Õu Õg ,g u Õy N g P log N g q N w P log N w Ž.Ž.Ž.Ž.ÝÝw w g wws N w log N w Ž.Ž.ÝwN g ,g ,g Ž.u Õw qN g ,g ,g log.Ž.Ýu Õw N g ,g N g Ž.Ž.u Õw g ,g ,g u Õw10Ž.2.3.Exchange algorithmTo find the unknown mapping G :w y g ,we w will show now how to apply a clustering algorithm.The goal of this algorithm is to find a class mapping function G such that the perplexity of the class model is minimized over the training corpus.We use an exchange algorithm similar to the exchange algo-Žrithms used in conventional clustering ISODATA Ž..Duda and Hart,1973,pp.227–228,where an observation vector is exchanged from one cluster to another cluster in order to improve the criterion.In the case of language modeling,the optimization Ž.criterion is the log-likelihood,i.e.,Eq.5for the Ž.class bigram model and Eq.10for the class trigram model.The algorithm employs a technique of local optimization by looping through each element of the set,moving it tentatively to each of the G word classes and assigning it to that word class resulting in the lowest perplexity.The whole procedure is repeated until a stopping criterion is met.The outline of our algorithm is depicted in Fig.1.We will use the term to remo Õe for taking a word out of the word class to which it has been assigned in the previous iteration,the term to mo Õe for insert-ing a word into a word class,and the term to exchange for a combination of a removal followed by a move.For initialization,we use the following method:Ž.we consider the most frequent G y 1words,and each of these words defines its own word class.The remaining words are assigned to an additional word class.As a side effect,all the words with a zero Ž.unigram count N w are assigned to this word class and remain there,because exchanging them has no effect on the training corpus perplexity.The stopping criterion is a prespecified number of iterations.In addition,the algorithm stops if no words are ex-changed any more.()S.Martin et al.r Speech Communication 24199819–3724Fig.1.Outline of the exchange algorithm for word clustering.Thus,in this method,we exploit the training corpus in two ways:1.in order to find the optimal partitioning;2.in order to evaluate the perplexity.An alternative approach would be to use two different data sets for these two tasks,or to simulate unseen events using leaving-one-out.That would result in an upper bound and possibly in more robust word classes,but at the cost of higher mathematical Ž.and computational expenses.Kneser and Ney,1993employs leaving one out for clustering.However,the improvement was not very significant,and so we will use the simpler original method here.An effi-cient implementation of this clustering algorithm will be presented in Section 3.parison with alternati Õe optimization strate -giesIt is interesting to compare the exchange algo-rithm for word clustering with two other approaches described in the literature,namely simulated anneal -Ž.ing Jardino and Adda,1993and bottom-up cluster -Ž.ing Brown et al.,1992.In simulated annealing ,the baseline optimization strategy is similar to the strategy of the exchange algorithm.The important difference is according to the simulated annealing concept that we accept tem-porary degradations of the optimization criterion.The decision of whether to accept a degradation or not is made dependent on the so called cooling parameter.This approach is usually referred to as Metropolis algorithm.Another difference is that the words to be exchanged from one word class to another and the target word classes are selected by the so-called Monte Carlo ing the correct cooling parameter,simulated annealing converges to the global optimum.In our own experimental tests Ž.unpublished results ,we made the experience that there was only a marginal improvement in the per-plexity criterion at dramatically increased computa-Ž.tional costs.In Jardino,1996,simulated annealing is applied to a large training corpus from the Wall Street Journal,but no CPU times are given.In Ž.addition in Jardino and Adda,1994,the authors introduce a modification of the clustering model allowing several word classes for each word,at least in principle.This modification,however,is more related to the definition of the clustering model and not that much to the optimization strategy.In this paper,we do not consider such types of stochastic class mappings.The other optimization strategy,bottom-up clus -Ž.tering ,as presented in Brown et al.,1992,is also Ž.based on the perplexity criterion given by Eq.6.However,instead of the exchange algorithm,the authors use the well-known hierarchical bottom-up Žclustering algorithm as described in Duda and Hart,.1973,pp.230and 235.The typical iteration step here is to reduce the number of word classes by one.This is achieved by merging that pair of word classes for which the perplexity degradation is the smallest.This process is repeated until the desired number of word classes has been obtained.The iteration process is initialized by defining a separate word class for Ž.each word.In Brown et al.,1992,the authors describe special methods to keep the computational complexity of the algorithm as small as possible.Obviously,like the exchange algorithm,this bottom up clustering strategy achieves only a local optimum.Ž.As reported in Brown et al.,1992,the exchange algorithm can be used to improve the results ob-tained by bottom-up clustering.From this result and our own experimental results for the various initial-Žization methods of the exchange algorithm see Sec-.tion 5.4,we may conclude that there is no basic performance difference between bottom-up cluster-ing and exchange clustering.()S.Martin et al.r Speech Communication 24199819–37252.5.Smoothing methodsŽ.Ž.Ž.On the training corpus,Eqs.3,4and 9are well-defined.However,even though the parameter estimation for class models is more robust than for word models,some of the class bigrams or trigrams in a test corpus may have zero frequencies in the training corpus,resulting in zero probabilities.To avoid this,smoothing must be used on the test corpus.However,for the clustering process on the training corpus,the unsmoothed relative frequencies Ž.Ž.Ž.of Eqs.3,4and 9are still used.To smooth the transition probability,we use the method of absolute interpolation with a singleton Ž.generalized distribution Ney et al.,1995,1997:N g ,g y bŽ.Õw <p g g s max 0,Ž.1w Õž/N g Ž.Õbq G y G g ,P PP b g ,Ž.Ž.Ž.0Õw N g Ž.ÕG P ,P Ž.1b s,G P ,P q 2P G P ,P Ž.Ž.12G P ,g Ž.1w b g s,Ž.w G P ,P Ž.1with b standing for the history-independent discount-Ž.ing value,g g ,P for the number of word classes r ÕŽ.seen r times right after word class g ,g P ,g for Õr w the number of word classes seen r times right before Ž.word class g ,and g P ,P for the number of w r distinct word class bigrams seen r times in the Ž.training corpus.b g is the so-called singleton w Ž.generalized distribution Ney et al.,1995,1997.The same method is used for the class trigram model.To smooth the membership distribution,we use the method of absolute discounting with backing off Ž.Ney et al.,1995,1997:N w y b Ž.°g Õif N w )0,Ž.N g Ž.w ~<p w g sŽ.0w b 1g w N g PPif N w s 0,Ž.Ž.Ýr w ¢N g N g Ž.Ž.w 0w r )0N G Ž.1w b s,g w N g q 2P N g Ž.Ž.1w 2w N g [1,Ž.Ýr w XXŽ.w :g s g ,N w s rw w with b standing for the word class dependent g w Ž.discounting value and N g for the number of r w words appearing r times and belonging to word class g .The reason for a different smoothing w method for the membership distribution is that no singleton generalized distribution can be constructed from unigram counts.Without singletons,backing Ž.off works better than interpolation Ney et al.,1997.However,no smoothing is applied to word classes with no unseen words.With our clustering algo-rithm,there is only one word class containing unseen words.Therefore,the effect of the kind of smoothing used for the membership distribution is negligible.Thus,for the sake of consistency,absolute interpola-tion could be used to smooth both distributions.3.Efficient clustering implementationA straightforward implementation of our cluster-ing algorithm presented in Section 2.3is time con-suming and prohibitive even for a small number of word classes G .In this section,we will present our techniques to improve computational performance in order to obtain word classes for large numbers of word classes.A detailed complexity analysis of the resulting algorithm will be presented in Section 4.3.1.Bigram clusteringŽ.We will use the log-likelihood Eq.5as the criterion for bigram clustering,which is equivalent to the perplexity criterion.The exchange of a word between word classes is entirely described by alter-ing the affected counts of this formula.3.1.1.Efficient method for count generationŽ.All the counts of Eq.5are computed once,stored in tables and updated after a word exchange.As we will see later,we need additional counts N w ,g s N w ,x ,11Ž.Ž.Ž.Ýx :g s gx N g ,w sN Õ,w 12Ž.Ž.Ž.ÝÕ:g s gÕ()S.Martin et al.r Speech Communication 24199819–3726Fig.2.Efficient procedure for count generation.describing how often a word class g appears right after and right before,respectively,a word w .These counts are recounted anew for each word currently under consideration,because updating them,if nec-essary,would require the same effort as recounting,and would require more memory because of the large tables.Ž.Ž.For a fixed word w in Eqs.11and 12,we need to know the predecessor and the successor words,which are stored as lists for each word w ,and the corresponding bigram counts.However,we ob-serve that if word Õprecedes w ,then w succeeds Õ.Ž.Consequently,the bigram Õ,w is stored twice,once in the list of successors to Õ,and once in the list of predecessors to w ,thus resulting in high memory consumption.However,dropping one type of list would result in a high search effort.Therefore we keep both lists,but with bigram counts stored only in the list of ing four bytes for the counts and two bytes for the word indexes,we reduce the memory requirements by 1r 3at the cost of a minor Ž.search effort for obtaining the count N Õ,w from the list of successors to Õby binary search.The Ž.Ž.count generation procedure for Eqs.11and 12is depicted in Fig.2.3.1.2.Baseline perplexity recomputationŽ.We will examine how the counts in Eq.5must be updated in a word exchange.We observe that removing a word w from word class g and moving w it to a word class k only affects those counts of Eq.Ž.5that involve g or k ;all the other counts,and,w consequently,their contributions to the perplexity remain unchanged.Thus,to compute the change in Ž.perplexity,we recompute only those terms in Eq.5which involve the affected counts.We consider in detail how to remove a word from word class g .Moving a word to a word class k isw similar.First,we have to reduce the word class unigram count:N g [N g y N w .Ž.Ž.Ž.w w Then,we have to decrement the transition counts from g to a word class g /g and from an w w arbitrary word class g /g by the number of times w w appears right before or right after g ,respectively:;g /g :N g ,g [N g ,g y N g ,w ,13Ž.Ž.Ž.Ž.w w w ;g /g :N g ,g [N g ,g y N w ,g .14Ž.Ž.Ž.Ž.w w w Ž.Changing the self-transition count N g ,g is a bit w w more complicated.We have to reduce this count by the number of times w appears right before or right after another word of g .However,if w follows w Ž.itself in the corpus,N w ,w is considered in both Ž.Ž.Eqs.11and 12.Therefore,it is subtracted twice from the transition count and must be added once for compensation:N g ,g [N g ,g y N g ,w Ž.Ž.Ž.w w w w w y N w ,g q N w ,w .15Ž.Ž.Ž.w Ž.Finally,we have to update the counts N g ,w and w Ž.N w ,g :w N g ,w [N g ,w y N w ,w ,Ž.Ž.Ž.w w N w ,g [N w ,g y N w ,w .Ž.Ž.Ž.w w Ž.We can view Eq.15as an application of the inclusion r exclusion principle from combinatorics Ž.Takacs,1984.If two subsets A and B of a set C ´are to be removed from C ,the intersection of A and B can only be removed once.Fig.3gives an inter-pretation of this principle applied to our problem of count updating.Viewing these updates in terms of the inclusion r exclusion principle will help to under-stand the mathematically more complicated update formulae for trigram clustering.。
CONCEPTS IN CONCEPTUAL CLUSTERING
CONCEPTS IN CONCEPTUAL CLUSTERINGRobert E. StcppCoordinated Science LaboratoryUniversity of Illinois at Urbana-ChampaignUrbana. IK 61801ABSTRACTAlthough it has a relatively short history, conceptual clus-tering is an especially active area of research in machine learn-ing. There are a variety of ways in which conceptual patterns (the Al contribution to clustering) play a role in the clustering process. Two distinct conceptual clustering paradigms (concep-tual sorting of exemplars and concept discovery) are describedbriefly. Then six types of conceptual clustering algorithms arecharacterized, attempting to cover the present spectrum ofmechanisms used to conceptualize the clustering process.I CONCEPTUA KCKUSTERING: The New Frontier Ever since Michalski wrote about conceptual clustering as anew branch of machine learning (Michalski 1980) there has beenever increasing attention to that family of machine learning tasks. Several researchers have been involved in conceptual clustering research, though early research (the next two citations in particular) was not conducted in the name of conceptual clus-tering. Wolff (1980) describes M K 10. an agglomerative hierarchical data compression system that is able to generate conjunctive descriptions of clusters based on co-occurrences of feature values K ebowitz (1982 and 1983) describes UNIMEM and IPP systems that use what he calls Generalization Based Memory to incrementally clump exemplars into overlapping conceptual categories based on predictive features. Michalski and Stepp (1983) describe CK USTER/2. a conceptual clustering algorithm for building polythetic clusterings (clusterings whose differences depend on discovered conjunctive concepts rather than variations in the value taken by a single attribute). K ang-ley and Sage (1984) describe DISCON. an ID3-like (Quinlan 1983) optimal classification tree builder that forms monothetic hierarchical clusterings given a list of "interesting" attributes. Fisher (1984) describes RUMMAGE, a DISCON-like program that does some generalization over attribute values and uses non-exhaustive search. Stepp (1984) describes CK USTER/S. a conjunctive conceptual clustering algorithm for use on struc-tured exemplars. Kangley, Zytkow. Simon, and Bradshaw (1985) describe GIAUBER, a concept discovery system based partly on MK 10. that employs conceptual clumping based on most commonly occurring relations in data. Stepp and Michalski (1986) describe algorithms for incorporating background knowledge and classification goals. Mogensen (1987) describes CKUSTER/CA. a program that forms clusters of structured objects in a goal-directed way through the use of Goal Depen-dency Networks.Taken together, there is a large diversity of algorithms that now are described by the term conceptual clustering. Fisher and Kangley (1985) provide two views of conceptual clustering (as extended numerical taxonomy, and as concept formation) and This research was supported in part by the National Science Foundation under grant NSF 1S T 85-11170.also give an enlightened characterization of several conceptual clustering algorithms. In the following sections, two somewhat different views of conceptual clustering are described. The first view is that of cluster formation per se. whose goal is the deter-mination of extensionally defined clusters. The conceptual part of the process lies in how the exemplars are agglomerated/divided rather than in how the clusters are described (i.e.. the cluster forming mechanism need not maintain any cluster descriptions). The second view is that of concept formation, with exemplars as the catalyst. Under this view clusters are formed according to their conceptual descriptions, i.e., the system must constantly maintain conceptual descrip-tions of clusters and cluster membership is constrained by the concepts available to describe the results.Following the terminology of psychology, the first view will here be called conceptual sorting. The second view will be called concept discovery. Each in its own way can be said to involve conceptual clustering.II CONCEPTUA CKUSTERING AS CONCEPT SORTING The process of clustering is to group exemplars in some interesting way (or ways) such as a hierarchy of categories or a tree structure (dendrogram). Numerical taxonomy readily pro-vides such groupings, but the groups have little or no conceptual interpretationOne view of conceptual clustering proposes to produce interesting groupings and then provide them with a conceptual interpretation. That is. to build extensionally defined categories (by enumerating their members) and then find a conceptual interpretation. Naturally, some subpopulations of exemplars are easier to interpret (i.e.. form better conceptual clusters) than others. Fisher (1985) proposes such a view, and states that the two phases (called the aggregation and characterization prob-lems, respectively) are not independent.That the clustering and characterization phases are not independent (assuming they are separate processes) is precisely one of the facets that distinguishes conceptual clustering from "regular" clustering. Indeed, one can perform statistical cluster-ing, take the extensionally defined resulting clusters and then generate conceptual interpretations for them. There are cluster ing problems for which this is an acceptable approach—cluster analysis was done exclusively just this way for a long while, with the analyst doing all the interpretation. But in general, concepts derived from independently rendered clusters have potentially messy conceptual characterizations, involving dis-junctive conceptual forms (Michalski and Stepp 1983) But one should note that certain patterns of disjunction can be restated as polymorphic concepts ("n of m properties must be present") and some clustering research is directed at finding polymorphic classifications (e.g.. (Hanson and Bauer 1986)).A major reason independently rendered clusters can have rather unappealing conceptual interpretations is that theyStepp 211practice no concept-related similarity measurement. There are two points to be made here: (1) the similarity metric used defines a gradient over the feature s p a c e that p o s s e s s e s none of the conceptual irregularities that underly the domain (the dis-tance from a purple grape to a red apple is not the same as from a green orange to a red apple)*, and (2) the similarity metric views all attributes with a fixed relevance to the problem without any way to determine attribute relevancy from patterns in' the data.**S o m e research in conceptual clustering has tackled this problem by focusing on the attributes- and correlations among attribute values. The system WITT (Hanson and Bauer 1986) performs a variation of K -means clustering using both within category and outside category cohesions to measure the quality of the categories. The goal is a balance between high within category cohesion and low outside category cohesion.*** The COBWEB system (Fisher 1987) u s e s category utility (Gluck and Corter 1985) to determine how to partition exemplars.The statistical approach to clustering (e.g.. numerical tax-onomy) u s e s a non-Gestalt measure of cluster quality that is s o m e function FS of exemplar pairs such as reciprocal euclidian distance.FS(e 1.e 2) The attribute-based approach to clustering u s e s a Gestalt meas-ure of cluster quality that is some function FA of exemplar pairs plus the environment in which they exist.FA(e 1. e 2. environment)The environment consists of exemplars arranged by categories. HI CONCEPTUAL CLUSTERING AS CONCEPT DISCOVERYConcept discovery systems focus on the determination of concepts (according to some concept representation system) to describe each category that is formed. Indeed, categories are formed such that their descriptions are as desired by the applied biases (including representational constraints) and a concept-b a s e d cluster quality measure. Concept discovery systems (such as CLUSTFR/2. CLUSTER/S. CLUSTFR/CA. and GLAUBER) u s e attribute value patterns in the exemplars to motivate the generation of conceptual descriptions for the categories. It is the category descriptions that are constantly monitored, generalized, specialized, and evaluated by the concept-based quality measure. These systems incorporate mechanisms to propose multi-relation (polythetic) concepts as category descriptions.Michalski (1980) describes a conceptual cohesiveness meas-ure of cluster quality that is not the s a m e as the attribute-based quality measure described above. Conceptual cohesiveness is a concept-based cluster quality measure that is s o m e function FC of exemplar pairs, their environment, and concepts available to describe categories.FC(e 1. e 2, environment, concepts)The availability of concepts is governed by the biases of the sys-tem and the background knowledge that is applied.* The grape and apple differ in color and type-of-fruil but are both ripe; the orange and apple differ in color and type and ripe-ness.** Much of the time all attributes are assumed equally relevant, and contribute equally to the measure of similarity (such as with reciprocal euclidian distance). That is. if an exemplar is less similar on "apple-ness" then that deficiency can be made up by being more similar on "orange-ness". Only certain concepts (like "area" or "physical distance") actually work that way. The universal application of distance measures provides an often unwarranted bias to the classification.*** Cohesion is defined in terms of joint information content, and is therefore sensitive to patterns of attribute values.Without background knowledge, the concept-based approach reverts to the attribute-based one. It is background knowledge (definitions of attribute ranges and scale, specificity hierarchies over attribute values, implicative rules of constraints of one attribute on others, construction rules for new attributes, suggestions or derivational rules for ranking attributes by poten-tial relevancy) that makes the feature s p a c e and concept s p a c e rough and irregular so that the fit of the data to the irregularities can be used to help confirm a candidate conceptual interpreta-tion.IV K NOWLEDGE-BASED CONCEPTUAL CLUSTERING Discovering concepts by conceptual clustering is not purely an inductive inference process. A portion of the process involves deductive inference to determine from background knowledge latent attributes for exemplars and appropriate concepts to ready as candidate category descriptions. The program CLUSTER/CA (Mogensen 1987) u s e s heuristics (including general and specific Goal Dependency Networks (Stepp and Michalski 1986)) to pro-p o s e attributes to be derived from those given in the exemplar data.A system equipped with sizable background knowledge and a deductive mechanism for accessing and applying it. can make a wide variety of hypothelically appropriate transformations of exemplars that will greatly aid concept formation. For example, an inference rule could suggest the construction of an attribute whose values report the number of other attributes (from a sub-set of other attributes) having values that differ from the most frequent attribute values. Such a derived attribute supports polymorphic concepts like "2 of the 3 attributes A. B. and C have target values of x. y, and z. respectively." Since the system knows the definition of the attribute (from background knowledge) it is able to state polymorphic concepts in easily understood terms. The point is that additional knowledge applied during clustering can have a g T e a t effect on the types of categories formed.V A YARDSTICK FOR CONCEPTUALCLUSTERING ALGORITHMSThe background knowledge that could be applied to con-cept discovery conceptual clustering systems d o e s not "grow on trees"—there may be no such knowledge available, or it may be rather non-specific. It is appropriate in such c a s e s to make heavy u s e of attribute-based information (attribute-based quality assessment scores b a s e d on information theoretic measures, such as inter-cluster coherence and intra-clusler predictability). The choice of approach is problem/domain determined.Conceptual clustering approaches have previously b e e n classified according to incremental versus batch operation; hierarchical versus flat category structure: and the type of search they do in feature and concept s p a c e s (Fisher and Langley 1985). Here, the topic is the "conceptually" of the algorithm: the way in which cluster quality is measured in a concept-oriented way. The various approaches are given a Type number: the higher the number, the more intentional are the categories, and the more search intensive and heuristic intensive are the algorithms. For best performance, problems should be addressed by conceptual clustering approaches of only sufficient type level. Type-0Statistic-based quality measure; no conceptual interpreta-tion.This category contains traditional numerical taxonomy: there is a similarity metric that treats all attributes equally: the output consists of just clusters (or a dendro-gram); some other system (e.g.. the human analyst) must interpret the results.212 KNOWLEDGE ACQUISITIONType-1Statistic-based quality measure; conceptual interpretation after-the-fact.This category would include a system that performs numerical taxonomy followed by a system that learns con-cepts from examples (such as AQ (Michalski 1983) or ID3(Quinlan 1983).Type-2Attribute-based quality measure; no conceptual interpreta-tion.This category contains systems that measure Gestalt infor-mation theoretic patterns over attributes and group exem-plars for optimal quality score but without regard for andwithout reporting the concept the group represents. WITT and utility-based clustering in P L S (Rendell 1986) appearto be of this type.Type-3Attribute-based quality measure; conceptual interpretationindependent of cluster formation.This category contains systems that are like Type-2 but that follow cluster formation (exemplar aggregation) witha characterization process that is mostly independent fromthe aggregation process. COBWEB. UNIMEM. 1P P. and GLAUBER appear to be in this category.Type-4Concept-based quality measure; no background knowledge.This category contains systems that have unified aggrega-tion and characterization processes, i.e.. the concepts derived to describe categories determine the partitioning ofthe exemplars. No deductive inference is performed; onlythe most general (built in) clustering goals and heuristics can be used to bias the process. DISCON. RUMMAGE.CLUSTER/2 appear to be of this type.Type-5Concept-based quality measure; background knowledge.This category contains systems that operate like Type-4 systems, but which can perform deductive inference to derive additional attributes, heuristics, and clustering goals. CLUSTER/S has s o m e deductive capabilities, and thus fits this category and Type-6 below.Type-6Concept-based quality measure; background knowledge;structured exemplars.This category contains systems that have the general mechanisms of Type-5 systems as extended to operate on structured objects. The system CLUSTER/CA (Mogensen 1987) has s o m e of these capabilities, although its deductiveand heuristic capabilities are still limited.The above range of "conceptuality" of conceptual clustering methods is relevant to conceptual clustering research on two accounts: (1) it may provide yet another way to contrast and understand conceptual clustering algorithms, and (2) it indicates the great breadth of conceptual clustering approaches, hopefully dispelling any notion of intrinsic architectures for conceptual clustering algorithms.REFERENCES1 Fisher. D.. "A Hierarchical Conceptual Clustering Algo-rithm.*' Tech. Report. Dept. of Information and Comp.Sci.. Univ. of Ca.. Irvine. 1984.2 Fisher. D.. "A Proposed Method of Conceptual Clusteringfor Structured and Decomposable Objects." Proc. of theThird Intern. Machine Learning Workshop, June 24-26.Skytop. Penn., Pp. 38-40. 1985.3 Fisher. D., "Knowledge Acquisition Via Incremental Con-ceptual Clustering." unpublished manuscript. 1987. 4 Fisher. D.. and I^angley. P.. "Approaches to ConceptualClustering." Proc. of the Ninth Intern. Joint Conf. onArtificial Intelligence, Pp. 691-697. 1985.5 Gluck. M.. and Corter. J., "Information. Uncertainty, andthe Utility of Categories." Proc. of the Seventh AnnualConf. of the Cog. Sci. Soc., Pp. 283-287. 1985.6 Hanson, S.J.. and Bauer. M., "Conceptual Clustering.Semantic Organization and Polymorphy." in Uncertaintyin Artificial Intelligence, Kanal, L.N.. and Lemmer. D..(eds.). North Holland. 1986.7 Langley. P.. Zytkow. J., Simon. H.. and Bradshaw. G.."The Search for Regularity: Four Aspects of ScientificDiscovery." in Machine Learning, Volume II, Michalski.R.S.. Carbonell. J.G.. and Mitchell. T. (eds). Morgan Kauf-mann Publishers. Los Altos. Ca.. 1986.8 Langley, P.. and S a g e. S.. "Conceptual Clustering asDiscrimination Learning." Proc. of the Fifth Biennial Conf.of the Canadian SOC. for Comp. Studies of Intelligence,1984.9 Lebowitz, M.. "Correcting Erroneous Generalizations."Cognition and Brain Theory, Vol. 5. Pp. 367-381. 1982.10 Lebowitz. M.. "Generalization from Natural LanguageText," Cog. Sci., Vol. 7. Pp. 1-40. 1983.11 Michalski. R.S.. P.q "Knowledge Acquisition Through Con-ceptual Clustering: A Theoretical Framework and Algo-rithm for Partitioning Data into Conjunctive Concepts."Intern. Journal of Policy Analysis and Information Sys-tems, Vol. 4. Pp. 219-243. 1980.12 Michalski. R.S.. "A Theory and Methodology of InductiveLearning." in Machine Learning, Michalski. R.S.. Car-bonell. J.G.. and Mitchell. T. (eds). Tioga Publishing Com-pany. Palo Alto. Ca.. 1983.13 Michalski. R.S.. and Stepp. R.E.. "Learning from Observa-tion: Conceptual Clustering.*' in Machinelearning, Michalski. R.S.. Carbonell. J.G.. and Mitchell. T. (eds).Tioga Publishing Company. Palo Alto. Ca.. 1983.14 Mogensen. B.N.. "Goal-Oriented Conceptual Clustering:The Classification Attribute Approach." M.S. Thesis. Dept.of Elect, and Comp. Engineering. Univ. of Illinois. Urbana.1987.15 Quinlan. J.. "Learning Efficient Classification Proceduresand Their Application to C h e s s End Games." in Machinelearning, Michalski. R.S., Carbonell. J.G.. and Mitchell. T.(eds). Tioga Publishing Company. Palo Alto. Ca.. 1983.16 Rendell. L.A., "A General Framework for Induction and aStudy of Selective Induction." Dept. of Comp. Sci.. ReportNo. UIUCDCS-R-86-1270. Univ. of Illinois. Urbana. 1986.17 Stepp, R.E.. "Conjunctive Conceptual Clustering: AMethodology and Experimentation," Ph.D. Thesis, ReportNo. UIUCDCS-R-841189, Dept. of Comp. Sci.. Univ. ofIllinois. Urbana. 1984.18 Stepp. R.E.. and Michalski. R.S.. "Conceptual Clustering:Inventing Goal-Oriented Classifications of StructuredObjects.' in Machine learning, Volume II, Michalski, R.S.,Carbonell. J.G.. and Mitchell. T. (eds), Morgan KaufmannPublishers. Los Altos. Ca.. 1986.19 Wolff. J.. "Data Compression. Generalization, and Over-generalization in an Evolving Theory of LanguageDevelopment," Proc. of the AISB-80 Conf. on ArtificialIntelligence, Pp. 1-10. 1980.Stepp 213。
学术英语(理工)Unit5
Learning Method
01
Inquiry-based learning
Students will engage in hands-on activities and experiments to
explore the topic and develop their understanding.
Grammar and Sentence Patterns
非谓语动词用法多
非谓语动词在本单元中频繁出现,包括不定式、现在分词 和过去分词等。学生需要了解这些非谓语动词的用法和区 别,以便在写作中更加灵活地运用。
Reading and Writing
阅读材料难度大 写作要求高
本单元的阅读材料涉及大量专业知识和理论,语 言难度较高。学生需要具备较好的阅读能力和技 巧,如快速阅读、归纳总结等,才能有效理解文 章内容。
Vocabulary and expression
抽象概念多
本单元涉及许多抽象的概念和理论,如量子力学、电路分析等。学生需要具备较好的逻辑思维和推理能力,才能理解这些概 念和理论的含义。
Vocabulary and expression
表达方式多样
为了更好地理解和应用科学和工程知识,学生需要掌握多种表达方式,如公式、图表、示意图等。此 外,学生还需要了解如何将这些表达方式与英语语言结合起来,以清晰地传达信息。
Students will work in groups to complete projects and tasks,
enhancing their teamwork and协作精神.
Learning Resources
01
Textbooks
The official textbook for this unit is "Science and Technology in Society: An Introduction to the Principles and Applications".
Cluster analysis
8 Cluster Analysis:Basic Concepts andAlgorithmsCluster analysis divides data into groups(clusters)that are meaningful,useful, or both.If meaningful groups are the goal,then the clusters should capture the natural structure of the data.In some cases,however,cluster analysis is only a useful starting point for other purposes,such as data summarization.Whether for understanding or utility,cluster analysis has long played an important role in a wide variety offields:psychology and other social sciences,biology, statistics,pattern recognition,information retrieval,machine learning,and data mining.There have been many applications of cluster analysis to practical prob-lems.We provide some specific examples,organized by whether the purpose of the clustering is understanding or utility.Clustering for Understanding Classes,or conceptually meaningful groups of objects that share common characteristics,play an important role in how people analyze and describe the world.Indeed,human beings are skilled at dividing objects into groups(clustering)and assigning particular objects to these groups(classification).For example,even relatively young children can quickly label the objects in a photograph as buildings,vehicles,people,ani-mals,plants,etc.In the context of understanding data,clusters are potential classes and cluster analysis is the study of techniques for automaticallyfinding classes.The following are some examples:488Chapter8Cluster Analysis:Basic Concepts and Algorithms •Biology.Biologists have spent many years creating a taxonomy(hi-erarchical classification)of all living things:kingdom,phylum,class, order,family,genus,and species.Thus,it is perhaps not surprising that much of the early work in cluster analysis sought to create a discipline of mathematical taxonomy that could automaticallyfind such classifi-cation structures.More recently,biologists have applied clustering to analyze the large amounts of genetic information that are now available.For example,clustering has been used tofind groups of genes that have similar functions.•Information Retrieval.The World Wide Web consists of billions of Web pages,and the results of a query to a search engine can return thousands of pages.Clustering can be used to group these search re-sults into a small number of clusters,each of which captures a particular aspect of the query.For instance,a query of“movie”might return Web pages grouped into categories such as reviews,trailers,stars,and theaters.Each category(cluster)can be broken into subcategories(sub-clusters),producing a hierarchical structure that further assists a user’s exploration of the query results.•Climate.Understanding the Earth’s climate requiresfinding patterns in the atmosphere and ocean.To that end,cluster analysis has been applied tofind patterns in the atmospheric pressure of polar regions and areas of the ocean that have a significant impact on land climate.•Psychology and Medicine.An illness or condition frequently has a number of variations,and cluster analysis can be used to identify these different subcategories.For example,clustering has been used to identify different types of depression.Cluster analysis can also be used to detect patterns in the spatial or temporal distribution of a disease.•Business.Businesses collect large amounts of information on current and potential customers.Clustering can be used to segment customers into a small number of groups for additional analysis and marketing activities.Clustering for Utility Cluster analysis provides an abstraction from in-dividual data objects to the clusters in which those data objects reside.Ad-ditionally,some clustering techniques characterize each cluster in terms of a cluster prototype;i.e.,a data object that is representative of the other ob-jects in the cluster.These cluster prototypes can be used as the basis for a489 number of data analysis or data processing techniques.Therefore,in the con-text of utility,cluster analysis is the study of techniques forfinding the most representative cluster prototypes.•Summarization.Many data analysis techniques,such as regression or PCA,have a time or space complexity of O(m2)or higher(where m is the number of objects),and thus,are not practical for large data sets.However,instead of applying the algorithm to the entire data set,it can be applied to a reduced data set consisting only of cluster prototypes.Depending on the type of analysis,the number of prototypes,and the accuracy with which the prototypes represent the data,the results can be comparable to those that would have been obtained if all the data could have been used.•Compression.Cluster prototypes can also be used for data compres-sion.In particular,a table is created that consists of the prototypes for each cluster;i.e.,each prototype is assigned an integer value that is its position(index)in the table.Each object is represented by the index of the prototype associated with its cluster.This type of compression is known as vector quantization and is often applied to image,sound, and video data,where(1)many of the data objects are highly similar to one another,(2)some loss of information is acceptable,and(3)a substantial reduction in the data size is desired.•Efficiently Finding Nearest Neighbors.Finding nearest neighbors can require computing the pairwise distance between all points.Often clusters and their cluster prototypes can be found much more efficiently.If objects are relatively close to the prototype of their cluster,then we can use the prototypes to reduce the number of distance computations that are necessary tofind the nearest neighbors of an object.Intuitively,if two cluster prototypes are far apart,then the objects in the corresponding clusters cannot be nearest neighbors of each other.Consequently,to find an object’s nearest neighbors it is only necessary to compute the distance to objects in nearby clusters,where the nearness of two clusters is measured by the distance between their prototypes.This idea is made more precise in Exercise25on page94.This chapter provides an introduction to cluster analysis.We begin with a high-level overview of clustering,including a discussion of the various ap-proaches to dividing objects into sets of clusters and the different types of clusters.We then describe three specific clustering techniques that represent490Chapter8Cluster Analysis:Basic Concepts and Algorithms broad categories of algorithms and illustrate a variety of concepts:K-means, agglomerative hierarchical clustering,and DBSCAN.Thefinal section of this chapter is devoted to cluster validity—methods for evaluating the goodness of the clusters produced by a clustering algorithm.More advanced clustering concepts and algorithms will be discussed in Chapter9.Whenever possible, we discuss the strengths and weaknesses of different schemes.In addition, the bibliographic notes provide references to relevant books and papers that explore cluster analysis in greater depth.8.1OverviewBefore discussing specific clustering techniques,we provide some necessary background.First,we further define cluster analysis,illustrating why it is difficult and explaining its relationship to other techniques that group data. Then we explore two important topics:(1)different ways to group a set of objects into a set of clusters,and(2)types of clusters.8.1.1What Is Cluster Analysis?Cluster analysis groups data objects based only on information found in the data that describes the objects and their relationships.The goal is that the objects within a group be similar(or related)to one another and different from (or unrelated to)the objects in other groups.The greater the similarity(or homogeneity)within a group and the greater the difference between groups, the better or more distinct the clustering.In many applications,the notion of a cluster is not well defined.To better understand the difficulty of deciding what constitutes a cluster,consider Figure 8.1,which shows twenty points and three different ways of dividing them into clusters.The shapes of the markers indicate cluster membership.Figures 8.1(b)and8.1(d)divide the data into two and six parts,respectively.However, the apparent division of each of the two larger clusters into three subclusters may simply be an artifact of the human visual system.Also,it may not be unreasonable to say that the points form four clusters,as shown in Figure 8.1(c).Thisfigure illustrates that the definition of a cluster is imprecise and that the best definition depends on the nature of data and the desired results.Cluster analysis is related to other techniques that are used to divide data objects into groups.For instance,clustering can be regarded as a form of classification in that it creates a labeling of objects with class(cluster)labels. However,it derives these labels only from the data.In contrast,classification8.1Overview491(a)Original points.(b)Two clusters.(c)Four clusters.(d)Six clusters.Figure8.1.Different ways of clustering the same set of points.in the sense of Chapter4is supervised classification;i.e.,new,unlabeled objects are assigned a class label using a model developed from objects with known class labels.For this reason,cluster analysis is sometimes referred to as unsupervised classification.When the term classification is used without any qualification within data mining,it typically refers to supervised classification.Also,while the terms segmentation and partitioning are sometimes used as synonyms for clustering,these terms are frequently used for approaches outside the traditional bounds of cluster analysis.For example,the term partitioning is often used in connection with techniques that divide graphs into subgraphs and that are not strongly connected to clustering.Segmentation often refers to the division of data into groups using simple techniques;e.g., an image can be split into segments based only on pixel intensity and color,or people can be divided into groups based on their income.Nonetheless,some work in graph partitioning and in image and market segmentation is related to cluster analysis.8.1.2Different Types of ClusteringsAn entire collection of clusters is commonly referred to as a clustering,and in this section,we distinguish various types of clusterings:hierarchical(nested) versus partitional(unnested),exclusive versus overlapping versus fuzzy,and complete versus partial.Hierarchical versus Partitional The most commonly discussed distinc-tion among different types of clusterings is whether the set of clusters is nested492Chapter8Cluster Analysis:Basic Concepts and Algorithmsor unnested,or in more traditional terminology,hierarchical or partitional.A partitional clustering is simply a division of the set of data objects into non-overlapping subsets(clusters)such that each data object is in exactly one subset.Taken individually,each collection of clusters in Figures8.1(b–d)is a partitional clustering.If we permit clusters to have subclusters,then we obtain a hierarchical clustering,which is a set of nested clusters that are organized as a tree.Each node(cluster)in the tree(except for the leaf nodes)is the union of its children (subclusters),and the root of the tree is the cluster containing all the objects. Often,but not always,the leaves of the tree are singleton clusters of individual data objects.If we allow clusters to be nested,then one interpretation of Figure8.1(a)is that it has two subclusters(Figure8.1(b)),each of which,in turn,has three subclusters(Figure8.1(d)).The clusters shown in Figures8.1 (a–d),when taken in that order,also form a hierarchical(nested)clustering with,respectively,1,2,4,and6clusters on each level.Finally,note that a hierarchical clustering can be viewed as a sequence of partitional clusterings and a partitional clustering can be obtained by taking any member of that sequence;i.e.,by cutting the hierarchical tree at a particular level. Exclusive versus Overlapping versus Fuzzy The clusterings shown in Figure8.1are all exclusive,as they assign each object to a single cluster. There are many situations in which a point could reasonably be placed in more than one cluster,and these situations are better addressed by non-exclusive clustering.In the most general sense,an overlapping or non-exclusive clustering is used to reflect the fact that an object can simultaneously belong to more than one group(class).For instance,a person at a university can be both an enrolled student and an employee of the university.A non-exclusive clustering is also often used when,for example,an object is“between”two or more clusters and could reasonably be assigned to any of these clusters. Imagine a point halfway between two of the clusters of Figure8.1.Rather than make a somewhat arbitrary assignment of the object to a single cluster, it is placed in all of the“equally good”clusters.In a fuzzy clustering,every object belongs to every cluster with a mem-bership weight that is between0(absolutely doesn’t belong)and1(absolutely belongs).In other words,clusters are treated as fuzzy sets.(Mathematically, a fuzzy set is one in which an object belongs to any set with a weight that is between0and1.In fuzzy clustering,we often impose the additional con-straint that the sum of the weights for each object must equal1.)Similarly, probabilistic clustering techniques compute the probability with which each8.1Overview493 point belongs to each cluster,and these probabilities must also sum to1.Be-cause the membership weights or probabilities for any object sum to1,a fuzzy or probabilistic clustering does not address true multiclass situations,such as the case of a student employee,where an object belongs to multiple classes. Instead,these approaches are most appropriate for avoiding the arbitrariness of assigning an object to only one cluster when it may be close to several.In practice,a fuzzy or probabilistic clustering is often converted to an exclusive clustering by assigning each object to the cluster in which its membership weight or probability is highest.Complete versus Partial A complete clustering assigns every object to a cluster,whereas a partial clustering does not.The motivation for a partial clustering is that some objects in a data set may not belong to well-defined groups.Many times objects in the data set may represent noise,outliers,or “uninteresting background.”For example,some newspaper stories may share a common theme,such as global warming,while other stories are more generic or one-of-a-kind.Thus,tofind the important topics in last month’s stories,we may want to search only for clusters of documents that are tightly related by a common theme.In other cases,a complete clustering of the objects is desired. For example,an application that uses clustering to organize documents for browsing needs to guarantee that all documents can be browsed.8.1.3Different Types of ClustersClustering aims tofind useful groups of objects(clusters),where usefulness is defined by the goals of the data analysis.Not surprisingly,there are several different notions of a cluster that prove useful in practice.In order to visually illustrate the differences among these types of clusters,we use two-dimensional points,as shown in Figure8.2,as our data objects.We stress,however,that the types of clusters described here are equally valid for other kinds of data. Well-Separated A cluster is a set of objects in which each object is closer (or more similar)to every other object in the cluster than to any object not in the cluster.Sometimes a threshold is used to specify that all the objects in a cluster must be sufficiently close(or similar)to one another.This idealistic definition of a cluster is satisfied only when the data contains natural clusters that are quite far from each other.Figure8.2(a)gives an example of well-separated clusters that consists of two groups of points in a two-dimensional space.The distance between any two points in different groups is larger than494Chapter8Cluster Analysis:Basic Concepts and Algorithmsthe distance between any two points within a group.Well-separated clusters do not need to be globular,but can have any shape.Prototype-Based A cluster is a set of objects in which each object is closer (more similar)to the prototype that defines the cluster than to the prototype of any other cluster.For data with continuous attributes,the prototype of a cluster is often a centroid,i.e.,the average(mean)of all the points in the clus-ter.When a centroid is not meaningful,such as when the data has categorical attributes,the prototype is often a medoid,i.e.,the most representative point of a cluster.For many types of data,the prototype can be regarded as the most central point,and in such instances,we commonly refer to prototype-based clusters as center-based clusters.Not surprisingly,such clusters tend to be globular.Figure8.2(b)shows an example of center-based clusters. Graph-Based If the data is represented as a graph,where the nodes are objects and the links represent connections among objects(see Section2.1.2), then a cluster can be defined as a connected component;i.e.,a group of objects that are connected to one another,but that have no connection to objects outside the group.An important example of graph-based clusters are contiguity-based clusters,where two objects are connected only if they are within a specified distance of each other.This implies that each object in a contiguity-based cluster is closer to some other object in the cluster than to any point in a different cluster.Figure8.2(c)shows an example of such clusters for two-dimensional points.This definition of a cluster is useful when clusters are irregular or intertwined,but can have trouble when noise is present since, as illustrated by the two spherical clusters of Figure8.2(c),a small bridge of points can merge two distinct clusters.Other types of graph-based clusters are also possible.One such approach (Section8.3.2)defines a cluster as a clique;i.e.,a set of nodes in a graph that are completely connected to each other.Specifically,if we add connections between objects in the order of their distance from one another,a cluster is formed when a set of objects forms a clique.Like prototype-based clusters, such clusters tend to be globular.Density-Based A cluster is a dense region of objects that is surrounded by a region of low density.Figure8.2(d)shows some density-based clusters for data created by adding noise to the data of Figure8.2(c).The two circular clusters are not merged,as in Figure8.2(c),because the bridge between them fades into the noise.Likewise,the curve that is present in Figure8.2(c)also8.1Overview495 fades into the noise and does not form a cluster in Figure8.2(d).A density-based definition of a cluster is often employed when the clusters are irregular or intertwined,and when noise and outliers are present.By contrast,a contiguity-based definition of a cluster would not work well for the data of Figure8.2(d) since the noise would tend to form bridges between clusters.Shared-Property(Conceptual Clusters)More generally,we can define a cluster as a set of objects that share some property.This definition encom-passes all the previous definitions of a cluster;e.g.,objects in a center-based cluster share the property that they are all closest to the same centroid or medoid.However,the shared-property approach also includes new types of clusters.Consider the clusters shown in Figure8.2(e).A triangular area (cluster)is adjacent to a rectangular one,and there are two intertwined circles (clusters).In both cases,a clustering algorithm would need a very specific concept of a cluster to successfully detect these clusters.The process offind-ing such clusters is called conceptual clustering.However,too sophisticated a notion of a cluster would take us into the area of pattern recognition,and thus,we only consider simpler types of clusters in this book.Road MapIn this chapter,we use the following three simple,but important techniques to introduce many of the concepts involved in cluster analysis.•K-means.This is a prototype-based,partitional clustering technique that attempts tofind a user-specified number of clusters(K),which are represented by their centroids.•Agglomerative Hierarchical Clustering.This clustering approach refers to a collection of closely related clustering techniques that producea hierarchical clustering by starting with each point as a singleton clusterand then repeatedly merging the two closest clusters until a single,all-encompassing cluster remains.Some of these techniques have a natural interpretation in terms of graph-based clustering,while others have an interpretation in terms of a prototype-based approach.•DBSCAN.This is a density-based clustering algorithm that producesa partitional clustering,in which the number of clusters is automaticallydetermined by the algorithm.Points in low-density regions are classi-fied as noise and omitted;thus,DBSCAN does not produce a complete clustering.Chapter 8Cluster Analysis:Basic Concepts and Algorithms (a)Well-separated clusters.Eachpoint is closer to all of the points in itscluster than to any point in anothercluster.(b)Center-based clusters.Each point is closer to the center of its cluster than to the center of any other cluster.(c)Contiguity-based clusters.Eachpoint is closer to at least one pointin its cluster than to any point inanother cluster.(d)Density-based clusters.Clus-ters are regions of high density sep-arated by regions of low density.(e)Conceptual clusters.Points in a cluster share some generalproperty that derives from the entire set of points.(Points in theintersection of the circles belong to both.)Figure 8.2.Different types of clusters as illustrated by sets of two-dimensional points.8.2K-meansPrototype-based clustering techniques create a one-level partitioning of the data objects.There are a number of such techniques,but two of the most prominent are K-means and K-medoid.K-means defines a prototype in terms of a centroid,which is usually the mean of a group of points,and is typically8.2K-means497 applied to objects in a continuous n-dimensional space.K-medoid defines a prototype in terms of a medoid,which is the most representative point for a group of points,and can be applied to a wide range of data since it requires only a proximity measure for a pair of objects.While a centroid almost never corresponds to an actual data point,a medoid,by its definition,must be an actual data point.In this section,we will focus solely on K-means,which is one of the oldest and most widely used clustering algorithms.8.2.1The Basic K-means AlgorithmThe K-means clustering technique is simple,and we begin with a description of the basic algorithm.Wefirst choose K initial centroids,where K is a user-specified parameter,namely,the number of clusters desired.Each point is then assigned to the closest centroid,and each collection of points assigned to a centroid is a cluster.The centroid of each cluster is then updated based on the points assigned to the cluster.We repeat the assignment and update steps until no point changes clusters,or equivalently,until the centroids remain the same.K-means is formally described by Algorithm8.1.The operation of K-means is illustrated in Figure8.3,which shows how,starting from three centroids,the final clusters are found in four assignment-update steps.In these and other figures displaying K-means clustering,each subfigure shows(1)the centroids at the start of the iteration and(2)the assignment of the points to those centroids.The centroids are indicated by the“+”symbol;all points belonging to the same cluster have the same marker shape.1:Select K points as initial centroids.2:repeat3:Form K clusters by assigning each point to its closest centroid.4:Recompute the centroid of each cluster.5:until Centroids do not change.In thefirst step,shown in Figure8.3(a),points are assigned to the initial centroids,which are all in the larger group of points.For this example,we use the mean as the centroid.After points are assigned to a centroid,the centroid is then updated.Again,thefigure for each step shows the centroid at the beginning of the step and the assignment of points to those centroids.In the second step,points are assigned to the updated centroids,and the centroids498Chapter8Cluster Analysis:Basic Concepts and Algorithms(a)Iteration1.(b)Iteration2.(c)Iteration3.(d)Iteration4.ing the K-means algorithm tofind three clusters in sample data.are updated again.In steps2,3,and4,which are shown in Figures8.3(b), (c),and(d),respectively,two of the centroids move to the two small groups of points at the bottom of thefigures.When the K-means algorithm terminates in Figure8.3(d),because no more changes occur,the centroids have identified the natural groupings of points.For some combinations of proximity functions and types of centroids,K-means always converges to a solution;i.e.,K-means reaches a state in which no points are shifting from one cluster to another,and hence,the centroids don’t change.Because most of the convergence occurs in the early steps,however, the condition on line5of Algorithm8.1is often replaced by a weaker condition, e.g.,repeat until only1%of the points change clusters.We consider each of the steps in the basic K-means algorithm in more detail and then provide an analysis of the algorithm’s space and time complexity. Assigning Points to the Closest CentroidTo assign a point to the closest centroid,we need a proximity measure that quantifies the notion of“closest”for the specific data under consideration. Euclidean(L2)distance is often used for data points in Euclidean space,while cosine similarity is more appropriate for documents.However,there may be several types of proximity measures that are appropriate for a given type of data.For example,Manhattan(L1)distance can be used for Euclidean data, while the Jaccard measure is often employed for documents.Usually,the similarity measures used for K-means are relatively simple since the algorithm repeatedly calculates the similarity of each point to each centroid.In some cases,however,such as when the data is in low-dimensional8.2K-means499Table8.1.Table of notation.Symbol Descriptionx An object.C i The i th cluster.c i The centroid of cluster C i.c The centroid of all points.m i The number of objects in the i th cluster.m The number of objects in the data set.K The number of clusters.Euclidean space,it is possible to avoid computing many of the similarities, thus significantly speeding up the K-means algorithm.Bisecting K-means (described in Section8.2.3)is another approach that speeds up K-means by reducing the number of similarities computed.Centroids and Objective FunctionsStep4of the K-means algorithm was stated rather generally as“recompute the centroid of each cluster,”since the centroid can vary,depending on the proximity measure for the data and the goal of the clustering.The goal of the clustering is typically expressed by an objective function that depends on the proximities of the points to one another or to the cluster centroids;e.g., minimize the squared distance of each point to its closest centroid.We illus-trate this with two examples.However,the key point is this:once we have specified a proximity measure and an objective function,the centroid that we should choose can often be determined mathematically.We provide mathe-matical details in Section8.2.6,and provide a non-mathematical discussion of this observation here.Data in Euclidean Space Consider data whose proximity measure is Eu-clidean distance.For our objective function,which measures the quality of a clustering,we use the sum of the squared error(SSE),which is also known as scatter.In other words,we calculate the error of each data point,i.e.,its Euclidean distance to the closest centroid,and then compute the total sum of the squared errors.Given two different sets of clusters that are produced by two different runs of K-means,we prefer the one with the smallest squared error since this means that the prototypes(centroids)of this clustering are a better representation of the points in their ing the notation in Table8.1,the SSE is formally defined as follows:。
人工智能导论第五章课后答案
人工智能导论第五章课后答案
第五章课后答案
一、填空题
1. 决策树是一种基于概率的决策模型,它可以用来表示和求
解复杂的决策问题。
2. 决策树的建立过程包括特征选择、决策树生成和决策树剪枝。
3. 决策树的特征选择是指从训练数据集中选择最有效的特征,以构建决策树。
4. 决策树生成是指根据特征选择的结果,构建决策树的过程。
5. 决策树剪枝是指在决策树生成的基础上,通过减少决策树
的复杂度,以提高决策树的泛化能力的过程。
6. 决策树的优点是可解释性强、易于实现和计算效率高。
7. 决策树的缺点是容易发生过拟合,对缺失数据敏感,对噪
声数据敏感。
二、简答题
1. 请简述决策树的建立过程?
决策树的建立过程包括特征选择、决策树生成和决策树剪枝。
特征选择是指从训练数据集中选择最有效的特征,以构建决策树。
决策树生成是指根据特征选择的结果,构建决策树的过程。
决策
树剪枝是指在决策树生成的基础上,通过减少决策树的复杂度,
以提高决策树的泛化能力的过程。
2. 请简述决策树的优缺点?
决策树的优点是可解释性强、易于实现和计算效率高。
决策
树的缺点是容易发生过拟合,对缺失数据敏感,对噪声数据敏感。
人工智能导论第 5 版 思考题 第七章
人工智能导论第 5 版思考题第七章下载提示:该文档是本店铺精心编制而成的,希望大家下载后,能够帮助大家解决实际问题。
文档下载后可定制修改,请根据实际需要进行调整和使用,谢谢!本店铺为大家提供各种类型的实用资料,如教育随笔、日记赏析、句子摘抄、古诗大全、经典美文、话题作文、工作总结、词语解析、文案摘录、其他资料等等,想了解不同资料格式和写法,敬请关注!Download tips: This document is carefully compiled by this editor. I hope that after you download it, it can help you solve practical problems. The document can be customized and modified after downloading, please adjust and use it according to actual needs, thank you! In addition, this shop provides you with various types of practical materials, such as educational essays, diary appreciation, sentence excerpts, ancient poems, classic articles, topic composition, work summary, word parsing, copy excerpts, other materials and so on, want to know different data formats and writing methods, please pay attention!人工智能导论第5版思考题第七章解读引言人工智能作为一门跨学科的科学,正在以惊人的速度改变我们的生活方式、工作方式以及与技术互动的方式。
COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1_week5Clustering_display
4 center-based clusters
© Tan,Steinbach, Kumar
Tuesday, 29 March 2011
Introduction to Data Mining
4/18/2004
Types of Clusters: Contiguity-Based
Contiguous
Tuesday, 29 March 2011
Conditional Independence
• • •
If A and B are independent then P(A|B)=P(A) P(AB) = P(A|B)P(B) Law of Total Probability.
Tuesday, 29 March 2011
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
© Tan,Steinbach, Kumar
Tuesday, 29 March 2011
Introduction to Data Mining
6 density-based clusters
© Tan,Steinbach, Kumar
Tuesday, 29 March 2011
Introduction to Data Mining
4/18/2004
Types of Clusters: Conceptual Clusters
Shared Property or Conceptual Clusters – Finds clusters that share some common property or represent a particular concept. .
Traffic Classification Using Clustering Algorithms
Traffic Classification Using Clustering AlgorithmsJeffrey Erman,Martin Arlitt,Anirban MahantiUniversity of Calgary,2500University Drive NW,Calgary,AB,Canada{erman,arlitt,mahanti}@cpsc.ucalgary.caABSTRACTClassification of network traffic using port-based or payload-based analysis is becoming increasingly difficult with many peer-to-peer (P2P)applications using dynamic port numbers,masquerading tech-niques,and encryption to avoid detection.An alternative approach is to classify traffic by exploiting the distinctive characteristics of applications when they communicate on a network.We pursue this latter approach and demonstrate how cluster analysis can be used to effectively identify groups of traffic that are similar using only transport layer statistics.Our work considers two unsupervised clustering algorithms,namely K-Means and DBSCAN,that have previously not been used for network traffic classification.We eval-uate these two algorithms and compare them to the previously used AutoClass algorithm,using empirical Internet traces.The experi-mental results show that both K-Means and DBSCAN work very well and much more quickly then AutoClass.Our results indicate that although DBSCAN has lower accuracy compared to K-Means and AutoClass,DBSCAN produces better clusters.Categories and Subject DescriptorsI.5.4[Computing Methodologies]:Pattern Recognition—Appli-cationsGeneral TermsAlgorithms,classificationKeywordsmachine learning,unsupervised clustering1.INTRODUCTIONAccurate identification and categorization of network traffic ac-cording to application type is an important element of many net-work management tasks such asflow prioritization,traffic shap-ing/policing,and diagnostic monitoring.For example,a network operator may want to identify and throttle(or block)traffic from peer-to-peer(P2P)file sharing applications to manage its band-width budget and to ensure good performance of business criti-Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.SIGCOMM’06Workshops September11-15,2006,Pisa,Italy. Copyright2006ACM1-59593-417-0/06/0009...$5.00.cal applications.Similar to network management tasks,many net-work engineering problems such as workload characterization and modelling,capacity planning,and route provisioning also benefit from accurate identification of network traffic.In this paper,we present preliminary results from our experience with using a ma-chine learning approach called clustering for the network traffic identification problem.In the remainder of this section,we moti-vate why clustering is useful,discuss the specific contributions of this paper,and outline our ongoing work.The classical approach to traffic classification relies on mapping applications to well-known port numbers and has been very suc-cessful in the past.To avoid detection by this method,P2P appli-cations began using dynamic port numbers,and also started dis-guising themselves by using port numbers for commonly used pro-tocols such as HTTP and FTP.Many recent studies confirm that port-based identification of network traffic is ineffective[8,15]. To address the aforementioned drawbacks of port-based classi-fication,several payload-based analysis techniques have been pro-posed[3,6,9,11,15].In this approach,packet payloads are ana-lyzed to determine whether they contain characteristic signatures of known applications.Studies show that these approaches work very well for the current Internet traffic including P2P traffic.In fact, some commercial packet shaping tools have started using these techniques.However,P2P applications such as BitTorrent are be-ginning to elude this technique by using obfuscation methods such as plain-text ciphers,variable-length padding,and/or encryption. In addition,there are some other disadvantages.First,these tech-niques only identify traffic for which signatures are available and are unable to classify any other traffic.Second,these techniques typically require increased processing and storage capacity.The limitations of port-based and payload-based analysis have motivated use of transport layer statistics for traffic classification[8, 10,12,14,17].These classification techniques rely on the fact that different applications typically have distinct behaviour patterns when communicating on a network.For instance,a largefile trans-fer using FTP would have a longer connection duration and larger average packet size than an instant messaging client sending short occasional messages to other clients.Similarly,some P2P appli-cations such as BitTorrent1can be distinguished from FTP data transfers because these P2P connections typically are persistent and send data bidirectionally;FTP data transfer connections are non-persistent and send data only unidirectionally.Transport layer statistics such as the total number of packets sent,the ratio of the bytes sent in each direction,the duration of the connection,and the average size of the packets characterize these behaviours.In this paper,we explore the use of a machine learning approach called clustering for classifying traffic using only transport layerstatistics.Cluster analysis is one of the most prominent methods for identifying classes amongst a group of objects,and has been used as a tool in manyfields such as biology,finance,and com-puter science.Recent work by McGregor et al.[10]and Zander et al.[17]show that cluster analysis has the ability to group Inter-net traffic using only transport layer characteristics.In this paper, we confirm their observations by evaluating two clustering algo-rithms,namely K-Means[7]and DBSCAN[5],that to the best of our knowledge have not been previously applied to this problem. In addition,as a baseline,we present results from the previously considered AutoClass[1]algorithm[10,17].The algorithms evaluated in this paper use an unsupervised learn-ing mechanism,wherein unlabelled training data is grouped based on similarity.This ability to group unlabelled training data is ad-vantageous and offers some practical benefits over learning ap-proaches that require labelled training data(discussed in Section 2).Although the selected algorithms use an unsupervised learning mechanism,each of these algorithms,however,is based on differ-ent clustering principles.The K-Means clustering algorithm is a partition-based algorithm[7],the DBSCAN algorithm is a density-based algorithm[5],and the AutoClass algorithm is a probabilistic model-based algorithm[1].One reason in particular why K-Means and DBSCAN algorithms were chosen is that they are much faster at clustering data than the previously used AutoClass algorithm. We evaluate the algorithms using two empirical traces:a well-known publicly available Internet traffic trace from the University of Auckland,and a recent trace we collected from the University of Calgary’s Internet connection.The algorithms are compared based on their ability to generate clusters that have a high predictive power of a single application.We show that clustering works for a variety of different applications,including Web,P2Pfile-sharing, andfile transfer with the AutoClass and K-Means algorithm’s ac-curacy exceeding85%in our results and DBSCAN achieving an accuracy of75%.Furthermore,we analyze the number of clusters and the number of objects in each of the clusters produced by the different algorithms.In general,the ability of an algorithm to group objects into a few“good”clusters is particularly useful in reducing the amount of processing required to label the clusters.We show that while DBSCAN has a lower overall accuracy the clusters it forms are the most accurate.Additionally,wefind that by looking at only a few of DBSCAN’s clusters one could identify a significant portion of the connections.Ours is a work-in-progress.Preliminary results indicate that clustering is indeed a useful technique for traffic identification.Our goal is to build an efficient and accurate classification tool using clustering techniques as the building block.Such a clustering tool would consist of two stages:a model building stage and a classifi-cation stage.In thefirst stage,an unsupervised clustering algorithm clusters training data.This produces a set of clusters that are then labelled to become our classification model.In the second stage, this model is used to develop a classifier that has the ability to label both online and offline network traffic.We note that offline classifi-cation is relatively easier compared to online classification,asflow statistics needed by the clustering algorithm may be easily obtained in the former case;the latter requires use of estimation techniques forflow statistics.We should also note that this approach is not a “panacea”for the traffic classification problem.While the model building phase does automatically generate clusters,we still need to use other techniques to label the clusters(e.g.,payload anal-ysis,manual classification,port-based analysis,or a combination thereof).This task is manageable because the model would typi-cally be built using small data sets.We believe that in order to build an accurate classifier,a good classification model must be used.In this paper,we focused on the model building step.Specifically,we investigate which clustering algorithm generates the best model.We are currently investigating building efficient classifiers for K-Means and DBSCAN and testing the classification accuracy of the algorithms.We are also investi-gating how often the models should be retrained(e.g.,on a daily, weekly,or monthly basis).The remainder of this paper is arranged as follows.The different Internet traffic classification methods including those using cluster analysis are reviewed in Section2.Section3outlines the theory and methods employed by the clustering algorithms studied in this paper.Section4and Section5present our methodology and out-line our experimental results,respectively.Section6discusses the experimental results.Section7presents our conclusions.2.BACKGROUNDSeveral techniques use transport layer information to address the problems associated with payload-based analysis and the diminish-ing effectiveness of port-based identification.McGregor et al.hy-pothesize the ability of using cluster analysis to groupflows using transport layer attributes[10].The authors,however,do not evalu-ate the accuracy of the classification as well as whichflow attributes produce the best results.Zander et al.extend this work by using another Expectation Maximization(EM)algorithm[2]called Au-toClass[1]and analyze the best set of attributes to use[17].Both [10]and[17]only test Bayesian clustering techniques implemented by an EM algorithm.The EM algorithm has a slow learning time. This paper evaluates clustering algorithms that are different and faster than the EM algorithm used in previous work.Some non-clustering techniques also use transport layer statis-tics to classify traffic[8,9,12,14].Roughan et e nearest neighbor and linear discriminate analysis[14].The connection du-rations and average packet size are used for classifying traffic into four distinct classes.This approach has some limitations in that the analysis from these two statistics may not be enough to classify all applications classes.Karagiannis et al.propose a technique that uses the unique be-haviors of P2P applications when they are transferring data or mak-ing connections to identify this traffic[8].Their results show that this approach is comparable with that of payload-based identifica-tion in terms of accuracy.More recently,Karagiannis et al.devel-oped another method that uses the social,functional,and applica-tion behaviors to identify all types of traffic[9].These approaches focus on higher level behaviours such as the number of concurrent connections to an IP address and does not use the transport layer characteristics of single connection that we utilize in this paper. In[12],Moore et e a supervised machine learning algo-rithm called Na¨ıve Bayes as a classifier.Moore et al.show that the Na¨ıve Bayes approach has a high accuracy classifying traffic.Su-pervised learning requires the training data to be labelled before the model is built.We believe that an unsupervised clustering approach offers some advantages over supervised learning approaches.One of the main benefits is that new applications can be identified by examining the connections that are grouped to form a new clus-ter.The supervised approach can not discover new applications and can only classify traffic for which it has labelled training data. Another advantage occurs when the connections are being labelled. Due to the high accuracy of our clusters,only a few of the connec-tions need to be identified in order to label the cluster with a high degree of confidence.Also consider the case where the data set be-ing clustered contains encrypted P2P connections or other types of encrypted traffic.These connections would not be labelled using payload-based classification.These connections would,therefore,be excluded from the supervised learning approach which can only use labelled training data as input.This could reduce the super-vised approach’s accuracy.However,the unsupervised clustering approach does not have this limitation.It might place the encrypted P2P traffic into a cluster with other unencrypted P2P traffic.By looking at the connections in the cluster,an analyst may be able to see similarities between unencrypted P2P traffic and the encrypted traffic and conclude that it may be P2P traffic.3.CLUSTERING ALGORITHMSThis section reviews the clustering algorithms,namely K-Means,DBSCAN,and AutoClass,considered in this work.The K-Means algorithm produces clusters that are spherical in shape whereas the DBSCAN algorithm has the ability to produce clusters that are non-spherical.The different cluster shapes that DBSCAN is capable of finding may allow for a better set of clusters to be found that minimize the amount of analysis required.The AutoClass algo-rithm uses a Bayesian approach and can automatically determine the number of clusters.Additionally,it performs soft clustering wherein objects are assigned to multiple clusters fractionally.The Cluster 3.0[4]software suite is used to obtain the results for K-Means clustering.The DBSCAN results are obtained the WEKA software suite [16].The AutoClass results are obtained using an implementation provided by [1].In order for the clustering of the connections to occur,a similar-ity (or distance)measurement must be established first.While vari-ous similarity measurements exist,Euclidean distance is one of the most commonly used metrics for clustering problems [7,16].With Euclidean distance,a small distance between two objects implies a strong similarity whereas a large distance implies a low similarity.In an n-dimensional space of features,Euclidean distance can be calculated between objects x and y as follows:dist (x,y )=v u ut4.METHODOLOGY4.1Empirical TracesTo analyze the algorithms,we used data from two empirical packet traces.One is a publicly available packet trace called Auck-land IV2,the other is a full packet trace that we collected ourselves at the University of Calgary.Auckland IV:The Auckland IV trace contains only TCP/IP head-ers of the traffic going through the University of Auckland’s link to the Internet.We used a subset of the Auckland IV trace from March16,2001at06:00:00to March19,2001at05:59:59.This subset provided sufficient connection samples to build our model (see Section4.4).Calgary:This trace was collected from a traffic monitor attached to the University of Calgary’s Internet link.We collected this trace on March10,2006from1to2pm.This trace is a full packet trace with the entire payloads of all the packets captured.Due to the amount of data generated when capturing full payloads,the disk capacity(60GB)of our traffic monitor wasfilled after one hour of collection,thus,limiting the duration of the trace.4.2Connection IdentificationTo collect the statisticalflow information necessary for the clus-tering evaluations,theflows must be identified within the traces. Theseflows,also known as connections,are a bidirectional ex-change of packets between two nodes.In the traces,the data is not exclusively from connection-based transport layer protocols such as TCP.While this study focused solely on the TCP-based applications it should be noted that statis-ticalflow information could be calculated for UDP traffic also.We identified the start of a connection using TCP’s3-way handshake and terminated a connection when FIN/RST packets were received. In addition,we assumed that aflow is terminated if the connection was idle for over90seconds.The statisticalflow characteristics considered include:total num-ber of packets,mean packet size,mean payload size excluding headers,number of bytes transfered(in each direction and com-bined),and mean inter-arrival time of packets.Our decision to use these characteristics was based primarily on the previous work done by Zander et al.[17].Due the heavy-tail distribution of many of the characteristics and our use of Euclidean distance as our similarity metric,we found that the logarithms of the characteristics gives much better results for all the clustering algorithms[13,16].4.3Classification of the Data SetsThe publicly available Auckland IV traces include no payload information.Thus,to determine the connections“true”classifica-tions port numbers are used.For this trace,we believe that a port-based classification will be largely accurate,as this archived trace predates the widespread use of dynamic port numbers.The classes considered for the Auckland IV datasets are DNS,FTP(control), FTP(data),HTTP,IRC,LIMEWIRE,NNTP,POP3,and SOCKS. LimeWire is a P2P application that uses the Gnutella protocol.In the Calgary trace,we were able to capture the full payloads of the packets,and therefore,were able to use an automated payload-based classification to determine the“true”classes.The payload-based classification algorithm and signatures we used is very sim-ilar to those described by Karagiannis et al.[9].We augmented their signatures to classify some newer P2P applications and instant messaging programs.The traffic classes considered for the Calgary trace are HTTP,P2P,SMTP,and POP3.The application breakdownConnections%Bytes1,132,92047.3% P2P17,578,995,93446,882 6.0% IMAP228,156,0603,6740.1% MSSQL23,824,93641,239 1.3%354,7989.6%of the Calgary trace is presented in Table1.The breakdown of the Auckland IV trace has been omitted due to space limitations.How-ever,HTTP is also the most dominant application accounting for over76%of the bytes and connections.4.4Testing MethodologyThe majority of the connections in both traces carry HTTP traf-fic.This unequal distribution does not allow for equal testing of the different classes.To address this problem,the Auckland data sets used for the clustering consist of1000random samples of each traf-fic class,and the Calgary data sets use2000random sample of each traffic category.This allows the test results to fairly judge the abil-ity on all traffic and not just HTTP.The size of the data sets were limited to8000connections because this was the upper bound that the AutoClass algorithm could cluster within a reasonable amount of time(4-10hours).In addition,to achieve a greater confidence in the results we generated10different data sets for each trace.Each of these data sets was then,in turn,used to evaluate the cluster-ing algorithms.We report the minimum,maximum,and average results from the data sets of each trace.In the future,we plan on examining the practical issue of what is the best way to pick the connections used as samples to build the model.Some ways that we think this could be accomplished is by random selection or a weighted selection using different criteria such as bytes transfered or duration.Also,in order to get a reason-able representative model of the traffic,one would need to select a fairly large yet manageable number of samples.We found that K-Means and DBSCAN algorithms are able to cluster much larger data sets(greater than100,000)within4-10hours.5.EXPERIMENTAL RESULTSIn this section,the overall effectiveness of each clustering algo-rithm is evaluatedfirst.Next,the number of objects in each cluster produced by the algorithms are analyzed.5.1Algorithm EffectivenessThe overall effectiveness of the clustering algorithms is calcu-lated using overall accuracy.This overall accuracy measurement determines how well the clustering algorithm is able to create clus-ters that contain only a single traffic category.The traffic class that makes up the majority of the connections in a cluster is used to label the cluster.The number of correctly classified connections in a cluster is referred to as the True Pos-itives(TP).Any connections that are not correctly classified are considered False Positives(FP).Any connection that has not been assigned to a cluster is labelled as noise.The overall accuracy is thus calculated as follows:overall accuracy=P T P for all clusters0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 020406080 100 120 140 160O v e r a l l A c c u r a c yNumber of ClustersCalgary AucklandIVFigure 1:Accuracy using K-Means 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.010.020.030.04O v e r a l l A c c u r a c yEpsilon DistanceAuckland IV (3 minPts)Calgary (3 minPts)Figure 2:Accuracy using DBSCAN0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.010.020.03 0.04O v e r a l l A c c u r a c yEpsilon Distance3 minPts 6 minPts 12 minPts 24 minPtsFigure 3:Parametrization of DBSCANTable 2:Accuracy using AutoClass Data Set Minimum Auckland IV 91.5%88.7%90.0%5.1.1K-Means ClusteringThe K-Means algorithm has an input parameter of K.This inputparameter as mentioned in Section 3.1,is the number of disjoint partitions used by K-Means.In our data sets,we would expect there would be at least one cluster for each traffic class.In ad-dition,due to the diversity of the traffic in some classes such as HTTP (e.g.,browsing,bulk download,streaming)we would ex-pect even more clusters to be formed.Therefore,based on this,the K-Means algorithm was evaluated with K initially being 10and K being incremented by 10for each subsequent clustering.The min-imum,maximum,and average results for the K-Means clustering algorithm are shown in Figure 1.Initially,when the number of clusters is small the overall ac-curacy of K-Means is approximately 49%for the Auckland IV data sets and 67%for the Calgary data sets.The overall accuracy steadily improves as the number of clusters increases.This contin-ues until K is around 100with the overall accuracy being 79%and 84%on average,for the Auckland IV and Calgary data sets,respec-tively.At this point,the improvement is much more gradual with the overall accuracy only improving by an additional 1.0%when K is 150in both data sets.When K is greater than 150,the improve-ment is further diminished with the overall accuracy improving to the high 80%range when K is 500.However,large values of K increase the likelihood of over-fitting.5.1.2DBSCAN ClusteringThe accuracy results for the DBSCAN algorithm are presented in Figure 2.Recall that DBSCAN has two input parameters (minPts,eps).We varied these parameters,and in Figure 2report results for the combination that produce the best clustering results.The values used for minPts were tested between 3and 24.The eps dis-tance was tested from 0.005to 0.040.Figure 3presents results for different combinations of (minPts,eps)values for the Calgary data sets.As may be expected,when the minPts was 3better results were produced than when the minPts was 24because smaller clus-ters are formed.The additional clusters found using three minPts were typically small clusters containing only 3to 5connections.When using minPts equal to 3while varying the eps distance between 0.005and 0.020(see Figure 2),the DBSCAN algorithm improved its overall accuracy from 59.5%to 75.6%for the Auck-land IV data sets.For the Calgary data sets,the DBSCAN algo-rithm improved its overall accuracy from 32.0%to 72.0%as the eps distance was varied with these same values.The overall ac-curacy for eps distances greater than 0.020decreased significantly0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1% C o n n e c t i o n s% ClustersDBSCAN K-Means AutoClassFigure 4:CDF of cluster weightsas the distance increased.Our analysis indicates that this large de-crease occurs because the clusters of different traffic classes merge into a single large cluster.We found that this larger cluster was for connections with few packets,few bytes transfered,and short dura-tions.This cluster contained typically equal amounts of P2P,POP3,and SMTP connections.Many of the SMTP connections were for emails with rejected recipient addresses and connections immedi-ately closed after connecting to the SMTP server.For POP3,many of the connections contained instances where no email was in the users mailbox.Gnutella clients attempting to connect to a remote node and having its “GNUTELLA CONNECT”packets rejected accounted for most of the P2P connections.5.1.3AutoClass ClusteringThe results for the AutoClass algorithm are shown in Table 2.For this algorithm,the number of clusters and the cluster param-eters are automatically determined.Overall,the AutoClass algo-rithm has the highest accuracy.On average,AutoClass is 92.4%and 88.7%accurate in the Auckland IV and Calgary data sets,re-spectively.AutoClass produces an average of 167clusters for the Auckland IV data sets,and 247clusters for the Calgary data sets.5.2Cluster WeightsFor the traffic classification problem,the number of clusters pro-duced by a clustering algorithm is an important consideration.The reason being that once the clustering is complete,each of the clus-ters must be labelled.Minimizing the number of clusters is also cost effective during the classification stage.One way of reducing the number of clusters to label is by evalu-ating the clusters with many connections in them.For example,if a clustering algorithm with high accuracy places the majority of the connections in a small subset of the clusters,then by analyzing only this subset a majority of the connections can be classified.Figure 4shows the percentage of connections represented as the percentage of clusters increases,using the Auckland IV data sets.In this eval-uation,the K-Means algorithm had 100for K.For the DBSCAN and AutoClass algorithms,the number of clusters can not be set.0.50.6 0.7 0.8 0.9 1P r e c i s i o nFigure 5:Precision using DBSCAN,K-Means,and AutoClass DBSCAN uses 0.03for eps,3for minPts,and has,on average,190clusters.We selected this point because it gave the best overall accuracy for DBSCAN.AutoClass has,on average,167clusters.As seen in Figure 4,both K-Means and AutoClass have more evenly distributed clusters than DBSCAN.The 15largest clusters produced by K-Means only contain 50%of the connections.In contrast,for the DBSCAN algorithm the five largest clusters con-tain over 50%of the connections in the data sets.These five clus-ters identified 75.4%of the NNTP,POP3,SOCKS,DNS,and IRC connections with a 97.6%overall accuracy.These results are un-expected when considering that by only looking at five of the 190clusters,one can identify a significant portion of traffic.Qualita-tively similar results were obtained for the Calgary data sets.6.DISCUSSIONThe DBSCAN algorithm is the only algorithm considered in this paper that can label connections as noise.The K-Means and Au-toClass algorithms place every connection into a cluster.The con-nections that are labelled as noise reduce the overall accuracy of the DBSCAN algorithm because they are regarded as misclassified.We have found some interesting results by excluding the connec-tions labelled as noise and just examining the clusters produced by DBSCAN.Figure 5shows the precision values for the DBSCAN (eps=0.02,minPts=3),the K-Means (K=190),and the AutoClass algorithms using the Calgary data sets.Precision is the ratio of TP to FP for a traffic class.Precision measures the accuracy of the clusters to classify a particular category of traffic.Figure 5shows that for the Calgary data sets,the DBSCAN algo-rithm has the highest precision values for three of the four classes of traffic.While not shown for the Auckland IV data sets,seven of the nine traffic classes have average precision values over 95%.This shows that while DBSCAN’s overall accuracy is lower than K-Means and AutoClass it produces highly accurate clusters.Another noteworthy difference among the clustering algorithms is the time required to build the models.On average to build the models,the K-Means algorithm took 1minute,the DBSCAN algo-rithm took 3minutes,and the AutoClass algorithm took 4.5hours.Clearly,the model building phase of AutoClass is time consum-ing.We believe this may deter systems developers from using this algorithm even if the frequency of retraining the model is low.7.CONCLUSIONSIn this paper,we evaluated three different clustering algorithms,namely K-Means,DBSCAN,and AutoClass,for the network traffic classification problem.Our analysis is based on each algorithm’s ability to produce clusters that have a high predictive power of a single traffic class,and each algorithm’s ability to generate a min-imal number of clusters that contain the majority of the connec-tions.The results showed that the AutoClass algorithm produces the best overall accuracy.However,the DBSCAN algorithm hasgreat potential because it places the majority of the connections in a small subset of the clusters.This is very useful because these clusters have a high predictive power of a single category of traffic.The overall accuracy of the K-Means algorithm is only marginally lower than that of the AutoClass algorithm,but is more suitable for this problem due to its much faster model building time.Ours in a work-in-progress and we continue to investigate these and other clustering algorithms for use as an efficient classification tool.8.ACKNOWLEDGMENTSThis work was supported by the Natural Sciences and Engineer-ing Research Council (NSERC)of Canada and Informatics Circle of Research Excellence (iCORE)of the province of Alberta.We thank Carey Williamson for his comments and suggestions which helped improve this paper.9.REFERENCES[1]P.Cheeseman and J.Strutz.Bayesian Classification (AutoClass):Theory and Results.In Advances in Knowledge Discovery and Data Mining,AAI/MIT Press,USA ,1996.[2]A.P.Dempster,N.M.Paird,and D.B.Rubin.Maximum likelihoodfrom incomeplete data via the EM algorithm.Journal of the Royal Statistical Society ,39(1):1–38,1977.[3]C.Dews,A.Wichmann,and A.Feldmann.An analysis of internetchat systems.In IMC’03,Miami Beach,USA,Oct 27-29,2003.[4]M.B.Eisen,P.T.Spellman,P.O.Brown,and D.Botstein.ClusterAnalysis and Display of Genome-wide Expression Patterns.Genetics ,95(1):14863–15868,1998.[5]M.Ester,H.Kriegel,J.Sander,and X.Xu.A Density-basedAlgorithm for Discovering Clusters in Large Spatial Databases with Noise.In 2nd Int.Conf.on Knowledge Discovery and Data Mining (KDD 96),Portland,USA,1996.[6]P.Haffner,S.Sen,O.Spatscheck,and D.Wang.ACAS:AutomatedConstruction of Application Signatures.In SIGCOMM’05MineNet Workshop ,Philadelphia,USA,August 22-26,2005.[7]A.K.Jain and R.C.Dubes.Algorithms for Clustering Data .PrenticeHall,Englewood Cliffs,USA,1988.[8]T.Karagiannis,A.Broido,M.Faloutsos,and K.claffy.TransportLayer Identification of P2P Traffic.In IMC’04,Taormina,Italy,October 25-27,2004.[9]T.Karagiannis,K.Papagiannaki,and M.Faloutsos.BLINK:Multilevel Traffic Classification in the Dark.In SIGCOMM’05,Philadelphia,USA,August 21-26,2005.[10]A.McGregor,M.Hall,P.Lorier,and J.Brunskill.Flow ClusteringUsing Machine Learning Techniques.In PAM 2004,Antibes Juan-les-Pins,France,April 19-20,2004.[11]A.W.Moore and K.Papagiannaki.Toward the AccurateIdentification of Network Applications.In PAM 2005,Boston,USA,March 31-April 1,2005.[12]A.W.Moore and D.Zuev.Internet Traffic Classification UsingBayesian Analysis Techniques.In SIGMETRIC’05,Banff,Canada,June 6-10,2005.[13]V .Paxson.Empirically-Derived Analytic Models of Wide-Area TCPConnections.IEEE/ACM Transactions on Networking ,2(4):316–336,August 1998.[14]M.Roughan,S.Sen,O.Spatscheck,and N.Duffield.Class-of-Service Mapping for QoS:A Statistical Signature-based Approach to IP Traffic Classification.In IMC’04,Taormina,Italy,October 25-27,2004.[15]S.Sen,O.Spatscheck,and D.Wang.Accurate,Scalable In-NetworkIdentification of P2P Traffic Using Application Signatures.In WWW2005,New York,USA,May 17-22,2004.[16]I.H.Witten and E.Frank.(2005)Data Mining:Pratical MachineLearning Tools and Techniques .Morgan Kaufmann,San Francisco,2nd edition,2005.[17]S.Zander,T.Nguyen,and G.Armitage.Automated TrafficClassification and Application Identification using Machine Learning.In LCN’05,Sydney,Australia,Nov 15-17,2005.。
Chromatic Correlation Clustering
Chromatic ClusteringGullo Antti UkkonenSpaina randomized algorithm with approximation guarantee pro-portional to the maximum degree of the input graph.The algorithm iteratively picks a random edge as pivot,builds a cluster around it,and removes the cluster from the graph. Although being fast,easy-to-implement,and parameter free, this algorithm tends to produce a relatively large number of clusters.To overcome this issue we introduce a variant algo-rithm,which modifies how the pivot is chosen and and how the cluster is built around the pivot.Finally,to address the case where afixed number of output clusters is required,we devise a third algorithm that directly optimizes the objective function via a strategy based on the alternating minimiza-tion paradigm.We test our algorithms on synthetic and real data from the domains of protein-interaction networks,social media, and bibliometrics.Experimental evidence show that our al-gorithms outperform a baseline algorithm both in the task of reconstructing a ground-truth clustering and in terms of objective function value.Categories and Subject Descriptors:H.2.8[Database Management]:Database Applications-Data Mining Keywords:Clustering,Edge-labeled graphs.1.INTRODUCTIONClustering is one of the most well-studied problems in data mining.The goal of clustering is to partition a set of objects in different clusters,so that objects in the same cluster are more similar to each other than to objects in other clusters.A common trait underlying most clustering paradigms is the existence of a function sim(x,y)representing the similarity between pairs of objects x and y.The similarity function Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.KDD’12,August12–16,2012,Beijing,China.Copyright2012ACM978-1-4503-1462-6/12/08...$15.00.(a)(b)1:An example of chromatic clustering:(a) input graph,(b)the optimal solution for chromatic-correlation-clustering(Problem2).is either provided explicitly as input,or it can be computed implicitly from the representation of the objects.In this paper,we consider a different clustering setting where the relationship among objects is represented by a re-lation type,such as a label (x,y)from afinite set of possible labels L.In other words,the range of the similarity func-tion sim(x,y)can be viewed as being categorical,instead of numerical.Moreover,we model the case where two objects x and y do not have any relation with a special label l0/∈L. Our framework has a natural graph interpretation:the in-put can be viewed as an edge-labeled graph G=(V,E,L, ), where the set of vertices V is the set of objects to be clus-tered,the set of edges E⊆V×V is implicitly defined as E={(x,y)∈V×V| (x,y)=l0},and each edge has a label in L or,as we like to think about it,a color.The key objective in our framework is tofind a partition of the vertices of the graph such that the edges in each clus-ter have,as much as possible,the same color(an example is shown in Figure1).Intuitively,a red edge(x,y)pro-vides positive evidence that the vertices x and y should be clustered in such a way that the edges in the subgraph in-duced by that cluster are mostly red.Furthermore,in the case that most edges of a cluster are red,it is reasonable to label the whole cluster with the red color.Note that a clustering algorithm for this problem should also deal with inconsistent evidence,as a red edge(x,y)provides evidence for the vertex x to participate in a cluster with red edges, while a green edge(x,z)provides contradicting evidence for the vertex x to participate in a cluster with green edges. Aggregating such inconsistent information is resolved by op-timizing a properly-defined objective function. Applications.The study of edge-labeled graphs is moti-vated by many real-world applications and is receiving in-creasing attention in the data-mining literature[8,10,16]. As an example,biologists study protein-protein interaction networks,where vertices represent proteins and edges repre-sent interactions occurring when two or more proteins bind together to carry out their biological function.Those inter-actions can be of different types,e.g.,physical association, direct interaction,co-localization,etc.In these networks, for instance,a cluster containing mainly edges labeled as co-localization,might represent a protein complex,i.e.,a group of proteins that interact with each other at the same time and place,forming a single multi-molecular machine[11]. As a further example,social networks are commonly rep-resented as graphs,where the vertices represent individuals and the edges capture relationships among these individu-als.Again,these relationships can be of various types,e.g., colleagues,neighbors,schoolmates,football-mates.In bibliographic data,co-authorship networks represent collaborations among authors:in this case the topic of the collaboration can be seen as an edge label,and a clus-ter of vertices represents a topic-coherent community of re-searchers.In our experiments in Section5we show how our framework can be applied in all the above domains. Contributions.In this paper we address the problem of clustering data with categorical similarity,achieving the fol-lowing contributions:•We define chromatic-correlation-clustering,a novel clustering problem for objects with categorical sim-ilarity,by revisiting the well-studied correlation cluster-ing framework[3].We show that our problem is a gen-eralization of the traditional correlation-clustering problem,implying that it is NP-hard.•We introduce a randomized algorithm,named Chromatic Balls,that provides approximation guarantee propor-tional to the maximum degree of the graph.•Though of theoretical interest,Chromatic Balls has some limits when it comes to practice.Trying to overcome these limits,we introduce two alternative algorithms:a more practical lazy version of Chromatic Balls,and an algorithm that directly optimizes the proposed objective function via an iterative process based on the alternating minimization paradigm.•We empirically assess our algorithms both on synthetic and real datasets.Experiments on synthetic data show that our algorithms outperform a baseline algorithm in the task of reconstructing a ground-truth clustering.Ex-periments on real-world data confirm that chromatic-correlation-clustering provides meaningful clusters. The rest of the paper is organized as follows.In the next section we recall the traditional correlation clustering prob-lem and introduce our new formulation.In Section3we introduce the Chromatic Balls algorithm and we prove its approximation guarantees.In Section4we present the two more practical algorithms,namely Lazy Chromatic Balls and Alternating Minimization.In Section5we report our experi-mental analysis.In Section6we discuss related work. 2.PROBLEM DEFINITIONGiven a set of objects V,a clustering problem asks to par-tition the set V into clusters of similar objects.Assuming that cluster identifiers are represented by natural numbers, a clustering C can be seen as a function C:V→N.Typi-cally,the goal is tofind a clustering C that optimizes an ob-jective function that measures the quality of the clustering Numerous formulations and objective functions have been considered in the literature.One of these,considered both in the area of theoretical computer science and data min-ing,is that at the basis of the correlation-clustering problem[3].Problem1(correlation-clustering)Given a set of objects V and a pairwise similarity func-tion sim:V×V→[0,1],find a clustering C:V→N that minimizes the costcost(C)=(x,y)∈V×VC(x)=C(y)(1−sim(x,y))+(x,y)∈V×VC(x)=C(y)sim(x,y).(1)The intuition underlying the above problem is that the cost of assigning two objects x and y to the same cluster should be equal to the dissimilarity1−sim(x,y),while the cost of assigning the objects in different clusters should cor-respond to their similarity sim(x,y).A common case is when the similarity is binary,that is,sim:V×V→{0,1}.In this case,Equation(1)reduces to counting the number of pairs of objects that have similarity0and are put in the same cluster plus the number of pairs of objects that have similarity1and belong to different clusters.Or equivalently, in a graph-based terminology,the objective function counts the number of“positive”edges that are cut plus the number of“negative”(i.e.,non-existing)edges that are not cut.In chromatic-correlation-clustering,which we for-mally define below,we still have negative edges(i.e.,l0-edges),but the positive edges may have different colors,rep-resenting different kinds of relations among the objects. Problem2(chromatic-correlation-clustering) Given a set V of objects,a set L of labels,a special label l0,and a pairwise labeling function :V×V→L∪{l0},find a clustering C:V→N and a cluster labeling function c :C[V]→L so to minimize the costcost(C,c )=(x,y)∈V×V,C(x)=C(y)(1−I[ (x,y)=c (C(x))])+(x,y)∈V×V,C(x)=C(y)I[ (x,y)=l0].(2) Equation(2)is composed by two terms,representing intra-and inter-cluster costs,respectively.In particular,ac-cording to the intra-cluster cost term,any pair of objects (x,y)assigned to the same cluster should pay a cost if and only if their relation type (x,y)is other than the predom-inant relation type of the cluster indicated by the function c .For the inter-cluster cost,the objective function does not penalize a pair of objects(x,y)only if they do not have any relation,i.e., (x,y)=l0.If (x,y)=l0,the objective function incurs a cost,regardless of the label (x,y). Example1For the problem instance in Figure1(a),the solution in Figure1(b)has a cost of5:there is no intra-cluster cost,because the two clusters are cliques and their edges are monochromatic,while we have an inter-cluster cost of5as equal to the number of edges that are cut.It is trivial to observe that,when|L|=1,the chro-matic-correlation-clustering problem corresponds to the binary version of correlation-clustering.Thus,our problem is a generalization of the standard problem.Since correlation-clustering is NP-hard,we can easily con-clude that chromatic-correlation-clustering is NP-hard too.The previous observation motivates us to considerwhether applying standard correlation-clustering algo-rithms,just ignoring the different colors,is a good solutionto the problem.As we show in the following example,suchan approach does not guarantee to produce good solutions. Example2For the problem instance in Figure1(a),theoptimal solution for the standard correlation-cluster-ing which does not consider the different colors,would be composed by a single cluster containing all the six vertices,as,according to Equation(1),this solution has a(min-imum)cost of4corresponding to the number of missingedges within the cluster.Conversely,this solution has anon-optimal cost12when evaluated according to the chro-matic-correlation-clustering formulation,i.e.,accord-ing to Equation(2).Instead,the optimum in this case would correspond to the cost5solution depicted in Figure1(b). Although the example shows that for the chromatic ver-sion of the problem we cannot directly apply algorithms de-veloped for the correlation-clustering problem,we can use such algorithms at least as a starting point,as shown in the next section.3.THE Chromatic Balls ALGORITHMWe present next a randomized approximation algo-rithm for the chromatic-correlation-clustering prob-lem.This algorithm,called Chromatic Balls,is motivated bythe Balls algorithm[1],which is an approximation algorithmfor standard correlation-clustering.For completeness,we briefly review the Balls algorithm.The algorithm works in iterations.Initially all objects areconsidered uncovered.In each iteration the algorithm pro-duces a cluster,and the objects participating in the clusterare considered covered.In particular,the algorithm picksas pivot a random object currently uncovered,and forms acluster consisting of the pivot itself along with all currentlyuncovered objects that are connected to the pivot.The outline of our Chromatic Balls is summarized in Algo-rithm1.The main difference with the Balls algorithm is thatthe edge labels are taken into account in order to build clus-ters around the pivots.To this end,the pivot chosen at eachiteration of Chromatic Balls is an edge,thus a pair of objects,rather than a single object.The Chromatic Balls algorithmemploys a set V to keep all the objects that have not beenassigned to any cluster yet;hence,initially,V =V.Ateach iteration,a random edge(u,v)such that both objectsu and v are currently in the set V is selected as pivot(line3).Given the pivot(u,v),a cluster C is formed around it.Beyond the objects u and v,the cluster C additionally con-tains all other objects x∈V for which the triangle(u,v,x)is monochromatic,that is, (u,x)= (v,x)= (u,v)(lines4and5).Since the label (u,v)forms the basis for creatingthe cluster C,the cluster is labeled with this label(line6).All objects added in C are removed from V (line7),and thealgorithm terminates when V does not contain any pair ofobjects that share an edge,i.e.,that is labeled with a labelother than l0(line2).All objects remaining in the set V ,if any,are eventually made singleton clusters(lines8-11). Computational complexity.The complexity of the Chromatic Balls algorithm is determined by two steps:(i) picking the pivots(line3),and(ii)building the clusters(line 4).Choosing the pivots requires O(m log n)time,where n=|V|and m=|E|,as selecting random edges can be im-plemented by building a priority queue of edges with random priorities,and subsequently removing edges;each edge is re-moved once from the priority queue,whether it is selected as Algorithm1Chromatic BallsInput:Edge-labeled graph G=(V,E,L, )Output:Clustering C:V→N;cluster labeling functionc :C[V]→L1:V ←V;i←12:while there exist u,v∈V such that(u,v)∈E do3:randomly pick u,v∈V such that(u,v)∈E4:C←{u,v}∪{x∈V | (u,x)= (v,x)= (u,v)} 5:for all x∈C do C(x)←i6:c (i)= (u,v)7:V ←V \C;i←i+18:for all x∈V do9:C(x)←i10:c (i)←a random label from L11:i←i+1pivot or not.Building a single cluster C,instead,requires to access all neighbors of the pivot edge(u,v).As the current cluster is removed from the set of uncovered objects at the end of each iteration,the neighbors of any pivot are not con-sidered again in the remainder of the algorithm.Thus,the step of selecting the objects to be included into the current clusters requires visiting each edge at most once;therefore, the process takes O(m)time.In conclusion,we can state that the computational complexity of the Chromatic Balls algorithm is O(m log n).3.1Theoretical analysisWe analyze next the quality of the solutions obtained by Chromatic Balls.Our main result,given in Theorem1,shows that the approximation guarantee of the algorithm depends on the number of bad triplets incident to a pair of objects in the input dataset.The notion of bad triplet is defined below; however,here we note that this result gives a constant-factor guarantee for bounded-degree graphs.Even though the Chromatic Balls algorithm is similar to the Balls algorithm,which can be shown to provide a constant-factor approximation guarantee for general graphs too,the theoretical analysis of Chromatic Balls is much more complicated and requires several additional and nontrivial arguments.Due to the limited space of this paper,we re-port next only an outline of our analysis.Further details, including complete proofs,can be found in an extended tech-nical report.1We begin our analysis by defining special types of triplets and quadruples among the vertices of the graph.Definition1(SC-triplet)We say that{x,y,z}is a same-color triplet(SC-triplet)if the induced triangle is monochromatic,i.e., (x,y)= (x,z)= (y,z)=l0.Definition2(B-triplet)We say that{x,y,z}is a bad triplet(B-triplet)if the induced triangle is non-monochromatic and it has at most one pair labeled with l0. Definition3(B-quadruple)A Bad-quadruple is a set {x,y,z,w}⊆V that contains at least one SC-triplet and at least one B-triplet.Note that,according to the cost function of our problem as defined in Equation(2),there is no way to partition a B-triplet without paying any cost.Next we define the no-tions of hitting and d-hitting.Definition4(hitting)Consider a pair of objects(x,y) and a triplet t,which can be either SC-triplet or B-triplet. We say that t hits(x,y)if x∈t and y∈t.Additionally,if 1/CCC.pdfq is a B-quadruple,we say that q hits(x,y)if x∈q,y∈q, and there exists z∈q such that{x,y,z}is a B-triplet.Definition5(d-hitting)Given any pair of objects(x,y) and any B-quadruple q={x,y,z,w},we say that q deeply hits(d-hits)(x,y)if q hits(x,y)and either{x,z,w}or {y,z,w}is an SC-triplet.In reference to the above notions,we hereinafter denote by S,T,and Q the sets of all SC-triplets,B-triplets,and B-quadruples for an instance of our problem.Moreover, given a pair(x,y)∈V×V we define the following sets: T xy⊆T denotes the set of all B-triplets in T that hit(x,y); Q xy⊆Q denotes the set of all B-quadruples in Q that hit (x,y);Q d xy⊆Q xy⊆Q denotes the set of all B-quadruples in Q that d-hit(x,y).Let us now consider some events that may arise during the execution of the Chromatic Balls algorithm.Given an object x∈V,P(i)xdenotes the event“x is chosen as pivot in the i-th iteration.”Given a set{x1,...,x n}⊆V,with n≥2,A(i)x1···x n denotes the event“all objects x1,...,x n enter thei-th iteration of the algorithm,while two of them are chosen as pivot in the same iteration.”Additionally,the events T(i)z|xy and Q(i)zw|xyare defined inreference to a pair(x,y).Given a B-triplet{x,y,z}∈T xy,T(i) z|xy denotes the event“A(i)xyz occurs while x and y arenot chosen both as pivots in the i-th iteration.”Given a B-quadruple{x,y,z,w}∈Q d xy,Q(i)zw|xydenotes the event “A(i)xyzw occurs while neither x nor y are chosen as pivots in i-th iteration.”For the events A(i)x1···x n ,T(i)z|xy,and Q(i)zw|xy,defined above,we also consider their counterparts that assert that the events occur at some iteration i.For instance,A x1···x nde-notes the event“A(i)x1···x n happens at some iteration i,”whileT z|xy and Q zw|xy are defined analogously.Formally:A x1···x n ⇔iA(i)x1···x n,(3)T z|xy⇔i T(i)z|xy⇔iA(i)xyz∧¬P(i)x∧P(i)y,(4)Q zw|xy⇔i Q(i)zw|xy⇔iA(i)xyzw∧¬P(i)x∧¬P(i)y.(5)As reported in the next two lemmas,the probabilities of the events T z|xy and Q zw|xy can be expressed in terms of the probabilities of the events A xyz and A xyzw.Lemma1Given a pair(x,y)∈V×V and a B-triplet {x,y,z}∈T xy,it holds that1Pr[A xyz]≤Pr[T z|xy]≤Pr[A xyz].Lemma2Given a pair(x,y)∈V×V and a B-quadruple{x,y,z,w}∈Q d xy,it holds that16Pr[A xyzw]≤Pr[Q zw|xy]≤14Pr[A xyzw].Analyzing carefully the probabilities of events T z|xy and Q zw|xy is crucial for deriving the desired approximation fac-tor,as shown next.We consider an instance G=(V,E,L, )of our problem and rewrite the cost function in Equation(2)as sum of the costs paid by any single pair(x,y).To this end,in order to simplify the notation,we hereinafter write the cost by omitting C and c while keeping G only:c(G)=(x,y)∈V×V c xy(G),(6)where c xy(G)denotes the aforementioned contribution of thepair(x,y)to the total cost.Moreover,let E[c(G)]denotethe expected cost of Chromatic Balls over the random choicesmade by the algorithm.By the linearity of expectation,theexpected cost E[c(G)]can be expressed asE[c(G)]=(x,y)∈V×VE[c xy(G)].(7)Finally,let c∗(G)be the cost of the optimal solution on G.To derive an approximation factor r(G)on the perfor-mance of the Chromatic Balls algorithm,we look for an upperbound Ub(G)on the expected cost E[c(G)]of the algorithm,and a lower bound Lb(G)on the cost c∗(G)of the optimalsolution,so thatE[c(G)]≤Ub(G)=r(G).(8)We next show how to derive such upper and lower bounds.Deriving the upper bound Ub(G).For a pair(x,y)wedefine the collection of eventsΩxy={T z|xy|{x,y,z}∈T xy}∪{Q zw|xy|{x,y,z,w}∈Q d xy}.As the following twolemmas show,if pair(x,y)contributes to the cost paid bythe algorithm,then exactly one of the events inΩxy occurs.Lemma3If c xy(C,c ,G)>0then at least one of the eventsinΩxy occurs.Lemma4The events within the collectionΩxy are disjoint.Combining Lemmas3and4with the expressions of theprobabilities of the events T z|xy(Lemma1)and Q zw|xy(Lemma2)we can derive an upper bound on the expectedcontribution E[c xy(G)]of a pair(x,y)to the total cost.Lemma5For a pair(x,y)∈V×V the following boundholds.E[c xy(G)]≤{x,y,z}∈T xyPr[A xyz]+{x,y,z,w}∈Q d xy14Pr[A xyzw].The bound in Lemma5together with Equation(7)can beused to give the desired(upper)bound on the overall ex-pected cost E[c(G)].Lemma6The expected cost E[c(G)]of the Chromatic Ballsalgorithm can be bounded as followsE[c(G)]≤Ub(G)={x,y,z}∈T3Pr[A xyz]+34X xyz+12Y xyz,where:X xyz=w∈W xyzPr[A xyzw]xyzw,Y xyz=Y xyxyz+Y xzxyz+Y yzxyz,Y xyxyz=w∈W xyxyzPr[A xyzw]τxyzw,Y xzxyz=w∈W xzxyzPr[A xyzw]τxyzw,and Y yzxyz=w∈W yzxyzPr[A xyzw]xyzw.Finally,τxyzw denotes the number of B-triplets contained inany B-quadruple{x,y,z,w}.Deriving theB-triplet incursT satisfying{x,y,z}∈T x yαxyz≤1for all(x ,y )∈V×V.It holds that c∗(G)≥{x,y,z}∈Tαxyz.We can then obtain a lower bound on the optimal solutionbyfinding a suitable set of weightsαxyz that satisfies theconditions of the previous lemma.We derive such a set ofweights in the following further lemma.Lemma8For any pair(x,y)∈V×V the following condi-tion holds.{x,y,z}∈T xy11+|T xy|12Pr[A xyz]+16X xyz+16Y xyz≤1.Thus,combining Lemmas7and8,we can obtain the desiredlower bound Lb(G)as follows.Lemma9The cost c∗(G)of the optimal solution on anyinput instance G is lower bounded as followsc∗(G)≥Lb(G)=={x,y,z}∈T11+t max12Pr[A xyz]+16X xyz+16Y xyz,where t max=max(x,y)∈V×V|T xy|is the maximum numberof B-triplets that hit a pair of objects.The approximation ratio r(G).The upper and lowerbounds obtained in Lemmas6and9are at the basis if thefinal form of the approximation ratio of Chromatic Balls.Theorem1The approximation ratio of the Chromatic Ballsalgorithm on any input instance G isr(G)=E[c(G)]c∗(G)≤6(1+t max),where t max=max(x,y)∈V×V|T xy|is the maximum numberof B-triplets that hit a pair of objects.Theorem1shows that the approximation factor of theChromatic Balls algorithm is bounded by the maximum num-ber t max of B-triplets that hit a pair of objects.The result ismeaningful as it quantifies the quality of the performance ofthe algorithm as a property of the input graph.For exam-ple,as the following corollary shows,the algorithm providesa constant-factor approximation for bounded-degree graphs.Corollary1The approximation ratio of the ChromaticBalls algorithm on any input instance G isr(G)≤6(2D max−1),D max=max x∈V|{y|y∈V∧ (x,y)=l0}|is thedegree in the problem instance.OTHER ALGORITHMSthis section we present two additional algorithms forchromatic-correlation-clustering problem.Theone is a variant of the Chromatic Balls algorithm thatto overcome some weaknesses of Chromatic Ballsemploying two heuristics,one for pivot selection and onecluster selection.The second one is an alternating min-method that is designed to optimize directly thefunction.Lazy Chromatic BallsThe algorithm we present next is motivated by the follow-ing example,in which we discuss what may go wrong duringthe execution of the Chromatic Balls algorithm.Example3Consider the graph in Figure2:it has a fairlyevident green cluster formed by vertices{U,V,R,X,Y,W,Z}. X YU VW ZRSTFigure2:An example of an edge-labeled graph.However,as all the edges have the same probability of be-ing selected as pivots,Chromatic Balls might miss this greencluster,depending on which edge is selectedfirst.For in-stance,suppose that thefirst pivot chosen is(Y,S).ChromaticBalls forms the red cluster{Y,S,T}and removes it from thegraph.Removing vertex Y makes the edge(X,Y)missing,which would have been a good pivot to build a green clus-ter.At this point,even if the second selected pivot edge isa green one,say(X,Z),Chromatic Balls would form only asmall green cluster{X,W,Z}.Motivated by the previous example we introduce the LazyChromatic Balls heuristic,which tries to minimize the riskof bad choices.Given a vertex x∈V,and a label l∈L,let d(x,l)be the number of edges incident to x havinglabel l.Also,we denote by∆(x)=max l∈L d(x,l),andλ(x)=arg maxl∈Ld(x,l).Lazy Chromatic Balls differs fromChromatic Balls in two ways:Pivot random selection.At each iteration Lazy Chro-matic Balls selects a pivot edge in two steps.First,a vertexu is picked up with probability directly proportional to∆(u).Then,a second vertex v is selected among the neighbors ofu with probability proportional to d(v,λ(u)).Ball formation.Given the pivot(u,v),Chromatic Ballsforms a cluster by adding all vertices x such that u,v,xis a monochromatic zy Chromatic Balls instead,iteratively adds vertices x in the cluster as long as they forma triangle X,Z,w of color (u,v),where X is either u or v,and Z can be any other vertex already belonging to thecurrent cluster.Example4Consider again the example in Figure2.Ver-tices X and Y have the maximum number of edges of onecolor:they both have5green edges.Hence,one of them ischosen asfirst pivot vertex u by Lazy Chromatic Balls withhigher probability than the remaining vertices.Suppose thatAlgorithm2Alternating Minimization(AM)Input:Edge-labeled graph G=(V,E,L, );number K of output clustersOutput:Clustering C:V→N;cluster labeling functionc :C[V]→L1:initialize A=[a1,...,a N]and C=[c1,...,c K]at ran-dom2:repeat3:for all x∈V compute optimal a x according to Propo-sition14:for all k∈[1..K]compute optimal c k according to Proposition25:until neither A nor C changedX is picked up,i.e.,u=X.Given this choice,the second pivot v is chosen among the neighbors of X with probabil-ity proportional to d(v,λ(u)),i.e.,the higher the number of green edges of the neighbor,the higher the probability for it to be chosen.In this case,hence,Lazy Chromatic Balls would likely choose Y as a second pivot vertex v,thus making (X,Y)the selected pivot edge.Afterwards,Lazy Chromatic Balls adds to the being formed cluster the vertices{U,V,Z} because each of them forms a green triangle with the pivot edge.Then,R enters the cluster too,because it forms a green triangle with Y and V,which is already in the cluster. Similarly,W enters the cluster thanks to Z. Computational complexity.Like Chromatic Balls,the running time of the Lazy Chromatic Balls algorithm is deter-mined by picking the pivots and building the various clus-ters.Picking thefirst pivot u can be implemented with a priority queue with priorities∆×rnd,where rnd is a ran-dom number.This requires computing∆for all objects, which takes O(nh+m)(where h=|L|).Managing the pri-ority queue itself requires instead O(n log n),as each object is put into/removed from the queue only once during the execution of the algorithm.Given u,the second pivot v is selected by probing all(non-chosen)neighbors of u.This takes O(m)time,as for each pivot u,its neighbors are ac-cessed only once throughout the execution of the algorithm. Finally,building the current cluster takes O(m)time,as it requires a visit of the graph,where each edge is accessed O(1)times.In conclusion,the computational complexity of Lazy Chromatic Balls is O(n(log n+h)+m),which,for small h,is better than the complexity of Chromatic Balls.4.2An alternating-minimization approachA nice feature of the previous algorithms is that they are parameter-free:they produce clusterings by using informa-tion that is local to the pivot edges,without forcing the num-ber of output clusters in any way.However,in some cases, it could be desired having an output clustering composed by a pre-specified number K of clusters.To this purpose, we present here an algorithm based on the alternating mini-mization paradigm[7],that receives in input the number K of output clusters and attempts to minimize Equation(2) directly.The pseudocode of the proposed algorithm,called Alternating Minimization,is given in Algorithm2.In a nutshell,AM tries to produce a solution by alternat-ing between two optimization steps.In thefirst step the algorithmfinds the best cluster assignment for every x∈V given the assignments of every other y∈V and the current cluster labels.In the second step,itfinds the best label forevery cluster given the current assignment of objects to clus-ters.Below we show that both steps can be solved optimally. As a consequence the value of Equation(2)is guaranteed to decrease in every step,until convergence.Finding the global minimum is obviously hard,but the algorithm is guaranteed to converge to a local optimum.Definitions.For presentation sake,we adopt matrix no-tation.We denote matrices by uppercase boldface romans and vectors by lowercase boldface romans.We write X ij for the(i,j)coordinate of matrix X,and x(i)for the i-th coordinate of vector x.The parameter space of Problem2consists of a cluster as-signment for every object x∈V,given by the binary matrix A,and a label assignment for every cluster k∈{1,...,K}, given by the binary matrix C.We have A kx=1when object x is assigned to cluster k,and A kx=0otherwise.Similarly, we set C lk=1when label l is assigned to cluster k,and C lk=0otherwise.Since every object must belong to one and only one cluster,and every cluster must have one and only one label assigned,both A and C are constrained to consist of all zeros with a single1on every column.Denote by a x the column of A that corresponds to object x.The input is represented by a set of binary matrices,with a matrix Z x for every x∈V.These matrices encode the labeling function as follows.Let z xy denote the column of Z x that corresponds to the object y∈V.We have z xy(l)=1 if and only if (x,y)=l,otherwise z xy(l)=0.Every Z x consists thus of zeros,with exactly one1on every column. Finally,denote by b a special binary vector where b(l)=1 when l=l0and b(l)=0otherwise.We have then z T xy b=1 if and only if (x,y)=l0.The above formulation of the problem assumes that the input is represented by many large matrices.Note however that this representation is only conceptual.In the actual im-plementation we do not have to materialize these matrices and we can represent the input with the minimal amount of space required,as shown next.The benefit of our formula-tion is that it allows to write our objective function and our optimization process using linear-algebra operations,and ar-gue about the optimality of the local optimization steps. Optimal cluster assignment.Denote by N−xkthe num-ber of objects y∈V in cluster k that have (x,y)=l0.Since (x,y)=l0⇔z xy b=1,we have N−xk=(AZ x b)(k).Similarly,let N+xkdenote the number of objects y∈V in cluster k that have (x,y)=c (k).Since y∈k, we have (x,y)=c (k)⇔z xy Ca y=1and can writeN+xk=(Aw x)(k),where w x=[z T x1Ca1...z T xn Ca n]. Proposition1The optimal cluster assignment for x∈Vgiven A and C is k∗=arg minkN−xk−N+xk.Proof.We can rewrite Equation(2)as follows:x,ya T x a y(1−z T xy Ca y)+(1−a T x a y)(1−z T xy b)=(9)=xa T x A(1−w x)+(1T−a T x A)(1−Z T x b), where w x is defined as above,and1denotes the|V|-dimensional vector of all1s.Terms that correspond to a fixed x∈V further simplify toa T x A T Z x b−a T x Aw x+d x,where the constant d x=1T1−1T Z T x b is the“degree”of object x,the number of objects y∈V where (x,y)=l0.。
2019新人教高中英语选择性必修一Unit5Using language-(P55-57)公开课教案
2019新人教高中英语选择性必修一Unit5 Using LanguageEvaluate chemical farming and organic farming公开课教案Teaching aims:1.Enable students to know the difference between chemical farming and organic farming.2.Know the structure of an argumentative essay and the writing skills of an argumentative essay, such as illustrating, giving examples or making contrasts.3.Write an argumentative essay using what they will learn in this period.Teaching key points:1.Analyse the difference between chemical farming and organic farming.2.Analyse the structure and the writing skills of the argumentative essay.Teaching difficult points:1.Analyse the structure and the writing skills of the argumentative essay.2.Write an argumentative essay using what they will learn in this period.Teaching procedures:Step Lead-inActivity 1Making predictions(1)Look at the title of the passage and try to guess what will be talked about in the passage.(2)If you were asked to write the passage under this title,what is your structure?Suggested answers:(1)What the text will talk about will include:the definitions of both kinds of farming and the advantages and disadvantages of each.(2)My structure may be like this:Activity2 Picture guessingHere are 6 pictures.Try to find out which ones are about chemical farming and which are organic according to your background knowledge.Suggested answers:Picture 1 organic farming;Picture 2 chemical farming;Picture 3 organic farming;Picture 4 organic farming;Picture 5 organic farming;Picture 6 chemical farming.StepⅡTest our predictionsRead the text quickly and silently and try to find what is the main idea of each paragraph.And then draw the map of the structure.Para.1_____________________________________________Para.2_____________________________________________Para.3_____________________________________________Para.4_____________________________________________Para.5_____________________________________________Suggested answers:Para.1 Chemical farming helps fight crop disease and increase production.(advantage)Para.2 Three disadvantages of chemical farming.(disadvantages)Para.3 The definition of organic farming and one of its advantages.(advantage) Para.4 Other different methods of organic farming and their advantages.(advantages)Para.5 The disadvantage of organic farming and show the attitude of the writer to the readers.(disadvantage & attitude)Therefore the structure of the passage is:(so our guess is quite right.)StepⅢ Read for details:Read the text carefully and complete the table.(红色为学生填写部分)Chemical farming advantage A great way to fight crop disease and increase production. disadvantages1.Kill not only harmful bacteria and insects but also helpfulones.2.Affect not only the crops but also the animals andhumans who digest them.3.Crops are lacking in nutrition.Organic farming advantages1.The soil is rich in minerals.2.Keep the air,soil,water and crops free of chemicals.3.Avoid damage to the environment or to people’s health. disadvantage Nowhere near able to meet the need.The writer’s attitudeHe/She is for organic farming.But there is still a long way to find a suitable solution.StepⅡ Read for the organization and languageActivity 1Read the text carefully again and find out the transitional words and think about their functions.Suggested answers:The transitional words:for example,not only...but also,in addition,in turn,in fact,as for,but,and,as well.Activity 2Discuss the following questions.(1)What other advantages and disadvantages of organic farming did the author not list?(2)Which kind of farming do you like?And why?Suggested answers:(1)Advantages:Organic food is more nutritious and has much flavour.Farmers can get more money out of organic farming.Disadvantages:higher cost of producing food;greater effort required to farm.(2)I like organic farming.Because organic food is more delicious and more healthy.StepⅡWritingWrite an argumentative essay giving your opinion on chemical or organic farming.Activity 1Complete the outline below.(红色部分为学生填写部分)Topic sentence:In my opinion,__________farming is preferable to__________ farming because ___________Point 1:Chemical farming harms the environment.(Detail): Chemical pesticides are not found in nature, and normal natural processes cannot get rid of them.Point 2:Chemical farming harms us people.(Detail):We end up eating plants,diary products, eggs, and meat that have high levels of pesticides.Point 3 :Organic food is generally more nutritious than food grown with man-made chemicals.(Detail):Chemicals are commonly used in farming to grow larger crops more quickly.Conclusion: I would like to suggest buying organic food.Suggested answers:Topic sentence:In my opinion,organic farming is preferable to chemical farming because chemical farming does harm to not only the environment but also people and because chemical food has less nutrition while organic food is more nutritious.Activity 2Use your outline to write a short essay giving your opinion on the topic,paying attention to using right transitional words.Activity 3Exchange your drafts with another e this checklist to givethem helpful feedback.√ Does the writer do a good job of expressing his/her opinion?√ Does each paragraph have one main idea?√ Does each paragraph have at least one detail to support its main idea?√ Does the writer sequence the points in logical order?√ Does the writer do a good job persuading you to accept his/her opinion?Activity 4Polish your draft and share it with your partner.Suggested answer:In my opinion,organic farming is preferable to chemical farming because of the benefits to the environment and people,and because organic food is more nutritious.To begin with,chemical farming harms the environment.Since chemical pesticides are not found in nature,and normal natural processes cannot get rid of them.Instead,the chemicals tend to accumulate in the soil and water,and are absorbed in the food chain,where they poison wildlife and domestic animals.This affects us directly,because we end up eating plants,dairy products,eggs,and meat that have high levels of pesticides.Think about it:The purpose of the pesticides is to kill “pests”.Something that can kill an insect probably is not healthy for a person to eat.While it may be true that the food we eat does not have enough pesticide in it to kill us immediately,over the years this pesticide may cause us health problems or give us diseases such as cancer.Finally,organic food is generally more nutritious than food grown with man-made chemicals.There was the case of watermelons literally exploding in the fields because they had been given chemicals to make them grow faster.The reason why they exploded was that they contained too much water,because the farmers had used too many chemicals and the weather was unusually wet.As it turns out,such chemicals are commonly used in farming to grow larger crops more quickly.Do you really want to eat watered-down fruit or vegetables which have been rushed to market?They cannot possibly taste as good as organic vegetables,nor can they be as nutritious.In closing,I would like to suggest buying organic food.Yes,the price is a littlehigher,but the small price we pay extra will bring us huge rewards in the end.StepⅥHomework1.Put up the poster in your classroom or in a public place.2.Appreciate your classmates’ good works and try to learn from them.。
Clustering
25
Graph Cut
26
Graph Cut
27
Graph Cut
28
1
0.8 0.6
0.1
5
0.8 0.8
Graph Cut
6
2
0.2 0.8
3
4
0.7
0 0.8 ������ = 0.6 0 0.1 0
0.8 0.6 0 0.1 0 0 0.8 0 0 0 0.8 0 0.2 0 0 0 0.2 0 0.8 0.7 0 0 0.2 0 0 0 0 0.7 0.8 0
Spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. Based on spectral graph theory, spectral clustering is in essence the problem of optimal graph cut.
14
Implementation of k-means Initialize k, u k=2; 0 2 u= 2 −1
������������������������������ ������ : = argmin ������������ − ������������
������ 2
(1) (2)
Unsupervised —clustering (e.g., k-means, mixture models, hierarchical clustering); hidden Markov models,
离散数学及其应用英文版第六版教学设计
Discrete Mathematics and Its Applications: Teaching Plan for6th EditionBackgroundDiscrete Mathematics is a foundational subject forcomputer science and mathematics. It covers a wide range of topics, including logic, set theory, algorithms, graph theory, and combinatorics. As a result, students who understand Discrete Mathematics are better prepared for further studyand for solving problems in a variety of fields.The sixth edition of Discrete Mathematics and Its Applications by Kenneth H. Rosen is a widely-used textbookfor students studying Discrete Mathematics. This teachingplan is designed for instructors teaching a course using this text.Course GoalsThe primary goal of this course is for students to develop an understanding of Discrete Mathematics and its applications. This will be achieved through the following objectives: •Develop an understanding of the fundamental concepts of Discrete Mathematics, including logic, settheory, counting, and graph theory.•Learn how to apply these concepts to solve problems in computer science and other areas.•Develop skills in problem-solving and logical reasoning.•Explore further applications of DiscreteMathematics in a variety of fields.Teaching PlanTextbook and ResourcesThe primary textbook for this course will be Discrete Mathematics and Its Applications (6th edition) by Kenneth H. Rosen. In addition to the textbook, there are several resources that can be used to enhance student learning, including:•Online lectures and tutorials•Problem sets and exercises•Supplemental reading materials and videos Course Outline1.Introduction to Discrete Mathematics•Overview of topics covered in the course•Introduction to sets and functions•Basic logic and proof techniques2.Propositional Logic•Propositions and truth values•Logical connectives and truth tables •Logical equivalences and implications3.Predicate Logic•Quantifiers and predicates•Universal and existential quantification •Validity and satisfiability4.Set Theory•Basic concepts and notation•Set operations and Venn diagrams •Applications to counting and probabilitybinatorics•Permutations and combinations•The Pigeonhole Principle and its applications •Binomial coefficients and Pascal’s triangle 6.Relations and Functions•Relations and their properties •Equivalence relations and partitioning •Functions and their properties7.Graph Theory•Basic concepts and notation•Graph representations and connectivity•Planar graphs and Euler’s formula8.Trees•Tree properties and traversal algorithms•Spanning trees and minimum weight trees•Applications to network optimization9.Boolean Algebra and Boolean Functions•Boolean algebra and its laws•Boolean functions and their representations•Applications to digital logic and circuit design Evaluation and GradingStudents will be evaluated based on their performance on the following components:•Homework assignments (30%)•Midterm exam (30%)•Final exam (40%)Homework assignments will consist of problem sets and exercises that reinforce the concepts covered in class. The mid-term exam will cover material from the first half of the course, while the final exam will cover all material from the course.ConclusionDiscrete Mathematics is a fundamental subject for computer science and mathematics, and the sixth edition of Discrete Mathematics and Its Applications is a widely-used textbookfor students studying this subject. This teaching plan is designed to help instructors teach this subject effectively and achieve the course objectives outlined above.。
DataClustering AReview
Data Clustering:A ReviewA.K.JAINMichigan State UniversityM.N.MURTYIndian Institute of ScienceANDP.J.FLYNNThe Ohio State UniversityClustering is the unsupervised classification of patterns(observations,data items,or feature vectors)into groups(clusters).The clustering problem has beenaddressed in many contexts and by researchers in many disciplines;this reflects itsbroad appeal and usefulness as one of the steps in exploratory data analysis.However,clustering is a difficult problem combinatorially,and differences inassumptions and contexts in different communities has made the transfer of usefulgeneric concepts and methodologies slow to occur.This paper presents an overviewof pattern clustering methods from a statistical pattern recognition perspective,with a goal of providing useful advice and references to fundamental conceptsaccessible to the broad community of clustering practitioners.We present ataxonomy of clustering techniques,and identify cross-cutting themes and recentadvances.We also describe some important applications of clustering algorithmssuch as image segmentation,object recognition,and information retrieval.Categories and Subject Descriptors:I.5.1[Pattern Recognition]:Models;I.5.3[Pattern Recognition]:Clustering;I.5.4[Pattern Recognition]:Applications—Computer vision;H.3.3[Information Storage and Retrieval]:InformationSearch and Retrieval—Clustering;I.2.6[Artificial Intelligence]:Learning—Knowledge acquisitionGeneral Terms:AlgorithmsAdditional Key Words and Phrases:Cluster analysis,clustering applications,exploratory data analysis,incremental clustering,similarity indices,unsupervisedlearningSection6.1is based on the chapter“Image Segmentation Using Clustering”by A.K.Jain and P.J. Flynn,Advances in Image Understanding:A Festschrift for Azriel Rosenfeld(K.Bowyer and N.Ahuja, Eds.),1996IEEE Computer Society Press,and is used by permission of the IEEE Computer Society. Authors’addresses:A.Jain,Department of Computer Science,Michigan State University,A714Wells Hall,East Lansing,MI48824;M.Murty,Department of Computer Science and Automation,Indian Institute of Science,Bangalore,560012,India;P.Flynn,Department of Electrical Engineering,The Ohio State University,Columbus,OH43210.Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage,the copyright notice,the title of the publication,and its date appear,and notice is given that copying is by permission of the ACM,Inc.To copy otherwise,to republish,to post on servers,or to redistribute to lists,requires prior specific permission and/or a fee.©2000ACM0360-0300/99/0900–0001$5.001.INTRODUCTION1.1MotivationData analysis underlies many comput-ing applications,either in a design phase or as part of their on-line opera-tions.Data analysis procedures can be dichotomized as either exploratory or confirmatory,based on the availability of appropriate models for the data source,but a key element in both types of procedures(whether for hypothesis formation or decision-making)is the grouping,or classification of measure-ments based on either(i)goodness-of-fit to a postulated model,or(ii)natural groupings(clustering)revealed through analysis.Cluster analysis is the organi-zation of a collection of patterns(usual-ly represented as a vector of measure-ments,or a point in a multidimensional space)into clusters based on similarity.Intuitively,patterns within a valid clus-ter are more similar to each other than they are to a pattern belonging to a different cluster.An example of cluster-ing is depicted in Figure1.The input patterns are shown in Figure1(a),and the desired clusters are shown in Figure 1(b).Here,points belonging to the same cluster are given the same label.The variety of techniques for representing data,measuring proximity(similarity) between data elements,and grouping data elements has produced a rich and often confusing assortment of clustering methods.It is important to understand the dif-ference between clustering(unsuper-vised classification)and discriminant analysis(supervised classification).In supervised classification,we are pro-vided with a collection of labeled(pre-classified)patterns;the problem is to label a newly encountered,yet unla-beled,pattern.Typically,the given la-beled(training)patterns are used to learn the descriptions of classes which in turn are used to label a new pattern. In the case of clustering,the problem is to group a given collection of unlabeled patterns into meaningful clusters.In a sense,labels are associated with clus-ters also,but these category labels are data driven;that is,they are obtained solely from the data.Clustering is useful in several explor-atory pattern-analysis,grouping,deci-sion-making,and machine-learning sit-uations,including data mining, document retrieval,image segmenta-tion,and pattern classification.How-ever,in many such problems,there is little prior information(e.g.,statistical models)available about the data,and the decision-maker must make as few assumptions about the data as possible. It is under these restrictions that clus-tering methodology is particularly ap-propriate for the exploration of interre-lationships among the data points to make an assessment(perhaps prelimi-nary)of their structure.The term“clustering”is used in sev-eral research communities to describeCONTENTS1.Introduction1.1Motivation1.2Components of a Clustering Task1.3The User’s Dilemma and the Role of Expertise1.4History1.5Outline2.Definitions and Notation3.Pattern Representation,Feature Selection andExtraction4.Similarity Measures5.Clustering Techniques5.1Hierarchical Clustering Algorithms5.2Partitional Algorithms5.3Mixture-Resolving and Mode-SeekingAlgorithms5.4Nearest Neighbor Clustering5.5Fuzzy Clustering5.6Representation of Clusters5.7Artificial Neural Networks for Clustering5.8Evolutionary Approaches for Clustering5.9Search-Based Approaches5.10A Comparison of Techniques5.11Incorporating Domain Constraints inClustering5.12Clustering Large Data Sets6.Applications6.1Image Segmentation Using Clustering6.2Object and Character Recognition6.3Information Retrieval6.4Data Mining7.SummaryData Clustering•265methods for grouping of unlabeled data.These communities have different ter-minologies and assumptions for the components of the clustering process and the contexts in which clustering is used.Thus,we face a dilemma regard-ing the scope of this survey.The produc-tion of a truly comprehensive survey would be a monumental task given the sheer mass of literature in this area.The accessibility of the survey might also be questionable given the need to reconcile very different vocabularies and assumptions regarding clustering in the various communities.The goal of this paper is to survey the core concepts and techniques in the large subset of cluster analysis with its roots in statistics and decision theory.Where appropriate,references will be made to key concepts and techniques arising from clustering methodology in the machine-learning and other commu-nities.The audience for this paper includes practitioners in the pattern recognition and image analysis communities (who should view it as a summarization of current practice),practitioners in the machine-learning communities (who should view it as a snapshot of a closely related field with a rich history of well-understood techniques),and the broader audience of scientific profes-sionals (who should view it as an acces-sible introduction to a mature field that is making important contributions to computing application areas).1.2Components of a Clustering TaskTypical pattern clustering activity in-volves the following steps [Jain and Dubes 1988]:(1)pattern representation (optionallyincluding feature extraction and/or selection),(2)definition of a pattern proximitymeasure appropriate to the data do-main,(3)clustering or grouping,(4)data abstraction (if needed),and (5)assessment of output (if needed).Figure 2depicts a typical sequencing of the first three of these steps,including a feedback path where the grouping process output could affect subsequent feature extraction and similarity com-putations.Pattern representation refers to the number of classes,the number of avail-able patterns,and the number,type,and scale of the features available to the clustering algorithm.Some of this infor-mation may not be controllable by theYY(a)(b)x x xx x 111x x 112 2x x 2 2x x xx x x x xx x xx xx x x x xx3 3 33444444444444444xx xxxx x x 66677776xx x x x xx 4555555Figure 1.Data clustering.266• A.Jain et al.practitioner.Feature selection is the process of identifying the most effective subset of the original features to use in clustering.Feature extraction is the use of one or more transformations of the input features to produce new salient features.Either or both of these tech-niques can be used to obtain an appro-priate set of features to use in cluster-ing.Pattern proximity is usually measured by a distance function defined on pairs of patterns.A variety of distance mea-sures are in use in the various commu-nities[Anderberg1973;Jain and Dubes 1988;Diday and Simon1976].A simple distance measure like Euclidean dis-tance can often be used to reflect dis-similarity between two patterns, whereas other similarity measures can be used to characterize the conceptual similarity between patterns[Michalski and Stepp1983].Distance measures are discussed in Section4.The grouping step can be performed in a number of ways.The output clus-tering(or clusterings)can be hard(a partition of the data into groups)or fuzzy(where each pattern has a vari-able degree of membership in each of the output clusters).Hierarchical clus-tering algorithms produce a nested se-ries of partitions based on a criterion for merging or splitting clusters based on similarity.Partitional clustering algo-rithms identify the partition that opti-mizes(usually locally)a clustering cri-terion.Additional techniques for the grouping operation include probabilistic [Brailovski1991]and graph-theoretic [Zahn1971]clustering methods.The variety of techniques for cluster forma-tion is described in Section5.Data abstraction is the process of ex-tracting a simple and compact represen-tation of a data set.Here,simplicity is either from the perspective of automatic analysis(so that a machine can perform further processing efficiently)or it is human-oriented(so that the representa-tion obtained is easy to comprehend and intuitively appealing).In the clustering context,a typical data abstraction is a compact description of each cluster, usually in terms of cluster prototypes or representative patterns such as the cen-troid[Diday and Simon1976].How is the output of a clustering algo-rithm evaluated?What characterizes a ‘good’clustering result and a‘poor’one? All clustering algorithms will,when presented with data,produce clusters—regardless of whether the data contain clusters or not.If the data does contain clusters,some clustering algorithms may obtain‘better’clusters than others. The assessment of a clustering proce-dure’s output,then,has several facets. One is actually an assessment of the data domain rather than the clustering algorithm itself—data which do not contain clusters should not be processed by a clustering algorithm.The study of cluster tendency,wherein the input data are examined to see if there is any merit to a cluster analysis prior to one being performed,is a relatively inactive re-search area,and will not be considered further in this survey.The interested reader is referred to Dubes[1987]and Cheng[1995]for information.Cluster validity analysis,by contrast, is the assessment of a clustering proce-dure’s output.Often this analysis uses a specific criterion of optimality;however, these criteria are usually arrived atFeature Selection/ ExtractionPatternGroupingClustersInterpatternSimilarityRepresentationsPatternsfeedback loopFigure2.Stages in clustering.Data Clustering•267subjectively.Hence,little in the way of ‘gold standards’exist in clustering ex-cept in well-prescribed subdomains.Va-lidity assessments are objective[Dubes 1993]and are performed to determine whether the output is meaningful.A clustering structure is valid if it cannot reasonably have occurred by chance or as an artifact of a clustering algorithm. When statistical approaches to cluster-ing are used,validation is accomplished by carefully applying statistical meth-ods and testing hypotheses.There are three types of validation studies.An external assessment of validity com-pares the recovered structure to an a priori structure.An internal examina-tion of validity tries to determine if the structure is intrinsically appropriate for the data.A relative test compares two structures and measures their relative merit.Indices used for this comparison are discussed in detail in Jain and Dubes[1988]and Dubes[1993],and are not discussed further in this paper.1.3The User’s Dilemma and the Role ofExpertiseThe availability of such a vast collection of clustering algorithms in the litera-ture can easily confound a user attempt-ing to select an algorithm suitable for the problem at hand.In Dubes and Jain [1976],a set of admissibility criteria defined by Fisher and Van Ness[1971] are used to compare clustering algo-rithms.These admissibility criteria are based on:(1)the manner in which clus-ters are formed,(2)the structure of the data,and(3)sensitivity of the cluster-ing technique to changes that do not affect the structure of the data.How-ever,there is no critical analysis of clus-tering algorithms dealing with the im-portant questions such as—How should the data be normalized?—Which similarity measure is appropri-ate to use in a given situation?—How should domain knowledge be uti-lized in a particular clustering prob-lem?—How can a vary large data set(say,a million patterns)be clustered effi-ciently?These issues have motivated this sur-vey,and its aim is to provide a perspec-tive on the state of the art in clustering methodology and algorithms.With such a perspective,an informed practitioner should be able to confidently assess the tradeoffs of different techniques,and ultimately make a competent decision on a technique or suite of techniques to employ in a particular application. There is no clustering technique that is universally applicable in uncovering the variety of structures present in mul-tidimensional data sets.For example, consider the two-dimensional data set shown in Figure1(a).Not all clustering techniques can uncover all the clusters present here with equal facility,because clustering algorithms often contain im-plicit assumptions about cluster shape or multiple-cluster configurations based on the similarity measures and group-ing criteria used.Humans perform competitively with automatic clustering procedures in two dimensions,but most real problems in-volve clustering in higher dimensions.It is difficult for humans to obtain an intu-itive interpretation of data embedded in a high-dimensional space.In addition, data hardly follow the“ideal”structures (e.g.,hyperspherical,linear)shown in Figure1.This explains the large num-ber of clustering algorithms which con-tinue to appear in the literature;each new clustering algorithm performs slightly better than the existing ones on a specific distribution of patterns.It is essential for the user of a cluster-ing algorithm to not only have a thor-ough understanding of the particular technique being utilized,but also to know the details of the data gathering process and to have some domain exper-tise;the more information the user has about the data at hand,the more likely the user would be able to succeed in assessing its true class structure[Jain and Dubes1988].This domain informa-268• A.Jain et al.tion can also be used to improve the quality of feature extraction,similarity computation,grouping,and cluster rep-resentation[Murty and Jain1995]. Appropriate constraints on the data source can be incorporated into a clus-tering procedure.One example of this is mixture resolving[Titterington et al. 1985],wherein it is assumed that the data are drawn from a mixture of an unknown number of densities(often as-sumed to be multivariate Gaussian). The clustering problem here is to iden-tify the number of mixture components and the parameters of each component. The concept of density clustering and a methodology for decomposition of fea-ture spaces[Bajcsy1997]have also been incorporated into traditional clus-tering methodology,yielding a tech-nique for extracting overlapping clus-ters.1.4HistoryEven though there is an increasing in-terest in the use of clustering methods in pattern recognition[Anderberg 1973],image processing[Jain and Flynn1996]and information retrieval [Rasmussen1992;Salton1991],cluster-ing has a rich history in other disci-plines[Jain and Dubes1988]such as biology,psychiatry,psychology,archae-ology,geology,geography,and market-ing.Other terms more or less synony-mous with clustering include unsupervised learning[Jain and Dubes 1988],numerical taxonomy[Sneath and Sokal1973],vector quantization[Oehler and Gray1995],and learning by obser-vation[Michalski and Stepp1983].The field of spatial analysis of point pat-terns[Ripley1988]is also related to cluster analysis.The importance and interdisciplinary nature of clustering is evident through its vast literature.A number of books on clustering have been published[Jain and Dubes1988; Anderberg1973;Hartigan1975;Spath 1980;Duran and Odell1974;Everitt 1993;Backer1995],in addition to some useful and influential review papers.A survey of the state of the art in cluster-ing circa1978was reported in Dubes and Jain[1980].A comparison of vari-ous clustering algorithms for construct-ing the minimal spanning tree and the short spanning path was given in Lee [1981].Cluster analysis was also sur-veyed in Jain et al.[1986].A review of image segmentation by clustering was reported in Jain and Flynn[1996].Com-parisons of various combinatorial opti-mization schemes,based on experi-ments,have been reported in Mishra and Raghavan[1994]and Al-Sultan and Khan[1996].1.5OutlineThis paper is organized as follows.Sec-tion2presents definitions of terms to be used throughout the paper.Section3 summarizes pattern representation, feature extraction,and feature selec-tion.Various approaches to the compu-tation of proximity between patterns are discussed in Section 4.Section5 presents a taxonomy of clustering ap-proaches,describes the major tech-niques in use,and discusses emerging techniques for clustering incorporating non-numeric constraints and the clus-tering of large sets of patterns.Section 6discusses applications of clustering methods to image analysis and data mining problems.Finally,Section7pre-sents some concluding remarks.2.DEFINITIONS AND NOTATIONThe following terms and notation are used throughout this paper.—A pattern(or feature vector,observa-tion,or datum)x is a single data item used by the clustering algorithm.It typically consists of a vector of d mea-surements:xϭ͑x1,...x d͒.—The individual scalar components xi of a pattern x are called features(or attributes).Data Clustering•269—d is the dimensionality of the pattern or of the pattern space.—A pattern set is denotedᐄϭ͕x1,...x n͖.The i th pattern inᐄis denoted x iϭ͑x i,1,...x i,d͒.In many cases a pattern set to be clustered isviewed as an nϫd pattern matrix.—A class,in the abstract,refers to a state of nature that governs the pat-tern generation process in some cases. More concretely,a class can be viewed as a source of patterns whose distri-bution in feature space is governed by a probability density specific to the class.Clustering techniques attempt to group patterns so that the classes thereby obtained reflect the different pattern generation processes repre-sented in the pattern set.—Hard clustering techniques assign aclass label l i to each patterns x i,iden-tifying its class.The set of all labels for a pattern setᐄisᏸϭ͕l1,...l n͖,with l iʦ͕1,···,k͖, where k is the number of clusters.—Fuzzy clustering procedures assign to each input pattern x i a fractional de-gree of membership f ij in each output cluster j.—A distance measure(a specialization of a proximity measure)is a metric (or quasi-metric)on the feature space used to quantify the similarity of pat-terns.3.PATTERN REPRESENTATION,FEATURESELECTION AND EXTRACTIONThere are no theoretical guidelines that suggest the appropriate patterns and features to use in a specific situation. Indeed,the pattern generation process is often not directly controllable;the user’s role in the pattern representation process is to gather facts and conjec-tures about the data,optionally perform feature selection and extraction,and de-sign the subsequent elements of the clustering system.Because of the diffi-culties surrounding pattern representa-tion,it is conveniently assumed that thepattern representation is available priorto clustering.Nonetheless,a careful in-vestigation of the available features andany available transformations(evensimple ones)can yield significantly im-proved clustering results.A good pat-tern representation can often yield asimple and easily understood clustering;a poor pattern representation may yielda complex clustering whose true struc-ture is difficult or impossible to discern.Figure3shows a simple example.Thepoints in this2D feature space are ar-ranged in a curvilinear cluster of ap-proximately constant distance from theorigin.If one chooses Cartesian coordi-nates to represent the patterns,manyclustering algorithms would be likely tofragment the cluster into two or moreclusters,since it is not compact.If,how-ever,one uses a polar coordinate repre-sentation for the clusters,the radiuscoordinate exhibits tight clustering anda one-cluster solution is likely to beeasily obtained.A pattern can measure either a phys-ical object(e.g.,a chair)or an abstractnotion(e.g.,a style of writing).As notedabove,patterns are represented conven-tionally as multidimensional vectors,where each dimension is a single fea-ture[Duda and Hart1973].These fea-tures can be either quantitative or qual-itative.For example,if weight and colorare the two features used,then ͑20,black͒is the representation of a black object with20units of weight.The features can be subdivided into thefollowing types[Gowda and Diday1992]:(1)Quantitative features:e.g.(a)continuous values(e.g.,weight);(b)discrete values(e.g.,the numberof computers);(c)interval values(e.g.,the dura-tion of an event).(2)Qualitative features:(a)nominal or unordered(e.g.,color);270• A.Jain et al.(b)ordinal (e.g.,military rank orqualitative evaluations of tem-perature (“cool”or “hot”)or sound intensity (“quiet”or “loud”)).Quantitative features can be measured on a ratio scale (with a meaningful ref-erence value,such as temperature),or on nominal or ordinal scales.One can also use structured features [Michalski and Stepp 1983]which are represented as trees,where the parent node represents a generalization of its child nodes.For example,a parent node “vehicle”may be a generalization of children labeled “cars,”“buses,”“trucks,”and “motorcycles.”Further,the node “cars”could be a generaliza-tion of cars of the type “Toyota,”“Ford,”“Benz,”etc.A generalized representa-tion of patterns,called symbolic objects was proposed in Diday [1988].Symbolic objects are defined by a logical conjunc-tion of events.These events link values and features in which the features can take one or more values and all the objects need not be defined on the same set of features.It is often valuable to isolate only the most descriptive and discriminatory fea-tures in the input set,and utilize those features exclusively in subsequent anal-ysis.Feature selection techniques iden-tify a subset of the existing features for subsequent use,while feature extrac-tion techniques compute new features from the original set.In either case,the goal is to improve classification perfor-mance and/or computational efficiency.Feature selection is a well-explored topic in statistical pattern recognition [Duda and Hart 1973];however,in a clustering context (i.e.,lacking class la-bels for patterns),the feature selection process is of necessity ad hoc,and might involve a trial-and-error process where various subsets of features are selected,the resulting patterns clustered,and the output evaluated using a validity index.In contrast,some of the popular feature extraction processes (e.g.,prin-cipal components analysis [Fukunaga 1990])do not depend on labeled data and can be used directly.Reduction of the number of features has an addi-tional benefit,namely the ability to pro-duce output that can be visually in-spected by a human.4.SIMILARITY MEASURESSince similarity is fundamental to the definition of a cluster,a measure of the similarity between two patterns drawn from the same feature space is essential to most clustering procedures.Because of the variety of feature types and scales,the distance measure (or mea-sures)must be chosen carefully.It is most common to calculate the dissimi-larity between two patterns using a dis-tance measure defined on the feature space.We will focus on the well-known distance measures used for patterns whose features are all continuous.The most popular metric for continu-ous features is the Euclidean distanced 2͑x i ,x j ͒ϭ͑k ϭ1d͑xi ,kϪx j ,k ͒2͒1/2ϭʈx i Ϫx j ʈ2,which is a special case (p ϭ2)of the Minkowski metric.....................................................................................................................................................................................Figure 3.A curvilinear cluster whose points are approximately equidistant from the origin.Different pattern representations (coordinate systems)would cause clustering algorithms to yield different results for this data (see text).Data Clustering •271d p͑x i,x j͒ϭ͑kϭ1dԽx i,kϪx j,kԽp͒1/pϭʈx iϪx jʈp.The Euclidean distance has an intuitive appeal as it is commonly used to evalu-ate the proximity of objects in two or three-dimensional space.It works well when a data set has“compact”or“iso-lated”clusters[Mao and Jain1996]. The drawback to direct use of the Minkowski metrics is the tendency of the largest-scaled feature to dominate the others.Solutions to this problem include normalization of the continuous features(to a common range or vari-ance)or other weighting schemes.Lin-ear correlation among features can also distort distance measures;this distor-tion can be alleviated by applying a whitening transformation to the data or by using the squared Mahalanobis dis-tanced M͑x i,x j͒ϭ͑x iϪx j͒⌺Ϫ1͑x iϪx j͒T, where the patterns x i and x j are as-sumed to be row vectors,and⌺is the sample covariance matrix of the pat-terns or the known covariance matrix of the pattern generation process;d M͑⅐,⅐͒assigns different weights to different features based on their variances and pairwise linear correlations.Here,it is implicitly assumed that class condi-tional densities are unimodal and char-acterized by multidimensional spread, i.e.,that the densities are multivariate Gaussian.The regularized Mahalanobis distance was used in Mao and Jain [1996]to extract hyperellipsoidal clus-ters.Recently,several researchers [Huttenlocher et al.1993;Dubuisson and Jain1994]have used the Hausdorff distance in a point set matching con-text.Some clustering algorithms work on a matrix of proximity values instead of on the original pattern set.It is useful in such situations to precompute all the n͑nϪ1͒ր2pairwise distance values for the n patterns and store them in a (symmetric)matrix.Computation of distances between patterns with some or all features being noncontinuous is problematic,since the different types of features are not com-parable and(as an extreme example) the notion of proximity is effectively bi-nary-valued for nominal-scaled fea-tures.Nonetheless,practitioners(espe-cially those in machine learning,where mixed-type patterns are common)have developed proximity measures for heter-ogeneous type patterns.A recent exam-ple is Wilson and Martinez[1997], which proposes a combination of a mod-ified Minkowski metric for continuous features and a distance based on counts (population)for nominal attributes.A variety of other metrics have been re-ported in Diday and Simon[1976]and Ichino and Yaguchi[1994]for comput-ing the similarity between patterns rep-resented using quantitative as well as qualitative features.Patterns can also be represented us-ing string or tree structures[Knuth 1973].Strings are used in syntactic clustering[Fu and Lu1977].Several measures of similarity between strings are described in Baeza-Yates[1992].A good summary of similarity measures between trees is given by Zhang[1995].A comparison of syntactic and statisti-cal approaches for pattern recognition using several criteria was presented in Tanaka[1995]and the conclusion was that syntactic methods are inferior in every aspect.Therefore,we do not con-sider syntactic methods further in this paper.There are some distance measures re-ported in the literature[Gowda and Krishna1977;Jarvis and Patrick1973] that take into account the effect of sur-rounding or neighboring points.These surrounding points are called context in Michalski and Stepp[1983].The simi-larity between two points x i and x j, given this context,is given by272• A.Jain et al.。
评估指南的讲座题目
评估指南的讲座题目英文回答:Assessment Rubric Lecture: Assessing Student Learning Effectively.This lecture will provide an in-depth overview of assessment rubrics, a powerful tool for evaluating student learning. Participants will learn the purpose, types, and benefits of rubrics, as well as the key elements to consider when constructing and using them.Through interactive exercises and examples, attendees will gain a thorough understanding of:The foundational principles of rubrics.The different types of rubrics and their appropriate use cases.The essential components of a well-designed rubric.How to create rubrics that align with learning objectives.Techniques for using rubrics to provide constructive feedback to students.Best practices for implementing and revising rubricsin the classroom.中文回答:评估指南讲座,有效评估学生学习。
本讲座将对评估指南进行深入概述,评估指南是一种用于评估学生学习的强大工具。
Evaluating Quality of Text Clustering with ART1
Evaluating Quality of Text Clustering with ART1L. MasseyRoyal Military College of Canada Kingston, Ontario, Canada, K7K 7B4Abstract - Self-organizing large amounts of textual data inaccordance to som e topics structure is an increasinglyimportant application of clustering. Adaptive ResonanceTheory (ART) neural networks possess several interesting properties that make them appealing in this area. Although ART has been used in several research works as a text clustering tool, the level of quality of the resulting documentclusters has not been clearly established yet. In this paper,we present experim ental results with binary ART that address this issue by determining how close clustering quality is to an upper bound on clustering quality.I. I NTRODUCTIONWe consider t he applicat ion of clust ering t o t he self-organization of a textual document collection. Clustering is he opera ion by which similar objec s are grouped together in an unsupervised manner [1, 2]. Hence, when clustering textual documents, one is hoping to form sets of documents wit h similar content. Instead of exploring t he whole collection of documents, a user can then browse the resulting clus t ers t o iden t ify and re t rieve relevan t documents. As such, clust ering provides a summarized view of the information space by grouping documents by topics. Clust ering is of en he only viable solu ion o organize large text collections into topics. The advantage of clust ering is realized when a t raining set and classes definitions are unavailable, or when creating them is either cos t prohibi t ive due t o t he collec t ion shear size or unrealistic due t o t he rapidly changing na t ure of t he collection.We specifically s udy ex clus ering wi h Adap ive Resonance Theory (ART) [3, 4] neural net works. ART stability and plast icit y propert ies as well as it s abilit y t o process dynamic da a efficien ly make i an a t t rac t ive candidate for clus tering large, rapidly changing t ex tcollections in real-life environment s. Alt hough ART has been investigated previously as a means of clustering text data, due to numerous variations in ART implementations,experimental da t a se ts and quali ty evalua tion methodologies, it is not clear whether ART performs well in t his t ype of applicat ion. Since ART seems t o be a logical and appealing solu t ion t o the rapidly growing amoun of t ex tual elec ronic informa tion processed by organizations, i twould be impor t an t t o elimina te any confusion surrounding t he qualit y of t he t ext clust ers it produces. In t his paper, we present experiment al result s with a binary ART neural net work (ART1) t hat address this issue by de ermining how close clus ering quali y achieved with ART is t o an expect ed upper bound onclustering qualit y. We will consider ot her versions of ART in future work.II. R ELATED W ORKWe consider one of he many applica t ions of t ex t clustering in t he field of Informat ion Ret rieval (IR) [5],namely clus t ering t ha t aims a t self-organizing t ex t ualdocument collect ions. This applicat ion of t ext clust ering can be seen as a form of classificat ion by t opics, hence making i t t he unsupervised coun t erpar t t o Text Categorization (TC) [6]. Tex t self-organiza tion hasbecome increasingly popular due o he availabili y oflarge document collections that change rapidly and that arequasi-impossible to organize manually. Even TC becomes unsuitable in such environmen t s because supervisedlearning of classifiers is no plas ic and t hus requiresretraining upon detection of novelty. Representative workon text clustering includes, among many others, [7, 8, and 9]. Due to space limitation, we will restrict our review of previous work o ex clus ering wi h ART. As well,supervised versions of ART [10] applied t o t ex t categorization [11] will not be considered.MacLeod and Robertson [12] were as far as we know the first researchers t o consider ART for t ext clust ering.They used a modified version of ART1 in which he similarity compu t a t ion and weigh t upda tes involved domain and task specific knowledge. The inclusion of this knowledge makes the ART implementation more complex but may advant age it over a basic form of ART1. We intend to test this type of ART network in future work, but for now we are in eres ed in es ablishing he baseline quality t ha t can be achieved wi th a more basic implementation. The relatively small Keen and Cranfield text collect ions (800 and 1400 document s respect ively),were used t o t est MacLeod’s algorit hm. ART clust ering quality was evaluated with the F measure computed on the results of provided queries. This approach t o clus er evaluation does not allow for a comprehensive evaluation of the resulting cluster structure, since it only considers the set of queries. The clustering quality results were not very good with F 1 = 0.25 and 0.15 (minimum quality value is 0and maximum 1), but were comparable wit h ot her non-neural clust ering result s published on t he same dat a set s.Merkl [13] compared ART wi h Self-Organizing Maps (SOM) [14] for t he clust ering of a small collec ion of documents and concludes t hat SOM forms bet t er clust ers based on a visual qualitative evaluation. We want to avoidsuch a subject ive evaluat ion. Moreover, we t hink t hat SOM has several weaknesses compared to ART that make i t unsui t able for documen t clus tering in a real-life environmen charac erized by high volume of dynamic data. Indeed, i is uns able under grow h while ART provides inherent stability and plasticity. Furthermore, the multiple i era ions required o a ain convergence wi t h SOM are incompatible with a real-time environment.On the other hand, some research concluded that ART text clustering resulted in good quality clusters. Vlajic and Card [15] used a modified ART2 ne work o crea e a hierarchical clust ering of a small number of web pages.They report t hat clust ering was “appropriat e in all cases,when compared to human performance […]”, but provide no quan t i t a t ive resul t . Reference [16] also considers hierarchical clust ering, but wit h ART1 and wit h a small database of document titles. Clustering quality is deemed adequate based on t he percen t age of overlap be t weenclusters and expec t ed classificat ion. The evalua tion measure and t ext collect ion were non-st andard. Finally,Kondadadi and Kozma [17] compare KMART, t heir soft clustering version of ART, t o Fuzzy-ART and k-means.The text data consist of 2000 documents downloaded from the web as well as anot her 2000 newsgroup document s.Quality evaluat ion is based on a one-t o-one mat ch of t he documents in t he clus t ers wi t h t he documen t s in t he specified category. K-MART and Fuzzy-ART results are encouraging wit h above 50% mat ching for 100 t o 500document subsets of the original text collections, while k-means stays in the range of 22-35% matching. However,this evalua ion me t hod is op imis ic since i t does no t account for false positives (i.e. documents that are presentin a clus t er bu t do no t ma t ch documen t s in t he corresponding desired topic).III. E XPERIMENTAL S ETTINGSWe selec t ed t wo well-es t ablished clus t er quali t y evaluation measures: Jaccard (JAC) [18] and Fowlkes-Mallows (FM) [19]:JAC = a / (a+ b+ c)(1)FM = a / ((a+ b)( a+ c))1/2(2)where:a is t he pair-wise number of t rue posit ives, i.e. t he total number of document pairs grouped t oget her in the expect ed solut ion and t hat are indeed clust ered together by the clustering algorithm;b is t he pair-wise number of false posit ives, i.e. t he number of document pairs not expected to be grouped together bu t t ha t are clus t ered t oge t her by the clustering algorithm;c is t he pair-wise number of false negat ives, i.e. t he number of document pairs expect ed t o be grouped together bu ha are no clus ered oge her by he clustering algorithm.We also use a measure that computes the F 1 clustering quality value. I t usest he same underlying pair-wise counting procedure as Jaccard and Fowlkes-Mallow o establish a count of false negatives and false positives, but combines t hose values following the F-measure [5]formulae:F22p+r] (3)where p = a/( a + b) is known as the precision and r = a/( aprecision and recall.The "ModAp t e" spli t [20] of t he Reu t er-21578Distribution 1.01 data set is used for our experiments. This data set s is known t o be challenging because of skewedclass distribution, multiple overlapping categories, and itsreal-life origin (Reuter newswires during the year 1987, in chronological order). We evalua t e clus t ering resul t sagainst the desired solution originally specified by Reuter'shuman classifiers. We only use t he desired solu tion information o evalua e clus ering resul s, i.e. af er he clusters have been formed. Reuter is a benchmark data set for TC. Using this data set specific split and the F 1 quality measure makes comparison wit h published TC result s on the same split [21] possible. This is an important and webelieve innovat ive aspect of our experiment al approach:TC F 1 quali y resul s are used as an upper bound for cluster qualit y since learning in a supervised framework with labeled data provides the best possible automated text classification (wit h current t echnology). Thus, clust ering can be expect ed t o approach t his level of qualit y but not exceed it since it relies solely on the information present in the dat a it self. This way of evaluat ing clust ering qualit y allows one to clearly establish the level of quality obtained by a clust ering algorit hm as a percent age of t he upper bound quality.We use t he k-means [22] clus t ering algori t hm t o establish a lower bound for quality. Our rationale is t hat since k-means represen s one of t he simples possible approaches t o clus t ering, one would expec t t ha tany slightly more advanced algori t hm would exceed i ts clustering quality. The parameter k is set to the number of topics (93) specified by the domain experts who manually organized t he Reu t er t ex t collec t ion. K-means ini t ial cluster cent roids are det ermined randomly and clust ering results are averaged over 10 trials t o smooth out ext reme values obtained from good and bad random init ializat ion.Our hope is that ART clusters would exceed significantly 1Available from/resources/testcollections/reuters21578/the quali t y ob t ained wi t h k-means and approach the quality of supervised TC.In this set of experiments, we use the simplest form of ART, binary ART1 in fas learning mode, t o est ablish what should be the baseline level of qualit y at t ainable by ART neural networks. We specifically use Moore’s ART1implementation [23]. Due to space constraints, the details of ART1 archit ect ure and algorit hm cannot be included here. The vigilance parameter ρ ∈ (0,1] det ermines t he level of abs t rac t ion a t which ART discovers clus t ers.Moreover, t he minimal number of clust ers present in the data can be de t ermined by minimal vigilance [24],computed as ρmin < 1/N where N is the number of features (words) used to represent a document. We chose a value of ρmin = 0.0005 as the initial vigilance parameter and we increment it until we find the best clustering quality. We stop increasing vigilance when more than 200 clusters are obtained because such a large number of clust ers would simply resul tin informa t ion overload for a user and therefore no t achieve t he in t ended objec t ive of t ex t clustering.A binary vect or-space [25] representation was creat ed for t he Reut er ModApt é t est set. Only t he t est set was clustered for compat ibilit y reasons wit h TC result s and also because in unsupervised learning, one must assume that a training set is unavailable. A standard stop word list was used t o remove frequent words and a simple feat urereduction by t erm select ion based on term frequency wasapplied t o reduce t he dimensionali t y of the original documents feature space. This approach was judged very effective for TC by [26].IV . E XPERIMENTAL R ESULTSWe eliminated words that appear in 10, 20 40 and 60 orless document s. In t he first case, a t ot al of 2282 t erm features were ret ained while in t he last only 466 were.Our experimen t s indica t ed t ha t less radical fea t ure selection not only increased t he number of feat ures and consequently processing t ime, but also result ed in lower quality clust ers in some cases (Fig. 1). Best qualit y is achieved at vigilance value of 0.05, wit h 106 clust ers, a number close t o t he expect ed number of t opics specifiedby t he domain exper t s who labeled t he da ta (93).Vigilance levels past 0.1 result in over 250 clusters, which is not desirable for users as explained previously.The results shown in Fig. 1 were obtained with a single pass in t he dat a. In realit y, ART converges t o a st able representation aft er at most N-1 presentat ions of t he dat a [27]. By st able, it is meant t hat if t he same document is presented several t imes t o t he ne t work, i t should beassigned t o t he same cat egory, and present ing t he same inputs over and over should no t change t he clus t er prototype values. Unstable clusters are problematic since an ident ical document submit ed at different t imes may end up in different clusters. Furthermore, we show in Fig.2a that cluster quality increases aft er ART has st abilized.Only 4 i t era t ions were required t o a t t ain a s t able representation, which is much less han he heore ical upper bound of N-1. However, in real world, high-volumeoperations this could still be a problem as little idle time in the syst em operat ion may be available t o st abilize t opics representation.We have processed he documen s in t heir na t uralorder, i.e. the chronological order in which they have been created and t hus t he order in which t hey would be submitted t o a classificat ion syst em. This simulat es t he real-world environment where there is no control over the order in which documents are created. As with any on-line clustering algori t hm, ART gives differen t resul t s depending on the order of presentation of the data. This is expected since clus ering decisions are aken for each sequentially submi t t ed documen t, compared t o ba t chclustering t hat considers all dat a at once. We submit t edthe dat a set in 15 different random orders t o ART and averaged clustering quality for each order. Fig. 2b shows that other orders of presentation are much worse t han t he natural, chronological order of t he document s in Reut er.This is encouraging because if quality was higher for other orders of presen a ion, one would face he problem of finding t he best order among t he very large number of possibilities or design a way t o combine resul s from different orders. It is possible t hat ot her t ext collect ions face t his problem. Maybe i t jus t happens t ha t Reu t er natural order is simply compa t ible wi t h t he desired solution, while in ot her cases t his may not happen. Aft erFig. 1. (a) More radical term selection (removing words appearing in 60 documents or less) results in better clustering quality insome cases, (at vigilance 0.05), compared to removing terms appearing in only 20 document s or less. (b) More radical feat ure selection also results in much smaller data set dimentionality which in turn allows for more rapid processing. (c) Vigilance 0.05finds a number of cluster close to the expected number of 93. Vigilance of 0.1 creates too many clusters for users.1150100150200250300all, there are many ways to organize large text collections.Some versions of ART use a similarity measure claimed to make i less suscep ible o order variat ions [28]. Our initial experimen t s wi t h [28] showed t ha t documen tvectors of small magnit ude are unable t o creat e enough output activation to pass t he vigilance t est under the newsimilarity measure, even with low values of vigilance.Fig. 2c. shows t ha t ART1 clus t er quali t y clearly exceeds t he lower bound es t ablished by K-means.However, clus t ering qualit y achieved by ART wi th random orders of presentation is comparable to K-means.This implies t hat if t he nat ural order of presentation does no correspond t o t he chosen organiza ion of he da a,ART1 will not do better than K-means. We now compare ART1 clustering qualit y t o the upper bound expect ed for cluster quality: the best TC results obtained wit h Support Vector Machines (SVM) and k-Nearest Neighbors (kNN)published in [21] (Fig. 3). ART1 achieves 51.2% of t he TC qualit y. Comparing t o 16.3% level of qualit y for k-means, t he lower bound, ART1 does much bet t er but is st ill only about half-way t o t he opt imal expect ed qualit y.There is also a potential for lower quality with other orders of presena ion. We mus however poin ou ha he solution we use to evaluate clusters is merely one among many ot her useful ways t o organize t he t ext collect ion.Hence, we merely evaluate t he abilit y of ART to recover the specified solut ion. A users st udy may bet t er validat e the s rucure discovered by ART, bu such s udies are costly and also subjective.We also built a small program ha simula es ART cluster prototype updating behavior and attempts to assign each document to its desired topic. We found that 2.2% of documents could not be assigned to their designated topic.Thus, even in the best conditions, perfect clustering is not possible wit h ART1 wit h t his dat a set in nat ural order since right from t he st art about 2% qualit y is lost t o t he order of presentation. This can be explained as follows: a document is assigned t o a t opic if it has a sufficient ly number of overlapping fea ures, as de ermined by he vigilance parameter. Furthermore, over time as documents are submitted to the network, the number of active features in prot ot ype vect ors decreases, which makes mat ching adocumen wi h it s desired t opic more unlikely. This is caused by t he pro t o t ype upda t ing mechanism t hat intersects document s assigned t o a t opic wit h t hat t opic prototype. So, in our simula t ion, if no feature of a document overlap wit h t he prot ot ype for t he documen desired topic, it means that the expected solution cannot besatisfied wit h ART1 and wit h t his order of present at ion.Better pro o ype upda ing may be required o improvequality, such as [12], which used the union of a document and its prototype features.Finally, while conducting our experiments, we noticed an in eres ing phenomenon: clus ers do no necessarily form at t he specified level of abstraction. In other words,ART some t imes discovers t opics generalizing or specializing desired t opics. A generalizat ion is a clust er that includes two or more classes (a class being a group of documents represent ing a desired t opic). A specializat ion on t he ot her hand is a class t hat includes t wo or more clusters. In a sense, such behavior can be expect ed from clustering algorit hms since t hey rely solely on similarit y among dat a it ems rat her t han on direct ions by a domain expert. Fig. 4 shows a portion of the confusion table built from the matches between clusters and classes to illustrate generalizations and specializations. A match is the number of document s a class and a clust er have in common. We note from Fig. 4 that:- cluster 0 is dominat ed by 169 document s from class25 (opic earnings) ou of a o al of 1087 (16%)documents expected to belong to that class;Fig. 3. ART1 clust ering F1 with stabilization at vigilance 0.05 fordata in its natural order (ART1 Nat). For both TC methods (SVM and k-NN), the micro-averaged F1 values is used for compatibility with our F1pair measure.Fig. 2. All results shown for vigilance 0.05. (a) Stabilization improves ART clustering quality. (b) Random orders of presentationeven when stabilized give much worse clust ers than the natural order of the document s in Reut er. (c) ART clust ers (in natural order, stabilized) are of better quality than k-means( k = 93).0,0,0,0,0,000- cluster 1 is dominat ed by 741 document s from class 25 (earnings) out of a total possible of 1087 (68%);- cluster 3 is dominat ed by 598 document s from class16 (acquisitions) out of a total possible of 719 (83%),but there is also a strong presence from class 0 (trade),class 1 (grain), class 2 (crude), class 11 (shipping) and class 18 (interest rate).Hence, one can consider that clusters 0 and 1 actually correspond t o class 25 (earnings) wit h 910 of t he 1087documents, so opic earnings is specialized. Clus er 3corresponds t o classes 16 (acquisit ions), class 0 (t rade),class 1 (grain), class 2 (crude), class 11 (shipping), and class 18 (in t eres t ra t e) for a t o t al of 1106 of 1395documents. So all these classes are generalized by cluster3. Ot her class-clust er mat ches may be deemed t o simply lower generaliza t ion and specializa t ion quali t y. Wedesigned a qualit y evaluat ion met hodology t hat , cont raryto exis t ing clus t ering evalua t ion measures, does notpenalize generaliza t ions and specializa t ions. For each class, it looks for all clust ers t hat mat ch t he class. This allows for the discovery of specializations. Then, for eachof the matching cluster, it also looks for all matching class,which accounts for generalizations. Extraneous documentsin the latter case are the clustering errors and a F1 value iscomputed based on hese errors. Fig. 5 shows quali yevaluation with t his measure, which we call Sub-Graph Dispersion (SGD) because a sub-graph is created for eachclass when looking for ma t ches. If one considersgeneralizations and specializa t ions of t he expec ted solution as acceptable, higher quality can be computed at lower vigilance, bu he quali y of generaliza ions and specialization decreases as vigilance increases. This method of evalua ion needs o be refined before final conclusions on he rue impac of learning a differen levels of abst ract ion on clust ering qualit y can be drawn.For ins ance, all ma t ches are curren ly considered bu some negatively affect quality and should not be included as being part of generalizations or specializations.V. C ONCLUSIONS AND F UTURE W ORKText clust ering work conduct ed wit h ART up t o now has used many differen t forms of ART-based architectures, as well as different and non-comparable text collections and evalua t ion me t hods. This si t ua tion resulted in confusion as t o t he level of clust ering qualit y achievable wit h ART. As a first st ep t owards resolving this si ua ion, we have es ed a simple ART1 ne work implementation and evaluated its text clustering quality on the benchmark Reut er dat a set and wit h t he st andard F 1measure. K-means clus ering quali y was used as he lower bound on quali t y while published resul t s wi t h supervised TC were used as an upper bound on quality.Our experiment s have demonst rat ed t hat t ext clust ers formed by ART1 achieve 51% of TC upper bound andexceedhe lower bound considerably. Consequen t ly,about half of the evidence needed to recover the expected document organizat ion solut ion is available direct ly from the data under the form of inter-document similarity rather than from cost ly and t ime consuming handcraft ing of a large labeled raining da a se . Whe her his level of quality is sufficien is a ask specific ques ion and ultimately a matter of cost/quality trade-off: it is a choice between higher quali y supervised documentcategorization ob ained a high developmen andmaintenance cost versus lower quality clusters obtained at basically no cost. At least we provide here a clear picture of the quality aspect by establishing the baseline quality to be expect ed wit h ART. Alt hough ART clust ers were of medium quali y, ART has he unique advan age of proceeding en irely wi hou human in erven ion, plus offers t he interesting properties of plasticit y and stabilit y.Should novelt y be det ect ed by t he net work, a new t opic would aut omat ically be creat ed as part of normal syst em operation. This con t ras t s wi th supervised TC, which would require downtime for re-training and related human intervention o prepare a new raining se . Therefore,despite lower quality, there may be some situations where ART-based t ext clust ering is a necessit y. Furt hermore,clustering quali ty can be increased if one considers discovery of t opics a t o ther levels of abs t rac tion as acceptable. Hence, an important area of future research is to explore evalua t ion measures t ha tdo no t penalize specializations and generaliza tions. Moreover, more advanced ART archi t ec tures wi t h non-binary representation (such as ART2 [29], fuzzy ART [30],MacLeod’s ART [12] and FOSART [31]) and be t ter feature select ion may improve clust er qualit y. We are currently inves iga ing hese avenues. As well, in heFig. 4. A shor ened version of he confusion able: compu edclusters 0-3 on rows and desired classes 0-25 on columns. The firstrow shows the class number. Highlighted values indicate best match.Fig. 5. Increased quality is computed by SGD at lower vigilance by notpenalizing generalizations and specializations. Stabilized results shown.Reuter collection, topics are not mutually exclusive, whileART1 clust ering is. ART-based soft clust ering such as with KMART [17] will be explored as yet another possible way to improve clustering. We are also exploring issues related t o lower qualit y wit h ot her orders of present at ion and the impact of time requirement s for stabilization in a real-life environment.Comparison wi t h o t her clus t ering met hods has no t been our objective. We rather focused on establishing the level of quality achieved by ART within the range defined by a lower and an upper bound. We believe t his gives abetter appreciation of the level of quality by setting it in awider framework. Some inves t iga t ors have evalua t ed clustering qualit y wit h ot her algorit hms on Reut er-21578and with the F 1 measure, but have used non-standard splits [7, 32]. So our results cannot be compared directly wit h theirs. We plan t o event ually evaluat e ot her clust eringmethodologies to compare their clus tering quali ty to ART’s. As well, testing on other text collections is needed to verify if quali t y resul t s apply t o documen t se t s displaying various characteristics.R EFERENCES[1] A.K. Jain, M.N. Murty and P.J Flynn. “Data clustering: A review”,ACM Computing Surveys , Vol. 31, No. 3, Sept 1999.[2] L. Kaufman and P.J. Rousseeuw. Finding groups in data: AnIntroduction to Cluster Analysis , Wiley-Interscience, 1990.[3] S.Grossberg. “Adap t ive pa t t ern classifica tion and universalrecording : I. Parallel development and coding of neural feat ure detectors”, Biological Cybernetics , Vol 23, 1976, pp 121-134.[4] G.A. Carpent er and S. Grossberg. “Adapt ive Resonance Theory(ART)”. In: Handbook of Brain Theory and Neural Networks, Ed.:Arbib M.A. , MIT Press, 1995[5] C. J. VanRijsbergen, Information Retrieval . London: But t erwort hs,1979.[6] F. Sebastiani. “Machine learning in automated text categorization”,ACM Computing Surveys , Vol. 34, No. 1, March 2002, pp. 1–47.[7] M. St einbach, G. Karypis and V. Kumar, V. “A comparison ofdocument clust ering echniques”, In: Proc. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , August 20-23, 2000, Boston, MA, USA.[8] D. Cutting, D. Karger, J. Pedersen and J. Tukey. “Scatter-gather: Acluster-based approach to browsing large document collections”, in:Proceedings of SIGIR'92, 1992.[9] T.Kohonen, K. Lagus, J. Salojärvi, J. Honkela, V. Paat ero and A.Saarela. “Self organiza t ion of a documen t collec t ion”, IEEE Transactions On Neural Networks , Vol 11, No. 3, May 2000.[10] G.A. Carpen t er, S. Grossberg and J.H. Reynolds. “Ar t map:Supervised real-ime learning and classifica ion of nons a ionary data by a self-organizing neural network”. Neural Networks , 4:565-588, 1991.[11] V. Pet ridis, V.G. Kaburlasos, P. Fragkou and A. Kehagias. “Textclassification using the sigma-FLNMAP neural network”. In: Proc.of IJCNN 2001, Washington, July 2001, p1362.[12] K.J. MacLeod and W. Robertson. “A neural algorithm for documentclustering”. Information Processing & Management , 27(4):337-346, 1991.[13] D. Merkl. “Con t en t -based sof t ware classifica tion by self-organization.” In: Proc. of the IEEE Int’l Conference on NeuralNetworks (ICNN’95). Perth. Australia. Nov 27 - Dec 1 . 1995. pp 1086-1091.[14] T. Kohonen. Self-Organizing Maps , Springer series in informat ionscience, 3rd ed., 2001.[15] N. Vlajic and H.-C. Card. “Categorizing web pages using modifiedART”. In: Proceedings of IEEE 1998 Canadian Conference on Electrical and Computer Engineering , 1998.[16] L. Massey L. “Structure discovery in text collections”, In: Proc. ofthe Sixth International Conference on Knowledge-Based IntelligentInformation & Engineering Systems (KES'2002), Sept ember 2002,Italy.[17] R. Kondadadi and R. Kozma. “A modified fuzzy art for softdocument clust ering”, In: Proc. International Joint Conference on Neural Networks , Honolulu, HA, May 2002.[18] M. Down on and T. Brennan. “Comparing classifica ions: anevaluation of several coefficient of partition agreement ”, In: Proc.Meeting of the Classification Soc., Boulder, CO, June 1980.[19] E. Fowlkes and C. Mallows. “A me hod for comparingwo hierarchical clus terings”. Journal of American Statistical Association , 78, pp. 553-569, 1983.[20] C. Apt e, F. Damerau and S.M. Weiss. “Aut omat ed learning ofdecision rules for t ex t ca t egoriza tion”, ACM T ransactions on Information Systems , 12(2):233-251, 1994.[21] Y. Yang and X. Liu. “A re-examinat ion of t ext cat egorizat ionmethods”. In: Proc. of Int'l ACM Conference on Research andDevelopment in Information Retrieval (SIGIR-99), pages 42-49,1999.[22] J. MacQueen. “Some met hods for classificat ion and analysis ofmultivariate observa ions”, in: Proceedings of the 5th Berkeleysymposium on Mathematical Statistics and Probability . Vol 1,Statistics. Edit ed by L.M. Le Cam and J. Neyman. Univ of California Press, 1967.[23] B. Moore. “ART and pat ern clust ering”, In: Proceedings of the1988 Connectionist Models Summer School , pp. 174-183, 1988.[24] L. Massey. “Det erminat ion of clust ering t endency wit h artneural networks”, In: Proc. Of Recent Adavances in Soft-Computing (RASC02), Nottingham, UK, Dec. 2002.[25] G. Salt on and M.E. Lesk. “Comput er evaluat ion of indexing andtext processing”. Journal of the ACM , Vol 15, no. 1, pp 8-36,January 1968.[26] Y. Yang and J.O. Pedersen. “A compara ive s udy on fea ureselection in ex ca egoriza ion”. In: Proc. of ICML-97, 14h International Conf. On machine Learning (Nashville, USA), pp.412-420, 1997.[27] M. Georgiopoulos, G.L. Heileman and J. Huang. “Convergenceproperties of learning in ART1”. Neural Computation , 2(4):502--509, 1990.[28] R. Sadananda and G.R.M. Sudhakara Rao. “ART1 : modelalgorithm charact erizat ion and alt ernat ive similarit y met ric for t he novelty de ec or”, In: Proc. IEEE International Conference on Neural Networks , Vol. 5, pp. 2421 –2425, 1995.[29] G.A. Carpen er, S. Grossberg and D.B. Rosen. “Ar 2-a: Anadaptive resonance algori t hm for rapid ca t egory learning and recognition”. Neural Networks , 4:493-504, 1991.[30] G.A. Carpent er, S. Grossberg and D.B. Rosen. “Fuzzy art : Faststable learning and categorization of analog patterns by an adaptive resonance system”. Neural Networks , 4:759-771, 1991.[31] A. Baraldi and E. Alpaydin. “Cons t ruc t ive feedforward artclustering ne works – Par II”. IEEE T ransactions on Neural Networks , Vol. 13, No. 3, May 2002.[32] B. Larsen and C. Aone. “Fast and effect ive t ext mining usinglinear-time document clustering”, In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining , 1999, pp. 16 – 22.。
罗立胜-学术综合英语Unit5
Part A Part B
Pre-listening
Background Information
Listening
New Words and Expressions
anchor 广播或电视节目主持人 cloaked in stigma and shame 蒙上了耻辱外衣 headstone 墓石 chant 单调反复地喊或说 traumatic 造成创伤的 reach out 伸出手(求助或给予帮助) secured a spot on the Western Kentucky team as a walk-on 得到在肯塔基队作替补队员的位置
Part A Part B
Pre-listening
Task 1 Listen and Take Notes
Listening
Task 2 Listen for Details
5. What are students encouraged to do when faced with troubles? To seek counseling and reach out for help. 6. What is the main purpose of this TV show? To save precious lives.
give up eating for fear of choking
Part A Part B
Pre-listening
Task 1 Listen and Take Notes
Listening
Task 2 Listen for Details
Suicide on Campus The TV program Early Show has invited a correspondent, some parents whose sons have committed suicide, a psychiatrist, and others to the studio to talk about campus suicides with the ultimate purpose of alarming people and saving lives.
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
2P R
2P R F1 P R
• is a factor indicating the relative importance of precision and recall • When =1, both has equal weight: (harmonic mean) • F2 –Measure puts more emphasis on precision • F0.5 –Measure puts more emphasis on precision
• Popular measures include:
– Minkowski distance:
d (t ,t ) N (|ai1 a j1 |N |ai2 a j2 |N ...|aip a jp |N )
i j
where ti = (ai1, ai2, …, aip) and tj = (aj1, aj2, …, ajp) are two pdimensional data objects, and q is a positive integer – Manhattan (or city block) distance If N=1 in Minkowski
7
Problems with Supervised Training
• Lack of sufficient annotated training data – Mostly multiple labels(NER, OrgN, PlcN, Time,…) – Different class may have similar linguistic variations of expressions
... anf
• Distance/dissimilarity matrix matrix
– object-by-object structure – n objects here! – Range is [0,1] if normalized
0 d(2,1) 0 d(3,1) d ( 3,2) 0 : : : d ( n,1) d ( n,2) ...
2
Macro/Micro Averaged Precision
• Macro gives two classes equal weight, Micro considers samples size
3
Other Measures
• F-Measure and F1 –Measure
F
( 2 1) P R
i j
– d(ti, tj ) 0 Remember this? – d(ti, ti ) = 0 2 2 2 x + y = z – d(ti, tj ) = d(tj, ti ) – d(ti, tj ) d(ti, tk ) + d(tk, tj )
• Distance can be given different weights on different attributes
– Consider the importance of different attributes – Can be assigned subjectively or by experiments
... 0
11
Numeric Distance Measures
• Distances are normally used to measure the similarity or dissimilarity between two data objects
• dij = 1 - sij(0 <= dij <=1)
4
Validations
• n-fold cross-validation: The available data is partitioned into n equal-size disjoint subsets. • Use each subset as the test set and combine the rest n-1 subsets as the training set to learn a classifier. • The procedure is run n times, which give n accuracies • The final estimated accuracy of learning is the average of the n accuracies • 10-fold and 5-fold cross-validations are commonly used • This method is used when the available data is not large
9
Clustering
• Dissimilarity/Similarity metric: Similarityfunction d(ti, tj) • There is a separate “quality”(cost) function that measures the “goodness” of a cluster(as you may get different sets of clusters) • The definitions of distance functions can be very different for interval-scaled, Boolean (binary), categorical, and ordinal variables(ordered categorical data) • Weights should be associated with different variables based on applications and data semantics • It is hard to define “similar enough” or “good enough”
6
Division of Data for Validation
• Validation set: the available data is divided into three subsets,
– a training set, – a validation set and – a test set
5
Other validations
• Leave-one-out cross-validation: This method is used when the data set is very small, a special case of cross-validation – Each fold of the cross validation has only a single test example and all the rest of the data is used in training
d (t ,t ) |a a j1||ai2 a j2 |...|aip aip |
i j i1
12
Numeric Distance Measures(2)
– Euclidean distance if N=2 in Minkowski Dist:
• Properties
d (t ,t ) (| xi1 x j1 |2 | xi2 x j2 |2 ...| xip x jp |2 )
• Only grouping, semantic labels are human derived as an after fact
8
Semi-Supervised Methods
• Make use of limited annotated resources – Weakly supervised/boot strapping methods • Expansion: the training set is iteratively expanded with similar examples • Self-training: examples are chosen in the next training step to which the current classifier assigns labels with most certainty, may change labels of trained data • Co-training: Examples are chosen in the next training step to which two or more current classifiers use an independent feature set to assign labels with most certainty • Active learning: Examples are human labeled. But machine pick examples
• “George has done an excellent job” • Vs. “Microsoft has done an excellent job” • Features to trigger a semantic pattern has limited examples.
• Large quantity of unlabeled data are available – Clustering for completely untrained data
Lecture 4: Evaluation for Classification and Semi(un)-supervised Learning
Qin Lu HITSZ
1
Intrinsic Evaluations • For a classifier of Ci