Generalizing wrapper induction across multiple Web sites using named entities and post-proc

合集下载

Intelligent Rollups in Multidimensional OLAP Data

Intelligent Rollups in Multidimensional OLAP Data

Intelligent Rollups in Multidimensional OLAP DataGayatri Sathe Sunita SarawagiIndian Institute of Technology Bombay,Indiagayatri,sunita@it.iitb.ac.inAbstractIn this paper we propose a new operator foradvanced exploration of large multidimensionaldatabases.The proposed operator can automati-cally generalize from a specific problem case indetailed data and return the broadest context inwhich the problem occurs.Such a functional-ity would be useful to an analyst who after ob-serving a problem case,say a drop in sales fora product in a store,would like tofind the ex-act scope of the problem.With existing tools hewould have to manually search around the prob-lem tuple trying to draw a pattern.This processis both tedious and imprecise.Our proposed op-erator can automate these manual steps and returnin a single step a compact and easy-to-interpretsummary of all possible maximal generalizationsalong various roll-up paths around the case.Wepresent aflexible cost-based framework that cangeneralize various kinds of behaviour(not simplydrops)while requiring little additional customiza-tion from the user.We design an algorithm thatcan work efficiently on large multidimensional hi-erarchical data cubes so as to be usable in an in-teractive setting.1IntroductionIn this paper we propose a new operator called RELAX forautomatically generalizing the scope of a specific problemcell of a large multidimensional database.Multidimensional database products were commerciallypopularized as Online Analytical Processing(OLAP)[Cod93,CD97]systems for helping analysts do decisionsupport on large historical data.They expose a multidi-mensional view of the data with categorical attributes likeProducts and Stores forming the dimensions and numericattributes like Sales and Revenue forming the measures orcells of the multidimensional cube.Dimensions usuallyFigure 2:A problematic drop in revenue from 1993to 1994observed for Product=“HRM/PayRoll,”Geography=“United States”and Platform=“Single-UserOther.”Figure 3:Output of the RELAX operator for the problem case marked in Figure 2.The bottom table shows details of the summarized exception E2.1.This table is not part of the result.ProductPlatformGeography Year Product name (67)Platform name (43)Geography (4)Year (5)Prod_Category (14) Plat_Type (6) Prod_Group (3)Plat_User (2)Figure 1:Dimensions and hierarchies of the software rev-enue data used in Scenario 1.The number in brackets indi-cates the size of that level of the dimension.the same Product-Platform pair had problems in other Ge-ographies besides “United States,”if the same Product had problems in other Platforms besides “Single-User Other”and so on.To answer such questions he could explore fur-ther around the problem case to find a pattern.He has to view this case in succession in the context of other Geogra-phies,the three levels of hierarchies of the Platform and Product dimensions and then further outward for combina-tions of two or more dimensions and hierarchies.In each case,he needs to check if one or more of them had a sim-ilar drop and explore further out trying to find a pattern.With existing tools,this has to be done manually by per-forming a series of roll-ups and drill-downs along different combinations of dimensions.This process can get tedious even for this small dataset.Searching in larger company datasets can get even more daunting.We propose to automate this search through the RELAX operator.The result of the operator as shown in Figure 3is a set of two maximal generalizations G1and G2.In the figure,the first row shows the problem case.The first generalization G1starts from the second row.The “*”on the columns represent the dimensions that can be gener-alized.Thus G1shows that we can generalize simultane-ously along the Product and Geography dimension for the same Platform and ProdCategory “Cross Industry Apps”for every Geog-raphy and Platform “Single-User Other”had a drop from 1993to 1994.The last column “Count”shows the number of cases which conform to the generalization.G1includes 25tuples around the problem case.The next three rows show cases that violate the generalization.We call these exceptions as they are subsumed by the generalization but did not have a drop.For example,for exception E1.1sales increased from 0.3in 1993to 2.0in 1994for the Prod-uct “Other Office Apps”and Geography “Rest of World.”The second generalization G2shows that we can generalize along the Product dimension up to two levels of hierarchy for the same Geography and Platform as the problem case,subsuming a total of 13rows.The next three rows marked E2.1through E2.3show exceptions to this generalization where sales increased from 1993to 1994.The first excep-tion summarizes all three Products under Category “Home software”that had an increase from 1993to 1994.This is indicated by the “*”in the Product column.Below the re-sult table we show the three rows subsumed by this summa-rized exception.Such summarizations provide a significant reduction in the amount of data that the user has to inspect.1.1.2Scenario 2In the above example,we generalized Boolean relation-ships —that is those based simply on whether the value in one cell was less or greater than another.We next consider a more involved generalization based on whether two val-ues have the same ratio.We consider another dataset,our university’s student enrollment data from 1989to 1998.As shown in the figure below,the data consists of five dimen-sions:Student category,Gender,Program with a two-level hierarchy,Department and Year.The measure is the num-ber of students enrolled.Student Gender Program Department Year Category (9)Gender (2)Program (10)Dept (28)Year (10)ProgCat (3)Suppose a new manager hired in1996to administer en-rollment of“MTechs”in the“Selffinance”category ob-serves that the fraction of females is significantly lower than the males.He would like to analyze if this case alsoheld for other Years and for other Categories of studentsin other Programs or was it peculiar to his particular case. Using the university’s enrollment cube,he could start fromhis case of interest as shown in Figure4and explore aroundthis case manually.A better alternative is to invoke the RE-LAX operator to generalize as long as the ratio is close to a factor of10that he observed in his case.The result of theoperator as shown in Figure5consists of a broad gener-alization covering the Category,Program and Year dimen-sions and including a total of79tuples.The next few rows list exceptions to this generalization.E1.1states that in “1990”the ratio was22which is significantly higher than claimed by G1.E1.1.1is an exception to this exception where the ratio was4instead of22for Category“Indian,”Program“M.Des”and Year“1990.”The last three rows show exceptions at various levels of summarization where the ratio was less than1.This summary gives the new analyst a solid impression of the trends in the university.1.1.3Scenario3We now consider a scenario where an analyst wants to gen-eralize a trend involving multiple measures.Suppose an analyst observes a steady increase in revenue from1990to 1994for Product“Project Management,”Platform“Single-user MAC OS”and Geography“Rest of World”(as shown in Figure6).He is interested in knowing whether the scope of this increasing trend extends to other Geographies,Prod-ucts or Platforms.The result of invoking the RELAX oper-ator is shown in Figure7.Here we see that the increasing trend generalizes to all Products and Geographies for Plat-form“Single-user MAC OS.”Also listed are the exceptions to this generalization.For example for exception E1.1the revenue keeps on dropping after1992.OutlineIn Section2we present aflexible framework for express-ing all the above three kinds of generalizations and many more,while requiring little customization when adapting to the various forms.We present aflexible cost-based for-mulation that can generalize myriad forms of relationships and present a report that is compact and easy-to-interpret. In Section3we present our algorithm that can work effi-ciently on large multidimensional hierarchical data cubes. The algorithm exploits the OLAP engine for preliminary filtering and reducing the amount of data read in the ap-plication.Experimentson large OLAP benchmarks show the feasibility of deploying the operator in an interactive setting.These are described inSection4.In Section5Figure4:Number of females and males enrolled for Pro-gram“M.Tech”in Category“Selffinance”in1996. Figure5:Generalization for the problem case marked inFigure4we discuss other related work done in the direction of inte-grating mining operations with OLAP.Finally we present conclusions and future work in Section6.2Problem FormulationIn this section we present a formulation of the RELAX op-erator.Our goal is to provide a unifying framework for expressing several kinds of generalizations.The challenge is in designing a framework that requires as little additional work as possible when plugging in different kinds of gen-eralizations.Also,the formulation should lead to compact, easy-to-comprehend reports.The user invokes the operator by specifying a de-tailed tuple and a property of that he wants to generalize.had constant values along some sub-sets of dimensions.Let be the dimensions along which has constant values respec-tively.For example in Figure2,has constant val-ues along three dimensions:Product=“HRM/PayRoll,”Geography=“United States”and Platform=“Single-User Other”.We claim that generalization is possible along a dimension if most rows obtained by replacing the con-stant value with other members of dimension,satisfy the property closely.Similarly,for generalization along two dimensions and we need to check against all tu-ples obtained by replacing and by the cross product of the different member values along the two dimensions and so on for multiple dimensions and hierarchies.Our goal is to report all possible consistent and maximal gen-eralizations.A generalization along a set of dimensions is consistent if all subsets of these dimensions also generalize and,maximal if no super-set of these dimensions can yield consistent generalizations.We next precisely formulate how to define a general-ization.Three issues arise when attempting this definition. First,how does a user specify the property to be general-ized?We discuss this in Section2.1.Second,what is theFigure6:Increasing revenues along Time for Product“Project Management,”Geography“Rest of World”and Platform “Single-user MACOS.”Figure7:Generalization of the problem case marked in Figure6criteria for generalization along a dimension,that is,how many tuples need to satisfy the property and to what extent before we can generalize them?We discuss this in Sec-tion2.2.Finally,how can we improve generalization ac-curacy by listing a few violating tuples as exceptions?We discuss this in Section2.3.2.1Generalization propertyWe need a unified mechanism for specifying various dif-ferent types of properties.Examples are:sales in current year is less than sales in previous year,or,profit is20% of revenue.One option is to specify a predicate that is true when satisfies the property and false otherwise. This formulation is coarse grained—it does not recog-nize the fact that different tuples could satisfy a property to different degrees.This is particularly limiting for mul-tiplicative properties like“profit is20%of revenue”where adjacent tuples will rarely follow the exact“20%”ratio.We therefore formulate the property as a function that re-turns a real-value that measures how closely conforms to the generalization property.is called the generaliza-tion error and is zero whenever is very close to and increases as gets further away from the generalizationproperty of.2.2Generalization criteriaGiven the error function,when can we claim that it is possible to generalize along a dimension?Clearly,we can generalize when the error of all tuples along the dimension is zero.Often,however,errors will be non-zero and vary-ing.One way is to ask the user to specify a threshold and we generalize as long as all tuples have error less than that threshold.There are two problems with this.First,it is often hard for a user to specify an absolute threshold.Sec-ond,in real-life cases,we can rarelyfind generalizations where all tuples satisfy the error condition without making the threshold so large that the relaxation becomes uninter-esting.We remove the need for a threshold by associating apenalty for excluding tuples which are similar to the spe-cific tuple but not included in the generalization around it.We require our generalization to be maximal,i.e.,itshould not be possible to expand out further from the re-ported generalization.Therefore,each generalization bi-ases a user towards thinking that the tuples just outside the generalization are very different from the specific problem tuple.Accordingly,we define a penalty function that is large when a tuple is close to and rapidly dimin-ishes towards zero as the difference between and in-creases.This behavior is the opposite of that of function .We allow a generalization whenever the sum of errors is greater than the sum of over all in.While the user can choose any function he wants,we propose the following method that is derived from the function and requires only a little additional work.The user specifies the least deviant example tuple,i.e.,the tuple closest to outside the generalization,that he thinks does not generalize the problem case.This might be an exist-ing tuple in the cube or a made-up tuple with hypothetical measures.We evaluate as a measure of its deviance. The user implicitly assumes that tuples not included in a generalization are more deviant than.Thus is zero for tuples where.Tuples less deviant than will pay a penalty of.Thus can be expressed as:(1)2.3Exceptions to generalizationsOften we mightfind that all but a few members of a gen-eralization closely satisfy the property.We improve accu-racy by explicitly listing such violating values as excep-tions.For example,in Figure3there are three exceptions E1.1through E1.3to thefirst generalization G1.We want to report the exceptions as compactly as possible.We do so by grouping together exceptions that are similar with re-spect to the property being generalized.Each group of similar exceptions is represented in the final answer with just a single tuple that is most rep-resentative of the group.To determine what set of tuples could be grouped together we need a method for determin-ing similarity of two tuples.We propose to use the same error function modified to take as arguments two tuples and.indicates the degree to which the prop-erty of is satisfied by.Thus corresponds to the old function where the error is measured with respect to the specific tuple.If previously was defined as whether the change in sales from1993to1994is neg-ative,the new would be whether’s change in sales from1993to1994has the same sign as that of. This redefinition allows us to express error in summariz-ing related exceptions in the same functional form as error in generalizing tuples.The error of the summarization is measured as the sum of error rep(e)whereis the representative tuple of the group and spans over the members of the group.We cannot group together arbitrary tuples—only those that can be represented by a single tuple with some dimen-sion value set to“*”to denote its members.For example, in Figure3E2.1has the Product dimension set to“*”indi-cating that its members correspond to all possible values of the Product dimension.Sometimes,a group might contain a few members that are significantly different than the rest. We allow such members to be listed explicitly as exceptions within its group.Thus a group of tuples listed as exceptions to a generalization might in turn have other nested excep-tions.For example,in Figure5E1.1.1is an exception to its group E1.1.In general this nesting can be any number of levels deep.The next issue that arises is how many such exceptions are we allowed to return.Clearly,there is a trade-off be-tween the answer size and error.If there is no limit on the answer size,we can achieve zero error by returning all pos-sible exceptions.In practice,a user might be happier with a less accurate but more compact answer.We allow the user to specify a loose upper bound on the maximum size of the answer that he would like to observe.This limit does not imply that all exceptions will have exactly the size of but just specifies to the system the maximum size that the user is willing to inspect.In practice this limit will be set by considerations such as how many rows can be simul-taneously eyed on a screen and so on.Our goal then is to find the solution with the smallest error given the limit on the answer size.We canfind the total error of the answer as follows.Summarization error For each tuple covered by the generalization,if is explicitly listed as an exception its error is zero.Otherwise,find the closest representative tu-ple in the answer that subsumes.Add its error as .could be either the outermost generalization (for which the representative measures are from)or one of the summarized exception rows.2.4Final formulationThefinal formulation is as follows.The user specifies a specific tuple,an upper bound on the size of excep-tions and two error functions:that measures the error of including a tuple in a generalization around and that measures the error of excluding from the generalization.can be specified either ex-plicitly or implicitly using the function in Equation1after specifying the closest deviant tuple to be excluded from the generalization.The goal of the system is to return all possible maxi-mal and consistent generalizations around.We define a generalization as an aggregation around for which.For each general-ization we are allowed rows within which to report its exceptions such that the total error as calculated in Sec-tion2.3is minimized.2.5Example ScenariosWe now consider two example scenarios in this general framework.2.5.1Boolean generalizationsIn this case,each tuple is associated with two measures, say and as illustrated in Scenario1of Section 1.1.1.The error is defined as:if signsignotherwise(2) The function is:(3)This implies that as long as the number of mismatches is less than the number of matches,we generalize.Alter-nately,a user can specify aswhich implies that as long as the number of mismatches is less than times the number of matches,we generalize.2.5.2Ratio generalizationsIn this case too,we assume that each tuple is associated with two measures,and.The user is interested in generalizing along all tuples where the ratio between the two measures is the same as that of.(Example Scenario 2,Figure4).Let denote the ratio and let=.We want to be small when is close to.However,we also need to consider the absolute values of and because a small value for can cause the ratio to be large,resulting in a disproportionately large value of the error.Therefore,we need to attach a weighting function of.Both these requirements are met very well by the symmetric KL distance function routinely used to measure the distance between two distributions.In designing the function,suppose the user indicates that tuples outside the generalization will be assumed to have a ratio at least twice that of the specific tuple.We can then write as(5)Summary We presented aflexible framework for express-ing various kinds of properties that a user might wish to generalize.Our formulation requires a user to just spec-ify an error function that shows how much a tuple devi-ates from the desired property and a penalty function(often derivable from the error function)for leaving irrelevant tu-ples out of a generalization.Even though we have gone to a great extent in reducing the amount of user input needed in specifying a generalization,a casual user might not want to go into the trouble of specifying any function.We believe that many of the common scenarios like the three listed above will be in-built within the same framework much like advanced database systems provide both built-in functions and the facility for registering functions for the advanced user.3AlgorithmIn this section we discuss algorithms forfinding general-izations and their exceptions based on the formulation dis-cussed in Section2.Our tool will work as an attachment to an OLAP data source.In designing the algorithm,our goal was to exploit the capabilities of the source for efficiently processing typ-ical multidimensional queries.Only when we encounter a computation intensive sub-task that relies extensively on in-memory state maintained within a run,do we fetch data out of the DBMS.This led us to design a two stage al-gorithm.In thefirst stage,wefind all possible maximal generalizations using a succession of aggregation queries pushed to the DBMS.In the second stage,wefind summa-rized exceptions to each of the generalizations using data fetched to the memory.In the second stage we have to deal with only those tuples that are covered by the maximal gen-eralizations.In most cases,a generalization covers only a small amount of data.So the amount of data required to be fetched in the second stage is limited.3.1Finding the generalizationsFinding generalizations involves making multiple searches over gradually increasing subsets of data around the spec-ified tuple.Wefirstfind one dimensional generaliza-tions.Starting from,we check if generalization is possi-ble along each dimension.This check requires us to go over all tuples along a dimension while keeping the other di-mension values same as,summing up the values of func-tions and,and checking if thefirst sum is greater than the other.This check can be easily posed as a“group-by”query.Once we get the single dimension generalizations,we try combinations of dimensions to check if generalization is possible.We require the generalizations to be consis-tent.Therefore,before we can claim that a set of dimen-sions generalizes,we need to check that all subsets of the set generalize.This property enables us to deploy Apri-ori style[AS94]subset pruning.We use a similar multi-pass algorithm.In each pass,we use the set of general-izations found in the previous pass to generate new poten-tial generalizations using the Apriori-style candidate gen-eration phase.This is followed by a pruning phase where we eliminate candidates any of whose subsets did not gen-eralize.We then check if the candidate generalizations conform to the generalization criteria.This check is done very differently from Apriori’s step of scanning the entire database.Our check involves an aggregate query to the database where only the tuples covered by the candidate generalization will be subsetted and the aggregated and values returned.3.2Finding Summarized ExceptionsAt this stage our goal is tofind exceptions to each maxi-mal generalization compacted to within rows and yield-ing the minimum total error calculated as discussed in Sec-tion2.3.Ideally we would generate the exceptions in the database itself.This is hard for several reasons.First,there is no absolute criteria for determining whether a tuple is an ex-ception or not for all possible functions.Thus we cannot push to the DBMS afilter query that will just return the exceptions.Even for functions where suchfilters exist,for example Boolean functions(Section2.5.1)a good summa-rization might require us to consider some non-exceptions too when nested exceptions are involved.Therefore we re-sort to algorithms that fetch data from the DBMS but min-imize cost by reducing the number of passes on the data. Also,we do not want to assume that all the data fetched can be buffered in memory.We present an efficient bottom-up algorithm that can re-turn the optimal answer in one pass of the data in some cases.We describe our algorithm in stages,first assuming that there is just a single dimension with levels of hierar-chy and later handling multiple dimensions.3.2.1Single dimension with multiple levels of hierar-chiesOne factor that crucially affects the design of the algo-rithm is the form of the function.Some functions can be rewritten as where returns a knownfinite set of values irrespective of the number of values can take.We call this thefinite-domain property of a function.For example,the function (Equation2)of Boolean-scenarios in Section2.5.1satisfies this property with defined as sign().will return either“+”or“-”for all values of and .The function for Ratio generalization(Section2.5.2) does not satisfy thefinite-domain property because the ratio can be any real number.The function for Trend general-ization satisfies the property if we assume that is known andfinite.The range size of in this case will befor all possible sign combinations of the changes.Note that the range of has nothing to do with whether it satisfies thefinite-domain property or not.Wefirst present an algorithm forfinite domain functions and later generalize to other functions.Optimal solution forfinite-domain functions When the function satisfies thefinite domain property,we canfind the optimal solution in a single pass of the data in an on-line manner i.e.,without buffering too much data in mem-ory.For ease of exposition,we describe the algorithm as-suming two-valued properties—extension to other cases is straightforward.Let us denote the two values of as ‘+’or‘-’.The function for two tuples is0when the signs match otherwise it is1.Let us further assume that for the specific tuple,is‘+’.Considerfirst an even simpler case where we have just one level of hierarchy.Let be the number of tuples.An obvious way tofind the best answer of size at most for this group is as follows.Find the majority value of the tuples.If the majority value is the same as’s i.e.,‘+’, then report as exceptions at most tuples with value‘-’. If the majority value is‘-’,make thefirst‘-’valued tuple a representative()for all tuples with value‘-’andfill the remaining slots with tuples of value‘+’.The problem with this algorithm is that it is not online. Unless all the tuples are scanned,we do not know the majority value and therefore cannot know whether to retain ‘-’tuples or‘+’tuples as exceptions.We solve this problem by maintaining two intermediate solutions at all times corresponding to the two possible values by which the group could be represented.At the end of the scan we pick the one with the smaller cost.We next consider the case of multiple levels of hierar-chy.The tuples in the generalization can be arranged in a tree.The slots need to befilled by tuples from the most detailed level or reps for groups from any level of the hierarchy so as to minimize total error.We propose the following bottom-up solution.We scan the relevant tuples from the DBMS sorted ac-cording to the levels of the hierarchy.For each lowermost subtree of tuples we canfind the op-timal solution in one pass for any given value of.How-ever,in this case we cannot know in advance the number of slots to allocate for a subtree.Wefind the solution and the corresponding error for all possible sizes from0to. Thus for each subtree we have the best solu-tion for all between and and all possible values of the default rep which in our case could be“+”or“-”.Recursively,for each internal node we merge the solu-tions of its subtrees by choosing the partitioning of that leads to the smallest total error.This step needs to be also done in an efficient online manner—a parent node cannot buffer solutions of its subtrees while they are processed. We solve this problem by maintaining at each parent node, the best solution for all subtrees processed so far for all possible answer sizes.Let refer to theinter-1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 6 7 8 911.1.4 : −1.1.8 : −1.1.9 : −1.1.4 : −1.1.8 : −1.1 : +− −1.2.3 : +1.3.3 : −1.3.3 : − 1.3 : ++ −+ +0 1 0 00 0+ −+ −1.2.1 : + 1.2.3 : + 1.4.2 : − 1.4.1 : −1.4.3 : − 1.4.2 : −1.4.1 : − 1.4 : +1 21.1.4 : −1.1.8 : −1.1 : +1.1.4 : −1 2+ −+ −+ −+ −1.2.1 : + 1.2.3 : +0 01 01.3.3 : − 1.3 : +1.3.3 : −1.4.1 : − 1.4 : +1.4.2 : − 1.4.1 : −2 3+ −1.1.4 : − 1.1 : +2 3+ −+ −+ −1.3.3 : − 1.3 : + 1.4.1 : − 1.4 : +2 10 13 41.2 : − 1.2.1 : +1.2 : − 1.2.1 : +1.2 : − 1.2.1 : ++ −+ −+ −+ −3 7 5 2 1 54 51.1.4 : −1.2.1 : + 1.3 : +1.2 : − 1.1 : +1.4 : +8 10+ −1.2 : − 1.1 : +1.2.1 : + 1.4 : +9 14N = 3N = 2(+) 1.2 1.3 1.41.1 (−) (+) (+)soln(1.1,*,*) soln(1.2,*,*) soln(1.3,*,*) soln(1.4,*,*)soln(1,3,*,*)soln(1,2,*,*)soln(1,1,*,*)soln(1,0,*,*)soln(1,1,*) N=3N=2N=1N=ErrErrFigure8:Illustration of the algorithmmediate value of after thefirst to the child of are scanned.After a new subtree has seen all its data,it passes on all its solution to its parent for merg-ing with the current solution.This merge can be optimally done at node using the following dynamic programming formulation for all values of and.(6)After all subtrees of a node have arrived,we need one last step in beforefinishing with this node.For each and we need to consider if choosing a new rep with sign different than will lead to a smaller cost solution even if that leads to one slot less for the rest of the tuples.This can only happen if error of is less than error of where“*”denotes that all children of have been scanned.In Equation form,thefinalat node after all children are scanned is:(7)Thefinal solution at the topmost node of the tree is.We illustrate the working of the algorithm with an example.Example In Figure8we present an example with and.The leftmost subtree under node1.1has。

Human-level concept learning through probabilistic program induction

Human-level concept learning through probabilistic program induction

new concept, and even children can make meaningful generalizations via “one-shot learning” (1–3). In contrast, many of the leading approaches in machine learning are also the most data-hungry, especially “deep learning” models that have achieved new levels of performance on object and speech recognition benchmarks (4–9). Second, people learn richer representations than machines do, even for simple concepts (Fig. 1B), using them for a wider range of functions, including (Fig. 1, ii) creating new exemplars (10), (Fig. 1, iii) parsing objects into parts and relations (11), and (Fig. 1, iv) creating new abstract categories of objects based on existing categories (12, 13). In contrast, the best machine classifiers do not perform these additional functions, which are rarely studied and usually require specialized algorithms. A central challenge is to explain these two aspects of human-level concept learning: How do people learn new concepts from just one or a few examples? And how do people learn such abstract, rich, and flexible representations? An even greater challenge arises when putting them together: How can learning succeed from such sparse data yet also produce such rich representations? For any theory of

新教材高中英语Unit10Connections课件北师大版选择性必修第四册

新教材高中英语Unit10Connections课件北师大版选择性必修第四册

主题:人与社会 学科素养:思维品质 难度系数:★★★★★ 【语篇导读】本文简单介绍了“小世界理论”的原理及其 在社交网络中的广泛运用。
The Small-World Phenomenon and Decentralized Search The small-world phenomenon—the principle that we are all linked by short chains of acquaintances, or “six degrees of separation”—is a fundamental issue in social networks. It is a basic statement about the abundance of short paths in a graph whose nodes are people, with links joining pairs who know one another. It is also a topic on which the feedback between social, mathematical, and computational issues has been particularly fluid.
○句型精析
1.The small-world phenomenon—the principle that we are all linked by short chains of acquaintances, or “six degrees of separation”—is a fundamental issue in social networks.
Working much more recently, applied mathematicians Duncan Watts and Steve Strogatz proposed thinking about networks with this small-world property as a superposition: a highly clustered(成 群的)sub-network consisting of the “local acquaintances” of nodes, together with a collection of random long-range shortcuts that help produce short paths.

湍流燃烧模型

湍流燃烧模型

Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 2. Balance equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
194
D. Veynante, L. Vervisch / Progress in Energy and Combustion Science 28 (2002) 193±266
6. Tools for turbulent combustion modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.2. Scalar dissipation rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.3. Geometrical description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.3.1. G-®eld equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.3.2. Flame surface density description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 6.3.3. Flame wrinkling description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.4. Statistical approaches: probability density function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.4.2. Presumed probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 6.4.3. Pdf balance equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 6.4.4. Joint velocity/concentrations pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.4.5. Conditional moment closure (CMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.5. Similarities and links between the tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

自组织映射(SOM)R包说明书

自组织映射(SOM)R包说明书

Package‘som’October14,2022Version0.3-5.1Date2010-04-08Title Self-Organizing MapAuthor Jun Yan<***************.edu>Maintainer Jun Yan<***************.edu>Depends R(>=2.10)Description Self-Organizing Map(with application in gene clustering).License GPL(>=3)Repository CRANDate/Publication2016-07-0610:26:15NeedsCompilation yesR topics documented:filtering (1)normalize (2)plot.som (3)qerror (4)som (4)summary.som (7)yeast (7)Index9 filtering Filter data before feeding som algorithm for gene expression dataDescriptionFiltering data by certainfloor,ceiling,max/min ratio,and max-min difference.12normalizeUsagefiltering(x,lt=20,ut=16000,mmr=3,mmd=200)Argumentsx a data frame or matrix of input data.ltfloor value replaces those less than it with the valueut ceiling value replaced those greater than it with the valuemmr the max/min ratio,rows with max/min<mmr will be removedmmd the max-min difference,rows with(max-min)<mmd will be removed ValueAn dataframe or matrix after thefilteringAuthor(s)Jun Yan<***************.edu>See Alsonormalize.normalize normalize data before feeding som algorithmDescriptionNormalize the data so that each row has mean0and variance1.Usagenormalize(x,byrow=TRUE)Argumentsx a data frame or matrix of input data.byrow whether normalizing by row or by column,default is byrow.ValueAn dataframe or matrix after the normalizing.Author(s)Jun Yan<***************.edu>plot.som3See Alsofiltering.plot.som Visualizing a SOMDescriptionPlot the SOM in a2-dim map with means and sd bars.Usage##S3method for class somplot(x,sdbar=1,ylim=c(-3,3),color=TRUE,ntik=3,yadj=0.1,xlab="",ylab="",...)Argumentsx a som objectsdbar the length of sdbar in sd,no sdbar if sdbar=0ylim the range of y axies in each cell of the mapcolor whether or not use color plottingntik the number of tiks of the vertical axisyadj the proportion used to put the number of obsxlab x labelylab y label...other options to plotNoteThis function is not cleanly written.The original purpose was to mimic what GENECLUSTER does.The ylim is hardcoded so that only standardized data could be properly plotted.There are visualization methods like umat and sammon in SOM\_PAK3.1,but not implemented here.Author(s)Jun Yan<***************.edu>Examplesfoo<-som(matrix(rnorm(1000),250),3,5)plot(foo,ylim=c(-1,1))4som qerror quantization accuracyDescriptionget the average distortion measureUsageqerror(obj,err.radius=1)Argumentsobj a‘som’objecterr.radius radius used calculating qerrorValueAn average of the following quantity(weighted distance measure)over all x in the sample,||x−m i||h ciwhere h ci is the neighbourhood kernel for the ith code.Author(s)Jun Yan<***************.edu>Examplesfoo<-som(matrix(rnorm(1000),100),2,4)qerror(foo,3)som Function to train a Self-Organizing MapDescriptionProduces an object of class"som"which is a Self-Organizing Mapfit of the data.som5 Usagesom.init(data,xdim,ydim,init="linear")som(data,xdim,ydim,init="linear",alpha=NULL,alphaType="inverse",neigh="gaussian",topol="rect",radius=NULL,rlen=NULL,err.radius=1,inv.alp.c=NULL)som.train(data,code,xdim,ydim,alpha=NULL,alphaType="inverse",neigh="gaussian",topol="rect",radius=NULL,rlen=NULL,err.radius=1,inv.alp.c=NULL) som.update(obj,alpha=NULL,radius=NULL,rlen=NULL,err.radius=1,inv.alp.c=NULL)som.project(obj,newdat)Argumentsobj a‘som’object.newdat a new dataset needs to be projected onto the map.code a matrix of initial code vector in the map.data a data frame or matrix of input data.xdim an integer specifying the x-dimension of the map.ydim an integer specifying the y-dimension of the map.init a character string specifying the initializing method.The following are per-mitted:"sample"uses a radom sample from the data;"random"uses randomdraws from N(0,1);"linear"uses the linear grids upon thefirst two principlecomponents directin.alpha a vector of initial learning rate parameter for the two training phases.Decreaseslinearly to zero during training.alphaType a character string specifying learning rate funciton type.Possible choices arelinear function("linear")and inverse-time type function("inverse").neigh a character string specifying the neighborhood function type.The following arepermitted:"bubble""gaussian"topol a character string specifying the topology type when measuring distance in themap.The following are permitted:"hexa""rect"radius a vector of initial radius of the training area in som-algorithm for the two trainingphases.Decreases linearly to one during training.rlen a vector of running length(number of steps)in the two training phases.err.radius a numeric value specifying the radius when calculating average distortion mea-sure.inv.alp.c the constant C in the inverse learning rate function:alpha0*C/(C+t);6somValue‘som.init’initializes a map and returns the code matrix.‘som’does the two-step som training ina batch fashion and return a‘som’object.‘som.train’takes data,code,and traing parameters andperform the requested som training.‘som.update’takes a‘som’object and further train it with updated paramters.‘som.project’projects new data onto the map.An object of class"som"representing thefit,which is a list containing the following components: data the dataset on which som was applied.init a character string indicating the initializing method.xdim an integer specifying the x-dimension of the map.ydim an integer specifying the y-dimension of the map.code a metrix with nrow=xdim*ydim,each row corresponding to a code vector of a cell in the map.The mapping from cell coordinate(x,y)to the row index in thecode matrix is:rownumber=x+y*xdimvisual a data frame of three columns,with the same number of rows as in data:x and y are the coordinate of the corresponding observation in the map,and qerror is thequantization error computed as the squared distance(depends topol)betweenthe observation vector and its coding vector.alpha0a vector of initial learning rate parameter for the two training phases.alpha a character string specifying learning rate funciton type.neigh a character string specifying the neighborhood function type.topol a character string specifying the topology type when measuring distance in the map.radius0a vector of initial radius of the training area in som-algorithm for the two training phases.rlen a vector of running length in the two training phases.qerror a numeric value of average distortion measure.code.sum a dataframe summaries the number of observations in each map cell.Author(s)Jun Yan<***************.edu>ReferencesKohonen,Hynninen,Kangas,and Laaksonen(1995),SOM-PAK,the Self-Organizing Map Pro-gram Package(version3.1).http://www.cis.hut.fi/research/papers/som\_tr96.ps.ZExamplesdata(yeast)yeast<-yeast[,-c(1,11)]yeast.f<-filtering(yeast)yeast.f.n<-normalize(yeast.f)foo<-som(yeast.f.n,xdim=5,ydim=6)foo<-som(yeast.f.n,xdim=5,ydim=6,topol="hexa",neigh="gaussian")plot(foo)summary.som7 summary.som summarize a som objectDescriptionprint out the configuration parameters of a som objectUsage##S3method for class somsummary(object,...)##S3method for class somprint(x,...)Argumentsobject,x a‘som’object...nothing yetAuthor(s)Jun Yan<***************.edu>yeast yeast cell cycleDescriptionThe yeast data frame has6601rows and18columns,i.e.,6601genes,measured at18time points. Usagedata(yeast)FormatThis data frame contains the following columns:Gene a character vector of gene nameszero a numeric vectorten a numeric vectortwenty a numeric vectorthirty a numeric vectorfourty a numeric vector8yeastfifty a numeric vectorsixty a numeric vectorseventy a numeric vectoreighty a numeric vectorninety a numeric vectorhundred a numeric vectorone.ten a numeric vectorone.twenty a numeric vectorone.thirty a numeric vectorone.fourty a numeric vectorone.fifty a numeric vectorone.sixty a numeric vectorSourceReferencesTamayo et.al.(1999),Interpreting patterns of gene expression with self-organizing maps:Methods and application to hematopoietic differentiation,PNAS V96,pp2907-2912,March1999.Index∗arithqerror,4∗clustersom,4∗datasetsyeast,7∗hplotplot.som,3∗manipfiltering,1normalize,2∗printsummary.som,7filtering,1,3normalize,2,2plot.som,3print.som(summary.som),7qerror,4som,4summary.som,7yeast,79。

MaxDEA

MaxDEA

Detailed Contents
Chapter 1: Main Features of MaxDEA ..................................................8
1.1 Main Features ............................................................................................... 8 1.2 Models in MaxDEA...................................................................................... 9 1.3 What’s NEW ............................................................................................... 12 1.4 Compare MaxDEA Editions ..................................................................... 17
3.1 Import Data ................................................................................................ 19 3.2 Define Data ................................................................................................. 24 3.3 Set and Run Model..................................................................................... 25 3.4 Export Results ............................................................................................ 77

Common Mode Filter Design Guide

Common Mode Filter Design Guide

Common M ode F ilter D esign G uideIntroductionThe selection of component values for common mode filters need not be a difficult and confusing process. The use of standard filter alignments can be utilized to achieve a relatively simple and straightforward design process, though such alignments may readily be modified to utilize pre-defined component values.GeneralLine filters prevent excessive noise from being conducted between electronic equipment and the AC line; generally, the emphasis is on protecting the AC line. Figure 1 shows the use of a common mode filter between the AC line (via impedance matching circuitry) and a (noisy) power con-verter. The direction of common mode noise (noise on both lines occurring simultaneously referred to earth ground) is from the load and into the filter, where the noise common to both lines becomes sufficiently attenuated. The result-ing common mode output of the filter onto the AC line (via impedance matching circuitry) is then negligible.Figure 1.Generalized line filteringThe design of a common mode filter is essentially the design of two identical differential filters, one for each of the two polarity lines with the inductors of each side coupled by a single core:L2Figure 2.The common mode inductorFor a differential input current ( (A) to (B) through L1 and (B) to (A) through L2), the net magnetic flux which is coupled between the two inductors is zero.Any inductance encountered by the differential signal is then the result of imperfect coupling of the two chokes; they perform as independent components with their leak-age inductances responding to the differential signal: the leakage inductances attenuate the differential signal. When the inductors, L1 and L2, encounter an identical signal of the same polarity referred to ground (common mode signal), they each contribute a net, non-zero flux in the shared core; the inductors thus perform as indepen-dent components with their mutual inductance respond-ing to the common signal: the mutual inductance then attenuates this common signal.The First Order FilterThe simplest and least expensive filter to design is a first order filter; this type of filter uses a single reactive component to store certain bands of a spectral energy without passing this energy to the load. In the case of a low pass common mode filter, a common mode choke is the reactive element employed.The value of inductance required of the choke is simply the load in Ohms divided by the radian frequency at and above which the signal is to be attenuated. For example, attenu-ation at and above 4000 Hz into a 50⏲ load would require a 1.99 mH (50/(2π x 4000)) inductor. The resulting common mode filter configuration would be as follows:50Ω1.99 mHFigure 3.A first order (single pole) common mode filter The attenuation at 4000 Hz would be 3 dB, increasing at 6 dB per octave. Because of the predominant inductor dependence of a first order filter, the variations of actual choke inductance must be considered. For example, a ±20% variation of rated inductance means that the nominal 3 dB frequency of 4000 Hz could actually be anywhere in the range from 3332 Hz to 4999 Hz. It is typical for the inductance value of a common mode choketo be specified as a minimum requirement, thus insuring that the crossover frequency not be shifted too high.However, some care should be observed in choosing a choke for a first order low pass filter because a much higher than typical or minimum value of inductance may limit the choke’s useful band of attenuation.Second Order FiltersA second order filter uses two reactive components and has two advantages over the first order filter: 1) ideally, a second order filter provides 12 dB per octave attenuation (four times that of a first order filter) after the cutoff point,and 2) it provides greater attenuation at frequencies above inductor self-resonance (See Figure 4).One of the critical factors involved in the operation of higher order filters is the attenuating character at the corner frequency. Assuming tight coupling of the filter components and reasonable coupling of the choke itself (conditions we would expect to achieve), the gain near the cutoff point may be very large (several dB); moreover, the time response would be slow and oscillatory. On the other hand, the gain at the crossover point may also be less than the presumed -3 dB (3 dB attenuation), providing a good transient response, but frequency response near and below the corner frequency could be less than optimally flat.In the design of a second order filter, the damping factor (usually signified by the Greek letter zeta (ζ )) describes both the gain at the corner frequency and the time response of the filter. Figure (5) shows normalized plots of the gain versus frequency for various values of zeta.Figure 4.Analysis of a second order (two pole) common modelow pass filterThe design of a second order filter requires more care and analysis than a first order filter to obtain a suitable response near the cutoff point, but there is less concern needed at higher frequencies as previously mentioned.A ≡ ζ = 0.1;B ≡ ζ = 0.5;C ≡ ζ = 0.707;D ≡ ζ = 1.0;E ≡ ζ = 4.0Figure 5.Second order frequency response for variousdamping f actors (ζ)As the damping factor becomes smaller, the gain at the corner frequency becomes larger; the ideal limit for zero damping would be infinite gain. The inherent parasitics of real components reduce the gain expected from ideal components, but tailoring the frequency response within the few octaves of critical cutoff point is still effectively a function of ideal filter parameters (i.e., frequency, capaci-tance, inductance, resistance).L0.1W n1W n 10W nRadian Frequency,WG a i n (d B )V s V s LR s LCs LC j L R j LC LR LCCMout CMin L L n n n L ()()=++=−+⎛⎝⎜⎞⎠⎟=+−⎛⎝⎜⎞⎠⎟≡≡≡≡111111212222ωωζωωωωωωζradian frequencyR the noise load resistance LFor some types of filters, the design and damping char-acteristics may need to be maintained to meet specific performance requirements. For many actual line filters,however, a damping factor of approximately 1 or greater and a cutoff frequency within about an octave of the calculated ideal should provide suitable filtering.The following is an example of a second order low pass filter design:1)Identify the required cutoff frequency:For this example, suppose we have a switching power supply (for use in equipment covered by UL478) that is actually 24 dB noisier at 60 KH z than permissible for the intended application. For a second order filter (12dB/octave roll off) the desired corner frequency would be 15 KHz.2)Identify the load resistance at the cutoff frequency:Assume R L = 50 Ω3)Choose the desired damping factor:Choose a minimum of 0.707 which will provide 3 dB attenuation at the corner frequency while providing favorable control over filter ringing.4)Calculate required component values:Note:Damping factors much greater than 1 may causeunacceptably high attenuation of lower frequen-cies whereas a damping factor much less than 0.707 may cause undesired ringing and the filter may itself produce noise.Third Order FiltersA third order filter ideally yields an attenuation of 18 dB per octave above the cutoff point (or cutoff points if the three corner frequencies are not simultaneous); this is the prominently positive aspect of this higher order filter. The primary disadvantage is cost since three reactive compo-nents are now required. H igher than third order filters are generally cost-prohibitive.Figure 6.Analysis of a third order (three pole) low pass filter where ω1, ω2 and ω4 occur at the same -3dB frequency of ω05)Choose available components:C = 0.05 µF (Largest standard capacitor value that will meet leakage current requirements for UL478/CSA C22.2 No. 1: a 300% decrease from design)L = 2.1 mH (Approx. 300% larger than design to compensate for reduction or capacitance: Coilcraft standard part #E3493-A)6)Calculate actual frequency, damping factor, and at-tenuation for components chosen:ζ = 2.05 (a damping factor of about 1 or more is acceptible)Attenuation = (12 dB/octave) x 2 octaves = 24 dB 7)The resulting filter is that of figure (4) with:L = 2.1 mH; C = 0.05 µF; R L = 50 ΩL 1L 2VCMout s VCMin s R R L s R L s sC R L s sC R L s L L s L s sC L L R s L Cs L L C R s L L L L L L L()()()()=+⎛⎝⎜⎞⎠⎟+++++⎛⎝⎜⎜⎜⎜⎞⎠⎟⎟⎟⎟=++++222121*********11Butterworth →+++112212233s s s n n n ωωω()()L L R R L L L n n L 12111222+==+ωω;()L L C n 1n2C =2;ωω2211414=.L L L L n n n 12L n3n2L2n2L2C R =1;R R ωωωωωω33224422===ωπωζωμn n n Lf C L L R L =====294248070727502rad /sec =1Hn .1215532πLC=Hz (very nearly 15KHz)The design of a generic filter is readily accomplished by using standard alignments such as the Butterworth (“maxi-mally flat”) alignments. Figure (6) shows the general analysis and component relationships to the Butterworth alignments for a third order low pass filter. Butterworth alignments provide an inherent ζ of 0.707 and a -3 dB point at the crossover frequency. The Butterworth alignments for the first three orders of low pass filters are shown in Figure (7).The design of a line filter need not obey the Butterworth alignments precisely (although such alignments do pro-vide a good basis for design); moreover, because of leakage current limits placed upon electronic equipment (thus limiting the amount of filter capacitance to ground),adjustments to the alignments are usually required, but they can be executed very simply as follows:1)First design a second order low pass with ζ ≥ 0.52)Add a third pole (which has the desired corner fre-quency) by cascading a second inductor between the second order filter and the noise load:L = R/ (2 π f c )Where f c is the desired corner frequency.Design ProcedureThe following example determines the required compo-nent values for a third order filter (for the same require-ments as the previous second order design example).1)List the desired crossover frequency, load resistance:Choose f c = 15000 Hz Choose R L = 50 Ω2)Design a second order filter with ζ = 0.5 (see second order example above):3)Design the third pole:R L /(2πf c ) = L 250/(2π15000) = 0.531 mH4)Choose available components and check the resulting cutoff frequency and attenuation:L2 = 0.508 mH (Coilcraft #E3506-A)f n= R/(2πL 1 )= 15665 HzAttenuation at 60 KHZ: 24 dB (second order filter) +2.9 octave × 6 = 41.4 dB5)The resulting filter configuration is that of figure (6)with:L 1 = 2.1 mH L 2 = 0.508 mH R L = 50 ΩConclusionsSpecific filter alignments may be calculated by manipu-lating the transfer function coefficients (component val-ues) of a filter to achieve a specific damping factor.A step-by-step design procedure may utilize standard filter alignments, eliminating the need to calculate the damping factor directly for critical filtering. Line filters,with their unique requirements, yet non-critical character-istics, are easily designed using a minimum allowable damping factor.Standard filter alignments assume ideal filter compo-nents; this does not necessarily hold true, especially at higher frequencies. For a discussion of the non-ideal character of common mode filter inductors refer to the application note “Common Mode Filter Inductor Analysis,”available from Coilcraft.Figure 7.The first three order low pass filters and their Butterworth alignmentse i +–e O +–R LL 2Ce i +–e O +–R LL 1Ce i +–e O +–R LL 1L 2Filter SchematicFilter Transfer FunctionButterworthAlignmentFirst OrderSecond OrderThird Ordere e Ls R o iL =+11e e LCs Ls R oi L=++112e e L L R s L Cs L L s R o iLL =++++111231212()e e s o in=+11ωe e LCs Ls R oiL =++112e e s s so i n n n =+++122133221ωωω。

GENERALIZEDHOUGHTRANSFORM课件.ppt

GENERALIZEDHOUGHTRANSFORM课件.ppt
GENERALIZED HOUGH
TRANSFORM
Recap on classical Hough Transform
1. In detecting lines
– The parameters r and q were found out relative to the origin (0,0)
5. The idea can be extended to shapes like ellipses, parabolas, etc.
2
Parameters forlytic Form Parameters Equation
Line
r, q
xcosq+ysinq=r
2. Complete specification of the exact shape of the target object is required
3. The Shape is specified in the form of the R-Table
4. Information that can be extracted are
• Advantages
1. A method for object recognition 2. Robust to partial deformation in shape 3. Tolerant to noise 4. Can detect multiple occurrences of a shape in
2. For each entry on the row calculate the candidate location of the reference point
xc xi r cosq yc yi r sinq

Lineage Tracing for General Data Warehouse Transformations

Lineage Tracing for General Data Warehouse Transformations

Lineage Tracing for General Data Warehouse Transformations∗Yingwei Cui and Jennifer WidomComputer Science Department,Stanford University{cyw,widom}@AbstractData warehousing systems integrate information from operational data sources into a central reposi-tory to enable analysis and mining of the integrated information.During the integration process,sourcedata typically undergoes a series of transformations,which may vary from simple algebraic operationsor aggregations to complex“data cleansing”procedures.In a warehousing environment,the data lineageproblem is that of tracing warehouse data items back to the original source items from which they werederived.We formally define the lineage tracing problem in the presence of general data warehouse trans-formations,and we present algorithms for lineage tracing in this environment.Our tracing procedures takeadvantage of known structure or properties of transformations when present,but also work in the absenceof such information.Our results can be used as the basis for a lineage tracing tool in a general warehousingsetting,and also can guide the design of data warehouses that enable efficient lineage tracing.1IntroductionData warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information[CD97,LW95].Sometimes during data analysis it is useful to look not only at the information in the warehouse,but also to investigate how certain warehouse information was derived from the sources.Tracing warehouse data items back to the source data items from which they were derived is termed the data lineage problem[CWW00].Enabling lineage tracing in a data warehousing environment has several benefits and applications,including in-depth data analysis and data mining,authorization management,view update,efficient warehouse recovery,and others as outlined in,e.g., [BB99,CW01,CWW00,HQGW93,LBM98,LGMW00,RS98,RS99,WS97].In previous work[CW00,CWW00],we studied the warehouse data lineage problem in depth,but we only considered warehouse data defined as relational materialized views over the sources,i.e.,views specified using SQL or relational algebra.Related work has focused on even simpler relational views[Sto75]or on multidimensional views[DB2,Pow].In real production data warehouses,however,data imported from the sources is generally“cleansed”,integrated,and summarized through a sequence or graph of transformations, and many commercial warehousing systems provide tools for creating and managing such transformations as part of the extract-transform-load(ETL)process,e.g.,[Inf,Mic,PPD,Sag].The transformations may vary from simple algebraic operations or aggregations to complex procedural code.In this paper we consider the problem of lineage tracing for data warehouses created by general transfor-mations.Since we no longer have the luxury of afixed set of operators or the algebraic properties offered by relational views,the problem is considerably more difficult and open-ended than previous work on lin-eage tracing.Furthermore,since transformation graphs in real ETL processes can often be quite complex—containing as many as60or more transformations—the storage requirements and runtime overhead associated with lineage tracing are very important considerations.We develop an approach to lineage tracing for general transformations that takes advantage of known structure or properties of transformations when present,yet provides tracing facilities in the absence of such information as well.Our tracing algorithms apply to single transformations,to linear sequences of transfor-mations,and to arbitrary acyclic transformation graphs.We present optimizations that effectively reduce the storage and runtime overhead in the case of large transformation graphs.Our results can be used as the basis for an in-depth data warehouse analysis and debugging tool,by which analysts can browse their warehouse data,then trace back to the source data that produced warehouse data items of interest.Our results also can guide the design of data warehouses that enable efficient lineage tracing.The main contributions of this paper are summarized as follows:•In Sections2and3we define data transformations formally and identify a set of relevant transformation properties.We define data lineage for general warehouse transformations exhibiting these properties, but we also cover“black box”transformations with no known properties.The transformation properties we consider can be specified easily by transformation authors,and they encompass a large majority of transformations used for real data warehouses.•In Section3we develop lineage tracing algorithms for single transformations.Our algorithms take advantage of transformation properties when they are present,and we also suggest how indexes can be used to further improve tracing performance.•In Sections4–6we develop a general algorithm for lineage tracing through a sequence or graph of transformations.Our algorithm includes methods for combining transformations so that we can reduce overall tracing cost,including the number of transformations we must trace through and the number of intermediate results that must be stored or recomputed for the purpose of lineage tracing.•We have implemented a prototype lineage tracing system based on our algorithms,and in Section7we present a few initial performance results.For examples in this paper we use the relational data model,but our approach and results clearly apply to data objects in general.1.1Related WorkThere has been a significant body of work on data transformations in general,including aspects such as transforming data formats,models,and schemas,e.g.,[ACM+99,BDH+95,CR99,HMN+99,LSS96,RH00, Shu87,Squ95].Often the focus is on data integration or warehousing,but none of these papers considers lineage tracing through transformations,or even addresses the related problem of transformation inverses.Most previous work on data lineage focuses on coarse-grained(or schema-level)lineage tracing,and uses annotations to provide lineage information such as which transformations were involved in producing a given warehouse data item[BB99,LBM98],or which source attributes derive certain warehouse attributes [HQGW93,RS98].By contrast,we considerfine-grained(or instance-level)lineage tracing:we retrieve the actual set of source data items that derived a given warehouse data item.As will be seen,in some cases weprod-id category valid111computer10/1/1998–Sony V AIO3280222computer12/1/1998–9/30/1999 Sony V AIO1950333electronics4/2/1999–Sony V AIO2750cust-id prod-list 01012/1/199901022/8/199903794/9/199905246/9/199907618/21/1999095211/8/1999102811/24/1999125012/15/19991We assume that valid is a simple string,which unfortunately is a typical ad-hoc treatment of time.Figure3:Transformations to derive SalesJump Namesplit ordersselect on product categoryjoin products and ordersaggregate and pivot quarterly salesadd a column avg3select on avg3remove columnsFigure4:Transformation summaryof ordered products with product ID and(parenthesized)quantity for each.Sample contents of small source tables are shown in Figures1and2.Suppose an analyst wants to build a warehouse table listing computer products that had a significant sales jump in the last quarter:the last quarter sales were more than twice the average sales for the preceding three quarters.A table SalesJump is defined in the data warehouse for this purpose.Figure3shows how the contents of table SalesJump can be specified using a transformation graph G with inputs Order and Product.G is a directed acyclic graph composed of the following seven transformations:•T1splits each input order according to its product list into multiple orders,each with a single ordered product and quantity.The output has schema order-id,cust-id,date,prod-id,quantity .•T2filters out products not in the computer category.•T3effectively performs a relational join on the outputs from T1and T2,with T1.prod-id=T2.prod-id and T1.date occurring in the period of T2.valid.T3also drops attributes cust-id and category,so the output has schema order-id,date,prod-id,quantity,prod-name,price,valid .•T4computes the quarterly sales for each product.It groups the output from T3by prod-name,computes the total sales for each product for the four previous quarters,and pivots the results to output a table with schema prod-name,q1,q2,q3,q4 ,where q1–q4are the quarterly sales.•T5computes from the output of T4the average sales of each product in thefirst three quarters.The output schema is prod-name,q1,q2,q3,avg3,q4 ,where avg3is the average sales(q1+q2+q3)/3.•T6selects those products whose last quarter’s sales were greater than twice the average of the preceding three quarters.•T7performs afinal projection to output SalesJump with schema prod-name,avg3,q4 .Figure4summarizes the transformations in G.Note that some of these transformations(T2,T5,T6,and T7) could be expressed as standard relational operations,while others(T1,T3,and T4)could not.As a simple lineage example,for the data in Figures1and2the warehouse table SalesJump contains tuple t= Sony VAIO,11250,39600 ,indicating that the sales of V AIO computers jumped from an average of11250in thefirst three quarters to39600in the last quarter.An analyst may want to see the relevant detailed information by tracing the lineage of tuple t,that is,by inspecting the original input data items that produced ing the techniques to be developed in this paper,from the source data in Figures1and2the analyst will be presented with the lineage result in Figure5.Orderorder-id dateAAA333(10),222(10)CCC222(5),333(5)DDD222(10)BBB222(10),333(10)Productprod-id category valid222computer12/1/1998–9/30/1999 Sony V AIO1980)I(= OXaabY2−1IFigure6:A transformation instanceonly one),as opposed to the entire input data set.Given a transformation instance T(I)=O and an output item o∈O,we call the actual set I∗⊆I of input data items that contributed to o’s derivation the lineage of o,and we denote it as I∗=T∗(o,I).The lineage of a set of output data items O∗⊆O is the union of the lineage of each item in the set:T∗(O∗,I)=o∈O∗T∗(o,I).A detailed definition of data lineage for different types of transformations will be given in Section3.Knowing something about the workings of a transformation is important for tracing data lineage—if we know nothing,any input data item may have participated in the derivation of an output item.Let us consider an example.Given a transformation T and its instance T(I)=O in Figure6,the lineage of the output item a,2 depends on T’s definition,as we will illustrate.Suppose T is a transformation thatfilters out input items with a negative Y value(i.e.,T=σY≥0in relational algebra).Then the lineage of output item o= a,2 should include only input item a,2 .Now,suppose instead that T groups the input data items based on their X values and computes the sum of their Y values multiplied by2(i.e.,T=αX,2∗sum(Y)as Y in relational algebra,whereαperforms grouping and aggregation).Then the lineage of output item o= a,2 should include input items a,−1 and a,2 ,because o is computed from both of them.We will refer back to these two transformations later(along with our earlier examples from Section1.2),so let us call thefirst one T8and the second one T9.Given a transformation specified as a standard relational operator or view,we can define and retrieve the exact data lineage for any output data item using the techniques introduced in[CWW00].On the other hand,if we know nothing at all about a transformation,then the lineage of an output item must be defined as the entire input set.In reality transformations often lie between these two extremes—they are not standard relational operators,but they have some known structure or properties that can help us identify and trace data lineage.The transformation properties we will consider often can be specified easily by the transformation author, or they can be inferred from the transformation definition(as relational operators,for example),or possibly even“learned”from the transformation’s behavior.In this paper,we do not focus on how properties are specified or discovered,but rather on how they are exploited for lineage tracing.3Lineage Tracing Using Transformation PropertiesWe consider three overall kinds of properties and provide algorithms that trace data lineage using these prop-erties.First,each transformation is in a certain transformation class based on how it maps input data items to output items(Section3.1).Second,we may have one or more schema mappings for a transformation, specifying how certain output attributes relate to input attributes(Section3.2).Third,a transformation may be accompanied by a tracing procedure or inverse transformation,which is the best case for lineage tracing (Section3.3).When a transformation exhibits many properties,we determine the best one to exploit for lin-O I(a) dispatcher I O (b) aggregatorI O (c) black−boxFigure 7:Transformation classes eage tracing based on a property hierarchy (Section 3.4).We also discuss nondeterministic transformations (Section 3.5),and how indexes can be used to further improve tracing performance (Section 3.6).3.1Transformation ClassesIn this section,we define three transformation classes:dispatchers ,aggregators ,and black-boxes .For each class,we give a formal definition of data lineage and specify a lineage tracing procedure.We also consider several subclasses for which we specify more efficient tracing procedures.Our informal studies have shown that about 95%of the transformations used in real data warehouses are dispatchers,aggregators,or their compositions (covered in Sections 4–6),and a large majority fall into the more efficient subclasses.3.1.1DispatchersA transformation T is a dispatcher if each input data item produces zero or more output data items indepen-dently:∀I ,T (I )= i ∈IT ({i }).Figure 7(a)illustrates a dispatcher,in which input item 1produces output items 1–4,input item 3produces output items 3–6,and input item 2produces no output items.The lineage of an output item o according to a dispatcher T is defined as T ∗(o,I )={i ∈I |o ∈T ({i })}.A simple procedure TraceDS(T ,O ∗,I )in Figure 8can be used to trace the lineage of a set of output items O ∗⊆O according to a dispatcher T .The procedure applies T to the input data items one at a time and returns those items that produce one or more items in O ∗.2Note that all of our tracing procedures are specified to take a set of output items as a parameter instead of a single output item,for generality and also so tracing procedures can be composed when we consider transformation sequences (Section 4)and graphs (Section 6).Example 3.1(Lineage Tracing for Dispatchers)Transformation T 1in Section 1.2is a dispatcher,because each input order produces one or more output orders via T 1.Given an output item o = 0101,AAA ,2/1/1999,222,10 based on the sample data of Figure 2,we can trace o ’s lineage according to T 1using proce-dure TraceDS(T 1,{o },Order )to obtain T ∗1(o,Order )={ 0101,AAA ,2/1/1999,“333(10),222(10)” }.Transformations T 2,T 5,T 6,and T 7in Section 1.2and T 8in Section 2.2all are dispatchers,and we can simi-larly trace data lineage for them.procedure TraceDS(T,O∗,I)I∗←∅;for each i∈I doif T({i})∩O∗=∅then I∗←I∗ {i}; return I∗;Figure8:Tracing procedure for dispatchers procedure TraceAG(T,O∗,I)L←all subsets of I sorted by size;for each I∗∈L in increasing order doif T(I∗)=O∗thenif T(I−I∗)=O−O∗then break;else L=all supersets of I∗sorted by size; return I∗;Figure9:Tracing procedure for aggregatorsTraceDS requires a complete scan of the input data set,and for each input item i it calls transformation T over{i}which can be very expensive if T has significant overhead(e.g.,startup time).In Section3.6we will discuss how indexes can be used to improve the performance of TraceDS.However,next we introduce a common subclass of dispatchers,filters,for which lineage tracing is trivial.Filters.A dispatcher T is afilter if each input item produces either itself or nothing:∀i∈I,T({i})={i} or T({i})=∅.Thus,the lineage of any output data item is the same item in the input set:∀o∈O, T∗(o)={o}.The tracing procedure for afilter T simply returns the traced item set O∗as its own lineage. It does not need to call the transformation T or scan the input data set,which can be a significant advantage in many cases(see Section4).Transformation T8in Section2.2is afilter,and the lineage of output item o= a,2 is the same item a,2 in the input set.Other examples offilters are T2and T6in Section1.2.3.1.2AggregatorsA transformation T is an aggregator if T is complete(defined momentarily),and for all I and T(I)= O={o1,...,o n},there exists a unique disjoint partition I1,...,I n of I such that T(I k)={o k}for k= 1..n.I1,...,I n is called the input partition,and I k is o k’s lineage according to T:T∗(o k,I)=I k.A transformation T is complete if each input data item always contributes to some output data item:∀I=∅, T(I)=∅.Figure7(b)illustrates an aggregator,where the lineage of output item1is input items{1,2},the lineage of output item2is{3},and the lineage of output item3is{4,5,6}.Transformation T9in Section2is an aggregator.The input partition is I1={ a,−1 , a,2 },I2= { b,0 },and the lineage of output item o= a,2 is I1.Among the transformations in Section1.2,T4,T5, and T7are aggregators.Note that transformations can be both aggregators and dispatchers(e.g.,T5and T7in Section1.2).We will address how overlapping properties affect lineage tracing in Section3.4.To trace the lineage of an output subset O∗according to an aggregator T,we can use the procedure TraceAG(T,O∗,I)in Figure9that enumerates subsets of input I.It returns the unique subset I∗such that I∗produces exactly O∗,i.e.,T(I∗)=O∗,and the rest of the input set produces the rest of the output set,i.e., T(I−I∗)=O−O∗.During the enumeration,we examine the subsets in increasing size.If wefind a subset I such that T(I )=O∗but T(I−I )=O−O∗,we then need to examine only supersets of I ,which can reduce the work significantly.TraceAG may call T as many as2|I|times in the worst case,which can be prohibitive.We introduce two common subclasses of aggregators,context-free aggregators and key-preserving aggregators,which allow us to apply much more efficient tracing procedures.procedure TraceCF(T,O∗,I)I∗←∅;pnum←0;for each i∈I doif pnum=0then I1←{i};pnum←1;continue;for(k←1;k≤pnum;k++)doif|T(I k∪{i})|=1then I k←I k∪{i};break;if k>pnum then pnum←pnum+1;I pnum←{i}; for k←1..pnum doif T(I k)⊆O∗then I∗←I∗∪I k;return I∗;Figure10:Tracing proc.for context-free aggregators procedure TraceKP(T,O∗,I)I∗←∅;for each i∈I doifπkey(T({i}))⊆πkey(O∗)then I∗←I∗ {i}; return I∗;Figure11:Tracing proc.for key-preserving aggrs. Context-Free Aggregators.An aggregator T is context-free if any two input data items either always belong to the same input partition,or they always do not,regardless of the other items in the input set.In other words, a context-free aggregator determines the partition that an input item belongs to based on its own value,and not on the values of any other input items.All example aggregators we have seen are context-free.As an example of a non-context-free aggregator,consider a transformation T that clusters input data points based on their x-y coordinates and outputs some aggregate value of items in each cluster.Suppose T specifies that any two points within distance d from each other must belong to the same cluster.T is an aggregator,but it is not context-free,since whether two items belong to the same cluster or not may depend on the existence of a third item near to both.We specify lineage tracing procedure TraceCF(T,O∗,I)in Figure10for context-free aggregators. This procedurefirst scans the input data set to create the partitions(which we could not do linearly if the ag-gregator were not context-free),then it checks each partition tofind those that produce items in O∗.TraceCF reduces the number of transformation calls to|I2|+|I|in the worst case,which is a significant improvement. Key-Preserving Aggregators.Suppose each input item and output item contains a unique key value in the relational sense,denoted i.key for item i.An aggregator T is key-preserving if given any input set I and its input partition I1,...,I n for output T(I)={o1,...,o n},all subsets of I k produce a single output item with the same key value as o k,for k=1..n.That is,∀I ⊆I k:T(I )={o k}and o k.key=o k.key.Theorem3.2All key-preserving aggregators are context-free.Proof:See Appendix A.1.All example aggregators we have seen are key-preserving.As an example of a context-free but non-key-preserving aggregator,consider a relational groupby-aggregation that does not retain the grouping attribute.TraceKP(T,O∗,I)in Figure11traces the lineage of O∗according to a key-preserving aggregator T. It scans the input data set once and returns all input items that produce output items with the same key as items in O∗.TraceKP reduces the number of transformation calls to|I|,with each call operating on a single input data item.We can further improve performance of TraceKP using an index,as discussed in Section3.6.3.1.3Black-box TransformationsAn atomic transformation is called a black-box transformation if it is neither a dispatcher nor an aggregator, and it does not have a provided lineage tracing procedure(Section3.3).In general,any subset of the input items may have been used to produce a given output item through a black-box transformation,as illustrated in Figure7(c),so all we can say is that the entire input data set is the lineage of each output item:∀o∈O, T∗(o,I)=I.Thus,the tracing procedure for a black-box transformation simply returns the entire input I.As an example of a true black-box,consider a transformation T that sorts the input data items and attaches a serial number to each output item according to its sorted position.For instance,given input data set I= { f,10 , b,20 , c,5 }and sorting by thefirst attribute,the output is T(I)={ 1,b,20 , 2,c,5 , 3,f,10 }, and the lineage of each output data item is the entire input set I.Note that in this case each output item,in particular its serial number,is indeed derived from all input data items.3.2Schema MappingsSchema information can be very useful in the ETL process,and many data warehousing systems require transformation programmers to provide some schema information.In this section,we discuss how we can use schema information to improve lineage tracing for dispatchers and aggregators.Sometimes schema informa-tion also can improve lineage tracing for a black-box transformation T,specifically when T can be combined with another non-black-box transformation based on T’s schema information(Section4).A schema specifi-cation may include:input schema A= A1,...,A p ,and input key A key⊆Aoutput schema B= B1,...,B q ,and output key B key⊆BThe specification also may include schema mappings,defined as follows.Definition3.3(Schema Mappings)Consider a transformation T with input schema A and output schema B.Let A⊆A and B⊆B be lists of input and output attributes.Let i.A denote the A attribute values of i,and similarly for o.B.Let f and g be functions from tuples of attribute values to tuples of attribute values. We say that T has a forward schema mapping f(A)T→B if we can partition any input set I into I1,...,I m based on equality of f(A)values,3and partition the output set O=T(I)into O1,...,O n based on equality of B values,such that m≥n and:1.for k=1..n,T(I k)=O k and I k={i∈I|f(i.A)=o.B for any o∈O k}.2.for k=(n+1)..m,T(I k)=∅.Similarly,we say that T has a backward schema mapping A T←g(B)if we can partition any input set I into I1,...,I m based on equality of A values,and partition the output set O=T(I)into O1,...,O n based on equality of g(B)values,such that m≥n and:1.for k=1..n,T(I k)=O k and I k={i∈I|i.A=g(o.B)for any o∈O k}.2.for k=(n+1)..m,T(I k)=∅.When f(or g)is the identity function,we simply write A T→B(or A T←B).If A T→B and A T←B we write A T↔B.Although Definition3.3may seem cumbersome,it formally and accurately captures the intuitive notion of schema mappings(certain input attributes producing certain output attributes)that transformations do fre-quently exhibit.Example3.4Schema information for transformation T5in Section1.2can be specified as: Input schema and key:A= prod-name,q1,q2,q3,q4 ,A key= prod-nameOutput schema and key:B= prod-name,q1,q2,q3,avg3,q4 .B key= prod-nameSchema mappings: prod-name,q1,q2,q3,q4 T5↔ prod-name,q1,q2,q3,q4f( q1,q2,q3 )T5→ avg3 ,where f( a,b,c )=(a+b+c)/3Theorem3.5Consider a transformation T that is a dispatcher or an aggregator,and consider any instance T(I)=O.Given any output item o∈O,let I∗be o’s lineage according to the lineage definition for T’s transformation class in Section3.1.If T has a forward schema mapping f(A)T→B,then I∗⊆{i∈I|f(i.A)=o.B}.If T has a backward schema mapping A T←g(B),then I∗⊆{i∈I|i.A=g(o.B)}. Proof:See Appendix A.2.Based on Theorem3.5,when tracing lineage for a dispatcher or aggregator,we can narrow down the lineage of any output data item to a(possibly very small)subset of the input data set based on a schema mapping.We can then retrieve the exact lineage within that subset using the algorithms in Section3.1.For example,consider an aggregator T with a backward schema mapping A T←g(B).When tracing the lineage of an output item o∈O according to T,we canfirstfind the input subset I ={i∈I|i.A=g(o.B)},then enumerate subsets of I using TraceAG(T,o,I )tofind o’s lineage I∗⊆I .If we have multiple schema mappings for T,we can use the intersection of the subsets for improved tracing efficiency.Although the narrowing technique of the previous paragraph is effective,when schema mappings satisfy certain additional conditions,we obtain transformation properties that permit very efficient tracing procedures. Definition3.6(Schema Mapping Properties)Consider a transformation T with input schema A,input keyA key,output schema B,and output keyB key.1.T is a forward key-map(fkmap)if it is complete(∀I=∅,T(I)=∅)and it has a forward schemamapping to the output key:f(A)T→B key.2.T is a backward key-map(bkmap)if it has a backward schema mapping to the input key:A key T←g(B).3.T is a backward total-map(btmap)if it has a backward schema mapping to all input attributes:A T←g(B).Suppose that schema information and mappings are given for all transformations in Section1.2.Then all of the transformations except T4are backward key-maps;T2,T5,and T6are backward total-maps;T4,T5,and T7are forward key-maps.Theorem3.7(1)Allfilters are backward total-maps.(2)All backward total-maps are backward key-maps.(3)All backward key-maps are dispatchers.(4)All forward key-maps are key-preserving aggregators. Proof:See Appendix A.3.Theorem3.8Consider a transformation instance T(I)=O.Given an output item o∈O,let I∗be o’s lineage based on T’s transformation class as defined in Section3.1.procedure TraceFM(T,O∗,I) //let f(A)T→B keyI∗←∅;for each i∈I doif f(i.A)∈πBkey (O∗)then I∗←I∗ {i}; return I∗;procedure TraceBM(T,O∗,I)//let A key T←g(B)I∗←∅;for each i∈I doif i.A key∈πg(B)(O∗)then I∗←I∗ {i};return I∗;procedure TraceTM(T,O∗)//let A T←g(B)returnπg(B)(O∗); Figure12:Tracing procedures using schema mappings1.If T is a forward key-map with schema mapping f(A)T→B key,then I∗={i∈I|f(i.A)=o.B key}.2.If T is a backward key-map with schema mapping A key T←g(B),then I∗={i∈I|i.A key=g(o.B)}.3.If T is a backward total-map with schema mapping A T←g(B),then I∗={g(o.B)}.Proof:See Appendix A.4.According to Theorem3.8,we can use the tracing procedures shown in Figure12for transformations with the schema mapping properties specified in Definition3.6.For example,procedure TraceFM(T,O∗,I)per-forms lineage tracing for a forward key-map T,which by Theorem3.7also could be traced using procedure TraceKP of Figure11.Both algorithms scan each input item once,however TraceKP applies transforma-tion T to each item,while TraceFM applies function f to some attributes of each item.Certainly f is very unlikely to be more expensive than T,since T effectively computes f and may do other work as well;f may in fact be quite a bit cheaper.TraceBM(T,O∗,I)uses a similar approach for a backward key-map,and is usually more efficient than TraceDS(T,O∗,I)of Figure8for the same reasons.TraceTM(T,O∗) performs lineage tracing for a backward total-map,which is very efficient since it does not need to scan the input data set and makes no transformation calls.Example3.9Considering some examples from Section1.2:•T1is a backward key-map with schema mapping order-id T1←order-id.We can trace the lineage of an output data item o using TraceBM,which simply retrieves items in Order that have the same order-id as o.•T4is a forward key-map with schema mapping prod-name T4→prod-name.We can trace the lineage of an output data item o using TraceFM,which simply retrieves the input items that have the same prod-name as o.•T5is a backward total-map with prod-name,q1,q2,q3,q4 T5← prod-name,q1,q2,q3,q4 .We can trace the lineage of an output data item o using TraceTM,which directly constructs o.prod-name, o.q1,o.q2,o.q3,o.q4 as o’s lineage.In Section3.6we will discuss how indexes can be used to further speed up procedures TraceFM and TraceBM.3.3Provided Tracing Procedure or Transformation InverseIf we are very lucky,a lineage tracing procedure may be provided along with the specification of a transfor-mation T.The tracing procedure TP may require access to the input data set,i.e.,TP(O∗,I)returns O∗’s。

Generalizedadditivemixedmodels

Generalizedadditivemixedmodels

Generalized Additive Mixed ModelsInitial data-exploratory analysis using scatter plots indicated a non linear dependence of the response on predictor variables. To overcome these difficulties, Hastie and Tibshirani (1990) proposed generalized additive models (GAMs). GAMs are extensions of generalized linear models (GLMs) in which a link function describing the total explained variance is modeled as a sum of the covariates. The terms of the model can in this case be local smoothers or simple transformations with fixed degrees of freedom (e.g. Maunder and Punt 2004). In general the model has a structure of:Where and has an exponential family distribution. is a response variable, isa row for the model matrix for any strictly parametric model component, is the correspondingparameter vector, and the are smooth functions of the covariates, .In regression studies, the coefficients tend to be considered fixed. However, there are cases in which it makes sense to assume some random coefficients. These cases typically occur in situations where the main interest is to make inferences on the entire population, from which some levels are randomly sampled. Consequently, a model with both fixed and random effects (so called mixed effects models) would be more appropriate. In the present study, observations were collected from the same individuals over time. It is reasonable to assume that correlations exist among the observations from the same individual, so we utilized generalized additive mixed models (GAMM) to investigate the effects of covariates on movement probabilities. All the models had the probability of inter-island movement obtained from the BBMM as the dependent term, various covariates (SST, Month, Chlorophyll concentration, maturity stage, and wave energy) as fixed effects, and individual tagged sharks as the random effect. The GAMM used in this study had Gaussian error, identity link function and is given as:Where k = 1, …q is an unknown centered smooth function of the k th covariate andis a vector of random effects following All models were implemented using the mgcv (GAM) and the nlme (GAMM) packages in R (Wood 2006, R Development Core Team 2011).Spatially dependent or environmental data may be auto-correlated and using models that ignore this dependence can lead to inaccurate parameter estimates and inadequate quantification of uncertainty (Latimer et al., 2006). In the present GAMM models, we examined spatial autocorrelation among the chosen predictors by regressing the consecutive residuals against each other and testing for a significant slope. If there was auto-correlation, then there should be a linear relationship between consecutive residuals. The results of these regressions showed no auto-correlation among the predictors.Predictor terms used in GAMMsPredictor Type Description Values Sea surface Continuous Monthly aver. SST on each of the grid cells 20.7° - 27.5°C Chlorophyll a Continuous Monthly aver. Chlo each of grid cells 0.01 – 0.18 mg m-3 Wave energy Continuous Monthly aver. W. energy on each of grid cells 0.01 – 1051.2 kW m-1Month Categorical Month the Utilization Distributionwas generated January to December (1-12)Maturity stage Categorical Maturity stage of shark Mature male TL> 290cmMature female TL > 330cmDistribution of residual and model diagnosticsThe process of statistical modeling involves three distinct stages: formulating a model, fitting the model to data, and checking the model. The relative effect of each x j variable over the dependent variable of interest was assessed using the distribution of partial residuals. The relative influence of each factor was then assessed based on the values normalized with respect to the standard deviation of the partial residuals. The partial residual plots also contain 95% confidence intervals. In the present study we used the distribution of residuals and the quantile-quantile (Q-Q) plots, to assess the model fits. The residual distributions from the GAMM analyses appeared normal for both males and females.MalesResiduals distribution ResidualsF r e q u e n c y-202402004006008001000120-4-2024-2024Q-Q plotTheorethical quantilesS a m p l e q u a n t i l e sFemalesHastie, T.J., and R.J. Tibshirani. 1990. Generalized Additive Models. CRC press, Boca Raton,FL. Latimer, A. M., Wu, S., Gelfand, A. E., and Silander, J. A. 2006. Building statistical models toanalyze species distributions. Ecological Applications, 16: 33–50. Maunder, M.N., and A.E. Punt. 2004. Standardizing catch and effort: a review of recentapproaches. Fisheries Research 70: 141-159. Wood, S.N. 2006. Generalized Additive Models: an introduction with R. Boca Raton, CRCPress.。

GCV广义交叉验证

GCV广义交叉验证

Plan
1 2
Generals What Regularization Parameter Examples Dening the Optimal
λ
is Optimal?
λ
3
Generalized Cross-Validation Cross Validation Generalized Cross Validation (GCV) Convergence Result
References
Spline models for Observational Data (1990) Grace Wahba Optimal Estimation of Contour Properties by Cross-Validated Regularization (1989) - Behzad Shahraray, David Anderson Smoothing Noisy Data with Spline Function (1979) - Peter Craven, Grace Wahba
l????????????????
Innehåll
Generals
What Regularization Parameter λ is Optimal?
Generalized Cross-Validation
Discussion
Generalized Cross Validation
Mårten Marcus
Plan
1 2
Generals What Regularization Parameter Examples Dening the Optimal
λ
is Optimal?
λ

洗衣机干衣机操作说明书

洗衣机干衣机操作说明书

123456H |+ "drying, iron dry, cupboard dry, fluff/finishedn (Container). (Filter)Empty the condensate container.Clean the fluff filter and/or air cooler under running water a Page 4/6.Emptying condensationEmpty container after each drying operation!1.Pull out condensate container keeping it horizontal.2.Pour out condensation.3.Always push container in fully until it clicks into place.If n (Container) flashes in the display panel a What to do if..., Page 10.Cleaning the fluff filterClean the fluff filter after each drying operation.1.Open the door, remove fluff from door/door area.2.Pull out and fold open the fluff filter.3.Remove the fluff (by wiping the filter with your hand).If the fluff filter is very dirty or blocked, rinse with warm water and dry thoroughly.4.Close and reinsert the fluff filter.Switching off the dryerTurn the programme selector to 0 (Off).Do not leave laundry in the dryer.Removing the laundryThe automatic anti-crease function causes the drum to move at specific intervals, the washing remains loose and fluffy for an hour (two hours if the additional S c (Reduced Ironing) function is also selected-depending on model ).... and adapt to individual requirementsNever start the dryer if it is damaged!Inform your after-sales service.Inspecting thedryer Sorting and loading laundryRemove all items from pockets.Check for cigarette lighters.The drum must be empty prior to loading.See programme overview on page 7.See also separate instructions for “Woollens basket” (depending on model)Your new dryerCongratulations - You have chosen a modern, high-quality Bosch domestic appliance.The condensation dryer is distinguished by its economical energy consumption.Every dryer which leaves our factory is carefully checked to ensure that it functions correctly and is in perfect condition.Should you have any questions, our after-sales service will be pleased to help.Disposal in an environmentally-responsible manner This appliance is labelled in accordance with European Directive 2012/19/EU concerning used electrical and electronic appliances (waste electrical and electronic equipment - WEEE). The guideline determines the framework for the return and recycling of used appliances as applicable throughout the EU.For further information about our products, accessories, spare parts and services, please visit: Intended usePreparing for installation, see Page 8Selecting and adjusting the programmeDryingCondensate container Control panelʋfor domestic use only,ʋonly to be used for drying fabrics that have beenwashed with water.This appliance is intended for use up to a maximum height of 4000 metres above sea level.Keep children younger than 3 years old away from the dryer.Do not let children make the cleaning andmaintenance work on the dryer without supervision.Do not leave children unsupervised near the dryer.Keep pets away from the dryer.The dryer can be operated by children 8 years old and older, by persons with reduced physical, sensory or mental abilities and by persons with insufficient experience or knowledge if they are supervised or have been instructed in its use by a responsible adult.Select the drying programme ...Press the (Start/Stop) button123Make sure your hands are dry. Hold the plug only.Connecting themains plugDryingInformation on laundry ...Labelling of fabricsFollow the manufacturer's care information.(c Drying at normal temperature.'c Drying at low temperature a also select V (Low Heat).)c Do not machine dry.Observe safety instructions without fail a Page 11!Do not tumble-dry the following fabrics for example:–Impermeable fabrics (e.g. rubber-coated fabrics).–Delicate materials (silk or curtains made from synthetic material) a they may crease –Laundry contaminated with oil.Drying tips–To ensure a consistent result, sort the laundry by fabric type and drying programme.–Always dry very small items (e.g. baby socks) together with large items of laundry (e.g. hand towel).–Close zips, hooks and eyelets, and button up covers.Tie fabric belts, apron strings, etc. together–Do not over-dry easy-care laundry a risk of creasing!Allow laundry to finish drying in the air.–Do not dry woolens in the dryer, only use to freshen them up a Page 7, /c Wool finish Programme (depending on model).–Do not iron laundry immediately after drying, fold items up and leave for a while a the remaining moisture will then be distributed evenly.–The drying result depends on the type of water used during washing. a Fine adjustment of the drying result a Page 5/6.–Machine-knitted fabrics (e.g. T-shirts or jerseys) often shrink the first time they are dried a do not use the +: Cupboard Dry plus programme.–Starched laundry is not always suitable for dryers a starch leaves behind a coating that adversely affects the drying operation.–Use the correct dosage of fabric softener as per the manufacturer's instructions when washing the laundry to be dried.–Use the timer programme for small loads a this improves the drying result.Environmental protection / Energy-saving tips–Before drying, spin the laundry thoroughly in the washing machine a the higher the spin speed the shorter the drying time will be (consumes less energy), also spin easy-care laundry.–Put in, but do not exceed, the maximum recommended quantity of laundry a programme overview a Page 7.–Make sure the room is well ventilated during drying.–Do not obstruct or seal up the air inlet.–Keep the air cooler clean a Page 6 “Care and cleaning”.Fine adjustment of the drying resultAdjustment of the levels of dryness1 x to the rightPress and hold V (Low Heat)and turn 5 x to the rightPress V (Low Heat) until the required level is reachedTurn to 0 (Off)Turn to 0 (Off)DrumAll buttons are sensitive and only need to be touched lightly.Only operate the dryer with the fluff filter inserted!Air inletFluff filterDrum interior light (depending on model)Maintenance flapProgramme end once lights up in the display.Interrupt programme removing or adding laundry.The drying cycle can be interrupted for a brief period so that laundry may be added or removed. The programme selected must then be resumed and completed.Never switch the dryer off before the drying process has ended.Drum and door may be hot!1.Open door, the drying process is interrupted.2.Load or remove laundry and close door.3.If required, select a new programme and additional functions.4.Press the (Start /Stop) button.Additional functionsProgramme selectorTime remainingDisplay panelSelect On/Off for a acoustic signal at end of programme.ʋ&(Buzzer)Reduced temperature for delicate fabrics 'that require a longer drying time;e.g. for polyacrylics, polyamide, elastane or acetate.ˎV (Low Heat)Reduces creasing and extends the anti-creasing phase once the program has ended.ˎS c (ReducedIroning)ContentsPageʋPreparation . . . . . . . . . . . . . . . . . . . . . .2ʋSetting the programmes . . . . . . . . . . . . .2ʋDrying . . . . . . . . . . . . . . . . . . . . . . . .3/4ʋInformation on laundry. . . . . . . . . . . . . . 5ʋFine adjustment of the drying result . .5/6ʋCare and cleaning . . . . . . . . . . . . . . . . .6ʋProgramme overview. . . . . . . . . . . . . . . .7ʋInstallation . . . . . . . . . . . . . . . . . . . . . . . .8ʋFrost protection / Transport. . . . . . . . . . .8ʋTechnical data . . . . . . . . . . . . . . . . . . . .9ʋOptional accessories. . . . . . . . . . . . . . . .9ʋWhat to do if... / After-sales service. . . .10ʋSafety instructions . . . . . . . . . . . . . . . .11Read these instructions and the separate Energy-saving mode instructions before operating the dryer.Observe the safety instructions on page 11.ˎh:min End of programme in 1*-24 hours (Press button several times if required)(*depending on the selected programme, e.g. duration 1:54h a 2h. Can always be selected to the next full hour.)Fine adjustment of the drying result The drying result (e.g. Cupboard Dry) can be adjusted over three levels (1 - max. 3) for the L Cottons ,I Easy-Care,L Mix and A Super Quick 40’ programmes a presetting = 0. After one of these programmes has been finely adjusted, the setting is retained for the others. Further information a Page 5/6.0, 1, 2, 3Fine adjustment of the drying resultCare and cleaningDryer housing, control panel, air cooler, moisture sensors–Wipe with a soft, damp cloth.–Do not use harsh cleaning agents and solvents.–Remove detergent and cleaning agent residue immediately.–During drying, water may collect between the door and seal. This does not affect your dryer's functions in any way.Clean the protective filter 5 - 6 times a year or if .(Filter) flashes after cleaning the fluff filter.Air cooler / Protective filterWhen cleaning, only remove the protective filter. Clean the air cooler behind the protective filter once a year.–Allow the dryer to cool.–Residual water may leak out, so place an absorbent towel underneath the maintenance door.1.Unlock the maintenance door.2.Open the maintenance door fully.3.Turn both locking levers towards each another.4.Pull out the protective filter/air cooler.Do not damage the protective filter or air cooler.Clean with warm water only. Do not use any hard or sharp-edged objects.5.Clean the protective filter/air cooler thoroughly,Allow to drip dry.6.Clean the seals.7.Re-insert the protective filter/air cooler,with the handle facing down.8.Turn back both locking levers.9.Close the maintenance door until the lock clicks into place.Moisture sensorsThe dryer is fitted with stainless steel moisture sensors. The sensors measure the level of moisture in the laundry. After a long period of operation, a fine layer of limescale may form on the sensors.1.Open the door and clean the moisture sensors with a damp spongewhich has a rough surface.Do not use steel wool or abrasive materials.L:00, L:01, L:02, L:03 are shown in sequenceShort signal when changing from L:03 to L:00, otherwise long signal.Page 11.Connect to an AC earthed socket. If in doubt have the socket checked by an expert.The mains voltage and the voltage shown on the rating plate (a Page 9) must correspond.The connected load and necessary fuse protection are specified on the rating plate.Note the fuse protection of the socket.Make sure that the air inlet remains unobstructedClean and level press and hold selection then turn 3 x to the rightturn to 0(Off)setamperage off flashes33Do not operate the dryer if there is a danger of frost.en Instruction manualDryerWTE86363SNRemove all items from pockets.Check for cigarette lighters.The drum must be empty prior to loading.See programme overview on page 7.See also separate instructions for “Woollens ba(depending on model)Programme selectorAll buttons areneed to be tou34Emptying condensationEmpty container after each drying operation!1.Pull out condensate container keeping it horizontal.2.Pour out condensation.3.Always push container in fully until it clicks into place.If n (Container) flashes in the display panel a What to do if..., Page 10.Cleaning the fluff filterClean the fluff filter after each drying operation.1.Open the door, remove fluff from door/door area.2.Pull out and fold open the fluff filter.3.Remove the fluff (by wiping the filter with your hand).If the fluff filter is very dirty or blocked, rinse with warm water and dry thoroughly.4.Close and reinsert the fluff filter.Switching off the dryerTurn the programme selector to 0 (Off).Do not leave laundry in the dryer.Removing the laundryThe automatic anti-crease function causes the drum to move at specific intervals, the washing remains loose and fluffy for an hour (two hours if the additional S c (Reduced Ironing) function is also selected-depending on model ).idual requirementsspecting thedryeroading laundryasket”he programmeDryingCondensate container Control paneldrying programme ...(Start/Stop) button123nnecting the mains plugDryingDrume sensitive and only uched lightly.the dryer with nserted!Air inletFluff filterDrum interior light (depending on model)Maintenance flapProgramme end once lights up in the display.Interrupt programme removing or adding laundry.The drying cycle can be interrupted for a brief period so that laundry may be added or removed. The programme selected must then be resumed and completed.Never switch the dryer off before the drying process has ended.Drum and door may be hot!1.Open door, the drying process is interrupted.2.Load or remove laundry and close door.3.If required, select a new programme and additional functions.4.Press the (Start /Stop) button.the g = 0. others.0, 1, 2, 3Information on laundry ...Labelling of fabricsFollow the manufacturer's care information.(c Drying at normal temperature.'c Drying at low temperature a also select V(Low Heat).)c Do not machine dry.Observe safety instructions without fail a Page 11!Do not tumble-dry the following fabrics for example:–Impermeable fabrics (e.g. rubber-coated fabrics).–Delicate materials (silk or curtains made from synthetic material) a they may crease–Laundry contaminated with oil.Drying tips–To ensure a consistent result, sort the laundry by fabric type and drying programme.–Always dry very small items (e.g. baby socks) together with large items of laundry(e.g. hand towel).–Close zips, hooks and eyelets, and button up covers.Tie fabric belts, apron strings, etc. together–Do not over-dry easy-care laundry a risk of creasing!Allow laundry to finish drying in the air.–Do not dry woolens in the dryer, only use to freshen them up a Page 7, /c Wool finishProgramme (depending on model).–Do not iron laundry immediately after drying, fold items up and leave for a while a theremaining moisture will then be distributed evenly.–The drying result depends on the type of water used during washing. a Fine adjustment of the drying result a Page 5/6.–Machine-knitted fabrics (e.g. T-shirts or jerseys) often shrink the first time they are drieda do not use the +: Cupboard Dry plus programme.–Starched laundry is not always suitable for dryers a starch leaves behind a coating that adversely affects the drying operation.–Use the correct dosage of fabric softener as per the manufacturer's instructions whenwashing the laundry to be dried.–Use the timer programme for small loads a this improves the drying result.Environmental protection / Energy-saving tips–Before drying, spin the laundry thoroughly in the washing machine a the higher the spin speed the shorter the drying time will be (consumes less energy), also spin easy-carelaundry.–Put in, but do not exceed, the maximum recommended quantity of laundry a programmeoverview a Page 7.–Make sure the room is well ventilated during drying.–Do not obstruct or seal up the air inlet.–Keep the air cooler clean a Page 6 “Care and cleaning”.Fine adjustment of the drying resultAdjustment of the levels of dryness1 x to the right Press and hold V (Low Heat)and turn 5 x to the rightPress V (Low Heat) untilthe required level is reachedTurn to 0 (Off)Turn to0 (Off)Fine adjustment of the drying resultCare and cleaningDryer housing, control panel, air cooler, moisture sensors–Wipe with a soft, damp cloth.–Do not use harsh cleaning agents and solvents.–Remove detergent and cleaning agent residue immediately.–During drying, water may collect between the door and seal.This does not affect your dryer's functions in any way.Clean the protective filter 5 - 6 times a yearor if .(Filter) flashes after cleaning the fluff filter.Air cooler / Protective filterWhen cleaning, only remove the protective filter. Clean the air coolerbehind the protective filter once a year.–Allow the dryer to cool.–Residual water may leak out, so place an absorbent towelunderneath the maintenance door.1.Unlock the maintenance door.2.Open the maintenance door fully.3.Turn both locking levers towards each another.4.Pull out the protective filter/air cooler.Do not damage the protective filter or air cooler.Clean with warm water only. Do not use any hard or sharp-edgedobjects.5.Clean the protective filter/air cooler thoroughly,Allow to drip dry.6.Clean the seals.7.Re-insert the protective filter/air cooler,with the handle facing down.8.Turn back both locking levers.9.Close the maintenance door until the lock clicks into place.Moisture sensorsThe dryer is fitted with stainless steel moisture sensors. The sensorsmeasure the level of moisture in the laundry. After a long period ofoperation, a fine layer of limescale may form on the sensors.1.Open the door and clean the moisture sensors with a damp spongewhich has a rough surface.Do not use steel wool or abrasive materials.L:00, L:01, L:02, L:03 are shown in sequenceShort signal when changing from L:03 to L:00, otherwise longsignal.56Page 11.Connect to an AC earthed socket. If in doubt have the socket checked by an expert.The mains voltage and the voltage shown on the rating plate (a Page 9) must correspond.The connected load and necessary fuse protection are specified on the rating plate.Note the fuse protection of the socket.Make sure that the air inlet remains unobstructedClean and level press and hold selection then turn 3 x to the rightturn to 0(Off)setamperage off flashes33Do not operate the dryer if there is a danger of frost.en Instruction manualDryerWTE86363SN。

EPOCH-II高流量输出单元I说明书

EPOCH-II高流量输出单元I说明书

EPOCH-II ®High-Current Output UnitI 1000 VA of high current I Rugged, portable test setIUp to 187 Amperes maximum outputEPOCH-IIHigh-Current Output UnitDESCRIPTIONThe EPOCH-II ®is a high-current output unit designed to be controlled by the PULSAR ®in combination with the High-Current Interface Module to produce a rugged,portable high-current, high-volt/ampere test set. The EPOCH-II also is designed to be controlled by the EPOCH-10®Relay T est Set.PULSAR/EPOCH-II or EPOCH-10/EPOCH-II combination uses microprocessor-based, digitally synthesized sine wave generators and solid-state regulated power amplifiers to provide sinusoidal voltage and current outputs with precise control of the phase-angle relationships.These combinations will produce accurate test results even with a fluctuating power source or when testing nonlinear or highly saturable relays.An optional IEEE-488 GPIB interface transforms the EPOCH-10/EPOCH-II combination into a programmable automatic test system when used with an externalcontroller or computer. Amplitude and/or phase angle can be set to desired values, step changed, ramped, or pulsed to new values.APPLICATIONSPULSAR/EPOCH-II or EPOCH-10/EPOCH-II combination is designed to test both complex protective relays which require phase-shifting capability and simpler relays,including all overcurrent relays.The table above lists the different types of relays by device numbers, and the different combinations of a PULSAR or EPOCH-10s and an EPOCH-II required to test them.FEATURES AND BENEFITSI Output current source has a 5-minute duty cycle rating of 1000 volt-amperes.IFour output ranges at 0.01 ampere and two output ranges at 0.1 ampere are provided.IT ough steel, sealed enclosure provides a high shock and vibration resistance.ICompletely compatible with PULSAR via the High-Current Interface Module and the EPOCH-10 units.Device No.Relay TypesSpecifyOne EPOCH-II and PULSAR with High-Current Interface Module or one EPOCH-10One EPOCH-II and PULSAR with Interface Module or two EPOCH-10sInstantaneous Overcurrent up to 187 A at 1000 VADirectional Overcurrent up to 187 A at 1000 VA All of the above plus. . .Differential Distance (open-delta)50678721One EPOCH-II and PULSAR with Interface Modules or three EPOCH-10sDistance (wye)21Ground DirectionalOvercurrent up to 187 A at 1000 VA67NOvercurrent up to 187 A at 1000 VA 51 1981EPOCH-II High-Current Output UnitSPECIFICATIONSInputInput Voltage (specify one)115 V ±10%, 50/60 Hz, 30 A (at full rated output)OR230 V ±10%, 50/60 Hz, 20 AOutputOutput CurrentT o provide a variety of test circuit impedances, six output taps with two ranges are provided.High Range2.00 to 10.00 A at 100 V max3.00 to 15.00 A at 66.6 V max8.00 to 40.00 A at 25 V max10.00 to 50.00 A at 20 V max20.00 to 100.0 A at 10 V max34.00 to 170.0 A at 5.9 V maxOutput Power:1000 VALow Range2.00 to 10.00 A at 50 V max3.00 to 1 5.00 A at 33.3 V max8.00 to 40.00 A at 12.5 V max10.00 to 50.00 A at 10 V max20.00 to 100.0 A at 5 V max34.00 to 170.0 A at 2.95 V maxOutput Power:500 VAAccuracyT ypical:±0.5% of settingMaximum:±1.0% of settingAlarm will indicate when amplitude, phase angle, or waveform is in error.ResolutionFour ranges:0.01 AT wo ranges:0.l ADuty CycleFive minutes at full rated VA output. Fifteen minutes recovery time.Overrange CapabilityThe EPOCH-II has an overrange capability of +10% for each tap with a maximum output current of 187 A on the 170 A tap. DistortionLess than 1% typical, 3% max.Output of EPOCH-10/EPOCH-II CombinationFrequencyI Synchronized to input power sourceI60 Hz crystal-controlledI50 Hz crystal-controlled AccuracyI Synchronized, tracks input frequencyI±0.006 Hz for 60-Hz crystal control (±001%)I±0.005 Hz for 50-Hz crystal control (±0.01%)Output of PULSAR and Interface Module Connected to EPOCH-II The High-Current Interface Module provides a variable frequency signal to the Epoch-II High-Current Output Unit.Output frequency is continuously displayed for each channel with large, high-intensity LEDs with the following ranges:5.000 to 99.999 Hz100.01 to 999.99 HzFrequency Accuracy:±10 ppm at 23°C, ±2°C)Current Phase Angle ControlAngle is adjusted on the EPOCH-10 control unit by 4-digit, pushbutton control with large LED display of setting.Range:0.0 to 359.9°Resolution:0.1°Accuracy:less than ±0.3°typical, ±1.0°maxControl SectionThe control unit for the EPOCH-II high-current section is PULSAR (with the High-Current Interface Module), an EPOCH-I or EPOCH-10. Thus, the excellent operating and control features of PULSAR and the EPOCH-10 are used to control the EPOCH-II.Note:When the EPOCH-II is in use, the current output of the EPOCH-10 control unit is inoperative. All EPOCH-10s can control an EPOCH-II high-current section. When the high-current section of the EPOCH-II is not needed, the EPOCH-10 can be used independently or slaved together with other EPOCH units. ProtectionThe input line circuit is breaker-protected. The dc power supply is overcurrent-protected. In addition, overvoltage protection is provided on the input line circuit.The power amplifiers are forced-air cooled and are protected by thermal-overload sensors.Audio and visual alarms on the PULSAR and EPOCH-10 control units indicate whenever the current or potential outputs are overloaded.TemperatureOperating32 to 122°F (0 to 50°C)Reduced duty cycle above 113°F (45°C)Storage–13 to +158°F (–25°to +70°C)Dimensions7.6 H x 19.75 W x 21.6 D in.(193 H x 502 W x 549 D mm)Weight115 lb (52.3 kg)EPOCH-II with EPOCH-10 control unitEPOCH-II High-Current Output UnitUKArchcliffe Road Dover CT17 9EN EnglandT +44 (0) 1304 502101 F +44 (0) 1304 207342UNITED STATES4271 Bronze Way DallasTX75237-1017 USAT 800 723 2861 (USA only)T +1 214 330 3203F +1 214 337 3038OTHER TECHNICAL SALES OFFICESValley Forge USA, Toronto CANADA,Mumbai INDIA, Trappes FRANCE,Sydney AUSTRALIA, Madrid SPAINand the Kingdom of BAHRAIN.Registered to ISO 9001:2000 Reg no. Q 09290Registered to ISO 14001 Reg no. EMS 61597EPOCH_II_DS_en_V10The word ‘Megger’ is a registeredtrademark。

stable diffusion进阶语法

stable diffusion进阶语法

在Stable Diffusion中,进阶语法可以用于控制图像生成质量和效果。

以下是一些常用的进阶语法:
X/Y/Z图表:可以在文生图/图生图界面的左下角选择“X/Y/Z图表(X/Y/Z plot)”以启用。

它创建具有不同参数的图像网格。

使用X类型和Y类型字段选择应由行和列共享的参数,并将这些参数以逗号分隔输入X值/Y值字段。

支持整数、浮点数和范围。

提示词的进阶用法:可以使用中括号[]进行进阶语法操作,例如使用[a man holding an apple]提示词生成图像中包含“一个拿着苹果的男人”的场景。

高级操作模式:可以使用S/R(搜索/替换)模式进行更高级的操作,例如将提示词中的某个关键词替换为另一个关键词。

例如,使用[a man holding an apple, 8k clean]将提示词中的“8k clean”替换为“a watermelon”。

采样器选择:不同的采样器可能会对生成图像的效果产生影响,需要根据具体情况选择合适的采样器。

调整参数:可以调整生成图像的一些参数,例如分辨率、步数、温度等,以达到更好的效果。

需要注意的是,在使用进阶语法时,需要仔细考虑语法规则和参数设置,以确保生成图像的质量和效果符合预期。

同时,也可以根据具体情况进行尝试和探索,以获得最佳的生成效果。

Abstract.

Abstract.

Minimal Perturbation in Dynamic SchedulingHani El Sakkout,Thomas Richards,and Mark WallaceAbstract.This paper describes an algorithm,unimodular probing,conceivedto optimally reconfigure schedules in response to a changing environ-ment.In the problems studied,resources may become unavailable,and scheduled activities may change.The total shift in the start andend times of activities should be kept to a minimum.This require-ment is captured in terms of a linear optimization function over lin-ear constraints.However,the disjunctive nature of many schedulingproblems impedes traditional mathematical programming approaches.The unimodular probing algorithm interleaves constraint program-ming and linear programming.The linear programming solver is ap-plied to a dynamically controlled subset of the problem constraints,to guarantee that the values returned are ing a repair strat-egy,these values are naturally integrated into the constraint program-ming search.We explore why the algorithm is effective and discussits applicability to a wider class of problems.It appears that otherproblems comprising disjunctive constraints and a linear optimiza-tion function may be suited to the algorithm.Unimodular probingoutperforms alternative algorithms on randomly generated bench-marks,and on a major airline application.1INTRODUCTIONIn this paper we tackle a dynamic variant of the classic schedulingproblem.The aim is to restore consistency in a schedule which hasbeen disrupted due to resource or activity changes.To minimize dis-ruption,the new schedule should differ minimally from the old one.Effective solutions to this NP-hard problem are theoretically andcommercially significant.Resource-constrained scheduling problemsgive rise to the NP-complete resource feasibility problem of planning(RFP)[3].A predecessor of our algorithm is a constraint satisfactionalgorithm proposed to solve the RFP.We are currently exploring theutility of using unimodular probing to guide the introduction of ac-tions in an incremental planner.However,work on the algorithm was originally driven by the com-mercial task of improving resource utilization.In resource utilizationproblems the number of activities is held constant while resourcesare removed to increase schedule efficiency.The task of the resched-uler is then to produce a consistent,minimally perturbed schedule.Inthis paper we define the features of the resource utilization problem,and use it to illustrate the deficiencies of constraint programmingand mathematical programming approaches.Unimodular probing isempirically compared with two other commonly used techniques onrandomly generated benchmark tests.Section2gives initial definitions,and a model for minimal per-turbation variants of the RFP.The following section outlines a con-straint model for this variant.Section4gives details of the unimod-Def.2:Resource Feasibility Problem(RFP)In the following,IN represents the set of natural numbers,and#the set cardinality function.An RFP is composed of:a set of possibly variable duration activities;for each,temporal start and end variables;a resource usage variable bounded by a capacity constraintIN,and a constraint equating it to the maximum resource overlap:I N #a set of linear inequalities that bound the temporal variables andimpose distance constraints between them.Bounding constraints are of the form,while distance constraints are of the form ,where:and IN.A solution to the RFP is an assignment to the start and end vari-ables satisfying the problem’s temporal and resource constraints.The algorithms compared in this paper are applied to the resource utilization problem.It is defined as follows:Def.3:Resource Utilization ProblemA resource utilization problem is a minimal perturbation problemwhere:is a CSP capturing the constraints of an RFP with an activity set and a resource bound;is a solution to;,such thatThisfinal problem is a good representative of dynamic scheduling problems with a linear perturbation cost function,because in non-trivial problems of this type,the required changes will lead to re-source contention,as discussed in the following section.3CONSTRAINT MODELThe RFP resource usage variable corresponds to the number of re-sources required(the maximum resource overlap over the temporal horizon).In fact,it is sufficient to count resource overlap at the start times of activities,since these are the points where increases in re-source usage occur.For each activity,a variable is introduced to count the number of overlapping resources at its start time.variables are defined in terms of Booleans.A Boolean is introduced for each pair of activities and.It is set when activity overlaps with the start of activity.1iff0otherwise(1)When linked with the corresponding,these Boolean variables link temporal and resource reasoning:(2)Each is bounded by the maximum resource capacity,via the constraints and.Initially,the start and time variables are notfixed and this is reflected in the range of values may take. By analyzing equation2,a search procedure may determine potential contention points and prioritize them.Once one is chosen,contention is relieved by forcing apart temporal variables(i.e.imposing new dis-tance constraints).Under certain conditions,this can cause the local consistency procedure to infer further distance constraints.Unimodular probing handles dynamic scheduling problems in a uniform manner using the above contention point strategy;tempo-ral inconsistencies(generated by imposing new temporal constraints) are transformed into resource contention[5].Resource utilization problems are thus representative of other dynamic scheduling prob-lems since they introduce resource contention directly.4UNIMODULAR PROBINGThe resource utilization problem shows up the deficiencies of con-straint programming(CP)and mathematical programming.CP algo-rithms are suited to exploring the disjunctive scheduling component, but local heuristics and consistency techniques are often not effec-tive at global optimization.Conversely,in mathematical program-ming the optimization function is usually central to the search,but disjunctive constraints are hard to satisfy.As a general strategy,unimodular probing counters these weak-nesses on specific problem classes.Mathematical programming,in the form of linear programming(LP),is integrated into the CP search. By contrast with previous algorithms combining CP with LP,the al-gorithm passes only a restricted subset of the linear constraints to the LP solver.This is to ensure that the solver-Primal or Dual Simplex -works on constraints that have a property known as total unimodu-larity(TU)[7].As detailed in Sect.4.1this property guarantees that the values returned by the LP algorithm are discrete.Instead of re-turning consistent solutions to a relaxed,non-discrete version of the discrete problem,LP is made to return values that are discrete but only partially consistent.In combination with CP,consistency is in-crementally restored.Applied to a sub-problem with TU,LP becomes a meaningful tool for solving discrete problems.In reality,resource scheduling prob-lems are often discrete since they have temporal granularity.The unimodular probing algorithm should be considered for con-straint optimization problems with the following features:A constraint subset with the TU property(the term easy set is laterused to describe this subset);A linear optimization function on some or all of the variables inthe easy set;Problem variables with discrete domain values.4.1Total UnimodularityTU problems are well known in the mathematical programming com-munity,and are components of many combinatorial optimization prob-lems.Ordinary networkflow problems,such as transportation,as-signment and minimum cost networkflow,have TU.The traveling salesman problem(TSP)is an example of a complex problem with an underlying“easy”networkflow component.The temporal con-straints of the RFP also have TU.In LP,non-unary constraints are represented by means of a con-straint matrix.The unary(bounding)constraints are handled sepa-rately,and do not violate TU when the bounds are integer.The nec-essary conditions for this property to hold in the matrix are not given here,but a more practical set of sufficient conditions are relayed[7].A constraint matrix may be configured in two ways.In thefirst, rows represent constraints,and columns represent variables,with the element of the matrix denoting the coefficient of a variable in constraint.In the dual configuration,the reverse holds,with rows corresponding to variables and columns to constraints.Brackets are used to distinguish the dual and primal versions of the matrix.Sufficient conditions:1.All variable coefficients are0,1,or-1,and all constants are integer.2.Two or less nonzero coefficients appear in each row(column).3.The columns(rows)of the matrix can be partitioned into two subsetsand such that(a)If a row(column)contains two nonzero coefficients with the same sign,one element is in each of the subsets.(b)If a row(column)contains two nonzero elements of opposite sign,bothelements are in the same subset.4.2The Abstract AlgorithmAn abstract version of the unimodular probing algorithm is given be-fore describing its application to the RFP.In unimodular probing, local consistency techniques and a repair algorithm work on the full set of problem constraints,while LP handles only a restricted but dynamically changing easy set.The local consistency techniques ap-plied depend on the specifics of the problem addressed by the al-gorithm.Section4.3gives the consistency methods selected for an algorithm designed to solve the RFP.At each search step the unimodular probing strategyfirst applies chosen local consistency procedures to the full set and then LP to the easy,TU set.The discrete solutions returned by LP,called unimodu-lar probes,satisfy the constraints in the easy set and are optimal with respect to the objective function.The algorithm identifies the con-straints in the full set violated by the latest unimodular probe.If there are no violated constraints then the unimodular probe is returned as the optimal solution.Otherwise,the algorithm selects a violated con-straint,and imposes an appropriate,new easy constraint.The new easy constraint corresponds to a repair,or a search decision that does not allow the variables of the violated constraint to hold the same val-ues in subsequent probes.A number of alternative repairs(new easy constraints)that reduce violation generally exist.The algorithm may backtrack through these choices in its search for optimality.Once a repair is selected,local consistency methods derive further easy con-straints that follow from this choice.In this way,as repair choices are made,easy constraints are incrementally added to the TU set un-til either the LP optimum satisfies all the constraints of the full set, or inconsistency is proven.Backtracking is initiated on inconsistency as in conventional depth-first search.Figure1gives the abstract search procedure.The search procedure utilizes four procedures,push cp,push lp to add and re-move constraints from the local consistency(CP)and LP constraint stores,implementing a depth-first search.It is assumed that before search all the problem constraints have been added to the CP con-straint store(including the Branch&Bound optimization cost bound), while initial and derived easy constraints have been added to LP’s.uni-modular search is then called with TRUE as the initial value for its argument new,representing thefirst(dummy)search decision.pusheasyThe form of easy constraint imposed is dependent on the problem class addressed.In the simplest case it is a unary constraint bounding orfixing a variable.filters the new constraints new cpnew returning new easy constraints lpnewfor addition to the LP constraint store.pushconstraints.select choice selects one repair from the set of alterna-tive repairs repair.The procedure recurses with this repair constraint, attempting to impose its negation on backtracking to search the alter-native search branch.Appropriate constraint store removal routines implement backtracking on failure.probingcp(new);if(cpnew FALSE)thenlpnew:=obtain constraints(new cpnew);lp:=pushconstraints(lp);if(conf)thenreturn lpelserepair:=select choice(conf,lp);lp:=unimodular search(repair);if(lp FALSE)thenreturn lpelse lp:=unimodular search(NOT(repair));if(lp FALSE)thenreturn lp;;;popcp;return FALSE;end unimodular searchfor each temporal variable.ranges over all temporal variablesand(activity durations may not befixed in the RFP).The abso-lute change in a temporal variable and is given by, where c represents the value given to in the previous solution.Thisexpression is captured by placing in the optimization function and adding the following linear delta constraints:(3)(4)Equation4does not satisfy the sufficient conditions for a TU set.The following lemma outlines a proof that Eqns.3&4do not destroythe TU property when added to a TU set.Lemma2:If is a set of constraints having TU thenhas TU,where is unconstrained in.The Pri-mal and Dual Simplex algorithms work by obtaining solutions at the vertices of the solution polyhedron,as defined by the linear inequali-ties of the problem.The TU property guarantees that the solutions at the vertices are discrete.The aim of the proof is to show that the new vertices created by the addition are also discrete.Let be a TU set.Let be extended to by adding a new vari-able and the two delta constraints3and4.adds a new dimen-sion to the polyhedron,and cuts it along the hyperplanesand.All existing vertices are projected in the dimen-sion onto these two hyperplanes.At the projected vertices takes the value of either or.Both values are discrete because is an integer constant and is discrete at the projected vertex since has TU.In addition,new vertices are added where the two new hyperplanes intersect at.But the hyperplane is a unary bounding constraint with an integer constant and as such guarantees the new vertices are discrete.Thus the extended set also has TU. Theorem:The temporal constraints of the RFP and the delta con-straints of its objective function constitute a TU set.Proof:By induction over delta constraint pair addition.Base Case:The temporal constraints have TU.Inductive Step:If has TU thenhas TU,where is unconstrained in.4.3.2The local consistency and the search heuristicsThe algorithm applies arc-B consistency propagation on the prob-lem’s constraints[10].An additional form of lookahead is appliedafter each decision;a minimum resource usage profile is built for af-fected portions of the schedule.Backtracking is initiated when the minimal usage exceeds the resource bound.Resource contention constraints are ordered according to the de-gree to which they are violated.The most violated constraint is cho-sen for repair,and a local heuristic is used to select the least con-straining temporal distance constraint viz.the one which yields the smallest reduction in the domains of its variables.This approach in-creases the chances offinding a good solution early.The heuristics and lookahead check are based on an earlier algo-rithm for solving the RFP[12].4.3.3Extending the TU setSince search decisions are made through the imposition of a newtemporal distance constraint,that means at every search node a at least one new temporal constraint preserving TU is added to the set. As noted in Sect.3,however,other temporal constraints may be de-rived from the subsequent constraint propagation sequence.The new temporal constraints are added to the TU set after each decision.5PERFORMANCE COMPARISONSIn our experience,the unimodular probing algorithm has been found to be substantially more effective than both an extended constraint programming algorithm and a mathematical programming approach –mixed integer programming(MIP)–on a set of commercial aircraft utilization problems.However,the results given here are for approx-imately1200randomly generated resource utilization problems.In these,the density of temporal constraints was varied as well as the required resource reduction.Details of these and other benchmarks have been made available on the Internet for future comparisons.Unimodular probing is compared with two algorithms.Thefirst is an extension of a pure CP algorithm detailed in[12].This algorithm, designated CP0,conducts the search in two search phases.In phase 1,CP0reduces resource conflict by forcing apart temporal variables until resource conflicts are no longer possible.In phase2,CP0fixes the temporal variables to the values returned by an LP solver which are optimal given the choices made in thefirst phase.The search is complete because the phase1choices may be backtracked.The sec-ond algorithm is a commercial MIP package,CPLEX.The default search settings are used,with only the resource Booleans de-clared as integer(see Sect.3).Other settings were attempted on a representative sample of200problems for both the node and vari-able selection strategies with minor or no improvement(3%in the best case).This confirms advice received from MIP practitioners that the default CPLEX settings are among the best for MIP search.Table1.Reduction in CP0Solution Quality versus Density Density1.00.005/hhe/rfpNumber of ActivitiesN u m b e r o f L P N o d e sFigure 2.No.of LP Nodes versus No.of ActivitiesTable 2.CPU Seconds versus No.of Activities No.ActivitiesMIP Avg CPU/sec265247。

The Charm Parton Content of the Nucleon

The Charm Parton Content of the Nucleon

a rXiv:h ep-ph/71220v125Jan27February 2,2008MSU-HEP-070101The Charm Parton Content of the Nucleon J.Pumplin ∗a ,i a,b,c ,W.K.Tung a,b a Michigan State University,nsing,MI,USA b University of Washington,Seattle,WA,USA c Taipei Municipal University of Education,Taipei,Taiwan We investigate the charm sector of the nucleon structure phenomenologically,using the most up-to-date global QCD analysis.Going beyond the common assumption of purely radiatively generated charm,we explore possible degrees of freedom in the parton parameter space associated with nonperturbative (intrinsic)charm in the nucleon.Specifically,we explore the limits that can be placed on the intrinsic charm (IC)component,using all relevant hard-scattering data,according to scenarios in which the IC has a form predicted by light-cone wave function models;or a form similar to the light sea-quark distributions.We find that the range of IC is constrained to be from zero (no IC)to a level 2–3times larger than previous model estimates.The behaviors of typical charm distributions within this range are described,and their implications for hadron collider phenomenology are briefly discussed.Contents1Introduction2 2Charm Partons at the Scaleµ0≈m c3 3Global QCD Analysis with Intrinsic Charm5 4Comparison with Light Partons9 5The Charm Distribution at Various Energy Scales10 6Summary and Implications111IntroductionThe parton distribution functions(PDFs)that describe the quark and gluon structure of the nucleon at short distances are essential inputs to the calculation of all high energy processes at hadron colliders.They are therefore important for carrying out searches for New Physics,as well as for making precision tests of the Standard Model(SM),in regions where perturbative Quantum Chromodynamics(PQCD)theory is applicable.Although much progress has been made in the last twenty years in determining the PDFs by global QCD analysis of a wide range of hard processes,very little is known phenomenologically about the heavy quark (charm c and bottom b)and antiquark(¯c,¯b)content of the nucleon[1,2].Knowledge of the heavy quark components is inherently important as an aspect of the fundamental structure of the nucleon.In addition,the heavy quarks are expected to play an increasingly significant role in the physics programs of the Tevatron Run II and the Large Hadron Collider(LHC),since many new processes of interest,such as single-top production and Higgs production in the SM and beyond,are quite sensitive to the heavy quark content of the nucleon.Existing global QCD analyses extract the PDFs by comparing a wide range of hard-scattering data to perturbative QCD theory.In these analyses,one usually adopts the ansatz that heavy quark partons in the nucleon are“radiatively generated”,i.e.,they originate only from QCD evolution,starting from a null distribution at a factorization scale of approxi-mately the relevant quark mass.This is motivated,on the theoretical side,by the notion that heavy quark degrees of freedom should be perturbatively calculable;and on the practical side,by the lack of clearly identifiable experimental constraints on these degrees of freedom in existing data.Neither of those considerations justifies the ansatz,however—especially for charm,whose mass lies in between the soft and hard energy scales.In fact,many nonper-turbative models,particularly those based on the light-cone wave function picture,expect an“intrinsic charm”(IC)component of the nucleon at an energy scale comparable to m c, the mass of the charm quark.This IC component,if present at a low energy scale,will par-ticipate fully in QCD dynamics and evolve along with the other partons as the energy scale increases.It can therefore have observable consequences on physically interesting processes at high energies and short distances.With recent advances in the implementation of the general perturbative QCD formalism to incorporate heavy quark mass effects[3,4],and the availability of comprehensive precision data from HERA,the Tevatron,andfixed-target experiments,we are now in a position to study the charm content of the nucleon phenomenologically,with minimal model-dependent assumptions.This paper represents afirst systematic effort to perform this study and answer the following questions:(i)do current theory and experiment determine,or place useful limits on,the charm component of the nucleon at the scale of m c;(ii)if a non-vanishing charm distribution is allowed,can current global QCD analysis distinguish its shape between a form typical of light sea quarks(peaked at small x)and the form predicted in light-cone wavefunction models(concentrated at moderate and large x);and(iii)what are the implications of IC for the Tevatron and the LHC physics programs?2Charm Partons at the Scaleµ0≈m cLet f a(x,µ)denote the PDF of partonflavor a inside the proton at momentum fraction x and factorization scaleµ.At short distances,corresponding to largeµ,the scale-dependence of f a(x,µ)is governed by the QCD evolution equation,with perturbatively calculable evolu-tion kernals(splitting functions).Thus,the set of PDFs{f a(x,µ)}are fully determined once their functional form in x is specified at afixed scaleµ=µ0,providedµ0is large enough to be in the region where PQCD applies.In practice,µ0is usually chosen to be on the order1−2GeV,which is at the borderline between the short-distance(perturbative)and long-distance(nonperturbative)regions.For the gluon and the light quarks(a=g,u,d,s), the f a(x,µ0)are certainly nonperturbative in origin.They must be determined phenomeno-logically through global QCD analysis that compares the theoretical predictions with a wide range of experimental data on hard processes[4–6].In the case of charm and bottom quarks, we need to examine the situation more closely.In this paper,we shall focus in particular on charm.For convenience,we use the short-hand notation c(x,µ)≡f c(x,µ),and frequently omit the argumentµ,e c(x)in place of c(x,µ).Over the energy range of most interest to current high energy physics,the charm quark behaves as a parton,and is characterized by a PDF c(x,µ)that is defined forµ m c.It is common in global QCD analysis to consider charm a heavy quark,and to adopt the ansatz c(x,µ0=m c)=0as the initial condition for calculating c(x,µ)at higher energy scales by QCD evolution.This is the so-called radiatively generated charm scenario.In the context of global analysis,this ansatz implies that the charm parton does not have any independent degrees of freedom in the parton parameter space:c(x,µ)is completely determined by the gluon and light quark parton parameters.However,nature does not have to subscribe to this scenario.First,although m c(∼1.3GeV)is larger thanΛQCD(∼0.2−0.4GeV,depending on the number of effectiveflavors), it is actually of the same order of magnitude as the nucleon mass,which must certainly be considered as being of a nonperturbative scale.Secondly,the ansatz itself is ill-defined,since: (i)the initial condition c(x,µ0)=0depends sensitively on the choice ofµ0;and(ii)PQCD only suggests thatµ0be of the same order of magnitude as m c,but does not dictate any particular choice.Since theµ-dependence of c(x,µ)is relatively steep in the threshold region, the condition c(x,µ0)=0for a given choice ofµ0is physically quite different from that at a different choice ofµ0.Thirdly,many nonperturbative models give nonzero predictions for c(x,µ0)—again,for unspecifiedµ0∼m c[7–9].We therefore wish to study the nucleon structure in a formalism that allows for nonperturbative charm.To carry out a systematic study of the charm sector of nucleon structure,one needs:(i)ageneral global analysis framework that includes a coherent treatment of nonzero quark masses in PQCD;and(ii)comprehensive experimental inputs that have the potential to constrain the charm degrees of freedom.1Recent advances on both fronts make this study now possible. In the following,we extend the recent CTEQ6.5global analysis[4,12]to include a charm sector with its own independent degrees of freedom at the initial factorization scaleµ0=m c. We shall address the questions posed in the introduction by examining the results of global analyses performed under three representative scenarios.Thefirst two scenarios invoke the light-cone Fock space picture[13]of nucleon structure. In this picture,IC appears mainly at large momentum fraction x,because states containing heavy quarks are suppressed according to their off-shell distance,which is proportional to (p2⊥+m2)/x.Hence components with large mass m appear preferentially at large x.It has recently been shown that indeed a wide variety of light-cone models all predict similar shapes in x[9].The specific light-cone models we take as examples are the original model of Brodsky et al.[14](BHPS),and a model in which the intrinsic charm arises from virtual low-mass meson+baryon components such as1Earlier efforts[10,11]treated light partons and the IC component as dynamically uncoupled,and were done outside the framework of a global analysis.in Ref.[9]that the charm distributions in this model can be very well approximated byc(x)=A x1.897(1−x)6.095(2)¯c(x)=¯A x2.511(1−x)4.929(3) where the normalization constants A and¯A are determined by the requirement of the quark number sum rule 10[c(x)−¯c(x)]dx=0,which specifies A/¯A;and the overall magnitude of IC that is to be varied in our study.In contrast to these light-cone scenarios,we also examine a purely phenomenological scenario in which the shape of the charm distribution is sea-like—i.e.,similar to that of the lightflavor sea quarks,except for an overall mass-suppression.In this scenario,for simplicity, we assume c(x)=¯c(x)∝¯d(x)+¯u(x)at the starting scaleµ0=m c.In each of the three scenarios,the initial nonperturbative c(x)and¯c(x)specified above is used as input to the general-mass perturbative QCD evolution framework discussed in detail in[4].We then determine the range of magnitudes for IC that is consistent with our standard global analysisfit to data.This is described in the next Section.3Global QCD Analysis with Intrinsic CharmFor these globalfits,we use the same theoretical framework and experimental input data sets that were used for the recent CTEQ6.5analysis[4],which made the traditional ansatz of no IC.Notable improvements over previous CTEQ global analyses[5]are:(i)the theory includes a comprehensive treatment of quark mass effects in DIS according to the PQCD formalism of Collins[20];and(ii)the full HERA I data on Neutral Current and Charged Current total inclusive cross sections,as well as heavy quark production cross sections,are incorporated.Published correlated systematic errors are used wherever they are available. Fixed-target DIS,Drell-Yan,and hadron collider data that were used previously are also included[4,5].For each model of IC described in Sec.2,we carry out a series of globalfits with varying magnitudes of the IC component.From the results,we infer the ranges of the amount of IC allowed by current data within each scenario.It is natural to characterize the magnitude of IC by the momentum fraction x c+¯c carried by charm at our starting scale for evolutionµ=1.3GeV.This is just thefirst moment of the c+¯c momentum distribution:2x c+¯c= 10x[c(x)+¯c(x)]dx.(4)The quality of each globalfit is measured by a globalχ2global,supplemented by considera-tions of the goodness-of-fit to the individual experiments included in thefit.(The procedure has been fully described in[4,5].)The three curves in Fig.1showχ2global as a function of x c+¯c for the three models under consideration.Figure1:Goodness-of-fit vs.momentum fraction of IC at the starting scaleµ=1.3GeV for three models of IC:BHPS(solid curve);meson cloud(dashed curve);and sea-like(dotted curve).Round dots indicate the specificfits that are shown in Figs.2–4.We observefirst that in the lower range,0< x c+¯c∼0.01,χ2global varies very little, i.e.,the quality of thefit is very insensitive to x c+¯c in this interval.This means that the global analysis of hard-scattering data provides no evidence either for or against IC up to x c+¯c∼0.01.Beyond x c+¯c∼0.01,all three curves in Fig.1rise steeply with x c+¯c—globalfits do place useful upper bounds on IC!The upper dots along the three curves represent marginal fits in each respective scenario,beyond which the quality of thefit becomes unacceptable according to the procedure established in Refs.[4,5]—one or more of the individual experi-ments in the globalfit is no longerfitted within the90%confidence level.3This implies,the global QCD analysis rules out the possibility of an IC component much larger than0.02in momentum fraction.We note that the allowed range for x c+¯c is somewhat wider for the sea-like IC model. This is understandable,because under this scenario,the charm component is more easily interchangeable with the other sea quark components,in its contribution to the inclusive cross sections that are used in the analysis;whereas the hard c and¯c components of the light-cone models are not easily mimicked by other sea quarks.The PDF sets that correspond to the three limiting cases(upper dots),along with threelower ones on the same curves that represent typical,more moderate,model candidates(lower dots),will be explored in detail next.BHPS model results:Figure2shows the charm distributions c(x)=¯c(x)at three factorization scales that arise from the BHPS model,along with results from the CTEQ6.5PDFs which have no IC.The short-dash curves correspond to the marginally allowed amountFigure2:Charm quark distributions from the BHPS IC model.The three panels correspondto scalesµ=2,µ=5,andµ=100GeV.The long-dash(short-dash)curve correspondsto x c+¯c=0.57%(2.0%).The solid curve and shaded region show the central value anduncertainty from CTEQ6.5,which contains no IC.of IC( x c+¯c=0.020)indicated by the upper dot on the BHPS curve in Fig.1.The long-dash curves correspond to IC that is weaker by a factor of∼3,indicated by the lower dot( x c+¯c=0.0057)on the BHPS curve in Fig.1.This point corresponds to the traditionalestimate of1%IC probability in the BHPS model,i.e., 10c(x)dx= 10¯c(x)dx=0.01,at the starting scaleµ0=1.3GeV.This physically motivated light-cone model estimate thuslies well within the phenomenological bounds set by our global analysis.We see that at low factorization scales,this model produces a peak in the charm distri-bution at x≈0.3.This peak survives in the form of a shoulder even at a scale as large asµ=100GeV.At that scale,IC strongly increases c(x)and¯c(x)above the gluon splittingcontribution at x>0.1,while making a negligible contribution at x<0.1.Meson Cloud model results:Figure3shows the charm distributions that arise from the D0Λ+c meson cloud model,together with the results from CTEQ6.5which has no IC.In the meson cloud model,the charm(c(x))and anti-charm(¯c(x))distributions are different.The short-dash(short-dash-dot)curves correspond to the maximum amount of IC c(x)(¯c(x))that is allowed by the data( x c+¯c=0.018),while the long-dash(long-dash-dot)curvesshow a smaller amount( x c+¯c=0.0096),and the shaded region shows CTEQ6.5which hasno IC.Again we see that IC can substantially increase the charm PDFs at x>0.1,even ata large factorization scale.Figure3:Same as Fig.2,except for the meson cloud model.The long-dash(short-dash) curves correspond to x c+¯c=0.96%(1.9%).The difference between c and¯c due to IC is seen to be potentially quite large.The sign of the difference is such that¯c(x)>c(x)for x→1,as explained in Ref.[9].Experimental evidence for c(x)=¯c(x)may be difficult to obtain;but it is worth considering because it would provide an important constraint on the nonperturbative physics—as well as supplying a direct proof of intrinsic charm,since c¯c pairs produced by gluon splitting are symmetric to NLO.4Sea-like model results:Figure4shows the charm distributions that arise from a model in which IC is assumed to have the same shape in x as the light-quark sea¯u(x)+¯d(x)at the starting scaleµ0=1.3GeV.The short-dash curve again corresponds to the maximum amount of IC of this type that is allowed by the data( x c+¯c=0.024),while the long-dash curve shows an intermediate amount( x c+¯c=0.011),and the shaded region shows CTEQ6.5which has no IC.Figure4:Same as Fig.2,except for the sea-like scenario.The long-dash(short-dash)curves correspond to x c+¯c=2.4%(1.1%).We see that IC in this sea-like form can increase the charm PDFs over a rather largeregion of moderate x.As a result,it may have important phenomenological consequences for hard processes that are initiated by charm.4Comparison with Light PartonsIn this section,we compare the charm content of the nucleon with the otherflavors at various hardness scales.Figure5shows the charm distribution with no IC,or IC with shape given by the BHPS model with the two strengths discussed previously,compared to gluon,light quark,and light antiquark distributions from the CTEQ6.5bestfit.By comparing the three panels of thisfigure,one can recognize the standard characteristics of DGLAP evolution:at increasing scale,the PDFs grow larger at small x and smaller at large x;while the differences between q and¯q,and the differences betweenflavors,all get smaller.Thisfigure shows that the light-cone form for IC has negligible effects for x<0.05, while it can make c(x)and¯c(x)larger than any of¯u(x),¯d(x),s(x),and¯s(x)for x>0.2. Similar results hold for the meson-cloud model(not shown),since the essential basis is the large-x behavior that is characteristic of the light-cone picture.Figure5:Comparison of charm with otherflavors:u,¯u(long-dash,long-dash-dot);d,¯d (short-dash,short-dash-dot);s=¯s(dash-dot-dot),g(dot).The solid curves are c=¯c with no IC(lowest)or the two magnitudes of IC in the BHPS model that are discussed in the text.Figure6similarly shows the charm content for the scenarios of no IC,or IC with a “sea-like”shape for the two normalizations discussed previously.In this case,unlike the light-cone forms,charm remains smaller than the other sea quarks,including s and¯s,at all values of x.However,thisfigure shows that IC can raise the c and¯c distributions by up to a factor of∼2above their traditional radiative-only estimates,so it can have an important effect on processes that are initiated by charm.Figures5–6also show that in every scenario,c(x)and¯c(x)remain small compared to u,d,and g.Figure6:Same as Fig.5,but for sea-like IC.5The Charm Distribution at Various Energy Scales The effect of evolution of the charm distributions in each scenario can be seen by comparing the three panels within each of Figs.2–4.But to show the evolution in more detail,we plot the x-dependence at scalesµ=1.3,2.0,3.16,5,20,100GeV together on the samefigure. In place of the logarithmic scale in x,we use a scale that is linear in x1/3in order to display the important large-x region more clearly.Figure7:Charm distributions at scaleµ=1.3(solid),2.0,3.16,5,20,100GeV(dotted). Left:no IC;center:BHPS model with maximal IC consistent with experiment;right:sea-like model with maximal IC consistent with experiment.The left panel of Fig.7has no IC;the center panel is the BHPS model(maximal level); and the right panel is the sea-like IC model(maximal level).In all cases,as the scale increases,the charm distribution becomes increasingly soft as the result of QCD evolution. As one would expect,the IC component dominates at low energy scales.The radiatively generated component(coming mainly from gluon splitting)increases rapidly in importance with increasing scale.A two-component structure appears in these plots at intermediate scales,sayµ=5GeV,where IC is dominant at large x and radiatively generated is dominant at small x.As noted previously,even at a rather high energy scale ofµ=100GeV the IC component is still very much noticeable in the light-cone wave function models.In all cases,the intrinsic component is quite large in magnitude compared to the purely radiatively generated case in the region,say,x>0.1.Thus,the existence of IC would have observable consequences in physical processes at future hadron colliders that depend on the charm PDF in this kinematic region.6Summary and ImplicationsAs a natural extension of the CTEQ6.5global analysis[4,12],we have determined the range of magnitudes for intrinsic charm that is consistent with an up-to-date global QCD analysis of hard-scattering data,for various plausible assumptions on the shape of the x-distribution of IC at a low factorization scale.For shapes suggested by light-cone models,wefind that the global analysis is consistent with anywhere from zero IC up to∼3times the amount that has been estimated in more model-dependent studies.In these models,there can be a large enhancement of c(x)and ¯c(x)at x>0.1,relative to previous PDF analyses which assume no IC.The enhancement persists to scales as large asµ∼100GeV,and can therefore have an important influence on charm-initiated processes at the Tevatron and rge differences between c(x)and ¯c(x)for x>0.1are also natural in some models of this type.For an assumed shape of IC similar to other sea quarks at low factorization scale, there can also be a significant enhancement of c(x)and¯c(x)relative to traditional no-IC analyses.In this case,the enhancement is spread more broadly in x,roughly over the region 0.01<x<0.50.What experimental data could be used to pin down the intrinsic charm contribution? An obvious candidate would be c and b production data from HERA;but these are already included in the analysis,and they don’t have very much effect because the errors are rather large and because the data are mostly at small x where the dominant partonic subprocess is γg→c¯c rather thanγc→cX.It may be possible to probe the subprocessγc→cX more effectively by measuring specific angular differential distributions—see[22].Future hadron collider measurements could place direct constraints on the charm PDF. For instance,in principle,the partonic process g+c→γ/Z+c is directly sensitive to the initial state charm distribution.The experimental signature would be Z plus a tagged charm jet in thefinal state.Such measurements would be experimentally very challenging, but potentially important[23,24].The proposed future ep colliders eRHIC[25]and LHeC [26]would also be very helpful for the study of heavy quark parton distributions.If IC is indeed present in the proton at a level on the order of what is allowed by the current data,it will have observable consequences in physical processes at future hadron colliders that depend on the charm PDF in the large x region.Interesting examples that have been proposed in the literature are:diffractive production of neutral Higgs bosons[27], and charged Higgs production at the LHC[28,29].Application of our results to the latterprocess will be presented in[12].The PDF sets with intrinsic charm that were used in this paper will be made available on the CTEQ webpage and via the LHAPDF standard,for use in predicting/analyzing experiments.They are designated by CTEQ6.5C.Acknowledgment We thank Joey Huston,Pavel Nadolsky,Daniel Stump,and C.-P.Yuan for useful discussions.This work was supported in part by the U.S.National Science Foun-dation under award PHY-0354838,and by National Science Council of Taiwan under grant 94(95)-2112-M-133-001.References[1]W.K.Tung,AIP Conf.Proc.753,15(2005)[hep-ph/0410139].[2]R.S.Thorne,“Parton distributions–DIS06,”[hep-ph/0606307].[3]R.S.Thorne,Phys.Rev.D73,054019(2006)[hep-ph/0601245],and references therein.[4]W.K.Tung,i,A.Belyaev,J.Pumplin,D.Stump and C.P.Yuan,[hep-ph/0611254].[5]J.Pumplin,D.R.Stump,J.Huston,i,P.Nadolsky and W.K.Tung,JHEP0207,012(2002)[hep-ph/0201195];D.Stump,J.Huston,J.Pumplin,W.K.Tung,i,S.Kuhlmann and J.F.Owens,JHEP0310,046(2003)[hep-ph/0303013].[6]R.S.Thorne,A.D.Martin and W.J.Stirling,[hep-ph/0606244],and references therein.[7]S.J.Brodsky,P.Hoyer,C.Peterson and N.Sakai,Phys.Lett.B93,451(1980).[8]S.J.Brodsky,C.Peterson and N.Sakai,Phys.Rev.D23,2745(1981);R.Vogt,Prog.Part.Nucl.Phys.45,S105(2000)[hep-ph/0011298];T.Gutierrez and R.Vogt,Nucl.Phys.B539,189(1999)[hep-ph/9808213];G.Ingelman and M.Thunman,Z.Phys.C 73,505(1997)[hep-ph/9604289];J.Alwall,[hep-ph/0508126].[9]J.Pumplin,Phys.Rev.D73,114015(2006)[hep-ph/0508184].[10]B.W.Harris,J.Smith and R.Vogt,Nucl.Phys.B461,181(1996)[hep-ph/9508403].[11]J.F.Gunion and R.Vogt,[hep-ph/9706252].[12]i,et al.,“The Strange Parton Distribution in the Nucleon:Global Analysis andApplications,”in preparation.[13]S.J.Brodsky,“Light-front QCD,”[hep-ph/0412101];S.J.Brodsky,Few Body Syst.36,35(2005)[hep-ph/0411056].[14]S.J.Brodsky,P.Hoyer,C.Peterson and N.Sakai,Phys.Lett.B93,451(1980).[15]F.S.Navarra,M.Nielsen,C.A.A.Nunes and M.Teixeira,Phys.Rev.D54,842(1996)[hep-ph/9504388];S.Paiva,M.Nielsen,F.S.Navarra,F.O.Duraes and L.L.Barz, Mod.Phys.Lett.A13,2715(1998)[hep-ph/9610310].[16]W.Melnitchouk and A.W.Thomas,Phys.Lett.B414,134(1997)[hep-ph/9707387].[17]F.M.Steffens,W.Melnitchouk and A.W.Thomas,Eur.Phys.J.C11,673(1999)[hep-ph/9903441].[18]J.F.Donoghue and E.Golowich,Phys.Rev.D15,3421(1977).[19]X.T.Song,Phys.Rev.D65,114022(2002)[hep-ph/0111129].[20]J.C.Collins and W.K.Tung,Nucl.Phys.B278,934(1986);J.C.Collins,Phys.Rev.D58,094002(1998)[hep-ph/9806259].[21]S.Catani,D.de Florian,G.Rodrigo and W.Vogelsang,Phys.Rev.Lett.93,152003(2004)[hep-ph/0404240].[22]L.N.Ananikyan and N.Ya.Ivanov,[hep-ph/0701076].[23]TeV4LHC QCD Working Group,[hep-ph/0610012].[24]P.Nadolsky,et al.,“Implications of New Global QCD Analyses for Hadron ColliderPhysics,”in preparation.[25]A.Deshpande,ner,R.Venugopalan,and W.Vogelsang,Ann.Rev.Nucl.Part.Sci.55,165(2005);cf.also /eic.[26]J.B.Dainton,M.Klein,P.Newman,E.Perez,and F.Willeke,[hep-ex/0603016].[27]S.J.Brodsky,B.Kopeliovich,I.Schmidt and J.Soffer,Phys.Rev.D73,113005(2006)[hep-ph/0603238].[28]H.J.He and C.P.Yuan,Phys.Rev.Lett.83,28(1999)[hep-ph/9810367];C.Balazs,H.J.He and C.P.Yuan,Phys.Rev.D60,114001(1999)[hep-ph/9812263].[29]U.Aglietti et al.,“Tevatron-for-LHC report:Higgs,”[hep-ph/0612172].。

高斯过程英文翻译版

高斯过程英文翻译版

对于采用相关系数法来判断两种重金属元素之间的相关性的思路同样可以应用到多种元素之间的相关性。

在这种情况下,我们就需要采用广义相关系数法。

用x 、y 分别表示两组重金属元素的浓度向量值,xx yy ,∑∑和xy ∑则为协方差,其协方差函数分别定义如下:()()()()()()'xx x x 'yy 'xy x x-x-,y-y-,x-y-.y y y E E E E E E E E E ∑=∑=∑=用(1,2,)i i n λ= 表示xx xy yy yx++∑∑∑∑的特征值,而,x x y y ++∑∑我们可以由xx ∑和yy ∑的摩尔彭罗斯逆矩阵求得。

这样我们就很容易由特征值的平均值求得广义相关系数,这样就为建立高斯过程模型寻找污染源做出某些简化的假设创造了条件。

下面我们建立高斯过程来进行差补以及对污染源作相关的估计。

第一种情况先假设在两个取样点之间进行插值:()()()1***,,0;.,1d i j i i n y k x x k x x N k x x y y ⎛⎫ ⎪⎛⎫ ⎪⎛⎫ ⎪ ⎪ ⎪= ⎪ ⎪ ⎪⎝⎭⎝⎭ ⎪ ⎪⎝⎭对于核函数我们选择Ornstein-Uhlenbeck (奥恩斯坦-乌伦贝克)和径向基函数,如下:()21exp ,Ornstein-Uhlenbeck,,1exp ,Radial basis functions.i j i j i j x x h K x x x x h ⎧⎧⎫--⎨⎬⎪⎪⎩⎭=⎨⎧⎫⎪--⎨⎬⎪⎩⎭⎩()()'*11*1**|,,(;var),(,)(,),var 1(,)(,)(,).dny y y N mean mean K x x K x x y K x x K x x x x --===-()()(1)(2)(1)(2)101,0110;.11,11i i j d i j i j j y k x x y N y k x x y ρρρρρρ⎛⎫⎛⎫⎛⎫⎛⎫⎛⎫ ⎪⎪ ⎪ ⎪ ⎪⎝⎭⎝⎭ ⎪ ⎪ ⎪= ⎪ ⎪ ⎪⎛⎫⎛⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎝⎭⎝⎭⎝⎭⎝⎭⎝⎭。

相关主题
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Generalizing wrapper induction across multiple Web sites using named entities and post-processingGeorgios Sigletos1,2, Georgios Paliouras1,Constantine D. Spyropoulos1, Michalis Hatzopoulos21 Institute of Informatics and Telecommunications, NCSR “Demokritos”,P.O. BOX 60228, Aghia Paraskeyh, GR-153 10, Athens, Greece{sigletos,paliourg,costass}@iit.demokritos.gr2 Department of Informatics and Telecommunications, University of Athens,TYPA Buildings, Panepistimiopolis, Athens, Greece{sigletos,mike}@di.uoa.grAbstract. This paper presents a novel method for extracting information fromcollections of Web pages across different sites. Our method uses a standardwrapper induction algorithm and exploits named entity information. Weintroduce the idea of post-processing the extraction results for resolvingambiguous facts and improve the overall extraction performance. Post-processing involves the exploitation of two additional sources of information:fact transition probabilities, based on a trained bigram model, and confidenceprobabilities, estimated for each fact by the wrapper induction system. Amultiplicative model that is based on the product of those two probabilities isalso considered for post-processing. Experiments were conducted on pagesdescribing laptop products, collected from many different sites and in fourdifferent languages. The results highlight the effectiveness of our approach.1 IntroductionWrapper induction (WI) [7] aims to generate extraction rules, called wrappers, by mining highly structured collections of Web pages that are labeled with domain-specific information. At run-time, wrappers extract information from unseen collections and fill the slots of a predefined template. These collections are typically built by querying an appropriate search form in a Web site and collecting the response pages, which commonly share the same content format.A central challenge to the WI community is Information Extraction (IE) from pages across multiple sites, including unseen sites, by a single trained system. Pages collected from different sites usually exhibit multiple hypertext markup structures, including tables, nested tables, lists, etc. Current WI research relies on learning separate wrappers for different structures. Training an effective site-independent IE system is an attractive solution in terms of scalability, since any domain-specific page could be processed, without relying heavily on the hypertext structure.In this paper we present a novel approach to IE from Web pages across different sites. The proposed method relies on domain specific named entities, identified withinWeb pages. Those entities are embedded within the Web pages as XML tags and can serve as a page-independent common markup structure among pages from different sites. A standard WI system can be applied and exploit the additional textual information. Thus, the new system relies more on page-independent named-entity markup tags for inducing delimiter-based rules for IE and less on the hypertext markup tags, which vary among pages from multiple sites.We experimented with STALKER [10], which performs extraction from a wide range of Web pages, by employing a special formalism that allows the specification of the output multi-place schema for the extraction task. However, information extraction from pages across different sites is a very hard problem, due to the multiple markup structures that cannot be described by a single formalism. In this paper we suggest the use of STALKER for single-slot extraction, i.e. extraction of isolated facts (i.e. extraction fields), from pages across different sites.A further contribution of this paper is a method for post-processing the system’s extraction results in order to disambiguate facts. When applying a set of single-slot extraction rules to a Web page, one cannot exclude the possibility of identical or overlapping textual matches within the page, among different rules. For instance, rules for extracting instances of the facts cd-rom and dvd-rom in pages describing laptop products may overlap or exactly match in certain text fragments, resulting in ambiguous facts. Among these facts, the correct choice must be made.To deal with the issue of ambiguous facts, two sources of information are explored: transitions between facts, incorporated in a bigram model, and prediction confidence values, generated by the WI system. Deciding upon the correct fact can be based on information from either the trained bigram model and/or the confidence assigned to each predicted fact. A multiplicative model that combines these two sources of information is also presented and compared to each of the two components.The rest of this paper is structured as follows: In Section 2 we outline the architecture of our approach. Section 2.1 briefly describes the named entity recognition task. Section 2.2 reviews STALKER and in Section 2.3 we discuss how STALKER can be used under the proposed approach to perform IE from pages across different sites. In Section 3 we discuss the issue of post-processing the output of the STALKER system in order to resolve ambiguous facts. Section 4 presents experimental results on datasets that will soon be publicly released. Related work is presented in Section 5. Finally we conclude in Section 6, discussing potential improvements of our approach.2 Information Extraction from multiple Web sitesOur methodology for IE from multiple Web sites is graphically depicted in Figure 1. Three component modules process each Web page. First, a named-entity recognizer (NER) that identifies domain-specific named entities across pages from multiple sites.A trained WI system is then applied to perform extraction. Finally, the extraction results are post-processed to improve the extraction performance.Fig. 1. Generic architecture for information extraction from multiple Web sites2.1 Named entity recognitionNamed entity recognition (NER) is an important subtask in most language engineering applications and has been included as such in all MUC competitions, e.g.[8]. NER is best known as the first step in the IE task, and involves the identification of a set of basic entity names, numerical expressions, temporal expressions, etc. The overall IE task aims to extract facts in the form of multi-place relations and NER provides the entities that fill the argument slots. NER has not received much attention in Web IE tasks.We use NER in order to identify basic named entities relevant to our task and thus reduce the complexity of fact extraction. The identified entities are identified within Web pages as XML tags and serve as a valuable source of information for the WI system that follows and extracts the facts. Some of the entity names (ne), numerical (numex) and temporal (timex) expressions, used in the laptop domain are shown in Table 1, along with the corresponding examples of XML tags.Table 1. Subset of named entities for the laptop domainEntity Entity Type Examples of XML tags<ne type=Model>Presario</ne>ProcessorNe Model,<ne type=Processor>Intel Pentium </ne> Numex Capacity,Speed <numex type=Speed>300 MHz </numex><numex type=Capacity>20 GB </numex> Timex Duration <timex type=Duration>1 year </timex>2.2 The STALKER wrapper induction systemSTALKER [10] is a sequential covering rule learning system that performs single-slot extraction from highly-structured Web pages. Multi-slot extraction –i.e. linking of theisolated facts- is feasible through an Embedded-Catalog(EC)Tree formalism, which may describe the common structure of a range of Web pages. The EC tree is constructed manually, usually for each site, and its leaves represent the individual facts. STALKER is capable of extracting information from pages with tabular organization of their content, as well as pages with hierarchically organized content.Each extraction rule in STALKER consists of two ordered lists of linear landmark automata (LA’s), which are a subclass of nondeterministic finite state automata. The first list constitutes the start rule, while the second list constitutes the end rule. Using the EC tree as a guide, the extraction in a given page is performed by applying –for each fact- the LA’s that constitute the start rule in the order in which they appear in the list. As soon as a LA is found that matches within the page, the matching process terminates. The process is symmetric for the end rule. More details on the algorithm can be found in [10].2.3 Adapting STALKER to multi-site information extractionThe EC tree formalism used in STALKER is generally not applicable for describing pages with variable markup structure. Different EC trees need to be manually built for different markup structures and thus different extraction rules to be induced. In this paper, we are seeking for a single domain-specific trainable system, without having to deal with each page structure separately. The paper focuses on the widely-used approach of single-slot extraction. Our motivation is that if isolated facts could be accurately identified, then it is possible to link those facts separately on a second step. We therefore specify our task as follows:For each fact, try to induce a list iteration rule as depicted in Figure 2.Fig. 2. Simplification of the EC tree. A list iteration rule is learned for each fact and applies to the whole content of a page, at run-timeThe EC tree depicted in Figure 2 has the following interpretation: a Web page that describes laptop products consists of a list of instances of the fact Manufacturer (e.g. “Compaq”), a list of instances of the fact ModelName (e.g. “Presario”), a list of Ram instances (e.g. “256MB”), etc. The system, during runtime, exhaustively applies each rule to the content of the whole page. This simplified EC tree is independent of any particular page structure. The proposed approach relies on the page-independent named entities to lead to efficient extraction rules.Since each extraction rule applies exhaustively within the complete Web page, rather than being constrained by the EC tree, we expect an extraction bias towards recall, i.e., overgeneration of extracts for each fact. The penalty is a potential loss in precision, since each rule applies to text regions that do not contain relevantinformation and may return erroneous instances. Therefore we seek a post-processing mechanism capable of discarding the erroneous instances and thus improving the overall precision.3 Post-processing the extraction resultsIn single-slot IE systems, each rule is applied independently of the others. This may naturally cause identical or overlapping matches among different rules resulting in multiple ambiguous facts for those matches. We would like to resolve such ambiguities and choose the correct fact. Choosing the correct fact and removing all the others shall improve the extraction precision.3.1 Problem specificationIn this paper we adopt a post-processing approach in order to resolve ambiguities in the extraction results of the IE system. More formally, the task can be described as follows:1. Let D be the sequence of a document’s tokens and T j (s j, e j) a fragment of thatsequence, where s j and e j are the start and end token bounds respectively.2. Let I = {i j | i j: T j→ fact j} be the set of instances extracted by all the rules, wherefact j is the predicted fact associated with instance T j.3. Let DT be the list of all distinct text fragments T j, appearing in the extractedinstances in I. Note that T1(s1, e1) and T2(s2, e2) are different, if either s1≠ s2 or e1≠e2. The elements of DT are sorted in ascending order of s j.4. If for a distinct fragment T i in DT, there exist at least two instances i k and i l so thati k: T i→ fact k and i l : T i→ fact l, k ≠ l,then fact k and fact l are ambiguous facts for T i.5. The goal is to associate a single fact to each element of the list DT.To illustrate the problem, if for the fragment T j(24, 25)=“16 x” in a page describing laptops, there are two extracted instances i k and i l, where fact k = dvdSpeed and fact l = cdromSpeed, then there are two ambiguous facts for T i. One of them must be chosen and associated with T j.3.2 Formulate the task as a hill-climbing searchResolving ambiguous facts can be viewed as a hill-climbing search in the space of all possible sequences of facts that can be associated with the sequence DT of distinct text fragments.This hill-climbing search can be formulated as follows:1. Start from a hypothetical empty node, and transition at each step j to the nextdistinct text fragment T j of the sorted sequence DT.2. At each step apply a set of operations Choose (fact k ). Each operation associates T j with the fact k predicted by an instance i k = {T j → fact k }. A weight is assigned toeach operation, based on some predefined metric. The operation with the highestweight is selected at each step.3. The goal of the search is to associate a single fact to the last distinct fragment of the sorted list DT, and thus return the final unambiguous sequence of facts for DT.To illustrate the procedure, consider the fictitious token table in Table 2(a), which is part of a page describing laptop products.Table 2. (a) Part of a token table of a page dscribing laptops. (b) Instances extracted by STALKER. (c) The tree of all possible fact paths (d) The extracted instances after the disambiguation process T j (s j , e j )fact k T 1 (33, 34)Processor Speed T 2 (36, 37)Ram T 2 (36, 37)Hard Disk capacity T 3 (39, 40) Hard DiskCapacity(b) … 33 34 35 36 37 38 39 40 …… 1,6 GHz / 1 GB / 80 GB…(a)(c)T j (s j , e j ) fact kT 1 (33, 34) ProcessorSpeedT 2 (36, 37) RamT 3 (39, 40) Hard DiskCapacity (d)Table 2(b) lists the instances extracted by STALKER for the token table part of Table 2(a). The DT list consists of the three distinct fragments T 1, T 2, T 3. Table 2(c) shows the two possible fact sequences that can be associated with DT. After the processor speed fact prediction for T 1, two operations apply for predicting a fact for T 2: The choose (Ram) and choose (hard disk capacity) operations, each associated with a weight , according to a predefined metric. We assume that the former operation returns a higher weight value and therefore Ram is the chosen fact for T 2. The bold circles in the tree show the chosen sequence of facts {Processor speed , Ram , Hard disk capacity } that is attached to the sequence T 1 T 2 T 3. Table 2(d) illustrates the final extracted instances, after the disambiguation process.In this paper we explore three metrics for assigning weights to the choiceoperations:1. Confidence values, estimated for each fact by the WI algorithm.2. Fact -transition probabilities, learned by a bigram model.3. The product of the above probabilities, based on a simple multiplicative model.Selecting the correct instance, and thus the correct fact, at each step and discarding the others, results in improving the overall precision. However, an incorrect choice harms both the recall and the precision of a certain fact. The overall goal of the disambiguation process is to improve the overall precision while keeping recall unaffected.3.3 Estimating fact confidenceThe original STALKER algorithm does not assign confidence values to the extracted instances. In this paper we estimate confidence values by calculating a value for each extraction rule, i.e. for each fact. That value is calculated as the average precision obtained by a three-fold cross-validation methodology on the training set. According to this methodology, the training data is split into three equally-sized subsets and the learning algorithm is run three times. Each time two of the three pieces are used for training and the third is kept as unseen data for the evaluation of the induced extraction rules. Each of the three pieces acts as the evaluation set in one of the three runs and the final result is the average over the three runs.At runtime, each instance extracted by a single-slot rule will be assigned theprecision value of that rule. For example, if the text fragment “300 Mhz” was matched by the processor speed rule, then this fragment will be assigned the confidence associated with processor speed . The key insight into using confidence values is that among ambiguous facts, we can choose the one with the highest estimated confidence.3.4 Learning fact transition probabilitiesIn many extraction domains, some facts appear in an almost fixed order within each page. For instance, a page describing laptop products may contain instances of the processor speed fact, appearing almost immediately after instances of the processor name fact. Training a simple bigram model is a natural way of modeling such dependencies and can be easily implemented by calculating ratios of counts (maximum likelihood estimation) in the labeled data as follows:∑∈→→=→K j j i c j i c j i P )()()(, (1)where the nominator counts the transitions from fact i to fact j , according to the labeled training instances. The denominator counts the total number of transitions from fact i to all facts (including self-transitions). We also calculate a startingprobability for each fact, i.e. the probability that an instance of a particular fact isthe first one appearing in the labeled training pages.The motivation for using fact transitions is that between ambiguous facts we couldchoose the one with the highest transition probability given the preceding factprediction. To illustrate that, consider that the text fragment “16 x” has been identifiedas both cdromSpeed and dvdSpeed within a page describing laptops. Assume also thatthe preceding fact prediction of the system is ram. If the transition from ram todvdSpeed has a higher probability, according to the learned bigram, than from ram to cdromSpeed, then we can choose the dvdSpeed fact. If ambiguity occurs at the firstextracted instance, where there is no preceding fact prediction available, then we canchoose the fact with the highest starting probability.3.5 Employing a multiplicative modelA simple way to combine the two sources of information described above is through a multiplicative model, assigning a confidence value to each extracted instance i k : T i→fact k, based on the product of the confidence value estimated for fact k and thetransition probability from the preceding instance to fact k. Using the example of Table2 with the two ambiguous facts ram and hard disk capacity for the text fragment T2,Table 4 depicts the probabilities assigned to each fact by the two methods describedin sections 3.3 and 3.4 and the multiplicative model.Table 3. Probabilities assigned to each of the two ambiguous facts of the text fragment T2 ofTable 2T2 (36, 37) = “1 GB” WI-Confidence Bigram Multiplicative0,21 Ram 0,70,3 Hard disk capacity 0,4 0,5 0,20Using the WI confidence values, the ram is selected. However, using bigram probabilities, the hard disk capacity is selected. We also experimented with a modelthat averages the two probabilities, rather than multiplying them. However the experiments led to worse results.4 Experiments4.1 Dataset descriptionExperiments were conducted on four language corpora (Greek, French, English, Italian) describing laptop products. The corpora were collected in the context of CROSSMARC1.Approximately 100 pages from each language were hand-tagged using a Web page annotation tool [14]. The corpus for each language was divided into two equally sized data sets for training and testing. Part of the test corpus was collected from sites not appearing in the training data. The named entities were embedded as XML tags within the pages of the training and test data, as illustrated in Table 1. A separate NER module was developed for each of the four languages of the project.A total of 19 facts were hand-tagged for the laptop product domain. The pages were collected from multiple vendor sites and demonstrate a rich variety of structure, including tables, lists etc. Examples of facts are the name of the manufacturer, the name of the model, the name of the processor, the speed of the processor, the ram, etc.4.2 ResultsOur goal was to evaluate the effect of named entity information to the extraction performance of STALKER and compare the three different methods for resolving ambiguous facts.We, therefore, conducted two groups of experiments. In the first group we evaluated STALKER on the testing datasets for each language, with the named entities embedded as XML tags within the pages. Table 4 presents the results. The evaluation metrics are micro-average recall and micro-average precision [12] over all 19 facts. The last row of Table 4 averages the results over all languages.Table 4. Evaluation results for STALKER in four languagesLanguage Micro Precision (%) Micro Recall (%)Greek 60,5 86,8French 64,1 93,7English 52,2 85,1Italian 72,8 91,9Average62,4 89,4 The exhaustive application of each extraction rule to the whole content of a page resulted, as expected, in a high recall, accompanied by a lower precision. However, named-entity information led a pure WI system like STALKER to achieve a bareable1http://www.iit.demokritos.gr/skel/crossmarc. Datasets will soon be available on this site.level of extraction performance across pages with variable structure. We also trained STALKER on the same data without embedding named entities within the pages. The result was an unacceptably high training time, accompanied by rules with many disjuncts that mostly overfit the training data. Evaluation results on the testing corpora provided recall and precision figures below 30%.In the second group of experiments, we evaluated the post-processing methodologyfor resolving ambiguous facts that was described in Section 3. Results are illustratedin Table 5.Table 5. Evaluation results after resolving ambiguitiesLanguage Micro Precision (%) Micro Recall (%)WI-Conf. Bigram Mult. WI-Conf. Bigram Mult.Greek 69,373,573,876,981,681,9 French 77,0 78,979,482,1 84,184,6 English 65,9 67,568,974,4 76,277,5 Italian 84,483,884,487,687,087,6 Average74,2 75,9 76,6 80,3 82,2 82,9 Comparing the results of Table 4 to the results of Table 5, we conclude the following:1. Choosing among ambiguous facts, using any of the three methods, achieves anoverall increase in precision, accompanied by a lower decrease in recall. Results are very encouraging, given the difficulty of the task.2. Using bigram fact transitions for choosing among ambiguous facts achieves betterresults that using confidence values. However, the simple multiplicative model outperforms slightly the two single methods.To corroborate the effectiveness of the multiplicative model, we counted the numberof correct choices made by the three post-processing methods at each step of the hill-climbing process, as described in section 3.2. Results are illustrated in Table 6.Table 6. Counting the ambiguous predictions and the correct choicesLanguage DistinctT i AmbiguousT iCorrected(WI-Conf.)Corrected(Bigram)Corrected(Mult.)Greek 549 490 251 331 336 French 720 574 321 364 374 English 2203 1806 915 996 1062 Italian 727 670 538 458483Average1050885506537563The first column of Table 6 is the number of distinct text fragments T i, as defined in section 3.1, for all pages in the testing corpus. The second column counts the T i with more than one –ambiguous- facts (e.g. the T2 in Table 2). The last three columns count the correct choices made by each of the three methods.We conclude that by using a simple multiplicative model, based on the product of bigram probabilities and STALKER-assigned confidence probabilities we make more correct choices than by using either of the two methods individually.5 Related WorkExtracting information from multiple Web sites is a challenging issue for the WI community. Cohen and Fun [3] present a method for learning page-independent heuristics for IE from Web pages. However they require as input a set of existing wrappers along with the pages they correctly wrap. Cohen et al. [4], also present one component of a larger system that extracts information from multiple sites. A common characteristic of both the aforementioned approaches is that they need to encounter separately each different markup structure during training. In contrast to this approach, we examine the viability of trainable systems that can generalize over unseen sites, without encountering each page’s specific structure.An IE system that exploits shallow linguistic pre-processing information is presented in [2]. However, they generalize extraction rules relying on lexical units (tokens), each one associated with shallow linguistic information, e.g., lemma, part-of-speech tag, etc. We generalize rules relying on named entities, which involve contiguous lexical units, and thus providing higher flexibility to the WI algorithm.An ontology-driven IE system from pages across different sites is presented in [5]. However, they rely on hand-crafted (provided by an ontology) regular expressions, along with a set of heuristics, in order to identify single-slot facts within a document. On the other hand, we try to induce such expressions using wrapper induction.All systems mentioned in this section experiment with different corpora, and thus cannot easily be comparatively evaluated.6 Conclusions and Future WorkThis paper presented a methodology for extracting information from Web pages across different sites, which is based on using a pipeline of three component modules: a named-entity recognizer, a standard wrapper induction system, and a post-processing module for disambiguating extracted facts. Experimental results showed the viability of our approach.The issue of disambiguating facts is important for single-slot IE systems used on the Web. For instance, Hidden Markov Models (HMMs) [11] are a well-known learning method for performing single-slot extraction [6], [13]. According to this approach, a single HMM is trained for each fact. At run-time, each HMM is applied to a page, using the Viterbi procedure, to identify relevant matches. Identified matches across different HMMs may be identical or overlapping resulting in ambiguous facts. Our post-processing methodology can thus be particularly useful to HMM extraction tasks.Bigram modeling is a simplistic approach to the exploitation of dependencies among facts. We plan to explore higher-level interdependencies among facts, using higher order n-gram models, or probabilistic FSA, e.g. as learned by the Alergia algorithm [1]. Our aim is to further increase the number of correct choices made for ambiguous facts, thus further improving both recall and precision. Dependencies among facts shall be also investigated in the context of multi-slot extraction.A bottleneck in existing approaches for IE is the labeling process. Despite the use of a user-friendly annotation tool [14], the labeling process is a tedious, time-consuming and error-prone task, especially when moving to a new domain. We plan to investigate active learning techniques [9] for reducing the amount of labeled data required. On the other hand, we anticipate that our labeled datasets will be of use as benchmarks for the comparative evaluation of other current and/or future IE systems.References1. Carrasco, R., Oncina, J., Learning stochastic regular grammars by means of a state-mergingmethod. Grammatical Inference and Applications, ICGI’94, p. 139-150, Spain (1994).2. Ciravegna, F., Adaptive Information Extraction from Text by Rule Induction andGeneralization. In Proceedings of the 17th IJCAI Conference. Seattle (2001).3. Cohen, W., Fan, W., Learning page-independent heuristics for extracting data from Webpages. In the Proceedings of the 8th international WWW conference (WWW-99). Toronto, Canada (1999).4. Cohen, W., Hurst, M., Jensen, L., A Flexible Learning System for Wrapping Tables andLists in HTML Documents.Proceedings of the 11th International WWW Conference.Hawaii, USA (2002).5. Davulcu, H., Mukherjee, S., Ramakrishman, I.V., Extraction Techniques for Mining Servicesfrom Web Sources, IEEE International Conference on Data Mining, Maebashi City, Japan (2002).6. Freitag, D., McCallum, A.K., Information Extraction using HMMs and Shrinkage. AAAI-99Workshop on Machine Learning for Information Extraction, p.31-36 (1999).7. Kushmerick N., Wrapper induction for Information Extraction, PhD Thesis, Department Ofcomputer Scienc, Univ. Of Washington (1997).8. MUC-7, /iaui/894.02/related_pr ojects/muc.9. Muslea, I., Active Learning with multiple views. PhD Thesis, University of SouthernCalifornia (2002).10. Muslea, I., Minton, S., Knoblock, C., Hierarchical Wrapper Induction for SemistructuredInformation Sources. Journal Of Autonomous Agents and Multi-Agent Systems, 4:93-114 (2001).11. Rabiner, L., A tutorial on hidden Markov models and selected applications in speechrecognition. Proceedings of the IEEE 77-2 (1989).12. Sebastiani F., Machine Learning in Automated Text Categorization, ACM ComputingSurveys, 34(1):1-47 (2002).13. Seymore, K., McCallum, A.K., Rosenfeld, R., Learning hidden Markov model structure forinformation extraction. AAAI Workshop on Machine Learning for Information Extraction, p.37-42 (1999).14. Sigletos, G., Farmakiotou, D., Stamatakis, K., Paliouras, G., Karkaletsis V., AnnotatingWeb pages for the needs of Web Information Extraction Applications. Poster at WWW 2003, Budapest Hungary, May 20-24 (2003).。

相关文档
最新文档