Structure of

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Overview of CMS virtual data needs
December2000
1Introduction
This document was put together to act as CMS input into the Griphyn Architecture meeting on De-cember20,2000.It is a high-level overview that tries to be short rather than complete.
Structure of this document:
Brief overview of CMS and its physics
CMS virtual data description
2005vs.current needs and activities
References to some other documents that may be of interest
2Brief overview of CMS and its physics
2.1CMS
The CMS experiment is a high energy physics experiment located at CERN,that will start data tak-ing in2005.The CMS detector(ﬁgure1)is one of the two general purpose detectors of the LHC accelerator.It is being designed and built,and will be used,by a world-wide collaboration,the CMS collaboration,that currently consists of some2200people in145institutes,divided over30countries. In future operation,the LHC accelerator lets two bunches of particles cross each other inside the CMS detector40.000.000times each second.Every bunch contains some protons.In every bunch crossing in the detector,on average20collisions occur between two protons from opposite bunches.
A bunch crossing with collisions is called an event.Figure2shows an example of an event.Note that the picture of the collision products is very complex,and represents a lot of information:CMS has15 million individual detector channels.The measurements of the event,done by the detector elements in the CMS detector,are called the’raw event data’.The size of the raw event data for a single CMS event is about1MB(after compression using‘zero suppression’).
Figure1:The CMS detector.
Of the40.000.000events in a second,some100are selected for storage and later analysis.This selection is done with a fast real-timeﬁltering system.Data analysis is done interactively,by CMS physicists working all over the world.Apart from a central data processing system at CERN,there will be some5–10regional centres all over the world that will support data processing.Further processing may happen locally at the physicist’s home institute,maybe also on the physicist’s desktop machine. The1MB’raw event’data for each event is not analyzed directly.Instead,for every stored raw event,a number of summary objects are computed.These summary objects range in size from a100 KB’reconstructed tracks’object to a100byte’event tag’object,see section3.1for details.This summary data will be replicated widely,depending on needs and capacities.The original raw data will stay mostly in CERN’s central robotic tape store,though some of it may be replicated too.Due to the slowness of random data access on tape robots,the access to raw data will be severely limited.
2.2Physics analysis
By studying the momenta,directions,and other properties of the collision products in the event, physicists can learn more about the exact nature of the particles and forces that were involved in the collision.
For example,to learn more about Higgs bosons,one can study events in which a collision produced a Higgs boson that then decayed into four charged leptons.(A Higgs boson decays almost immediately after creation,so it cannot be observed directly,only its decay products can be observed.)A Higgs boson analysis effort can therefore start with isolating the set of events in which four charged leptons were produced.Not all events in this set correspond to the decay of a Higgs boson:there are many other physics processes that also produce charged leptons.Therefore,subsequent isolation steps are needed,in which’background’events,in which the leptons were not produced by a decaying Higgs boson,are eliminated as much as possible.Background events can be identiﬁed by looking at other observables in the event record,like the non-lepton particles that were produced,or the momenta of
Figure2:A CMS event(simulation).
particles that left the collision point.Once enough background events have been eliminated,some important properties of the Higgs boson can be determined by doing a statistical analysis on the set of events that are left.
The data reduction factor in physics analysis is enormous.Theﬁnal event set in the above example may contain only a few hundreds of events,selected from the events that occurred in one year in the CMS detector.This gives a data reduction factor of about1in.Much of this reduc-tion happens in the real-timeﬁlter before any data is stored,the rest happens through the successive application of‘cut predicates’to the stored events,to isolate ever smaller subsets.
3CMS virtual data description
3.1Structure of event data
CMS models all its data in terms of objects.We will call the objects that represent event data,or summaries of event data,‘physics objects’.
The CMS object store will contain a number of physics objects for each event,as shown inﬁgure3.In a Griphyn context,on can think of each object inﬁgure3as a materialized virtual data object.Among themselves,the objects for each event form a hierarchy.At higher levels in the hierarchy,the objects become smaller,and can be thought of as holding summary descriptions of the data in the objects at
a lower level.By accessing the smallest summary object whenever possible,physicists can save both
Figure3:Example of the physics objects present for two events.The numbers indicate
object sizes.The reconstructed object sizes shown reﬂect the CMS estimates in[1].The
sum of the sizes of the raw data objects for an event is1MB,this corresponds to the1MB
raw event data speciﬁed in[1].
CPU and I/O resources.
At the lowest level of the hierarchy are raw data objects,which store all the detector measurements made at the occurrence of the event.Every event has about1MB of raw data in total,which is partitioned into objects according to some predeﬁned scheme that follows the physical structure of the detector.Above the raw data objects are the reconstructed objects,they store interpretations of the raw data in terms of physics processes.Reconstructed objects can be created by physicists as needed, so different events may have different types and numbers of reconstructed(materialized)objects.At the top of the hierarchy of reconstructed objects are event tag objects of some100bytes,which store only the most important properties of the event.Several versions of these event tag objects can exist at the same time.
3.2Data dependencies
To interpretﬁgure3in terms of virtual data,one has to make visible the way in which each of these objects was computed.This is done inﬁgure4:it shows the data dependencies for objects inﬁgure3. Note that the‘raw’objects are not computed,they correspond to detector measurements.
Inﬁgure3an arrow from to signiﬁes that the value of depends on.The grey boxes represent physics algorithms used to compute the object:note that these all have particular versions.The lower-most grey box represents some’calibration constants’that specify the conﬁguration of the detector over time.Calibration is outside of the scope of this text.
Note that,usually,there is only one way in which a requested CMS virtual data product can be computed.This in contrast with LIGO,where often many ways to compute a product are feasible, and where one challenge to the scheduler is toﬁnd the most efﬁcient way.
Figure4:Data dependencies for some of the objects inﬁgure3.An arrow from to
signiﬁes that the value of depends on.The grey boxes represent physics algorithms
used to compute the objects,and some’calibration constants’used by these algorithms.
3.3Encoding data dependency speciﬁcs
As is obvious fromﬁgure4,the data dependencies for any physics object can become very complex. Note however that,for the two events inﬁgure4,the dependency graphs are similar.It is possible to make a single big dependency graph,in a metadata repository,that captures all possible dependencies between the physics objects of every event.Figure5shows such a metadata graph for the events in ﬁgure
4.
graph.The numbers in the nodes representing physics objects are globally unique type
numbers.
In the metadata graph,the physics objects are replaced by unique type numbers.These numbers can be used to represent the particular data dependencies and algorithms that went into the computation of any reconstructed object.A CMS virtual data addressing system could use the type numbers as keys
in a transformation catalog that yields instructions on how to compute any virtual physics object. 3.4Virtual data object identiﬁcation
In CMS it is possible to uniquely identify every(virtual or real)physics object by a tuple(event ID,type number).Theﬁrst part of the tuple is the(unique)identiﬁer of the event the object belongs to,the second the unique identiﬁer of the corresponding location in the object dependency graph discussed above.
3.5Importance of physics objects relative to other data products
Besides the physics objects as shown inﬁgure3,where each object holds data about a single event only,CMS will also work with data products that describe properties of sets of events.There will be several types of such products with names like‘calibration data’,‘tag collections’,‘histograms’,‘physics papers’,etc.However in terms of data volume these products will be much less signiﬁcant than the physics objects.
In terms of virtual data grids,it is believed that the big challenge for CMS lies in making these grids compute and deliver physics objects to‘physics analysis jobs’,where these physics jobs can output relatively small data products like histograms that need not necessarily be managed by the grid.
3.6Typical physics job
A typical job computes a function value for every event in some set,and aggregates the function results.In contrast with LIGO,in CMS the will generally be a very sparse subset of the events taken over a very long time interval.To compute,the values of one or more virtual data products for this event are needed.Generally,for every event will request the same product(s), products with the same‘type numbers’.
In a high energy physics experiment,there is no special correlation between the time an event collision took place and any other property of the event:events with similar properties are evenly distributed over the time sequence of all events.In physics analysis jobs,events are treated as completely inde-pendent from each other.If function results are aggregated,the aggregation function does not depend on the order in which the function results are fed to it.Thus,from a physics standpoint,the order of traversal of a job event set does not inﬂuence the job result.
3.7CMS virtual data grid characteristics
When comparing the grid requirements of a high energy physics experiment like CMS with the re-quirements of LIGO and SDSS,the points which are characteristic for CMS are:
Not very large but extremely large amounts of data.
Large amount of CPU power needed to derive needed virtual data products.
The above two imply that fault tolerant facilities for the mass-materialization of virtual data products on a large and very distributed system are essential.
Baseline virtual data model does not have virtual data products that aggregate data from multiple events,so the model looks relatively simple from a scheduling standpoint.
A requested set of data products generally corresponds to a very sparse subset of the events
taken over a very long time interval.
42005vs.current needs and activities
The preceding sections talk about the CMS data model and data analysis needs when the experiment is running from2005on.This section discusses current and near-future needs and activities,and contrasts these to the2005needs.
Currently CMS is performing large-scale simulation efforts,in which physics events are simulated as they occur inside a simulation of the CMS detector.These simulation efforts support detector design and the design of the real-time eventﬁltering algorithms that will be used when CMS is running.The simulation efforts are in the order of hundreds of CPU years and terabytes of data.These simula-tion efforts will continue,and will grow in size,up to2005and then throughout the lifetime of the experiment.The simulation efforts and the software R&D for CMS data management are seen as strongly intertwined and complementary activities.In addition to performing grid-related R&D in the context of several projects,CMS is also already using some grid-type software‘in production’for its simulation efforts.Examples of this is the use of Condor-managed CPU power in some large-scale simulation efforts,and the use of some Globus components by GDMP[7],which is a software system developed by CMS that is currently being used in production to replicateﬁles with simulation results in the wide-area.
CMS simulation efforts currently still rely to a large extent on hand-coded shell and perl scripts, and the careful manual mapping of hardware resources to tasks.As more grid technology becomes available,CMS will be actively looking to use it in its simulation efforts,both as a way to save manpower and as a means to allow for greater scalability.On the grid R&D side,the CMS simulation codes could also be used inside testbeds that evaluate still-experimental grid technologies.It thus makes sense to look more closely here at the exact properties of the CMS simulation efforts,and how these differ from those of the2005CMS virtual data problem.
Each individual CMS simulation run can be modeled as a deﬁnition of a set of virtual data products and a request to materialize them into a set ofﬁles.Current CMS simulation runs have a batch nature, not an interactive nature.Each large run generally at least takes a few days to plan,with several people getting involved,and then at least few weeks to execute.At most some tens of runs will be in progress at the same time.So there is a huge contrast with the2005situation,where CMS data processing requirements are expected to be dominated by‘chaotic’interactive physics analysis workloads generated by hundreds of physicists working independently.Also,in contrast to the2005 workloads,requests for the data in sparse subsets of(simulated)event datasets will be rare,if they occur at all.Simulated event sets can be,and are,generated in such a way that events likely to be requested together are created together in the same databaseﬁle or set of databaseﬁles.Therefore, to support simulation runs in the near future,it would be possible to use a virtual data system that works at the granularity ofﬁles,rather than theﬁner granularity of events or objects.Going towards 2005,the data creation and transport needs of the CMS simulation exercises are expected to become increasingly‘chaotic’andﬁne-grained,but the exact pace at which change will happen is currently not known.
CMS currently has two distinct simulation packages.The Fortran-based CMSIM software takes care of theﬁrst steps in a full simulation chain.It producesﬁles which are then read by the C++-based ORCA software.CMSIM usesﬂatﬁles in the‘fz’format for its output,ORCA data storage is done using the Objectivity object database.CMSIM will be phased out in the next few years,it will be replaced by more up-to-date simulation codes using the C++-based next-generation GEANT4physics simulation library.As targets for the use in virtual data testbeds,CMSIM and ORCA each have their own strengths and weaknesses.In CMSIM,each simulation run produces one outputﬁle based on a set of runtime parameters.This yields a virtual data model that is almost embarrassingly simple, a data model of‘virtual outputﬁles’,with eachﬁle having a runtime parameter set as its unique signature,and no dependencies betweenﬁles.CMSIM is very portable and can be run on almost any platform.Installing the CMSIM software is not a major job.The simulations involving ORCA display a much more complicated pattern of data handling in which intermediate products appear[3], and a corresponding virtual data model would be much more complex,and more representative of the 2005situation.The ORCA software is under rapid development,with cycles of a few months or less. ORCA is only supported on Linux and Solaris,and currently takes considerable effort and expertise to install.Work on more automated installation procedures is underway.
5References to some other documents that may be of interest
The CMS Computing Technical Proposal[1],written in1996,is still a good source of overview mate-rial.More recent sources are[5],which has material on CMS physics and its software requirements, and[2]which has more details about the CMS2005data model and expected access patterns.
A short write-up on CMSIM and virtual data is[6].More details on simulations using ORCA are in
[3]and[4].
References
[1]CMS Computing Technical Proposal.CERN/LHCC96-45,CMS collaboration,19December
1996.
[2]Koen Holtman,Introduction to CMS from a CS viewpoint.21Nov2000.
http://home.cern.ch/kholtman/introcms.ps
[3]David Stickland.The Design,Implementation and Deployment of Functional Prototype OO
Reconstruction Software for CMS.The ORCA project.
http://chep2000.pd.infn.it/abst/abs。