MPI's reduction operations in clustered wide area systems


Ohio Supercomputer Center The Ohio State UniversityUPERCOMPUTER S OHIOC E N T E R LAM is a parallel processing environment and development system for a network of independent computers. It features the Message-Passing Interface (MPI)programming standard,supported by extensive monitoring and debugging M / MPI Key Features:•full implementation of the MPI standard •extensive monitoring and debugging tools,runtime and post-mortem •heterogeneous computer networks •add and delete nodes •node fault detection and recovery •MPI extensions and LAM programming supplements •direct communication between application processes •robust MPI resource management •MPI-2 dynamic processes •multi-protocol communication (shared memory and network)MPI Primer /Developing With LAM2This document is organized into four major chapters. It begins with a tuto-rial covering the simpler techniques of programming and operation. New users should start with the tutorial. The second chapter is an MPI program-ming primer emphasizing the commonly used routines. Non-standard extensions to MPI and additional programming capabilities unique to LAM are separated into a third chapter. The last chapter is an operational refer-ence.It describes how to configure and start a LAM multicomputer,and how to monitor processes and messages.This document is user oriented. It does not give much insight into how the system is implemented. It does not detail every option and capability of every command and routine.An extensive set of manual pages cover all the commands and internal routines in great detail and are meant to supplement this document.The reader will note a heavy bias towards the C programming language,especially in the code samples.There is no Fortran version of this document.The text attempts to be language insensitive and the appendices contain For-tran code samples and routine prototypes.We have kept the font and syntax conventions to a minimum.code This font is used for things you type on the keyboard orsee printed on the screen.We use it in code sections andtables but not in the main text.<symbol>This is a symbol used to abstract something you wouldtype. We use this convention in commands.Section Italics are used to cross reference another section in thedocument or another document. Italics are also used todistinguish LAM commands.How to UseThisDocument3How to Use This Document 2LAM Architecture 7Debugging 7MPI Implementation 8How to Get LAM 8LAM / MPI Tutorial IntroductionProgramming Tutorial 9The World of MPI 10Enter and Exit MPI 10Who Am I; Who Are They? 10Sending Messages 11Receiving Messages 11Master / Slave Example 12Operation Tutorial 15Compilation 15Starting LAM 15Executing Programs 16Monitoring 17Terminating the Session 18MPI Programming PrimerBasic Concepts 19Initialization 21Basic Parallel Information 21Blocking Point-to-Point 22Send Modes 22Standard Send 22Receive 23Status Object 23Message Lengths 23Probe 24Nonblocking Point-to-Point 25Request Completion 26Probe 26Table ofContents4Message Datatypes 27Derived Datatypes 28Strided Vector Datatype 28Structure Datatype 29Packed Datatype 31Collective Message-Passing 34Broadcast 34Scatter 34Gather 35Reduce 35Creating Communicators 38Inter-communicators 40Fault Tolerance 40Process Topologies 41Process Creation 44Portable Resource Specification 45 Miscellaneous MPI Features 46Error Handling 46Attribute Caching 47Timing 48LAM / MPI ExtensionsRemote File Access 50Portability and Standard I/O 51 Collective I/O 52Cubix Example 54Signal Handling 55Signal Delivery 55Debugging and Tracing 56LAM Command ReferenceGetting Started 57Setting Up the UNIX Environment 575 Node Mnemonics 57Process Identification 58On-line Help 58Compiling MPI Programs 60Starting LAM 61recon 61lamboot 61Fault Tolerance 61tping 62wipe 62Executing MPI Programs 63mpirun 63Application Schema 63Locating Executable Files 64Direct Communication 64Guaranteed Envelope Resources 64Trace Collection 65lamclean 65Process Monitoring and Control 66mpitask 66GPS Identification 68Communicator Monitoring 69Datatype Monitoring 69doom 70Message Monitoring and Control 71mpimsg 71Message Contents 72bfctl 72Collecting Trace Data 73lamtrace 73Adding and Deleting LAM Nodes 74lamgrow 74lamshrink 74File Monitoring and Control 75fstate 75fctl 756Writing a LAM Boot Schema 76Host File Syntax 76Low Level LAM Start-up 77Process Schema 77hboot 77Appendix A: Fortran Bindings 79 Appendix B: Fortran Example Program 857LAM runs on each computer as a single daemon (server) uniquely struc-tured as a nano-kernel and hand-threaded virtual processes.The nano-kernel component provides a simple message-passing,rendez-vous service to local processes. Some of the in-daemon processes form a network communica-tion subsystem,which transfers messages to and from other LAM daemons on other machines.The network subsystem adds features such as packetiza-tion and buffering to the base synchronization. Other in-daemon processes are servers for remote capabilities, such as program execution and parallel file access.The layering is quite distinct:the nano-kernel has no connection with the network subsystem, which has no connection with the ers can configure in or out services as necessary.The unique software engineering of LAM is transparent to users and system administrators, who only see a conventional daemon. System developers can de-cluster the daemon into a daemon containing only the nano-kernel and several full client processes. This developers’ mode is still transparent to users but exposes LAM’s highly modular components to simplified indi-vidual debugging.It also reveals LAM’s evolution from Trollius,which ran natively on scalable multicomputers and joined them to a host network through a uniform programming interface.The network layer in LAM is a documented,primitive and abstract layer on which to implement a more powerful communication standard like MPI (PVM has also been implemented).A most important feature of LAM is hands-on control of the multicomputer.There is very little that cannot be seen or changed at runtime. Programs residing anywhere can be executed anywhere,stopped,resumed,killed,and watched the whole time. Messages can be viewed anywhere on the multi-computer and buffer constraints tuned as experience with the application LAMArchitecturelocal msgs, client mgmt network msgs MPI, client / server cmds, apps, GUIs Figure 1: LAM’s Layered Design Debugging8dictates.If the synchronization of a process and a message can be easily dis-played, mismatches resulting in bugs can easily be found. These and other services are available both as a programming library and as utility programs run from any shell.MPI synchronization boils down to four variables:context,tag,source rank,and destination rank.These are mapped to LAM’s abstract synchronization at the network layer. MPI debugging tools interpret the LAM information with the knowledge of the LAM / MPI mapping and present detailed infor-mation to MPI programmers.A significant portion of the MPI specification can be and is implemented within the runtime system and independent of the underlying environment.As with all MPI implementations, LAM must synchronize the launch of MPI applications so that all processes locate each other before user code is entered. The mpirun command achieves this after finding and loading the program(s) which constitute the application. A simple SPMD application can be specified on the mpirun command line while a more complex config-uration is described in a separate file, called an application schema.MPI programs developed on LAM can be moved without source code changes to any other platform that supports M installs anywhere and uses the shell’s search path at all times to find LAM and application executables.A multicomputer is specified as a simple list of machine names in a file, which LAM uses to verify access, start the environment, and remove M is freely available under a GNU license via anonymous ftp from.MPIImplementationHow to Get LAM9LAM / MPI Tutorial Introduction The example programs in this section illustrate common operations in MPI.You will also see how to run and debug a program with LAM.For basic applications, MPI is as easy to use as any other message-passing library.The first program is designed to run with exactly two processes.Oneprocess sends a message to the other and then both terminate.Enter the fol-lowing code in trivial.c or obtain the source from the LAM source distribu-tion (examples/trivial/trivial.c)./** Transmit a message in a two process system.*/#include <mpi.h>#define BUFSIZE 64int buf[64];intmain(argc, argv)int argc;char *argv[];{int size, rank;MPI_Status status;/** Initialize MPI.*/MPI_Init(&argc, &argv);/** Error check the number of processes.* Determine my rank in the world group.ProgrammingTutorial10 * The sender will be rank 0 and the receiver, rank 1. */MPI_Comm_size(MPI_COMM_WORLD, &size);if (2 != size) {MPI_Finalize();return(1);}MPI_Comm_rank(MPI_COMM_WORLD, &rank);/* * As rank 0, send a message to rank 1. */if (0 == rank) {MPI_Send(buf, sizeof(buf), MPI_INT, 1, 11,MPI_COMM_WORLD);}/* * As rank 1, receive a message from rank 0. */else {MPI_Recv(buf, sizeof(buf), MPI_INT, 0, 11,MPI_COMM_WORLD, &status);}MPI_Finalize();return(0);}Note that the program uses standard C program structure, statements, vari-able declarations and types, and functions.Processes are represented by a unique “rank” (integer) and ranks are num-bered 0, 1, 2, ..., N-1. MPI_COMM_WORLD means “all the processes in the MPI application.” It is called a communicator and it provides all infor-mation necessary to do message-passing. Portable libraries do more with communicators to provide synchronization protection that most other mes-sage-passing systems cannot handle.As with other systems, two routines are provided to initialize and cleanup an MPI process:MPI_Init(int *argc, char ***argv);MPI_Finalize(void);Typically, a process in a parallel application needs to know who it is (its rank)and how many other processes exist.A process finds out its own rankby calling MPI_Comm_rank().The World ofMPIEnter and ExitMPIWho Am I; WhoAre They?MPI_Comm_rank(MPI_Comm comm, int *rank);The total number of processes is returned by MPI_Comm_size().MPI_Comm_size(MPI_Comm comm, int *size);A message is an array of elements of a given datatype.MPI supports all the basic datatypes and allows a more elaborate application to construct new datatypes at runtime.A message is sent to a specific process and is marked by a tag (integer)spec-ified by the user. Tags are used to distinguish between different message types a process might send/receive.In the example program above,the addi-tional synchronization offered by the tag is unnecessary.Therefore,any ran-dom value is used that matches on both sides.MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm);A receiving process specifies the tag and the rank of the sending process.MPI_ANY_TAG and MPI_ANY_SOURCE may be used to receive a mes-sage of any tag and from any sending process.MPI_Recv(void *buf, int count, MPI_Datatypedtype, int source, int tag, MPI_Comm comm,MPI_Status *status);Information about the received message is returned in a status variable. If wildcards are used, the received message tag is status.MPI_TAG and the rank of the sending process is status.MPI_SOURCE.Another routine, not used in the example program, returns the number of datatype elements received.It is used when the number of elements received might be smaller than number specified to MPI_Recv().It is an error to send more elements than the receiving process will accept.MPI_Get_count(MPI_Status, &status,MPI_Datatype dtype, int *nelements);SendingMessagesReceivingMessagesThe following example program is a communication skeleton for a dynam-ically load balanced master/slave application. The source can be obtainedfrom the LAM source distribution (examples/trivial/ezstart.c).The program is designed to work with a minimum of two processes:one master and one slave.#include <mpi.h>#define WORKTAG 1#define DIETAG 2#define NUM_WORK_REQS 200static void master();static void slave();/**main* This program is really MIMD, but is written SPMD for * simplicity in launching the application.*/intmain(argc, argv)int argc;char *argv[];{int myrank;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD,/* group of everybody */&myrank);/* 0 thru N-1 */if (myrank == 0) {master();} else {slave();}MPI_Finalize();return(0);}/**master* The master process sends work requests to the slaves * and collects results.*/static voidmaster(){int ntasks, rank, work;double result;MPI_Status status;MPI_Comm_size(MPI_COMM_WORLD,&ntasks);/* #processes in app */Master / SlaveExample/** Seed the slaves.*/work = NUM_WORK_REQS;/* simulated work */for (rank = 1; rank < ntasks; ++rank) {MPI_Send(&work,/* message buffer */1,/* one data item */MPI_INT,/* of this type */rank,/* to this rank */WORKTAG,/* a work message */MPI_COMM_WORLD);/* always use this */ work--;}/** Receive a result from any slave and dispatch a new work* request until work requests have been exhausted.*/while (work > 0) {MPI_Recv(&result,/* message buffer */1,/* one data item */MPI_DOUBLE,/* of this type */MPI_ANY_SOURCE,/* from anybody */MPI_ANY_TAG,/* any message */MPI_COMM_WORLD,/* communicator */&status);/* recv’d msg info */MPI_Send(&work, 1, MPI_INT, status.MPI_SOURCE,WORKTAG, MPI_COMM_WORLD);work--;/* simulated work */ }/** Receive results for outstanding work requests.*/for (rank = 1; rank < ntasks; ++rank) {MPI_Recv(&result, 1, MPI_DOUBLE, MPI_ANY_SOURCE,MPI_ANY_TAG, MPI_COMM_WORLD, &status);}/** Tell all the slaves to exit.*/for (rank = 1; rank < ntasks; ++rank) {MPI_Send(0, 0, MPI_INT, rank, DIETAG,MPI_COMM_WORLD);}}/**slave* Each slave process accepts work requests and returns* results until a special termination request is received. */static voidslave(){double result;int work;MPI_Status status;for (;;) {MPI_Recv(&work, 1, MPI_INT, 0, MPI_ANY_TAG,MPI_COMM_WORLD, &status);/** Check the tag of the received message.*/if (status.MPI_TAG == DIETAG) {return;}sleep(2);result = 6.0;/* simulated result */MPI_Send(&result, 1, MPI_DOUBLE, 0, 0,MPI_COMM_WORLD);}}The workings of ranks,tags and message lengths should be mastered before constructing serious MPI applications.Before running LAM you must establish certain environment variables and search paths for your shell. Add the following commands or equivalent to your shell start-up file (.cshrc,assuming C shell).Do not add these to your .login as they would not be effective on remote machines when rsh is used to start LAM.setenv LAMHOME <LAM installation directory>set path = ($path $LAMHOME/bin)The local system administrator,or the person who installed LAM,will know the location of the LAM installation directory. After editing the shell start-up file,invoke it to establish the new values.This is not necessary on subse-quent logins to the UNIX system.% source .cshrc Many LAM commands require one or more nodeids.Nodeids are specified on the command line as n<list>, where <list> is a list of comma separated nodeids or nodeid ranges.n1n1,3,5-10The mnemonic ‘h’refers to the local node where the command is typed (as in ‘here’).Any native C compiler is used to translate LAM programs for execution.All LAM runtime routines are found in a few libraries. LAM provides a wrap-ping command called hcc which invokes cc with the proper header and library directories, and is used exactly like the native cc.% hcc -o trivial trivial.c -lmpi The major,internal LAM libraries are automatically linked.The MPI library is explicitly linked.Since LAM supports heterogeneous computing,it is up to the user to compile the source code for each of the various CPUs on their respective machines. After correcting any errors reported by the compiler,proceed to starting the LAM session.Before starting LAM,the user specifies the machines that will form the mul-ticomputer. Create a host file listing the machine names, one on each line.An example file is given below for the machines “ohio” and “osc”. Lines starting with the # character are treated as comment lines.OperationTutorialCompilationStarting LAM# a 2-node LAM ohio osc The first machine in the host file will be assigned nodeid 0, the second nodeid 1,etc.Now verify that the multicomputer is ready to run LAM.The recon tool checks if the user has access privileges on each machine in the multicomputer and if LAM is installed and accessible.% recon -v <host file>If recon does not report a problem, proceed to start the LAM session with the lamboot tool.% lamboot -v <host file>The -v (verbose)option causes lamboot to report on the start-up process as it progresses. You should return to the your own shell’s prompt. LAM pre-sents no special shell or interface environment.Even if all seems well after start-up,verify communication with each node.tping is a simple confidence building command for this purpose.% tping n0Repeat this command for all nodes or ping all the nodes at once with the broadcast mnemonic,N.tping responds by sending a message between the local node (where the user invoked tping)and the specified node.Successful execution of tping proves that the target node, nodes along the route from the local node to the target node,and the communication links between them are working properly. If tping fails, press Control-Z, terminate the session with the wipe tool and then restart the system.See Terminating the Session .To execute a program,use the mpirun command.The first example program is designed to run with two processes.The -c <#>option runs copies of thegiven program on nodes selected in a round-robin manner.% mpirun -v -c 2 trivialThe example invocation above assumes that the program is locatable on the machine on which it will run. mpirun can also transfer the program to the target node before running it.Assuming your multicomputer for this tutorial is homogeneous, you can use the -s h option to run both processes.% mpirun -v -c 2 -s h trivialExecutingProgramsIf the processes executed correctly,they will terminate and leave no traces.If you want more feedback,try using tprintf()functions within the program.The first example program runs too quickly to be monitored.Try changingthe tag in the call to MPI_Recv() to 12 (from 11). Recompile the program and rerun it as before. Now the receiving process cannot synchronize with the message from the send process because the tags are unequal.Look at the status of all MPI processes with the mpitask command.You will notice that the receiving process is blocked in a call to MPI_Recv()- a synchronizing message has not been received. From the code we know this is process rank 1in the MPI application,which is confirmed in the first column,the MPI task identification.The first number is the rank within the world group.The second number is the rank within the communicator being used by MPI_Recv(), in this case (and in many applications with simple communication structure)also the world group.The specified source of the message is likewise identified.The synchronization tag is 12and the length of the receive buffer is 64 elements of type MPI_INT.The message was transferred from the sending process to a system buffer en route to process rank 1.MPI_Send()was able to return and the process has called MPI_Finalize().System buffers,which can be thought of as message queues for each MPI process,can be examined with the mpimsg command.The message shows that it originated from process rank 0 usingMPI_COMM_WORLD and that it is waiting in the message queue of pro-cess rank 1, the destination. The tag is 11 and the message contains 64 ele-ments of type MPI_INT. This information corresponds to the arguments given to MPI_Send(). Since the application is faulty and will never com-plete, we will kill it with the lamclean command.% lamclean -vMonitoring % mpitaskTASK (G/L)FUNCTION PEER|ROOT TAG COMM COUNT DATATYPE 0/0 trivialFinalize 1/1 trivial Recv 0/012WORLD 64INT % mpimsgSRC (G/L)DEST (G/L)TAG COMM COUNT DATATYPE MSG 0/01/111WORLD 64INT n1,#0The LAM session should be in the same state as after invoking lamboot.You can also terminate the session and restart it with lamboot,but this is a much slower operation. You can now correct the program, recompile and rerun.To terminate LAM, use the wipe tool. The host file argument must be the same as the one given to lamboot.% wipe -v <host file>Terminating theSessionMPI Programming PrimerBasic ConceptsThrough Message Passing Interface(MPI)an application views its parallelenvironment as a static group of processes.An MPI process is born into theworld with zero or more siblings. This initial collection of processes iscalled the world group.A unique number,called a rank,is assigned to eachmember process from the sequence0through N-1,where N is the total num-ber of processes in the world group.A member can query its own rank andthe size of the world group.Processes may all be running the same program(SPMD) or different programs (MIMD). The world group processes maysubdivide,creating additional subgroups with a potentially different rank ineach group.A process sends a message to a destination rank in the desired group.A pro-cess may or may not specify a source rank when receiving a message.Mes-sages are further filtered by an arbitrary, user specified, synchronizationinteger called a tag, which the receiver may also ignore.An important feature of MPI is the ability to guarantee independent softwaredevelopers that their choice of tag in a particular library will not conflictwith the choice of tag by some other independent developer or by the enduser of the library.A further synchronization integer called a context is allo-cated by MPI and is automatically attached to every message.Thus,the fourmain synchronization variables in MPI are the source and destination ranks,the tag and the context.A communicator is an opaque MPI data structure that contains informationon one group and that contains one context.A communicator is an argumentto all MPI communication routines.After a process is created and initializes MPI, three predefined communicators are available.MPI_COMM_WORLD the world groupMPI_COMM_SELF group with one member, myselfMPI_COMM_PARENT an intercommunicator between two groups:my world group and my parent group (SeeDynamic Processes.)Many applications require no other communicators beyond the world com-municator.If new subgroups or new contexts are needed,additional commu-nicators must be created.MPI constants, templates and prototypes are in the MPI header file, mpi.h. #include <mpi.h>MPI_Init Initialize MPI state.MPI_Finalize Clean up MPI state.MPI_Abort Abnormally terminate.MPI_Comm_size Get group process count.MPI_Comm_rank Get my rank within process group.MPI_Initialized Has MPI been initialized?The first MPI routine called by a program must be MPI_Init(). The com-mand line arguments are passed to MPI_Init().MPI_Init(int *argc, char **argv[]);A process ceases MPI operations with MPI_Finalize().MPI_Finalize(void);In response to an error condition,a process can terminate itself and all mem-bers of a communicator with MPI_Abort().The implementation may report the error code argument to the user in a manner consistent with the underly-ing operation system.MPI_Abort (MPI_Comm comm, int errcode);Two numbers that are very useful to most parallel applications are the total number of parallel processes and self process identification. This informa-tion is learned from the MPI_COMM_WORLD communicator using the routines MPI_Comm_size() and MPI_Comm_rank().MPI_Comm_size (MPI_Comm comm, int *size);MPI_Comm_rank (MPI_Comm comm, int *rank);Of course, any communicator may be used, but the world information is usually key to decomposing data across the entire parallel application.InitializationBasic ParallelInformationMPI_Send Send a message in standard mode.MPI_Recv Receive a message.MPI_Get_count Count the elements received.MPI_Probe Wait for message arrival.MPI_Bsend Send a message in buffered mode.MPI_Ssend Send a message in synchronous mode.MPI_Rsend Send a message in ready mode.MPI_Buffer_attach Attach a buffer for buffered sends.MPI_Buffer_detach Detach the current buffer.MPI_Sendrecv Send in standard mode, then receive.MPI_Sendrecv_replace Send and receive from/to one area.MPI_Get_elements Count the basic elements received.This section focuses on blocking,point-to-point,message-passing routines.The term “blocking”in MPI means that the routine does not return until the associated data buffer may be reused. A point-to-point message is sent by one process and received by one process.The issues of flow control and buffering present different choices in design-ing message-passing primitives. MPI does not impose a single choice but instead offers four transmission modes that cover the synchronization,data transfer and performance needs of most applications.The mode is selected by the sender through four different send routines, all with identical argu-ment lists. There is only one receive routine. The four send modes are:standard The send completes when the system can buffer the mes-sage (it is not obligated to do so)or when the message is received.buffered The send completes when the message is buffered in application supplied space, or when the message is received.synchronous The send completes when the message is received.ready The send must not be started unless a matching receive has been started. The send completes immediately.Standard mode serves the needs of most applications.A standard mode mes-sage is sent with MPI_Send().MPI_Send (void *buf, int count, MPI_Datatypedtype, int dest, int tag, MPI_Comm comm); BlockingPoint-to-PointSend ModesStandard SendAn MPI message is not merely a raw byte array. It is a count of typed ele-ments.The element type may be a simple raw byte or a complex data struc-ture. See Message Datatypes .The four MPI synchronization variables are indicated by the MPI_Send()parameters. The source rank is the caller’s. The destination rank and mes-sage tag are explicitly given.The context is a property of the communicator.As a blocking routine, the buffer can be overwritten when MPI_Send()returns.Although most systems will buffer some number of messages,espe-cially short messages,without any receiver,a programmer cannot rely upon MPI_Send() to buffer even one message. Expect that the routine will not return until there is a matching receiver.A message in any mode is received with MPI_Recv().MPI_Recv (void *buf, int count, MPI_Datatype dtype, int source, int tag, MPI_Comm comm,MPI_Status *status);Again the four synchronization variables are indicated,with source and des-tination swapping places. The source rank and the tag can be ignored with the special values MPI_ANY_SOURCE and MPI_ANY_TAG.If both these wildcards are used, the next message for the given communicator is received.An argument not present in MPI_Send()is the status object pointer.The sta-tus object is filled with useful information when MPI_Recv()returns.If the source and/or tag wildcards were used,the actual received source rank and/or message tag are accessible directly from the status object.status.MPI_SOURCE the sender’s rank status.MPI_TAG the tag given by the sender It is erroneous for an MPI program to receive a message longer than thespecified receive buffer. The message might be truncated or an error condi-tion might be raised or both.It is completely acceptable to receive a message shorter than the specified receive buffer. If a short message may arrive, the application can query the actual length of the message withMPI_Get_count().MPI_Get_count (MPI_Status *status,MPI_Datatype dtype, int *count);ReceiveStatus ObjectMessage Lengths。



Hyundai Elantra 商品说明书

Dare to challenge the status quo and find greater courage without fear of failure, Challenge tradition and prejudice to find opportunities for innovation.Be in charge of today through passion and effort,Open up the world of tomorrow using your own standards, not the world’s.Do not hesitate, but stay bold and go for those big dreams.Dare to be you.The answer is you.Stand strong. Have faith in yourself,then see your true strengths unleashed.Question the old rules.The ‘Parametric Dynamics’ design accentuates geometric aesthetics of the elongated hood and sleek roof lines, completing the visionary and innovative style.Parametric Jewel SurfaceThree sections emerge out of three bold-edged lines intersecting at a single point, creating three different colors of light.Parametric Jewel Pattern GrilleThe stereoscopic Parametric Jewel-pattern design highlights the depth of the front grille, making it resemble diamond-cut gemstones, and the bold and elongated front headlights come together to give Elantra its front sporty look.The edgy spoiler on the trunk and the integrated all-in-one taillight – representing Hyundai with its distinct H-shape design -help to create a high-tech, futuristic rear look.H-Tail Lamp LED Headlights17" Alloy Wheels & TiresBigger, longer, and lower than ever.Elantra’s sporty look and elaborate lines highlight its bold presence.11Immersive InterfaceThe 10.25” cluster display and 8” audio display deliver a fully immersive space, tilted 10 degrees toward the driver for easier control and a high-tech feel.Electric Parking Brake with Auto HoldSimply engage/disengage the parking brake with a single switch. The electric parking brake comes with Auto Hold, which keeps the vehicle stationary while stopped.BOSE Premium Sound SystemElantra’s 8 high-performance speakers deliver precise, powerful sound, ranging from low- to high-pitched tones at volumes that adjust according to the speed of your vehicle.Most sensuous.The Immersive Interface cocoons the driver like a cockpit that surrounds the pilot, offering easier control and an enveloping , advanced driving experience.1213141516Phone ProjectionThe main features of your smartphone are shown and controllable on the interior display.Wireless Smartphone ChargingWireless charging that is as simple as placing your phone on the charging pad.Stay connected.Elantra’s highly advanced connectivity features are extremely intuitive and easy to use, keeping you connected with the car and bringing greater innovation to your driving experience.* The availability of Hyundai CarPlay and Android Auto on mobile phones may vary depending on your country or region and will be in compliance with the policies of Google Play and App Store.17Make your move.Elantra’s newly developed 3rd generation platform delivers agile handling and stability powered by a fuel-efficient engine, giving you optimal driving performance wherever you go.Gamma 1.6 MPi Gasoline Engine127.5Max. Power (ps/6,300rpm)15.77Max. Torque(kg·m/4,850rpm)Smartstream G2.0 Gasoline Engine159Max. Power (ps/6,200rpm)19.5Max. Torque(kg·m/4,500rpm)1819Forward Collision-Avoidance AssistIf the preceding vehicle suddenly slows down, or if a forward collision risk is detected, such as a stopped vehicle or a pedestrian in front, it provides a warning. After the warning, if the risk of collision increases, it automatically assists with emergency braking. While driving, if there is a risk of collision with a cyclist, it automatically assists with emergency braking. If there is a risk of collision with an oncoming vehicle while turning left at an intersection, it automatically assists with emergency braking.Blind-Spot Collision-Avoidance AssistWhen operating the turn signal switch to change lanes, if there is a risk of collision with a rear side vehicle, it provides a warning. After the warning, if the risk of collision increases, it automatically controls the vehicle to help avoid a collision.Driver Attention WarningDisplays the driver's attention level while driving. Provides a warning when signs of driver inattentiveness are detected, and recommends a rest if needed. The driver should stop and park in a safe place and get plenty of rest before driving. During a stop, the driver is alerted if the leading vehicle departs.20Smart Cruise ControlSmart Cruise Control helps maintain distance from the vehicle ahead and drive at a speed, set by the driver. The vehicle stops automatically and starts automatically if the vehicle in front starts in a short time.If a period of time has elapsed, start again by pressing the accelerator pedal or operating the ne Following AssistDetects lane and vehicle ahead on the road with a front view camera on the front wind-shield, and assists the driver's steering to help keep the vehicle centered betweenthe lanes.Rear Cross-Traffic Collision-Avoidance AssistIf there is a risk of collision with an oncoming vehicle on the left or right side while reversing, it provides a warning. After the warning, if the risk of collision increases,it automatically assists with emergency braking.21For those who dare to be courageous and are unafraid of failure,The all-new Elantra is with you.You lead.We’re right beside you.2223FeaturesExterior Sideview Mirrors(heated, power folding, LED turn signal indicators)LED Rear Combination LampBulb Rear Combination Lamp LED + Bulb Rear Combination LampBlack Radiator GrilleLED HeadlightsProjector HeadlightsChrome Radiator Grille24Basic Audio System8" Display Audio System(matt, when selecting 4.2" color LCD cluster display)Manual Air Conditioner Ventilated Front SeatsDual Full Auto Air Conditioner10.25" Full Color Cluster Display Memory System For Driver’s Seat15" Alloy Wheel 15" Steel Wheel & Wheel Cover 16" Alloy Wheel 17" Alloy WheelHeated rear seats25ColorsInterior colorsExterior colorsPolar WhiteCyber GreyNatural Leather TricotWoven26SpecificationsUnit : mm, Wheel tread is based on 15" tires, and the number in parenthesis is based on 17" tires.1,593(1,579)1,825Wheel TreadOverall Width1,604(1,590)1,825Wheel Tread Overall WidthOverall Height1,4302,7204,675WheelbaseOverall Length● Above values are based on internal testing results and are subject to change after final validation.● Model specification may vary depending on sales region and country.● Some of the equipment illustrated or described in this catalog may not be supplied as standard equipment and may be available at extra cost.● Hyundai Motor Company reserves the right to change specifications and equipment without prior notice. ● The color plates shown may vary slightly from the actual colors due to the limitations of the printing process.● Please consult your dealer for full information and availability on colors and trims.27Dealer stampHyundai Motor CompanyGEN. Intel® MPI Library for Windows* OS Getting Started GuideThe Intel® MPI Library is a multi-fabric message passing library that implements the Message Passing Interface, v2 (MPI-2) specification. Use it to switch interconnection fabrics without re-linking.This Getting Started Guide explains how to use the Intel® MPI Library to compile and run a simple MPI program. This guide also includes basic usage examples and troubleshooting tips.To quickly start using the Intel® MPI Library, print this short guide and walk through the example provided.Copyright © 2003–2010 Intel CorporationAll Rights ReservedDocument Number: 316404-007Revision: 4.0World Wide Web: Contents1About this Document (4)1.1Intended Audience (4)1.2Using Doc Type Field (4)1.3Conventions and Symbols (4)1.4Related Information (5)2Using the Intel® MPI Library (6)2.1Usage Model (6)2.2Before you Begin (6)2.3Quick Start (7)2.4Compiling and Linking (8)2.5Setting up SMPD Services (8)2.6Selecting a Network Fabric (9)2.7Running an MPI Program (10)3Troubleshooting (12)3.1Testing Installation (12)3.2Compiling and Running a Test Program (12)4Next Steps (14)Disclaimer and Legal NoticesINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site.Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See/products/processor_number for details.BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile, i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.* Other names and brands may be claimed as the property of others.Copyright © 2007-2010, Intel Corporation. All rights reserved.1 About this DocumentThe Intel® MPI Library for Windows* OS Getting Started Guide contains information on thefollowing subjects:•First steps using the Intel® MPI Library•Troubleshooting outlines first-aid troubleshooting actions1.1 Intended AudienceThis Getting Started Guide is intended for first time users.1.2 Using Doc Type FieldThis Getting Started Guide contains the following sections:OrganizationDocumentSection DescriptionSection 1 introduces this documentSection 1 About thisDocumentSection 2 Using the Intel®Section 2 describes how to use the Intel® MPI Library MPI LibrarySection 3 Troubleshooting Section 3 outlines first-aid troubleshooting actionsSection 4 Next Steps Section 4 provides links further resources1.3 Conventions and SymbolsThe following conventions are used in this document.Table 1.3-1 Conventions and Symbols used in this DocumentThis type style Document or product namesThis type style HyperlinksThis type style Document or product namesThis type style Commands, arguments, options, file namesTHIS_TYPE_STYLE Environment variables<this type style>Placeholders for actual values[ items ] Optional items{ item | item }Selectable items separated by vertical bar(s)(SDK only)For Software Development Kit (SDK) users only 1.4 Related InformationTo get more information about the Intel® MPI Library, see the following resources:Product Web SiteIntel® MPI Library SupportIntel® Cluster Tools ProductsIntel® Software Development Products2 Using the Intel® MPI Library2.1 Usage ModelUsing the Intel® MPI Library involves the following steps. These steps are described in thecorresponding sections in detail.Figure 1: Flowchart representing the usage model for working with the Intel® MPI Library.2.2 Before You Begin1.Before using the Intel® MPI Library, ensure that the library, scripts, and utility applications areinstalled. See the Intel® MPI Library for Windows* OS Installation Guide for installationinstructions.2.For getting proper environment settings, use the following commands from the Start menu:Start > Programs > Intel(R) Software Development Tools> Intel(R) MPI Library 4.0 > BuildEnvironment for the IA-32 architectureStart > Programs > Intel Software Development Tools > Intel(R) MPI Library 4.0 > BuildEnvironment for the Intel® 64 architectureAlternatively, you can open a new console (cmd) window and run one of the following BAT filesfrom the command line.<installdir>\ia32\bin\mpivars.bat<installdir>\em64t\bin\mpivars.bat3.You should have administrator privileges on all nodes of the cluster to start the smpd serviceon all nodes of the cluster.2.3 Quick Starte the call batch command for getting proper environment settings from the mpivars.batbatch scripts included with the Intel® MPI Library. It is located in the<installdir>\em64t\bin directory for the Intel® 64 architecture, or in the<installdir>\ia32\bin directory for the 32-bit mode.2.Make sure the smpd services are installed and started on compute nodes. Otherwise installthem manually from the command line by using the –install smpd option. If the smpdservice stops, start it through Computer Management -> Services and Applications -> Servicesor from the command line manually using the –start smpd option.3.(SDK only) Make sure that you have a compiler in your PATH.4.(SDK only) Compile the test program using the appropriate compiler driver. For example:> mpicc.bat –o test <installdir>\test\test.c5.Register your credentials using the wmpiregister GUI utility.6.Execute the test using the GUI utility wmpiexec. Set the application name and a number ofprocesses. In this case all processes start on the current host. To start a test on a remote hostor on more than one host press the Advanced Options button and fill the appropriate fields.Use the Show Command button to check the command line. Press the Execute button tostart the program.You can use the command line interface instead of the GUI interface.e the mpiexec –register option instead of the wmpiregister GUI utility to registeryour credentials.e the CLI mpiexec command to execute the test.> mpiexec.exe –n <# of processes> test.exeor> mpiexec.exe –hosts <# of hosts> <host1_name> \<host1 # of processes> <host2_name> \<host2 # of processes> … test.exeSee the rest of this document and the Intel® MPI Library Reference Manual for Windows*OS formore details.2.4 Compiling and Linking(SDK only)To compile and link an MPI program with the Intel® MPI Library do the following steps:1.Create a Winxx Console project for Microsoft* Visual Studio* 2005.2.Choose the x64 solution platform.3.Add <installdir>\em64t\include to the include path.4.Add <installdir>\em64t\lib to the library path.5.Add impi.lib (Release) or impid.lib (Debug) to your target link command for Capplications.6.Add impi.lib and impicxx.lib (Release) or impid.lib and impicxxd.lib (Debug)to your target link command for C++ applications. Link application with impimt.lib(Release) impidmt.lib (Debug) for multithreading.7.Build a program.8.Place your application and all the dynamic libraries in a shared location or copy them to all thenodes.9.Run the application using the mpiexec.exe command.2.5 Setting up SMPD ServicesThe Intel® MPI Library uses a Simple Multi-Purpose Daemon(SMPD) job startup mechanism. Inorder to run programs compiled with Microsoft* Visual Studio* (or related), set up a SMPD service.NOTE:You should have administrator privileges to start the smpd service and all users can launch processes with mpiexec.To set up SMPD services:1.During the Intel® MPI Library installation the smpd service is started. During installation youcan cancel the smpd service startup.2.You can start, restart, stop or remove the smpd service manually when the Intel® MPI Libraryis installed. Find smpd.exe in the <installdir>\em64t\bine the following command on each node of the cluster: > smpd.exe –remove to removethe previous smpd service.e the following command on each node of the cluster: > smpd.exe –install to install thesmpd service manually.2.6 Selecting a Network FabricThe Intel® MPI Library dynamically selects different fabrics for communication between MPIprocesses. To select a specific fabric combination, set the new I_MPI_FABRICS or the oldI_MPI_DEVICE environment variable.I_MPI_FABRICS(I_MPI_DEVICE)Select the particular network fabrics to be used.SyntaxI_MPI_FABRICS=<fabric>|<intra-node fabric>:<inter-nodes fabric>Where <fabric> := {shm, dapl, tcp}<intra-node fabric> := {shm, dapl, tcp}<inter-nodes fabric> := {dapl, tcp}Deprecated SyntaxI_MPI_DEVICE=<device>[:<provider>]Arguments<fabric>Define a network fabricshm Shared-memorydapl DAPL–capable network fabrics, such as InfiniBand*, iWarp*,Dolphin*, and XPMEM* (through DAPL*)tcp TCP/IP-capable network fabrics, such as Ethernet and InfiniBand*(through IPoIB*)Correspondence with I_MPI_DEVICE<device><fabric>sock tcpshm shmssm shm:tcprdma daplrdssm shm:dapl<provider>Optional DAPL* provider name (only for the rdma and the rdssmdevices)I_MPI_DAPL_PROVIDER=<provider>Use the <provider> specification only for the {rdma,rdssm} devices.For example, to select the OFED* InfiniBand* device, use the following command:> mpiexec -n <# of processes> \-env I_MPI_DEVICE rdssm:ibnic0v2-scm <executable>For these devices, if <provider> is not specified, the first DAPL* provider in the dat.conf file isused.NOTE:Ensure the selected fabric is available. For example, use shm only if all the processes can communicate with each other through shared memory. Use rdma only if all the processescan communicate with each other through a single DAPL provider. Ensure that thedat.dll library is in your %PATH%. Otherwise, use the –genv option for mpiexec.exe forsetting the I_MPI_DAT_LIBRARY environment variable with the fully-qualified path to thedat.dll library.2.7 Running an MPI ProgramUse the mpiexec command to launch programs linked with the Intel® MPI Library:> mpiexec.exe -n <# of processes> myprog.exeNOTE:The wmpiexec utility is a GUI wrapper for mpiexec.exe. See the Intel® MPI Library Reference Manual for more details.Use the only required mpiexec -n option to set the number of processes on the local node.Use the –hosts option to set names of hosts and number of processes:> mpiexec.exe –hosts 2 host1 2 host2 2 myprog.exeIf you are using a network fabric as opposed to the default fabric, use the -genv option to setthe I_MPI_DEVICE variable.For example, to run an MPI program using the shm fabric, type in the following command:> mpiexec.exe -genv I_MPI_DEVICE shm -n <# of processes> \myprog.exeYou may use the –configfile option to run the program:> mpiexec.exe –configfile config_fileThe configuration file contains:-host host1 –n 1 –genv I_MPI_DEVICE rdssm myprog.exe-host host2 –n 1 –genv I_MPI_DEVICE rdssm myprog.exeFor the rdma capable fabric, use the following command:> mpiexec.exe –hosts 2 host1 1 host2 1 –genv I_MPI_DEVICE rdma myprog.exe You can select any supported device. For more information, see Section Selecting a Network Fabric.If you successfully run your application using the Intel® MPI Library, you can move your application from one cluster to another and use different fabrics between the nodes without re-linking. If you encounter problems, see Troubleshooting for possible solutions.3 TroubleshootingUse the following sections to troubleshoot problems with installation, setup, and runningapplications using the Intel® MPI Library.3.1 Testing InstallationTo ensure that the Intel® MPI Library is installed and functioning, complete a general testing,compile and run a test program.To test the installation:1.Verify through the Computer Management that the smpd service is started. It calls the IntelMPI Process Manager.2.Verify that <installdir>\ia32\bin (<installdir>\em64t\bin for the Intel® 64architecture in the 64-bit mode) is in your path:> echo %PATH%You should see the correct path for each node you test.3.(SDK only) If you use Intel compilers, verify that the appropriate directories are included inthe PATH and LIB environment variables:> mpiexec.exe –hosts 2 host1 1 host2 1 a.batwhere a.bat containsecho %PATH%You should see the correct directories for these path variables for each node you test. If not,call the appropriate *vars.bat scripts. For example, with Intel® C++ Compiler 11.0 forWindows*OS for the Intel® 64 architecture in the 64-bit mode, use the Windows programmenu to select:Intel(R) Software Development Tools > Intel(R) C++ Compiler 11.0 >Build Environment forthe Intel® 64 architecture or from the command line%ProgramFiles%\Intel\Compiler\C++\11.0\em64t\bin\iclvars.bat3.2 Compiling and Running a Test ProgramThe install directory <installdir>\test contains test programs which you can use for testing.To compile one of them or your test program, do the following:1.(SDK only) Compile a test program as described in Section2.4 Compiling and Linking.2.If you are using InfiniBand* or other RDMA-capable network hardware and software, verifythat everything is functioning.3.Run the test program with all available configurations on your cluster.•Test the sock device using:> mpiexec.exe -n 2 -env I_MPI_DEBUG 2 –env I_MPI_DEVICE sock a.out You should see one line of output for each rank, as well as debug output indicating that the sock device is used.•Test the ssm devices using:> mpiexec.exe -n 2 -env I_MPI_DEBUG 2 –env I_MPI_DEVICE ssm a.out You should see one line of output for each rank, as well as debug output indicating that the ssm device is used.•Test any other fabric devices using:> mpiexec.exe –n 2 -env I_MPI_DEBUG 2 -env I_MPI_DEVICE <device>a.outwhere<device>can be shm, rdma, or rdssmFor each of the mpiexec commands used, you should see one line of output for each rank, as well as debug output indicating which device was used. The device(s) should agree with theI_MPI_DEVICE setting.4 Next StepsTo get more information about the Intel® MPI Library, explore the following resources:The Intel® MPI Library Release Notes include key product details. See the Intel® MPI LibraryRelease Notes for updated information on requirements, technical support, and known limitations.Use the Windows program menu to select Intel(R) Software Development Tools > Intel(R) MPILibrary > Intel(R) MPI Library for Windows* OS Release Notes.For more information see Websites:Product Web SiteIntel® MPI Library SupportIntel® Cluster Tools ProductsIntel® Software Development Products。



  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

1 MPI’s Reduction Operations in ClusteredWide Area SystemsThilo Kielmann Rutger F.H.HofmanHenri E.Bal Aske Plaat Raoul A.F.BhoedjangDepartment of Computer Science,Vrije Universiteit,Amsterdam,The Netherlandskielmann,rutger,bal,aske,—The emergence of meta computers and computational grids makes it feasible to run parallel programs on large-scale,geo-graphically distributed computer systems.Writing parallel appli-cations for such systems is a challenging task which may require changes to the communication structure of the applications.MPI’s collective operations(such as broadcast and reduce)allow for some of these changes to be hidden from the applications programmer.We have developed M AG PI E,a library of collective communication op-erations optimized for wide area systems.M AG PI E’s algorithms are designed to send the minimal amount of data over the slow wide area links,and to only incur a single wide area latency.This paper dis-cusses MPI’s collective reduction pared to systems that do not take the topology into account,such as MPICH,large performance improvements are possible.For larger messages,best performance is achieved when the reduction function is associative. On moderate cluster sizes,using a wide area latency of10millisec-ond and a bandwidth of1MByte/s,operations execute up to8times faster than MPICH;application kernels improve by up to a factor of3.Due to the structure of our algorithms,the advantage increases for higher wide area latencies.I.I NTRODUCTIONSeveral research projects pursue the idea of integrating computing resources at different locations into a single, powerful parallel system.Metacomputing projects like Globus[15]and Legion[17]build the software infrastruc-ture that makes such an integration possible.An impor-tant problem,however,is how to write parallel programs that run efficiently on metacomputers(or computational grids[16]).The key difference with traditional parallel programs is that communication between distant comput-ers can be orders of magnitude slower than that between the processors within a parallel machine.Wide-area net-works typically have a latency and bandwidth that is a fac-tor100–1000worse than that of local interconnects. Earlier research showed that,for many applications,it is possible to overcome the slowness of wide-area links by optimizing programs at the application level[5],[13], [26].Metacomputers typically consist of several clus-ters(parallel machines like MPPs or networks of worksta-tions),connected by slow wide-area links.They typically have a hierarchical structure with slow and fast links.Pro-grammers can take this hierarchical structure into account and minimize the amount of traffic over the slow wide-area links[26].Such optimizations,however,can complicate metacom-puter programming significantly.The goal of our work is to hide the hierarchical structure as much as possible from the programmer,by implementing the optimizations in a communication library.This works especially well for collective communication primitives,such as found in MPI.Current implementations of MPI(for example, MPICH[2])are designed for“flat”systems and run in-efficiently on wide-area,hierarchical systems.We have designed and implemented a new library,called M AG PI E, whose collective communication routines are optimized for hierarchical systems.In an earlier paper,we described the general design of M AG PI E and the implementation and performance of some collective operations[20].In this paper,we dis-cuss in more detail the reduction operations:,,and2II.A LGORITHM D ESIGNM AG PI E implements wide area optimal algorithms for all of MPI’s collective operations.Wefirst outline the gen-eral structure of M AG PI E’s algorithms.Then,we discuss how associativity of the reduction operators influences the applicability of this structure to the reduction operations.A.General Algorithm StructureM AG PI E’s goal is to minimize completion time of a collective operation,which we define as the moment at which all processors have received all messages that be-long to that operation.The performance of collective com-munication operations on a wide area system is dominated by the time spent on the wide area links;local communi-cation plays a minor role.In a collective operation,all processors have to communicate with each other,so the completion time cannot be less than the wide area latency. (On interconnects with varying latencies the highest la-tency dominates completion time.To simplify our anal-ysis,we assume that all wide-area links have the same bandwidth and latency.)In the design of our wide area algorithms,we have used the following two conditions: 1.Every sender-receiver path used by an algorithm con-tains at most one wide area link.2.Data items only travel to those clusters that need them; no data item travels multiple times to the same cluster. Condition(1)ensures that the wide area latency con-tributes at most once to an operation’s completion time. Condition(2)prevents waste of precious wide area band-width.We call algorithms that adhere to both conditions wide area optimal.Reduction operations have a high po-tential for optimization,by computing partial reductions locally in each cluster.The applicability of this optimiza-tion depends on associativity and commutativity of the re-duction operation.We will discuss this issue in detail be-low.In previous work[20],we showed how MPI’s collec-tive operations for synchronization and data exchange can be implemented by wide area optimal algorithms.In Sec-tion III we will show how MPI’s reduction operations can be implemented accordingly.We distinguish between the completion time t s of a message send and the completion time t r of the matching receive.We assume that messages are sent asynchronously:a message send completes when the message has been injected into the network.Note that t s only depends on message size;t r additionally depends on network bandwidth and latency.These performance characteristics determine the shape of the optimal com-munication graph[7],[19],[25].Local communication contributes a negligible amount to the overall completion time;for local communication we use graph shapes that bestfit the needs of the oper-ations,like binary or binomial trees.According to[7], binomial trees are optimal when t r t s is small.How-(a)symmetricoperation(b)asymmetric operationFig.1.Wide Area Optimal Communication Graphs ever,for wide area communication,where t r t s,the op-timal shape is a one-levelflat tree[7].Thus we have two communication graphs:the intra-cluster graph that con-nects the processors within a single cluster,and the inter-cluster graph that connects the different clusters.To in-terface both graphs,we designate a coordinator node for each cluster.Notice that the one-level tree satisfies con-dition(1).Condition(2)depends on the semantics of the actual operation,as will be discussed further in Section III. The optimal inter-cluster graph shape depends on the values for latency,bandwidth,message size,and the num-ber of puting optimal graph shapes there-fore requires run time instrumentation(see,for example, [22]).M AG PI E does not yet perform this analysis.In Sec-tion III we show that M AG PI E’s combination of one-level trees and binary/binomial treesfits the wide area case well enough to outperform MPICH’s algorithms in our tests. To outline the general structure of our algorithms, we distinguish two kinds of algorithms.In the asymmetric algorithms one dedicated process,called the root,either acts as sender(in one-to-many al-gorithms)or as receiver(in many-to-one algorithms such as aand3 chosen arbitrarily.Figure1shows the communicationgraphs for symmetric and asymmetric operations;in thelatter the root process is marked with a circle.Asymmetric algorithms perform two steps.The nodesfirst send to their coordinator and then the coordinatorssend to the root.Symmetric algorithms perform threesteps.First,nodes send to their coordinators.Second,thecoordinators perform an all-to-all exchange,and third,co-ordinators send to their nodes.We use this basic structurefor implementing wide area optimal algorithms for MPI’sreduction operations.B.Non-Associative Reduction OperatorsWe now turn to the group of reduction operators.MPI’sreduction operations are parameterized by the actual op-erator that is applied to the data.MPI assumes all op-erators(such as sum,product,minimum,or maximum)to be associative;in addition,the programmer can markthem as commutative.Concerning execution order,theMPI standard states that“any implementation can take ad-vantage of associativity,or associativity and commutativ-ity in order to change the order of evaluation.This maychange the result of the reduction for operations that arenot strictly associative and commutative,such asfloatingpoint addition”[24].Accordingly,application program-mers must not rely on a specific execution order.Foryielding reproducible results,the standard adds:“It isstrongly recommended that4mainder of this paper,the wide area latency is set to10ms and the bandwidth is set to1.0MByte/s.(All latencies in this paper are one-way.)On most metacomputers,wide area latency will be significantly higher,and,since M AG-PI E’s algorithms have been optimized for long latency,the advantage of M AG PI E over MPICH will be even higher.A.Basic Collective OperationsM AG PI E implements all14collective communication operations defined by version1.1of the MPI standard. The operations for synchronization and data exchange have been presented in[20].Some of them are used as building blocks for the reduction operations.We briefly discuss how they are implemented by MPICH and by M AG PI E.We compare against version1.1of MPICH. A.1BroadcastIn,MPI also has personal-ized broadcast operations:.MPICH implements the simplest possible scatter algo-rithm in which the root process linearly and directly sends the pieces of data to the respective nodes.This algorithm is wide area optimal,and M AG PI E does not improve on it.The gather operation and the personalized all-to-all ex-change are implemented similarly.With.In M AG PI E’s algo-rithm,the coordinatorsfirst gather data locally into a sub-vector of their local cluster.They then gather the complete data vector by exchanging their partial vectors with eachcompletiontime(ms)2 clusterscompletiontime(ms)2 clustersFig.3. With5M AG PI E’s algorithm that exploits user-asserted asso-ciativityfirst reduces the cluster-local data on the coordi-nator nodes.In afinal wide area step,these partial results are sent to the root process which in turn combines them to compute the overall result.This algorithm is wide area op-timal;it adheres to the single-latency condition,and also sends the minimal amount of data between clusters.The comparison of this algorithm with MPICH’s tree algo-rithm is presented in Figure3for the case of64KB data vectors.Results are shown for2,4,and8clusters,with a total number of16,24,32,and40processors,equally dis-tributed over the clusters.The run times for M AG PI E are shown in black while MPICH’s times are shown in grey. For non-associative operators,M AG PI E implements two more algorithms:one for short messages and one for long messages.Thefirst algorithm gathers all data at the root,which then applies the operation in the pre-scribed order,irrespective of the network topology.This satisfies only the single-latency condition(by re-using,but delivers the reduction result to all processes.MPICH provides a so-called naive implementation byfirst reducing to the pro-cess with rank zero and subsequently broadcasting the re-sult.This implementation always yields correct results, but is not wide area optimal.Sequential compositions of two collective operations need at least two times the wide area latency.Furthermore,both basic operations(as im-plemented by MPICH)are not wide area optimal them-selves.Again M AG PI E provides three algorithms.For associa-tive operators,the processes in each clusterfirst reduce tocompletiontime(ms)2 clusterscompletiontime(ms)2 called which delivers all data items at all processes with a sin-gle wide area latency.Then,all processes compute the reduction result locally.The completion time compared to MPICH is shown in Figure4(top)for1-byte messages. For long messages,MPICH’s approach is used.Because M AG PI E implements a wide area optimal broadcast,it is still faster than MPICH.B.3is similar to6 the resulting data vector is implicitly scattered among theprocesses such that each process gets a different part ofthe reduced data vector.MPICH implements a sequen-tial composition,a reduce followed by a scatter operation.Analogous to.Using a communicator object with re-ordered process ranks,the algorithm is only applicable tocommutative operators.For non-associative operators,anThe operation performed by.It isbased onc o m p l e t i o n t i m e (m s )2 clustersc o m p l e t i o n t i m e (m s )2 clustersFig.6.with an average message size of 32KBytefor the broadcast of Householder vectors,and 8192calls toTABLE IC OMPARISON OF A LGORITHMSMPICH OperationOptim.Shape Implementationnear flat Gather no bin yes flat near flat Allgather no bin ;bin Reduce ;Bcast yes flat near flat Allgather no bin ;flat Reduce ;ScatterVyes flat near flat Allgatherno bin yesflatMMUL20002000ReduceBcast/Scan sequential run time (s)335843390M AG PI EMPICHM AG PI EMPICHparallel run time (s)9630018.712.130.520.1wide area msgs 3360855971551903.245357.522.4917.20wide area latency (total)4000200001615.5s e c o n d s32 cpusFig.8.Application Runtimes formore allows all matrices to be fully distributed which sig-nificantly lowers the memory requirements of the parallel version.For the runtimes shown in Figure 8,we enabled associa-tive optimization.When the matrix elements are known to yield neither over nor underflow situations,this optimiza-tion is legal.If not,then even a result computed without this optimization could not be trusted,because still overor underflows may occur—although always the same.With the matrices being of size 20002000,is called 4000times with a messagesize of 16000bytes.M AG PI E outperforms MPICH up to a factor of three.Due to large memory requirements and the related cache effects (the algorithm uses straightfor-ward,un-blocked loops),MMUL achieves superlinear speedups.Relative to a single processor,the speedup on a single cluster of 64processors is 100,both for M AG PI E and MPICH.When the 64processors are divided over 8wide area clusters,M AG PI E achieves a speedup of 64.8,MPICH only 18.4.Again,Table II shows that M AG PI E sends fewer messages and less data across wide area links.It also chains fewer latencies.The TRI kernel repeatedly invokes a solver for tridiago-nal equation systems.Following [21],the solver treats the equations as a system of recurrences and usescan be used.Some addi-tional numerical instability had to be introduced to get the right functionality without additional communication.In our measurements,the tridiagonal matrix had size 1000000.The solver calls 1000timeswith amessage size of 48bytes.M AG PI E outperforms MPICHs e c o n d s32 cpusFig.9.Application Runtimes forTABLE IIIW IDE A REA S YSTEM R UN T IMES40processors M AG PI E MPICH325490246112313810For reduction operations with short data vectors,M AG-PI E uses algorithms that are nearly wide area optimal.For non-associative operators,when execution order matters for correctness,M AG PI E’s algorithms are no longer wide area optimal,although by using better broadcast and scan algorithms,performance is still improved over MPICH for most reduction operations.Although the MPI standard recognizes the possibility of exploiting associativity for optimization,it does not spec-ify a standard way to implement this feature.For a wide area latency of10ms and a bandwidth of 1MByte/s,the measurements of the individual operations show improvements over MPICH that vary between a fac-tor of3to8,depending on the operation,the number of clusters,and the message length.For the application ker-nels,QR,MMUL,and TRI,M AG PI E consistently outper-formed MPICH by a factor of2or more.The current version of M AG PI E assumes a static topol-ogy.In future work the best tree shape can be computed dynamically based on wide area latency and bandwidth as measured at run time,as well as message size.Writing correct and efficient parallel programs is hard. Having to take non-uniformity of the interconnect into account makes it even harder.MPI’s collective opera-tions provide a convenient abstraction that can be imple-mented efficiently for a non-uniform interconnect.For problems that heavily use the collective operations,the M AG PI E library offers transparent optimization,and com-pletely hides non-uniformity.The system is available as a plug-in for MPICH from CKNOWLEDGEMENTSThis work is supported in part by a SION grant from the Dutch re-search council NWO,and by a USF grant from the Vrije Universiteit. We thank Kees Verstoep and John Romein for keeping the DAS in good shape,and Cees de Laat(University of Utrecht)for getting the wide area links of the DAS up and running.R EFERENCES[1] A.Alexandrov,M.F.Ionescu,K.E.Schauser,and C.Scheiman.LogGP:Incorporating Long Messages into the LogP Model—One step closer towards a realistic model for parallel computa-tion.In Proc.Symposium on Parallel Algorithms and Architectures (SPAA),pages95–105,Santa Barbara,CA,July1995.[2]Argonne National Laboratory.MPICH implementation home page./Projects/mpi/mpich/,1995.[3]S.Bae, D.Kim,and S.Ranka.Vector Reduction and Pre-fix Computation on Coarse-Grained,Distributed-Memory Parallel Machines.In IPPS-98,International Parallel Processing Sympo-sium,1998.[4]H.Bal,R.Bhoedjang,R.Hofman,C.Jacobs,ngendoen,T.R¨u hl,and F.Kaashoek.Performance Evaluation of the Orca Shared Object System.ACM Transactions on Computer Systems, 16(1),1998.[5]H.Bal,A.Plaat,M.Bakker,P.Dozy,and R.Hofman.OptimizingParallel Applications for Wide-Area Clusters.In IPPS-98Interna-tional Parallel Processing Symposium,pages784–790,Apr.1998.[6]M.Banikazemi,V.Moorthy,and D.Panda.Efficient CollectiveCommunication on Heterogeneous Networks of Workstations.In International Conference on Parallel Processing,pages460–467, Minneapolis,MN,1998.[7]M.Bernaschi and G.Iannello.Collective Communication opera-tions:experimental results vs.theory.Concurrency:Practice and Experience,10(5):359–386,April1998.[8]M.Bernaschi,G.Iannello,and uria.Efficient Implementa-tion of Reduce-Scatter in MPI.Submitted for publication,1998.Available from˜i annello/dapaa.htm. [9]R.Bhoedjang,T.R¨u hl,and er-Level Network InterfaceProtocols.IEEE Computer,31(11):53–60,1998.[10]N.Boden, D.Cohen,R.Felderman, A.Kulawik, C.Seitz,J.Seizovic,and W.Su.Myrinet:A Gigabit-per-second Local Area Network.IEEE Micro,15(1):29–36,1995.[11]T.H.Cormen,C.E.Leiserson,and R.L.Rivest.Introduction toAlgorithms.M.I.T.Press,1990.[12] D.Culler,R.Karp,D.Patterson,A.Sahay,K.E.Schauser,E.San-tos,R.Subramonian,and T.von Eicken.LogP:Towards a Realistic Model of Parallel Computation.In Proc.Symposium on Principles and Practice of Parallel Programming(PPoPP),pages1–12,San Diego,CA,May1993.[13]I.Foster,J.Geisler,W.Gropp,N.Karonis,E.Lusk,G.Thiru-vathukal,and S.Tuecke.Wide-Area Implementation of the Mes-sage Passing Interface.Parallel Computing,24(11),1998. [14]I.Foster and N.Karonis.A Grid-Enabled MPI:Message Passingin Heterogeneous Distributed Computing Systems.In SC’98,Or-lando,FL,Nov.1998.[15]I.Foster and C.Kesselman.Globus:A metacomputing in-frastructure toolkit.Int.Journal of Supercomputer Applications, 11(2):115–128,Summer1997.[16]I.Foster and C.Kesselman,editors.The GRID:Blueprint for aNew Computing Infrastructure.Morgan Kaufmann,1998. [17] A.Grimshaw and W.A.Wulf.The Legion Vision of a WorldwideVirtual m.ACM,40(1):39–45,Jan.1997.[18]G.Iannello,uria,and S.Mercolino.Cross–Platform Analy-sis of Fast Messages for Myrinet.In Proc.Workshop CANPC’98, number1362in Lecture Notes in Computer Science,pages217–231,Las Vegas,Nevada,January1998.Springer.[19]R.M.Karp,A.Sahay,E.E.Santos,and K.E.Schauser.OptimalBroadcast and Summation in the LogP model.In Proc.Symposium on Parallel Algorithms and Architectures(SPAA),pages142–153, Velen,Germany,June1993.[20]T.Kielmann,R.F.H.Hofman,H.E.Bal,A.Plaat,and R.A.F.Bhoedjang.M AG PI E:MPI’s Collective Communication Opera-tions for Clustered Wide Area Systems.Submitted for publication, 1998.Available from[21] F.T.Leighton.Introduction to Parallel Algorithms and Architec-tures.Morgan Kaufmann,1992.[22] B.Lowekamp and A.Beguelin.ECO:Efficient Collective Opera-tions for Communication on Heterogeneous Networks.In Interna-tional Parallel Processing Symposium,pages399–405,Honolulu, HI,1996.[23]Message Passing Interface Forum.MPI:A Message Passing Inter-face Standard.International Journal of Supercomputing Applica-tions,8(3/4),1994.[24]Message Passing Interface Forum.MPI Standard docu-ment,Version 1.1.Available from / Projects/mpi/standard.html,1995.[25]J.-Y.L.Park,H.-A.Choi,N.Nupairoj,and L.M.Ni.Constructionof Optimal Multicast Trees Based on the Parameterized Commu-nication Model.In Proc.Int.Conference on Parallel Processing (ICPP),volume I,pages180–187,1996.[26] A.Plaat,H.Bal,and R.Hofman.Sensitivity of Parallel Appli-cations to Large Differences in Bandwidth and Latency in Two-Layer Interconnects.In High Performance Computer Architecture HPCA-5,Orlando,FL,Jan.1999.[27]R.Wolski.Dynamically Forecasting Network Performance to Sup-port Dynamic Scheduling Using the Network Weather Service.In 6th High-Performance Distributed Computing,Aug.1997.The network weather service is at /.。
