MPI's reduction operations in clustered wide area systems
mpi应用实例 -回复
mpi应用实例-回复MPIApplication ExamplesIntroduction:Message Passing Interface (MPI) is a widely used communication protocol and programming model for parallel computing. It allows multiple processes to work together to solve a complex problem by sending and receiving messages. In this article, we will explore some real-world applications of MPI and discuss how this technology is used to address various computational challenges.1. Weather and Climate Modeling:One crucial application of MPI is weather and climate modeling. Weather prediction models require a vast amount of data and processing power to simulate atmospheric conditions accurately. MPI enables scientists to split the computational workload across multiple nodes or processors and exchange data as needed. Each processor works on a designated section of the simulation, and they communicate with each other by sending and receiving messages containing essential information, such as temperature,pressure, and wind speed. By using MPI, weather models can run efficiently on high-performance computing clusters, providing more accurate predictions for meteorologists and disaster management agencies.2. Computational Fluid Dynamics (CFD):CFD is a computational tool used to simulate fluid flow and heat transfer phenomena in various industries, such as aerospace, automotive, and energy. MPI plays a vital role in solving CFD problems by allowing parallel processing of large grid systems. Grid cells or elements are divided among different processors, and each processor calculates the flow properties for its allocated grid cells. MPI facilitates the exchange of information between neighboring processors to update boundary conditions and ensure a consistent solution across the entire domain. By utilizing MPI's collective communication operations, such as scatter, gather, and reduce, CFD simulations can be performed faster, enabling engineers and designers to optimize their designs efficiently.3. Molecular Dynamics Simulations:Molecular dynamics simulations are widely used in chemistry, biochemistry, and material science to study the structure and behavior of atoms and molecules. In these simulations, MPI is used to distribute the calculation tasks among different processors, each responsible for a subset of atoms in the system. The processors communicate to exchange forces and update particle positions to simulate the movement of atoms over time accurately. By employing advanced MPI features, such as non-blocking communication and parallel I/O, scientists can effectively simulate large systems and investigate chemical reactions, protein folding, and material properties.4. Data Analytics and Machine Learning:As the volume of data continues to grow exponentially, the need for efficient data analysis and machine learning algorithms has increased. MPI can be applied to distributed data analytics applications, where multiple processors work on different subsets of data simultaneously. Each processor performs calculations and uses MPI to communicate partial results or exchange intermediate data when needed. By utilizing MPI's collective operations like broadcast, reduce, and all-to-all communication, distributedmachine learning algorithms can efficiently train models on massive datasets. MPI provides the capability to scale up computations to large clusters, enabling researchers to extract meaningful insights from enormous amounts of data.Conclusion:From weather modeling and computational fluid dynamics to molecular dynamics simulations and data analytics, MPI finds extensive application in a wide range of scientific and engineering domains. As parallel computing becomes a necessity to solve complex problems, the efficient communication provided by MPI proves invaluable. With the advancements in high-performance computing and the continual development of MPI libraries, researchers and developers can harness the power of parallelism to tackle ever more significant computational challenges and propel scientific discoveries forward.。
lam_MPI
Ohio Supercomputer Center The Ohio State UniversityUPERCOMPUTER S OHIOC E N T E R LAM is a parallel processing environment and development system for a network of independent computers. It features the Message-Passing Interface (MPI)programming standard,supported by extensive monitoring and debugging M / MPI Key Features:•full implementation of the MPI standard •extensive monitoring and debugging tools,runtime and post-mortem •heterogeneous computer networks •add and delete nodes •node fault detection and recovery •MPI extensions and LAM programming supplements •direct communication between application processes •robust MPI resource management •MPI-2 dynamic processes •multi-protocol communication (shared memory and network)MPI Primer /Developing With LAM2This document is organized into four major chapters. It begins with a tuto-rial covering the simpler techniques of programming and operation. New users should start with the tutorial. The second chapter is an MPI program-ming primer emphasizing the commonly used routines. Non-standard extensions to MPI and additional programming capabilities unique to LAM are separated into a third chapter. The last chapter is an operational refer-ence.It describes how to configure and start a LAM multicomputer,and how to monitor processes and messages.This document is user oriented. It does not give much insight into how the system is implemented. It does not detail every option and capability of every command and routine.An extensive set of manual pages cover all the commands and internal routines in great detail and are meant to supplement this document.The reader will note a heavy bias towards the C programming language,especially in the code samples.There is no Fortran version of this document.The text attempts to be language insensitive and the appendices contain For-tran code samples and routine prototypes.We have kept the font and syntax conventions to a minimum.code This font is used for things you type on the keyboard orsee printed on the screen.We use it in code sections andtables but not in the main text.<symbol>This is a symbol used to abstract something you wouldtype. We use this convention in commands.Section Italics are used to cross reference another section in thedocument or another document. Italics are also used todistinguish LAM commands.How to UseThisDocument3How to Use This Document 2LAM Architecture 7Debugging 7MPI Implementation 8How to Get LAM 8LAM / MPI Tutorial IntroductionProgramming Tutorial 9The World of MPI 10Enter and Exit MPI 10Who Am I; Who Are They? 10Sending Messages 11Receiving Messages 11Master / Slave Example 12Operation Tutorial 15Compilation 15Starting LAM 15Executing Programs 16Monitoring 17Terminating the Session 18MPI Programming PrimerBasic Concepts 19Initialization 21Basic Parallel Information 21Blocking Point-to-Point 22Send Modes 22Standard Send 22Receive 23Status Object 23Message Lengths 23Probe 24Nonblocking Point-to-Point 25Request Completion 26Probe 26Table ofContents4Message Datatypes 27Derived Datatypes 28Strided Vector Datatype 28Structure Datatype 29Packed Datatype 31Collective Message-Passing 34Broadcast 34Scatter 34Gather 35Reduce 35Creating Communicators 38Inter-communicators 40Fault Tolerance 40Process Topologies 41Process Creation 44Portable Resource Specification 45 Miscellaneous MPI Features 46Error Handling 46Attribute Caching 47Timing 48LAM / MPI ExtensionsRemote File Access 50Portability and Standard I/O 51 Collective I/O 52Cubix Example 54Signal Handling 55Signal Delivery 55Debugging and Tracing 56LAM Command ReferenceGetting Started 57Setting Up the UNIX Environment 575 Node Mnemonics 57Process Identification 58On-line Help 58Compiling MPI Programs 60Starting LAM 61recon 61lamboot 61Fault Tolerance 61tping 62wipe 62Executing MPI Programs 63mpirun 63Application Schema 63Locating Executable Files 64Direct Communication 64Guaranteed Envelope Resources 64Trace Collection 65lamclean 65Process Monitoring and Control 66mpitask 66GPS Identification 68Communicator Monitoring 69Datatype Monitoring 69doom 70Message Monitoring and Control 71mpimsg 71Message Contents 72bfctl 72Collecting Trace Data 73lamtrace 73Adding and Deleting LAM Nodes 74lamgrow 74lamshrink 74File Monitoring and Control 75fstate 75fctl 756Writing a LAM Boot Schema 76Host File Syntax 76Low Level LAM Start-up 77Process Schema 77hboot 77Appendix A: Fortran Bindings 79 Appendix B: Fortran Example Program 857LAM runs on each computer as a single daemon (server) uniquely struc-tured as a nano-kernel and hand-threaded virtual processes.The nano-kernel component provides a simple message-passing,rendez-vous service to local processes. Some of the in-daemon processes form a network communica-tion subsystem,which transfers messages to and from other LAM daemons on other machines.The network subsystem adds features such as packetiza-tion and buffering to the base synchronization. Other in-daemon processes are servers for remote capabilities, such as program execution and parallel file access.The layering is quite distinct:the nano-kernel has no connection with the network subsystem, which has no connection with the ers can configure in or out services as necessary.The unique software engineering of LAM is transparent to users and system administrators, who only see a conventional daemon. System developers can de-cluster the daemon into a daemon containing only the nano-kernel and several full client processes. This developers’ mode is still transparent to users but exposes LAM’s highly modular components to simplified indi-vidual debugging.It also reveals LAM’s evolution from Trollius,which ran natively on scalable multicomputers and joined them to a host network through a uniform programming interface.The network layer in LAM is a documented,primitive and abstract layer on which to implement a more powerful communication standard like MPI (PVM has also been implemented).A most important feature of LAM is hands-on control of the multicomputer.There is very little that cannot be seen or changed at runtime. Programs residing anywhere can be executed anywhere,stopped,resumed,killed,and watched the whole time. Messages can be viewed anywhere on the multi-computer and buffer constraints tuned as experience with the application LAMArchitecturelocal msgs, client mgmt network msgs MPI, client / server cmds, apps, GUIs Figure 1: LAM’s Layered Design Debugging8dictates.If the synchronization of a process and a message can be easily dis-played, mismatches resulting in bugs can easily be found. These and other services are available both as a programming library and as utility programs run from any shell.MPI synchronization boils down to four variables:context,tag,source rank,and destination rank.These are mapped to LAM’s abstract synchronization at the network layer. MPI debugging tools interpret the LAM information with the knowledge of the LAM / MPI mapping and present detailed infor-mation to MPI programmers.A significant portion of the MPI specification can be and is implemented within the runtime system and independent of the underlying environment.As with all MPI implementations, LAM must synchronize the launch of MPI applications so that all processes locate each other before user code is entered. The mpirun command achieves this after finding and loading the program(s) which constitute the application. A simple SPMD application can be specified on the mpirun command line while a more complex config-uration is described in a separate file, called an application schema.MPI programs developed on LAM can be moved without source code changes to any other platform that supports M installs anywhere and uses the shell’s search path at all times to find LAM and application executables.A multicomputer is specified as a simple list of machine names in a file, which LAM uses to verify access, start the environment, and remove M is freely available under a GNU license via anonymous ftp from.MPIImplementationHow to Get LAM9LAM / MPI Tutorial Introduction The example programs in this section illustrate common operations in MPI.You will also see how to run and debug a program with LAM.For basic applications, MPI is as easy to use as any other message-passing library.The first program is designed to run with exactly two processes.Oneprocess sends a message to the other and then both terminate.Enter the fol-lowing code in trivial.c or obtain the source from the LAM source distribu-tion (examples/trivial/trivial.c)./** Transmit a message in a two process system.*/#include <mpi.h>#define BUFSIZE 64int buf[64];intmain(argc, argv)int argc;char *argv[];{int size, rank;MPI_Status status;/** Initialize MPI.*/MPI_Init(&argc, &argv);/** Error check the number of processes.* Determine my rank in the world group.ProgrammingTutorial10 * The sender will be rank 0 and the receiver, rank 1. */MPI_Comm_size(MPI_COMM_WORLD, &size);if (2 != size) {MPI_Finalize();return(1);}MPI_Comm_rank(MPI_COMM_WORLD, &rank);/* * As rank 0, send a message to rank 1. */if (0 == rank) {MPI_Send(buf, sizeof(buf), MPI_INT, 1, 11,MPI_COMM_WORLD);}/* * As rank 1, receive a message from rank 0. */else {MPI_Recv(buf, sizeof(buf), MPI_INT, 0, 11,MPI_COMM_WORLD, &status);}MPI_Finalize();return(0);}Note that the program uses standard C program structure, statements, vari-able declarations and types, and functions.Processes are represented by a unique “rank” (integer) and ranks are num-bered 0, 1, 2, ..., N-1. MPI_COMM_WORLD means “all the processes in the MPI application.” It is called a communicator and it provides all infor-mation necessary to do message-passing. Portable libraries do more with communicators to provide synchronization protection that most other mes-sage-passing systems cannot handle.As with other systems, two routines are provided to initialize and cleanup an MPI process:MPI_Init(int *argc, char ***argv);MPI_Finalize(void);Typically, a process in a parallel application needs to know who it is (its rank)and how many other processes exist.A process finds out its own rankby calling MPI_Comm_rank().The World ofMPIEnter and ExitMPIWho Am I; WhoAre They?MPI_Comm_rank(MPI_Comm comm, int *rank);The total number of processes is returned by MPI_Comm_size().MPI_Comm_size(MPI_Comm comm, int *size);A message is an array of elements of a given datatype.MPI supports all the basic datatypes and allows a more elaborate application to construct new datatypes at runtime.A message is sent to a specific process and is marked by a tag (integer)spec-ified by the user. Tags are used to distinguish between different message types a process might send/receive.In the example program above,the addi-tional synchronization offered by the tag is unnecessary.Therefore,any ran-dom value is used that matches on both sides.MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm);A receiving process specifies the tag and the rank of the sending process.MPI_ANY_TAG and MPI_ANY_SOURCE may be used to receive a mes-sage of any tag and from any sending process.MPI_Recv(void *buf, int count, MPI_Datatypedtype, int source, int tag, MPI_Comm comm,MPI_Status *status);Information about the received message is returned in a status variable. If wildcards are used, the received message tag is status.MPI_TAG and the rank of the sending process is status.MPI_SOURCE.Another routine, not used in the example program, returns the number of datatype elements received.It is used when the number of elements received might be smaller than number specified to MPI_Recv().It is an error to send more elements than the receiving process will accept.MPI_Get_count(MPI_Status, &status,MPI_Datatype dtype, int *nelements);SendingMessagesReceivingMessagesThe following example program is a communication skeleton for a dynam-ically load balanced master/slave application. The source can be obtainedfrom the LAM source distribution (examples/trivial/ezstart.c).The program is designed to work with a minimum of two processes:one master and one slave.#include <mpi.h>#define WORKTAG 1#define DIETAG 2#define NUM_WORK_REQS 200static void master();static void slave();/**main* This program is really MIMD, but is written SPMD for * simplicity in launching the application.*/intmain(argc, argv)int argc;char *argv[];{int myrank;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD,/* group of everybody */&myrank);/* 0 thru N-1 */if (myrank == 0) {master();} else {slave();}MPI_Finalize();return(0);}/**master* The master process sends work requests to the slaves * and collects results.*/static voidmaster(){int ntasks, rank, work;double result;MPI_Status status;MPI_Comm_size(MPI_COMM_WORLD,&ntasks);/* #processes in app */Master / SlaveExample/** Seed the slaves.*/work = NUM_WORK_REQS;/* simulated work */for (rank = 1; rank < ntasks; ++rank) {MPI_Send(&work,/* message buffer */1,/* one data item */MPI_INT,/* of this type */rank,/* to this rank */WORKTAG,/* a work message */MPI_COMM_WORLD);/* always use this */ work--;}/** Receive a result from any slave and dispatch a new work* request until work requests have been exhausted.*/while (work > 0) {MPI_Recv(&result,/* message buffer */1,/* one data item */MPI_DOUBLE,/* of this type */MPI_ANY_SOURCE,/* from anybody */MPI_ANY_TAG,/* any message */MPI_COMM_WORLD,/* communicator */&status);/* recv’d msg info */MPI_Send(&work, 1, MPI_INT, status.MPI_SOURCE,WORKTAG, MPI_COMM_WORLD);work--;/* simulated work */ }/** Receive results for outstanding work requests.*/for (rank = 1; rank < ntasks; ++rank) {MPI_Recv(&result, 1, MPI_DOUBLE, MPI_ANY_SOURCE,MPI_ANY_TAG, MPI_COMM_WORLD, &status);}/** Tell all the slaves to exit.*/for (rank = 1; rank < ntasks; ++rank) {MPI_Send(0, 0, MPI_INT, rank, DIETAG,MPI_COMM_WORLD);}}/**slave* Each slave process accepts work requests and returns* results until a special termination request is received. */static voidslave(){double result;int work;MPI_Status status;for (;;) {MPI_Recv(&work, 1, MPI_INT, 0, MPI_ANY_TAG,MPI_COMM_WORLD, &status);/** Check the tag of the received message.*/if (status.MPI_TAG == DIETAG) {return;}sleep(2);result = 6.0;/* simulated result */MPI_Send(&result, 1, MPI_DOUBLE, 0, 0,MPI_COMM_WORLD);}}The workings of ranks,tags and message lengths should be mastered before constructing serious MPI applications.Before running LAM you must establish certain environment variables and search paths for your shell. Add the following commands or equivalent to your shell start-up file (.cshrc,assuming C shell).Do not add these to your .login as they would not be effective on remote machines when rsh is used to start LAM.setenv LAMHOME <LAM installation directory>set path = ($path $LAMHOME/bin)The local system administrator,or the person who installed LAM,will know the location of the LAM installation directory. After editing the shell start-up file,invoke it to establish the new values.This is not necessary on subse-quent logins to the UNIX system.% source .cshrc Many LAM commands require one or more nodeids.Nodeids are specified on the command line as n<list>, where <list> is a list of comma separated nodeids or nodeid ranges.n1n1,3,5-10The mnemonic ‘h’refers to the local node where the command is typed (as in ‘here’).Any native C compiler is used to translate LAM programs for execution.All LAM runtime routines are found in a few libraries. LAM provides a wrap-ping command called hcc which invokes cc with the proper header and library directories, and is used exactly like the native cc.% hcc -o trivial trivial.c -lmpi The major,internal LAM libraries are automatically linked.The MPI library is explicitly linked.Since LAM supports heterogeneous computing,it is up to the user to compile the source code for each of the various CPUs on their respective machines. After correcting any errors reported by the compiler,proceed to starting the LAM session.Before starting LAM,the user specifies the machines that will form the mul-ticomputer. Create a host file listing the machine names, one on each line.An example file is given below for the machines “ohio” and “osc”. Lines starting with the # character are treated as comment lines.OperationTutorialCompilationStarting LAM# a 2-node LAM ohio osc The first machine in the host file will be assigned nodeid 0, the second nodeid 1,etc.Now verify that the multicomputer is ready to run LAM.The recon tool checks if the user has access privileges on each machine in the multicomputer and if LAM is installed and accessible.% recon -v <host file>If recon does not report a problem, proceed to start the LAM session with the lamboot tool.% lamboot -v <host file>The -v (verbose)option causes lamboot to report on the start-up process as it progresses. You should return to the your own shell’s prompt. LAM pre-sents no special shell or interface environment.Even if all seems well after start-up,verify communication with each node.tping is a simple confidence building command for this purpose.% tping n0Repeat this command for all nodes or ping all the nodes at once with the broadcast mnemonic,N.tping responds by sending a message between the local node (where the user invoked tping)and the specified node.Successful execution of tping proves that the target node, nodes along the route from the local node to the target node,and the communication links between them are working properly. If tping fails, press Control-Z, terminate the session with the wipe tool and then restart the system.See Terminating the Session .To execute a program,use the mpirun command.The first example program is designed to run with two processes.The -c <#>option runs copies of thegiven program on nodes selected in a round-robin manner.% mpirun -v -c 2 trivialThe example invocation above assumes that the program is locatable on the machine on which it will run. mpirun can also transfer the program to the target node before running it.Assuming your multicomputer for this tutorial is homogeneous, you can use the -s h option to run both processes.% mpirun -v -c 2 -s h trivialExecutingProgramsIf the processes executed correctly,they will terminate and leave no traces.If you want more feedback,try using tprintf()functions within the program.The first example program runs too quickly to be monitored.Try changingthe tag in the call to MPI_Recv() to 12 (from 11). Recompile the program and rerun it as before. Now the receiving process cannot synchronize with the message from the send process because the tags are unequal.Look at the status of all MPI processes with the mpitask command.You will notice that the receiving process is blocked in a call to MPI_Recv()- a synchronizing message has not been received. From the code we know this is process rank 1in the MPI application,which is confirmed in the first column,the MPI task identification.The first number is the rank within the world group.The second number is the rank within the communicator being used by MPI_Recv(), in this case (and in many applications with simple communication structure)also the world group.The specified source of the message is likewise identified.The synchronization tag is 12and the length of the receive buffer is 64 elements of type MPI_INT.The message was transferred from the sending process to a system buffer en route to process rank 1.MPI_Send()was able to return and the process has called MPI_Finalize().System buffers,which can be thought of as message queues for each MPI process,can be examined with the mpimsg command.The message shows that it originated from process rank 0 usingMPI_COMM_WORLD and that it is waiting in the message queue of pro-cess rank 1, the destination. The tag is 11 and the message contains 64 ele-ments of type MPI_INT. This information corresponds to the arguments given to MPI_Send(). Since the application is faulty and will never com-plete, we will kill it with the lamclean command.% lamclean -vMonitoring % mpitaskTASK (G/L)FUNCTION PEER|ROOT TAG COMM COUNT DATATYPE 0/0 trivialFinalize 1/1 trivial Recv 0/012WORLD 64INT % mpimsgSRC (G/L)DEST (G/L)TAG COMM COUNT DATATYPE MSG 0/01/111WORLD 64INT n1,#0The LAM session should be in the same state as after invoking lamboot.You can also terminate the session and restart it with lamboot,but this is a much slower operation. You can now correct the program, recompile and rerun.To terminate LAM, use the wipe tool. The host file argument must be the same as the one given to lamboot.% wipe -v <host file>Terminating theSessionMPI Programming PrimerBasic ConceptsThrough Message Passing Interface(MPI)an application views its parallelenvironment as a static group of processes.An MPI process is born into theworld with zero or more siblings. This initial collection of processes iscalled the world group.A unique number,called a rank,is assigned to eachmember process from the sequence0through N-1,where N is the total num-ber of processes in the world group.A member can query its own rank andthe size of the world group.Processes may all be running the same program(SPMD) or different programs (MIMD). The world group processes maysubdivide,creating additional subgroups with a potentially different rank ineach group.A process sends a message to a destination rank in the desired group.A pro-cess may or may not specify a source rank when receiving a message.Mes-sages are further filtered by an arbitrary, user specified, synchronizationinteger called a tag, which the receiver may also ignore.An important feature of MPI is the ability to guarantee independent softwaredevelopers that their choice of tag in a particular library will not conflictwith the choice of tag by some other independent developer or by the enduser of the library.A further synchronization integer called a context is allo-cated by MPI and is automatically attached to every message.Thus,the fourmain synchronization variables in MPI are the source and destination ranks,the tag and the context.A communicator is an opaque MPI data structure that contains informationon one group and that contains one context.A communicator is an argumentto all MPI communication routines.After a process is created and initializes MPI, three predefined communicators are available.MPI_COMM_WORLD the world groupMPI_COMM_SELF group with one member, myselfMPI_COMM_PARENT an intercommunicator between two groups:my world group and my parent group (SeeDynamic Processes.)Many applications require no other communicators beyond the world com-municator.If new subgroups or new contexts are needed,additional commu-nicators must be created.MPI constants, templates and prototypes are in the MPI header file, mpi.h. #include <mpi.h>MPI_Init Initialize MPI state.MPI_Finalize Clean up MPI state.MPI_Abort Abnormally terminate.MPI_Comm_size Get group process count.MPI_Comm_rank Get my rank within process group.MPI_Initialized Has MPI been initialized?The first MPI routine called by a program must be MPI_Init(). The com-mand line arguments are passed to MPI_Init().MPI_Init(int *argc, char **argv[]);A process ceases MPI operations with MPI_Finalize().MPI_Finalize(void);In response to an error condition,a process can terminate itself and all mem-bers of a communicator with MPI_Abort().The implementation may report the error code argument to the user in a manner consistent with the underly-ing operation system.MPI_Abort (MPI_Comm comm, int errcode);Two numbers that are very useful to most parallel applications are the total number of parallel processes and self process identification. This informa-tion is learned from the MPI_COMM_WORLD communicator using the routines MPI_Comm_size() and MPI_Comm_rank().MPI_Comm_size (MPI_Comm comm, int *size);MPI_Comm_rank (MPI_Comm comm, int *rank);Of course, any communicator may be used, but the world information is usually key to decomposing data across the entire parallel application.InitializationBasic ParallelInformationMPI_Send Send a message in standard mode.MPI_Recv Receive a message.MPI_Get_count Count the elements received.MPI_Probe Wait for message arrival.MPI_Bsend Send a message in buffered mode.MPI_Ssend Send a message in synchronous mode.MPI_Rsend Send a message in ready mode.MPI_Buffer_attach Attach a buffer for buffered sends.MPI_Buffer_detach Detach the current buffer.MPI_Sendrecv Send in standard mode, then receive.MPI_Sendrecv_replace Send and receive from/to one area.MPI_Get_elements Count the basic elements received.This section focuses on blocking,point-to-point,message-passing routines.The term “blocking”in MPI means that the routine does not return until the associated data buffer may be reused. A point-to-point message is sent by one process and received by one process.The issues of flow control and buffering present different choices in design-ing message-passing primitives. MPI does not impose a single choice but instead offers four transmission modes that cover the synchronization,data transfer and performance needs of most applications.The mode is selected by the sender through four different send routines, all with identical argu-ment lists. There is only one receive routine. The four send modes are:standard The send completes when the system can buffer the mes-sage (it is not obligated to do so)or when the message is received.buffered The send completes when the message is buffered in application supplied space, or when the message is received.synchronous The send completes when the message is received.ready The send must not be started unless a matching receive has been started. The send completes immediately.Standard mode serves the needs of most applications.A standard mode mes-sage is sent with MPI_Send().MPI_Send (void *buf, int count, MPI_Datatypedtype, int dest, int tag, MPI_Comm comm); BlockingPoint-to-PointSend ModesStandard SendAn MPI message is not merely a raw byte array. It is a count of typed ele-ments.The element type may be a simple raw byte or a complex data struc-ture. See Message Datatypes .The four MPI synchronization variables are indicated by the MPI_Send()parameters. The source rank is the caller’s. The destination rank and mes-sage tag are explicitly given.The context is a property of the communicator.As a blocking routine, the buffer can be overwritten when MPI_Send()returns.Although most systems will buffer some number of messages,espe-cially short messages,without any receiver,a programmer cannot rely upon MPI_Send() to buffer even one message. Expect that the routine will not return until there is a matching receiver.A message in any mode is received with MPI_Recv().MPI_Recv (void *buf, int count, MPI_Datatype dtype, int source, int tag, MPI_Comm comm,MPI_Status *status);Again the four synchronization variables are indicated,with source and des-tination swapping places. The source rank and the tag can be ignored with the special values MPI_ANY_SOURCE and MPI_ANY_TAG.If both these wildcards are used, the next message for the given communicator is received.An argument not present in MPI_Send()is the status object pointer.The sta-tus object is filled with useful information when MPI_Recv()returns.If the source and/or tag wildcards were used,the actual received source rank and/or message tag are accessible directly from the status object.status.MPI_SOURCE the sender’s rank status.MPI_TAG the tag given by the sender It is erroneous for an MPI program to receive a message longer than thespecified receive buffer. The message might be truncated or an error condi-tion might be raised or both.It is completely acceptable to receive a message shorter than the specified receive buffer. If a short message may arrive, the application can query the actual length of the message withMPI_Get_count().MPI_Get_count (MPI_Status *status,MPI_Datatype dtype, int *count);ReceiveStatus ObjectMessage Lengths。
核材料辐照损伤的并行空间分辨随机团簇动力学模拟
z在体积元-%之间扩散的反应速率R为:
" R = DAs i和Ni分别为体积元p S内缺陷z的 数量Ap为体积元p、q的交界面面积丄s为体 积元p、q中心点之间的距离$
2 MISA-SCD1. 0 实现
2. 1 MISA-SCD1. 0 概述 MISA-SCD1 0的计算流程如图1所示,
的实现方式与关键技术,并将其应用于反应堆压力容器模型合金中富Cu团簇的析出模拟,验证了程序
的正确性并测试了并行性能$结果表明,MISA-SCD1.0能获得与实验结果和类似模拟结果吻合的Cu
析出过程,且具有较高的并行效率和良好的扩展性$
关键词:辐照损伤;空间分辨随机团簇动力学;动力学蒙特卡罗;并行计算
CHEN Dandan1 , HE Xinfu2 , YANG Wen2 , CHU Genshen1 , BAI He】,HU Changjun1 ''
((.Universit3 of Science and Technology Beijing , Beijing 100083 , China ; 2. China Institute of Atomic Energy , Beijing 102413 , China)
第55卷第7期 2021年7月
原子能科学技术 Ato—icEnergyScienceandTechnology
Vol. 55 ,No. 7 Jul.202*
核材料辐照损伤的 并行空间分辨随机团簇动力学模拟
陈丹丹S贺新福2,杨文2,储根深S白鹤1,胡长军1!"
(1 •北京科技大学,北京1000832.中国原子能科学研究院,北京102413)
为扩大SRSCD的模拟体积,并解决扩大 体积后带来的计算量,最有效的方式就是并行 处理$在确定性方法中,并行区域按相同的时
集群系统中基于MPI的关联规则快速挖掘算法
wh c e d n y s m e l g c o e a i n u h a ‘ n , o ’ n ‘ o ’ e r a e h e l i g d fi u t ,r — i h n e s o l o o i p r to s s c s a d’ ‘ r a d x r ,d c e s s t e r a i n i c ly e z f d c s t e n t r r f i a d i c e s s t e m i i g e f i n y I sp o e h o e ia l h tt e a g rt m a u e h e wo k t a f n n r a e h n n fi e c . ti r v d t e r tc l t a h l o ih h s c c y
集群 系 统 (lse) 称机 群 系统 指“ cu tr也 利用 高 速通
联 规则 的挖 掘 算 法 主要 有 【 Aga l 提 出 的基 于 1 rwa 等 ] Ap ir算 法 的频 集 方 法. 了提 高关 联 规 则 的挖 掘 r i o 为 效 率 , 究 人 员 又 提 出 了 并 行 挖 掘 算 法 , 要 包 研 主 括 :- Aga l [5 rwa 等人 提 出的 C 算法 , ak等人 提 出 2] D Pr 的P DM 算法 , h e g等 人 提 出 了 F M 和 D C un D MA 算
B h iUnv o a i.,Jn h u 1 1 0 ,Ch n ) iz o 2 0 3 ia
Ab t a t Cl s e s a k nd o i ti ut d s o a e s s e ,whih mos l do s t e me h fme s g s i g sr c u t r i i fd s r b e t r g y t m c ty a pt h t od o s a epa s n t e l e t omm u c ton be we n e e y n e Th o gh t e i e f c o e s ge p s i g o PI o r a i he c z nia i t e v r od . r u h nt r a e f r m s a a s n f M ,pu s t f wor a tp r le l rt m o n n s o i to u e s d o na y da a s o a nd e m pu a i or d a f s a a l la go ih f r mi i g a s ca i n r l s ba e n bi r t t r ge a o t tng,
Mellanox Quantum QM8700 InfiniBand交换机产品说明说明书
©2020 Mellanox Technologies. All rights reserved.†For illustration only. Actual products may vary.© Copyright 2020. Mellanox, Mellanox logo, Connect-X, MLNX-OS, and UFM are registered trademarks of Mellanox Technologies, Ltd. Mellanox Quantum and Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) are trademarks of Mellanox Technologies, Ltd. All other trademarks are property of their respective owners.page 2Mellanox QM8700 InfiniBand Switch 350 Oakmead Parkway, Suite 100, Sunnyvale, CA 94085Tel: 408-970-3400 • Fax: Table 1 - Part Numbers and Descriptions53779PB Rev 2.2Note : All tall-bracket adapters are shipped with the tall bracket mounted and a short bracket as accessory.Support : For information about our support packages, please contact your Mellanox Technologies sales representative or visit our Support Index page .HDR100QM8700 together with the Mellanox ConnectX ®-6 adapter card support HDR100. By utilizing two pairs of two lanes per port, theQM8700 can support up to 80 ports of 100G to create the densest TOR switch available in the market. This is a perfect solution for double dense racks with more than 40 servers per rack and also helps small-medium deployments with the need to scale to 3-level fat-tree, to lower power, latency and space.MANAGEMENTThe QM8700’s x86 ComEx Broadwell CPU comes with an on-board subnet manager, enabling simple, out-of-the-box bring-up for up to 2K nodes in the fabric. Running the MLNX-OS ® software package, it delivers full chassis management through CLI, WebUI, SNMP or JSON interfaces.QM8700 also incorporates Mellanox’s Unified Fabric Manager (UFM ®) software for managing scale-out, InfiniBand, computing environments to enable efficient provisioning, health indications and monitoring of the cluster. UFM ® ensures that the fabric is up and running at maximum performance at all times.Safety –CB –cTUVus –CE –CUEMC (Emissions) –CE –FCC –VCCI –ICES –RCMOperating Conditions –Temperature:• Operating 0ºC to 40ºC• Non-operating -40ºC to 70ºC –Humidity:• Operating 10% to 85% non-condensing• Non-operating 10% to 90% non-condensing–Altitude: Up to 3200m Acoustic –ISO 7779 –ETS 300 753Others–RoHS compliant –Rack-mountable, 1U –1-year warrantyCOMPLIANCEMellanox QM8700–19’’ rack mountable 1U chassis –40 QSFP56 non-blocking ports with aggregate data throughput up to 16Tb/s (HDR)Switch Specifications–Compliant with IBTA 1.21 and 1.3 –9 virtual lanes: 8 data + 1 management–256 to 4Kbyte MTU –Adaptive Routing –Congestion control –Port Mirroring–VL2VL mapping–4X48K entry linear forwarding databaseManagement Ports–100/1000 RJ45 Ethernet port –RS232 console port –USB port –DHCP–Industry standard CLI –Management over IPv6 –Management IP –SNMP v1,v2,v3 –WebUIFabric Management–On-board Subnet Manager (SM) supporting fabrics of up to 2K nodes –Unified Fabric Manager (UFM) agentConnectors and Cabling–QSFP56 connectors–Passive copper or active fiber cables–Optical modulesIndicators–Per port status LED Link, Activity–System LEDs: System, fans, powersupplies–Unit ID LEDPower Supply–Dual redundant slots –Hot plug operation –Input range:100-127VAC, 200-240VAC –Frequency: 50-60Hz, single phase AC, 4.5A, 2.9A Cooling –Front-to-rear or rear-to-front cooling option–Hot-swappable fan unitPower Consumption –Contact Mellanox Sales FEATURES。
集群系统中基于MPI的关联规则快速挖掘算法
3算法理论分析
(1)算法的并行性能分析 假设每个结点的DB,具有D/.个事务,咒为并 行计算结点的数量,假设需要挖掘的项目集I的大小 为忌,则最多有2‘个项集,每次对数据库的扫描时间 为t。,则串行时间为2kt。,串行的时间复杂度T。=O (2‘).每个分结点的并行运行时间为2't。/n,主结点 的运行时间为t。,最坏情况下并行运行的计算时间为 t。+咒*2kt。/n,在阶的意义下并行代价等于最坏情况 下串行的代价,具有代价最佳的并行性. (2)算法的加速比分析 并行运行的时间t。由两部分组成:计算部分£。。 和通信部分£一。,t,=tc。。+t。。。. 上述算法中,如果项集X在结点i是局部频繁 的,则其通信量的复杂度为0(n),£。。。=咒(t。。。+ wtdat。),t。。。。是消息时延,t‰是发送单位数据的时
性质1 E6] 对于Xp,X。∈J,Xp,X。有且只有志一 2个元素相同的充要条件是length(Bxp or Bx。)一忌.
由Apriori算法和性质l有,如果X,,鼍∈ LLk~l(_f),length(Bxp or Bx。)=志,则,可以组成结点
J上的局部候选是项集X,U X。.以下用m位二进制 数P代表项集X,,m位二进制数q代表项集X。.
2.3从结点的局部频繁项目集的生成与支持度计算
(1)局部频繁项目集的生成算法 输入:分发到结点i的正一1项全局频繁项目集Gk一-
(i).
输出:局部频繁女项集LL。(i)
Apriori_gen() {LL。(i)=予;
for each pEGL,一I(i) for each口∈GLI—l(i)//口≠P {夕q=P Og q}//生成新的七序列
AIStation 人工智能平台用户指南说明书
Artificial intelligence development platformRelease AI computing power, accelerate intelligent evolutionAI&HPCAIStation -Artificial Intelligence PlatformUser DataUtilizationTraining40%→80%2 days →4 hrs.Telecom FinanceMedicalManuf.Trans.Internet Development & TrainingDeployment & InferenceDeployment2 days →5 minDataModelServingLaptopMobileIndustryRobotIoTPyTorch Caffe MxNetPaddlePaddle TensorFlow ImportPre-processing accelerationTraining Visualization Hyper-para. tuningOn-demandAuto sched.OptimizationJupyter WebShell PipelineData mgnt computing resources Dev. ToolsModelTensorFlow ServingTensorRT Inference Server PyTorch Inference ServerServingDeployingDev. Tools PipelineData processing RecommendationsystemCV NLPScenarioOn-demand Auto sched.Optimization"Efficiency" has become a bottleneck restricting the development of enterprise AI businesspycharmjupyterVstudiosublime70%50%70%Data security issuesInefficient collaborative developmentLack of centralized data management Low resource utilizationInconvenient for large-scale trainingDecentralized Resource Management Lack of synergy in R&D, slow business responseR&D lacks a unified processAIStation –Artificial intelligence development platformTensorflow Pytorch Paddle Caffe MXNetAIStation Integrated development platformModel DevelopmentBuild environment Model Debugging Model OptimizationModel Deployment Model Loading Service DeploymentAPI ServiceModel Training Task Queuing Distributed Training Visual AnalysisAI computing resourcesTraining samplesApplication stackCPU GPUNFS BeeGFS HDFSComputing Resource Pooling User Quota management Utilizing GPU usagePool schedulingData accelerationAutomated DDP trainingSSDResource poolData pooldata1data2data3node1node2node3data4data5Dataset managementData pre-loading Cached data managementSolving data IO bottleneck Accelerating large scale dataset transferring and model trainingLow threshold for DDP training Helping developers drive massive computing power to iteratively trainmodelsbatch2batch1batch0Data loadingbatch3BeeGFSwork2GPU serverworker1GPUserverworker0GPUserverwork3GPU ServerAIStation TensorFlowCustomized MPI operatorsHighlighted featuresSSDSSDGPU GPU GPU GPU GPUGPUGPUGPUGPU Cards MIG instancesResource PoolingUser QuotaUser QuotaA I St a t i o n d e ve l o p m e n t P l a t f o r m A rc h i te c t u reP100V100V100sA100A30… …Ethernet ClusterInfiniband ClusterRoCE ClusterStorageNFS 、BeeGFS 、Lustre 、HDFS 、Object StorageLinux OSNVIDIA driver package: GPU, Mallanox NIC, etcOperating SystemHardware ClusterNVIDIAGPU seriesMonitoringSchedulingGPU PluginOperatorKubernetes + dockerNetwork PluginSRIOV PluginMultus CNIData prep.Algorithm prototype TrainingTestResource Enginedata mgmtJupyterimage mgmtwebshell/ssh multi-instance visualizationquota mgmtresource mgmt deployment job workflowmgmt job lifecycleproject mgmtalgorithm mgmtmodel mgmt Report HAMulti-tenant System settingBusiness ManagementAuthenticationAPIsAI Application Development3rd or user-defined system integrationDeployment ModeComputing Nodes Storage :SSD 2T-10TGPU :8*V100Management network Ethernet @ 1/10Gbps IPMIEthernet @ 1GbpsManagement Node Storage size :4T-100TCluster Size (10-80persons )ManagerDeployment Mode (Larger Scale+HA )Storage 100T-200TManagement network Ethernet @ 1/10Gbps IPMIEthernet @ 1Gbps Management Node 1*Main ,2*BackupCluster Size (10-80persons )Computing NodesSSD 2T-10T 8*V100Computing NodesSSD 2T-10T 8*V100Manager...I00G EDR infiniband EDR@100GpsOne-stop service full-cycle management,Easy use for distributed trainingHelping developers drive massive computing powerto iteratively train modelsOne-stop AI Dev. platformAI framework AI ops tools GPU driver & Cuda GPUStandard interface for AI Chips Multiply AI Chips supportedHeterogeneousComprehensive resource using statisticsData security and access control Automatic faulty analysis and solutionsIntelligent maintenance & securityHighlighted featuresAIStationStandard and unifiedManagementPollingSchedulingCPU GPU FPGAASICA100A30A40V100MLU270MLU390Cloud AIC 100•Personal data isolation•Collaborative sharing of public data •Unified data set managementC e n t r a l i z e d d a t a m a n a g e m e n tf a c i l i t a t e c o l l a b o r a t i v e d e v e l o p m e n t •Dataset preloading •Data Affinity Scheduling•Local cache management strategyD a t a c a c h e a c c e l e r a t i o ne f f e c t i v e l y s o l v e I /O b o t t l e n e c k s AIStation –Data Synergy Acceleration•Data access control•Data security sandbox, anti-download •Multiple copies ensure secure data backupS e c u r i t y p o l i c yUser DataTraining SamplesSharing Data(NFS 、HDFS 、BeeGFS 、Cloud Storage )D a t a M a n a g e m e n t :M u l t i -s t o r a g e Sy s t e m•Support “main -node ”storage using mode ;•Unified access and data usage for NFS 、BeeGFS 、HDFS 、Lustre through UI;•Built-in NFS storage supports small file merger and transfer, optimizing the cache efficiency of massive small filesAIStationComputing PoolStorage extension (storage interface 、data exchange )Data accelerationMain storageSSD+BeeGFSNode Storage(NFS )Node Storage(HDFS )Node Storage(Lustre )Data exchangeGPU PoolAIStationUser01UserNcaffeTensorflowmxnetpytorchGPUGPU GPU GPU GPUGPUGPUGPUGPU GPU GPU GPU GPUGPUGPUGPUGPU GPU GPU GPU GPUGPUGPUGPUGPU GPU GPU GPU GPUGPUGPUGPUAIStation –Resource SchedulingR e s o u r c e a l l o c a t i o n m a n a g e m e n tUser GPU resource quota limit User storage quota limitResource partition: target users, resource usageF l e x i b l e r e s o u r c e s c h e d u l i n g•Network topology affinity scheduling •PCIE affinity scheduling•Device type affinity scheduling •GPU fine-grained distributionD y n a m i c s c h e d u l i n g•Allocate computing resources on demand •Automatically released when task completedG P U M a n a g e m e n t :F i n e g r a n u l a r i t y G P U u s i n guser1user2user3user4user2481632123456GPU mem (G )Time (H )user1user2IdleIdle481632123456GPU mem (G )Time (H )GPU sharing scheduling policy based on CUDA to realize single-card GPU resource reuse and greatly improve computing resource utilization.Elastic sharing:Resources are allocated based on the number of tasks to be multiplexed.A single card supports a maximum of 64tasks to be multiplexed.Strict sharing:the GPU memory is isolated and allocated in any granularity (minimum:1GB).and resources are isolated based on thegraphics memory ;Flexible and convenient:user application to achieve "zero intrusion",easy application code migration ;S c h e d u l i n g w i t h M I G8 * A100 GPUsN V I D I A A100M I G s u p p o r t i n gUtilizing GPU usage• A single A100 GPU achieves up to 7x instance partitioning and up to56x performance on 8*A100 GPUs in Inspur NF5488A5;•Allocates appropriate computing power resources to tasks withdifferent load requirements.•Automatic MIG resource pool management, on-demand application,release after use;Convenient operation and maintenance•Set different sizes of pre-configured MIG instance templates.•Standard configuration UI for IT and DevOps team.•Simplify NVIDIA A100 utilization and resource management;56 *MIG instancesRe s o u rc e m a n a g e m e n t :N U M A ba s e d s c h e d u l i n gKubeletResource management PluginInspur-DevicePluginGPUGPU topo scoreGPU resource updateGPU allocatingAIStation SchedulerGPU allocationAutomatically detects the topology of compute nodes and preferentially allocates CPU and GPU resources in the same NUMA group to a container to make full use of the communication bandwidth in the groupAIStation –Integrated AI training frameworkPrivate image library PublicimagelibraryinspurAIimagelibraryAI DevelopmentFrameworkAI Developmentcomponents and toolsGPU Driver anddevelopment libraryGPU computingresources◆Te n s o r f l o w,P y t o r c h,P a d d l e,C a f f e,M X N e t◆B u i l d a p r i v a t e w a r e h o u s e t o m a n a g et h e A I a p p l i c a t i o n s t a c k◆S u p p o r t i m a g e c r e a t i o n a n d e x t e r n a li m p o r t◆S u p p o r t o p e n s o u r c e r e p o s i t o r i e s s u c ha s N G C a n d D o c k e r H u b◆B u i l t-i n m a i n s t r e a m d e v e l o p m e n t t o o l sa n d s u p p o r t d o c k i n g w i t h l o c a l I D E•Built-in Jupyter and Shell tools •Support docking with local IDE •Support command line operationQuickly enterdevelopment mode•Allocate computing resources on demand•Quick creation through the interface•Rapid Copy Development EnvironmentRapid build Model Development Environment•Life cycle management •Real-time monitoring of resource performance•One-click submission of training tasksCentralized management of development environmentQuickly build development environment, focus on model developmentD e ve l o p m e n t P l a t f o r mJupyterWebShell本地IDEDevelopment PlatformDev. Platform StatusDevelopment environment instancemonitoring The development environment saves the imageS e c o n d l e v e l b u i l d•On –demand GPU ;•T ensorflow/MXNet/Pytorch/Caffe ;•Single-GPU, multi-GPU, distributed training ;•Flexible adjustment of resources on demand decouples the binding of runtime environment and computing power ;I n t e r a c t i v e m o d e l i n g •Jupyter / WebShell / IDE V i s u a l i z a t i o nT ensorBoard / Visdom / NetscopeF u l l c y c l e m a n a g e m e n t S t a t u smonitoring/Performance monitoring/Port password memoryImage save/copy expansion/start/delete etcVisualizationTensorboardVisdom NetscopeEnhanced affinity scheduling, optimized distributed scheduling strategy, multi-GPU training acceleration ratio can reach more than 90%.Optimized most of the code based on open source;Fixed a bug where workers and launchers could not start at the same time;Task status is more detailed.•Supports distributed training for mainstream frameworks•Provides one-page submission and command line submission of training tasks.M u l t i p l e w a y s t o s u p p o r t d i s t r i b u t e dQ u i c k s t a r t d i s t r i b u t i o nI m p r o v e c o m p u t i n g p e r f o r m a n c eDistributed task scheduling to speed up model trainingAIStation –Training ManagementAIStation –Resource MonitoringO v e r a l l M o n i t o r i n g•Usage status of cluster resources such as GPU, CPU, and storage •Computing node health and performance•User task status and resource usageR e s o u r c e U s a g e St a t i s t i c s•Cluster-level resource usage statistics•Cluster-level task scale statistics•User-level resource usage statistics•User-level task scale statisticsS y s t e m A l a r m•hardware malfunction•System health status•Computing resource utilizationM u l t i -te n a n t M a n a g e m e n tAIStationUserUser2User group1User group2Kubernetes Namespace1Namespace2Cluster resource ☐Supports an administrator -Tenant administrator -Common User organization structure. Tenant administrators can conveniently manage user members and services in user groups, while platformadministrators can restrict access to and use of resources and data in user groups.☐User authentication: LDAP as user authentication system, supporting third-party LDAP/NIS systems.☐Resource quotas control for users and user groups using K8S namespace.☐User operations: Users can be added, logged out, removed, and reset passwords in batches. Users can be enabled or disabled to download data and schedule urgent tasks.I n t e l l i g e n t p l a t f o r m o p e r a t i o n a n d m a i n t e n a n c eIntelligent diagnosis and recovery tool•Based on the existing cluster monitoring, alarm and statistics functions, the operation monitoring component is removed to support independent deployment and use;•Health monitoring: Obtain the status and list display (monitoring information and abnormal events display) of components (micro-services and NFS).•Abnormal repair: Based on the operation and maintenance experience of AIStation, automatic or manual repair of the sorted events such as interface timeout and service abnormalities (microservice restart and NFS remount);Intelligent fault toleranceSupports active and standby management node health monitoring, HA status monitoring, and smooth switchover between active and standby management nodes without affecting services. Monitors alarms forabnormal computenode resource usage toensure the smoothrunning of computenodes.In the event of a systemfailure, the training taskautomatically startssmooth migrationwithin 30 secondsMonitor the status ofkey services andabnormal warning toensure the smoothoperation of user coreservices.M a n a g e m e n t n o d e h i g h l y a v a i l a b l e C o m p u t i n g n o d eF a u l t t o l e r a n c eC r i t i c a l s e r v i c e sf a u l t t o l e r a n tTr a i n i n g m i s s i o nf a u l t t o l e r a n c eN o r t h b o u n d i n t e r f a c e•Secure, flexible, extensible northbound interface based on REST APIs.AIStationQuery URL Status Usages Performance status logs performance resultsReturn URL resource framework scripts dataset environment Login info performance resource framework dataset Return URL Query URL Query URL Return URL monitordeveloping training Computing resourcesDatasets Applications Caffe TensorFlow Pytorch Keras MXNet theanodata1data2data3data4data5AIStation product featuresFull AI business process support Integrated cluster management Efficient computing resource scheduling Data caching strategy reliable security mechanismsUse Case :Automatic driveSolutions:•Increasing computing cluster resource utilization by 30% with efficient scheduler.•One-stop service full-cycle management,streamlined deployments.•Computing support, data management.Background :•Widely serving the commercial vehicle and passenger vehiclefront loading market. •The company provides ADAS and ADS system products andsolutions, as well as high-performance intelligent visualperception system products required for intelligent driving.U s e C a s e :c o m m u n i c a t i o n s te c h n o l o g y c o m pa n y•Quick deployment and distributed •GPU pooling •Huge files reading and training optimizationBackground•HD video conference and mobile conference are provided,and voice recognition and visual processing are the main scenarios.•Increased scale of sample data,distributed training deployment and management,a unified AI development platform is required to support the rapid development of service.ProblemsSolutions •Increasing size of dataset (~1.5T), distributed training;•GPU resource allocating automatically ;•Efficient and optimized management for the huge set of small files ;Use Case: Build One-Stop AI Workflow for Largest Excavator Manufacturer Revenue 15.7B$ExcavatorsPer Year 100,000+Factories 30+AIStation built one-stop AI workflow to connect cloud, edge,and localclusters; support 75 production systems.API Calls Per day 25 M QoS 0missper 10M calls Model Dev Cycle 2 weeks -> 3days Use AI to automate 90% production lines, double production capacity.SANY HEAVY INDUSTRY CO., LTDSANY CloudAIStationModel Dev &Training Inference ServiceSensor Data Data Download Realtime work condition analysis Inference API invoke Training Cluster Inference ClusterTraining Jobs InferenceServices200 * 5280M5 800 * T4, inference; 40* 5468M5 320 * V100, training。
Hyundai Elantra 商品说明书
Dare to challenge the status quo and find greater courage without fear of failure, Challenge tradition and prejudice to find opportunities for innovation.Be in charge of today through passion and effort,Open up the world of tomorrow using your own standards, not the world’s.Do not hesitate, but stay bold and go for those big dreams.Dare to be you.The answer is you.Stand strong. Have faith in yourself,then see your true strengths unleashed.Question the old rules.The ‘Parametric Dynamics’ design accentuates geometric aesthetics of the elongated hood and sleek roof lines, completing the visionary and innovative style.Parametric Jewel SurfaceThree sections emerge out of three bold-edged lines intersecting at a single point, creating three different colors of light.Parametric Jewel Pattern GrilleThe stereoscopic Parametric Jewel-pattern design highlights the depth of the front grille, making it resemble diamond-cut gemstones, and the bold and elongated front headlights come together to give Elantra its front sporty look.The edgy spoiler on the trunk and the integrated all-in-one taillight – representing Hyundai with its distinct H-shape design -help to create a high-tech, futuristic rear look.H-Tail Lamp LED Headlights17" Alloy Wheels & TiresBigger, longer, and lower than ever.Elantra’s sporty look and elaborate lines highlight its bold presence.11Immersive InterfaceThe 10.25” cluster display and 8” audio display deliver a fully immersive space, tilted 10 degrees toward the driver for easier control and a high-tech feel.Electric Parking Brake with Auto HoldSimply engage/disengage the parking brake with a single switch. The electric parking brake comes with Auto Hold, which keeps the vehicle stationary while stopped.BOSE Premium Sound SystemElantra’s 8 high-performance speakers deliver precise, powerful sound, ranging from low- to high-pitched tones at volumes that adjust according to the speed of your vehicle.Most sensuous.The Immersive Interface cocoons the driver like a cockpit that surrounds the pilot, offering easier control and an enveloping , advanced driving experience.1213141516Phone ProjectionThe main features of your smartphone are shown and controllable on the interior display.Wireless Smartphone ChargingWireless charging that is as simple as placing your phone on the charging pad.Stay connected.Elantra’s highly advanced connectivity features are extremely intuitive and easy to use, keeping you connected with the car and bringing greater innovation to your driving experience.* The availability of Hyundai CarPlay and Android Auto on mobile phones may vary depending on your country or region and will be in compliance with the policies of Google Play and App Store.17Make your move.Elantra’s newly developed 3rd generation platform delivers agile handling and stability powered by a fuel-efficient engine, giving you optimal driving performance wherever you go.Gamma 1.6 MPi Gasoline Engine127.5Max. Power (ps/6,300rpm)15.77Max. Torque(kg·m/4,850rpm)Smartstream G2.0 Gasoline Engine159Max. Power (ps/6,200rpm)19.5Max. Torque(kg·m/4,500rpm)1819Forward Collision-Avoidance AssistIf the preceding vehicle suddenly slows down, or if a forward collision risk is detected, such as a stopped vehicle or a pedestrian in front, it provides a warning. After the warning, if the risk of collision increases, it automatically assists with emergency braking. While driving, if there is a risk of collision with a cyclist, it automatically assists with emergency braking. If there is a risk of collision with an oncoming vehicle while turning left at an intersection, it automatically assists with emergency braking.Blind-Spot Collision-Avoidance AssistWhen operating the turn signal switch to change lanes, if there is a risk of collision with a rear side vehicle, it provides a warning. After the warning, if the risk of collision increases, it automatically controls the vehicle to help avoid a collision.Driver Attention WarningDisplays the driver's attention level while driving. Provides a warning when signs of driver inattentiveness are detected, and recommends a rest if needed. The driver should stop and park in a safe place and get plenty of rest before driving. During a stop, the driver is alerted if the leading vehicle departs.20Smart Cruise ControlSmart Cruise Control helps maintain distance from the vehicle ahead and drive at a speed, set by the driver. The vehicle stops automatically and starts automatically if the vehicle in front starts in a short time.If a period of time has elapsed, start again by pressing the accelerator pedal or operating the ne Following AssistDetects lane and vehicle ahead on the road with a front view camera on the front wind-shield, and assists the driver's steering to help keep the vehicle centered betweenthe lanes.Rear Cross-Traffic Collision-Avoidance AssistIf there is a risk of collision with an oncoming vehicle on the left or right side while reversing, it provides a warning. After the warning, if the risk of collision increases,it automatically assists with emergency braking.21For those who dare to be courageous and are unafraid of failure,The all-new Elantra is with you.You lead.We’re right beside you.2223FeaturesExterior Sideview Mirrors(heated, power folding, LED turn signal indicators)LED Rear Combination LampBulb Rear Combination Lamp LED + Bulb Rear Combination LampBlack Radiator GrilleLED HeadlightsProjector HeadlightsChrome Radiator Grille24Basic Audio System8" Display Audio System(matt, when selecting 4.2" color LCD cluster display)Manual Air Conditioner Ventilated Front SeatsDual Full Auto Air Conditioner10.25" Full Color Cluster Display Memory System For Driver’s Seat15" Alloy Wheel 15" Steel Wheel & Wheel Cover 16" Alloy Wheel 17" Alloy WheelHeated rear seats25ColorsInterior colorsExterior colorsPolar WhiteCyber GreyNatural Leather TricotWoven26SpecificationsUnit : mm, Wheel tread is based on 15" tires, and the number in parenthesis is based on 17" tires.1,593(1,579)1,825Wheel TreadOverall Width1,604(1,590)1,825Wheel Tread Overall WidthOverall Height1,4302,7204,675WheelbaseOverall Length● Above values are based on internal testing results and are subject to change after final validation.● Model specification may vary depending on sales region and country.● Some of the equipment illustrated or described in this catalog may not be supplied as standard equipment and may be available at extra cost.● Hyundai Motor Company reserves the right to change specifications and equipment without prior notice. ● The color plates shown may vary slightly from the actual colors due to the limitations of the printing process.● Please consult your dealer for full information and availability on colors and trims.27Dealer stampHyundai Motor CompanyGEN. RHD 2008 ENG. FDCGCopyright © 2020 Hyundai Motor Company. All Rights Reserved.。
Intel MPI Library for Windows OS 入门指南说明书
Intel® MPI Library for Windows* OS Getting Started GuideThe Intel® MPI Library is a multi-fabric message passing library that implements the Message Passing Interface, v2 (MPI-2) specification. Use it to switch interconnection fabrics without re-linking.This Getting Started Guide explains how to use the Intel® MPI Library to compile and run a simple MPI program. This guide also includes basic usage examples and troubleshooting tips.To quickly start using the Intel® MPI Library, print this short guide and walk through the example provided.Copyright © 2003–2010 Intel CorporationAll Rights ReservedDocument Number: 316404-007Revision: 4.0World Wide Web: Contents1About this Document (4)1.1Intended Audience (4)1.2Using Doc Type Field (4)1.3Conventions and Symbols (4)1.4Related Information (5)2Using the Intel® MPI Library (6)2.1Usage Model (6)2.2Before you Begin (6)2.3Quick Start (7)2.4Compiling and Linking (8)2.5Setting up SMPD Services (8)2.6Selecting a Network Fabric (9)2.7Running an MPI Program (10)3Troubleshooting (12)3.1Testing Installation (12)3.2Compiling and Running a Test Program (12)4Next Steps (14)Disclaimer and Legal NoticesINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site.Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See/products/processor_number for details.BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile, i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.* Other names and brands may be claimed as the property of others.Copyright © 2007-2010, Intel Corporation. All rights reserved.1 About this DocumentThe Intel® MPI Library for Windows* OS Getting Started Guide contains information on thefollowing subjects:•First steps using the Intel® MPI Library•Troubleshooting outlines first-aid troubleshooting actions1.1 Intended AudienceThis Getting Started Guide is intended for first time users.1.2 Using Doc Type FieldThis Getting Started Guide contains the following sections:OrganizationDocumentSection DescriptionSection 1 introduces this documentSection 1 About thisDocumentSection 2 Using the Intel®Section 2 describes how to use the Intel® MPI Library MPI LibrarySection 3 Troubleshooting Section 3 outlines first-aid troubleshooting actionsSection 4 Next Steps Section 4 provides links further resources1.3 Conventions and SymbolsThe following conventions are used in this document.Table 1.3-1 Conventions and Symbols used in this DocumentThis type style Document or product namesThis type style HyperlinksThis type style Document or product namesThis type style Commands, arguments, options, file namesTHIS_TYPE_STYLE Environment variables<this type style>Placeholders for actual values[ items ] Optional items{ item | item }Selectable items separated by vertical bar(s)(SDK only)For Software Development Kit (SDK) users only 1.4 Related InformationTo get more information about the Intel® MPI Library, see the following resources:Product Web SiteIntel® MPI Library SupportIntel® Cluster Tools ProductsIntel® Software Development Products2 Using the Intel® MPI Library2.1 Usage ModelUsing the Intel® MPI Library involves the following steps. These steps are described in thecorresponding sections in detail.Figure 1: Flowchart representing the usage model for working with the Intel® MPI Library.2.2 Before You Begin1.Before using the Intel® MPI Library, ensure that the library, scripts, and utility applications areinstalled. See the Intel® MPI Library for Windows* OS Installation Guide for installationinstructions.2.For getting proper environment settings, use the following commands from the Start menu:Start > Programs > Intel(R) Software Development Tools> Intel(R) MPI Library 4.0 > BuildEnvironment for the IA-32 architectureStart > Programs > Intel Software Development Tools > Intel(R) MPI Library 4.0 > BuildEnvironment for the Intel® 64 architectureAlternatively, you can open a new console (cmd) window and run one of the following BAT filesfrom the command line.<installdir>\ia32\bin\mpivars.bat<installdir>\em64t\bin\mpivars.bat3.You should have administrator privileges on all nodes of the cluster to start the smpd serviceon all nodes of the cluster.2.3 Quick Starte the call batch command for getting proper environment settings from the mpivars.batbatch scripts included with the Intel® MPI Library. It is located in the<installdir>\em64t\bin directory for the Intel® 64 architecture, or in the<installdir>\ia32\bin directory for the 32-bit mode.2.Make sure the smpd services are installed and started on compute nodes. Otherwise installthem manually from the command line by using the –install smpd option. If the smpdservice stops, start it through Computer Management -> Services and Applications -> Servicesor from the command line manually using the –start smpd option.3.(SDK only) Make sure that you have a compiler in your PATH.4.(SDK only) Compile the test program using the appropriate compiler driver. For example:> mpicc.bat –o test <installdir>\test\test.c5.Register your credentials using the wmpiregister GUI utility.6.Execute the test using the GUI utility wmpiexec. Set the application name and a number ofprocesses. In this case all processes start on the current host. To start a test on a remote hostor on more than one host press the Advanced Options button and fill the appropriate fields.Use the Show Command button to check the command line. Press the Execute button tostart the program.You can use the command line interface instead of the GUI interface.e the mpiexec –register option instead of the wmpiregister GUI utility to registeryour credentials.e the CLI mpiexec command to execute the test.> mpiexec.exe –n <# of processes> test.exeor> mpiexec.exe –hosts <# of hosts> <host1_name> \<host1 # of processes> <host2_name> \<host2 # of processes> … test.exeSee the rest of this document and the Intel® MPI Library Reference Manual for Windows*OS formore details.2.4 Compiling and Linking(SDK only)To compile and link an MPI program with the Intel® MPI Library do the following steps:1.Create a Winxx Console project for Microsoft* Visual Studio* 2005.2.Choose the x64 solution platform.3.Add <installdir>\em64t\include to the include path.4.Add <installdir>\em64t\lib to the library path.5.Add impi.lib (Release) or impid.lib (Debug) to your target link command for Capplications.6.Add impi.lib and impicxx.lib (Release) or impid.lib and impicxxd.lib (Debug)to your target link command for C++ applications. Link application with impimt.lib(Release) impidmt.lib (Debug) for multithreading.7.Build a program.8.Place your application and all the dynamic libraries in a shared location or copy them to all thenodes.9.Run the application using the mpiexec.exe command.2.5 Setting up SMPD ServicesThe Intel® MPI Library uses a Simple Multi-Purpose Daemon(SMPD) job startup mechanism. Inorder to run programs compiled with Microsoft* Visual Studio* (or related), set up a SMPD service.NOTE:You should have administrator privileges to start the smpd service and all users can launch processes with mpiexec.To set up SMPD services:1.During the Intel® MPI Library installation the smpd service is started. During installation youcan cancel the smpd service startup.2.You can start, restart, stop or remove the smpd service manually when the Intel® MPI Libraryis installed. Find smpd.exe in the <installdir>\em64t\bine the following command on each node of the cluster: > smpd.exe –remove to removethe previous smpd service.e the following command on each node of the cluster: > smpd.exe –install to install thesmpd service manually.2.6 Selecting a Network FabricThe Intel® MPI Library dynamically selects different fabrics for communication between MPIprocesses. To select a specific fabric combination, set the new I_MPI_FABRICS or the oldI_MPI_DEVICE environment variable.I_MPI_FABRICS(I_MPI_DEVICE)Select the particular network fabrics to be used.SyntaxI_MPI_FABRICS=<fabric>|<intra-node fabric>:<inter-nodes fabric>Where <fabric> := {shm, dapl, tcp}<intra-node fabric> := {shm, dapl, tcp}<inter-nodes fabric> := {dapl, tcp}Deprecated SyntaxI_MPI_DEVICE=<device>[:<provider>]Arguments<fabric>Define a network fabricshm Shared-memorydapl DAPL–capable network fabrics, such as InfiniBand*, iWarp*,Dolphin*, and XPMEM* (through DAPL*)tcp TCP/IP-capable network fabrics, such as Ethernet and InfiniBand*(through IPoIB*)Correspondence with I_MPI_DEVICE<device><fabric>sock tcpshm shmssm shm:tcprdma daplrdssm shm:dapl<provider>Optional DAPL* provider name (only for the rdma and the rdssmdevices)I_MPI_DAPL_PROVIDER=<provider>Use the <provider> specification only for the {rdma,rdssm} devices.For example, to select the OFED* InfiniBand* device, use the following command:> mpiexec -n <# of processes> \-env I_MPI_DEVICE rdssm:ibnic0v2-scm <executable>For these devices, if <provider> is not specified, the first DAPL* provider in the dat.conf file isused.NOTE:Ensure the selected fabric is available. For example, use shm only if all the processes can communicate with each other through shared memory. Use rdma only if all the processescan communicate with each other through a single DAPL provider. Ensure that thedat.dll library is in your %PATH%. Otherwise, use the –genv option for mpiexec.exe forsetting the I_MPI_DAT_LIBRARY environment variable with the fully-qualified path to thedat.dll library.2.7 Running an MPI ProgramUse the mpiexec command to launch programs linked with the Intel® MPI Library:> mpiexec.exe -n <# of processes> myprog.exeNOTE:The wmpiexec utility is a GUI wrapper for mpiexec.exe. See the Intel® MPI Library Reference Manual for more details.Use the only required mpiexec -n option to set the number of processes on the local node.Use the –hosts option to set names of hosts and number of processes:> mpiexec.exe –hosts 2 host1 2 host2 2 myprog.exeIf you are using a network fabric as opposed to the default fabric, use the -genv option to setthe I_MPI_DEVICE variable.For example, to run an MPI program using the shm fabric, type in the following command:> mpiexec.exe -genv I_MPI_DEVICE shm -n <# of processes> \myprog.exeYou may use the –configfile option to run the program:> mpiexec.exe –configfile config_fileThe configuration file contains:-host host1 –n 1 –genv I_MPI_DEVICE rdssm myprog.exe-host host2 –n 1 –genv I_MPI_DEVICE rdssm myprog.exeFor the rdma capable fabric, use the following command:> mpiexec.exe –hosts 2 host1 1 host2 1 –genv I_MPI_DEVICE rdma myprog.exe You can select any supported device. For more information, see Section Selecting a Network Fabric.If you successfully run your application using the Intel® MPI Library, you can move your application from one cluster to another and use different fabrics between the nodes without re-linking. If you encounter problems, see Troubleshooting for possible solutions.3 TroubleshootingUse the following sections to troubleshoot problems with installation, setup, and runningapplications using the Intel® MPI Library.3.1 Testing InstallationTo ensure that the Intel® MPI Library is installed and functioning, complete a general testing,compile and run a test program.To test the installation:1.Verify through the Computer Management that the smpd service is started. It calls the IntelMPI Process Manager.2.Verify that <installdir>\ia32\bin (<installdir>\em64t\bin for the Intel® 64architecture in the 64-bit mode) is in your path:> echo %PATH%You should see the correct path for each node you test.3.(SDK only) If you use Intel compilers, verify that the appropriate directories are included inthe PATH and LIB environment variables:> mpiexec.exe –hosts 2 host1 1 host2 1 a.batwhere a.bat containsecho %PATH%You should see the correct directories for these path variables for each node you test. If not,call the appropriate *vars.bat scripts. For example, with Intel® C++ Compiler 11.0 forWindows*OS for the Intel® 64 architecture in the 64-bit mode, use the Windows programmenu to select:Intel(R) Software Development Tools > Intel(R) C++ Compiler 11.0 >Build Environment forthe Intel® 64 architecture or from the command line%ProgramFiles%\Intel\Compiler\C++\11.0\em64t\bin\iclvars.bat3.2 Compiling and Running a Test ProgramThe install directory <installdir>\test contains test programs which you can use for testing.To compile one of them or your test program, do the following:1.(SDK only) Compile a test program as described in Section2.4 Compiling and Linking.2.If you are using InfiniBand* or other RDMA-capable network hardware and software, verifythat everything is functioning.3.Run the test program with all available configurations on your cluster.•Test the sock device using:> mpiexec.exe -n 2 -env I_MPI_DEBUG 2 –env I_MPI_DEVICE sock a.out You should see one line of output for each rank, as well as debug output indicating that the sock device is used.•Test the ssm devices using:> mpiexec.exe -n 2 -env I_MPI_DEBUG 2 –env I_MPI_DEVICE ssm a.out You should see one line of output for each rank, as well as debug output indicating that the ssm device is used.•Test any other fabric devices using:> mpiexec.exe –n 2 -env I_MPI_DEBUG 2 -env I_MPI_DEVICE <device>a.outwhere<device>can be shm, rdma, or rdssmFor each of the mpiexec commands used, you should see one line of output for each rank, as well as debug output indicating which device was used. The device(s) should agree with theI_MPI_DEVICE setting.4 Next StepsTo get more information about the Intel® MPI Library, explore the following resources:The Intel® MPI Library Release Notes include key product details. See the Intel® MPI LibraryRelease Notes for updated information on requirements, technical support, and known limitations.Use the Windows program menu to select Intel(R) Software Development Tools > Intel(R) MPILibrary > Intel(R) MPI Library for Windows* OS Release Notes.For more information see Websites:Product Web SiteIntel® MPI Library SupportIntel® Cluster Tools ProductsIntel® Software Development Products。
基于MPI的不可压缩N-S方程并行计算方法
冲通信模式 、 同步通信模式 、 就绪通信模 式等 。MP 提供 了与 c I
和 Fra ot n语言 的绑定 。 r
2 本 文问题 的描 述
对不可压缩流 动的连续 方程和 N v r tks ai . oe 方程进行 无量 eS 纲化后 的控制方程 为 :
丝 :0
d
。
然而 , 由于湍流运动 的复杂性 , 使得数值计算在大 雷诺 数情
0 4 26 文献标识码 A
PARALLEL CoM PUTI NG ALG oI UTHM oF CoM PRESS BLE S D I N-
E QUAT oN B S D oN M P I A E I
W a g L a s e g Xio Ho gi n in h n a n l ‘ Gu n mi g n o Mi g n ( ea m n o e a i l ni e n ,i j i —e a yVct n l e nl yI tu ,i j 0 9 , h a Dp r e t t fM c n a E g e i T n nS oG r n o i a Tc o g s t e T n n 0 1 1 C i ) h c n rg a i n m a o h o n it a i 3 n
和消息接受等 。MP 支持 以下四种通信模式 : 准通信模式 、 1 标 缓
0 引 言
随着 计算机运行速 度 的大幅提 高 , 算流 体力学 ( F 得 计 C D) 到 了蓬勃 发展 , 这也促进 了航空 航天 技术 的发展 。数 值计算 已 经成 为与理论分析和实验 研究 并列 的研究 流体 问题 的方 法 , 直 接数 值模拟 ( N ) D S 方法 目前是 数值 研究湍流运动 的有效工具 之
基于MPI-----多核机群下MPI程序优化技术的研究_王洁
群 采 用 多 核 处 理 器 作 为 核 心 部 件 ,基 于 多 核 技 术 的 机 群 已 经 化的有效性,并具有一定的普遍适用性和研究意义。
成为高性能计算 领 域 的 主 流 平 台[1]。 基 于 2010 年 发 布 的 世 界 Top500的超级计算 机 排 名,约 88% 的 超 级 计 算 机 采 用 了
素,但其与具体的特 定 应 用 特 性 有 关,直 接 将 现 有 的 MPI程 双核共享 L2cache,而 Power6 的 chip 内 双 核 有 自 己 独 立 的
序 移 植 到 多 核 机 群 平 台 上 ,应 用 的 性 能 和 可 扩 展 性 并 没 有 得 到 多 大 的 改 进[4]。
mization performances were also analyzed.
Keywords Multi-core cluster,Memory hierarchy,MPI programs optimization,Hybrid MPI/OpenMP,MPI runtime pa- rameters,MPI process placement
境下 MPI程序的性 能 优 化 成 为 了 研 究 的 热 点。 虽 然 算 法 的 和 L2cache,并通过超传输接口技术交换传输数据,同 时 共 享
数据局部性、负载均衡等因素是影响 MPI并 行 应 用 性 能 的 因 相对较小 的 L3cache。IBM 的 Power4 和 Power5 的 chip 内
使 用Intel MPI基准(IMB)中 ping-pong基准测试3个机 群中缓 存 结 构 对 通 信 性 能 的 影 响,图 1- 图 3 显 示 了 3 种 多 核 机 群 chip 内 、chip 间 以 及 节 点 间 的 通 信 带 宽 。
MPI错误对照表.doc
MPI错误对照表00CA : no resources available,没有资源可用到的00CB : configuration error,配置错误00CD : illegal call,违法的喊声00CE : module not found,模块不发现00CF : driver not loaded,驱动程序不装满的00D0 : hardware fault,硬件过错00D1 : software fault,软件过错00D2 : memory fault,记忆过错00D7 : no message,没有消息00D8 : storage fault,贮藏过错00DB : internal timeout,内在的暂时休息00E1 : too many channels open,也多数通道打开00E2 : internal fault,内在的过错00E7 : hardware fault,硬件过错00E9 : sin_serv.exe not started,sin_serv.exe不开始00EA : protected,保护00F0 : scp db file does not exist ,scpdecibel 分贝文件不存在00F1 : no global dos storage available,没有全局磁盘操作系统贮藏可用到的00F2 : error during transmission,错误期间播送00F2 : error during reception,错误期间接待00F4 : device does not exist,装置不存在00F5 : incorrect sub system,错误的补充系统00F6 : unknown code,未知的代码00F7 : buffer too small,缓冲器也小的00F8 : buffer too small,缓冲器也小的00F9 : incorrect protocol,错误的协议00FB : reception error,接待错误00FC : licence error,许可错误0101 : connection not established / parameterised,连接不既定的/ parameterised010A : negative acknowledgement received / timeout error,否定承认已收到/暂时休息错误010C : data does not exist or disabled,数据不存在或伤残的012A : system storage no longer available,系统贮藏不再可用到的012E : incorrect parameter,错误的参数0132 : no memory in DPRAM,没有记忆在DPRAM0201 : incorrect interface specified,错误的接口规定0202 : maximum amount of interfaces exceeded,接口的最大的数量超越0203 : PRODA VE already initialised,PRODA VE已经initialised0204 : wrong parameter list,错误的参数列表0205 : PRODA VE not initialised,PRODA VE不initialised0206 : handle cannot be set,句柄不能是设置0207 : data segment cannot be disabled,数据段不能是伤残的0300 : initialisiation error,initialisiation错误0301 : initialisiation error,initialisiation错误0302 : block too small, DW does not exist,木块也小的, DW不存在0303 : block limit exceeded, correct amount,木块界限超越, 正确的数量0310 : no HW found,没有HW发现0311 : HW defective,HW有瑕疵的0312 : incorrect config param,错误的配置param0313 : incorrect baud rate / interrupt vector,错误的波特率/打断向量0314 : HSA parameterised incorrectly,HSA parameterised不正确地0315 : MPI address error,MPI地址错误0316 : HW device already allocated,HW装置已经分派0317 : interrupt not available,打断不可用到的0318 : interrupt occupied,打断占0319 : sap not occupied,树液不占031A : no remote station found,没有远程位置发现031B : internal error,内在的错误031C : system error,系统错误031D : error buffer size,错误缓冲器大小0320 : hardware fault,硬件过错0321 : DLL function error,DLL功能错误0330 : version conflict,版本斗争0331 : error com config,错误com配置0332 : hardware fault,硬件过错0333 : com not configured,com不配置0334 : com not available,com不可用到的0335 : serial drv in use,连续的drv使用中0336 : no connection,没有连接0337 : job rejected,工作驳回0380 : internal error,内在的错误0381 : hardware fault,硬件过错0382 : no driver or device found,没有驱动程序或装置发现0384 : no driver or device found,没有驱动程序或装置发现03FF : system fault,系统过错0800 : toolbox occupied,工具箱占4001 : connection not known,连接不知名的4002 : connection not established,连接不既定的4003 : connection is being established,连接是存在既定的4004 : connection broken down,连接坏掉的向下的8000 : function already actively occupied,功能已经活跃地占8001 : not allowed in this operating status,不允许在这操作状态8101 : hardware fault,硬件过错8103 : object access not allowed,物体通路不允许8104 : context is not supported,上下文是不支持8105 : invalid address,病人地址8106 : type (data type) not supported,类型(数据类型) 不支持8107 : type (data type) not consistent,类型(数据类型) 不一致的810A : object does not exist,物体不存在8301 : memory slot on CPU not sufficient,记忆缝在CPU存款不足8404 : grave error,墓穴错误8500 : incorrect PDU size,错误的PDU大小8702 : address invalid,地址病人D201 : syntax error block name,语法错误木块名字D202 : syntax error function parameter,语法错误功能参数D203 : syntax error block type,语法错误木块类型D204 : no linked block in storage medium,没有连接的草拟贮藏媒体D205 : object already exists,物体已经存在D206 : object already exists,物体已经存在D207 : block exists in EPROM,木块存在在EPROMD209 : block does not exist,木块不存在D20E : no block available,没有木块可用到的D210 : block number too big,木块数也大的D241 : protection level of function not sufficient,功能的保护水平存款不足D406 : information not available,通知不可用到的EF01 : incorrect ID2,错误的ID2FFFB : TeleService Library not found,TeleService库不发现FFFE : unknown error FFFE hex,未知的错误FFFE hexFFFF : timeout error. Check interface,暂时休息错误. 检查接口。
LS-DYNABest-Prac...
Computing Technology
LS-DYNA® Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA® Performance
The manner in which HPC clusters are architected has a huge influence on the overall application performance and productivity – number of CPUs, usage of GPUs, the storage solution and the cluster interconnect. By providing low-latency, high-bandwidth and extremely low CPU overhead, InfiniBand has become the most deployed high-speed interconnect for HPC clusters, replacing proprietary or low-performance solutions. The InfiniBand Architecture (IBA) is an industry-standard fabric designed to provide high-bandwidth, low-latency computing, scalability for ten-thousand nodes and multiple CPU cores per server platform and efficient utilization of compute processing resources.
《并行程序设计导论》_第三章
从理论上说,MPI所有的通信功能可以用它的6个 基本的调用来实现:
MPI_INIT:
启动MPI环境
MPI_COMM_SIZE: 确定进程数
MPI_COMM_RANK: 确定自己的进程标识符
MPI_SEND:
发送一条消息
MPI_RECV:
接收一条消息
MPI_FINALIZE:
结束MPI环境
(1)MPI初始化:通过MPI_Init函数进入MPI环境并完 成所有的初始化工作。
常用的MPI版本
MPICH
是MPI最流行的非专利实现,由Argonne国家实验室和密西西比州立 大学联合开发,具有更好的可移植性
当前最新版本有MPICH 3.2
LAMMPI
美国Indiana 大学Open Systems 实验室实现
更多的商业版本MPI
HP-MPI,MS-MPI,……
MPI_Init(&argc,&argv);/*程序初始化*/
第三部分
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
/*得到当前进程号*/
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
/*得到总的进程数*/
MPI_Get_processor_name(processor_name,&namelen);
mpiexec -n 1 ./mpi_hello
用1个进程运行程序
mpiexec -n 4 ./mpi_hello
用4个进程运行程序
Copyright © 2010, Elsevier Inc. All rights Reserved
Execution
MPI集群通信技术浅析
4
集群系统下MPI通信
非阻塞通信-MPI_ISEND
MPI_ISEND(buf,count,datatype,dest,tag,comm,request) IN buf 发送缓冲区的起始地址(可选数据类型) IN count 发送数据的个数(整型) IN datatype 发送数据的数据类型(句柄) IN dest 目的进程号(整型) IN tag 消息标志(整型) IN comm 通信域(句柄) OUT request 返回非阻塞通信对象(句柄) Int MPI_isend(void *buf,int count,MPI_Datatype,int dest,int tag,MPI_Comm comm,MPI_Request *request)
4
集群系统下MPI通信
集群环境下MPI并行程序设计使用的几种主要的通信模式: 阻塞通信 非阻塞通信 组通信 其中阻塞通信和非阻塞通信是点对点通信,组通信是一对多通信。 点对点通信是指两个进程之间的通信,这两个进程可以属于相同或者 不同的进程组里。
4
阻塞通信
集群系统下MPI通信
阻塞发送
阻塞接收
3
集群系统简介
什么是集群系统? 集群系统是利用高速通用网络将一组高性能工作站或高档PC机按照某 种结构连接起来。 工作原理 在并行程序设计以及可视化人机交 互集成开发环境支持下,统一调度,协 调管理,实现高效并行处理的系统。它 主要利用消息传递实现各主机之间的通 信,由建立在一般操作系统之上的并行 编程环境完成资源的管理及相互协作, 同时也屏蔽工作站及网络的异构性。目 前集群系统是建立在Unix和Windows曹 组系统上的。
黄亚玲
3
集群系统简介
集 群 系 统 的 组 成
POM:一个MPI程序的进程优化映射工具
POM:一个MPI程序的进程优化映射工具卢兴敬;商磊;陈莉【期刊名称】《计算机工程与科学》【年(卷),期】2009(031)0z1【摘要】Modern supercomputers contain more computing nodes with many multi-core processors in one node. Inter-node and intra-node hvae different bandwidth, and make up two different communication layers, the intra-node layer's communication performance is better. The default process mapping of MPI do not consider the difference of bandwidth, so it decreases the performance of the computing platform. To resolve the problem, this paper introduces an automatic tool of optimizing process mapping for MPI programs, which supplies a low cost method of getting the communication information and optimizes the distribution of the communication of the system. So we can leverage the communication performance of the platform, and also better the performance of the program. First, to present the communication layer of the computing platform, supercomputer was simplified into two layers. The top is different computing nodes connected by high speed networks, the base is the multi-core processors on the same node, which has wider bandwidth. Second, we introduce a method to transform the collective communication into point-to-point communication and add it to the communication information. In the last, using undirected graph with edges of differentweights to present the processes' communication relationship. So the process mapping problem now is a graph partitioning problem. This paper uses the open source software Chaco to solve the graph partitioning problem. The experiment proves that the POM can efficiently better the performance of MPI programs.%现代超级计算机具有越来越多的计算结点,同时结点内具有多个处理器核.由于互联带宽的差异,结点间与结点内构成两个通信性能不同的通信层次,后者的通信性能好于前者.但是,目前MPI程序的默认进程映射未考虑该通信层次差异,无法利用结点内较好的通信带宽,严重束缚了超级计算机的性能发挥.针对该问题,本文设计实现了能利用层次通信差异的MPI程序自动进程优化映射工具POM,提供了高效、低开销获取MPI程序通信信息的方法,最终通过优化通信在通信层次上的分布提高了程序的通信效率,从而提高了应用程序的性能.本文解决了硬件平台通信层次的抽象、MPI程序通信信息的低开销获取与映射方案的计算三个问题.首先,按照通信能力差异将超级计算机结构抽象为高速互联的不同计算结点与相同结点上的多个处理器核两层.其次,提出了将集合通信转化成点到点通信的简单实现方法.最后,利用无向加权边图来表示MPI程序的进程间通信关系,将MPI程序的进程映射问题转化为图划分问题.在曙光5000A和曙光4000A上的实验结果表明,利用POM工具能够显著提高MPI程序的性能.【总页数】5页(P201-205)【作者】卢兴敬;商磊;陈莉【作者单位】中国科学院计算技术研究所系统结构重点实验室,北京,100190;中国科学院计算技术研究所系统结构重点实验室,北京,100190;澳大利亚新南威尔士大学,悉尼,2052;中国科学院计算技术研究所系统结构重点实验室,北京,100190【正文语种】中文【中图分类】TP319【相关文献】1.基于程序定义及动态进程的PVM与MPI比较 [J], 马小玲2.基于MPI并行程序的性能评测可视化工具 [J], 刘华;徐炜民;孙强3.求解高次方程的一个C++/MPI并行程序 [J], 无4.基于MPI的进程拓扑感知映射研究 [J], 李东洋;王云岚5.利用冗余进程实现MPI程序错误检测 [J], 富弘毅;宋伟;杨学军因版权原因,仅展示原文概要,查看原文内容请购买。
基于MPI的二维泊松方程差分并行实现与测试
基于MPI的二维泊松方程差分并行实现与测试苑野;杨东华【摘要】消息传递是一种广泛应用于集群环境下的并行编程模型.针对简单二维Poisson方程的第一边值问题的典型差分格式,在MPI并行环境下,使用五点差分离散和雅可比迭代法实现了此类方程的并行求解.实际测试表明此类方程在一定问题规模下,其并行算法具有很好的加速比和并行效率.%Message-passing is a widely used cluster environment for parallel programming model. With a simple two-dimensional poisson equation for the first boundary value problem of the typical difference scheme, this paper used five-point difference discretization and the implementation of the Jacobi iterative method for solving such equations in MPI parallel environment. Actual tests showed that such equations in a certain scale of the problem, the parallel algorithm of it had good speedup and parallel efficiency.【期刊名称】《哈尔滨商业大学学报(自然科学版)》【年(卷),期】2011(027)006【总页数】4页(P854-856,861)【关键词】Poisson方程;消息传递接口;有限差分;加速比;并行效率【作者】苑野;杨东华【作者单位】哈尔滨工业大学基础与交叉科学研究院,哈尔滨 150080;哈尔滨工业大学基础与交叉科学研究院,哈尔滨 150080【正文语种】中文【中图分类】TP319随着高速网络和多核处理器技术的发展,集群系统获得了很好的性能.由于性价比高和可扩展性好的特点,集群正逐渐成为主流的并行平台.MPI(Message Passing Interface)消息传递是一种典型的并行编程模型.由于集群是一种典型的分布式存储系统,因此MPI消息传递系统已经成为目前集群上最重要的并行编程环境之一. 在科学计算中经常要数值求解各类偏微分方程.Poisson方程广泛应用于电学、磁学、力学、热学等多个领域,因此解决Poisson方程的计算问题成为了高性能计算领域中的一个最基本问题.目前比较常用的方法有有限差分法、有限元法和有限体积法.用差分方法解Poisson方程,解的结果就是方程的准确解函数在节点上的近似解,这种方法主要是集中在依赖于时间的问题.与其他两种方法相比,有限差分法简单,易并行.因此我们用有限差分方法求解Poisson方程.本文抛开复杂的理论问题,在高性能集群环境下,针对矩形区域上二维Poisson方程边值问题的五点差分格式,使用雅可比迭代法和MPI消息传递接口模型对一类简单的Poisson 方程给出了其差分方程组的并行实现解法,并定量的对该类解法的并行化性能进行了测试.1 MPI技术消息传递是一种广泛应用的并行编程模型.MPI(Message Passing Interface)是1994年5月发布的一种消息传递接口,它定义了用C和Fortran编写消息传递应用程序所用到的核心库例程的语法和语义,具有很多特点.首先,MPI提供了一个易移植的编程接口和一个可靠的通信接口,允许避免内存到内存的拷贝,允许通信重叠,具有良好的通讯性能;其次,它可以在异构系统中透明使用,即能在不同体系结构的处理器上运行;再者,MPI提供的接口与现存消息传递系统接口(如PVM、NX等)相差不大,却提供了更大的灵活性,能在更多的平台上运行;最后,MPI是一个标准,它没有规定底层必须如何实现,故给实现该标准的厂家带来了更大的灵活性,使MPI可扩展性更好.1.1 最基本的MPIMPI是个复杂的系统,它包含128个函数(1994年标准),1997年修订的标准MPI-2已经超过200个,目前常用的大约有30个,然后使用其中的6个最基本的函数就能编写一个完整的MPI程序,6个函数如下.MPI_INT MPI 初始化MPI_FINALIZE结束MPI计算MPI_COMM_SIZE确定进程数MPI_COMM_RANK确定当前进程标识MPI_SEND发一条消息MPI_RECV 接受一条消息1.2 组通讯MPI提供了强大的组通讯功能,可以满足进程间的组通信.组通信一般实现3个功能:通讯、同步和计算.通讯功能主要完成组内数据的传输,分为3种,即一对多通讯,多对一通讯和多对多通讯;而同步功能实现组内所有进程在特定的地点在执行进度上取得一致;计算的功能比较复杂,要对给定数据完成一定的操作.组内通信函数主要包括5类:同步(Barrier)、广播(Bcast)、收集(Gather)、散发(Scatter)和规约(Reduce).1.3 通信体通信体是由一个进程组和进程活动环境(上下文)组成.其中进程组就是一组有限和有序进程的组合;进程活动环境是系统指定的超级标记,它能安全地将相互冲突的通信区分开.通信体提供了MPI中独立的安全的消息传递,不同的通信库使用独立的通信体,保证了在同一通信体的通信操作互不干扰.2 Poisson方程简介2.1 Poisson方程的定义Poisson方程是数学中的一种偏微分方程,即为其中:Δ代表的是拉普拉斯算子,而f和Δ可以是在流形上的实数或复数值的方程.当流形属于欧氏空间,而拉普拉斯算子通常表示为▽2,因此Poisson方程通常写成在二维直角坐标系统中,Poisson方程可以写成2.2 二维Poisson方程的差分离散考虑区域Ω=(0,a)×(0,b)上的Poisson方程的第一边值问题将Ω沿着x,y轴方向均剖分为m,n等分,沿x方向上的步长记为,沿y方向上的步长记为,剖分节点记为(x,y)(i=1,…,m-1,j=ii 1,…,n-1).用μij表示μ 在节点(xi,yi)的差分近似解,则离散后的差分方程为其边界为令),整理后可得其中对于格式(3)~(7),可以使用各种迭代法求解,常用的有逐次超松弛迭代法、共轭梯度法、预条件共轭梯度法、交替方法及多重网格方法等,其中雅可比迭代法以其简单实用和易于并行实现一直受到人们的重视.格式(3)~(8)的雅可比迭代格式为其中:3 实例数据测试与结果分析下面研究一类简单的Poisson方程,当此时Poisson方程的解析式为我们用区域分解法,在每个子区域用五点差分离散及雅可比迭代法并行求解此类方程.本文的硬件测试环境是16个节点的惠普高性能集群,采用千兆以太网互联,每个节点有2颗Intel Xeon 2.66G处理器,16G内存,72G SAS磁盘,NFS共享文件系统,软件环境是Red hat Enterprise Linux 4.6操作系统,内核版本为2.69-67,采用的C编译器为Intel C++12.0,MPI的版本为Intel MPI 4.0.表1为所测试问题的规模及其串行计算的执行时间.一般情况下,问题的串行计算执行时间比其并行计算的单机执行时间短,这主要是因为并行计算的单机时间中包含了并行化所带来的开销.表1 题规模及串行程序执行时间问题规模顺序执行时间400×400 3.0000e+00 800×800 4.9000e+01 1600×1600 2.0000e+02表2为所测试问题的规模及多处理机程序执行时间.对并行算法的性能测试主要是通过加速比和并行效率.我们以问题的多处理机执行时间与单处理机系统执行时间的比值作为多机加速比,把并行算法的加速比与CPU个数之间的比值定义为并行效率.图1为不同问题规模下的多机并行加速比.图2为不同问题规模下的并行效率. 表2 不同问题规模下多处理机程序执行时间问题规模并行执行时间4 node 9 node 16 node 400×400 2.5862e+00 3.0134e+00 3.2024e+00 800×8007.5078e+00 4.9370e+00 4.7353e+00 160×1600 5.2700e+01 1.8333e+01 9.2666e+00由图1可知,当矩阵为400阶时,问题的规模较小,随着节点数目的增多,加速比持续下降,且在节点数node=4时,获得最大加速比为1.16;当矩阵为800阶时,问题规模比较大时,其加速比均大于1,且几乎成线性增长;当矩阵为1 600阶时,问题规模较大,其加速比随节点的增加,表现为先逐渐变大,然后迅速减小.当node=9时,获得最大加速比为2.9655,且node=16时的加速比大于node=4时的加速比,但由图2可知其并行效率下降了61.6%,可以预测,当节点数继续增加时,其加速比和并行效率将会持续降低.由图2可知,当问题规模比较小时,矩阵规模小于800阶,随着节点数的增加,并行效率逐渐降低,但问题规模越大其并行效率也越高.4 结语通过上述实验数据可知,此类Poisson方程的并行效率和加速比很难得到非常理想的值,主要原因在于:问题规模的大小,如果问题规模较小(如:矩阵为400阶或800阶),并行计算的任务量较小,几个处理器就足够了,若处理器太多,则难以实现最佳负载平衡同时处理器也得不到充分利用.反之,如果问题规模较大(如:矩阵为1 600阶),则需要更多的处理器.但随着处理器个数的增加,并行算法的加速比在峰值后呈现下降趋势,并行效率也在下降.参考文献:[1]陈国良.并行计算—结构·算法·编程(修订版)[M].北京:高等教育出版社,2003.[2]王同科,谷同祥.Poisson方程差分格式的SOR方法中最优松弛因子的回归分析方法[J].工程数学学报,2005,22(3):474-480.[3]陆金莆,关治.偏微分方程的数值解法[M].北京:清华大学出版社,2004.[4]章隆兵,吴少刚,蔡飞,等.PC机群上共享存储与消息传递的比较[J].软件学报,2004,15(6):842-849.[5]胡明昌,史岗,胡伟武,等.PC机群上JIAJIA与MPI的比较[J].软件学报,2003(7):1187-1194.[6]张武生,薛巍,李建江,等.MPI并行程序设计实例教程[M].北京:清华大学出版社,2009.[7]都志辉.高性能计算并行编程技术-MPI并行程序设计[M].北京:清华大学出版社,2001.。
多核机群下MPI程序优化技术的研究
多核机群下MPI程序优化技术的研究王洁;衷璐洁;曾字【期刊名称】《计算机科学》【年(卷),期】2011(038)010【摘要】The new features of multi-core make the memory hierarchy of multi-core clusters more complex, and also add the optimization space for MPI programs. We tested the communication performance of three different multi-core clusters, and evaluated some general optimization technologies,such as hybrid MPI/OpenMP, tuning MPI runtime parameters and optimization of MPI process placement in Intel and AMD multi-core cluster. The experiments result and optimization performances were also analyzed.%多核处理器的新特性使多核机群的存储层次更加复杂,同时也给MPI程序带来了新的优化空间.国内外学者提出了许多多核机群下MPI程序的优化方法和技术.测试了3个不同多核机群的通信性能,并分别在Intel与AMD多核机群下实验评估了几种具有普遍意义的优化技术:混合MPI/OpeMP、优化MPI运行时参数以及优化MPI进程摆放,同时对实验结果和优化性能进行了分析.【总页数】4页(P281-284)【作者】王洁;衷璐洁;曾字【作者单位】中国科学院计算技术研究所北京100190;中国科学院研究生院北京100049;中国科学院计算技术研究所北京100190;中国科学院研究生院北京100049;北京市计算中心北京100005【正文语种】中文【相关文献】1.多核并行编程技术在中文分词程序优化中的应用 [J], 董丽丽;刘明生;袁香菊2.多核机群下基于神经网络的MPI运行时参数优化 [J], 王洁;曾宇;张建林3.MPI Alltoall通信在多核机群中的优化 [J], 李强;孙凝晖;霍志刚;马捷4.支持多种访存技术的CBEA片上多核MPI并行编程模型探讨 [J], 李广润5.多核机群下MPI程序优化技术的研究 [J], 王洁; 衷璐洁; 曾宇因版权原因,仅展示原文概要,查看原文内容请购买。
基于集群MPI的图层级多边形并行合并算法_范俊甫
基于集群MPI 的图层级多边形并行合并算法范俊甫1,2,马廷1,周成虎1,季民3,周玉科1,许涛1,2(1.中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室,北京100101;2.中国科学院大学,北京100049;3.山东科技大学测绘学院,青岛266590)摘要:在集群环境下,基于MPI 并行编程模型和OGC 简单要素规范进行并行多边形合并时,需要处理叠加图层间要素的“多对多”映射关系,由于空间上相邻的多边形在要素序列上并不一定连续,导致无法按要素序列为子节点分配任务,给并行任务映射带来了困难。
本文以集群环境下的并行多边形合并算法为研究对象,通过比较叠加分析中两种多边形映射关系对算法并行化带来的影响,基于R 树空间索引、MySQL 精确空间查询,以及MPI 通信机制,提出了6种不同的并行任务映射策略;通过实验分析和比较了6种策略的优劣。
结果显示:基于R 树预筛选的直接合并策略,在各算法中具有最高的串行计算效率和优秀的并行性能表现。
虽然MySQL 精确空间查询的预筛选过程较为耗时,但可有效地过滤掉不真正相交的多边形,从而提高合并操作的效率。
因此,在集群MPI 环境下,基于R 树和MySQL 精确空间查询的预筛选策略是解决并行任务映射难题,实现图层级多边形并行合并算法的有效途径。
关键词:多边形合并;预筛选;任务映射;并行计算;MPI 通信DOI:10.3724/SP.J.1047.2014.005171引言矢量数据叠加分析将原始要素分割生成新的要素,新要素综合了原来两层或多层要素所具有的属性[1-2]。
其中,多边形叠加分析包括差、交、并、异4种基本操作和交集取反、更新、标识、空间连接等衍生操作,具有高算法复杂度和计算时间密集性的特点[3],是矢量数据叠加分析的核心问题[4-6]。
多边形叠加合并在GIS 分析矢量缓冲区合并计算、制图综合中并非多边形的简化合并,其在土地利用分析、地籍管理等诸多空间分析工具、制图以及行业应用中使用广泛。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
1 MPI’s Reduction Operations in ClusteredWide Area SystemsThilo Kielmann Rutger F.H.HofmanHenri E.Bal Aske Plaat Raoul A.F.BhoedjangDepartment of Computer Science,Vrije Universiteit,Amsterdam,The Netherlandskielmann,rutger,bal,aske,raoul@cs.vu.nlAbstract—The emergence of meta computers and computational grids makes it feasible to run parallel programs on large-scale,geo-graphically distributed computer systems.Writing parallel appli-cations for such systems is a challenging task which may require changes to the communication structure of the applications.MPI’s collective operations(such as broadcast and reduce)allow for some of these changes to be hidden from the applications programmer.We have developed M AG PI E,a library of collective communication op-erations optimized for wide area systems.M AG PI E’s algorithms are designed to send the minimal amount of data over the slow wide area links,and to only incur a single wide area latency.This paper dis-cusses MPI’s collective reduction pared to systems that do not take the topology into account,such as MPICH,large performance improvements are possible.For larger messages,best performance is achieved when the reduction function is associative. On moderate cluster sizes,using a wide area latency of10millisec-ond and a bandwidth of1MByte/s,operations execute up to8times faster than MPICH;application kernels improve by up to a factor of3.Due to the structure of our algorithms,the advantage increases for higher wide area latencies.I.I NTRODUCTIONSeveral research projects pursue the idea of integrating computing resources at different locations into a single, powerful parallel system.Metacomputing projects like Globus[15]and Legion[17]build the software infrastruc-ture that makes such an integration possible.An impor-tant problem,however,is how to write parallel programs that run efficiently on metacomputers(or computational grids[16]).The key difference with traditional parallel programs is that communication between distant comput-ers can be orders of magnitude slower than that between the processors within a parallel machine.Wide-area net-works typically have a latency and bandwidth that is a fac-tor100–1000worse than that of local interconnects. Earlier research showed that,for many applications,it is possible to overcome the slowness of wide-area links by optimizing programs at the application level[5],[13], [26].Metacomputers typically consist of several clus-ters(parallel machines like MPPs or networks of worksta-tions),connected by slow wide-area links.They typically have a hierarchical structure with slow and fast links.Pro-grammers can take this hierarchical structure into account and minimize the amount of traffic over the slow wide-area links[26].Such optimizations,however,can complicate metacom-puter programming significantly.The goal of our work is to hide the hierarchical structure as much as possible from the programmer,by implementing the optimizations in a communication library.This works especially well for collective communication primitives,such as found in MPI.Current implementations of MPI(for example, MPICH[2])are designed for“flat”systems and run in-efficiently on wide-area,hierarchical systems.We have designed and implemented a new library,called M AG PI E, whose collective communication routines are optimized for hierarchical systems.In an earlier paper,we described the general design of M AG PI E and the implementation and performance of some collective operations[20].In this paper,we dis-cuss in more detail the reduction operations:,,and2II.A LGORITHM D ESIGNM AG PI E implements wide area optimal algorithms for all of MPI’s collective operations.Wefirst outline the gen-eral structure of M AG PI E’s algorithms.Then,we discuss how associativity of the reduction operators influences the applicability of this structure to the reduction operations.A.General Algorithm StructureM AG PI E’s goal is to minimize completion time of a collective operation,which we define as the moment at which all processors have received all messages that be-long to that operation.The performance of collective com-munication operations on a wide area system is dominated by the time spent on the wide area links;local communi-cation plays a minor role.In a collective operation,all processors have to communicate with each other,so the completion time cannot be less than the wide area latency. (On interconnects with varying latencies the highest la-tency dominates completion time.To simplify our anal-ysis,we assume that all wide-area links have the same bandwidth and latency.)In the design of our wide area algorithms,we have used the following two conditions: 1.Every sender-receiver path used by an algorithm con-tains at most one wide area link.2.Data items only travel to those clusters that need them; no data item travels multiple times to the same cluster. Condition(1)ensures that the wide area latency con-tributes at most once to an operation’s completion time. Condition(2)prevents waste of precious wide area band-width.We call algorithms that adhere to both conditions wide area optimal.Reduction operations have a high po-tential for optimization,by computing partial reductions locally in each cluster.The applicability of this optimiza-tion depends on associativity and commutativity of the re-duction operation.We will discuss this issue in detail be-low.In previous work[20],we showed how MPI’s collec-tive operations for synchronization and data exchange can be implemented by wide area optimal algorithms.In Sec-tion III we will show how MPI’s reduction operations can be implemented accordingly.We distinguish between the completion time t s of a message send and the completion time t r of the matching receive.We assume that messages are sent asynchronously:a message send completes when the message has been injected into the network.Note that t s only depends on message size;t r additionally depends on network bandwidth and latency.These performance characteristics determine the shape of the optimal com-munication graph[7],[19],[25].Local communication contributes a negligible amount to the overall completion time;for local communication we use graph shapes that bestfit the needs of the oper-ations,like binary or binomial trees.According to[7], binomial trees are optimal when t r t s is small.How-(a)symmetricoperation(b)asymmetric operationFig.1.Wide Area Optimal Communication Graphs ever,for wide area communication,where t r t s,the op-timal shape is a one-levelflat tree[7].Thus we have two communication graphs:the intra-cluster graph that con-nects the processors within a single cluster,and the inter-cluster graph that connects the different clusters.To in-terface both graphs,we designate a coordinator node for each cluster.Notice that the one-level tree satisfies con-dition(1).Condition(2)depends on the semantics of the actual operation,as will be discussed further in Section III. The optimal inter-cluster graph shape depends on the values for latency,bandwidth,message size,and the num-ber of puting optimal graph shapes there-fore requires run time instrumentation(see,for example, [22]).M AG PI E does not yet perform this analysis.In Sec-tion III we show that M AG PI E’s combination of one-level trees and binary/binomial treesfits the wide area case well enough to outperform MPICH’s algorithms in our tests. To outline the general structure of our algorithms, we distinguish two kinds of algorithms.In the asymmetric algorithms one dedicated process,called the root,either acts as sender(in one-to-many al-gorithms)or as receiver(in many-to-one algorithms such as aand3 chosen arbitrarily.Figure1shows the communicationgraphs for symmetric and asymmetric operations;in thelatter the root process is marked with a circle.Asymmetric algorithms perform two steps.The nodesfirst send to their coordinator and then the coordinatorssend to the root.Symmetric algorithms perform threesteps.First,nodes send to their coordinators.Second,thecoordinators perform an all-to-all exchange,and third,co-ordinators send to their nodes.We use this basic structurefor implementing wide area optimal algorithms for MPI’sreduction operations.B.Non-Associative Reduction OperatorsWe now turn to the group of reduction operators.MPI’sreduction operations are parameterized by the actual op-erator that is applied to the data.MPI assumes all op-erators(such as sum,product,minimum,or maximum)to be associative;in addition,the programmer can markthem as commutative.Concerning execution order,theMPI standard states that“any implementation can take ad-vantage of associativity,or associativity and commutativ-ity in order to change the order of evaluation.This maychange the result of the reduction for operations that arenot strictly associative and commutative,such asfloatingpoint addition”[24].Accordingly,application program-mers must not rely on a specific execution order.Foryielding reproducible results,the standard adds:“It isstrongly recommended that4mainder of this paper,the wide area latency is set to10ms and the bandwidth is set to1.0MByte/s.(All latencies in this paper are one-way.)On most metacomputers,wide area latency will be significantly higher,and,since M AG-PI E’s algorithms have been optimized for long latency,the advantage of M AG PI E over MPICH will be even higher.A.Basic Collective OperationsM AG PI E implements all14collective communication operations defined by version1.1of the MPI standard. The operations for synchronization and data exchange have been presented in[20].Some of them are used as building blocks for the reduction operations.We briefly discuss how they are implemented by MPICH and by M AG PI E.We compare against version1.1of MPICH. A.1BroadcastIn,MPI also has personal-ized broadcast operations:.MPICH implements the simplest possible scatter algo-rithm in which the root process linearly and directly sends the pieces of data to the respective nodes.This algorithm is wide area optimal,and M AG PI E does not improve on it.The gather operation and the personalized all-to-all ex-change are implemented similarly.With.In M AG PI E’s algo-rithm,the coordinatorsfirst gather data locally into a sub-vector of their local cluster.They then gather the complete data vector by exchanging their partial vectors with eachcompletiontime(ms)2 clusterscompletiontime(ms)2 clustersFig.3. With5M AG PI E’s algorithm that exploits user-asserted asso-ciativityfirst reduces the cluster-local data on the coordi-nator nodes.In afinal wide area step,these partial results are sent to the root process which in turn combines them to compute the overall result.This algorithm is wide area op-timal;it adheres to the single-latency condition,and also sends the minimal amount of data between clusters.The comparison of this algorithm with MPICH’s tree algo-rithm is presented in Figure3for the case of64KB data vectors.Results are shown for2,4,and8clusters,with a total number of16,24,32,and40processors,equally dis-tributed over the clusters.The run times for M AG PI E are shown in black while MPICH’s times are shown in grey. For non-associative operators,M AG PI E implements two more algorithms:one for short messages and one for long messages.Thefirst algorithm gathers all data at the root,which then applies the operation in the pre-scribed order,irrespective of the network topology.This satisfies only the single-latency condition(by re-using,but delivers the reduction result to all processes.MPICH provides a so-called naive implementation byfirst reducing to the pro-cess with rank zero and subsequently broadcasting the re-sult.This implementation always yields correct results, but is not wide area optimal.Sequential compositions of two collective operations need at least two times the wide area latency.Furthermore,both basic operations(as im-plemented by MPICH)are not wide area optimal them-selves.Again M AG PI E provides three algorithms.For associa-tive operators,the processes in each clusterfirst reduce tocompletiontime(ms)2 clusterscompletiontime(ms)2 clustersFig.4.is called which delivers all data items at all processes with a sin-gle wide area latency.Then,all processes compute the reduction result locally.The completion time compared to MPICH is shown in Figure4(top)for1-byte messages. For long messages,MPICH’s approach is used.Because M AG PI E implements a wide area optimal broadcast,it is still faster than MPICH.B.3is similar to6 the resulting data vector is implicitly scattered among theprocesses such that each process gets a different part ofthe reduced data vector.MPICH implements a sequen-tial composition,a reduce followed by a scatter operation.Analogous to.Using a communicator object with re-ordered process ranks,the algorithm is only applicable tocommutative operators.For non-associative operators,anThe operation performed by.It isbased onc o m p l e t i o n t i m e (m s )2 clustersc o m p l e t i o n t i m e (m s )2 clustersFig.6.with an average message size of 32KBytefor the broadcast of Householder vectors,and 8192calls toTABLE IC OMPARISON OF A LGORITHMSMPICH OperationOptim.Shape Implementationnear flat Gather no bin yes flat near flat Allgather no bin ;bin Reduce ;Bcast yes flat near flat Allgather no bin ;flat Reduce ;ScatterVyes flat near flat Allgatherno bin yesflatMMUL20002000ReduceBcast/Scan sequential run time (s)335843390M AG PI EMPICHM AG PI EMPICHparallel run time (s)9630018.712.130.520.1wide area msgs 3360855971551903.245357.522.4917.20wide area latency (total)4000200001615.5s e c o n d s32 cpusFig.8.Application Runtimes formore allows all matrices to be fully distributed which sig-nificantly lowers the memory requirements of the parallel version.For the runtimes shown in Figure 8,we enabled associa-tive optimization.When the matrix elements are known to yield neither over nor underflow situations,this optimiza-tion is legal.If not,then even a result computed without this optimization could not be trusted,because still overor underflows may occur—although always the same.With the matrices being of size 20002000,is called 4000times with a messagesize of 16000bytes.M AG PI E outperforms MPICH up to a factor of three.Due to large memory requirements and the related cache effects (the algorithm uses straightfor-ward,un-blocked loops),MMUL achieves superlinear speedups.Relative to a single processor,the speedup on a single cluster of 64processors is 100,both for M AG PI E and MPICH.When the 64processors are divided over 8wide area clusters,M AG PI E achieves a speedup of 64.8,MPICH only 18.4.Again,Table II shows that M AG PI E sends fewer messages and less data across wide area links.It also chains fewer latencies.The TRI kernel repeatedly invokes a solver for tridiago-nal equation systems.Following [21],the solver treats the equations as a system of recurrences and usescan be used.Some addi-tional numerical instability had to be introduced to get the right functionality without additional communication.In our measurements,the tridiagonal matrix had size 1000000.The solver calls 1000timeswith amessage size of 48bytes.M AG PI E outperforms MPICHs e c o n d s32 cpusFig.9.Application Runtimes forTABLE IIIW IDE A REA S YSTEM R UN T IMES40processors M AG PI E MPICH325490246112313810For reduction operations with short data vectors,M AG-PI E uses algorithms that are nearly wide area optimal.For non-associative operators,when execution order matters for correctness,M AG PI E’s algorithms are no longer wide area optimal,although by using better broadcast and scan algorithms,performance is still improved over MPICH for most reduction operations.Although the MPI standard recognizes the possibility of exploiting associativity for optimization,it does not spec-ify a standard way to implement this feature.For a wide area latency of10ms and a bandwidth of 1MByte/s,the measurements of the individual operations show improvements over MPICH that vary between a fac-tor of3to8,depending on the operation,the number of clusters,and the message length.For the application ker-nels,QR,MMUL,and TRI,M AG PI E consistently outper-formed MPICH by a factor of2or more.The current version of M AG PI E assumes a static topol-ogy.In future work the best tree shape can be computed dynamically based on wide area latency and bandwidth as measured at run time,as well as message size.Writing correct and efficient parallel programs is hard. Having to take non-uniformity of the interconnect into account makes it even harder.MPI’s collective opera-tions provide a convenient abstraction that can be imple-mented efficiently for a non-uniform interconnect.For problems that heavily use the collective operations,the M AG PI E library offers transparent optimization,and com-pletely hides non-uniformity.The system is available as a plug-in for MPICH from http://www.cs.vu.nl/albatross/.A CKNOWLEDGEMENTSThis work is supported in part by a SION grant from the Dutch re-search council NWO,and by a USF grant from the Vrije Universiteit. We thank Kees Verstoep and John Romein for keeping the DAS in good shape,and Cees de Laat(University of Utrecht)for getting the wide area links of the DAS up and running.R EFERENCES[1] A.Alexandrov,M.F.Ionescu,K.E.Schauser,and C.Scheiman.LogGP:Incorporating Long Messages into the LogP Model—One step closer towards a realistic model for parallel computa-tion.In Proc.Symposium on Parallel Algorithms and Architectures (SPAA),pages95–105,Santa Barbara,CA,July1995.[2]Argonne National Laboratory.MPICH implementation home page./Projects/mpi/mpich/,1995.[3]S.Bae, D.Kim,and S.Ranka.Vector Reduction and Pre-fix Computation on Coarse-Grained,Distributed-Memory Parallel Machines.In IPPS-98,International Parallel Processing Sympo-sium,1998.[4]H.Bal,R.Bhoedjang,R.Hofman,C.Jacobs,ngendoen,T.R¨u hl,and F.Kaashoek.Performance Evaluation of the Orca Shared Object System.ACM Transactions on Computer Systems, 16(1),1998.[5]H.Bal,A.Plaat,M.Bakker,P.Dozy,and R.Hofman.OptimizingParallel Applications for Wide-Area Clusters.In IPPS-98Interna-tional Parallel Processing Symposium,pages784–790,Apr.1998.[6]M.Banikazemi,V.Moorthy,and D.Panda.Efficient CollectiveCommunication on Heterogeneous Networks of Workstations.In International Conference on Parallel Processing,pages460–467, Minneapolis,MN,1998.[7]M.Bernaschi and G.Iannello.Collective Communication opera-tions:experimental results vs.theory.Concurrency:Practice and Experience,10(5):359–386,April1998.[8]M.Bernaschi,G.Iannello,and uria.Efficient Implementa-tion of Reduce-Scatter in MPI.Submitted for publication,1998.Available from http://grid.grid.unina.it/˜i annello/dapaa.htm. [9]R.Bhoedjang,T.R¨u hl,and er-Level Network InterfaceProtocols.IEEE Computer,31(11):53–60,1998.[10]N.Boden, D.Cohen,R.Felderman, A.Kulawik, C.Seitz,J.Seizovic,and W.Su.Myrinet:A Gigabit-per-second Local Area Network.IEEE Micro,15(1):29–36,1995.[11]T.H.Cormen,C.E.Leiserson,and R.L.Rivest.Introduction toAlgorithms.M.I.T.Press,1990.[12] D.Culler,R.Karp,D.Patterson,A.Sahay,K.E.Schauser,E.San-tos,R.Subramonian,and T.von Eicken.LogP:Towards a Realistic Model of Parallel Computation.In Proc.Symposium on Principles and Practice of Parallel Programming(PPoPP),pages1–12,San Diego,CA,May1993.[13]I.Foster,J.Geisler,W.Gropp,N.Karonis,E.Lusk,G.Thiru-vathukal,and S.Tuecke.Wide-Area Implementation of the Mes-sage Passing Interface.Parallel Computing,24(11),1998. [14]I.Foster and N.Karonis.A Grid-Enabled MPI:Message Passingin Heterogeneous Distributed Computing Systems.In SC’98,Or-lando,FL,Nov.1998.[15]I.Foster and C.Kesselman.Globus:A metacomputing in-frastructure toolkit.Int.Journal of Supercomputer Applications, 11(2):115–128,Summer1997.[16]I.Foster and C.Kesselman,editors.The GRID:Blueprint for aNew Computing Infrastructure.Morgan Kaufmann,1998. [17] A.Grimshaw and W.A.Wulf.The Legion Vision of a WorldwideVirtual m.ACM,40(1):39–45,Jan.1997.[18]G.Iannello,uria,and S.Mercolino.Cross–Platform Analy-sis of Fast Messages for Myrinet.In Proc.Workshop CANPC’98, number1362in Lecture Notes in Computer Science,pages217–231,Las Vegas,Nevada,January1998.Springer.[19]R.M.Karp,A.Sahay,E.E.Santos,and K.E.Schauser.OptimalBroadcast and Summation in the LogP model.In Proc.Symposium on Parallel Algorithms and Architectures(SPAA),pages142–153, Velen,Germany,June1993.[20]T.Kielmann,R.F.H.Hofman,H.E.Bal,A.Plaat,and R.A.F.Bhoedjang.M AG PI E:MPI’s Collective Communication Opera-tions for Clustered Wide Area Systems.Submitted for publication, 1998.Available from http://www.cs.vu.nl/albatross/.[21] F.T.Leighton.Introduction to Parallel Algorithms and Architec-tures.Morgan Kaufmann,1992.[22] B.Lowekamp and A.Beguelin.ECO:Efficient Collective Opera-tions for Communication on Heterogeneous Networks.In Interna-tional Parallel Processing Symposium,pages399–405,Honolulu, HI,1996.[23]Message Passing Interface Forum.MPI:A Message Passing Inter-face Standard.International Journal of Supercomputing Applica-tions,8(3/4),1994.[24]Message Passing Interface Forum.MPI Standard docu-ment,Version 1.1.Available from / Projects/mpi/standard.html,1995.[25]J.-Y.L.Park,H.-A.Choi,N.Nupairoj,and L.M.Ni.Constructionof Optimal Multicast Trees Based on the Parameterized Commu-nication Model.In Proc.Int.Conference on Parallel Processing (ICPP),volume I,pages180–187,1996.[26] A.Plaat,H.Bal,and R.Hofman.Sensitivity of Parallel Appli-cations to Large Differences in Bandwidth and Latency in Two-Layer Interconnects.In High Performance Computer Architecture HPCA-5,Orlando,FL,Jan.1999.[27]R.Wolski.Dynamically Forecasting Network Performance to Sup-port Dynamic Scheduling Using the Network Weather Service.In 6th High-Performance Distributed Computing,Aug.1997.The network weather service is at /.。