相场模拟的并行计算方法
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Hybrid MPI + OpenMP Approach to Improve the Scalability of a Phase-Field-Crystal Code
Reuben D. Budiardja reubendb@utk.edu ECSS Symposium March 19th, 2013
Decrease the time to solution to 1 sec / time step
– Strong scaling: decrease time-to-solution with increasing number of process and a fixed problem size – Exploit other parallelism (with OpenMP?) – Investigate better preconditioner – Different method to solve the equations
MPI_Boundary_Communicate(…) ierr = PAT_region_end(PAT_ID); PAT_ID = 42;
ierr = PAT_region_begin(PAT_ID, "computation");
for (int i=1; i<size.L1+1; i++){ for (int j=1; j<size.L2+1; j++){ for (int k=1; k<size.L3+1; k++){ residual(i,j,k)=(1.0/D)*(. . .); } } } ierr = PAT_region_end(PAT_ID);
Project Background
Project Team (University of Michigan): Katsuyo Thornton (P.I.), Victor Chan Phase-field-crystal (PFC) formulation to study dynamics of various metal systems Original in-house code written in C++ Has been run in 2D and 3D systems Solves multiple Helmholtz equations, a reduction, then an explicit time step
4
Goal
Scalable to solve larger problem
– Weak scaling: maintain the time-to-solution with increasing number of processes and a fixed problem size per process
9
> module load perftools > make clean > make
> pat_build –g mpi pfc_jacobi.exe
> aprun –n 48 pfc_jacobi.exe+pat > pat_report –o profile.txt \ <output_data>.xf
3
Goal
Scalable to solve larger problem
– Weak scaling: maintain the time-to-solution with increasing number of processes and a fixed problem size per process
8
> module load perftools > make clean > make
> pat_build –g mpi pfc_jacobi.exe
> aprun –n 48 pfc_jacobi.exe+pat > pat_report –o profile.txt \ <output_data>.xf
2
ຫໍສະໝຸດ Baidu
Solving the Helmholtz Equations
������ 2 ∅ + ������������ ∅ = ������
Originally used GMRES with Algebraic Multigrid (AMG) preconditioner from HYPRE In 3D, discretization matrix is large and may become indefinite difficult to solve, requiring large iterations Poor weak-scaling results Prohibitively long for indefinite matrix case Increasing memory requirements with iteration
the project team (Victor Chan) and tested.
7
Profiling the Code with CrayPAT
Measure before optimize Can use sampling or tracing Using CrayPAT is simple: load module, re-compile, build instrumented code, re-run CayPAT can trace only specified group, e.g. mpi, io, heap, fftw, ...
10
CrayPAT Workaround
Use the API for “fine grain” instrumentation Add “PAT_region_{begin/end}” calls to most subroutines After narrowed down to a couple major subroutines, split labels to “computation” and “communication” Communication subroutine eventually dominate at certain MPI size
#include <pat_api.h> ... void Complex_Jacobi(…){ ... int PAT_ID, ierr; PAT_ID = 41; ierr = PAT_region_begin(PAT_ID, "communication");
MPI_Internal_Communicate( …);
Easily parallelized and low memory requirement
Convergence rate depends on resolution, but roughly constant from problem to problem larger problem (with similar resolution) should not increase iterations.
Replaced HYPRE A modification of standard Jacobi method
������ ������+1 ������ ������ , ������������ , ������������2 , ������������2 is computed with centereddifference
Replaced HYPRE A modification of standard Jacobi method
������ ������+1 ������ ������ , ������������ , ������������2 , ������������2 is computed with centereddifference
That Should Have Worked !
CrayPAT Workaround
Use the API for “fine grain” instrumentation Add “PAT_region_{begin/end}” calls to most subroutines After narrowed down to a couple major subroutines, split labels to “computation” and “communication”
11
#include <pat_api.h> ... void Complex_Jacobi(…){ ... int PAT_ID, ierr; PAT_ID = 41; ierr = PAT_region_begin(PAT_ID, "communication");
MPI_Internal_Communicate( …);
5
Complex Iterative Jacobi Solver
Hadley, G. R, A complex Jacobi iterative method for the indefinite Helmholtz Equation, J.Comp.Phys. 203 (2005) 358-370
Decrease the time to solution to 1 sec / time step
– Strong scaling: decrease time-to-solution with increasing number of process and a fixed problem size – Exploit other parallelism (with OpenMP?) – Investigate better preconditioner – Different method (library?) to solve the equations
Easily parallelized and low memory requirement
Convergence rate depends on resolution, but roughly constant from problem to problem larger problem (with similar resolution) should not increase iterations. A draft version was quickly implemented by
Profiling the Code with CrayPAT
Measure before optimize Can use sampling or tracing Using CrayPAT is simple: load module, re-compile, build instrumented code, re-run CayPAT can trace only specified group, e.g. mpi, io, heap, fftw, ...
6
Complex Iterative Jacobi Solver
Hadley, G. R, A complex Jacobi iterative method for the indefinite Helmholtz Equation, J.Comp.Phys. 203 (2005) 358-370
MPI_Boundary_Communicate(…) ierr = PAT_region_end(PAT_ID); PAT_ID = 42;
ierr = PAT_region_begin(PAT_ID, "computation");
for (int i=1; i<size.L1+1; i++){ for (int j=1; j<size.L2+1; j++){ for (int k=1; k<size.L3+1; k++){ residual(i,j,k)=(1.0/D)*(. . .); } } } ierr = PAT_region_end(PAT_ID);
Reuben D. Budiardja reubendb@utk.edu ECSS Symposium March 19th, 2013
Decrease the time to solution to 1 sec / time step
– Strong scaling: decrease time-to-solution with increasing number of process and a fixed problem size – Exploit other parallelism (with OpenMP?) – Investigate better preconditioner – Different method to solve the equations
MPI_Boundary_Communicate(…) ierr = PAT_region_end(PAT_ID); PAT_ID = 42;
ierr = PAT_region_begin(PAT_ID, "computation");
for (int i=1; i<size.L1+1; i++){ for (int j=1; j<size.L2+1; j++){ for (int k=1; k<size.L3+1; k++){ residual(i,j,k)=(1.0/D)*(. . .); } } } ierr = PAT_region_end(PAT_ID);
Project Background
Project Team (University of Michigan): Katsuyo Thornton (P.I.), Victor Chan Phase-field-crystal (PFC) formulation to study dynamics of various metal systems Original in-house code written in C++ Has been run in 2D and 3D systems Solves multiple Helmholtz equations, a reduction, then an explicit time step
4
Goal
Scalable to solve larger problem
– Weak scaling: maintain the time-to-solution with increasing number of processes and a fixed problem size per process
9
> module load perftools > make clean > make
> pat_build –g mpi pfc_jacobi.exe
> aprun –n 48 pfc_jacobi.exe+pat > pat_report –o profile.txt \ <output_data>.xf
3
Goal
Scalable to solve larger problem
– Weak scaling: maintain the time-to-solution with increasing number of processes and a fixed problem size per process
8
> module load perftools > make clean > make
> pat_build –g mpi pfc_jacobi.exe
> aprun –n 48 pfc_jacobi.exe+pat > pat_report –o profile.txt \ <output_data>.xf
2
ຫໍສະໝຸດ Baidu
Solving the Helmholtz Equations
������ 2 ∅ + ������������ ∅ = ������
Originally used GMRES with Algebraic Multigrid (AMG) preconditioner from HYPRE In 3D, discretization matrix is large and may become indefinite difficult to solve, requiring large iterations Poor weak-scaling results Prohibitively long for indefinite matrix case Increasing memory requirements with iteration
the project team (Victor Chan) and tested.
7
Profiling the Code with CrayPAT
Measure before optimize Can use sampling or tracing Using CrayPAT is simple: load module, re-compile, build instrumented code, re-run CayPAT can trace only specified group, e.g. mpi, io, heap, fftw, ...
10
CrayPAT Workaround
Use the API for “fine grain” instrumentation Add “PAT_region_{begin/end}” calls to most subroutines After narrowed down to a couple major subroutines, split labels to “computation” and “communication” Communication subroutine eventually dominate at certain MPI size
#include <pat_api.h> ... void Complex_Jacobi(…){ ... int PAT_ID, ierr; PAT_ID = 41; ierr = PAT_region_begin(PAT_ID, "communication");
MPI_Internal_Communicate( …);
Easily parallelized and low memory requirement
Convergence rate depends on resolution, but roughly constant from problem to problem larger problem (with similar resolution) should not increase iterations.
Replaced HYPRE A modification of standard Jacobi method
������ ������+1 ������ ������ , ������������ , ������������2 , ������������2 is computed with centereddifference
Replaced HYPRE A modification of standard Jacobi method
������ ������+1 ������ ������ , ������������ , ������������2 , ������������2 is computed with centereddifference
That Should Have Worked !
CrayPAT Workaround
Use the API for “fine grain” instrumentation Add “PAT_region_{begin/end}” calls to most subroutines After narrowed down to a couple major subroutines, split labels to “computation” and “communication”
11
#include <pat_api.h> ... void Complex_Jacobi(…){ ... int PAT_ID, ierr; PAT_ID = 41; ierr = PAT_region_begin(PAT_ID, "communication");
MPI_Internal_Communicate( …);
5
Complex Iterative Jacobi Solver
Hadley, G. R, A complex Jacobi iterative method for the indefinite Helmholtz Equation, J.Comp.Phys. 203 (2005) 358-370
Decrease the time to solution to 1 sec / time step
– Strong scaling: decrease time-to-solution with increasing number of process and a fixed problem size – Exploit other parallelism (with OpenMP?) – Investigate better preconditioner – Different method (library?) to solve the equations
Easily parallelized and low memory requirement
Convergence rate depends on resolution, but roughly constant from problem to problem larger problem (with similar resolution) should not increase iterations. A draft version was quickly implemented by
Profiling the Code with CrayPAT
Measure before optimize Can use sampling or tracing Using CrayPAT is simple: load module, re-compile, build instrumented code, re-run CayPAT can trace only specified group, e.g. mpi, io, heap, fftw, ...
6
Complex Iterative Jacobi Solver
Hadley, G. R, A complex Jacobi iterative method for the indefinite Helmholtz Equation, J.Comp.Phys. 203 (2005) 358-370
MPI_Boundary_Communicate(…) ierr = PAT_region_end(PAT_ID); PAT_ID = 42;
ierr = PAT_region_begin(PAT_ID, "computation");
for (int i=1; i<size.L1+1; i++){ for (int j=1; j<size.L2+1; j++){ for (int k=1; k<size.L3+1; k++){ residual(i,j,k)=(1.0/D)*(. . .); } } } ierr = PAT_region_end(PAT_ID);