CUDA超大规模并行程序设计

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

4.8
AMD circuit simulation
7.0
FEM, steady state thermal problem
6.2
Optimal power flow, nonlinear optimization
5.0
Freescale circuit simulation
15
15
SPMV Throughput on GTX280
TPC SM SM SM
TPC SM SM SM
TPC SM SM SM
TPC SM SM SM
TPC SM SM SM
TPC SM SM SM
TPC SM SM SM
TPC SM SM SM
Streaming Processor (SP)
Special Function Unit (SFU)
Double Precision FPU
8
8
GPGPU
核心思想
用图形语言描述通用计算问题
把数据映射到vertex或者 fragment处理器
但是
硬件资源使用不充分存储器访问方式严重受限难以调试和查错高度图形处理和编程技巧
9
9
NVidia G200 Architecture
TPC SM SM SM
TPC SM SM SM
17
Static Timing Analysis. ICCAD. 2019.
17
Static Timing Analysis Results on GTX280
Instance
CPU (#paths per second)
CUDA超大规模并行程序设
计
2
2
3
3
4
4
5
5
传统GPU架构
6
Graphics program Vertex processors
Fragment processors
Pixel operations Output image
6
GPU的强大运算能力
•数据级并行: 计算一致性
120 GPU
Problem Instance # rows # columns
Lin t2em ecology1 cont11 sls
256000 921632 1000000 1468599 1748122
256000 921632 1000000 1961394 62729
# non-zeros
1766400 4590832 4996000 5382999 6804304
11
11
混合计算模型
CUDA: 集成CPU + GPU C应用程序
CPU: 顺序执行代码 GPU = 超大规模数据并行协处理器
• “批发”式执行大量细粒度线程
kernel 0
CPU Serial Code
... CPU Serial Code
GPU Parallel Code
Concurrent execution!
9.23 12.41 9.03 10.66 10.10 8.86 8.97 5.70 11.56
36.04 43.44 37.43 33.84 36.49 41.45 41.89 22.01 40.37
16
SMVP Application: Static Timing Analysis
Adapted from Ramalingam, A. et. al. An Accurate Sparse Matrix Based Framework for Statistical
3.9
Large least-squares problem
G3_circuit thermal2 kkt_power Freescale1
1585478 1228045
1585478 1228045
2063494 3428755
2063494 3428755
7660826 8580313 12771361 17052626
//Smith-Waterman基因序列比较
BHale Waihona Puke ack Scholes: 4.7GOptions/sec //期权定价模型
VMD: 290 GFLOPS
//分子动力学图形显示
14
14
Problem Instances for Sparse Matrix Vector
Product (SMVP)
100
CPU
80
60
G80 Ultra G80 G71
Memory bandwidth (GB/s)
40
NV40
20
NV30
Northwood
0
2003
2004
Prescott EE
Hapertown Woodcrest
2005
2006
2007
•专用存储器通道
•有效隐藏存储器延时
7
7
General Purpose Computing on GPU (GPGPU)
10
10
CUDA: Compute Unified Device Architecture
通用并行计算模型
单指令、多数据执行模式 (SIMD)
• 所有线程执行同一段代码(1000s threads on the fly) • 大量并行计算资源处理不同数据隐藏存储器延时 • 提升计算／通信比例 • 合并相邻地址的内存访问 • 快速线程切换1 cycleGPU vs. ~1000 cyclesCPU
Avg. # non-zeros per row
Description
6.9
Large sparse Eigenvalue problem
5.0
Electromagnetic problems
5.0
Circuit theory applied to animal/gene flow
3.7
Linear programming
12 kernel 1
...
GPU Parallel Code
12
CUDA成功案例
13
13
CUDA性能
BLAS3: 127 GFLOPS
//基本线性代数: matrix-matrix
FFT: 52 benchFFT*GFLOPS
FDTD: 1.2 Gcells/sec
//计算电动力学
SSEARCH: 5.2 Gcells/sec
Problem Instance CPU (GFLOPS) GPU (GFLOPS) Speed-up
Lin
0.26
t2em
0.29
ecology1
0.24
cont11
0.31
sls
0.28
G3_circuit
0.21
thermal2
0.21
kkt_power
0.26
Freescale1
0.29
16