CUDA编程优化讲义
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
GpuTimer timer; timer.Start(); transpose_serial<<<1, 1>>>(d_in, d_out); // launch kernel timer.Stop(); cudaMemcpy(out, d_out, numbytes, cudaMemcpyDeviceToHost);
i
}
int main(int argc, char **argv)
{
int numbytes = N * N * sizeof(float);
j
float *in = (float *)malloc(numbytes); float *out = (float *)malloc(numbytes);
(a, b)
const int N = 1024; // matrix size is NxN
void transpose_CPU(float in[], float out[])
{
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
out[j + i * N] = in[i + j * N];
fast
Thread
Thread block
Grid 0 Thread block
Thread block
Thread block
Per-thread local mem
Per-block shared mem
Thread block Thread block Thread block
Global mem
CUDA编程优化
CUDA Programming Optimization
Outline
1. GPU
Why Add GPUs? Accelerate Applications
CPU
GPU
提高时钟频率 多核处理器加 速 —— 同步多线程SMT
Why Add GPUs? Accelerate Applications
tremendous computational horsepower
very high memory bandwidth
New release – P100
CPU vs. GPU: Low Latency or High Throughput?
ALU ALU ALU ALU
CONTROL
L2 CACHE
printf("transpose_serial: %g ms.\n", timer.Elapsed()); }
2nd CUDA Version – parallel per row
1,N
__gloabl__ void transpose_parallel_per_row(float in[], float out[]) {
DRAM
100s of ALUs
L2 CACHE
100s of ALUs
DRAM
CPU • 以低延迟访问缓存数据 • 强于预测执行 (speculative
execution)、乱序执行 (out-oforder)等 • 减少每个线程内的延迟 • 大部分芯片面积支持缓存
GPU • 数据级并行 (DLP)运算 • 以更多的核心,突出大量多线程运
算能力,弱化延迟 • 大部分芯片面积支持运算逻辑
How to implement?
程序代码
Small Changes, Big Speed-up
计算强度大的代码
GPU
GPU并行加速
其余串行CPU代码
CPU
+
Outline
2. CUDA
What is CUDA?
GPU Programming Languages
float *in = (float *)malloc(numbytes); // on Host float *out = (float *)malloc(numbytes); float *d_in, *d_out; // on Device cudaMalloc(&d_in, numbytes); cudaMalloc(&d_out, numbytes); cudaMemcpy(d_in, in, numbytes, cudaMemcpyHostToDevice);
Analyze
Deploy
APOD
Parallelize
Optimize
A – analyze
P – parallelize
O – optimize
• nsight • nvvp • nvprof
D – deploy
Case Study: Matrix Transpose C code
i
j
(b, a)
fill_matrix(in); // extra fill in matrix function, not listed here
transpose_CPU(in, out);
return 0;
源自文库
}
An Initial CUDA Version
#include "gnutimer.h“
slow
Grid 1
Outline
3. Optimization
Optimize GPU programs
Levels of Optimization
• 选择好算法 • 有效率的编写规则 • 体系结构优化 • 指令集的位操作微观优化
APOD – Systematic Optimization
int i = threadIdx.x; for (int j = 0; j < N; j++)
CUDA CUDA CUDA
CUDA Software Environment
Programming Model – Threads Hierarchy
• 线程 Thread • 线程块 Threads Block
• 栅格 Grid
Programming Model – Memory Hierarchy
并行化
__global__ void transpose_serial(float in[], float out[]) {
for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) out[j + i*N] = in[i + j*N];
} int main(int argc, char **argv) {
i
}
int main(int argc, char **argv)
{
int numbytes = N * N * sizeof(float);
j
float *in = (float *)malloc(numbytes); float *out = (float *)malloc(numbytes);
(a, b)
const int N = 1024; // matrix size is NxN
void transpose_CPU(float in[], float out[])
{
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
out[j + i * N] = in[i + j * N];
fast
Thread
Thread block
Grid 0 Thread block
Thread block
Thread block
Per-thread local mem
Per-block shared mem
Thread block Thread block Thread block
Global mem
CUDA编程优化
CUDA Programming Optimization
Outline
1. GPU
Why Add GPUs? Accelerate Applications
CPU
GPU
提高时钟频率 多核处理器加 速 —— 同步多线程SMT
Why Add GPUs? Accelerate Applications
tremendous computational horsepower
very high memory bandwidth
New release – P100
CPU vs. GPU: Low Latency or High Throughput?
ALU ALU ALU ALU
CONTROL
L2 CACHE
printf("transpose_serial: %g ms.\n", timer.Elapsed()); }
2nd CUDA Version – parallel per row
1,N
__gloabl__ void transpose_parallel_per_row(float in[], float out[]) {
DRAM
100s of ALUs
L2 CACHE
100s of ALUs
DRAM
CPU • 以低延迟访问缓存数据 • 强于预测执行 (speculative
execution)、乱序执行 (out-oforder)等 • 减少每个线程内的延迟 • 大部分芯片面积支持缓存
GPU • 数据级并行 (DLP)运算 • 以更多的核心,突出大量多线程运
算能力,弱化延迟 • 大部分芯片面积支持运算逻辑
How to implement?
程序代码
Small Changes, Big Speed-up
计算强度大的代码
GPU
GPU并行加速
其余串行CPU代码
CPU
+
Outline
2. CUDA
What is CUDA?
GPU Programming Languages
float *in = (float *)malloc(numbytes); // on Host float *out = (float *)malloc(numbytes); float *d_in, *d_out; // on Device cudaMalloc(&d_in, numbytes); cudaMalloc(&d_out, numbytes); cudaMemcpy(d_in, in, numbytes, cudaMemcpyHostToDevice);
Analyze
Deploy
APOD
Parallelize
Optimize
A – analyze
P – parallelize
O – optimize
• nsight • nvvp • nvprof
D – deploy
Case Study: Matrix Transpose C code
i
j
(b, a)
fill_matrix(in); // extra fill in matrix function, not listed here
transpose_CPU(in, out);
return 0;
源自文库
}
An Initial CUDA Version
#include "gnutimer.h“
slow
Grid 1
Outline
3. Optimization
Optimize GPU programs
Levels of Optimization
• 选择好算法 • 有效率的编写规则 • 体系结构优化 • 指令集的位操作微观优化
APOD – Systematic Optimization
int i = threadIdx.x; for (int j = 0; j < N; j++)
CUDA CUDA CUDA
CUDA Software Environment
Programming Model – Threads Hierarchy
• 线程 Thread • 线程块 Threads Block
• 栅格 Grid
Programming Model – Memory Hierarchy
并行化
__global__ void transpose_serial(float in[], float out[]) {
for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) out[j + i*N] = in[i + j*N];
} int main(int argc, char **argv) {