图形硬件与GPU体系结构201004修改

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

传统静态分支架构下的GPGPU计算
基于GPU的MPEG2运动估计算法
Shader Unit对通用计算支持的改进
HLSL程序映射到下面三个模块中：
PSU (Pixel Shader Unit) TMU (Texture Mapping Unit) BP (Branch Processor)
图灵完备(Turing Completeness)
一条无限长的纸带 TAPE。（Data Memory）一个读写头 HEAD。（Load/Store）一套控制规则 TABLE。它根据当前机器所
处的状态以及当前读写头所指的格子上的符号来确定读写头下一步的动作，并改变状态寄存器的值，令机器进入一个新的状态。 (Program,包含Branch, Add) 一个状态寄存器。(Program Counter)
SGI Infinite Reality
产品1996发售，SGI图形设备的巅峰之作 Tile-based Rendering
HP VISUALIZE FX6
98年商业发售分离式图形卡典范
光栅加速卡(Rast) 几何加速卡(GA) 纹理加速卡(TA)
Intel 740
Fragment processors
Pixel operations
Output image
DX10 — Unified Shader Architecture
DX10带来了什么？
USA（统一渲染架构） Float32精度不限制数量的Dynamic Flow Control 4096 Temp Register 大于64K的着色程序指令长度
3Dlabs P10
NV30 (nVIDIA Gefroce5)
DX9 — GPU通用计算的开始
DX9.0c-class (NV4X) GPU才引入动态分支操作 (Dynamic Control)
Z Buffer + Render to Texture + Clip 可以用来模拟动态分支操作
传统GPU体系结构
Split-Shader Architecture (SSA)
Post-vertex Cache
Hierarchical-Z
Fast-Z Clear
Z/Color Compression
Perfetch Texture Cache
Matrox Parhelia 512
什么是流计算？流计算主要解
决什么问题？ NV50在传统流
计算上增加了什么限制？
流计算机基本概念
流计算起源于传统的DSP应用，典型的应用：
视频编解码数字图像处理模式识别计算机图形处理软件无线电
以上流计算的特点：可实现的硬件/软件流水线
传统流计算的特点：
Stream processing is especially suitable for applications that exhibit three application characteristics:
Compute Intensity, the number of arithmetic operations per I/O or global memory reference. In many signal processing applications today it is well over 50:1 and increasing with algorithmic complexity.
NV40 (nVIDIA Gefroce6)
NV40与SGI Infinity Reality 体系结构的变化？
NV47 (nVIDIA Gefroce7)
Split Shader Unit 架构的巅峰之作
多通道存储器技术？
Graphics program
Vertex processors
传统流计算处理器
Imagine Stream Processor
传统流计算处理器
Imagine architecture is the three tiered storage bandwidth hierarchy
a streaming memory system (2.1GB/s), a 128KB stream register file (25.6GB/s), direct forwarding of results among arithmetic
.loc
28
86
0
ld.global.f32
%f2, [%rd23+0];
st.shared.f32
[%rd14+0], %f2;
.loc
28
87
0
ld.global.f32
%f3, [%rd19+0];
st.shared.f32
[%rd15+0], %f3;
.loc
28
90
0
bar.sync
0;Hale Waihona Puke Baidu
Texture Cache Access
Data Parallelism exists in a kernel if the same function is applied to all records of an input stream and a number of records can be processed simultaneously without waiting for results from previous records.
%f132, %f80;
mul.f32
%f133, %f82, %f73;
units via local register files (435GB/s).
Imagine is able to sustain performance of up to 18.3GOPS on key applications.
Imagine is designed to fit on a 2.56cm^2 0.18um CMOS chip and to operate at 400MHz.
// store the sub-matrix of B __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from device memory // to shared memory; each thread loads // one element of each matrix AS(ty, tx) = A[a + wA * ty + tx]; BS(ty, tx) = B[b + wB * ty + tx];
传统流计算处理器
不同应用在各个存储器层次上的带宽需求：
CUDA流计算模型
CUDA的存储器架构与传统的流处理器存储架构的区别？
在PTX上面的体现？
Shared Memory
从Device memory到Shared memory
CUDA
// Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
Data Locality is a specific type of temporal locality common in signal and media processing applications where data is produced once, read once or twice later in the application, and never read again. Intermediate streams passed between kernels as well as intermediate data within kernel functions can capture this locality directly using the stream processing programming model.
ic.expert @gmail.com
摘要
图形硬件的历史早期GPGPU通用计算现代GPU体系结构基于现代GPU的编程模型展望
OpenGL三维图形流水线
Texture
Vertex Processing
Rasterizer
Fragment Processing
Framebuffer
threadIdx.x; unsigned int y = blockIdx.y*blockDim.y +
threadIdx.y;
float u = x / (float) width; float v = y / (float) height;
// transform coordinates u -= 0.5f; v -= 0.5f; float tu = u*cosf(theta) - v*sinf(theta) + 0.5f; float tv = v*cosf(theta) + u*sinf(theta) + 0.5f;
// read from texture and write to global memory g_odata[y*width + x] = tex2D(tex, tu, tv); }
#endif // #ifndef _SIMPLETEXTURE_KERNEL_H_
.tex .u64 tex;
mov.f32
// Declaration of the shared memory array As used to
// store the sub-matrix of A __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
// Declaration of the shared memory array Bs used to
98年商业发售 Intel第一块图形卡单芯片图形卡无几何加速单元
S3G Savage 2000
99年商业发售第一块集成TnL几何定点变换单元的消费
级图形卡
传统GPU图形流水线
在2000年前后，图形加速卡完成了从分离式元件和分离式板卡到单芯片图形硬件的整合。
图形硬件翻开了新的一页，开始了可编程化的征途。 1D9ir9ec9tX年9.0DirectX7 （增加TnL ） 2000年 DirectX8（增加Vertex shader ） 2002年 DirectX9 （增加Pixel shader ） 2004年 DirectX9.0c （增加动态分支） 2006年DirectX10
// Synchronize to make sure the matrices are loaded
__syncthreads();
PTX
$Lt_0_2818:
//<loop> Loop body line 71, nesting depth: 1, estimated iterations:
unknown
__global__ void transformKernel( float* g_odata, int width, int height,
float theta) {
// calculate normalized texture coordinates unsigned int x = blockIdx.x*blockDim.x +
DX10还差什么？
通信（Communication）访存（Memory access）
Shader Unit的改进
基于FIFO的Shader Unit （传统GPU）
Shader Unit的改进
基于Thread的Shader Unit（DX10级别GPU）
NV50架构图
编程模型 : 流计算(Stream Computing)