§4脉动阵列处理机

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Characteristics

Practical realizations (e.g. Intel iWARP) use quite general processors

Enable variety of algorithms on same hardware Data transfer directly from register to register across channel General purpose systems work well for same algorithms (locality etc.)
§4 脉动阵列处理机

ቤተ መጻሕፍቲ ባይዱ

为要求计算量很大的信号/图像处理及科学计算的特定算法需要卡内基-梅隆大学的美籍华人H.T.Kung于1978 年提出脉动阵列处理（Systolic Array）机具有较高的计算并行性脉动阵列结构原理通用脉动阵列结构

脉动架构适合的算法

线性代数，矩阵-矩阵和矩阵-向量乘法，求解线性方程组字符串搜索和模式匹配数字滤波器，例如，一维、二维和三维数字滤波器在视频数据压缩中的运动估计有限域运算，如椭圆曲线运算
Two Communication Styles
Systolic communication
CPU CPU CPU
Local Memory
Local Memory
Local Memory
Memory communication
CPU Local Memory CPU Local Memory CPU Local Memory
a 12 a 22 a 32 a 13 a 23 a 33
a 11 A a 21 a 31
b11 B b 21 b 31
b12 b 22 b 32
b13 b 23 b 33
c11 C AB c 21 c 31
3 4 2 2 5 3 3 2 5

But dedicated interconnect channels

Specialized, and same problems as SIMD

脉动阵列结构的构形

一维线形二维矩形二维六边形二维二叉树性二维三角形三维。。。
举例

在一个脉动式二维阵列结构上进行两个3*3 矩阵相乘每一个处理单元PE含有一个乘法器和一个加法器，完成一个内积运算
3 i 1
c12 c 22 c 32
c13 c 23 c 33
cij aik bkj ,1 i 3,1 j 3
Matrix Multiplication
a11 a12 a13 a21 a22 a23 a31 a32 a33
*
b11 b12 b13 b21 b22 b23 b31 b32 b33
a13 a12 a11 a23 a22 a21 a33 a32 a31
b31 b32 b33 b21 b22 b23 b11 b12 b13
Flip rows 1 & 3
and finally stagger the data sets for input.
b31 b21 b11 a13 a12 a11 P1
Systolic Architectures

Orchestrate（编制、合成） data flow for high throughput with less memory access Different from pipelining

Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory
Each PE may do something different VLSI enables inexpensive special-purpose chips Represent algorithms directly by chips connected in regular pattern

=
c11 c12 c13 c21 c22 c23 c31 c32 c33
Conventional Method: N3
For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];
Systolic Method
b32 b22 b12
b33 b23 b13
P2
P3
a23 a22 a21
a33 a32 a31
P4
P7
P5
P8
P6
P9
At every tick of the global system clock data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.
This will run in O(n) time! To run in N time we need N x N processing units, in this case we need 9.
P1 P2 P3
P4
P7
P5
P8
P6
P9
We need to modify the input data, like so: Flip columns 1 & 3
Different from SIMD

Initial motivation

Systolic Architectures
M PE Conventional PE PE Systolic arrays M PE
Replace a processing element(PE) with an array of PE’s without increasing I/O bandwidth