脉动阵列处理机

相关主题

阵列处理机

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

2 4 3*3
352
5 23
Clock tick: 1
P1 P2 P3 P4 P5 P6 P7 P8 P9
900000000
5 23 352
脉动阵列结构的构形
一维线形二维矩形二维六边形二维二叉树性二维三角形三维。。。
举例
在一个脉动式二维阵列结构上进行两个3*3 矩阵相乘
每一个处理单元PE含有一个乘法器和一个加法器，完成一个内积运算
a11 a12 a13
A a 21
a 22
a
23
a31 a32 a33
§4 脉动阵列处理机
为要求计算量很大的信号/图像处理及科学计算的特定算法需要
卡内基-梅隆大学的美籍华人H.T.Kung于1978 年提出脉动阵列处理（Systolic Array）机
具有较高的计算并行性
脉动阵列结构原理通用脉动阵列结构
Systolic Architectures
Orchestrate（编制、合成） data flow for high throughput with less memory access
Data transfer directly from register to register across channel
Specialized, and same problems as SIMD
General purpose systems work well for same algorithms (locality etc.)
b11 b12 b13
c11 c12 c13
a21 a22 a23 a31 a32 a33
*
b21 b22 b23 = c21 c22 c23
b31 b32 b33
c31 c32 c33
Conventional Method: N3
For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];
Different from pipelining
Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory
Different from SIMD
b11 b12 b13
B b21
b 22
b
23
b31 b32 b33
c11 c12 c13
C A • B c21
c22
c
23
c31 c32 c33
3
cij aik • bkj,1 i 3,1 j 3 i 1
Matrix Multiplication
a11 a12 a13
a13 a12 a11 P1 P2 P3
a23 a22 a21 a33 a32 a31
P4 P5 P6 P7 P8 P9
At every tick of the global system clock data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.
Two Communication Styles
Systolic communication
CPU
CPU
CPU
Local Memory
Local Memory
Local Memory
Memory communication
CPU
CPU
CPU
Local Memory
Local Memory
Local Memory
Flip columns 1 & 3
a13 a12 a11 a23 a22 a21 a33 a32 a31
Flip rows 1 & 3
b31 b32 b33 b21 b22 b23 b11 b12 b13
and finally stagger the data sets for input.
b33 b32 b23 b31 b22 b13 b21 b12 Leabharlann Baidu11
pattern
Systolic Architectures
M PE Conventional
M
PE PE
PE
Systolic arrays
Replace a processing element(PE) with an array of PE’s without increasing I/O bandwidth
Characteristics
Practical realizations (e.g. Intel iWARP) use quite general processors
Enable variety of algorithms on same hardware
But dedicated interconnect channels
Each PE may do something different
Initial motivation
VLSI enables inexpensive special-purpose chips Represent algorithms directly by chips connected in regular
Systolic Method
This will run in O(n) time! To run in N time we need N x N processing units, in this case we need 9.
P1 P2 P3
P4 P5 P6
P7 P8 P9
We need to modify the input data, like so:
342
342
23 36 28
* 2 5 3
2 5 3 = 25 39 34
325
325
28 32 37
5 Lets try this using a systolic array. 2 3
352 24 3
2 4 3 P1 P2 P3
352
P4 P5 P6
5 23
P7 P8 P9
5 23 352 24