Automatic Application-Specific Instruction-Set Extensions :自动特定应用的指令集扩展

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
7/16
The 6:2 Compressor: an Alternative to the 6:3 Counter and 6-input GPC
6:3 Counter 6-input GPC 6:2 Compressor
All inputs have rank 0
6:3
210 Output rank
10/16
Similarities Between Shared Arithmetic Mode and the 6:2 Compressor
ALM inputs
FA
FA (LUTs)
rank-0 inputs cin,1 cin,0
FA
FA
HA
FA
FA
To ALM outputs
ALM (Shared Arithmetic Mode)
[Dadda 1967] n
FCGAoeunnetranliuzFmeAdbeFrulolL/fHUiDTnaslprfauAwtdbbdaietcsrkssset
to
1
Routing delays
[Stelln in g eTltCaoo lm.2pm g 20 11 9]
OutpuHtA is a valuCLeUairTnrsyt-hCcehaanri’ant nugsee c[a0r,rym-c]hains
Configures the Altera Stratix II/III ALM as a 6:2 compressor 1 HA, 2 FA, 2 muxes, plus wires
Best results combine GPC mapping with 6:2 compressors
Average speedup : 1.41x over 3-ADD Average increase in ALM usage: 1.19x over 3-ADD
9/16
6:2 Compressors: Microarchitecture
rank-0 inputs cin,1 cin,0
FA
FA
HA
FA cout,1 cout,0
FA Sum outputs
No combinational path from carry-in to carry-out bits This is not ripple-carry
14/16
Experimental Results
RoSuptAienregedaDu(epNlao(Nyrmo(Nramolizramelidzaeltidozet3do-At3oD-AD3D-)ADD)D)
21.258 1.86 11..624 101101......82415821 00..616 0.4 0.25
[Wallace 1966, Dadda 1967]
Parallel multipliers Many video/signal processing circuits
FIR Filters H.264/AVC video coding 3G wireless base station channel cards
FPGA Synthesis
Stratix II/III carry chain LUTs (shared arithmetic mode) Ternary addition Map poorly onto LUTs Poor flexibility in mapping
HA
FA
FA
m
[Wallace 1966]
6:3 Counter
6:2 Compressor
Steady state: 3 bits per column
Steady state: 2 bits per column
11/16
6:2 Compressors Form a Carry Chain
6
6
6
6
6
6
6:2
6:2
6:2
6:2
6:2
6:2
33--AADDDD GGPPCC 666:::222 666::22:2+++GGGPPPCCC
3-AD6D:2h+asGtNPhoCe usisnmitfaholelrembstetsaretreninadsainll aclal sceasses
GPCGdPoCeshnaostthueselacragrersytcahraeiansin; tahlel coatsheesrs do!
0
0 aadddd22II
aadddd22QQ
aadddd22YY
fifri3r3
ggg777222xxx mmm111222xxx111222 mm1166xx1166
fifri6r6 MMoottiioonnEEEsssttt RRRQQQGGGQQQBBBQQQ RRRYYYGGGYYYBBBYYY AAAvvevereraragaggeee
4 ALMs per LAB to reduce complexity
4 Mapping Algorithms
3-ADD : Ternary adder trees GPC : GPC mapping
[Parandeh-Afshar et al. ASPDAC 2019]
6:2 : Mapping using 6:2 compressors only 6:2 + GPC : The best of both worlds
2/16
Circuit Transformation
step 3
delta 7
delta 4
delta 2
delta 1
>>
4
&
step 1 +
0
&
0
&
=
0
&
=
0
&
=
>>
SEL
=2
step 0
step 1
step 2
step 3
step 2 +
0ห้องสมุดไป่ตู้
&
0 >>
0 >>
0 >>
>>
>>
SEL
=
1
SEL
SEL
HA
[Verma and Ienne, DATE 2019]
4/16
The Altera Stratix II/III ALM: Shared Arithmetic Mode
sumr
3-LUT
carryr
3-LUT
To ALM output rank = r
3-LUT sumr+1 3-LUT carryr+1
FA cout,1 cout,0
FA Sum outputs
6:2 Compressor
12/16
Proposed Logic Cell: 2 Designs
13/16
Experimental Methodology
Platform: VPR
Modeled island-style FPGA Altera-like ALMs and LABs
To ALM output
rank = r+1
5/16
Generalized Parallel Counters (GPCs)
Extension to m:n counters
Input bits can have different ranks
i.e., (kn-1, …, k1, k0; S)
2
2
2
2
2
2
Each 6:2 compressor is a logic cell
Carry chains between adjacent cells bypass local routing
This is not an over-glorified ripple-carry structure
16/16
SEL
+
0
&
SEL
=
ADPCM
vpdiff
[Verma and Ienne, ICCAD 2019]

Compressor
+
Tree
vpdiff
3/16
Compressor Tree Synthesis
ASIC Synthesis
Ripple-carry addition Carry-save representation Ternary addition Full/Half Adder Trees m:n counters
Flowgraph transformations to expose compressor trees
[Verma and Ienne, ICCAD 2019]
Generally applicable to arithmetic circuits Merge disparate add, mul operations to form compressor trees
FPGA vs. ASIC
Performance Area Utilization Power Consumption Flexibility Time-to-Market
ASIC
√ √ √
FPGA
√ √
Performance gap between FPGAs and ASICs
[Kuon and Rose, FPGA 2019 and TCAD 2019]
Software synthesis heuristic/ILP
[Parandeh-Afshar et al. ASPDAC 2019, DATE 2019]
Faster than ternary adder trees or DSP blocks
Stratix II/III and Xilinx Virtex-5 FPGAs
A Novel FPGA Logic Block for Improved Arithmetic Performance
Hadi Parandeh-Afshar Philip Brisk Paolo Ienne
16th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2019
Arithmetic circuits exacerbate the disparities Focus on compressor trees
1/16
Compressor Trees
A circuit that sums k > 2 integer values
Carry-save representation
(2, 3; 3)
(0, 4; 3)
2n-1 … 21 20
21 20
21 20
Output Range: [0, 7] 4:3 Counter
S=3
Number of input bits: Number of output bits:
M = kn-1 + … + k1 + k0 S
6/16
Compressor Tree Synthesis on FPGAs via GPC Mapping
Input ranks may vary
All inputs have rank 0
6:3
210 Output rank
rank 2
rank 0
cout,1
6:2
cin,0
cout,0
cin,1
rank 1
rank 0
10 Output rank
8/16
Why are 6:2 compressors more effective than 6:3 counters?
15/16
Conclusion
Compressor trees are an important class of arithmetic circuits
Previous work: GPC mapping outperforms 3-ADD
Cannot use carry-chain
Contribution: New carry chain
M = 6 inputs S = 3, 4 outputs GPCs were mapped onto 6-LUTs Unable to exploit the carry chain, except for final add Contribution: A new carry chain that we can use!
相关文档
最新文档