计算机组成与设计第五版答案

合集下载

计算机组成原理第五版 白中英(详细)第4章习题参考答案

计算机组成原理第五版 白中英(详细)第4章习题参考答案

第4章习题参考答案1.ASCII码是7位,如果设计主存单元字长为32位,指令字长为12位,是否合理?为什么?答:不合理。

指令最好半字长或单字长,设16位比较合适。

一个字符的ASCII 是7位,如果设计主存单元字长为32位,则一个单元可以放四个字符,这也是可以的,只是在存取单个字符时,要多花些时间而已,不过,一条指令至少占一个单元,但只占一个单元的12位,而另20位就浪费了,这样看来就不合理,因为通常单字长指令很多,浪费也就很大了。

2.假设某计算机指令长度为32位,具有双操作数、单操作数、无操作数三类指令形式,指令系统共有70条指令,请设计满足要求的指令格式。

答:字长32位,指令系统共有70条指令,所以其操作码至少需要7位。

双操作数指令单操作数指令无操作数指令3.指令格式结构如下所示,试分析指令格式及寻址方式特点。

答:该指令格式及寻址方式特点如下:(1) 单字长二地址指令。

(2) 操作码字段OP可以指定26=64种操作。

(3) 源和目标都是通用寄存器(可分指向16个寄存器)所以是RR型指令,即两个操作数均在寄存器中。

(4) 这种指令结构常用于RR之间的数据传送及算术逻辑运算类指令。

4.指令格式结构如下所示,试分析指令格式及寻址方式特点。

15 10 9 8 7 4 3 0答:该指令格式及寻址方式特点如下:(1)双字长二地址指令,用于访问存储器。

(2)操作码字段OP可以指定26=64种操作。

(3)RS型指令,一个操作数在通用寄存器(选择16个之一),另一个操作数在主存中。

有效地址可通过变址寻址求得,即有效地址等于变址寄存器(选择16个之一)内容加上位移量。

5.指令格式结构如下所示,试分析指令格式及寻址方式特点。

答:该指令格式及寻址方式特点如下:(1)该指令为单字长双操作数指令,源操作数和目的操作数均由寻址方式和寄存器构成,寄存器均有8个,寻址方式均有8种。

根据寻址方式的不同,指令可以是RR型、RS型、也可以是SS型;(2)因为OP为4位,所以最多可以有16种操作。

计算机组装与维护第五版课后习题参考答案(工业)

计算机组装与维护第五版课后习题参考答案(工业)

计算机组装与维护第五版课后习题参考答案(工业)第一章:计算机基础知识1. 什么是计算机硬件?计算机硬件又可以分为哪几类?计算机硬件是指构成计算机实体的各个部件,包括处理器、内存、硬盘、显示器、键盘等。

计算机硬件可以分为以下几类:1.中央处理器(CPU):负责处理计算机的指令和数据。

2.主存储器(内存):用于存储程序和数据。

3.输入设备:用于将外部信息输入到计算机中,如键盘、鼠标等。

4.输出设备:用于将计算机处理的结果输出,如显示器、打印机等。

5.辅助存储设备:用于长期保存数据和程序,如硬盘、光盘等。

6.扩展设备:如网络设备、声卡、显卡等。

2. 什么是计算机软件?计算机软件又可以分为哪几类?计算机软件是指计算机系统中的各种程序和数据,可以分为以下几类:1.系统软件:包括操作系统、编译器、数据库管理系统等,用于管理和控制计算机硬件资源。

2.应用软件:包括办公软件、图形图像处理软件、娱乐软件等,用于满足用户的各种需求。

3.开发工具软件:包括编程语言、集成开发环境等,用于开发其他软件。

3. 简述计算机系统的五大组成部分。

计算机系统的五大组成部分分别是:输入设备、输出设备、存储设备、中央处理器(CPU)和控制器/调度器。

•输入设备用于将外部信息输入到计算机中,如键盘、鼠标等。

•输出设备用于将计算机处理的结果输出,如显示器、打印机等。

•存储设备用于长期保存数据和程序,如硬盘、光盘等。

•中央处理器(CPU)负责处理计算机的指令和数据。

•控制器/调度器用于控制计算机系统的各个部件的协调工作。

4. 什么是计算机的位数?计算机位数的提升对计算机性能有什么影响?计算机的位数是指计算机中用于表示数据的二进制位的位数。

常见的计算机位数有8位、16位、32位和64位等。

计算机位数的提升对计算机性能有以下几个方面的影响:1.内存容量:位数的提升可以扩大计算机内存的寻址范围,使计算机能够处理更大容量的数据。

2.计算速度:位数的提升可以增加计算机的计算能力,提高运算速度。

白中英《计算机组成原理》(第5版)笔记和课后习题详解复习答案

白中英《计算机组成原理》(第5版)笔记和课后习题详解复习答案

白中英《计算机组成原理》(第5版)笔记和课后习题详解完整版>精研学习网>无偿试用20%资料
全国547所院校视频及题库全收集
考研全套>视频资料>课后答案>往年真题>职称考试
第1章计算机系统概论
1.1复习笔记
1.2课后习题详解
第2章运算方法和运算器
2.1复习笔记
2.2课后习题详解
第3章多层次的存储器
3.1复习笔记
3.2课后习题详解
第4章指令系统
4.1复习笔记
4.2课后习题详解
第5章中央处理器
5.1复习笔记
5.2课后习题详解
第6章总线系统
6.1复习笔记
6.2课后习题详解
第7章外存与I/O设备
7.1复习笔记
7.2课后习题详解
第8章输入输出系统
8.1复习笔记
8.2课后习题详解
第9章并行组织与结构
9.1复习笔记
9.2课后习题详解
第10章课程教学实验设计
第11章课程综合设计。

计算机组成原理课后答案(白中英主编_第五版_立体化教材)_2

计算机组成原理课后答案(白中英主编_第五版_立体化教材)_2

( 2= ==( 2= = =( 2===第二章1.(1) 35 =−100011)[ 35]原 10100011[ 35]补 11011100 [ 35]反 11011101(2)[127]原=01111111[127]反=01111111[127]补=01111111(3) 127 =−1111111)[ 127]原 11111111[ 127]补 10000001[ 127]反 10000000(4) 1 =−00000001)[ 1]原 10000001[ 1]补 11111111 [ 1]反 111111102.[x]补 = a 0. a 1a 2…a 6解法一、(1) 若 a 0 = 0, 则 x > 0, 也满足 x > -0.5此时 a 1→a 6 可任意(2) 若 a 0 = 1, 则 x <= 0, 要满足 x > -0.5, 需 a 1 = 1 即 a 0 = 1, a 1 = 1, a 2→a 6 有一个不为 0解法二、-0.5 = -0.1(2) = -0.100000 = 1, 100000(1) 若 x >= 0, 则 a0 = 0, a 1→a 6 任意即可;(2) [x]补= x = a 0. a 1a 2…a 6(2) 若 x < 0, 则 x > -0.5只需-x < 0.5, -x > 0[x]补 = -x, [0.5]补 = 01000000 即[-x]补 < 01000000a 0 * a 1 * a 2 a 6 + 1 < 01000000⋅ (1 2 ) 即: 2 2 ⋅ 2(最接近 0 的负数)即: 2 2 ⋅ (2 + 2[ 2 2 ⋅ 2 ⋅ (1 2 ) ] [ 22 1 ⋅ ( 1) , 2 2 ⋅ (2 1 + 2 ) ]a 0 a 1a 2 a 6 > 11000000即 a 0a 1 = 11, a 2→a 6 不全为 0 或至少有一个为 1(但不是“其余取 0”)3.字长 32 位浮点数,阶码 8 位,用移码表示,尾数 23 位,用补码表示,基为 2EsE 1→E 8MsM 21M 0(1) 最大的数的二进制表示E = 11111111Ms = 0, M = 11…1(全 1)1 11111111 01111111111111111111111(2) 最小的二进制数E = 11111111Ms = 1, M = 00…0(全 0) 1 11111111 1000000000000000000000(3) 规格化范围正最大E = 11…1, M = 11…1, Ms = 08 个22 个即: 227 122正最小E = 00…0, M = 100…0, Ms = 08 个7121 个负最大E = 00…0, M = 011…1, Ms = 18 个 21 个负最小7 1E = 11…1, M = 00…0, Ms =18 个22 个22 )即: 22⋅ ( 1) 规格化所表示的范围用集合表示为:71, 227122 7 7 2244.在 IEEE754 标准中,一个规格化的 32 位浮点数 x 的真值表示为:X=( 1)s ×(1.M )× 2 E 127(1)27/64=0.011011=1.1011× 22E= -2+127 = 125= 0111 1101 S= 0M= 1011 0000 0000 0000 0000 000最后表示为:0 01111101 10110000000000000000000 (2)-27/64=-0.011011=1.1011× 22E= -2+127 = 125= 0111 1101 S= 1M= 1011 0000 0000 0000 0000 000最后表示为:1 01111101 10110000000000000000000 5.(1)用变形补码进行计算:[x]补=00 11011 [y]补=00 00011[x]补 = [y]补 = [x+y]补00 11011 + 00 00011 00 11110结果没有溢出,x+y=11110(2) [x]补=00 11011 [y]补=11 01011[x]补 = [y]补 = [x+y]补=00 11011 + 11 01011 00 00110结果没有溢出,x+y=00110(3)[x]补=11 01010 [y]补=11 111111[x]补 = [y]补 = [x+y]补=00 01010 + 00 11111 11 01001结果没有溢出,x+y=−101116.[x-y]补=[x]补+[-y]补 (1)[x]补=00 11011[-y]补=00 11111[x]补 =00 11011 [-y]补 = + 00 11111 [x-y]补= 01 11010结果有正溢出,x−y=11010(2)[x]补=00 10111[-y]补=11 00101[x]补 =00 10111 [-y]补 = + 11 00101 [x-y]补结果没有溢出,x−y=−00100(3)[x]补=00 11011 [-y]补=00 10011[x]补= 00 11011[-y]补= + 00 10011[x-y]补= 01 01110结果有正溢出,x−y=100107.(1)用原码阵列乘法器:[x]原=0 11011 [y]原=1 11111因符号位单独考虑,|x|=11011 |y|=111111 1 0 1 1×) 1 1 1 1 1——————————————————————————1 1 0 1 11 1 0 1 11 1 0 1 11 1 0 1 11 1 0 1 11 1 0 1 0 0 0 1 0 1[x×y]原=1 1101000101用补码阵列乘法器:[x]补=0 11011 [y]补=1 00001乘积符号位为:1|x|=11011 |y|=111111 1 0 1 1×) 1 1 1 1 1——————————————————————————1 1 0 1 11 1 0 1 11 1 0 1 11 1 0 1 11 1 0 1 0 0 0 1 0 1[x×y]补=1 0010111011(2) 用原码阵列乘法器:[x]原=1 11111 [y]原=1 11011因符号位单独考虑,|x|=11111 |y|=110111 1 1 1 1×) 1 1 0 1 1——————————————————————————1 1 1 1 11 1 1 1 10 0 0 0 01 1 1 1 11 1 1 1 11 1 0 1 0 0 0 1 0 1[x×y]原=0 1101000101用补码阵列乘法器:[x]补=1 00001 [y]补=1 00101乘积符号位为:1|x|=11111 |y|=110111 1 1 1 1×) 1 1 0 1 1——————————————————————————1 1 1 1 11 1 1 1 10 0 0 0 01 1 1 1 111111[x×y]补=0 11010001018.(1) [x]原=[x]补=0 11000[-∣y ∣]补=1 00001被除数 X 0 11000 +[-|y|]补 1 00001----------------------------------------------------余数为负 1 11001 →q0=0左移 1 10010 +[|y|]补0 11111----------------------------------------------------余数为正 0 10001 →q1=1左移 1 00010 +[-|y|]补1 00001----------------------------------------------------余数为正 0 00011 →q2=1左移 0 00110 +[-|y|]补1 00001----------------------------------------------------余数为负 1 00111 →q3=0左移 0 01110 +[|y|]补0 11111----------------------------------------------------余数为负 1 01101 →q4=0左移 0 11010 +[|y|]补0 11111----------------------------------------------------余数为负 1 11001 →q5=0+[|y|]补0 11111 ----------------------------------------------------余数 0 11000故 [x÷y]原=1.11000 即 x÷y= −0.11000 余数为 0 11000(2)[∣x ∣]补=0 01011[-∣y ∣]补=1 00111被除数 X 0 01011 +[-|y|]补 1 00111----------------------------------------------------余数为负 1 10010 →q0=0x+y= 1.010010*2 = 2 *-0.101110左移 1 00100 +[|y|]补 0 11001----------------------------------------------------余数为负 1 11101 →q1=0左移 1 11010 +[|y|]补0 11001----------------------------------------------------余数为正 0 10011 →q2=1左移 1 00110 +[-|y|]补1 00111----------------------------------------------------余数为正 0 01101 →q3=1左移 0 11010 +[-|y|]补1 00111----------------------------------------------------余数为正 0 00001 →q4=1左移 0 00010 +[-|y|]补1 00111----------------------------------------------------余数为负 1 01001 →q5=0 +[|y|]补0 11001----------------------------------------------------余数 0 00010x÷y= −0.01110余数为 0 000109.(1) x = 2-011*0.100101, y = 2-010*(-0.011110)[x]浮 = 11101,0.100101 [y]浮 = 11110,-0.011110 Ex-Ey = 11101+00010=11111 [x]浮 = 11110,0.010010(1)x+y 0 0. 0 1 0 0 1 0 (1)+ 1 1. 1 0 0 0 1 01 1. 1 1 0 1 0 0 (1)规格化处理: 1.010010 阶码11100-4 -4x-y0 0. 0 1 0 0 1 0 (1) + 0 0. 0 1 1 1 1 00 0 1 1 0 0 0 0 (1) 规格化处理:0.110000阶码11110x-y=2-2*0.110001(2) x = 2-101*(-0.010110), y = 2-100*0.010110[x]浮= 11011,-0.010110 [y]浮= 11100,0.0101109Ex-Ey = 11011+00100 = 11111 [x]浮= 11100,1.110101(0) x+y 1 1. 1 1 0 1 0 1+ 0 0. 0 1 0 1 1 00 0. 0 0 1 0 1 1规格化处理: 0.101100 x+y= 0.101100*2阶码-611010x-y1 1.1 1 0 1 0 1 + 1 1.1 0 1 0 1 01 1.0 1 1 1 1 1规格化处理: 1.011111 阶码11100x-y=-0.100001*2-410.(1) Ex = 0011, Mx = 0.110100Ey = 0100, My = 0.100100 Ez = Ex+Ey = 0111 Mx*My 0. 1 1 0 1* 0.1 0 0 101101 00000 00000 01101 00000 001110101规格化:26*0.111011(2) Ex = 1110, Mx = 0.011010Ey = 0011, My = 0.111100 Ez = Ex-Ey = 1110+1101 = 1011 [Mx]补 = 00.011010[My]补 = 00.111100, [-My]补 = 11.00010010计算机组成原理第五版习题答案00011010 +[-My]11000100 11011110 10111100+[My]00111100 11111000 111100000.0 +[My]00111100 00101100 010110000.01 +[-My]11000100 00011100 001110000.011 +[-My]11000100 11111100 111110000.0110 +[My]00111100 00110100 011010000.01101 +[-My]1 1 0 00 1 0 0 0 0 1 0 1 10 00.01101 商 = 0.110110*2-6, 11.4 位加法器如上图,C i = A i B i + A i C i 1 + B i C i 1 = A i B i + ( A i + B i )C i 1 = A i B i + ( A i B i )C i 1(1)串行进位方式余数=0.101100*2-6C 1 = G 1+P 1C 0 C 2 = G 2+P 2C 1 C 3 = G 3+P 3C 2 C 4 = G 4+P 4C 3 其中:G 1 = A 1B 1G 2 = A 2B 2G 3 = A 3B 3 G 4 = A 4B 4P1 = A 1⊕B 1(A 1+B 1 也对) P 2 = A 2⊕B 2 P 3 = A 3⊕B 3 P 4 = A 4⊕B 4(2)并行进位方式 C 1 = G 1+P 1C 0C 2 = G 2+P 2G 1+P 2P 1C 0C 3 = G 3+P 3G 2+P 3P 2G 1+P 3P 2P 1C 0C 4 = G 4+P 4G 3+P 4P 3G 2+P 4P 3P 2G 1+P 4P 3P 2P 1C 0“计算机组成原理第五版习题答案12.(1)组成最低四位的74181 进位输出为:C4 = C n+4 = G+PC n = G+PC0,C0为向第0 位进位其中,G = y3+y2x3+y1x2x3+y0x1x2x3,P = x0x1x2x3,所以C5 = y4+x4C4C6 = y5+x5C5 = y5+x5y4+x5x4C4(2)设标准门延迟时间为T,与或非”门延迟时间为1.5T,则进位信号C0,由最低位传送至C6需经一个反相器、两级“与或非”门,故产生C0的最长延迟时间为T+2*1.5T = 4T(3)最长求和时间应从施加操作数到ALU 算起:第一片74181 有3 级“与或非”门(产生控制参数x0, y0, C n+4),第二、三片74181 共 2 级反相器和 2 级“与或非”门(进位链),第四片74181 求和逻辑(1 级与或非门和 1 级半加器,设其延迟时间为3T),故总的加法时间为:t0 = 3*1.5T+2T+2*1.5T+1.5T+3T = 14T13.设余三码编码的两个运算数为X i和Y i,第一次用二进制加法求和运算的和数为S i’,进位为C i+1’,校正后所得的余三码和数为S i,进位为C i+1,则有:X i = X i3X i2X i1X i0Y i = Y i3Y i2Y i1Y i0S i’ = S i3’S i2’S i1’S i0’s i3 s i2 s i1 s i0Ci+1FA FA FA FA十进校正+3VFA s i3'FAs i2'FAs i1'FAs i0'二进加法X i3 Y i3 X i2 Y i2 X i1 Y i1 X i0 Y i0当C i+1’ = 1时,S i = S i’+0011并产生C i+1当C i+1’ = 0时,S i = S i’+1101根据以上分析,可画出余三码编码的十进制加法器单元电路如图所示。

计算机组成与设计第五版(Chapter2)

计算机组成与设计第五版(Chapter2)

计算机组成与设计第五版(Chapter2)Chapter 2 Solutions S-3 2.1 addi f, h, -5 (note, no subi) add f, f, g2.2 f = g + h + i2.3 sub $t0, $s3, $s4add $t0, $s6, $t0lw $t1, 16($t0)sw $t1, 32($s7)2.4 B[g] = A[f] + A[1+f];2.5 add $t0, $s6, $s0add $t1, $s7, $s1lw $s0, 0($t0)lw $t0, 4($t0)add $t0, $t0, $s0sw $t0, 0($t1)2.62.6.1 temp = Array[0];temp2 = Array[1];Array[0] = Array[4];Array[1] = temp;Array[4] = Array[3];Array[3] = temp2;2.6.2 lw $t0, 0($s6)lw $t1, 4($s6)lw $t2, 16($s6)sw $t2, 0($s6)sw $t0, 4($s6)lw $t0, 12($s6)sw $t0, 16($s6)sw $t1, 12($s6)S-4 ChapterSolutions22.712ab12128cd8ef4ef4cd0120ab2.8 28824000182.9 sll $t0, $s1, 2 # $t0 <-- 4*gadd $t0, $t0, $s7 # $t0 <-- Addr(B[g])lw $t0, 0($t0) # $t0 <-- B[g]addi $t0, $t0, 1 # $t0 <-- B[g]+1sll $t0, $t0, 2 # $t0 <-- 4*(B[g]+1) = Addr(A[B[g]+1]) lw $s0, 0($t0) # f <-- A[B[g]+1]2.10 f = 2*(&A);2.11addi $t0, $s6, 4I-type82284add $t1, $s6, $0R-type02209sw $t1, 0($t0)I-type43890lw $t0, 0($t0)I-type35880add $s0, $t1, $t0R-type098162.122.12.1 500000002.12.2 overflow2.12.3 B00000002.12.4 no overflow2.12.5 D00000002.12.6 overflow2.132.13.1 128 231?1, x ? 231?129 and 128 ? x ??231, x ??231? 128(impossible)2.13.2 128? x ? 231?1, x ??231?129 and 128 ? x ??231, x ? 231? 128(impossible)2.13.3 x? 128 ??231, x ??231? 128 and x ? 128 ? 231? 1, x ? 231? 127(impossible)Chapter 2 Solutions S-52.14 r-type, add $s0, $s0, $s02.15 i-type, 0xAD4900202.16 r-type, sub $v1, $v1, $v0, 0x006218222.17 i-type, lw $v0, 4($at), 0x8C2200042.182.18.1 opcode would be 8 bits, rs, rt, rd fi elds would be 7 bits each2.18.2 opcode would be 8 bits, rs and rt fi elds would be 7 bits each2.18.3 more registers →more bits per instruction → could increase code sizemore registers → less register spills → less instructionsmore instructions → more appropriate instruction → decrease code sizemore instructions → larger opcodes → larger code size2.192.19.1 0xBABEFEF82.19.2 0xAAAAAAA02.19.3 0x000055452.20 srl $t0, $t0, 11sll $t0, $t0, 26ori $t2, $0, 0x03ffsll $t2, $t2, 16ori $t2, $t2, 0xffffand $t1, $t1, $t2or $t1, $t1, $t02.21 nor $t1, $t2, $t22.22 lw $t3, 0($s1)sll $t1, $t3, 42.23 $t2 = 32.24 jump: no, beq: noS-6 ChapterSolutions22.252.25.1 i-type2.25.2 addi $t2, $t2, –1beq $t2, $0, loop2.262.26.1 202.26.2 i = 10;do {B += 2;i = i – 1;} while ( i > 0)2.26.3 5*N2.27addi $t0, $0, 0beq $0, $0, TEST1LOOP1: addi $t1, $0, 0beq $0, $0, TEST2LOOP2: add $t3, $t0, $t1sll $t2, $t1, 4add $t2, $t2, $s2sw $t3, ($t2)addi $t1, $t1, 1TEST2: slt $t2, $t1, $s1bne $t2, $0, LOOP2addi $t0, $t0, 1TEST1: slt $t2, $t0, $s0bne $t2, $0, LOOP12.28 14 instructions to implement and 158 instructions executed2.29 for (i=0; i<100; i++) {result += MemArray[s0];s0 = s0 + 4;}Chapter 2 Solutions S-7 2.30 addi $t1, $s0, 400LOOP: lw $s1, 0($t1)add $s2, $s2, $s1addi $t1, $t1, -4bne $t1, $s0, LOOP2.31 fib: addi $sp, $sp, -12 # make room on stacksw $ra, 8($sp) # push $rasw $s0, 4($sp) # push $s0sw $a0, 0($sp) # push $a0 (N)bgt $a0, $0, test2 # if n>0, test if n=1add $v0, $0, $0 # else fib(0) = 0rtn #jtest2: addi $t0, $0, 1 #bne $t0, $a0, gen # if n>1, genadd $v0, $0, $t0 # else fib(1) = 1rtnjgen: subi $a0, $a0,1 # n-1jal fib # call fib(n-1)add $s0, $v0, $0 # copy fib(n-1)sub $a0, $a0,1 # n-2jal fib # call fib(n-2)add $v0, $v0, $s0 # fib(n-1)+fib(n-2)rtn: lw $a0, 0($sp) # pop $a0lw $s0, 4($sp) # pop $s0lw $ra, 8($sp) # pop $raaddi $sp, $sp, 12 # restore spjr $ra# fib(0) = 12 instructions, fib(1) = 14 instructions,# fib(N) = 26 + 18N instructions for N >=22.32 D ue to the recursive nature of the code, it is not possible for the compiler toin-line the function call.2.33 after calling function fib:old $sp -> 0x7ffffffcc ontents of register $ra for-4fib(N)c ontents of register $s0 for-8fib(N)$sp-> -12 c ontents of register $a0 forfib(N)there will be N-1 copies of $ra, $s0 and $a0 S-8 ChapterSolutions22.34 f: addi $sp,$sp,-12$ra,8($sp)sw$s1,4($sp)sw$s0,0($sp)sw$s1,$a2move$s0,$a3movefuncjal$a0,$v0moveadd$a1,$s0,$s1funcjallw$ra,8($sp)$s1,4($sp)lw$s0,0($sp)lw$sp,$sp,12addi$rajr2.35 W e can use the tail-call optimization for the second call to func, but thenwe must restore $ra, $s0, $s1, and $sp before that call. We save only oneinstruction (jr $ra).2.36 R egister $ra is equal to the return address in the caller function, registers$sp and $s3 have the same values they had when function f was called, andregister $t5 can have an arbitrary value. For register $t5, note that althoughour function f does not modify it, function func is allowed to modify it sowe cannot assume anything about the of $t5 aft er function func has beencalled.2.37 MAIN: addi $sp, $sp, -4sw $ra, ($sp)add $t6, $0, 0x30 # ‘0’add $t7, $0, 0x39 # ‘9’add $s0, $0, $0add $t0, $a0, $0LOOP: lb $t1, ($t0)slt $t2, $t1, $t6bne $t2, $0, DONEslt $t2, $t7, $t1bne $t2, $0, DONEsub $t1, $t1, $t6beq $s0, $0, FIRSTmul $s0, $s0, 10FIRST: add $s0, $s0, $t1addi $t0, $t0, 1LOOPjChapter 2 Solutions S-9DONE: add $v0, $s0, $0lw $ra, ($sp)addi $sp, $sp, 4jr $ra2.38 0x000000112.39 Generally, all solutions are similar:lui $t1, top_16_bitsori $t1, $t1, bottom_16_bits2.40 No, jump can go up to 0x0FFFFFFC.2.41 N o, range is 0x604 + 0x1FFFC = 0x0002 0600 to 0x604 – 0x20000= 0xFFFE 0604.2.42 Y es, range is 0x1FFFF004 + 0x1FFFC = 0x2001F000 to 0x1FFFF004- 0x20000 = 1FFDF0042.43 trylk: li $t1,1ll $t0,0($a0)$t0,trylkbnezsc $t1,0($a0)$t1,trylkbeqzlw $t2,0($a1)slt $t3,$t2,$a2$t3,skipbnezsw $a2,0($a1)skip: sw $0,0($a0)2.44 try: ll $t0,0($a1)slt $t1,$t0,$a2$t1,skipbnezmov $t0,$a2sc $t0,0($a1)$t0,trybeqzskip:2.45 It is possible for one or both processors to complete this code without everreaching the SC instruction. If only one executes SC, it completes successfully. Ifboth reach SC, they do so in the same cycle, but one SCcompletes fi rst and thenthe other detects this and fails.S-10 ChapterSolutions22.462.46.1 Answer is no in all cases. Slows down the computer.CCT ? clock cycle timeICa ? instruction count (arithmetic)ICls ? instruction count (load/store)ICb ? instruction count (branch)new CPU time ? 0.75*old ICa*CPIa*1.1*oldCCToldICls*CPIls*1.1*oldCCToldICb*CPIb*1.1*oldCCTTh e extra clock cycle time adds suffi ciently to the new CPU time such thatit is not quicker than the old execution time in all cases.2.46.2 107.04%, 113.43%2.472.47.1 2.62.47.2 0.882.47.3 0.533333333。

计算机组成原理第五版-白中英(详细)第4章习题参考答案

计算机组成原理第五版-白中英(详细)第4章习题参考答案

第4章习题参考答案1.ASCII码是7位,如果设计主存单元字长为32位,指令字长为12位,是否合理为什么答:不合理。

指令最好半字长或单字长,设16位比较合适。

一个字符的ASCII 是7位,如果设计主存单元字长为32位,则一个单元可以放四个字符,这也是可以的,只是在存取单个字符时,要多花些时间而已,不过,一条指令至少占一个单元,但只占一个单元的12位,而另20位就浪费了,这样看来就不合理,因为通常单字长指令很多,浪费也就很大了。

2.假设某计算机指令长度为32位,具有双操作数、单操作数、无操作数三类指令形式,指令系统共有70条指令,请设计满足要求的指令格式。

答:字长32位,指令系统共有70条指令,所以其操作码至少需要7位。

双操作数指令单操作数指令无操作数指令3.指令格式结构如下所示,试分析指令格式及寻址方式特点。

15 10 !9 8 7 4 3 0答:该指令格式及寻址方式特点如下:(1) 单字长二地址指令。

》(2) 操作码字段OP可以指定26=64种操作。

(3) 源和目标都是通用寄存器(可分指向16个寄存器)所以是RR型指令,即两个操作数均在寄存器中。

(4) 这种指令结构常用于RR之间的数据传送及算术逻辑运算类指令。

4.指令格式结构如下所示,试分析指令格式及寻址方式特点。

15 10 9 8 7 4 3 015 10 9 8 7 4 3 0答:该指令格式及寻址方式特点如下:(1)双字长二地址指令,用于访问存储器。

(2)操作码字段OP可以指定26=64种操作。

(3)RS型指令,一个操作数在通用寄存器(选择16个之一),另一个操作数在主存中。

有效地址可通过变址寻址求得,即有效地址等于变址寄存器(选择16个之一)内容加上位移量。

|5.指令格式结构如下所示,试分析指令格式及寻址方式特点。

15 12 11 9 8 6 5 3 2 0答:该指令格式及寻址方式特点如下:(1)该指令为单字长双操作数指令,源操作数和目的操作数均由寻址方式和寄存器构成,寄存器均有8个,寻址方式均有8种。

计算机组成原理第五版白中英(详细)第3章习题答案

计算机组成原理第五版白中英(详细)第3章习题答案

第3章习题‎答案1、设有一个具‎有20位地‎址和32位‎字长的存储‎器,问 (1) 该存储器能‎存储多少字‎节的信息? (2) 如果存储器‎由512K ‎×8位SRA ‎M 芯片组成‎,需要多少片‎? (3) 需要多少位‎地址作芯片‎选择? 解:(1) 该存储器能‎存储:字节4M 832220=⨯(2) 需要片8823228512322192020=⨯⨯=⨯⨯K(3) 用512K ‎⨯8位的芯片‎构成字长为‎32位的存‎储器,则需要每4‎片为一组进‎行字长的位‎数扩展,然后再由2‎组进行存储‎器容量的扩‎展。

所以只需一‎位最高位地‎址进行芯片‎选择。

2、已知某64‎位机主存采‎用半导体存‎储器,其地址码为‎26位,若使用4M ‎×8位的DR ‎A M 芯片组‎成该机所允‎许的最大主‎存空间,并选用内存‎条结构形式‎,问; (1) 若每个内存‎条为16M ‎×64位,共需几个内‎存条? (2) 每个内存条‎内共有多少‎D RAM 芯‎片? (3) 主存共需多‎少DRAM ‎芯片? CPU 如何‎选择各内存‎条? 解:(1) 共需内存条‎条4641664226=⨯⨯M (2) 每个内存条‎内共有个芯‎32846416=⨯⨯M M 片 (3) 主存共需多‎少个RAM ‎1288464648464226=⨯⨯=⨯⨯M M M 芯片, 共有4个内‎存条,故CPU 选‎择内存条用‎最高两位地‎址A 24和‎A 25通过‎2:4译码器实‎现;其余的24‎根地址线用‎于内存条内‎部单元的选‎择。

3、用16K ×8位的DR ‎A M 芯片构‎成64K ×32位存储‎器,要求: (1) 画出该存储‎器的组成逻‎辑框图。

(2) 设存储器读‎/写周期为0‎.5μS ,CPU 在1‎μS 内至少‎要访问一次‎。

试问采用哪‎种刷新方式‎比较合理?两次刷新的‎最大时间间‎隔是多少?对全部存储‎单元刷新一‎遍所需的实‎际刷新时间‎是多少? 解:(1) 用16K ×8位的DR ‎A M 芯片构‎成64K ×32位存储‎器,需要用个芯‎16448163264=⨯=⨯⨯K K 片,其中每4片‎为一组构成‎16K ×32位——进行字长位‎数扩展(一组内的4‎个芯片只有‎数据信号线‎不互连——分别接D0‎~D 7、D 8~D 15、D 16~D23和D ‎24~D 31,其余同名引‎脚互连),需要低14‎位地址(A 0~A 13)作为模块内‎各个芯片的‎内部单元地‎址——分成行、列地址两次‎由A 0~A6引脚输‎入;然后再由4‎组进行存储‎器容量扩展‎,用高两位地‎址A 14、A15通过‎2:4译码器实‎现4组中选‎择一组。

计算机组成与设计第五版答案

计算机组成与设计第五版答案

解决方案4第4章解决方案S-34.1 4.1.1信号值如下:RegWrite MemReadALUMux MemWrite aloop RegMux Branch 0 0 1(Imm)1 ADD X 0 ALUMux是控制ALU输入处Mux 的控制信号,0(Reg)选择寄存器文件的输出,1(Imm)从指令字中选择立即数作为第二个输入。

以铝合金为控制信号,控制Mux输入寄存器文件,0(ALU)选择ALU的输出,1(Mem)选择存储器的输出。

X值表示“不关心”(不管信号是0还是1)4.1.2除了未使用的寄存器4.1.3分支添加单元和写入端口:分支添加,寄存器写入端口没有输出:无(所有单元都生成输出)4.2 4.2.1第四条指令使用指令存储器、两个寄存器读取端口、添加Rd和Rs的ALU,寄存器中的数据存储器和写入端口。

4.2.2无。

可以使用此指令实现现有的块。

4.2.3无。

此指令可以在不添加新的控制信号的情况下实现。

它只需要改变控制逻辑。

4.3 4.3.1时钟周期时间由关键路径决定。

对于给定的延迟,它正好得到加载指令的数据值:I-Mem(读取指令)、Regs(长于控制时间)、Mux(选择ALU)输入)、ALU、数据存储器和Mux(从内存中选择要写入寄存器的值)。

这个路径的延迟是400ps吗?200秒?第30页?120秒?350马力?第30页?1130马力。

1430马力(1130马力?300 ps,ALU在关键路径上)。

4.3.2第4.3.2节加速度来自于时钟周期时间和程序所需时钟周期数的变化:程序要求的周期数减少了5%,但循环时间是1430而不是1130,所以我们的加速比是(1/0.95)*(1130/1430)?0.83,这意味着我们实际上在减速。

S-4第4章解决方案4.3.3成本始终是所有组件(不仅仅是关键路径上的组件)的总成本,因此原处理器的成本是I-Mem、Regs、Control、ALU、D-Mem、2个Add单元和3个Mux 单元,总成本是1000?200?500?100?2000年?2*30?3*10?3890我们将计算与基线相关的成本。

计算机组成与设计第五版答案

计算机组成与设计第五版答案

计算机组成与设计:《计算机组成与设计》是2010年机械工业出版社出版的图书,作者是帕特森(DavidA.Patterson)。

该书讲述的是采用了一个MIPS 处理器来展示计算机硬件技术、流水线、存储器的层次结构以及I/O 等基本功能。

此外,该书还包括一些关于x86架构的介绍。

内容简介:这本最畅销的计算机组成书籍经过全面更新,关注现今发生在计算机体系结构领域的革命性变革:从单处理器发展到多核微处理器。

此外,出版这本书的ARM版是为了强调嵌入式系统对于全亚洲计算行业的重要性,并采用ARM处理器来讨论实际计算机的指令集和算术运算。

因为ARM是用于嵌入式设备的最流行的指令集架构,而全世界每年约销售40亿个嵌入式设备。

采用ARMv6(ARM 11系列)为主要架构来展示指令系统和计算机算术运算的基本功能。

覆盖从串行计算到并行计算的革命性变革,新增了关于并行化的一章,并且每章中还有一些强调并行硬件和软件主题的小节。

新增一个由NVIDIA的首席科学家和架构主管撰写的附录,介绍了现代GPU的出现和重要性,首次详细描述了这个针对可视计算进行了优化的高度并行化、多线程、多核的处理器。

描述一种度量多核性能的独特方法——“Roofline model”,自带benchmark测试和分析AMD Opteron X4、Intel Xeo 5000、Sun Ultra SPARC T2和IBM Cell的性能。

涵盖了一些关于闪存和虚拟机的新内容。

提供了大量富有启发性的练习题,内容达200多页。

将AMD Opteron X4和Intel Nehalem作为贯穿《计算机组成与设计:硬件/软件接口(英文版·第4版·ARM版)》的实例。

用SPEC CPU2006组件更新了所有处理器性能实例。

图书目录:1 Computer Abstractions and Technology1.1 Introduction1.2 BelowYour Program1.3 Under the Covers1.4 Performance1.5 The Power Wall1.6 The Sea Change: The Switch from Uniprocessors to Multiprocessors1.7 Real Stuff: Manufacturing and Benchmarking the AMD Opteron X41.8 Fallacies and Pitfalls1.9 Concluding Remarks1.10 Historical Perspective and Further Reading1.11 Exercises2 Instructions: Language of the Computer2.1 Introduction2.2 Operations of the Computer Hardware2.3 Operands of the Computer Hardware2.4 Signed and Unsigned Numbers2.5 Representing Instructions in the Computer2.6 Logical Operations2.7 Instructions for Making Decisions2.8 Supporting Procedures in Computer Hardware2.9 Communicating with People2.10 ARM Addressing for 32-Bit Immediates and More Complex Addressing Modes2.11 Parallelism and Instructions: Synchronization2.12 Translating and Starting a Program2.13 A C Sort Example to Put lt AU Together2.14 Arrays versus Pointers2.15 Advanced Material: Compiling C and Interpreting Java2.16 Real Stuff." MIPS Instructions2.17 Real Stuff: x86 Instructions2.18 Fallacies and Pitfalls2.19 Conduding Remarks2.20 Historical Perspective and Further Reading2.21 Exercises3 Arithmetic for Computers3.1 Introduction3.2 Addition and Subtraction3.3 Multiplication3.4 Division3.5 Floating Point3.6 Parallelism and Computer Arithmetic: Associativity 3.7 Real Stuff: Floating Point in the x863.8 Fallacies and Pitfalls3.9 Concluding Remarks3.10 Historical Perspective and Further Reading3.11 Exercises4 The Processor4.1 Introduction4.2 Logic Design Conventions4.3 Building a Datapath4.4 A Simple Implementation Scheme4.5 An Overview of Pipelining4.6 Pipelined Datapath and Control4.7 Data Hazards: Forwarding versus Stalling4.8 Control Hazards4.9 Exceptions4.10 Parallelism and Advanced Instruction-Level Parallelism4.11 Real Stuff: theAMD OpteronX4 (Barcelona)Pipeline4.12 Advanced Topic: an Introduction to Digital Design Using a Hardware Design Language to Describe and Model a Pipelineand More Pipelining Illustrations4.13 Fallacies and Pitfalls4.14 Concluding Remarks4.15 Historical Perspective and Further Reading4.16 Exercises5 Large and Fast: Exploiting Memory Hierarchy5.1 Introduction5.2 The Basics of Caches5.3 Measuring and Improving Cache Performance5.4 Virtual Memory5.5 A Common Framework for Memory Hierarchies5.6 Virtual Machines5.7 Using a Finite-State Machine to Control a Simple Cache5.8 Parallelism and Memory Hierarchies: Cache Coherence5.9 Advanced Material: Implementing Cache Controllers5.10 Real Stuff: the AMD Opteron X4 (Barcelona)and Intel NehalemMemory Hierarchies5.11 Fallacies and Pitfalls5.12 Concluding Remarks5.13 Historical Perspective and Further Reading5.14 Exercises6 Storage and Other I/0 Topics6.1 Introduction6.2 Dependability, Reliability, and Availability6.3 Disk Storage6.4 Flash Storage6.5 Connecting Processors, Memory, and I/O Devices6.6 Interfacing I/O Devices to the Processor, Memory, andOperating System6.7 I/O Performance Measures: Examples from Disk and File Systems6.8 Designing an I/O System6.9 Parallelism and I/O: Redundant Arrays of Inexpensive Disks6.10 Real Stuff: Sun Fire x4150 Server6.11 Advanced Topics: Networks6.12 Fallacies and Pitfalls6.13 Concluding Remarks6.14 Historical Perspective and Further Reading6.15 Exercises7 Multicores, Multiprocessors, and Clusters7.1 Introduction7.2 The Difficulty of Creating Parallel Processing Programs7.3 Shared Memory Multiprocessors7.4 Clusters and Other Message-Passing Multiprocessors7.5 Hardware Multithreading 637.6 SISD,MIMD,SIMD,SPMD,and Vector7.7 Introduction to Graphics Processing Units7.8 Introduction to Multiprocessor Network Topologies7.9 Multiprocessor Benchmarks7.10 Roofline:A Simple Performance Model7.11 Real Stuff:Benchmarking Four Multicores Using theRooflineMudd7.12 Fallacies and Pitfalls7.13 Concluding Remarks7.14 Historical Perspective and Further Reading7.15 ExercisesInuexC D-ROM CONTENTA Graphics and Computing GPUSA.1 IntroductionA.2 GPU System ArchitecturesA.3 Scalable Parallelism-Programming GPUSA.4 Multithreaded Multiprocessor ArchitectureA.5 Paralld Memory System G.6 Floating PointA.6 Floating Point ArithmeticA.7 Real Stuff:The NVIDIA GeForce 8800A.8 Real Stuff:MappingApplications to GPUsA.9 Fallacies and PitflaUsA.10 Conduding RemarksA.1l HistoricalPerspectiveandFurtherReadingB1 ARM and Thumb Assembler InstructionsB1.1 Using This AppendixB1.2 SyntaxB1.3 Alphabetical List ofARM and Thumb Instructions B1.4 ARM Asembler Quick ReferenceB1.5 GNU Assembler Quick ReferenceB2 ARM and Thumb Instruction EncodingsB3 Intruction Cycle TimingsC The Basics of Logic DesignD Mapping Control to HardwareADVANCED CONTENTHISTORICAL PERSPECTIVES & FURTHER READINGTUTORIALSSOFTWARE作者简介:David A.Patterson,加州大学伯克利分校计算机科学系教授。

计算机组成与设计第五版答案

计算机组成与设计第五版答案

解决方案4第4章解决方案S-34.1 4.1.1信号值如下如下:RegWrite MemReadALUMux MemWrite aloop RegMux Branch0 0 1(Imm)1 ADD X 0ALUMux是在ALU输入处控制Mux 的控制信号,0(Reg)选择寄存器文件的输出,和1(Imm)从指令字中选择立即数作为第二个输入铝合金是控制输入到寄存器文件的Mux的控制信号,0(ALU)选择ALU 的输出,1(Mem)选择存储器的输出。

X值表示“不在乎”(不管信号是0还是1)4.1.2除了未使用的寄存器4.1.3的分支添加单元和写入端口:分支添加,寄存器的写入端口无输出:无(所有单元都产生输出)4.2 4.2.1第四条指令使用指令存储器,两个寄存器读端口,将Rd和Rs相加的ALU,数据存储器和寄存器中的写入端口。

4.2.2无。

此指令可使用现有的块来实现。

4.2.3无。

该指令无需添加新的控制信号即可实现。

它只需要改变控制逻辑。

4.3 4.3.1时钟周期时间由关键路径决定,对于给定的延迟,它恰好是为了得到加载指令的数据值:I-Mem(读取指令)、Regs(比控制时间长)、Mux(选择ALU 输入)、ALU、数据存储器,和Mux(从内存中选择要写入寄存器的值)。

这个路径的延迟是400ps?200秒?第30页?120秒?350马力?第30页?1130马力。

1430马力(1130马力?300 ps,ALU在关键路径上)。

4.3.2第4.3.2加速来自于时钟周期时间的变化和程序所需时钟周期数的变化:程序所需的周期数减少了5%,但周期时间是1430而不是1130,所以我们的加速比是(1/0.95)*(1130/1430)?0.83,这意味着我们实际上在减速。

S-4第4章解决方案4.3.3成本始终是所有组件(不只是关键路径上的组件)的总成本,因此原始处理器的成本为I-Mem、Regs、Control、ALU、D-Mem、2个Add单元和3个Mux 单元,总成本为1000?200?500?100?2000年?2*30?3*10?3890.我们将计算与基线相关的成本。

计算机组成与设计_第五版答案_Chapter05_Solution

计算机组成与设计_第五版答案_Chapter05_Solution

Chapter 5 Solutions S-35.15.1.1 45.1.2 I, J5.1.3 A[I][J]5.1.4 3596 ϭ 8 ϫ 800/4 ϫ 2Ϫ8ϫ8/4 ϩ 8000/45.1.5 I, J5.1.6 A(J, I)5.25.2.130000 001103M1801011 0100114M430010 1011211M20000 001002M1911011 11111115M880101 100058M1901011 11101114M140000 1110014M1811011 0101115M440010 1100212M1861011 10101110M2531111 11011513M5.2.230000 001101M1801011 0100112M430010 101125M20000 001001H1911011 1111117M880101 100054M1901011 1110117H140000 111007M1811011 0101112H440010 110026M1861011 1010115M2531111 1101156MS-4 ChapterSolutions55.2.330000 001103M1M0M1801011 0100224M2M1M430010 101153M1M0M20000 001002M1M0M1911011 1111237M3M1M880101 1000110M0M0M1901011 1110236M3H1H140000 111016M3M1M1811011 0101225M2H1M440010 110054M2M1M1861011 1010232M1M0M2531111 1101315M2M1MCache 1 miss rate ϭ 100%Cache 1 total cycles ϭ 12 ϫ 25 ϩ 12 ϫ 2 ϭ 324Cache 2 miss rate ϭ 10/12 ϭ 83%Cache 2 total cycles ϭ 10 ϫ 25 ϩ 12 ϫ 3 ϭ 286Cache 3 miss rate ϭ 11/12 ϭ 92%Cache 3 total cycles ϭ 11 ϫ 25 ϩ 12 ϫ 5 ϭ 335Cache 2 provides the best performance.5.2.4 First we must compute the number of cache blocks in the initial cacheconfi guration. For this, we divide 32 KiB by 4 (for the number of bytes per word)and again by 2 (for the number of words per block). Th is gives us 4096 blocks anda resulting index fi eld width of 12 bits. We also have a word off set size of 1 bit and abyte off set size of 2 bits. Th is gives us a tag fi eld size of 32 Ϫ 15 ϭ 17 bits. Th ese tagbits, along with one valid bit per block, will require 18 ϫ 4096 ϭ 73728 bits or 9216bytes. Th e total cache size is thus 9216 ϩ 32768 ϭ 41984 bytes.Th e total cache size can be generalized tototalsize ϭ datasize ϩ (validbitsize ϩ tagsize) ϫ blockstotalsize ϭ 41984datasize ϭ blocks ϫ blocksize ϫ wordsizewordsize ϭ 4tagsize ϭ 32 Ϫ log2(blocks) Ϫ log2(blocksize) Ϫ log2(wordsize)validbitsize ϭ 1Chapter 5 Solutions S-5 Increasing from 2-word blocks to 16-word blocks will reduce the tag size from17 bits to 14 bits.In order to determine the number of blocks, we solve the inequality:41984 Ͻϭ 64 ϫ blocks ϩ 15 ϫ blocksSolving this inequality gives us 531 blocks, and rounding to the next power oftwo gives us a 1024-block cache.Th e larger block size may require an increased hit time and an increased misspenalty than the original cache. Th e fewer number of blocks may cause a higherconfl ict miss rate than the original cache.5.2.5 Associative caches are designed to reduce the rate of confl ict misses. Assuch, a sequence of read requests with the same 12-bit index fi eld but a diff erenttag fi eld will generate many misses. For the cache described above, the sequence0, 32768, 0, 32768, 0, 32768, …, would miss on every access, while a 2-way setassociate cache with LRU replacement, even one with a signifi cantly smaller overallcapacity, would hit on every access aft er the fi rst two.5.2.6 Y es, it is possible to use this function to index the cache. However,information about the fi ve bits is lost because the bits are XOR’d, so you mustinclude more tag bits to identify the address in the cache.5.35.3.1 85.3.2 325.3.3 1ϩ (22/8/32) ϭ 1.0865.3.4 35.3.5 0.255.3.6 ϽIndex, tag, dataϾϽ0000012, 00012, mem[1024]ϾϽ0000012, 00112, mem[16]ϾϽ0010112, 00002, mem[176]ϾϽ0010002, 00102, mem[2176]ϾϽ0011102, 00002, mem[224]ϾϽ0010102, 00002, mem[160]ϾS-6 ChapterSolutions55.45.4.1 Th e L1 cache has a low write miss penalty while the L2 cache has a highwrite miss penalty. A write buff er between the L1 and L2 cache would hide thewrite miss latency of the L2 cache. Th e L2 cache would benefi t from write buff erswhen replacing a dirty block, since the new block would be read in before the dirtyblock is physically written to memory.5.4.2 On an L1 write miss, the word is written directly to L2 without bringingits block into the L1 cache. If this results in an L2 miss, its block must be broughtinto the L2 cache, possibly replacing a dirty block which must fi rst be written tomemory.5.4.3 Aft er an L1 write miss, the block will reside in L2 but not in L1. A subsequentread miss on the same block will require that the block in L2 be written back tomemory, transferred to L1, and invalidated in L2.5.4.4 One in four instructions is a data read, one in ten instructions is a datawrite. For a CPI of 2, there are 0.5 instruction accesses per cycle, 12.5% of cycleswill require a data read, and 5% of cycles will require a data write.Th e instruction bandwidth is thus (0.0030 ϫ 64) ϫ 0.5 ϭ 0.096 bytes/cycle. Th edata read bandwidth is thus 0.02 ϫ (0.13ϩ0.050) ϫ 64 ϭ 0.23 bytes/cycle. Th etotal read bandwidth requirement is 0.33 bytes/cycle. Th e data write bandwidthrequirement is 0.05 ϫ 4 ϭ 0.2 bytes/cycle.5.4.5 Th e instruction and data read bandwidth requirement is the same as in5.4.4. Th e data write bandwidth requirement becomes 0.02 ϫ 0.30 ϫ (0.13ϩ0.050)ϫ 64 ϭ 0.069 bytes/cycle.5.4.6 For CPIϭ1.5 the instruction throughput becomes 1/1.5 ϭ 0.67 instructionsper cycle. Th e data read frequency becomes 0.25 / 1.5 ϭ 0.17 and the write frequencybecomes 0.10 / 1.5 ϭ 0.067.Th e instruction bandwidth is (0.0030 ϫ 64) ϫ 0.67 ϭ 0.13 bytes/cycle.For the write-through cache, the data read bandwidth is 0.02 ϫ (0.17 ϩ0.067) ϫ64 ϭ 0.22 bytes/cycle. Th e total read bandwidth is 0.35 bytes/cycle. Th e data writebandwidth is 0.067 ϫ 4 ϭ 0.27 bytes/cycle.For the write-back cache, the data write bandwidth becomes 0.02 ϫ 0.30 ϫ(0.17ϩ0.067) ϫ 64 ϭ 0.091 bytes/cycle.Address041613223216010243014031001802180Line ID001814100191118Hit/miss M H M M M M M H H M M MReplace N N N N N N Y N N Y N YChapter 5 Solutions S-75.55.5.1 Assuming the addresses given as byte addresses, each group of 16 accesseswill map to the same 32-byte block so the cache will have a miss rate of 1/16. Allmisses are compulsory misses. Th e miss rate is not sensitive to the size of the cacheor the size of the working set. It is, however, sensitive to the access pattern andblock size.5.5.2 Th e miss rates are 1/8, 1/32, and 1/64, respectively. Th e workload isexploiting temporal locality.5.5.3 In this case the miss rate is 0.5.5.4 AMAT for B ϭ 8: 0.040 ϫ (20 ϫ 8) ϭ6.40AMAT for B ϭ 16: 0.030 ϫ (20 ϫ 16) ϭ 9.60AMAT for B ϭ 32: 0.020 ϫ (20 ϫ 32) ϭ 12.80AMAT for B ϭ 64: 0.015 ϫ (20 ϫ 64) ϭ 19.20AMAT for B ϭ 128: 0.010 ϫ (20 ϫ 128) ϭ 25.60B ϭ 8 is optimal.5.5.5 AMAT for B ϭ 8: 0.040 ϫ (24 ϩ 8) ϭ 1.28AMAT for B ϭ 16: 0.030 ϫ (24 ϩ 16) ϭ 1.20AMAT for B ϭ 32: 0.020 ϫ (24 ϩ 32) ϭ 1.12AMAT for B ϭ 64: 0.015 ϫ (24 ϩ 64) ϭ 1.32AMAT for B ϭ 128: 0.010 ϫ (24 ϩ 128) ϭ 1.52B ϭ 32 is optimal.5.5.6 Bϭ1285.65.6.1P1 1.52 GHzP2 1.11 GHz5.6.2P1 6.31 ns9.56 cyclesP2 5.11 ns 5.68 cycles5.6.3P112.64 CPI8.34 ns per instP27.36 CPI 6.63 ns per instS-8 Chapter5Solutions5.6.46.50 ns9.85 cycles Worse5.6.513.045.6.6 P1 AMAT ϭ 0.66 ns ϩ 0.08 ϫ 70 ns ϭ 6.26 nsP2 AMAT ϭ 0.90 ns ϩ 0.06 ϫ (5.62 ns ϩ 0.95 ϫ 70 ns) ϭ 5.23 nsFor P1 to match P2’s performance:5.23 ϭ 0.66 ns ϩ MR ϫ 70 nsMR ϭ 6.5%5.75.7.1 Th e cache would have 24 / 3 ϭ 8 blocks per way and thus an index fi eld of3 bits.30000 001101M T(1)ϭ01801011 0100112M T(1)ϭ0T(2)ϭ11430010 101125MT(1)ϭ0 T(2)ϭ11 T(5)ϭ220000 001001MT(1)ϭ0T(2)ϭ11T(5)ϭ2T(1)ϭ01911011 1111117MT(1)ϭ0T(2)ϭ11T(5)ϭ2T(7)ϭ11T(1)ϭ0880101 100054MT(1)ϭ0T(2)ϭ11T(5)ϭ2T(7)ϭ11T(4)ϭ5T(1)ϭ01901011 1110117HT(1)ϭ0T(2)ϭ11T(5)ϭ2T(7)ϭ11T(4)ϭ5T(1)ϭ0140000 111007MT(1)ϭ0T(2)ϭ11T(5)ϭ2T(7)ϭ11T(4)ϭ5T(1)ϭ0T(7)ϭ01811011 0101112HT(1)ϭ0T(2)ϭ11T(5)ϭ2T(7)ϭ11T(4)ϭ5T(1)ϭ0T(7)ϭChapter 5 Solutions S-9440010 110026MT(1)ϭ0T(2)ϭ11T(5)ϭ2T(7)ϭ11T(4)ϭ5T(6)ϭ2T(1)ϭ0T(7)ϭ01861011 1010115MT(1)ϭ0T(2)ϭ11T(5)ϭ2T(7)ϭ11T(4)ϭ5T(6)ϭ2T(1)ϭ0T(7)ϭ0T(5)ϭ112531111 1101156MT(1)ϭ0T(2)ϭ11T(5)ϭ2T(7)ϭ11T(4)ϭ5T(6)ϭ2T(1)ϭ0T(7)ϭ0T(5)ϭ11T(6)ϭ155.7.2 Since this cache is fully associative and has one-word blocks, the word address is equivalent to the tag. Th e only possible way for there to be a hit is arepeated reference to the same word, which doesn’t occur for this sequence.3M 3180M 3, 18043M 3, 180, 432M 3, 180, 43, 2191M 3, 180, 43, 2, 19188M 3, 180, 43, 2, 191, 88190M 3, 180, 43, 2, 191, 88, 19014M 3, 180, 43, 2, 191, 88, 190, 14181M 181, 180, 43, 2, 191, 88, 190, 1444M 181, 44, 43, 2, 191, 88, 190, 14186M 181, 44, 186, 2, 191, 88, 190, 14253M181, 44, 186, 253, 191, 88, 190, 145.7.331M 118090M 1, 904321M 1, 90, 2121H 1, 90, 2119195M 1, 90, 21, 958844M 1, 90, 21, 95, 4419095H 1, 90, 21, 95, 44147M 1, 90, 21, 95, 44, 718190H 1, 90, 21, 95, 44, 74422M 1, 90, 21, 95, 44, 7, 22186143M 1, 90, 21, 95, 44, 7, 22, 143253126M1, 90, 126, 95, 44, 7, 22, 143S-10 ChapterSolutions5Th e fi nal reference replaces tag 21 in the cache, since tags 1 and 90 had been re-used at timeϭ3 and timeϭ8 while 21 hadn’t been used since timeϭ2.Miss rate ϭ 9/12 ϭ 75%Th is is the best possible miss rate, since there were no misses on any block thathad been previously evicted from the cache. In fact, the only eviction was for tag21, which is only referenced once.5.7.4 L1 only:.07 ϫ 100 ϭ 7 nsCPI ϭ 7 ns / .5 ns ϭ 14Direct mapped L2:.07 ϫ (12 ϩ 0.035 ϫ 100) ϭ 1.1 nsCPI ϭ ceiling(1.1 ns/.5 ns) ϭ 38-way set associated L2:.07 ϫ (28 ϩ 0.015 ϫ 100) ϭ 2.1 nsCPI ϭ ceiling(2.1 ns / .5 ns) ϭ 5Doubled memory access time, L1 only:.07 ϫ 200 ϭ 14 nsCPI ϭ 14 ns / .5 ns ϭ 28Doubled memory access time, direct mapped L2:.07 ϫ (12 ϩ 0.035 ϫ 200) ϭ 1.3 nsCPI ϭ ceiling(1.3 ns/.5 ns) ϭ 3Doubled memory access time, 8-way set associated L2:.07 ϫ (28 ϩ 0.015 ϫ 200) ϭ 2.2 nsCPI ϭ ceiling(2.2 ns / .5 ns) ϭ 5Halved memory access time, L1 only:.07 ϫ 50 ϭ 3.5 nsCPI ϭ 3.5 ns / .5 ns ϭ 7Halved memory access time, direct mapped L2:.07 ϫ (12 ϩ 0.035 ϫ 50) ϭ 1.0 nsCPI ϭ ceiling(1.1 ns/.5 ns) ϭ 2Halved memory access time, 8-way set associated L2:Chapter 5 Solutions S-11.07 ϫ (28 ϩ 0.015 ϫ 50) ϭ 2.1 nsCPI ϭ ceiling(2.1 ns / .5 ns) ϭ 55.7.5 .07 ϫ (12 ϩ 0.035 ϫ (50 ϩ 0.013 ϫ 100)) ϭ 1.0 nsAdding the L3 cache does reduce the overall memory access time, which is themain advantage of having a L3 cache. Th e disadvantage is that the L3 cache takesreal estate away from having other types of resources, such as functional units.5.7.6 Even if the miss rate of the L2 cache was 0, a 50 ns access time givesAMAT ϭ .07 ϫ 50 ϭ 3.5 ns, which is greater than the 1.1 ns and 2.1 ns given by theon-chip L2 caches. As such, no size will achieve the performance goal.5.85.8.11096 days26304 hours5.8.20.9990875912%5.8.3 Availability approaches 1.0. With the emergence of inexpensive drives,having a nearly 0 replacement time for hardware is quite feasible. However,replacing fi le systems and other data can take signifi cant time. Although a drivemanufacturer will not include this time in their statistics, it is certainly a part ofreplacing a disk.5.8.4 MTTR becomes the dominant factor in determining availability. However,availability would be quite high if MTTF also grew measurably. If MTTF is 1000times MTTR, it the specifi c value of MTTR is not signifi cant.5.95.9.1 Need to fi nd minimum p such that 2pϾϭ p ϩ d ϩ 1 and then add one.Th us 9 total bits are needed for SEC/DED.5.9.2 Th e (72,64) code described in the chapter requires an overhead of8/64ϭ12.5% additional bits to tolerate the loss of any single bit within 72 bits,providing a protection rate of 1.4%. Th e (137,128) code from part a requires anoverhead of 9/128ϭ7.0% additional bits to tolerate the loss of any single bit within137 bits, providing a protection rate of 0.73%. Th e cost/performance of both codesis as follows:(72,64) code ϭϾ 12.5/1.4 ϭ 8.9(136,128) code ϭϾ 7.0/0.73 ϭ 9.6Th e (72,64) code has a better cost/performance ratio.5.9.3 Using the bit numbering from section 5.5, bit 8 is in error so the valuewould be corrected to 0x365.5.10 Instructors can change the disk latency, transfer rate and optimal page size for more variants. Refer to Jim Gray’s paper on the fi ve-minute rule ten years later.5.10.1 32 KB5.10.2 Still 32 KB5.10.3 64 KB. Because the disk bandwidth grows much faster than seek latency, future paging cost will be more close to constant, thus favoring larger pages.5.10.4 1987/1997/2007: 205/267/308 seconds. (or roughly fi ve minutes)5.10.5 1987/1997/2007: 51/533/4935 seconds. (or 10 times longer for every 10 years).5.10.6 (1) DRAM cost/MB scaling trend dramatically slows down; or (2) disk $/ access/sec dramatically increase. (2) is more likely to happen due to the emerging fl ash technology.5.115.11.1TLB miss PT hitPF 11112466911741361 (last access 0)11322270TLB missPT hit 1 (last access 1)05174136 1 (last access 0)113139163TLB hit 1 (last access 1)05174 1 (last access 2)36 1 (last access 0)113345878TLB missPT hitPF1 (last access 1)051 (last access 3)8141 (last access 2)361 (last access 0)1134887011TLB missPT hit 1 (last access 1)05 1 (last access 3)814 1 (last access 2)36 1 (last access 4)1112126083TLB hit 1 (last access 1)05 1 (last access 3)814 1 (last access 5)36 1 (last access 4)11124922512TLB missPT miss 1 (last access 6)1215 1 (last access 3)814 1 (last access 5)36 1 (last access 4)11125.11.246690TLB miss PT hit111121741361 (last access 0)0522270TLB hit111121741361 (last access 1)05139160TLB hit111121741361 (last access 2)05345872TLB miss PT hit PF1 (last access 3)2131741361 (last access 2)05488702TLB hit1 (last access 4)2131741361 (last access 2)05126080TLB hit1 (last access 4)2131741361 (last access 5)05492253TLB hit1 (last access 4)2131741 (last axxess 6)361 (last access 5)5A larger page size reduces the TLB miss rate but can lead to higher fragmentationand lower utilization of the physical memory.5.11.3Two-way set associative4669101TLB missPT hitPF111120174113601 (last access 0)01312227000TLB missPT hit1 (last access 1)050174113601 (last access 0)013113916311TLB missPT hit1 (last access 1)0501 (last access 2)16113601 (last access 0)113134587840TLB missPT hitPF1 (last access 1)0501 (last access 2)1611 (last access 3)41401 (last access 0)1131488701151TLB missPT hit1 (last access 1)0501 (last access 2)1611 (last access 3)41401 (last access 4)512112608311TLB hit 1 (last access 1)050 1 (last access 5)161 1 (last access 3)4140 1 (last access 4)5121492251260TLB missPT miss1 (last access 6)61501 (last access 5)1611 (last access 3)41401 (last access 4)51214669101TLB miss PT hit PF11112010131136204932227000TLB miss PT hit1050101311362049313916303TLB miss PT hit1050101311362106334587820TLB miss PT hit PF121401013113621063488701123TLB miss PT hit121401013113621212312608303TLB miss PT hit121401013113621063492251230TLB miss PT miss13150101311362163All memory references must be cross referenced against the page table and the TLB allows this to be performed without accessing off -chip memory (in the common case). If there were no TLB, memory access time would increase signifi cantly.5.11.4 Assumption: “half the memory available” means half of the 32-bit virtual address space for each running application.Th e tag size is 32 Ϫ log 2(8192) ϭ 32 Ϫ 13 ϭ 19 bits. All five page tables would require 5 ϫ (2^19/2 ϫ 4) bytes ϭ 5 MB.5.11.5 In the two-level approach, the 2^19 page table entries are divided into 256 segments that are allocated on demand. Each of the second-level tables contain 2^(19Ϫ8) ϭ 2048 entries, requiring 2048 ϫ 4 ϭ 8 KB each and covering 2048 ϫ 8 KB ϭ 16 MB (2^24) of the virtual address space.Direct mappedIf we assume that “half the memory” means 2^31 bytes, then the minimum amount of memory required for the second-level tables would be 5 ϫ (2^31 / 2^24) * 8 KB ϭ 5 MB. Th e fi rst-level tables would require an additional 5 ϫ 128 ϫ 6 bytes ϭ 3840 bytes.Th e maximum amount would be if all segments were activated, requiring the use of all 256 segments in each application. Th is would require 5 ϫ 256 ϫ 8 KB ϭ10 MB for the second-level tables and 7680 bytes for the fi rst-level tables.5.11.6 Th e page index consists of address bits 12 down to 0 so the LSB of the tag is address bit 13.A 16 KB direct-mapped cache with 2-words per block would have 8-byte blocks and thus 16 KB / 8 bytes ϭ 2048 blocks, and its index fi eld would span address bits 13 down to 3 (11 bits to index, 1 bit word off set, 2 bit byte off set). As such, the tag LSB of the cache tag is address bit 14.Th e designer would instead need to make the cache 2-way associative to increase its size to 16 KB.5.125.12.1 Worst case is 2^(43Ϫ12) entries, requiring 2^(43Ϫ12) ϫ 4 bytes ϭ2 ^33 ϭ 8 GB.5.12.2 With only two levels, the designer can select the size of each page table segment. In a multi-level scheme, reading a PTE requires an access to each level of the table.5.12.3 In an inverted page table, the number of PTEs can be reduced to the size of the hash table plus the cost of collisions. In this case, serving a TLB miss requires an extra reference to compare the tag or tags stored in the hash table.5.12.4 It would be invalid if it was paged out to disk.5.12.5 A write to page 30 would generate a TLB miss. Soft ware-managed TLBs are faster in cases where the soft ware can pre-fetch TLB entries.5.12.6 When an instruction writes to V A page 200, and interrupt would be generated because the page is marked as read only.5.135.13.1 0 hits5.13.2 1 hit5.13.3 1 hits or fewer5.13.4 1 hit. Any address sequence is fi ne so long as the number of hits are correct.5.13.5 Th e best block to evict is the one that will cause the fewest misses in the future. Unfortunately, a cache controller cannot know the future! Our best alternative is to make a good prediction.5.13.6 If you knew that an address had limited temporal locality and would confl ict with another block in the cache, it could improve miss rate. On the other hand, you could worsen the miss rate by choosing poorly which addresses to cache.5.145.14.1 Shadow page table: (1) VM creates page table, hypervisor updates shadow table; (2) nothing; (3) hypervisor intercepts page fault, creates new mapping, and invalidates the old mapping in TLB; (4) VM notifi es the hypervisor to invalidate the process’s TLB entries. Nested page table: (1) VM creates new page table, hypervisor adds new mappings in PA to MA table. (2) Hardware walks both page tables to translate V A to MA; (3) VM and hypervisor update their page tables, hypervisor invalidates stale TLB entries; (4) same as shadow page table.5.14.2 Native: 4; NPT: 24 (instructors can change the levels of page table)Native: L; NPT: Lϫ(Lϩ2)5.14.3 Shadow page table: page fault rate.NPT: TLB miss rate.5.14.4 Shadow page table: 1.03NPT: 1.045.14.5 Combining multiple page table updates5.14.6 NPT caching (similar to TLB caching)5.155.15.1 CPIϭ 1.5 ϩ 120/10000 ϫ (15ϩ175) ϭ 3.78If VMM performance impact doubles ϭϾ CPI ϭ 1.5 ϩ 120/10000 ϫ(15ϩ350) ϭ5.88If VMM performance impact halves ϭϾ CPI ϭ 1.5 ϩ 120/10000 ϫ(15ϩ87.5) ϭ2.735.15.2 Non-virtualized CPI ϭ 1.5 ϩ 30/10000 ϫ 1100 ϭ 4.80Virtualized CPI ϭ 1.5 ϩ 120/10000 ϫ (15ϩ175) ϩ 30/10000 ϫ(1100ϩ175) ϭ 7.60Virtualized CPI with half I/Oϭ 1.5 ϩ 120/10000 ϫ (15ϩ175) ϩ 15/10000ϫ (1100ϩ175) ϭ 5.69I/O traps usually oft en require long periods of execution time that can beperformed in the guest O/S, with only a small portion of that time needingto be spent in the VMM. As such, the impact of virtualization is less forI/O bound applications.5.15.3 Virtual memory aims to provide each application with the illusion of the entire address space of the machine. Virtual machines aims to provide each operating system with the illusion of having the entire machine to its disposal. Th us they both serve very similar goals, and off er benefi ts such as increased security. Virtual memory can allow for many applications running in the same memory space to not have to manage keeping their memory separate.5.15.4 Emulating a diff erent ISA requires specifi c handling of that ISA’s API. Each ISA has specifi c behaviors that will happen upon instruction execution, interrupts, trapping to kernel mode, etc. that therefore must be emulated. Th is can require many more instructions to be executed to emulate each instruction than was originally necessary in the target ISA. Th is can cause a large performance impact and make it diffi cult to properly communicate with external devices. An emulated system can potentially run faster than on its native ISA if the emulated code can be dynamically examined and optimized. For example, if the underlying machine’s ISA has a single instruction that can handle the execution of several of the emulated system’s instructions, then potentially the number of instructions executed can bereduced. Th is is similar to the case with the recent Intel processors that do micro-op fusion, allowing several instructions to be handled by fewer instructions.5.165.16.1 Th e cache should be able to satisfy the request since it is otherwise idle when the write buff er is writing back to memory. If the cache is not able to satisfy hits while writing back from the write buff er, the cache will perform little or no better than the cache without the write buff er, since requests will still be serialized behind writebacks.5.16.2 U nfortunately, the cache will have to wait until the writeback is completesince the memory channel is occupied. Once the memory channel is free,the cache is able to issue the read request to satisfy the miss.5.16.3 Correct solutions should exhibit the following features:1. Th e memory read should come before memory writes.2. Th e cache should signal “Ready” to the processor before completingthe write.Example (simpler solutions exist; the state machine is somewhatunderspecifi ed in the chapter):5.175.17.1 Th ere are 6 possible orderings for these instructions.Ordering 1:Results: (5,5)Ordering 2:Results: (5,5)Ordering 3:Results: (6,3)Ordering 4:Results: (5,3)Ordering 5:Results: (6,5)Ordering 6:(6,3)If coherency isn’t ensured:P2’s operations take precedence over P1’s: (5,2)5.17.25.17.3 Best case:Orderings 1 and 6 above, which require only two total misses.Worst case:Orderings 2 and 3 above, which require 4 total cache misses.5.17.4 Ordering 1:Result: (3,3)Ordering 2:Result: (2,3)Ordering 3:Result: (2,3) Ordering 4:Result: (0,3)Ordering 5:Result: (0,3) Ordering 6:Result: (2,3)Ordering 7:Result: (2,3) Ordering 8:Result: (0,3)Ordering 9:Result: (0,3) Ordering 10:Result: (2,1)Result: (0,1) Ordering 12:Result: (0,1) Ordering 13:Result: (0,1) Ordering 14:Result: (0,1)Ordering 15:Result: (0,0)5.17.5 Assume Bϭ0 is seen by P2 but not preceding Aϭ1Result: (2,0)5.17.6 Write back is simpler than write through, since it facilitates the use of exclusive access blocks and lowers the frequency of invalidates. It prevents the use of write-broadcasts, but this is a more complex protocol.Th e allocation policy has little eff ect on the protocol.5.185.18.1 Benchmark Aϭ (1/32) ϫ 5 ϩ 0.0030 ϫ 180 ϭ 0.70AMATprivateϭ (1/32) ϫ 20 ϩ 0.0012 ϫ 180 ϭ 0.84AMATsharedBenchmark Bϭ (1/32) ϫ 5 ϩ 0.0006 ϫ 180 ϭ 0.26AMATprivateAMATϭ (1/32) ϫ 20 ϩ 0.0003 ϫ 180 ϭ 0.68sharedPrivate cache is superior for both benchmarks.5.18.2 Shared cache latency doubles for shared cache. Memory latency doubles for private cache.Benchmark Aϭ (1/32) ϫ 5 ϩ 0.0030 ϫ 360 ϭ 1.24AMATprivateϭ (1/32) ϫ 40 ϩ 0.0012 ϫ 180 ϭ 1.47AMATsharedBenchmark Bϭ (1/32) ϫ 5 ϩ 0.0006 ϫ 360 ϭ 0.37AMATprivateϭ (1/32) ϫ 40 ϩ 0.0003 ϫ 180 ϭ 1.30AMATsharedPrivate is still superior for both benchmarks.5.18.35.18.4 A non-blocking shared L2 cache would reduce the latency of the L2 cache by allowing hits for one CPU to be serviced while a miss is serviced for another CPU, or allow for misses from both CPUs to be serviced simultaneously.A non-blocking private L2 would reduce latency assuming that multiple memory instructions can be executed concurrently.5.18.5 4 times.5.18.6 Additional DRAM bandwidth, dynamic memory schedulers, multi-banked memory systems, higher cache associativity, and additional levels of cache.f. P rocessor: out-of-order execution, larger load/store queue, multiple hardware threads;Caches: more miss status handling registers (MSHR)Memory: memory controller to support multiple outstanding memoryrequests5.195.19.1 srcIP and refTime fi elds. 2 misses per entry.5.19.2 Group the srcIP and refTime fi elds into a separate array.5.19.3 peak_hour (int status); // peak hours of a given statusGroup srcIP, refTime and status together.5.19.4 Answers will vary depending on which data set is used.Confl ict misses do not occur in fully associative caches.Compulsory (cold) misses are not aff ected by associativity.Capacity miss rate is computed by subtracting the compulsory miss rateand the fully associative miss rate (compulsory ϩ capacity misses) fromthe total miss rate. Confl ict miss rate is computed by subtracting the coldand the newly computed capacity miss rate from the total miss rate.Th e values reported are miss rate per instruction, as opposed to miss rateper memory instruction.5.19.5 Answers will vary depending on which data set is used.5.19.6 apsi/mesa/ammp/mcf all have such examples.Example cache: 4-block caches, direct-mapped vs. 2-way LRU.Reference stream (blocks): 1 2 2 6 1.。

计算机组成原理第五版答案

计算机组成原理第五版答案

计算机组成原理第五版答案计算机组成原理是计算机科学与技术专业的一门重要课程,它是培养学生对计算机硬件结构和工作原理的基本认识和理解的重要课程之一。

本文将针对计算机组成原理第五版中的一些问题进行解答,希望能够帮助学习者更好地理解和掌握这门课程的知识。

1. 什么是计算机组成原理?计算机组成原理是研究计算机硬件系统组成和工作原理的学科,它主要包括计算机的基本组成结构、指令系统、数据表示、运算单元、控制单元、存储器层次结构、输入输出系统等内容。

通过学习计算机组成原理,可以深入了解计算机内部的工作原理,为进一步学习计算机体系结构、操作系统、编译原理等课程打下坚实的基础。

2. 计算机的基本组成结构有哪些?计算机的基本组成结构包括中央处理器(CPU)、存储器、输入设备、输出设备和外部设备等部分。

其中,中央处理器是计算机的核心部件,它包括运算单元和控制单元,负责执行计算机程序的指令和控制计算机的运行。

存储器用于存储数据和程序,输入设备用于接收外部数据,输出设备用于向外部输出处理结果,外部设备用于与计算机进行数据交换。

3. 什么是指令系统?指令系统是计算机中用于控制和操作的一组机器指令的集合,它包括数据传输指令、算术运算指令、逻辑运算指令、控制转移指令等。

指令系统的设计和实现对计算机的性能和功能有着重要的影响,合理的指令系统可以提高计算机的运行效率和灵活性。

4. 计算机的数据表示有哪些方式?计算机的数据表示主要包括原码、反码、补码、浮点数表示等方式。

原码是最基本的数据表示方式,反码是原码的符号位不变,其余位取反,补码是在反码的基础上加1。

浮点数表示采用科学计数法表示实数,包括阶码和尾数两部分,可以表示很大或很小的数。

5. 存储器层次结构包括哪些层次?存储器层次结构包括寄存器、高速缓存、主存储器和辅助存储器等层次。

寄存器是CPU内部的存储器,速度最快,容量最小;高速缓存是CPU与主存之间的缓冲存储器,速度较快,容量较小;主存储器是计算机主要的存储器,速度适中,容量较大;辅助存储器是外部存储器,速度较慢,容量很大。

计算机组成原理教程第五版(张基温)课后习题大题答案

计算机组成原理教程第五版(张基温)课后习题大题答案

!ROM和RAM区别:区别ROM只读存储器非易失器件,RAM随机存储器,易失!存储元:存储一位数据的存储空间。

存储单元;存储一个字或者字节的数据存储空间。

存储体;存储器中承载数据存储的部件。

存储单元地址:给每个存储单元进行编号是存取数据时不混淆,2由许多1构成,1是数据存储最小单元,3由许多2构成,2是数据存储基本单元,避免混淆编号形成4!存储系统层次解决成本速度容量问题条件是存储器访问局部性度量存储系统命中率!联系物理空间和逻辑空间都是用来存储数据的,区别物理空间是真实存在的是主板上的内存条,虚拟是虚拟存在的电脑匀出的部分硬盘空间来充当内存,物力数据比虚拟存取速度快,物力内存一般不可变,逻辑可变,物力作用与cpu 沟通,虚拟解决物理内存不足问题。

!具有虚拟存储器对内存空间作了扩充,部分数据存入虚拟空间,操作系统自动调用简单灵活。

不具有内存空间不足,现存部分数据分批调用,用户手动完成。

!Cache扩容后能不能取代主存不能,成本高,相邻两级速度不能差太多!电子时代之前,有可能成功制造自动工作的计算机吗不能,内动力和程序是计算机自动工作的条件内动力需要电子元件完成!汉字需要输入码英文需要输入码吗需要,英文字母对应的ASCII码为机器内码!减少数据失真失真:提高采样频率和量化精度!指令系统作用指令系统是对cpu功能的扩展,它表明cpu执行那些基本操作,cpu功能是取指令分析指令执行指令,所以一个cpu设计依据的功能也来自指令系统,它是cpu设计的基本依据!运算速度与cpu主频有关,主存容量与内存有关,运算精度与总线位数有关!评价计算机性能从运算速度,机器字长存储容量带宽均衡性环保和效能用户友好性性价比可靠性可用性。

!控制器控制方式和特点控制器:定长cpu周期法:时序电路简单但浪费时间,不定长c周期法:按需分配有效提高运行速度但增加时序部件复杂性,延长节拍法:中央局部结合兼顾集中分散两控制优点。

!机器指令和微指令关系:每一条机器指令由一段用微指令编成的微程序来解释执行。

计算机组成原理第五版课后习题答案

计算机组成原理第五版课后习题答案

计算机组成原理
第一章
4.冯.诺依曼型的计算机的主要设计思想是什么?它包括哪些主要组成部分?答:设计思想为:数字计算机的数制采用二进制;计算机应该按照程序顺序执行。

主要组成部分有:运算器、控制器、存储器、输入和输出设备。

5什么是存储容量?什么是单元地址?什么是数据字?什么是指令字?
答:存储器所有的存储单元的总数称为存储器的存储容量。

存储器是由许多的存储单元组成,每个单元都有一个编号,称为单元地址。

某字代表要处理的数据称为数据字。

某字为一条指令称为指令字。

6.什么是指令?什么是程序?
答:每一个基本操作(如加减乘除等基本操作)就是一条指令。

而解决某一问题的一串指令序列为程序。

7.指令和数据均放在内存中,计算机如何区分它们是指令还是数据?
答:一般来讲,取指周期中从内存读出的信息流是指令流,它流向控制器:而在执行器周期中从内存读出的信息流是数据流,它由内存流向运算器。

计算机组成原理第五版课后答案

计算机组成原理第五版课后答案

第一章计算机系统概论1. 什么是计算机系统、计算机硬件和计算机软件?硬件和软件哪个更重要?解:P3计算机系统:由计算机硬件系统和软件系统组成的综合体。

计算机硬件:指计算机中的电子线路和物理装置。

计算机软件:计算机运行所需的程序及相关资料。

硬件和软件在计算机系统中相互依存,缺一不可,因此同样重要。

5. 冯•诺依曼计算机的特点是什么?解:冯•诺依曼计算机的特点是:P8●计算机由运算器、控制器、存储器、输入设备、输出设备五大部件组成;●指令和数据以同同等地位存放于存储器内,并可以按地址访问;●指令和数据均用二进制表示;●指令由操作码、地址码两大部分组成,操作码用来表示操作的性质,地址码用来表示操作数在存储器中的位置;●指令在存储器中顺序存放,通常自动顺序取出执行;●机器以运算器为中心(原始冯•诺依曼机)。

7. 解释下列概念:主机、CPU、主存、存储单元、存储元件、存储基元、存储元、存储字、存储字长、存储容量、机器字长、指令字长。

解:P9-10主机:是计算机硬件的主体部分,由CPU和主存储器MM合成为主机。

CPU:中央处理器,是计算机硬件的核心部件,由运算器和控制器组成;(早期的运算器和控制器不在同一芯片上,现在的CPU内除含有运算器和控制器外还集成了CACHE)。

主存:计算机中存放正在运行的程序和数据的存储器,为计算机的主要工作存储器,可随机存取;由存储体、各种逻辑部件及控制电路组成。

存储单元:可存放一个机器字并具有特定存储地址的存储单位。

存储元件:存储一位二进制信息的物理元件,是存储器中最小的存储单位,又叫存储基元或存储元,不能单独存取。

存储字:一个存储单元所存二进制代码的逻辑单位。

存储字长:一个存储单元所存二进制代码的位数。

存储容量:存储器中可存二进制代码的总量;(通常主、辅存容量分开描述)。

机器字长:指CPU一次能处理的二进制数据的位数,通常与CPU的寄存器位数有关。

指令字长:一条指令的二进制代码位数。

计算机组成与设计 第五版答案_CH01_Solution

计算机组成与设计 第五版答案_CH01_Solution

Chapter 1 Solutions S-3 1.1 P ersonal computer (includes workstation and laptop): Personal computersemphasize delivery of good performance to single users at low cost and usuallyexecute third-party soft ware.Personal mobile device (PMD, includes tablets): PMDs are battery operatedwith wireless connectivity to the Internet and typically cost hundreds ofdollars, and, like PCs, users can download soft ware (“apps”) to run on them.Unlike PCs, they no longer have a keyboard and mouse, and are more likelyto rely on a touch-sensitive screen or even speech input.Server: Computer used to run large problems and usually accessed via a network.Warehouse scale computer: Th ousands of processors forming a large cluster.Supercomputer: Computer composed of hundreds to thousands of processorsand terabytes of memory.Embedded computer: Computer designed to run one application or one setof related applications and integrated into a single system.1.2a.Performance via Pipeliningb.Dependability via Redundancyc.Performance via Predictiond.Make the Common Case Faste.Hierarchy of Memoriesf.Performance via Parallelismg.Design for Moore’s Lawe Abstraction to Simplify Design1.3 Th e program is compiled into an assembly language program, which is thenassembled into a machine language program.1.4a.1280 ϫ 1024 pixels ϭ 1,310,720 pixels ϭϾ 1,310,720 ϫ 3 ϭ 3,932,160bytes/frame.b. 3,932,160 bytes ϫ (8 bits/byte) /100E6 bits/second ϭ 0.31 seconds1.5a. performance of P1 (instructions/sec) ϭ 3 ϫ 109/1.5 ϭ 2 ϫ 109performance of P2 (instructions/sec) ϭ 2.5 ϫ 109/1.0 ϭ 2.5 ϫ 109performance of P3 (instructions/sec) ϭ 4 ϫ 109/2.2 ϭ 1.8 ϫ 109S-4 Chapter1Solutionsb. cycles(P1) ϭ 10 ϫ 3 ϫ 109ϭ 30 ϫ 109 scycles(P2) ϭ 10 ϫ 2.5 ϫ 109ϭ 25 ϫ 109 scycles(P3) ϭ 10 ϫ 4 ϫ 109ϭ 40 ϫ 109 sc. No. instructions(P1) ϭ 30 ϫ 109/1.5 ϭ 20 ϫ 109No. instructions(P2) ϭ 25 ϫ 109/1 ϭ 25 ϫ 109No. instructions(P3) ϭ 40 ϫ 109/2.2 ϭ 18.18 ϫ 109CPInew ϭ CPIoldϫ 1.2, then CPI(P1) ϭ 1.8, CPI(P2) ϭ 1.2, CPI(P3) ϭ 2.6fϭ No. instr. ϫ CPI/time, thenf(P1) ϭ 20 ϫ 109ϫ1.8/7 ϭ 5.14 GHzf(P2) ϭ 25 ϫ 109ϫ 1.2/7 ϭ 4.28 GHzf(P1) ϭ 18.18 ϫ 109ϫ 2.6/7 ϭ 6.75 GHz1.6a. C lass A: 105 instr. Class B: 2 ϫ 105 instr. Class C: 5 ϫ 105 instr.Class D: 2 ϫ 105 instr.Time ϭ No. instr. ϫ CPI/clock rateTotal time P1 ϭ (105ϩ 2 ϫ 105ϫ 2 ϩ 5 ϫ 105ϫ 3 ϩ 2 ϫ 105ϫ 3)/(2.5 ϫ109) ϭ 10.4 ϫ 10Ϫ4 sTotal time P2 ϭ (105ϫ 2 ϩ 2 ϫ 105ϫ 2 ϩ 5 ϫ 105ϫ 2 ϩ 2 ϫ 105ϫ 2)/(3 ϫ 109) ϭ 6.66 ϫ 10Ϫ4 sCPI(P1) ϭ 10.4 ϫ 10Ϫ4ϫ 2.5 ϫ 109/106ϭ 2.6CPI(P2) ϭ 6.66 ϫ 10Ϫ4ϫ 3 ϫ 109/106ϭ 2.0b. c lock cycles(P1)ϭ 105ϫ 1ϩ 2 ϫ 105ϫ 2 ϩ 5 ϫ 105ϫ 3 ϩ 2 ϫ 105ϫ 3ϭ 26 ϫ 105clock cycles(P2) ϭ 105ϫ 2ϩ 2 ϫ 105ϫ 2 ϩ 5 ϫ 105ϫ 2 ϩ 2 ϫ 105ϫ 2 ϭ 20 ϫ 1051.7a. CPI ϭ Texecϫ f/No. instr.Compiler A CPI ϭ 1.1Compiler B CPI ϭ 1.25b.fB /fAϭ (No. instr.(B) ϫ CPI(B))/(No. instr.(A) ϫ CPI(A)) ϭ 1.37c.TA /Tnewϭ 1.67T B /Tnewϭ 2.27Chapter 1 Solutions S-51.81.8.1 C ϭ 2 ϫ DP/(V2*F)Pentium 4: C ϭ 3.2E–8FCore i5 Ivy Bridge: C ϭ 2.9E–8F1.8.2 Pentium 4: 10/100 ϭ 10%Core i5 Ivy Bridge: 30/70 ϭ 42.9%1.8.3 (Snew ϩ Dnew)/(Soldϩ Dold) ϭ 0.90Dnew ϭ C ϫ Vnew2 ϫ FS old ϭ Voldϫ IS new ϭ Vnewϫ ITh erefore:V new ϭ [Dnew/(C ϫ F)]1/2Dnew ϭ 0.90 ϫ (Soldϩ Dold) Ϫ SnewS new ϭ Vnewϫ (Sold/Vold)Pentium 4:S new ϭ Vnewϫ (10/1.25) ϭ Vnewϫ 8Dnew ϭ 0.90 ϫ 100 Ϫ Vnewϫ 8 ϭ 90 Ϫ Vnewϫ 8V new ϭ [(90 Ϫ Vnewϫ 8)/(3.2E8 ϫ 3.6E9)]1/2Vnewϭ 0.85 V Core i5:S new ϭ Vnewϫ (30/0.9) ϭ Vnewϫ 33.3Dnew ϭ 0.90 ϫ 70 Ϫ Vnewϫ 33.3 ϭ 63 Ϫ Vnewϫ 33.3V new ϭ [(63 Ϫ Vnewϫ 33.3)/(2.9E8 ϫ 3.4E9)]1/2Vnewϭ 0.64 V1.91.9.11 2.56E9 1.28E9 2.56E87.94E1039.712 1.83E99.14E8 2.56E8 5.67E1028.3 1.449.12E8 4.57E8 2.56E8 2.83E1014.2 2.88 4.57E8 2.29E8 2.56E8 1.42E107.10 5.6S-6 Chapter1Solutions1.9.2141.0229.3414.687.331.9.3 31.101.10.1 die area15cmϭ wafer area/dies per wafer ϭ pi*7.52 / 84 ϭ 2.10 cm2yield15cmϭ 1/(1ϩ(0.020*2.10/2))2ϭ 0.9593die area20cmϭ wafer area/dies per wafer ϭ pi*102/100 ϭ 3.14 cm2yield20cmϭ 1/(1ϩ(0.031*3.14/2))2ϭ 0.90931.10.2 cost/die15cmϭ 12/(84*0.9593) ϭ 0.1489cost/die20cmϭ 15/(100*0.9093) ϭ 0.16501.10.3 die area15cmϭ wafer area/dies per wafer ϭ pi*7.52/(84*1.1) ϭ 1.91 cm2yield15cmϭ 1/(1 ϩ (0.020*1.15*1.91/2))2ϭ 0.9575die area20cmϭ wafer area/dies per wafer ϭ pi*102/(100*1.1) ϭ 2.86 cm2yield20cmϭ 1/(1 ϩ (0.03*1.15*2.86/2))2ϭ 0.90821.10.4 d efects per area0.92ϭ (1–y^.5)/(y^.5*die_area/2) ϭ (1Ϫ0.92^.5)/(0.92^.5*2/2) ϭ 0.043 defects/cm2defects per area0.95ϭ (1–y^.5)/(y^.5*die_area/2) ϭ (1Ϫ0.95^.5)/(0.95^.5*2/2) ϭ 0.026 defects/cm21.111.11.1 CPI ϭ clock rate ϫ CPU time/instr. countclock rate ϭ 1/cycle time ϭ 3 GHzCPI(bzip2) ϭ 3 ϫ 109ϫ 750/(2389 ϫ 109)ϭ 0.941.11.2 SPEC ratio ϭ ref. time/execution timeSPEC ratio(bzip2) ϭ 9650/750 ϭ 12.861.11.3. CPU time ϭ No. instr. ϫ CPI/clock rateIf CPI and clock rate do not change, the CPU time increase is equal to the increase in the of number of instructions, that is 10%.Chapter 1 Solutions S-71.11.4 CPU time(before) ϭ No. instr. ϫ CPI/clock rateCPU time(aft er) ϭ 1.1 ϫ No. instr. ϫ 1.05 ϫ CPI/clock rateCPU time(aft er)/CPU time(before) ϭ 1.1 ϫ 1.05 ϭ1.155. Th us, CPU timeis increased by 15.5%.1.11.5 SPECratio ϭ reference time/CPU timeSPECratio(aft er)/SPECratio(before) ϭ CPU time(before)/CPU time(aft er) ϭ1/1.1555 ϭ 0.86. Th e SPECratio is decreased by 14%.1.11.6 CPI ϭ (CPU time ϫ clock rate)/No. instr.CPI ϭ 700 ϫ 4 ϫ 109/(0.85 ϫ 2389 ϫ 109) ϭ 1.371.11.7 Clock rate ratio ϭ 4 GHz/3 GHz ϭ 1.33CPI @ 4 GHz ϭ 1.37, CPI @ 3 GHz ϭ 0.94, ratio ϭ 1.45Th ey are diff erent because, although the number of instructions has beenreduced by 15%, the CPU time has been reduced by a lower percentage.1.11.8 700/750 ϭ 0.933. CPU time reduction: 6.7%1.11.9 No. instr. ϭ CPU time ϫ clock rate/CPINo. instr. ϭ 960 ϫ 0.9 ϫ 4 ϫ 109/1.61 ϭ 2146 ϫ 1091.11.10 Clock rate ϭ No. instr. ϫ CPI/CPU time.Clock ratenew ϭ No. instr. ϫ CPI/0.9 ϫ CPU time ϭ 1/0.9 clock rateoldϭ3.33 GHz1.11.11 Clock rate ϭ No. instr. ϫ CPI/CPU time.Clock ratenew ϭ No. instr. ϫ 0.85ϫ CPI/0.80 CPU time ϭ 0.85/0.80,clock rateoldϭ 3.18 GHz1.121.12.1 T(P1) ϭ 5 ϫ 109ϫ 0.9 / (4 ϫ 109) ϭ 1.125 sT(P2) ϭ 109ϫ 0.75 / (3 ϫ 109) ϭ 0.25 sclock rate (P1) Ͼ clock rate(P2), performance(P1) < performance(P2) 1.12.2 T(P1) ϭ No. instr. ϫ CPI/clock rateT(P1) ϭ 2.25 3 1021 sT(P2) 5 N ϫ 0.75/(3 ϫ 109), then N ϭ 9 ϫ 1081.12.3 MIPS ϭ Clock rate ϫ 10Ϫ6/CPIMIPS(P1) ϭ 4 ϫ 109ϫ 10Ϫ6/0.9 ϭ 4.44 ϫ 103S-8 Chapter1SolutionsMIPS(P2) ϭ 3 ϫ 109ϫ 10Ϫ6/0.75 ϭ 4.0 ϫ 103MIPS(P1) Ͼ MIPS(P2), performance(P1) Ͻ performance(P2) (from 11a)1.12.4 MFLOPS ϭ No. FP operations ϫ 10Ϫ6/TMFLOPS(P1) ϭ .4 ϫ 5E9 ϫ 1E-6/1.125 ϭ 1.78E3MFLOPS(P2) ϭ .4 ϫ 1E9 ϫ 1E-6/.25 ϭ 1.60E3MFLOPS(P1) Ͼ MFLOPS(P2), performance(P1) Ͻ performance(P2)(from 11a)1.131.13.1 Tfp ϭ 70 ϫ 0.8 ϭ 56 s. Tnewϭ 56ϩ85ϩ55ϩ40 ϭ 236 s. Reduction: 5.6%1.13.2 Tnew ϭ 250 ϫ 0.8 ϭ 200 s, TfpϩTl/sϩTbranchϭ 165 s, Tintϭ 35 s. Reductiontime INT: 58.8%1.13.3 Tnew ϭ 250 ϫ 0.8 ϭ 200 s, TfpϩTintϩTl/sϭ 210 s. NO1.141.14.1 C lock cyclesϭ CPIfp ϫ No. FP instr. ϩ CPIintϫ No. INT instr. ϩ CPIl/sϫNo. L/S instr. ϩ CPIbranchϫ No. branch instr.TCPUϭ clock cycles/clock rate ϭ clock cycles/2 ϫ 109clock cycles ϭ 512 ϫ 106; TCPUϭ 0.256 sTo have the number of clock cycles by improving the CPI of FP instructions:CPIimproved fp ϫ No. FP instr. ϩ CPIintϫ No. INT instr. ϩ CPIl/sϫ No. L/Sinstr. ϩ CPIbranchϫ No. branch instr. ϭ clock cycles/2CPIimproved fp ϭ (clock cycles/2 Ϫ (CPIintϫ No. INT instr. ϩ CPIl/sϫ No. L/Sinstr. ϩ CPIbranchϫ No. branch instr.)) / No. FP instr.CPIimproved fpϭ (256Ϫ462)/50 Ͻ0 ϭϭϾ not possible1.14.2 Using the clock cycle data from a.To have the number of clock cycles improving the CPI of L/S instructions:CPIfp ϫ No. FP instr. ϩ CPIintϫ No. INT instr. ϩ CPIimproved l/sϫ No. L/Sinstr. ϩ CPIbranchϫ No. branch instr. ϭ clock cycles/2CPIimproved l/s ϭ (clock cycles/2 Ϫ (CPIfpϫ No. FP instr. ϩ CPIintϫ No. INTinstr. ϩ CPIbranchϫ No. branch instr.)) / No. L/S instr.CPIimproved l/sϭ (256Ϫ198)/80 ϭ 0.7251.14.3 C lock cyclesϭ CPIfp ϫ No. FP instr. ϩ CPIintϫ No. INT instr. ϩ CPIl/sϫNo. L/S instr. ϩ CPIbranchϫ No. branch instr.Chapter 1 Solutions S-9TCPUϭ clock cycles/clock rate ϭ clock cycles/2 ϫ 109CPIint ϭ 0.6 ϫ 1 ϭ 0.6; CPIfpϭ 0.6 ϫ 1 ϭ 0.6; CPIl/sϭ 0.7 ϫ 4 ϭ 2.8;CPIbranchϭ 0.7 ϫ 2 ϭ 1.4T CPU (before improv.) ϭ 0.256 s; TCPU(aft er improv.)ϭ 0.171 s1.15110025054100/54 ϭ 1.85 1.85/2 ϭ .9342529100/29 ϭ 3.44 3.44/4 ϭ 0.86 812.516.5100/16.5 ϭ 6.06 6.06/8 ϭ 0.7516 6.2510.25100/10.25 ϭ 9.769.76/16 ϭ 0.61。

(完整版)计算机组成原理课后习题答案(第五版_白中英)

(完整版)计算机组成原理课后习题答案(第五版_白中英)

计算机组成原理 第五版 习题答案第一章...............................................................................................................................................1第二章...............................................................................................................................................3第三章.............................................................................................................................................14第四章.............................................................................................................................................19第五章.............................................................................................................................................21第六章.............................................................................................................................................27第七章.............................................................................................................................................31第八章.............................................................................................................................................34第九章 (36)第一章1.模拟计算机的特点是数值由连续量来表示,运算过程也是连续的。

计算机组成与设计 第五版答案_CH06_Solution

计算机组成与设计 第五版答案_CH06_Solution

Chapter 6 Solutions S-3 6.1 Th ere is no single right answer for this question. Th e purpose is to get studentsto think about parallelism present in their daily lives. Th e answer should have atleast 10 activities identifi ed.6.1.1 Any reasonable answer is correct here.6.1.2 Any reasonable answer is correct here.6.1.3 Any reasonable answer is correct here.6.1.4 Th e student is asked to quantify the savings due to parallelism. Th e answershould consider the amount of overlap provided through parallelism and should beless than or equal to (if no parallelism was possible) to the original time computedif each activity was carried out serially.6.26.2.1 For this set of resources, we can pipeline the preparation. We assume thatwe do not have to reheat the oven for each cake.Preheat OvenMix ingredients in bowl for Cake 1Fill cake pan with contents of bowl and bake Cake 1. Mix ingredients forCake 2 in bowl.Finish baking Cake 1. Empty cake pan. Fill cake pan with bowl contents forCake 2 and bake Cake 2. Mix ingredients in bowl for Cake 3.Finish baking Cake 2. Empty cake pan. Fill cake pan with bowl contents forCake 3 and bake Cake 3.Finish baking Cake 3. Empty cake pan.6.2.2 Now we have 3 bowls, 3 cake pans and 3 mixers. We will name them A, B,and C.Preheat OvenMix incredients in bowl A for Cake 1Fill cake pan A with contents of bowl A and bake for Cake 1. Mix ingredientsforCake 2 in bowl A.Finish baking Cake 1. Empty cake pan A. Fill cake pan A with contents ofbowl A for Cake 2. Mix ingredients in bowl A for Cake 3.Finishing baking Cake 2. Empty cake pan A. Fill cake pan A with contentsof bowl A for Cake 3.S-4 ChapterSolutions6Finish baking Cake 3. Empty cake pan A.Th e point here is that we cannot carry out any of these items in parallelbecause we either have one person doing the work, or we have limitedcapacity in our oven.6.2.3 Each step can be done in parallel for each cake. Th e time to bake 1 cake, 2cakes or 3 cakes is exactly the same.6.2.4 Th e loop computation is equivalent to the steps involved to make one cake.Given that we have multiple processors (or ovens and cooks), we can executeinstructions (or cook multiple cakes) in parallel. Th e instructions in the loop (orcooking steps) may have some dependencies on prior instructions (or cookingsteps) in the loop body (cooking a single cake).Data-level parallelism occurs when loop iterations are independent (i.e., noloop carried dependencies).Task-level parallelism includes any instructions that can be computed onparallel execution units, are similar to the independent operations involvedin making multiple cakes.6.36.3.1 While binary search has very good serial performance, it is diffi cult toparallelize without modifying the code. So part A asks to compute the speedupfactor, but increasing X beyond 2 or 3 should have no benefi ts. While we canperform the comparison of low and high on one core, the computation for midon a second core, and the comparison for A[mid] on a third core, without somerestructuring or speculative execution, we will not obtain any speedup. Th e answershould include a graph, showing that no speedup is obtained aft er the values of 1,2, or 3 (this value depends somewhat on the assumption made) for Y.6.3.2 In this question, we suggest that we can increase the number of cores (toeach the number of array elements). Again, given the current code, we really cannotobtain any benefi t from these extra cores. But if we create threads to compare theN elements to the value X and perform these in parallel, then we can get idealspeedup (Y times speedup), and the comparison can be completed in the amountof time to perform a single comparison.6.4. Th is problem illustrates that some computations can be done in parallelif serial code is restructured. But more importantly, we may want to provide forSIMD operations in our ISA, and allow for data-level parallelism when performingthe same operation on multiple data items.Chapter 6 Solutions S-5 6.4.1 Th is is a straightforward computation. Th e fi rst instruction is executedonce, and the loop body is executed 998 times.Version 1—17,965 cyclesVersion 2—22,955 cyclesVersion 3—20,959 cycles6.4.2 Array elements D[j] and D[jϪ1] will have loop carried dependencies. Th esewill $f4 in the current iteration and $f0 in the next iteration.6.4.3 Th is is a very challenging problem and there are many possibleimplementations for the solution. Th e preferred solution will try to utilize the twonodes by unrolling the loop 4 times (this already gives you a substantial speedupby eliminating many loop increment, branch and load instructions). Th e loopbody running on node 1 would look something like this (the code is not the mosteffi cient code sequence):addiu $s1, $zero, 996l.d $f0, –16($s0)l.d $f2, –8($s0)loop:add.d $f4, $f2, $f0add.d $f6, $f4, $f2Send (2, $f4)Send (2, $f6)s.d $f4, 0($s0)s.d $f6, 8($s0)Receive($f8)add.d $f10, $f8, $f6add.d $f0, $f10, $f8Send (2, $f10)Send (2, $f0)s.d. $f8, 16($s0)s.d $f10, 24($s0)s.d $f0 32($s0)Receive($f2)s.d $f2 40($s0)addiu $s0, $s0, 48bne $s0, $s1, loopadd.d $f4, $f2, $f0add.d $f6, $f4, $f2add.d $f10, $f8, $f6s.d $f4, 0($s0)s.d $f6, 8($s0)s.d $f8, 16($s0)S-6 Chapter6SolutionsTh e code on node 2 would look something like this:addiu $s2, $zero, 0loop:Receive ($f12)Receive ($f14)add.d $f16, $f14, $f12Send(1, $f16)Receive ($f12)Receive ($f14)add.d $f16, $f14, $f12Send(1, $f16)Receive ($f12)Receive ($f14)add.d $f16, $f14, $f12Send(1, $f16)Receive ($f12)Receive ($f14)add.d $f16, $f14, $f12Send(1, $f16)addiu $s2, $s2, 1bne $s2, 83, loopBasically Node 1 would compute 4 adds each loop iteration, and Node 2would compute 4 adds. Th e loop takes 1463 cycles, which is much better thanclose to 18K. But the unrolled loop would run faster given the current sendinstruction latency.6.4.4 Th e loop network would need to respond within a single cycle to obtain aspeedup. Th is illustrates why using distributed message passing is diffi cult whenloops contain loop-carried dependencies.6.56.5.1 Th is problem is again a divide and conquer problem, but utilizes recursionto produce a very compact piece of code. In part A the student is asked to computethe speedup when the number of cores is small. When forming the lists, we spawn athread for the computation of left in the MergeSort code, and spawn a thread for thecomputation of the right. If we consider this recursively, for m initial elements in thearray, we can utilize 1 ϩ 2 ϩ 4 ϩ 8 ϩ 16 ϩ …. log2(m) processors to obtain speedup.6.5.2 In this question, log2 (m) is the largest value of Y for which we can obtainany speedup without restructuring. But if we had m cores, we could perform sorting using a very diff erent algorithm. For instance, if we have greater than m/2 cores, we can compare all pairs of data elements, swap the elements if the left element is greater than the right element, and then repeat this step m times. So this is one possible answer for the question. It is known as parallel comparison sort. Various comparison sort algorithms include odd-even sort and cocktail sort.Chapter 6 Solutions S-76.66.6.1 Th is problem presents an “embarrassingly parallel” computationand asks the student to fi nd the speedup obtained on a 4-core system. Th ecomputations involved are: (m ϫ p ϫ n) multiplications and (m ϫ p ϫ(n Ϫ 1)) additions. Th e multiplications and additions associated with a singleelement in C are dependent (we cannot start summing up the results of themultiplications for an element until two products are available). So in this question,the speedup should be very close to 4.6.6.2 Th is question asks about how speedup is aff ected due to cache misses causedby the 4 cores all working on diff erent matrix elements that map to the same cacheline. Each update would incur the cost of a cache miss, and so will reduce thespeedup obtained by a factor of 3 times the cost of servicing a cache miss.6.6.3 In this question, we are asked how to fi x this problem. Th e easiest way tosolve the false sharing problem is to compute the elements in C by traversing thematrix across columns instead of rows (i.e., using index-j instead of index-i). Th eseelements will be mapped to diff erent cache lines. Th en we just need to make surewe process the matrix index that is computed ( i, j) and (i ϩ 1, j) on the same core.Th is will eliminate false sharing.6.76.7.1 x ϭ 2, y ϭ 2, w ϭ 1, z ϭ 0x ϭ 2, y ϭ 2, w ϭ 3, z ϭ 0x ϭ 2, y ϭ 2, w ϭ 5, z ϭ 0x ϭ 2, y ϭ 2, w ϭ 1, z ϭ 2x ϭ 2, y ϭ 2, w ϭ 3, z ϭ 2x ϭ 2, y ϭ 2, w ϭ 5, z ϭ 2x ϭ 2, y ϭ 2, w ϭ 1, z ϭ 4x ϭ 2, y ϭ 2, w ϭ 3, z ϭ 4x ϭ 3, y ϭ 2, w ϭ 5, z ϭ 46.7.2 We could set synchronization instructions aft er each operation so that allcores see the same value on all nodes.6.86.8.1 If every philosopher simultaneously picks up the left fork, then there will beno right fork to pick up. Th is will lead to starvation.S-8 ChapterSolutions66.8.2 Th e basic solution is that whenever a philosopher wants to eat, she checksboth forks. If they are free, then she eats. Otherwise, she waits until a neighborcontacts her. Whenever a philosopher fi nishes eating, she checks to see if herneighbors want to eat and are waiting. If so, then she releases the fork to one ofthem and lets them eat. Th e diffi culty is to fi rst be able to obtain both forks withoutanother philosopher interrupting the transition between checking and acquisition.We can implement this a number of ways, but a simple way is to accept requestsfor forks in a centralized queue, and give out forks based on the priority defi nedby being closest to the head of the queue. Th is provides both deadlock preventionand fairness.6.8.3 Th ere are a number or right answers here, but basically showing a casewhere the request of the head of the queue does not have the closest forks available,though there are forks available for other philosophers.6.8.4 By periodically repeating the request, the request will move to the head ofthe queue. Th is only partially solves the problem unless you can guarantee thatall philosophers eat for exactly the same amount of time, and can use this time toschedule the issuance of the repeated request.6.9A3B1, B4A1, A2B1, B4A1, A4B2A1B3A1A2A1A1B1B2B1A3A4B2B4Chapter 6 Solutions S-9A1B1A1B1A1B2A2B3A3B4A46.10 Th is is an open-ended question.6.116.11.1 Th e answer should include a MIPS program that includes 4 diff erentprocesses that will compute ¼ of the sums. Assuming that memory latency is notan issue, the program should get linear speed when run on the 4 processors (thereis no communication necessary between threads). If memory is being consideredin the answer, then the array blocking should consider preserving spatial locality sothat false sharing is not created.6.11.2 Since this program is highly data parallel and there are no datadependencies, a 8ϫ speedup should be observed. In terms of instructions, theSIMD machine should have fewer instructions (though this will depend upon theSIMD extensions).6.12 Th is is an open-ended question that could have many possible answers. Th ekey is that the student learns about MISD and compares it to an SIMD machine.6.13 Th is is an open-ended question that could have many answers. Th e key isthat the students learn about warps.6.14 Th is is an open-ended programming assignment. Th e code should be testedfor correctness.6.15 Th is question will require the students to research on the Internet both theAMD Fusion architecture and the Intel QuickPath technology. Th e key is thatstudents become aware of these technologies. Th e actual bandwidth and latencyvalues should be available right off the company websites, and will change as thetechnology evolves.6.166.16.1 For an n-cube of order N (2N nodes), the interconnection network cansustain NϪ1 broken links and still guarantee that there is a path to all nodes in thenetwork.6.16.2 Th e plot below shows the number of network links that can fail and stillguarantee that the network is not disconnected.S-10 Chapter 6Solutions11010010000100000Network order N u m b e r o f f a u l t y l i n k s6.176.17.1 Major diff erences between these suites include:Whetstone—designed for fl oating point performance specifi callyPARSEC—these workloads are focused on multithreaded programs6.17.2 Only the PARSEC benchmarks should be impacted by sharing and synchronization. Th is should not be a factor in Whetstone.6.186.18.1 Any reasonable C program that performs the transformation should be accepted.6.18.2 Th e storage space should be equal to (R ϩ R) times the size of a single precision fl oating point number ϩ (m + 1) times the size of the index, where R is the number of non-zero elements and m is the number of rows. We will assume each fl oating-point number is 4 bytes, and each index is a short unsigned integer that is 2 bytes. For Matrix X this equals 111 bytes.6.18.3 Th e answer should include results for both a brute-force and a computation using the Yale Sparse Matrix Format.6.18.4 Th ere are a number of more effi cient formats, but their impact should be marginal for the small matrices used in this problem.6.196.19.1 Th is question presents three diff erent CPU models to consider when executing the following code:if (X[i][j] > Y[i][j])count++;Chapter 6 Solutions S-11 6.19.2 Th ere are a number of acceptable answers here, but they should considerthe capabilities of each CPU and also its frequency. What follows is one possibleanswer:Since X and Y are FP numbers, we should utilize the vector processor (CPU C) toissue 2 loads, 8 matrix elements in parallel from A and 8 matrix elements from B,into a single vector register and then perform a vector subtract. We would thenissue 2 vector stores to put the result in memory.Since the vector processor does not have comparison instructions, we would haveCPU A perform 2 parallel conditional jumps based on fl oating point registers. Wewould increment two counts based on the conditional compare. Finally, we couldjust add the two counts for the entire matrix. We would not need to use core B.6.19.3 Th e point of the problem is to show that it is diffi cult to perform an operationon individual vector elements when utilizing a vector processor. What might be a niceinstruction to add would be a vector comparison that would allow for us to comparetwo vectors and produce a scalar value of the number of elements where one vectorwas larger the other. Th is would reduce the computation to a single instruction forthe comparison of 8 FP number pairs, and then an integer computation for summingup all of these values.6.20 Th is question looks at the amount of queuing that is occurring in the systemgiven a maximum transaction processing rate, and the latency observed on averageby a transaction. Th e latency includes both the service time (which is computed bythe maximum rate) and the queue time.6.20.1 So for a max transaction processing rate of 5000/sec, and we have 4 corescontributing, we would see an average latency of .8 ms if there was no queuingtaking place. Th us, each core must have 1.25 transactions either executing or insome amount of completion on average.So the answers are:1 ms5000/sec 1.252 ms5000/sec 2.51 ms10,000/sec 2.52 ms10,000/sec56.20.2 We should be able to double the maximum transaction rate by doublingthe number of cores.6.20.3 Th e reason this does not happen is due to memory contention on theshared memory system.。

计算机组成与设计第五版答案

计算机组成与设计第五版答案

计算机组成与设计:《计算机组成与设计》是2010年机械工业出版社出版的图书, 作该书讲述的是采用了一个MIPS 处理器者是帕特森(DavidAPatterson )o来展示计算机硬件技术、流水线、存储器的层次结构以及I/O 等基本功能。

此外,该书还包括一些关于x86架构的介绍。

内容简介:这本最畅销的计算机组成书籍经过全面更新,关注现今发生在计算机体系结构领域的革命性变革:从单处理器发展到多核微处理器。

此外,出版这本书的ARM版是为了强调嵌入式系统对于全亚洲计算行业的重要性,并采用ARM处理器来讨论实际计算机的指令集和算术运算。

因为ARM 是用于嵌入式设备的最流行的指令集架构,而全世界每年约销售40亿个嵌入式设备。

采用ARMv6 ( ARM 11系列)为主要架构来展示指令系统和计算机算术运算的基本功能。

覆盖从串行计算到并行计算的革命性变革,新增了关于并行化的一章,并且每章中还有一些强调并行硬件和软件主题的小节。

新增一个由NVIDIA的首席科学家和架构主管撰写的附录,介绍了现代GPU的出现和重要性,首次详细描述了这个针对可视计算进行了优化的高度并行化、多线程、多核的处理器。

描述一种度量多核性能的独特方法一"Roofline model", 自带Intel Xeo 5000、 Sun Ultra benchmark 测试和分析 AMD Opteron X4SSPARC T2 和 IBM Cell 的性能。

涵盖了一些关于闪存和虚拟机的新内容。

提供了大量富有启发性的练习题,内容达200多页。

将AMD Opteron X4和Intel Nehalem作为贯穿《计算机组成与设计:硬件/软件接口(英文版•第4版・ARM版)》的实例。

用SPEC CPU2006组件更新了所有处理器性能实例。

图书目录:1Computer Abstractions and Technology1.1Introduction1.2BelowYour Program1.3Under the Covers1.4Performance1.5The Power Wall1.6The Sea Change: The Switch from Uniprocessors to Multiprocessors1.7Real Stuff: Manufacturing and Benchmarking the AMD Opteron X41.8Fallacies and Pitfalls1.9Concluding Remarks1.10Historical Perspective and Further Reading1.11Exercises2Instructions: Language of the Computer2.1 Introduction2.1Operations of the Computer Hardware2.2Operands of the Computer Hardware2.3Signed and Unsigned Numbers2.4Representing Instructions in the Computer2.5Logical Operations2.6Instructions for Making Decisions2.7Supporting Procedures in Computer Hardware2.8Communicating with People2.9ARM Addressing for 32-Bit Immediates and More Complex Addressing Modes2.10Parallelism and Instructions: Synchronization2.11Translating and Starting a Program2.12A C Sort Example to Put It AU Together2.13Arrays versus Pointers2.14Advanced Material: Compiling C and Interpreting Java2.15Real Stuff." MIPS Instructions2.16Real Stuff: x86 Instructions2.17Fallacies and Pitfalls2.18Conduding Remarks2.19Historical Perspective and Further Reading2.21 Exercises1.10Parallelism and Advanced Instruction-Level Parallelism1.11Real Stuff: theAMD OpteronX4 ( Barcelona ) Pipeline1.12Advanced Topic: an Introduction to Digital Design Using a Hardware Design Language to Describe and Model a Pipelineand More Pipelining Illustrations1.13Fallacies and Pitfalls1.14Concluding Remarks1.15Historical Perspective and Further Reading1.16Exercises5Large and Fast: Exploiting Memory Hierarchy5.1Introduction5.2The Basics of Caches5.3Measuring and Improving Cache Performance5.4Virtual Memory5.5A Common Framework for Memory Hierarchies5.6Virtual Machines5.7Using a Finite-State Machine to Control a Simple Cache5.8Parallelism and Memory Hierarchies: Cache Coherence5.9Advanced Material: Implementing Cache Controllers5.10Real Stuff: the AMD Opteron X4 ( Barcelona ) andIntel NehalemMemory Hierarchies5.11Fallacies and Pitfalls5.12Concluding Remarks5.13Historical Perspective and Further Reading5.14Exercises6Storage and Other I/O Topics6.1Introduction6.2Dependability, Reliability, and Availability6.3Disk Storage6.4Flash Storage6.5Connecting Processors, Memory, and I/O Devices6.6Interfacing I/O Devices to the Processor, Memory, andOperating System6.7I/O Performance Measures: Examples from Disk and File Systems6.8Designing an I/O System6.9Parallelism and I/O: Redundant Arrays of Inexpensive Disks6.10Real Stuff: Sun Fire x4150 Server6.11Advanced Topics: Networks6.12 Fallacies and Pitfalls6.13Concluding Remarks6.14Historical Perspective and Further Reading6.15Exercises7Multicores, Multiprocessors, and Clusters7.1Introduction7.2The Difficulty of Creating Parallel Processing Programs7.3Shared Memory Multiprocessors7.4Clusters and Other Message-Passing Multiprocessors7.5Hardware Multithreading 63MIMD , SIMD , SPMD , and Vector7.6SISDZ7.7Introduction to Graphics Processing Units7.8Introduction to Multiprocessor Network Topologies7.9Multiprocessor Benchmarks7.10Roofline : A Simple Performance Model7.11Real Stuff : Benchmarking Four Multicores Using theRooflineMudd7.12Fallacies and Pitfalls7.13Concluding Remarks7.14Historical Perspective and Further Reading7.15ExercisesInuexC D-ROM CONTENTA Graphics and Computing GPUSA.l IntroductionA.2 GPU System ArchitecturesA.3 Scalable Parallelism-Programming GPUSA.4 Multithreaded Multiprocessor ArchitectureA.5 Paralld Memory System G.6 Floating PointA.6 Floating Point ArithmeticA.7 Real Stuff : The NVIDIA GeForce 8800A.8 Real Stuff : MappingApplications to GPUsA.9 Fallacies and PitflaUsA.10 Conduding RemarksA.ll HistoricalPerspectiveandFurtherReadingBl ARM and Thumb Assembler Instructions81.1Using This Appendix81.2Syntax81.3Alphabetical List ofARM and Thumb Instructions 81.4ARM Asembler Quick Reference81.5GNU Assembler Quick Reference82ARM and Thumb Instruction Encodings83Intruction Cycle TimingsC The Basics of Logic DesignD Mapping Control to HardwareADVANCED CONTENTHISTORICAL PERSPECTIVES & FURTHER READINGTUTORIALSSOFTWARE作者简介:David A.Patterson ,加州大学伯克利分校计算机科学系教授。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

解决方案4第4章解决方案S-34.1 4.1.1信号值如下:RegWrite MemReadALUMux MemWrite aloop RegMux Branch 0 0 1(Imm)1 ADD X 0 ALUMux是控制ALU输入处Mux 的控制信号,0(Reg)选择寄存器文件的输出,1(Imm)从指令字中选择立即数作为第二个输入。

铝合金是控制Mux输入寄存器文件的控制信号,0(ALU)选择ALU的输出,1(Mem)选择存储器的输出。

X值表示“不关心”(不管信号是0还是1)4.1.2除了未使用的寄存器4.1.3分支添加单元和写入端口:分支添加,寄存器写入端口没有输出:无(所有单元都生成输出)4.2 4.2.1第四条指令使用指令存储器、两个寄存器读取端口、添加Rd和Rs的ALU,寄存器中的数据存储器和写入端口。

4.2.2无。

此指令可以使用现有的块来实现。

4.2.3无。

此指令可以在不添加新的控制信号的情况下实现。

它只需要改变控制逻辑。

4.3 4.3.1时钟周期时间由关键路径决定。

对于给定的延迟,它正好得到加载指令的数据值:I-Mem(读取指令)、Regs(长于控制时间)、Mux(选择ALU)输入)、ALU、数据存储器和Mux(从内存中选择要写入寄存器的值)。

这个路径的延迟是400ps 吗?200秒?30秒?120秒?350马力?30秒?1130马力。

1430马力(1130马力?300
ps,ALU在关键路径上)。

4.3.2第4.3.2节加速度来自于时钟周期时间和程序所需时钟周期数的变化:程序要求的周期数减少了5%,但循环时间是1430而不是1130,所以我们的加速比是(1/0.95)*(1130/1430)?0.83,这意味着我们实际上在减速。

S-4第4章解决方案4.3.3成本始终是所有组件(不仅仅是关键路径上的组件)的总成本,因此原处理器的成本是I-Mem、Regs、Control、ALU、D-Mem、2个Add单元和3个Mux单元,总成本是1000?200?500?100?2000年?2*30?3*10?3890我们将计算与基线相关的成本。

相对于此基线的性能是我们先前计算的加速,相对于基线的成本/性能如下:新成本:3890?600?4490相对成本:4490/3890?1.15性价比:1.15/0.83?1.39条。

我们必须付出更高的代价来换取更差的性能;成本/性能比未经修改的处理器差得多。

4.2.2的单位是4.2倍,所以指令选择4.2倍的时间,而不是4.2倍的时间?4注意,通过另一个加法单元的路径较短,因为I-Mem的延迟比加法单元的延迟长。

我们有:200秒?15磅?10磅?70秒?20秒?315 ps4.4.3条件分支和无条件分支具有相同的长延迟路径来计算分支地址。

此外,它们还有一个长延迟路径,通过寄存器、Mux和ALU计算PCSrc
条件。

关键路径是这两条路径中较长的一条。

对于这些路径,通过PCSrc的路径具有更长的延迟:200ps?90秒?20秒?90秒?20秒?420 ps4.4.4 PC相关分支机构。

4.4.5 PC 相对无条件分支指令。

我们在c部分看到,这不是条件分支的关键路径,它只在PC相关分支上需要。

注意,MIPS没有实际的无条件分支(bnezero、zero和Label扮演这个角色,所以不需要无条件分支操作码),所以对于MIPS,这个问题的答案实际上是“None”。

4.4.6在这两条指令(bne和ADD)中,bne的关键路径较长,决定了时钟周期时间。

注意ADD的每条路径都小于或等于BNE的相应路径,因此第4章中解决方案S-5中的单位延迟变化不会影响这一点。

因此,我们关心的是单元的延迟如何影响BNE的关键路径,而这个单元不在关键路径上,所以这个单元成为关键的唯一方法就是增加它的延迟直到它通过符号扩展,移位和分支加法地址计算的路径比PCSrc通过寄存器、Mux和ALU的路径长。

所以延迟是2.5倍,所以延迟乘以2.5倍?10%?35%4.
5.2符号扩展电路实际上在每个循环中计算一个结果。

相关文档
最新文档