计算机体系结构chapter讲义5-4
《计算机体系结构 量化研究方法》 第五版 习题答案
Chapter 1 Solutions2 Chapter 2 Solutions6 Chapter 3 Solutions13 Chapter 4 Solutions33 Chapter 5 Solutions44 Chapter 6 Solutions50 Appendix A Solutions63 Appendix B Solutions83 Appendix C Solutions92Solutions to Case Studies and Exercises2■Solutions to Case Studies and ExercisesCase Study 1: Chip Fabrication Cost1.1 a.b.It is fabricated in a larger technology, which is an older plant. As plants age,their process gets tuned, and the defect rate decreases.1.2 a.Profit = 416 × 0.65 × $20 = $5408b.Profit = 240 × 0.50 × $25 = $3000c.The Woods chipd.Woods chips: 50,000/416 = 120.2 wafers neededMarkon chips: 25,000/240 = 104.2 wafers neededTherefore, the most lucrative split is 120 Woods wafers, 30 Markon wafers.1.3 a.No defects = 0.282 = 0.08One defect = 0.28 × 0.72 × 2 = 0.40No more than one defect = 0.08 + 0.40 = 0.48b.$20 × 0.28 = Wafer size/old dpw= $23.33 Chapter 1 SolutionsYield 10.30 3.89×4.0--------------------------+⎝⎠⎛⎞4–0.36==Dies per wafer π302⁄()×21.5-----------------------------=π30×sqrt 2 1.5×()------------------------------–47154.4416=–=Yield 10.30 1.5×4.0-----------------------+⎝⎠⎛⎞4–0.65==Dies per wafer π302⁄()×22.5-----------------------------=π30×sqrt 2 2.5×()------------------------------–28342.1240=–=Yield 10.30 2.5× 4.0-------------------------+⎝⎠⎛⎞4–0.50==Defect – Free single core 10.75 1.992⁄×4.0---------------------------------+⎝⎠⎛⎞4–0.28==$20Wafer size old dpw 0.28×-----------------------------------=x Wafer size 1/2old dpw ×0.48×-------------------------------------------------$200.28×1/20.48×-------------------------==Chapter 1 Solutions ■3Case Study 2: Power Consumption in Computer Systems1.4 a..80x = 66 + 2 ×2.3 + 7.9; x = 99b..6 × 4 W + .4 × 7.9 = 5.56c.Solve the following four equations:seek7200 = .75 × seek5400seek7200 + idle7200 = 100seek5400 + idle5400 = 100seek7200 × 7.9 + idle7200 × 4 = seek5400 × 7 + idle5400 × 2.9idle7200 = 29.8%1.5 a.b.c.200 W × 11 = 2200 W2200/(76.2) = 28 racksOnly 1 cooling door is required.1.6 a.The IBM x346 could take less space, which would save money in real estate.The racks might be better laid out. It could also be much cheaper. In addition,if we were running applications that did not match the characteristics of thesebenchmarks, the IBM x346 might be faster. Finally, there are no reliabilitynumbers shown. Although we do not know that the IBM x346 is better in anyof these areas, we do not know it is worse, either.1.7a.(1 – 8) + .8/2 = .2 + .4 = .6b.c. ; x = 50%d.Exercises1.8 a.(1.35)10 = approximately 20b.3200 × (1.4)12 = approximately 181,420c.3200 × (1.01)12 = approximately 3605d.Power density, which is the power consumed over the increasingly small area,has created too much heat for heat sinks to dissipate. This has limited theactivity of the transistors on the chip. Instead of increasing the clock rate,manufacturers are placing multiple cores on the chip.14 KW 66 W 2.3 W 7.9 W ++()-----------------------------------------------------------183=14 KW 66 W 2.3 W 2++7.9 W ×()---------------------------------------------------------------------166=Power new Power old --------------------------V 0.60×()2F 0.60×()×V 2F×----------------------------------------------------------0.630.216===1.751x –()x 2⁄+-------------------------------=Power new Power old --------------------------V 0.75×()2F 0.60×()×V 2F ×----------------------------------------------------------0.7520.6×0.338===4■Solutions to Case Studies and Exercisese.Anything in the 15–25% range would be a reasonable conclusion based onthe decline in the rate over history. As the sudden stop in clock rate shows,though, even the declines do not always follow predictions.1.9 a.50%b.Energy = ½ load × V2. Changing the frequency does not affect energy–onlypower. So the new energy is ½ load × ( ½ V)2, reducing it to about ¼ the oldenergy.1.10 a.60%b.0.4 + 0.6 × 0.2 = 0.58, which reduces the energy to 58% of the originalenergy.c.newPower/oldPower = ½ Capacitance × (V oltage × .8)2× (Frequency × .6)/½Capacitance × V oltage × Frequency = 0.82× 0.6 = 0.256 of the original power.d.0.4 + 0 .3 × 2 = 0.46, which reduce the energy to 46% of the original energy.1.11 a.109/100 = 107b.107/107 + 24 = 1c.[need solution]1.12 a.35/10000 × 3333 = 11.67 daysb.There are several correct answers. One would be that, with the current sys-tem, one computer fails approximately every 5 minutes. 5 minutes is unlikelyto be enough time to isolate the computer, swap it out, and get the computerback on line again. 10 minutes, however, is much more likely. In any case, itwould greatly extend the amount of time before 1/3 of the computers havefailed at once. Because the cost of downtime is so huge, being able to extendthis is very valuable.c.$90,000 = (x + x + x + 2x)/4$360,000 = 5x$72,000 = x4th quarter = $144,000/hr1.13 a.Itanium, because it has a lower overall execution time.b.Opteron: 0.6 × 0.92 + 0.2 × 1.03 + 0.2 × 0.65 = 0.888c.1/0.888 = 1.1261.14 a.See Figure S.1.b. 2 = 1/((1 – x) + x/10)5/9 = x = 0.56 or 56%c.0.056/0.5 = 0.11 or 11%d.Maximum speedup = 1/(1/10) = 105 = 1/((1 – x) + x/10)8/9 = x = 0.89 or 89%Chapter 1 Solutions ■5e.Current speedup: 1/(0.3 + 0.7/10) = 1/0.37 = 2.7Speedup goal: 5.4 = 1/((1 – x) + x /10) = x = 0.91This means the percentage of vectorization would need to be 91%1.15 a.old execution time = 0.5 new + 0.5 × 10 new = 5.5 newb.In the original code, the unenhanced part is equal in time to the enhanced partsped up by 10, therefore:(1 – x) = x /1010 – 10x = x10 = 11x10/11 = x = 0.911.16 a.1/(0.8 + 0.20/2) = 1.11b.1/(0.7 + 0.20/2 + 0.10 × 3/2) = 1.05c.fp ops: 0.1/0.95 = 10.5%, cache: 0.15/0.95 = 15.8%1.17 a.1/(0.6 + 0.4/2) = 1.25b.1/(0.01 + 0.99/2) = 1.98c.1/(0.2 + 0.8 × 0.6 + 0.8 × 0.4/2) = 1/(.2 + .48 + .16) = 1.19d.1/(0.8 + 0.2 × .01 + 0.2 × 0.99/2) = 1/(0.8 + 0.002 + 0.099) = 1.111.18 a.1/(.2 + .8/N)b.1/(.2 + 8 × 0.005 + 0.8/8) = 2.94c.1/(.2 + 3 × 0.005 + 0.8/8) = 3.17d.1/(.2 + logN × 0.005 + 0.8/N)e.d/dN(1/((1 – P) + logN × 0.005 + P/N)) = 0Figure S.1Plot of the equation: y = 100/((100 – x) + x/10).6■Solutions to Case Studies and ExercisesChapter 2 SolutionsCase Study 1: Optimizing Cache Performance viaAdvanced Techniques2.1 a.Each element is 8B. Since a 64B cacheline has 8 elements, and each columnaccess will result in fetching a new line for the non-ideal matrix, we need aminimum of 8x8 (64 elements) for each matrix. Hence, the minimum cachesize is 128 × 8B = 1KB.b.The blocked version only has to fetch each input and output element once.The unblocked version will have one cache miss for every 64B/8B = 8 rowelements. Each column requires 64Bx256 of storage, or 16KB. Thus, columnelements will be replaced in the cache before they can be used again. Hencethe unblocked version will have 9 misses (1 row and 8 columns) for every 2 inthe blocked version.c.for (i = 0; i < 256; i=i+B) {for (j = 0; j < 256; j=j+B) {for(m=0; m<B; m++) {for(n=0; n<B; n++) {output[j+n][i+m] = input[i+m][j+n];}}}}d.2-way set associative. In a direct-mapped cache the blocks could be allocatedso that they map to overlapping regions in the cache.e.You should be able to determine the level-1 cache size by varying the blocksize. The ratio of the blocked and unblocked program speeds for arrays thatdo not fit in the cache in comparison to blocks that do is a function of thecache block size, whether the machine has out-of-order issue, and the band-width provided by the level-2 cache. You may have discrepancies if yourmachine has a write-through level-1 cache and the write buffer becomes alimiter of performance.2.2Since the unblocked version is too large to fit in the cache, processing eight 8B ele-ments requires fetching one 64B row cache block and 8 column cache blocks.Since each iteration requires 2 cycles without misses, prefetches can be initiatedevery 2 cycles, and the number of prefetches per iteration is more than one, thememory system will be completely saturated with prefetches. Because the latencyof a prefetch is 16 cycles, and one will start every 2cycles, 16/2 = 8 will be out-standing at a time.Open hands-on exercise, no fixed solution.2.3Chapter 2 Solutions■7Case Study 2: Putting it all Together: Highly ParallelMemory Systems2.4 a.The second-level cache is 1MB and has a 128B block size.b.The miss penalty of the second-level cache is approximately 105ns.c.The second-level cache is 8-way set associative.d.The main memory is 512MB.e.Walking through pages with a 16B stride takes 946ns per reference. With 250such references per page, this works out to approximately 240ms per page.2.5 a.Hint: This is visible in the graph above as a slight increase in L2 miss servicetime for large data sets, and is 4KB for the graph above.b.Hint: Take independent strides by the page size and look for increases inlatency not attributable to cache sizes. This may be hard to discern if theamount of memory mapped by the TLB is almost the same as the size as acache level.c.Hint: This is visible in the graph above as a slight increase in L2 miss servicetime for large data sets, and is 15ns in the graph above.d.Hint: Take independent strides that are multiples of the page size to see if theTLB if fully-associative or set-associative. This may be hard to discern if theamount of memory mapped by the TLB is almost the same as the size as acache level.2.6 a.Hint: Look at the speed of programs that easily fit in the top-level cache as afunction of the number of threads.b.Hint: Compare the performance of independent references as a function oftheir placement in memory.Open hands-on exercise, no fixed solution.2.7Exercises2.8 a.The access time of the direct-mapped cache is 0.86ns, while the 2-way and4-way are 1.12ns and 1.37ns respectively. This makes the relative accesstimes 1.12/.86 = 1.30 or 30% more for the 2-way and 1.37/0.86 = 1.59 or59% more for the 4-way.b.The access time of the 16KB cache is 1.27ns, while the 32KB and 64KB are1.35ns and 1.37ns respectively. This makes the relative access times 1.35/1.27 = 1.06 or 6% larger for the 32KB and 1.37/1.27 = 1.078 or 8% larger forthe 64KB.c.Avg. access time = hit% × hit time + miss% × miss penalty, miss% = missesper instruction/references per instruction = 2.2% (DM), 1.2% (2-way), 0.33%(4-way), .09% (8-way).Direct mapped access time = .86ns @ .5ns cycle time = 2 cycles2-way set associative = 1.12ns @ .5ns cycle time = 3 cycles8■Solutions to Case Studies and Exercises4-way set associative = 1.37ns @ .83ns cycle time = 2 cycles8-way set associative = 2.03ns @ .79ns cycle time = 3 cyclesMiss penalty = (10/.5) = 20 cycles for DM and 2-way; 10/.83 = 13 cycles for4-way; 10/.79 = 13 cycles for 8-way.Direct mapped – (1 – .022) × 2 + .022 × (20) = 2.39 6 cycles => 2.396 × .5 = 1.2ns2-way – (1 – .012) × 3 + .012 × (20) = 3. 2 cycles => 3.2 × .5 = 1.6ns4-way – (1 – .0033) × 2 + .0033 × (13) = 2.036 cycles => 2.06 × .83 = 1.69ns8-way – (1 – .0009) × 3 + .0009 × 13 = 3 cycles => 3 × .79 = 2.37nsDirect mapped cache is the best.2.9 a.The average memory access time of the current (4-way 64KB) cache is 1.69ns.64KB direct mapped cache access time = .86ns @ .5 ns cycle time = 2 cyclesWay-predicted cache has cycle time and access time similar to direct mappedcache and miss rate similar to 4-way cache.The AMAT of the way-predicted cache has three components: miss, hit withway prediction correct, and hit with way prediction mispredict: 0.0033 × (20)+ (0.80 × 2 + (1 – 0.80) × 3) × (1 – 0.0033) = 2.26 cycles = 1.13nsb.The cycle time of the 64KB 4-way cache is 0.83ns, while the 64KB direct-mapped cache can be accessed in 0.5ns. This provides 0.83/0.5 = 1.66 or 66%faster cache access.c.With 1 cycle way misprediction penalty, AMA T is 1.13ns (as per part a), butwith a 15 cycle misprediction penalty, the AMAT becomes: 0.0033 × 20 +(0.80 × 2 + (1 – 0.80) × 15) × (1 – 0.0033) = 4.65 cycles or 2.3ns.d.The serial access is 2.4ns/1.59ns = 1.509 or 51% slower.2.10 a.The access time is 1.12ns, while the cycle time is 0.51ns, which could bepotentially pipelined as finely as 1.12/.51 = 2.2 pipestages.b.The pipelined design (not including latch area and power) has an area of1.19 mm2 and energy per access of 0.16nJ. The banked cache has an area of1.36 mm2 and energy per access of 0.13nJ. The banked design uses slightlymore area because it has more sense amps and other circuitry to support thetwo banks, while the pipelined design burns slightly more power because thememory arrays that are active are larger than in the banked case.2.11 a.With critical word first, the miss service would require 120 cycles. Withoutcritical word first, it would require 120 cycles for the first 16B and 16 cyclesfor each of the next 3 16B blocks, or 120 + (3 × 16) = 168 cycles.b.It depends on the contribution to Average Memory Access Time (AMAT) ofthe level-1 and level-2 cache misses and the percent reduction in miss servicetimes provided by critical word first and early restart. If the percentage reduc-tion in miss service times provided by critical word first and early restart isroughly the same for both level-1 and level-2 miss service, then if level-1misses contribute more to AMAT, critical word first would likely be moreimportant for level-1 misses.Chapter 2 Solutions■92.12 a.16B, to match the level 2 data cache write path.b.Assume merging write buffer entries are 16B wide. Since each store canwrite 8B, a merging write buffer entry would fill up in 2 cycles. The level-2cache will take 4 cycles to write each entry. A non-merging write bufferwould take 4 cycles to write the 8B result of each store. This means themerging write buffer would be 2 times faster.c.With blocking caches, the presence of misses effectively freezes progressmade by the machine, so whether there are misses or not doesn’t change therequired number of write buffer entries. With non-blocking caches, writes canbe processed from the write buffer during misses, which may mean fewerentries are needed.2.13 a. A 2GB DRAM with parity or ECC effectively has 9 bit bytes, and wouldrequire 18 1Gb DRAMs. To create 72 output bits, each one would have tooutput 72/18 = 4 bits.b. A burst length of 4 reads out 32B.c.The DDR-667 DIMM bandwidth is 667 × 8 = 5336 MB/s.The DDR-533 DIMM bandwidth is 533 × 8 = 4264 MB/s.2.14 a.This is similar to the scenario given in the figure, but tRCD and CL areboth5. In addition, we are fetching two times the data in the figure. Thus itrequires 5 + 5 + 4 × 2 = 18 cycles of a 333MHz clock, or 18 × (1/333MHz) =54.0ns.b.The read to an open bank requires 5 + 4 = 9 cycles of a 333MHz clock, or27.0ns. In the case of a bank activate, this is 14 cycles, or 42.0ns. Including20ns for miss processing on chip, this makes the two 42 + 20 = 61ns and27.0+20 = 47ns. Including time on chip, the bank activate takes 61/47 = 1.30or 30% longer.The costs of the two systems are $2 × 130 + $800 = $1060 with the DDR2-667 2.15DIMM and 2 × $100 + $800 = $1000 with the DDR2-533 DIMM. The latency toservice a level-2 miss is 14 × (1/333MHz) = 42ns 80% of the time and 9 × (1/333MHz) = 27ns 20% of the time with the DDR2-667 DIMM.It is 12 × (1/266MHz) = 45ns (80% of the time) and 8 × (1/266MHz) = 30ns(20% of the time) with the DDR-533 DIMM. The CPI added by the level-2misses in the case of DDR2-667 is 0.00333 × 42 × .8 + 0.00333 × 27 × .2 = 0.130giving a total of 1.5 + 0.130 = 1.63. Meanwhile the CPI added by the level-2misses for DDR-533 is 0.00333 × 45 × .8 + 0.00333 × 30 × .2 = 0.140 giving atotal of 1.5 + 0.140 = 1.64. Thus the drop is only 1.64/1.63 = 1.006, or 0.6%,while the cost is $1060/$1000 = 1.06 or 6.0% greater. The cost/performance ofthe DDR2-667 system is 1.63 × 1060 = 1728 while the cost/performance of theDDR2-533 system is 1.64 × 1000 = 1640, so the DDR2-533 system is a bettervalue.The cores will be executing 8cores × 3GHz/2.0CPI = 12 billion instructions per 2.16second. This will generate 12 × 0.00667 = 80 million level-2 misses per second.With the burst length of 8, this would be 80 × 32B = 2560MB/sec. If the memorybandwidth is sometimes 2X this, it would be 5120MB/sec. From Figure 2.14, this is just barely within the bandwidth provided by DDR2-667 DIMMs, so just one memory channel would suffice.2.17a.The system built from 1Gb DRAMs will have twice as many banks as thesystem built from 2Gb DRAMs. Thus the 1Gb-based system should provide higher performance since it can have more banks simultaneously open.b.The power required to drive the output lines is the same in both cases, but thesystem built with the x4 DRAMs would require activating banks on 18 DRAMs,versus only 9 DRAMs for the x8 parts. The page size activated on each x4 and x8 part are the same, and take roughly the same activation energy. Thus since there are fewer DRAMs being activated in the x8 design option, it would have lower power.2.18a.With policy 1,Precharge delay Trp = 5 × (1/333 MHz) = 15ns Activation delay Trcd = 5 × (1/333 MHz) = 15ns Column select delay Tcas = 4 × (1/333 MHz) = 12ns Access time when there is a row buffer hitAccess time when there is a missWith policy 2,Access time = Trcd + Tcas + TddrIf A is the total number of accesses, the tip-off point will occur when the net access time with policy 1 is equal to the total access time with policy 2.i.e.,= (Trcd + Tcas + Tddr)Ar = 100 × (15)/(15 + 15) = 50%If r is less than 50%, then we have to proactively close a page to get the best performance, else we can keep the page open.b.The key benefit of closing a page is to hide the precharge delay Trp from thecritical path. If the accesses are back to back, then this is not possible. This new constrain will not impact policy 1.T h r Tcas Tddr +()100--------------------------------------=T m 100r –()Trp Trcd Tcas Tddr +++()100--------------------------------------------------------------------------------------------=r 100--------Tcas Tddr +()A 100r–100----------------Trp Trcd Tcas Tddr +++()A +r 100Trp×Trp Trcd+---------------------------=⇒The new equations for policy 2,Access time when we can hide precharge delay = Trcd + Tcas + Tddr Access time when precharge delay is in the critical path = Trcd + Tcas + Trp + Tddr Equation 1 will now become,r = 90 × 15/30 = 45%c.For any row buffer hit rate, policy 2 requires additional r × (2 + 4) nJ peraccess. If r = 50%, then policy 2 requires 3nJ of additional energy.2.19 Hibernating will be useful when the static energy saved in DRAM is at least equalto the energy required to copy from DRAM to Flash and then back to DRAM.DRAM dynamic energy to read/write is negligible compared to Flash and can be ignored.= 400 secondsThe factor 2 in the above equation is because to hibernate and wakeup, both Flash and DRAM have to be read and written once.2.20a.Yes. The application and production environment can be run on a VM hostedon a development machine.b.Yes. Applications can be redeployed on the same environment on top of VMsrunning on different hardware. This is commonly called business continuity.c.No. Depending on support in the architecture, virtualizing I/O may add sig-nificant or very significant performance overheads.d.Yes. Applications running on different virtual machines are isolated fromeach other.e.Yes. See “Devirtualizable virtual machines enabling general, single-node,online maintenance,” David Lowell, Yasushi Saito, and Eileen Samberg, in the Proceedings of the 11th ASPLOS, 2004, pages 211–223.2.21a.Programs that do a lot of computation but have small memory working setsand do little I/O or other system calls.b.The slowdown above was 60% for 10%, so 20% system time would run120% slower.c.The median slowdown using pure virtualization is 10.3, while for para virtu-alization the median slowdown is 3.76.r 100--------Tcas Tddr +()A 100r–100----------------Trp Trcd Tcas Tddr +++()A +0.9Trcd Tcas Tddr ++()×A 0.1Trcd Tcas Trp Tddr +++()×+=r ⇒90Trp Trp Trcd +---------------------------⎝⎠⎛⎞×=Time 81092 2.56106–××××64 1.6×------------------------------------------------------------=d.The null call and null I/O call have the largest slowdown. These have no realwork to outweigh the virtualization overhead of changing protection levels,so they have the largest slowdowns.The virtual machine running on top of another virtual machine would have to emu- 2.22late privilege levels as if it was running on a host without VT-x technology.2.23 a.As of the date of the Computer paper, AMD-V adds more support for virtual-izing virtual memory, so it could provide higher performance for memory-intensive applications with large memory footprints.b.Both provide support for interrupt virtualization, but AMD’s IOMMU alsoadds capabilities that allow secure virtual machine guest operating systemaccess to selected devices.Open hands-on exercise, no fixed solution.2.242.25 a.These results are from experiments on a3.3GHz Intel® Xeon® ProcessorX5680 with Nehalem architecture (westmere at 32nm). The number of missesper 1K instructions of L1 Dcache increases significantly by more than 300Xwhen input data size goes from 8KB to 64 KB, and keeps relatively constantaround 300/1K instructions for all the larger data sets. Similar behavior withdifferent flattening points on L2 and L3 caches are observed.b.The IPC decreases by 60%, 20%, and 66% when input data size goes from8KB to 128 KB, from 128KB to 4MB, and from 4MB to 32MB, respectively.This shows the importance of all caches. Among all three levels, L1 and L3caches are more important. This is because the L2 cache in the Intel® Xeon®Processor X5680 is relatively small and slow, with capacity being 256KB andlatency being around 11 cycles.c.F or a recent Intel i7 processor (3.3GHz Intel® Xeon® Processor X5680),when the data set size is increased from 8KB to 128KB, the number of L1Dcache misses per 1K instructions increases by around 300, and the numberof L2 cache misses per 1K instructions remains negligible. With a 11 cyclemiss penalty, this means that without prefetching or latency tolerance fromout-of-order issue we would expect there to be an extra 3300 cycles per 1Kinstructions due to L1 misses, which means an increase of 3.3 cycles perinstruction on average. The measured CPI with the 8KB input data size is1.37. Without any latency tolerance mechanisms we would expect the CPI ofthe 128KB case to be 1.37 + 3.3 = 4.67. However, the measured CPI of the128KB case is 3.44. This means that memory latency hiding techniques suchas OOO execution, prefetching, and non-blocking caches improve the perfor-mance by more than 26%.Case Study 1: Exploring the Impact of Microarchitectural Techniques3.1 The baseline performance (in cycles, per loop iteration) of the code sequence inFigure 3.48, if no new instruction’s execution could be initiated until the previ-ous instruction’s execution had completed, is 40. See Figure S.2. Each instruc-tion requires one clock cycle of execution (a clock cycle in which that instruction, and only that instruction, is occupying the execution units; since every instruction must execute, the loop will take at least that many clock cycles). To that base number, we add the extra latency cycles. Don’t forget the branch shadow cycle.3.2 H ow many cycles would the loop body in the code sequence in Figure 3.48require if the pipeline detected true data dependencies and only stalled on those,rather than blindly stalling everything just because one functional unit is busy?The answer is 25, as shown in Figure S.3. Remember, the point of the extra latency cycles is to allow an instruction to complete whatever actions it needs, in order to produce its correct output. Until that output is ready, no dependent instructions can be executed. So the first LD must stall the next instruction for three clock cycles. The MULTD produces a result for its successor, and therefore must stall 4 more clocks, and so on.Figure S.2Baseline performance (in cycles, per loop iteration) of the code sequence in Figure 3.48.Chapter 3 SolutionsLoop:LD F2,0(Rx) 1 + 4DIVD F8,F2,F0 1 + 12MULTD F2,F6,F2 1 + 5LD F4,0(Ry) 1 + 4ADDD F4,F0,F4 1 + 1ADDD F10,F8,F2 1 + 1ADDI Rx,Rx,#8 1 ADDI Ry,Ry,#81SD F4,0(Ry) 1 + 1SUB R20,R4,Rx 1BNZR20,Loop1 + 1____cycles per loop iter403.3 Consider a multiple-issue design. Suppose you have two execution pipelines, eachcapable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the front end so that it will not stall your execution. Assume results can be immediately forwarded from one execution unit to another, or to itself.Further assume that the only reason an execution pipeline would stall is to observe a true data dependency. Now how many cycles does the loop require? The answer is 22, as shown in Figure S.4. The LD goes first, as before, and the DIVD must wait for it through 4 extra latency cycles. After the DIVD comes the MULTD , which can run in the second pipe along with the DIVD , since there’s no dependency between them.(Note that they both need the same input, F2, and they must both wait on F2’s readi-ness, but there is no constraint between them.) The LD following the MULTD does not depend on the DIVD nor the MULTD , so had this been a superscalar-order-3 machine,Figure S.3N umber of cycles required by the loop body in the code sequence in Figure 3.48.that LD could conceivably have been executed concurrently with the DIVD and the MULTD . Since this problem posited a two-execution-pipe machine, the LD executes in the cycle following the DIVD /MULTD . The loop overhead instructions at the loop’s bottom also exhibit some potential for concurrency because they do not depend on any long-latency instructions.3.4 Possible answers:1.If an interrupt occurs between N and N + 1, then N + 1 must not have beenallowed to write its results to any permanent architectural state. Alternatively,it might be permissible to delay the interrupt until N + 1 completes.2.If N and N + 1 happen to target the same register or architectural state (say,memory), then allowing N to overwrite what N + 1 wrote would be wrong.3.N might be a long floating-point op that eventually traps. N + 1 cannot beallowed to change arch state in case N is to be retried.Execution pipe 0Execution pipe 1Loop:LDF2,0(Rx);<nop><stall for LD latency>;<nop><stall for LD latency>;<nop><stall for LD latency>;<nop><stall for LD latency>;<nop>DIVD F8,F2,F0;MULTD F2,F6,F2LDF4,0(Ry);<nop><stall for LD latency>;<nop><stall for LD latency>;<nop><stall for LD latency>;<nop><stall for LD latency>;<nop>ADDF4,F0,F4;<nop><stall due to DIVD latency>;<nop><stall due to DIVD latency>;<nop><stall due to DIVD latency>;<nop><stall due to DIVD latency>;<nop><stall due to DIVD latency>;<nop><stall due to DIVD latency>;<nop>ADDD F10,F8,F2;ADDI Rx,Rx,#8ADDI Ry,Ry,#8;SD F4,0(Ry)SUB R20,R4,Rx;BNZR20,Loop <nop>;<stall due to BNZ>cycles per loop iter 22Figure S.4 Number of cycles required per loop.。
《计算机体系结构》课件
ABCD
理解指令集体系结构、处 理器设计、存储系统、输 入输出系统的基本原理和 设计方法。
培养学生对计算机体系结 构领域的兴趣和热情,为 未来的学习和工作打下坚 实的基础。
CHAPTER
02
计算机体系结构概述
计算机体系结构定义
计算机体系结构是指计算机系统的整 体设计和组织结构,包括其硬件和软 件的交互方式。
CHAPTER
06
并行处理与多核处理器
并行处理概述
并行处理
指在同一时刻或同一时间间隔内 完成两个或两个以上工作的能力
。
并行处理的分类
时间并行、空间并行、数据并行和 流水并行。
并行处理的优势
提高计算速度、增强计算能力、提 高资源利用率。
多核处理器
1 2
多核处理器
指在一个处理器上集成多个核心,每个核心可以 独立执行一条指令。
间接寻址
间接寻址是指操作数的有效地址通过寄存器间接给出,计算机先取出 寄存器中的地址,再通过该地址取出操作数进行操作。
CHAPTER
04
存储系统
存储系统概述
存储系统是计算机体系结构中 的重要组成部分,负责存储和 检索数据和指令。
存储系统通常由多个层次的存 储器组成,包括主存储器、外 存储器和高速缓存等。
《计算机体系结构》ppt 课件
CONTENTS
目录
• 引言 • 计算机体系结构概述 • 指令系统 • 存储系统 • 输入输出系统 • 并行处理与多核处理器 • 流水线技术 • 计算机体系结构优化技术
CHAPTER
01
引言
课程简介
计算机体系结构是计算机科学的一门核心课程,主要研究计算机系统的基本组成、组织结构、工作原 理及其设计方法。
计算机系统结构课件
2.1.1.1 浮点数的组成 浮点数的组成与人们通常所说的“科学记数法”非常相似,唯一不同的是各部分 均为有限位数,如下所示
它的主要参数有8个:
m ── 尾数,一般为纯小数,符合规格化原则(即最高位的绝对值不为0), 用原码或补码表示;
e ── 阶码,整数,常用移码表示(见下文解释);
= 1.25×80%×ICA×1.1×CYCLEA = 1.1×ICA×CYCLEA < Te_A 这时B机器快一些。
Sn
• 题12 (P33)
20
Amdahl定律公式,代入已知量
Se=20变成一元函数
10.5
Sn=20/(20-19Fe)
用三点作图法作出关系曲线。
1.8
1
0
0.5
2001.9.1
计算机系统结构
•
= 1.25×80%×ICA×1.25×CYCLEA
•
= 1.25×ICA×CYCLEA > Te_A
• 显然A机器快一些。
2001.9.1
计算机系统结构
17
例题选讲(5)
• 例1.5(P12) Te公式,改动上题中CYCLEB =1.1 ×CYCLEA,则最后
Te_B = 1.25×ICB ×CYCLEB
汇编语言机器
汇编语言程序员 (使用汇编语言)
(经汇编程序翻译成机器语言、操作系统原语)
操作系统语言机器 操作系统用户 (使用操作系统原语)
(经原语解释子程序翻译成机器语言)
传统机器语言机器 传统机器程序员(使用二进制机器语言)
(由微程序解释成微指令序列)
微指令语言机器 微指令程序员 (使用微指令语言)
第5章-计算机体系结构-95页PPT资料
令的时间为:T=(1+2n)t
取指 分析 执行 取指 分析 执行 取指 分析 执行
主要优点: 指令的执行时间缩短 功能部件的利用率明显提高
主要缺点: 需要增加一些硬件 控制过程稍复杂
3、二次重叠执行方式
如果三过程的时间相等,执行n条指令的 时间为:T=(2+n)t
5.2.1 流水线工作原理
1、简单流水线
输 分析器 流水 执行部件 流水 输 入 分析k+1 锁存器 执行k 锁存器 出
t1
t2
流水线的每一个阶段称为流水步、流水
步骤、流水段、流水线阶段、流水功能
段、功能段、流水级、流水节拍等。`
在每一个流水段的末尾或开头必须设置一个寄 存器,称为流水寄存器、流水锁存器、流水闸 门寄存器等。会增加指令的执行时间。
时间
静态流水线时空图
空间
浮点加法 定点乘法
输出
1 2 3 …… n 1 2 3 …
累加
1 2 3 4…
尾数乘
1 2 3 4 5…
规格化
1 2 3 …… n
尾数加
1 2 3 …… n
对阶
1 2 3 …… n
求阶差 1 2 3 … … n
输入 1 2 3 … … n
1 2 3 4 5 6…
0
时间
动态流水线时空图
动态流水线: 在同一段时间内,多功能流水线中的各段可以按 照不同的方式连接,同时执行多种功能。
空间
浮点加法 定点乘法
输出
1 2 3…n
1…
累加
1 2…
尾数乘
1 2 3…
Chapter 5-4---等级相关系数
等级相关的应用场合
斯皮尔曼等级相关 肯德尔等级相关(肯德尔W系数 )
作业:6、7
肯德尔等级相关(肯德尔W系数 )
肯德尔(Kendall)和谐(W)系数 无相同等级 有相同等级
有相同等级
Ri为评定对象获得的K个等级之和, n为相同等级的数目 例5-7
无相同等级
Ri为评定对象在K个评价者那里获得的等级之和 N代表被评定对象的数目 K为评定者的数目 例5-6
肯德尔(Kendall)和谐(W)系数
1、用于衡量两个以上评定者等级评定的一致程度。这种一 致程度要用多列等级变量的相关系数表示 2、适用这种方法的数据资料一般是采用等级评定的方法收 集的,即让K个评委(被试)评定N件事物,或1个评委(被 试)先后K次评定N件事物。 3、等级评定法每个评价者对N件事物排出一个等级顺序, 最小的等级序数为1 ,最大的为N,若并列等级时,则平分 共同应该占据的等级。
4、可用积差相关计算的资料若用等级相关来计算,则精确 度降低了。
无相同等级时的计算公式
6 D 2
i 1 n
rR 1
N ( N 1)
2
N : 成对等级个数; D=R X -R Y:二列成对变量的等级差数
例题5-3
有相同等级时的计算公式
N: 成对数据的数目 n为各列变量内的相同等级数,注意求和 注意相同等级的等级赋值 参见例
等级相关的应用场合
等级数据间的相关或总体分布非正态时不满足积ቤተ መጻሕፍቲ ባይዱ相关条 件。
因为对总体分布不作要求,又称为非参数的相关方法。
计算机体系结构课件
05
计算机体系结构的发展趋势
多核处理器
总结词
多核处理器技术是计算机体系结构的重要发 展趋势之一,它通过将多个处理器核心集成 到一个芯片上,提高了计算机的处理能力和 能效。
详细描述
随着集成电路技术的发展,多核处理器已成 为现实,并广泛应用于各类计算机系统中。 多核处理器可以同时执行多个线程,提高了 并行处理能力,使得计算机在处理复杂任务 时更加高效。
存储器是计算机中用于存储数据和指令的部件。
详细描述
存储器分为不同的类型,如随机存取存储器(RAM)、只读存储器(ROM)和高速缓存等。它们以二进制的形 式存储数据和指令,并允许对存储的数据进行读取、写入和修改等操作。
控制器
总结词
控制器是计算机中协调各部件工作的部件。
详细描述
控制器负责控制计算机中各个部件的工作流程,确保它们按照正确的顺序和时间进行操作。它通常由 指令计数器、指令寄存器和控制逻辑等组成,能够解析指令并协调各部件的工作。
硬件虚拟化技术
总结词
硬件虚拟化技术是计算机体系结构的另一重要发展趋势,它通过虚拟化技术将物理硬件 资源抽象成虚拟资源,实现了资源的共享和灵活配置。
详细描述
硬件虚拟化技术可以使得多个操作系统在同一物理硬件上运行,并且每个操作系统都认 为自己拥有完整的硬件资源。这不仅提高了硬件资源的利用率,还增强了系统的可靠性
03
计算机体系结构决定了计算机 的能耗和成本,对于现代计算 机系统来说,能耗和成本是非 常重要的考虑因素。
计算机体系结构的分类
1 2
根据指令集体系结构的分类
可以分为复杂指令集计算机(CISC)和精简指令 集计算机(RISC)。
计算机体系结构课件
输入输出系统是计算机中用于接收外部输入(如键盘、鼠标、传感器等)和输 出数据(如显示器、打印机、音响等)的硬件设备。输入输出系统的性能和可 靠性对计算机的整体性能和使用体验至关重要。
总线与接口
总结词
总线与接口是计算机中用于连接各个部件并进行通信的通道。
详细描述
总线与接口是计算机中各个部件之间进行通信的通道。总线是连接各个部件的公共通道,而接口则是 连接外部设备和计算机的通道。通过总线与接口,各个部件之间可以相互通信并协同工作,实现计算 机的整体功能。总线与接口的性能和稳定性对计算机的整体性能和使用体验至关重要。
长电池寿命。
扩展功能
03
通过增加输入输出接口、支持多种数据类型等,可以扩展计算
机的功能和应用范围。
计算机体系结构的分类
1 2
按指令集分类
可以分为复杂指令集计算机(CISC)和精简指令 集计算机(RISC)。
按数据类型分类
可以分为固定长度数据和可变长度数据。
3
按寻址方式分类
可以分为直接寻址、间接寻址和基址加变址寻址 等。
03
计算机指令系统
指令集架构
ቤተ መጻሕፍቲ ባይዱ
复杂指令集架构 (CISC)
提供了许多复杂的指令,能够执行各种高级操作。
精简指令集架构 (RISC)
只包含简单的、基本的指令,强调通过并行处理加快执行速度。
超长指令集架构 (VLIW)
通过将多个操作数和操作码放入一个指令,实现并行处理。
指令格式与寻址方式
固定长度的指令格式
可重构计算面临着能效、可扩展性、编程模型等方面的挑 战,如何设计更高效的
THANKS
感谢观看
详细描述
存储器是计算机中用于存储数据和程序的硬件设备。根据存储速度、容量和价格的不同,计算机中存在多种类型 的存储器,如随机存取存储器(RAM)、只读存储器(ROM)、高速缓存(Cache)等。存储器的容量和速度 对计算机的性能有很大的影响。
计算机操作系统 OS-chapter 5
4)重定位(地址变换) :把逻辑地址转换为相应物理地 址叫重定位
5)程序的装入与链接
程序的装入
绝对装入方式:编译或汇编时时直接给出实际内
存地址,只适用于单道程序环境
物理地址由程序员给出(对程序员要求较高)
物理地址由编译器或汇编器给出
可重定位装入方式:每道程序都从0开始编址,
程序中的其他地址是相对于0号地址的,在将程 序装入内存时,物理地址与逻辑地址不同,不仅 要修改指令地址,而且要修改指令内容
3.内存信息共享 :使多道程序能动态地共享内存, 最好能共享内存的信息
4.地址变换(重定位)(需要硬件支持)
逻 辑 地 址 空 间 高级语言 源程序
浮 编译 动 目 标 文 件
链接 目 标 代 码 .EXE
装入
内存
库文件
1)逻辑地址(相对地址) :用户编程时总是从0开始编址, 这种用户编程所用的地址称 逻辑地址 2)物理地址(内存地址、绝对地址):内存是由若干存 贮单元组成的,每个存贮单元有一个编号称为物理地址。 3)地址空间 逻辑地址空间:用户编程空间,是由CPU的地址总线 扫描出来的 。 物理地址空间:由物理存贮单元组成的空间,由存贮 器的地址总线扫描出来的空间。
5.1 存贮器管理的功能
1.内存的分配及回收:根据不同的管理机制有不同的分配 回收算法。但是,无论何种机制,一个有效的机制必须做 到用户申请时立即响应,预以分配;用户用完立即回收, 以供其它用户使用,为此存贮区分配应有如下机制 。 记住每个区域的状态(已分。未分) 实施分配(修改数据结构) 接受系统或用户释放的区域(修改数据结构)
程序装入之后不能在内存中移动
0 1000 Load 1,2500
计算机体系结构详解
计算机体系结构详解计算机体系结构是指计算机硬件和软件之间的关系以及它们在计算机体系中的组织方式。
在计算机科学领域中,计算机体系结构被认为是计算机科学的核心概念之一。
本文将详细介绍计算机体系结构的各个方面,包括其定义、发展历程、基本原理、主要类型和应用。
一、定义计算机体系结构是一种用于描述计算机硬件和软件之间关系的概念模型。
它描述了计算机内部各个组件、子系统之间的连接方式、数据流动和控制方式等。
计算机体系结构不仅包括计算机的物理结构,还包括计算机的逻辑结构和操作方式。
二、发展历程计算机体系结构的概念最早出现在20世纪40年代末的冯·诺依曼体系结构中。
随着计算机科学的发展,计算机体系结构逐渐演变出多种类型,包括冯·诺依曼体系结构、哈佛体系结构、超标量体系结构、多核体系结构等。
三、基本原理计算机体系结构的基本原理包括指令集架构、数据表示和处理、存储器层次结构、处理器组织和控制方式等。
指令集架构定义了计算机的指令集和执行方式,数据表示和处理涉及数据的内部表示以及算术和逻辑运算的执行方式,存储器层次结构描述了计算机内存的组织和访问方式,处理器组织和控制方式描述了计算机处理器的内部结构和运行方式。
四、主要类型根据计算机体系结构的组织方式和特点,常见的计算机体系结构类型包括冯·诺依曼体系结构、哈佛体系结构、超标量体系结构、多核体系结构等。
冯·诺依曼体系结构是最早的计算机体系结构之一,它的特点是将程序指令和数据存储在同一个存储器中,并且以顺序执行方式执行指令。
哈佛体系结构则将程序指令和数据存储在不同的存储器中,以提高指令和数据的并行处理能力。
超标量体系结构可以同时执行多条指令,提高计算机的运行效率。
多核体系结构是指将多个处理器核心集成在一起,以实现多任务并行处理。
五、应用计算机体系结构的应用广泛涉及到计算机硬件和软件的设计、开发和优化。
在计算机硬件设计领域,计算机体系结构的选择直接影响计算机的性能和能耗。
计算机体系结构基础教程
计算机体系结构基础教程计算机体系结构是计算机科学中的核心概念之一,它描述了计算机硬件和软件之间的各种关系和交互。
深入理解计算机体系结构对于学习和应用计算机科学和工程学科非常重要。
在本篇文章中,我将为您提供一份详细的计算机体系结构的基础教程,涵盖以下几个方面:1. 什么是计算机体系结构?2. 计算机体系结构的重要性3. 计算机体系结构的组成和层次结构4. 计算机体系结构的发展历程5. 计算机体系结构的主要类型6. 如何选择适合的计算机体系结构7. 计算机体系结构的未来发展趋势1. 什么是计算机体系结构?计算机体系结构指的是计算机硬件和软件之间的组织结构和交互方式。
它涉及到计算机内部的各个关键组件,如中央处理器(CPU)、内存、输入输出设备、总线等,以及它们之间的连接方式和数据传输方式。
计算机体系结构是一个抽象的概念,它描述了计算机在逻辑上是如何工作的,而不涉及具体的物理实现细节。
2. 计算机体系结构的重要性计算机体系结构是计算机科学和工程学科中的重要基础,它为我们理解计算机的工作原理和性能提供了关键的知识。
通过学习计算机体系结构,我们可以更好地理解和设计计算机硬件和软件,并提高计算机系统的效率和性能。
计算机体系结构还可以帮助我们优化计算机程序,并解决计算机系统中的各种问题。
3. 计算机体系结构的组成和层次结构计算机体系结构由多个组成部分和层次结构组成。
最基本的组成部分是中央处理器(CPU),它包括运算器和控制器。
运算器负责执行算术和逻辑运算,而控制器负责执行指令并控制计算机的各个部件。
其他重要的组成部分包括内存、输入输出设备和总线,它们负责存储数据、与外部设备交互和实现各个组件之间的数据传输。
计算机体系结构的层次结构包括多个层次,从底层到顶层依次是硬件层、机器级层、操作系统层、应用层和用户层。
每个层次负责不同的任务和功能,彼此之间通过接口和协议进行交互。
硬件层负责实现计算机的物理组件,机器级层负责提供指令和数据的运算功能,操作系统层负责控制和管理计算机的各个资源,应用层负责提供具体的计算机应用和服务,用户层负责进行用户交互和应用操作。
《计算机体系结构》课件
计算机体系结构的应用领域
1
云计算
了解云计算架构的特点和应用领域,
物联网
2
如基础设施即服务(IaaS)和软件 即服务(SaaS)。
探索物联网架构的设计原则和适用
场景,如智能家居和智慧城市。
3
人工智能
了解人工智能系统的计算机体系结 构,包括深度学习和神经网络。
总结和展望
通过本课件,我们深入了解了计算机体系结构的定义、重要性、经典模型和 应用领域。希望这些知识能够帮助您更好地理解和应用计算机体系结构的原 理和思想。
3
多核处理器
了解多核处理器的原理,以及如何充分利用多核架构提高系统性能。
计算机体系结构的演进
主机计算机时代
个人计算机时代
回顾早期大型计算机的发展, 如IBM System/360系列。
介绍个人计算机的崛起,如 IBM PC和Apple Macintosh。
云计算时代
探索云计算的概念和发展, 如Amazon Web Services和 Microsoft Azure。
《计算机体系结构》PPT 课件
欢迎来到《计算机体系结构》PPT课件!在这里,我们将深入探讨计算机体系 结构的定义、重要性、经典模型以及应用领域。让我们一起展望计算机体系 结构的未来吧!
课程介绍
探索计算机架构的奥秘
了解计算机体系结构的基本概念和学习目标,以及如何应用这些知识。
重要性与应用
探索计算机体系结构在各个领域中的重要性和应用,如云计算、物联网和人工智能。
2 可伸缩性
计算机体系结构的合 理设计可以实现系统 的可扩展性,适应不 断增长的需求。
3 可靠性
合理的计算机体系结 构可以提高系统的可 靠性,减少故障和中 断。
计算机体系结构第2版课件第4章 第5讲
FPS-164是较新的机器,它的每个指令字含有 对应于10个不同功能单元的10条指令
VLIW基本机构
VLIW采用多个独立的功能单元,多个不同 的操作封装在一条长指令字中,每个功能单 元在VLIW指令中都有一定的对应区域
在超标量流水线上对上述代码进行调度,以 获取更多地指令机并行度
Loop:
整数指令
LD LD LD LD LD SD SD SD SD SUBI BNEZ SD
F0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F18,-32(R1) 0(R1),F4 -8(R1),F8 -16(R1),F12 -24(R1),F16 R1,R1,#40 R1,Loop -32(R1),F20
Mem Ref2 L.D F6,-8(R1)
FP1
ADD.D F4,F0,F2 ADD.D F12,F10,F2
FP2 ADD.D F8,F6,F2
S.D F4,24(R1) S.D F12,8(R1)
S.D F8,16(R1)
Int/branch
DADDUI R1,R1,#-24 BNE R1,R2,Loop
一般每个功能单元占用16-24位 例如:2个整数、2个浮点、2个访存、1个分支
,则该指令的长度为112-168位
VLIW硬件只是简单地将指令字中对应的部 分送给各个功能单元,功能单元在哪一个时 钟周期执行什么操作由编译器来确定
如果某个功能单元在某个周期没有任务,则执行 NOP指令
VLIW例子
Int/branch DADDUI R1,R1,#-80 BNE R1,R2,Loop
计算机系统结构5-4 ppt课件_
➢ 不存在功能部件使用冲突 每种功能部件一般只设置一个。 如:V3←V1×V2 V6←V4×V5
链接技术
采用“相关专用通路”思想,解决寄存器的RAW相关。
V0
V1
例:V3←A
V2←V0+V1
B
C
V4←V2×V3
V2
1
浮
访存与浮点加可并行操作; 点
2 3 4
加5
计算机系统结构5-4 ppt课件
向量处理机结构分类
按向量元素和结果存放分M-M和R-R两类。 ❖ 控制部分:控制部件和缓冲部件(中间REG) ❖ 标量流水:功能部件和标量寄存器(S) ❖ 向量流水:功能、存取部件和寄存器(V、VM、VL)
存储器-存储器型
存
读数
储
缓冲器
流水处理部件
器
系
写数
统
缓冲器
指令格式:
1
V2
2
3
4
5
6
操作 X A Y B Z C
X、Y、Z表示源及目标向量寄存器号及位移量; A、B、C存放向量基址、长度。
如何提高向量指令处理性能?
增强向量处理性能方法
向量机中多个功能部件并行操作。
并行操作条件: ➢ 不存在向量寄存器使用冲突 不允许出现RAW、WAR、WAW、RAR相关。
特点:向量长度不受限制,但访存次数增加,宜、组间横向处理。
第1组:分两步:①B1~n+C1~n=E1~n ②A1~n×E1~n=D1~n
第2组:分两步:①Bn+1~2n+Cn+1~2n=En+1~2n ②An+1~2n×En+1~2n=Dn+1~2n
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
精品jing
向量的流水处理
提高流水性能方法:
增加流水线段数,以减少Δt 每个时钟同时启动多条指令 减少相关,减少功能变换次数,增加
处理指令条数。
向量的流水处理(续)
向量操作特点
向量元素间操作相互独立,且为相同操作 相当于标量循环,对指令带宽的访问要求不高 可采用多体交叉存储器,减少访存延迟。
时间
度m=3的超标量处理机时空图
超标量处理机基本结构
一般流水线处理机:
一条指令流水线 一个多功能操作部件,每个时钟周期平均执行指令的条数小于1。
多操作部件处理机:
一条指令流水线 多个独立的操作部件,操作部件可以采用流水线,也可以不流水 多操作部件处理机的指令级并行度小于1
超标量处理机典型结构:
多条指令流水线 进的超标量处理机有:定点处理部件CPU,浮点处理部件FPU,图形加
速部件GPU 大量的通用寄存器,两个一级高速Cache 超标量处理机的指令级并行度大于1
举例: Motorola公司的MC88110
10个操作部件 两个寄存器堆:整数部件通用寄存器堆,32个32位寄
存器;浮点部件扩展寄存器堆,32个80位寄存器。每 个寄存器堆有8个端口,分别与8条内部总线相连接, 有一个缓冲深度为4的先行读数栈和一个缓冲深度为3 的后行写数栈。 两个独立的高速Cache中,各为8KB,采用两路组相联 方式。 转移目标指令Cache,在有两路分支时,存放其中一路 分支上的指令
几种超级计算机的向量性能和标量性能
机器型号
向量性能 Mflops
标量性能 Mflops
向量平衡点
Cray IS
85.0
9.8
0.90
Cray 2S
151.5
11.2
0.93
Cray X-MP 143.3
13.1
0.92
Cray Y-MP 201.6
17.0
0.92
Hitachi S820 737.3
包含有向量型和标量型两类指令
பைடு நூலகம்
向量型运算类指令
向量V1运算得向量V2,如V2=SIN(V1)
向量V运算得标量S,如
n
S Vi
i 1
向量V1与向量V2运算得向量V3,V3=V1^V2
向量V1与标量S运算得向量V2,V2=S*V1
特殊操作指令
向量比较指令
向量压缩指令
归并指令
向量传送指令
产生N次相关,2N次功能切换,适合标量循环
纵向加工:bi+ci->ki, ki*ai->di
产生1次相关,1次功能切换,可流水处理
纵横处理:对向量分组,组内纵向、组间横 向处理
向量流水处理机
向量流水处理机的指令系统 向量流水处理机的结构 超级向量流水处理机举例
向量流水处理机的指令系统
17.8
0.98
NEC SX2 424.2
9.5
0.98
Fujitsu VP400 207.1
6.6
0.97
向量平衡点(vector balance point)定义为:为了使向量硬 件设备和标量硬件设备的利用率相等,一个程序中向量代码 所占的百分比。
§4 指令级高度并行的超级计算机
超标量处理机 超流水线处理机 超标量超流水线处理机 超长指令字处理机
超标量处理机
采用多指令流水线(度=m) 配置多套功能部件、指令译码电路和多组总线,
并且寄存器也备有多个端口和多组总线。 适合于求解稀疏向量、矩阵 IBM RS/6000、DEC 21064、Intel i960CA、
Tandem Cyclone(飓风)等
超标量处理机(续)
部件
存结果 执行 译码 取指
超标量处理机MC88110的结构
整数 整数 位 浮点 乘法 除法 图形 图形 部件 部件 操作 加 部件 部件 部件 部件
内部总线
读数存 通用寄 扩展寄 目标 数部件 存器堆 存器堆 指令
指令分配 转移部件
数据Cache (8KB)
指令Cache (8KB)
系统总线
32位地址总线
向量操作很适合于流水处理或并行处理。
向量的流水处理(续)
向量处理过程
置VL、VM、A 取向量到V 运算。
向量的分量间采取的是流水方式。 并行处理机(SIMD)处理向量时采取的是
并行方式。
向量的流水处理(续)
向量处理工作方式 如:D=A×(B+C)
横向加工:bi+ci->k, k*ai->di
XT4超级计算机,并将在2009年接近1Pflops(每秒1000万亿次浮点运 算)能力。 网址:
超级向量流水处理机举例(续)
CDC公司1973年推出第一台超级计算机 STAR-100
1964年CDC-6600 RISC特征 1982年 CYBER 205 1999年 被Syntegra收购 ETA10:8个CPU 网址:
向量流水处理机的结构
1972年首次交付使用CRAY-1向量流水处理机 分布异构型多处理机系统,由中央处理机、诊
断维护控制处理机、大容量磁盘存储子系统、 前端处理机组成 6个流水线单功能部件:整数加、逻辑运算、 移位、浮点加、浮点乘和浮点迭代求倒数 向量寄存器由512个64位寄存器组成,分成8组
V7 向量寄存器组(8×64个) 移位
逻辑运算
V0
主
加 向量 功能
B
部件
向量控制
迭代求倒数
R/W
VM S7
相乘 加
存
浮点
地址寄存器 A
T
向量控制
功能 部件
标量寄存器 S0 向量长度寄存器 VL
超级向量流水处理机举例
1972年成立CRAY公司,至今生产了400台以上的超级计算机 1979年CRAY-1S,CRAY-1改进型,有10条流水线 1983年CRAY X-MP,用4台CRAY-1 1985年CRAY-2S 1988年CRAY Y-MP,8台处理机 1991年CRAY Y-MP C-90 1996年12月,克雷研究公司也被SGI公司以7.5亿美元收购 2000年,被Tera公司合并,同年更名Cray 目前产品:MTA、SV1、SX_6、T3E 2002年Cray X1。运算速度最高为每秒52万亿次,支持65.5TB存储器。 宣布了在2010年以前实现能够连续地处理每秒1000万亿次 Cray公司称,他们将在2008年使用四核心的AMD Opteron处理器建造