高级计算机体系结构10存储器结构
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
•Overlap $ acces •with VA translatio •requires $ index
•remain invarian •across translatio
2’. Fast Cache Hits by Avoiding Translation: Index w Physical Portion of Address
– Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip
• Simple direct mapping
– Can overlap tag check with data transmission since no choice
1. Fast Hit times via Small and Simple Caches
• Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache?
– Small data cache (faster) and clock rate (on-chip)
• In shade is Delayed Write Buffer? must be checked on reads either complete write or read from buffer
4. Fast Writes on Misses Via Small Subblocks
• If most writes are 1 word, subblock size is 1 word, & write through then always write subblock & tag immediately
2. Fast hits by Avoiding Address Translation
• Send virtual address to cache: Called Virtually Addressed Cache or jus Virtual Cache vs. Physical Cache
Virtually Addressed Caches
•CPU •VA
•TB •PA
•$ •PA
•MEM
•Conventional •Organization
•CPU
•CPU
•VA •Tags
•VA •$
•VA •TB
•VA
•PA
•$
•TB
•Tags
•L2 $
•PA
•MEM
•MEM
•Virtually Addressed Cache •Translate only on miss •Synonym Problem
• Can be applied recursively to Multilevel Caches
– Danger is that time to DRAM will grow with multiple levels in betwee – First attempts at L2 caches can make things worse, since increased
• Only STORES in the pipeline; empty during a miss
Store r2, (r1) Add Sub Store r4, (r3)
•CPU •in out
•write •buffer
•
DRAM
•(or lower mem)
Check r1 ---
M[r1]<-r2& check r3
as long as covers index field & direct mapped, they must be unique; called page coloring
• Solution to cache flush
– Add process identifier tag that identifies process as well as address within process: cannot get a hit if wrong process
• Access time estimate for 90 nm using CACTI model 4.0
– Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches
• Direct Mapped, on chip
– Advantage: overlap tag check & data transfer
1. Fast Hit times via Small and Simple Caches
• Index tag memory and then compare takes time
– Tag match and valid bit already set: Writing the block was proper, & nothing lost by setting valid bit on again.
– Tag match and valid bit not set: The tag match means that this is the proper block; writing the data into the subblock makes it appropriate turn the valid bit on.
• Trace cache in Pentium 4
1. Dynamic traces of the executed instructions vs. static sequences of instructions as determined by layout in memory
• 3 Cs: Compulsory, Capacity, Conflict Misses • Reducing Miss Rate
1. Reduce Misses via Larger Block Size 2. Reduce Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Misses by Compiler Optimizations
– Tag mismatch: This is a miss and will modify the data portion of the block. Since write-through cache, no harm was done; memory still ha up-to-date copy of the old value. Only the tag to the address of the w and the valid bits of the other subblock need be changed
• Doesn’t work with write back due to last case
5. Fast Hit times via Trace Cache (Pentium 4 only; and last time?)
• Find more instruction level parallelism? How avoid translation from x86 to microops?
高级计算机体系结构10 存储器结构
2020/3/20
•Lecture 10: Memory Hierarchy: Reducing Hit Time, Main Memory, & Examples
• Spring 2010 • Super Computing Lab.
Review: Reducing Misses
– Every time process is switched logically must flush the cache; otherwise g false hits
» Cost is time to flush + “compulsory” misses from empty cache
– Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address
•0
•Address Tag
•Index
•Block Offset
• Limits cache to page size: what if want bigger caches and uses same trick?
– Higher associativity moves barrier to right
• Remember danger of concentrating on just one parameter when evaluating performance
Reducing Miss Penalty Summary
• Five techniques
– Read priority over write on miss – Subblock placement – Early Restart and Critical Word First on miss – Non-blocking Caches (Hit under Miss, Miss under Miss) – Second Level Cache
• Small cache can help hit time since smaller memory takes less time to index
– E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron
worst case is worse
Review: Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache
- hit time: read tag + compare
– I/O must interact with cache, so need virtual address
• Solution to aliases
– HW guarantees that every cache block has unique physical address – SW guarantee : lower n bits must have same address;
• If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag
源自文库•31
•Page Address •12 11
•Page Offset
– Page coloring
3. Fast Hit Times Via Pipelined Writes
• Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update