L10… Cache Coherence in Scalable Machines Scalable Cache Coherent Systems

合集下载

LTE_3GPP_36.213-860(中文版)

3GPP
Release 8
3
3GPP TS 36.213 V8.6.0 (2009-03)
Contents
Foreword ...................................................................................................................................................... 5 1 2 3
Internet

Copyright Notification No part may be reproduced except as authorized by written permission. The copyright and the foregoing restriction extend to reproduction in all media.
© 2009, 3GPP Organizational Partners (ARIB, ATIS, CCSA, ETSI, TTA, TTC). All rights reserved. UMTS™ is a Trade Mark of ETSI registered for the benefit of its members 3GPP™ is a Trade Mark of ETSI registered for the benefit of its Members and of the 3GPP Organizational Partners LTE™ is a Trade Mark of ETSI currently being registered for the benefit of i ts Members and of the 3GPP Organizational Partners GSM® and the GSM logo are registered and owned by the GSM Association

多核Cache一致性

（5）如果C1为Invalid，C2为Dirty，则读C1 时Read miss，这时只有C2中的内容是正确的，要发Read-blk信号把副本从C2读到C1，同时修改memory，把C1，C2都改为Valid（程序状态转移图中P-Read(2)使C1Valid，Readblk(3)使C2 Valid）。
（4）无效（invalid state）与存储器或其它的Cache副本不一致，或在Cache中找不到。
3. 局部命令（Local commands）
（1）P-Read：本地处理机读自己的Cache 副本。
（2）P-Write：本地处理机写自己的Cache 副本。
4. 一致性命令
（1）Read-blk：从另一Cache读一份有效的副本。
（5）如果C1为Invalid，C2为Dirty，这时 memory中内容和C2中的内容不一致，把 C2C1，再把要写的内容写入C1， C1Dirty，发Read-inv使其它所有Cache的副本变成无效状态。
（6）如果C1为Invalid，C2为Reserved，这时memory中的内容和C2内容一致，把 C2C1，再把要写的内容写入C1，这时C1 与memory内容不一致，使C1Dirty，发 Read-inv使其它所有Cache的副本变成无效状态。
5. 写更新的问题
由于更新时，所有的副本均需要更新，开销很大。
1.1 Cache Coherence问题 1.2 监听总线协议
1.2.1 写一次协议
1.3 基于目录的Cache一致性协议 1.4 三种Cache一致性策略
1.2 监听总线协议(Snoopy protocol)
通过总线监听机制实现Cache和共享存储器之间的一致性。适用性分析：

cache一致性问题和解决方法

cache一致性问题和解决方法作者辽宁工程技术大学摘要高速缓冲存储器一致性问题是指高速缓冲存储器中的数据必须与内存中的数据保持同步(一致) 。

多核处理器将一个以上的计算内核集成在一个处理器中，通过多个核心的并行计算技术，增强处理器计算性能。

单片多处理器结构(CMP—ChipMultiprocessor)又是该领域中备受关注的问题。

本文简要论述了CMP的多级Cache存储结构，多级结构引起了Cache一致性问题，一致性协议的选取对CMP系统的性能有重要影响。

使用何种Cache一致性模型以及它的设计方案是本文重点研究的内容。

关键词：CMP；Cache一致性；存储器；协议；替换策略Cache consistency problem and solving methodAbstract Cache consistency refers to the data in the cache memory must be synchronized with the data in memory (the same).Multi·core processor was the integration of multiple computing cores on a single processoL which improved processor computing ability through the parallelcomputing Technology of multi-coreprocessors．Single chip multi-processorarchitecture(CMP-ChipMulfiprocessor)was hot spots in this area．The CMPmulti-level Cache storage structure was briefly discussed in this paper,which led to Cache coherence problem，the selection of consistency protocol had a major impact on the performance of the CMP system．The selection of model of theCache Coherence and methods of its design will have a significant impact ofoverall design and development of CMPKey words：CMP Cache; consistency; memory; protocol; replacement strategy1引言在过去的二十年中，计算机处理器设计工艺和处理器体系结构发展迅速，计算机也能够完成所赋予它的大部分任务。

MIPS芯片架构说明

MIPS32™ Architecture For Programmers Volume I: Introduction to the MIPS32™ArchitectureDocument Number: MD00082Revision 2.00June 8, 2003MIPS Technologies, Inc.1225 Charleston RoadMountain View, CA 94043-1353Copyright © 2001-2003 MIPS Technologies Inc. All rights reserved.Copyright ©2001-2003 MIPS Technologies, Inc. All rights reserved.Unpublished rights (if any) reserved under the copyright laws of the United States of America and other countries.This document contains information that is proprietary to MIPS Technologies, Inc. ("MIPS Technologies"). Any copying,reproducing,modifying or use of this information(in whole or in part)that is not expressly permitted in writing by MIPS Technologies or an authorized third party is strictly prohibited. At a minimum, this information is protected under unfair competition and copyright laws. Violations thereof may result in criminal penalties and ﬁnes.Any document provided in source format(i.e.,in a modiﬁable form such as in FrameMaker or Microsoft Word format) is subject to use and distribution restrictions that are independent of and supplemental to any and all conﬁdentiality restrictions. UNDER NO CIRCUMSTANCES MAY A DOCUMENT PROVIDED IN SOURCE FORMAT BE DISTRIBUTED TO A THIRD PARTY IN SOURCE FORMAT WITHOUT THE EXPRESS WRITTEN PERMISSION OF MIPS TECHNOLOGIES, INC.MIPS Technologies reserves the right to change the information contained in this document to improve function,design or otherwise.MIPS Technologies does not assume any liability arising out of the application or use of this information, or of any error or omission in such information. Any warranties, whether express, statutory, implied or otherwise, including but not limited to the implied warranties of merchantability orﬁtness for a particular purpose,are excluded. Except as expressly provided in any written license agreement from MIPS Technologies or an authorized third party,the furnishing of this document does not give recipient any license to any intellectual property rights,including any patent rights, that cover the information in this document.The information contained in this document shall not be exported or transferred for the purpose of reexporting in violation of any U.S. or non-U.S. regulation, treaty, Executive Order, law, statute, amendment or supplement thereto. The information contained in this document constitutes one or more of the following: commercial computer software, commercial computer software documentation or other commercial items.If the user of this information,or any related documentation of any kind,including related technical data or manuals,is an agency,department,or other entity of the United States government ("Government"), the use, duplication, reproduction, release, modiﬁcation, disclosure, or transfer of this information, or any related documentation of any kind, is restricted in accordance with Federal Acquisition Regulation12.212for civilian agencies and Defense Federal Acquisition Regulation Supplement227.7202 for military agencies.The use of this information by the Government is further restricted in accordance with the terms of the license agreement(s) and/or applicable contract terms and conditions covering this information from MIPS Technologies or an authorized third party.MIPS,R3000,R4000,R5000and R10000are among the registered trademarks of MIPS Technologies,Inc.in the United States and other countries,and MIPS16,MIPS16e,MIPS32,MIPS64,MIPS-3D,MIPS-based,MIPS I,MIPS II,MIPS III,MIPS IV,MIPS V,MIPSsim,SmartMIPS,MIPS Technologies logo,4K,4Kc,4Km,4Kp,4KE,4KEc,4KEm,4KEp, 4KS, 4KSc, 4KSd, M4K, 5K, 5Kc, 5Kf, 20Kc, 25Kf, ASMACRO, ATLAS, At the Core of the User Experience., BusBridge, CoreFPGA, CoreLV, EC, JALGO, MALTA, MDMX, MGB, PDtrace, Pipeline, Pro, Pro Series, SEAD, SEAD-2, SOC-it and YAMON are among the trademarks of MIPS Technologies, Inc.All other trademarks referred to herein are the property of their respective owners.Template: B1.08, Built with tags: 2B ARCH MIPS32MIPS32™ Architecture For Programmers Volume I, Revision 2.00 Copyright © 2001-2003 MIPS Technologies Inc. All rights reserved.Table of ContentsChapter 1 About This Book (1)1.1 Typographical Conventions (1)1.1.1 Italic Text (1)1.1.2 Bold Text (1)1.1.3 Courier Text (1)1.2 UNPREDICTABLE and UNDEFINED (2)1.2.1 UNPREDICTABLE (2)1.2.2 UNDEFINED (2)1.3 Special Symbols in Pseudocode Notation (2)1.4 For More Information (4)Chapter 2 The MIPS Architecture: An Introduction (7)2.1 MIPS32 and MIPS64 Overview (7)2.1.1 Historical Perspective (7)2.1.2 Architectural Evolution (7)2.1.3 Architectural Changes Relative to the MIPS I through MIPS V Architectures (9)2.2 Compliance and Subsetting (9)2.3 Components of the MIPS Architecture (10)2.3.1 MIPS Instruction Set Architecture (ISA) (10)2.3.2 MIPS Privileged Resource Architecture (PRA) (10)2.3.3 MIPS Application Specific Extensions (ASEs) (10)2.3.4 MIPS User Defined Instructions (UDIs) (11)2.4 Architecture Versus Implementation (11)2.5 Relationship between the MIPS32 and MIPS64 Architectures (11)2.6 Instructions, Sorted by ISA (12)2.6.1 List of MIPS32 Instructions (12)2.6.2 List of MIPS64 Instructions (13)2.7 Pipeline Architecture (13)2.7.1 Pipeline Stages and Execution Rates (13)2.7.2 Parallel Pipeline (14)2.7.3 Superpipeline (14)2.7.4 Superscalar Pipeline (14)2.8 Load/Store Architecture (15)2.9 Programming Model (15)2.9.1 CPU Data Formats (16)2.9.2 FPU Data Formats (16)2.9.3 Coprocessors (CP0-CP3) (16)2.9.4 CPU Registers (16)2.9.5 FPU Registers (18)2.9.6 Byte Ordering and Endianness (21)2.9.7 Memory Access Types (25)2.9.8 Implementation-Specific Access Types (26)2.9.9 Cache Coherence Algorithms and Access Types (26)2.9.10 Mixing Access Types (26)Chapter 3 Application Specific Extensions (27)3.1 Description of ASEs (27)3.2 List of Application Specific Instructions (28)3.2.1 The MIPS16e Application Specific Extension to the MIPS32Architecture (28)3.2.2 The MDMX Application Specific Extension to the MIPS64 Architecture (28)3.2.3 The MIPS-3D Application Specific Extension to the MIPS64 Architecture (28)MIPS32™ Architecture For Programmers Volume I, Revision 2.00i Copyright © 2001-2003 MIPS Technologies Inc. All rights reserved.3.2.4 The SmartMIPS Application Specific Extension to the MIPS32 Architecture (28)Chapter 4 Overview of the CPU Instruction Set (29)4.1 CPU Instructions, Grouped By Function (29)4.1.1 CPU Load and Store Instructions (29)4.1.2 Computational Instructions (32)4.1.3 Jump and Branch Instructions (35)4.1.4 Miscellaneous Instructions (37)4.1.5 Coprocessor Instructions (40)4.2 CPU Instruction Formats (41)Chapter 5 Overview of the FPU Instruction Set (43)5.1 Binary Compatibility (43)5.2 Enabling the Floating Point Coprocessor (44)5.3 IEEE Standard 754 (44)5.4 FPU Data Types (44)5.4.1 Floating Point Formats (44)5.4.2 Fixed Point Formats (48)5.5 Floating Point Register Types (48)5.5.1 FPU Register Models (49)5.5.2 Binary Data Transfers (32-Bit and 64-Bit) (49)5.5.3 FPRs and Formatted Operand Layout (50)5.6 Floating Point Control Registers (FCRs) (50)5.6.1 Floating Point Implementation Register (FIR, CP1 Control Register 0) (51)5.6.2 Floating Point Control and Status Register (FCSR, CP1 Control Register 31) (53)5.6.3 Floating Point Condition Codes Register (FCCR, CP1 Control Register 25) (55)5.6.4 Floating Point Exceptions Register (FEXR, CP1 Control Register 26) (56)5.6.5 Floating Point Enables Register (FENR, CP1 Control Register 28) (56)5.7 Formats of Values Used in FP Registers (57)5.8 FPU Exceptions (58)5.8.1 Exception Conditions (59)5.9 FPU Instructions (62)5.9.1 Data Transfer Instructions (62)5.9.2 Arithmetic Instructions (63)5.9.3 Conversion Instructions (65)5.9.4 Formatted Operand-Value Move Instructions (66)5.9.5 Conditional Branch Instructions (67)5.9.6 Miscellaneous Instructions (68)5.10 Valid Operands for FPU Instructions (68)5.11 FPU Instruction Formats (70)5.11.1 Implementation Note (71)Appendix A Instruction Bit Encodings (75)A.1 Instruction Encodings and Instruction Classes (75)A.2 Instruction Bit Encoding Tables (75)A.3 Floating Point Unit Instruction Format Encodings (82)Appendix B Revision History (85)ii MIPS32™ Architecture For Programmers Volume I, Revision 2.00 Copyright © 2001-2003 MIPS Technologies Inc. All rights reserved.Figure 2-1: Relationship between the MIPS32 and MIPS64 Architectures (11)Figure 2-2: One-Deep Single-Completion Instruction Pipeline (13)Figure 2-3: Four-Deep Single-Completion Pipeline (14)Figure 2-4: Four-Deep Superpipeline (14)Figure 2-5: Four-Way Superscalar Pipeline (15)Figure 2-6: CPU Registers (18)Figure 2-7: FPU Registers for a 32-bit FPU (20)Figure 2-8: FPU Registers for a 64-bit FPU if Status FR is 1 (21)Figure 2-9: FPU Registers for a 64-bit FPU if Status FR is 0 (22)Figure 2-10: Big-Endian Byte Ordering (23)Figure 2-11: Little-Endian Byte Ordering (23)Figure 2-12: Big-Endian Data in Doubleword Format (24)Figure 2-13: Little-Endian Data in Doubleword Format (24)Figure 2-14: Big-Endian Misaligned Word Addressing (25)Figure 2-15: Little-Endian Misaligned Word Addressing (25)Figure 3-1: MIPS ISAs and ASEs (27)Figure 3-2: User-Mode MIPS ISAs and Optional ASEs (27)Figure 4-1: Immediate (I-Type) CPU Instruction Format (42)Figure 4-2: Jump (J-Type) CPU Instruction Format (42)Figure 4-3: Register (R-Type) CPU Instruction Format (42)Figure 5-1: Single-Precisions Floating Point Format (S) (45)Figure 5-2: Double-Precisions Floating Point Format (D) (45)Figure 5-3: Paired Single Floating Point Format (PS) (46)Figure 5-4: Word Fixed Point Format (W) (48)Figure 5-5: Longword Fixed Point Format (L) (48)Figure 5-6: FPU Word Load and Move-to Operations (49)Figure 5-7: FPU Doubleword Load and Move-to Operations (50)Figure 5-8: Single Floating Point or Word Fixed Point Operand in an FPR (50)Figure 5-9: Double Floating Point or Longword Fixed Point Operand in an FPR (50)Figure 5-10: Paired-Single Floating Point Operand in an FPR (50)Figure 5-11: FIR Register Format (51)Figure 5-12: FCSR Register Format (53)Figure 5-13: FCCR Register Format (55)Figure 5-14: FEXR Register Format (56)Figure 5-15: FENR Register Format (56)Figure 5-16: Effect of FPU Operations on the Format of Values Held in FPRs (58)Figure 5-17: I-Type (Immediate) FPU Instruction Format (71)Figure 5-18: R-Type (Register) FPU Instruction Format (71)Figure 5-19: Register-Immediate FPU Instruction Format (71)Figure 5-20: Condition Code, Immediate FPU Instruction Format (71)Figure 5-21: Formatted FPU Compare Instruction Format (71)Figure 5-22: FP RegisterMove, Conditional Instruction Format (71)Figure 5-23: Four-Register Formatted Arithmetic FPU Instruction Format (72)Figure 5-24: Register Index FPU Instruction Format (72)Figure 5-25: Register Index Hint FPU Instruction Format (72)Figure 5-26: Condition Code, Register Integer FPU Instruction Format (72)Figure A-1: Sample Bit Encoding Table (76)MIPS32™ Architecture For Programmers Volume I, Revision 2.00iii Copyright © 2001-2003 MIPS Technologies Inc. All rights reserved.Table 1-1: Symbols Used in Instruction Operation Statements (2)Table 2-1: MIPS32 Instructions (12)Table 2-2: MIPS64 Instructions (13)Table 2-3: Unaligned Load and Store Instructions (24)Table 4-1: Load and Store Operations Using Register + Offset Addressing Mode (30)Table 4-2: Aligned CPU Load/Store Instructions (30)Table 4-3: Unaligned CPU Load and Store Instructions (31)Table 4-4: Atomic Update CPU Load and Store Instructions (31)Table 4-5: Coprocessor Load and Store Instructions (31)Table 4-6: FPU Load and Store Instructions Using Register+Register Addressing (32)Table 4-7: ALU Instructions With an Immediate Operand (33)Table 4-8: Three-Operand ALU Instructions (33)Table 4-9: Two-Operand ALU Instructions (34)Table 4-10: Shift Instructions (34)Table 4-11: Multiply/Divide Instructions (35)Table 4-12: Unconditional Jump Within a 256 Megabyte Region (36)Table 4-13: PC-Relative Conditional Branch Instructions Comparing Two Registers (36)Table 4-14: PC-Relative Conditional Branch Instructions Comparing With Zero (37)Table 4-15: Deprecated Branch Likely Instructions (37)Table 4-16: Serialization Instruction (38)Table 4-17: System Call and Breakpoint Instructions (38)Table 4-18: Trap-on-Condition Instructions Comparing Two Registers (38)Table 4-19: Trap-on-Condition Instructions Comparing an Immediate Value (38)Table 4-20: CPU Conditional Move Instructions (39)Table 4-21: Prefetch Instructions (39)Table 4-22: NOP Instructions (40)Table 4-23: Coprocessor Definition and Use in the MIPS Architecture (40)Table 4-24: CPU Instruction Format Fields (42)Table 5-1: Parameters of Floating Point Data Types (45)Table 5-2: Value of Single or Double Floating Point DataType Encoding (46)Table 5-3: Value Supplied When a New Quiet NaN Is Created (47)Table 5-4: FIR Register Field Descriptions (51)Table 5-5: FCSR Register Field Descriptions (53)Table 5-6: Cause, Enable, and Flag Bit Definitions (55)Table 5-7: Rounding Mode Definitions (55)Table 5-8: FCCR Register Field Descriptions (56)Table 5-9: FEXR Register Field Descriptions (56)Table 5-10: FENR Register Field Descriptions (57)Table 5-11: Default Result for IEEE Exceptions Not Trapped Precisely (60)Table 5-12: FPU Data Transfer Instructions (62)Table 5-13: FPU Loads and Stores Using Register+Offset Address Mode (63)Table 5-14: FPU Loads and Using Register+Register Address Mode (63)Table 5-15: FPU Move To and From Instructions (63)Table 5-16: FPU IEEE Arithmetic Operations (64)Table 5-17: FPU-Approximate Arithmetic Operations (64)Table 5-18: FPU Multiply-Accumulate Arithmetic Operations (65)Table 5-19: FPU Conversion Operations Using the FCSR Rounding Mode (65)Table 5-20: FPU Conversion Operations Using a Directed Rounding Mode (65)Table 5-21: FPU Formatted Operand Move Instructions (66)Table 5-22: FPU Conditional Move on True/False Instructions (66)iv MIPS32™ Architecture For Programmers Volume I, Revision 2.00 Copyright © 2001-2003 MIPS Technologies Inc. All rights reserved.Table 5-23: FPU Conditional Move on Zero/Nonzero Instructions (67)Table 5-24: FPU Conditional Branch Instructions (67)Table 5-25: Deprecated FPU Conditional Branch Likely Instructions (67)Table 5-26: CPU Conditional Move on FPU True/False Instructions (68)Table 5-27: FPU Operand Format Field (fmt, fmt3) Encoding (68)Table 5-28: Valid Formats for FPU Operations (69)Table 5-29: FPU Instruction Format Fields (72)Table A-1: Symbols Used in the Instruction Encoding Tables (76)Table A-2: MIPS32 Encoding of the Opcode Field (77)Table A-3: MIPS32 SPECIAL Opcode Encoding of Function Field (78)Table A-4: MIPS32 REGIMM Encoding of rt Field (78)Table A-5: MIPS32 SPECIAL2 Encoding of Function Field (78)Table A-6: MIPS32 SPECIAL3 Encoding of Function Field for Release 2 of the Architecture (78)Table A-7: MIPS32 MOVCI Encoding of tf Bit (79)Table A-8: MIPS32 SRL Encoding of Shift/Rotate (79)Table A-9: MIPS32 SRLV Encoding of Shift/Rotate (79)Table A-10: MIPS32 BSHFL Encoding of sa Field (79)Table A-11: MIPS32 COP0 Encoding of rs Field (79)Table A-12: MIPS32 COP0 Encoding of Function Field When rs=CO (80)Table A-13: MIPS32 COP1 Encoding of rs Field (80)Table A-14: MIPS32 COP1 Encoding of Function Field When rs=S (80)Table A-15: MIPS32 COP1 Encoding of Function Field When rs=D (81)Table A-16: MIPS32 COP1 Encoding of Function Field When rs=W or L (81)Table A-17: MIPS64 COP1 Encoding of Function Field When rs=PS (81)Table A-18: MIPS32 COP1 Encoding of tf Bit When rs=S, D, or PS, Function=MOVCF (81)Table A-19: MIPS32 COP2 Encoding of rs Field (82)Table A-20: MIPS64 COP1X Encoding of Function Field (82)Table A-21: Floating Point Unit Instruction Format Encodings (82)MIPS32™ Architecture For Programmers Volume I, Revision 2.00v Copyright © 2001-2003 MIPS Technologies Inc. All rights reserved.vi MIPS32™ Architecture For Programmers Volume I, Revision 2.00 Copyright © 2001-2003 MIPS Technologies Inc. All rights reserved.Chapter 1About This BookThe MIPS32™ Architecture For Programmers V olume I comes as a multi-volume set.•V olume I describes conventions used throughout the document set, and provides an introduction to the MIPS32™Architecture•V olume II provides detailed descriptions of each instruction in the MIPS32™ instruction set•V olume III describes the MIPS32™Privileged Resource Architecture which deﬁnes and governs the behavior of the privileged resources included in a MIPS32™ processor implementation•V olume IV-a describes the MIPS16e™ Application-Speciﬁc Extension to the MIPS32™ Architecture•V olume IV-b describes the MDMX™ Application-Speciﬁc Extension to the MIPS32™ Architecture and is notapplicable to the MIPS32™ document set•V olume IV-c describes the MIPS-3D™ Application-Speciﬁc Extension to the MIPS64™ Architecture and is notapplicable to the MIPS32™ document set•V olume IV-d describes the SmartMIPS™Application-Speciﬁc Extension to the MIPS32™ Architecture1.1Typographical ConventionsThis section describes the use of italic,bold and courier fonts in this book.1.1.1Italic Text•is used for emphasis•is used for bits,ﬁelds,registers, that are important from a software perspective (for instance, address bits used bysoftware,and programmableﬁelds and registers),and variousﬂoating point instruction formats,such as S,D,and PS •is used for the memory access types, such as cached and uncached1.1.2Bold Text•represents a term that is being deﬁned•is used for bits andﬁelds that are important from a hardware perspective (for instance,register bits, which are not programmable but accessible only to hardware)•is used for ranges of numbers; the range is indicated by an ellipsis. For instance,5..1indicates numbers 5 through 1•is used to emphasize UNPREDICTABLE and UNDEFINED behavior, as deﬁned below.1.1.3Courier TextCourier ﬁxed-width font is used for text that is displayed on the screen, and for examples of code and instruction pseudocode.MIPS32™ Architecture For Programmers Volume I, Revision 2.001 Copyright © 2001-2003 MIPS Technologies Inc. All rights reserved.Chapter 1 About This Book1.2UNPREDICTABLE and UNDEFINEDThe terms UNPREDICTABLE and UNDEFINED are used throughout this book to describe the behavior of theprocessor in certain cases.UNDEFINED behavior or operations can occur only as the result of executing instructions in a privileged mode (i.e., in Kernel Mode or Debug Mode, or with the CP0 usable bit set in the Status register).Unprivileged software can never cause UNDEFINED behavior or operations. Conversely, both privileged andunprivileged software can cause UNPREDICTABLE results or operations.1.2.1UNPREDICTABLEUNPREDICTABLE results may vary from processor implementation to implementation,instruction to instruction,or as a function of time on the same implementation or instruction. Software can never depend on results that areUNPREDICTABLE.UNPREDICTABLE operations may cause a result to be generated or not.If a result is generated, it is UNPREDICTABLE.UNPREDICTABLE operations may cause arbitrary exceptions.UNPREDICTABLE results or operations have several implementation restrictions:•Implementations of operations generating UNPREDICTABLE results must not depend on any data source(memory or internal state) which is inaccessible in the current processor mode•UNPREDICTABLE operations must not read, write, or modify the contents of memory or internal state which is inaccessible in the current processor mode. For example,UNPREDICTABLE operations executed in user modemust not access memory or internal state that is only accessible in Kernel Mode or Debug Mode or in another process •UNPREDICTABLE operations must not halt or hang the processor1.2.2UNDEFINEDUNDEFINED operations or behavior may vary from processor implementation to implementation, instruction toinstruction, or as a function of time on the same implementation or instruction.UNDEFINED operations or behavior may vary from nothing to creating an environment in which execution can no longer continue.UNDEFINED operations or behavior may cause data loss.UNDEFINED operations or behavior has one implementation restriction:•UNDEFINED operations or behavior must not cause the processor to hang(that is,enter a state from which there is no exit other than powering down the processor).The assertion of any of the reset signals must restore the processor to an operational state1.3Special Symbols in Pseudocode NotationIn this book, algorithmic descriptions of an operation are described as pseudocode in a high-level language notation resembling Pascal. Special symbols used in the pseudocode notation are listed in Table 1-1.Table 1-1 Symbols Used in Instruction Operation StatementsSymbol Meaning←Assignment=, ≠Tests for equality and inequality||Bit string concatenationx y A y-bit string formed by y copies of the single-bit value x2MIPS32™ Architecture For Programmers Volume I, Revision 2.00 Copyright © 2001-2003 MIPS Technologies Inc. All rights reserved.1.3Special Symbols in Pseudocode Notationb#n A constant value n in base b.For instance10#100represents the decimal value100,2#100represents the binary value 100 (decimal 4), and 16#100 represents the hexadecimal value 100 (decimal 256). If the "b#" preﬁx is omitted, the default base is 10.x y..z Selection of bits y through z of bit string x.Little-endian bit notation(rightmost bit is0)is used.If y is less than z, this expression is an empty (zero length) bit string.+, −2’s complement or ﬂoating point arithmetic: addition, subtraction∗, ×2’s complement or ﬂoating point multiplication (both used for either)div2’s complement integer divisionmod2’s complement modulo/Floating point division<2’s complement less-than comparison>2’s complement greater-than comparison≤2’s complement less-than or equal comparison≥2’s complement greater-than or equal comparisonnor Bitwise logical NORxor Bitwise logical XORand Bitwise logical ANDor Bitwise logical ORGPRLEN The length in bits (32 or 64) of the CPU general-purpose registersGPR[x]CPU general-purpose register x. The content of GPR[0] is always zero.SGPR[s,x]In Release 2 of the Architecture, multiple copies of the CPU general-purpose registers may be implemented.SGPR[s,x] refers to GPR set s, register x. GPR[x] is a short-hand notation for SGPR[ SRSCtl CSS, x].FPR[x]Floating Point operand register xFCC[CC]Floating Point condition code CC.FCC[0] has the same value as COC[1].FPR[x]Floating Point (Coprocessor unit 1), general register xCPR[z,x,s]Coprocessor unit z, general register x,select sCP2CPR[x]Coprocessor unit 2, general register xCCR[z,x]Coprocessor unit z, control register xCP2CCR[x]Coprocessor unit 2, control register xCOC[z]Coprocessor unit z condition signalXlat[x]Translation of the MIPS16e GPR number x into the corresponding 32-bit GPR numberBigEndianMem Endian mode as conﬁgured at chip reset (0→Little-Endian, 1→ Big-Endian). Speciﬁes the endianness of the memory interface(see LoadMemory and StoreMemory pseudocode function descriptions),and the endianness of Kernel and Supervisor mode execution.BigEndianCPU The endianness for load and store instructions (0→ Little-Endian, 1→ Big-Endian). In User mode, this endianness may be switched by setting the RE bit in the Status register.Thus,BigEndianCPU may be computed as (BigEndianMem XOR ReverseEndian).Table 1-1 Symbols Used in Instruction Operation StatementsSymbol MeaningChapter 1 About This Book1.4For More InformationVarious MIPS RISC processor manuals and additional information about MIPS products can be found at the MIPS URL:ReverseEndianSignal to reverse the endianness of load and store instructions.This feature is available in User mode only,and is implemented by setting the RE bit of the Status register.Thus,ReverseEndian may be computed as (SR RE and User mode).LLbitBit of virtual state used to specify operation for instructions that provide atomic read-modify-write.LLbit is set when a linked load occurs; it is tested and cleared by the conditional store. It is cleared, during other CPU operation,when a store to the location would no longer be atomic.In particular,it is cleared by exception return instructions.I :,I+n :,I-n :This occurs as a preﬁx to Operation description lines and functions as a label. It indicates the instruction time during which the pseudocode appears to “execute.” Unless otherwise indicated, all effects of the currentinstruction appear to occur during the instruction time of the current instruction.No label is equivalent to a time label of I . Sometimes effects of an instruction appear to occur either earlier or later — that is, during theinstruction time of another instruction.When this happens,the instruction operation is written in sections labeled with the instruction time,relative to the current instruction I ,in which the effect of that pseudocode appears to occur.For example,an instruction may have a result that is not available until after the next instruction.Such an instruction has the portion of the instruction operation description that writes the result register in a section labeled I +1.The effect of pseudocode statements for the current instruction labelled I +1appears to occur “at the same time”as the effect of pseudocode statements labeled I for the following instruction.Within one pseudocode sequence,the effects of the statements take place in order. However, between sequences of statements for differentinstructions that occur “at the same time,” there is no deﬁned order. Programs must not depend on a particular order of evaluation between such sections.PCThe Program Counter value.During the instruction time of an instruction,this is the address of the instruction word. The address of the instruction that occurs during the next instruction time is determined by assigning a value to PC during an instruction time. If no value is assigned to PC during an instruction time by anypseudocode statement,it is automatically incremented by either 2(in the case of a 16-bit MIPS16e instruction)or 4before the next instruction time.A taken branch assigns the target address to the PC during the instruction time of the instruction in the branch delay slot.PABITSThe number of physical address bits implemented is represented by the symbol PABITS.As such,if 36physical address bits were implemented, the size of the physical address space would be 2PABITS = 236 bytes.FP32RegistersModeIndicates whether the FPU has 32-bit or 64-bit ﬂoating point registers (FPRs).In MIPS32,the FPU has 3232-bit FPRs in which 64-bit data types are stored in even-odd pairs of FPRs.In MIPS64,the FPU has 3264-bit FPRs in which 64-bit data types are stored in any FPR.In MIPS32implementations,FP32RegistersMode is always a 0.MIPS64implementations have a compatibility mode in which the processor references the FPRs as if it were a MIPS32 implementation. In such a caseFP32RegisterMode is computed from the FR bit in the Status register.If this bit is a 0,the processor operates as if it had 32 32-bit FPRs. If this bit is a 1, the processor operates with 32 64-bit FPRs.The value of FP32RegistersMode is computed from the FR bit in the Status register.InstructionInBranchDelaySlotIndicates whether the instruction at the Program Counter address was executed in the delay slot of a branch or jump. This condition reﬂects the dynamic state of the instruction, not the static state. That is, the value is false if a branch or jump occurs to an instruction whose PC immediately follows a branch or jump, but which is not executed in the delay slot of a branch or jump.SignalException(exce ption, argument)Causes an exception to be signaled, using the exception parameter as the type of exception and the argument parameter as an exception-speciﬁc argument). Control does not return from this pseudocode function - the exception is signaled at the point of the call.Table 1-1 Symbols Used in Instruction Operation StatementsSymbolMeaning。

l1cache策略

l1cache策略
L1缓存策略是计算机系统中的一种高速缓存策略，用于提高数据访问的效率。

在这篇文章中，我们将从人类的视角来介绍L1缓存策略的原理和作用。

L1缓存是位于CPU内部的一级缓存，其作用是存储CPU频繁访问的数据和指令。

相比于内存或其他较慢的存储介质，L1缓存的读写速度更快，可以大大减少CPU访问数据的时间。

L1缓存采用了一种称为缓存行的数据结构来组织数据。

每个缓存行的大小通常为64字节，包含多个连续的内存地址。

当CPU需要读取数据时，它首先会查找L1缓存。

如果所需数据在缓存行中已经存在，CPU可以直接从缓存中读取，而无需访问内存。

这样可以大大减少数据访问的延迟，提高计算机系统的整体性能。

然而，L1缓存的容量有限，一般在几十KB到几百KB之间。

当CPU需要读取的数据超过L1缓存的容量时，就会发生缓存未命中。

这时，CPU需要从较慢的内存中读取数据，并将其存储到L1缓存中，以供后续的访问。

缓存未命中会导致额外的延迟，降低系统的性能。

为了尽量减少缓存未命中的次数，L1缓存采用了一种叫做缓存替换算法的策略。

常见的缓存替换算法包括最近最少使用（LRU）算法和随机替换算法。

这些算法根据一定的规则来选择被替换的缓存行，
以便将最频繁使用的数据存储在L1缓存中，提高系统的整体性能。

总结一下，L1缓存策略是通过将频繁访问的数据存储在CPU内部的一级缓存中，以提高数据访问的效率。

它使用缓存行和缓存替换算法来组织和管理数据，以减少缓存未命中的次数。

通过合理使用L1缓存策略，可以提高计算机系统的性能，加快数据访问的速度。

AppH

H.1 H.2 H.3 H.4 H.5 H.6 H.7 H.8 H.9
Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientiﬁc Applications Synchronization: Scaling Up Performance of Scientiﬁc Applications on Shared-Memory Multiprocessors Performance Measurement of Parallel Processors with Scientiﬁc Applications Implementing Cache Coherence The Custom Cluster Approach: Blue Gene/L Concluding Remarks
H.2
Interprocessor Communication: The Critical Performance Issue
I
H-3
cessor node. By using a custom node design, Blue Gene achieves a signiﬁcant reduction in the cost, physical size, and power consumption of a node. Blue Gene/L, a 64K-node version, is the world’s fastest computer in 2006, as measured by the linear algebra benchmark, Linpack.
H-2 H-3 H-6 H-12 H-21 H-33 H-34 H-41 H-44

清华大学张小斌教授分析原子操作和竞争

原子操作和竞争英文原文：Atomic operations and contention翻译：清华大学张小斌（教授）本文是RAD Game Tools程序员Fabian “ryg” Giesen在其博客上发表的《Atomic operations and contention》一文的翻译，经作者许可分享至InfoQ中文站。

上次（缓存一致性（Cache Coherency）入门），我们讲了缓存一致性原理的基础知识。

今天，我们来谈谈基于一致的缓存构建一个有用的系统，需要哪些原语（primitive），以及它们是如何工作的。

原子性和原子操作计算机操作最重要的构成单位是原子操作。

这里的原子跟物理上说的原子没有任何关系，而是起源于单词atom，也就是希腊语“ἄτομος”（意为不可见的）。

原子操作是一种不可再细分的操作，或者在系统中其他处理器看来是不可再分了。

为了说明为什么原子操作很重要，考虑两个处理器以几乎相同的方式增加一个计数器，翻译成C语言就是counter++，此时会发生什么：指令周期处理器一处理器二0 reg = load(&counter);1 reg = reg + 1; reg = load(&counter);2 store(&counter, reg); reg = reg + 1;3 store(&counter, reg);在编译好的代码中，这样一个操作分为：读操作、寄存器自加，最后是一个写操作（这里用类似C语言的伪代码表示）。

这三个步骤是独立且按顺序执行的（注意，对于x86来说，在更微观的架构层次上这句话是正确的，但是在指令集架构的层次上，这三步看起来可以用一条“读-修改-写（read-modify-write）”指令完成：add [memory], value）。

并且因为这些操作被分成多个指令周期来执行，所以在处理器一读完counter （并且正开始把它加一）之后，把结果写回去之前的空隙，处理器二也有可能去读它。

高速缓存一致性协议MESI与内存屏障

⾼速缓存⼀致性协议MESI与内存屏障⼀、CPU⾼速缓存简单介绍 CPU⾼速缓存机制的引⼊，主要是为了解决CPU越来越快的运⾏速度与相对较慢的主存访问速度的⽭盾。

CPU中的寄存器数量有限，在执⾏内存寻址指令时，经常需要从内存中读取指令所需的数据或是将寄存器中的数据写回内存。

⽽CPU对内存的存取相对CPU⾃⾝的速度⽽⾔过于缓慢，在内存存取的过程中CPU只能等待，机器效率太低。

为此，设计者在CPU与内存之间引⼊了⾼速缓存。

CPU中寄存器的存储容量⼩，访问速度极快；内存存储容量很⼤，但相对寄存器⽽⾔访问速度很慢。

⽽⾼速缓存的存储⼤⼩和访问速度都介于⼆者之间，作为⼀个缓冲桥梁来填补寄存器与主存间访问速度过⼤的差异。

引⼊⾼速缓存后，CPU在需要访问主存中某⼀地址空间时，⾼速缓存会拦截所有对于内存的访问，并判断所需数据是否已经存在于⾼速缓存中。

如果缓存命中，则直接将⾼速缓存中的数据交给CPU；如果缓存未命中，则进⾏常规的主存访问，获取数据交给CPU的同时也将数据存⼊⾼速缓存。

但由于⾼速缓存容量远⼩于内存，因此在⾼速缓存已满⽽⼜需要存⼊新的内存映射数据时，需要通过某种算法选出⼀个缓存单元调度出⾼速缓存，进⾏替换。

由于对内存中数据的访问具有局部性，使⽤⾼速缓存能够极⼤的提⾼CPU访问存储器的效率。

⼆、⾼速缓存⼀致性问题⾼速缓存与内存的⼀致性问题⾼速缓存在命中时，意味着内存和⾼速缓存中拥有了同⼀份数据的两份拷贝。

CPU在执⾏修改内存数据的指令时如果⾼速缓存命中，只会修改⾼速缓存中的数据，此时便出现了⾼速缓存与内存中数据不⼀致的问题。

这个不⼀致问题在早期单核CPU环境下似乎不是什么⼤问题，因为所有的内存操作都来⾃唯⼀的CPU。

但即使是单核环境下，为了减轻CPU在I/O时的负载、提⾼I/O效率，先进的硬件设计都引⼊了DMA机制。

DMA芯⽚在⼯作时会直接访问内存，如果⾼速缓存⾸先被CPU 修改和内存不⼀致，就会出现DMA实际写回磁盘的内容和程序所需要写⼊的内容不⼀致的问题。

计算机体系结构作业答案(高性能)

作
业
第一讲：计算机系统基础
1. 在三台不同指令系统的计算机上运行同一程序 P 时，A 机需要执行 1.0*10 条指令，B 机需要执行 2.0*10 条指令，C 机需要执行 4.0*10 条指令，但实际执行时间都是 10 秒，请分别计算这三台机器在实行程序 P 时的实际运行速度，以 MIPS 为单位。这三台计算机在运行程序 P 时，哪台性能最高？为什么？
13. 分析 CMOS EDFF 触发器的建立时间、保持时间、和 CLK->Q 的构成。对于下图的电路，假设反相器的延迟为 1ns，传输门从源到漏(或从漏到源)的延迟为 0.5ns，传输门从栅到漏（或源）的延迟为 0.75ns，不考虑由于 latch 的 fight 对反相器延迟的影响。请从概念上分析此电路的 setup 时间和 hold 时间为多少，给出分析过程。
7. 试讨论冯·诺伊曼结构的主要特点。 a）查阅资料，分别给出一款 Intel、AMD、IBM 商业处理器的峰值性能和访存带宽。 b）分析这 3 种处理器的访存带宽和存储层次参数（一级 cache 大小和延迟、二级 cache 大小和延迟等）之间的关系。
8. 在一台个人计算机上（如 Pentium-4、Core、Opteron 的 CPU） a）查阅相关资料，给出该机器的浮点运算峰值。
8 8
8
2. 如果要给标量处理器增加向量运算部件，并且假定向量模式的运算速度是标量模式的 8 倍，这里把向量模式所占的百分比时间称作向量化百分比。 a）画出一张图来表示加速比和向量化百分比的关系，X 轴为向量化百分比，Y 轴为加速比。 b）向量化百分比为多少时，加速比能达到 2？当加速比达到 2 时，向量模式占了运算运行时间的百分之多少？向量化百分比为多少时，加速比能达到最大加速比的一半？ c）假设程序的向量化百分比为 70%。如果需要继续提升处理器的性能，一种方法是增加硬件成本将向量部件的速度提高一倍，另外一种方法是通过改进编译器来提高向量模式的应用范围，那么需要提升多少向量化百分比才能得到与向量部件运算速度提高一倍得到相同的性能？你推荐哪一种设计方案？

缓存一致性问题（CacheCoherency）

缓存⼀致性问题（CacheCoherency）引⾔现在越来越多的系统都会⽤到缓存，只要使⽤缓存就可能涉及到缓存数据与数据库数据之间双写双存储，只要双写就会遇到数据⼀致性的问题，除⾮只有⼀个数据库服务器，数据⼀致性问题也就不存在了。

缓存的作⽤ 1. 临时存储，⽤于提⾼数据查询速度。

⽐如CPU的L1⾼速缓存和L2⾼速缓存，缓存主要是为CPU和内存提供⼀个⾼速的数据缓存区域。

CPU读取数据的百顺序是:先在缓存中寻找,找到后就直接进⾏读取,如果未能找到,才从主内存中进⾏读取。

2. 降低系统反应时间，提⾼并发能⼒。

数据⼀致性的问题的原因主要是由于两次操作时间不同步导致的数据⼀致性问题。

⽐如Mysql主从复制到时候，master数据在同步slave给数据到过程中会有数据不⼀致的时刻。

如何保证缓存与数据库双写⼀致性⼀. 缓存与数据库读写模式（Cache aside pattern）分两种情况，读数据和写数据（更新） 1. 读数据：读数据时候先读取缓存，如果缓存没有（miss hit）就读取数据库，然后在从数据库中取出数据并添加到缓存中，最后在返回数据给客户端。

2. 更新数据: 先更新数据库数据在删除缓存（也有⼈认为先删除缓存在更新数据库）。

那么为什么在更新数据库同时在删除缓存呢？这⾥主要考虑⼏个点： 1）缓存懒加载。

有些缓存出来的数据是应⽤在⽐较复杂的场景中，这些缓存存在的时候，不是简单的从数据库取出数据，⽐如更新了数据表中某个字段的值，有⼀条缓存数据值是这个字段的值与另外多个表中字段的值进⾏计算后的结果，当每次该字段被更新的时候都要与其他表多个字段去运算然后得到这个缓存数据，所以这样场景下更新缓存的代价很⾼。

所以要不要实时更新缓存视具体情况来定，⽐如这个字段⼀分钟内修改60次，那么跟该字段相关缓存也跟着要计算60次，但是该缓存⼀分钟内只被访问1次，当缓存真正被访问的时候在进⾏计算，这⼀分钟内缓存只计算了⼀次，开销就⼤幅减少。

高速缓冲存储器一致性

此外，关于多个私有缓存还存在另外一方面的问题：如果数据是由一个处理器核对某个单元写入，而另一个处理器从中读出这样的方式来进行传递的话，那么我们前面所的一致性将是非常重要的。最终，写在一个单元中的数据将对所有的读取者都会是可见的，但这种一致性并没有指明所写入的数据何时会成为可见的。通常，在编写一个并行程序时，我们希望在写和读之间建立一种序，即我们需要定义一个序模型，依照该模型，程序员能推断他们程序的执行结果及其正确性。这个模型就是存储同一性。
感谢观看
I（Invalid）：这行数据无效。
在该协议的作用下，虽然各cache控制器随时都在监听系统总线，但能监听到的只有读未命中、写未命中以及共享行写命中三种情况。读监听命中的有效行都要进入S态并发出监听命中指示，但M态行要抢先写回主存；写监听命中的有效行都要进入I态，但收到RWITM时的M态行要抢先写回主存。总之监控逻辑并不复杂，增添的系统总线传输开销也不大，但MESI协议却有力地保证了主存块脏拷贝在多cache中的一致性，并能及时写回，保证 cache主存存取的正确性。
在MESI协议中，每个Cache line有4个状态，可用2个bit表示，它们分别是：
M（Modified）：这行数据有效，数据被修改了，和内存中的数据不一致，数据只存在于本Cache中。
E（Exclusive）：这行数据有效，数据和内存中的数据一致，数据只存在于本Cache中。
S（Shared）：这行数据有效，数据和内存中的数据一致，数据存在于很多Cache中。
一个完整的一致性模型包括高速缓存一致性及存储同一性两个方面，且这两个是互补的：高速缓存一致性定义了对同一个存储进行的读写操作行为，而存储同一性模型定义了访问所有存储的读写行为。在共享存储空间中，多个进程对存储的不同单元做并发的读写操作，每个进程都会看到一个这些操作被完成的序。

l1 cache读写原理

l1 cache读写原理在计算机系统中，CPU通过访问内存来获取指令和数据。

然而，内存的访问速度较慢，为了提高计算机的运行效率，现代计算机系统引入了多级缓存，其中l1 cache是最接近CPU的一级缓存。

l1 cache是一个小型且高速的存储器，其主要作用是存储CPU最频繁访问的指令和数据。

由于l1 cache与CPU之间的距离非常近，因此可以在极短的时间内响应CPU的读写请求，从而大大提高了计算机系统的运行速度。

l1 cache的读写原理可以分为两个部分：读取和写入。

我们来看l1 cache的读取原理。

当CPU需要读取指令或数据时，首先会检查l1 cache中是否存在所需内容。

如果存在，CPU会直接从l1 cache中获取；如果不存在，CPU会向l1 cache所在的cache line请求数据。

cache line是l1 cache的最小存储单元，一般为64字节。

当需要读取的数据位于某个cache line中时，CPU会一次性将整个cache line加载到l1 cache中，以提高后续访问的效率。

在l1 cache中读取数据时，会根据地址进行索引，这个过程称为cache的映射。

常见的映射方式有直接映射、全相联映射和组相联映射。

直接映射将每个内存地址映射到唯一的cache行，全相联映射将每个内存地址都可以映射到任意的cache行，而组相联映射则是介于两者之间的一种映射方式。

映射的目的是为了减少数据的冲突，提高l1 cache的命中率。

接下来，我们来看l1 cache的写入原理。

当CPU需要写入数据时，首先会将数据写入l1 cache。

然后，l1 cache会根据写策略将数据写回内存。

常见的写策略有写直达和写回。

写直达是指每次写入l1 cache都会立即将数据写回内存，而写回是指只有当l1 cache被替换出或被写入指令触发时，才会将数据写回内存。

写回策略可以减少对内存的写入次数，提高写入效率。

cache笔记：MemoryCoherence和MemoryConsistency

cache笔记：MemoryCoherence和MemoryConsistencyAt this point we should formally discuss memory coherence and memory consistency, terms that speak to the ordering behavior of operations on the mem-ory system. We illustrate their defi nitions within the scope of the race condition example and then discuss them from the perspective of their use in modern multiprocessor systems design.Memory Coherence（⼀致性）:The principle of memory coherence indicates that the memory system behaves rationally. （理性的，有条理的）For instance, a value written does not disappear (fail to be read at a later point) unless that value is explicitly overwritten.Write data cannot be buffered indefinitely; any write data must eventually become visible to subsequent reads unless overwritten. Finally, the system must pick an order. If it is decidedthat write X comes before write Y, then at a later point, the system may not act as if Y came before X.Memory Consistency（前后⼀致，连贯性）Whereas coherence defines rational behavior, the consistency model indicates how long and in what ways the system is allowed to behave irrationally with respect to a given set of references. A memory-consistency model indicates how the memory system inter-leaves read and write accesses, whether they are to the same memory location or not. If two refer-ences refer to the same address, their ordering is obviously important. On the other hand, if two references refer to two different addresses, there can be no data dependencies between the two, and it should, in theory, be possible to reorder them as one sees fi t. This is how modern DRAM memory controllers operate, for instance. How-ever, as the example in Figure 4.6 shows, though a data dependence may not exist between two variables, a causal dependence may nonethe-less exist, and thus it may or may not be safe to reorder the accesses. Therefore, it makes sense to defi ne allowable orderings of references to differ-ent addresses. In the race condition example, depending on ne’s consistency model, the simple code fragment。

多核cache亲和性

多核cache亲和性综述概述利用亲和性这种特性可以降低进程转移带来的性能损失，提高cache命中率，同时利用该特性可以充分利用片上所有的cache来加速串行程序的执行。

但要利用该特性需要操作系统调度程序的支持，同时要求有一定的硬件的支持。

经过研究，cache亲和性对单核多处理器的性能提升不大，但对于多核多处理器能带来很大的性能提升。

该文主要介绍了亲和性的定义，亲和性对性能的影响，最后怎样利用操作系统及硬件支持来充分利用该特性。

引言芯片多处理器（CMP）的已成为当今高性能的多处理器主要形式之一。

对影响性能的关键因素之一便是高速缓存的利用率。

传统的对于高速缓存，每个核心是有自己的私有L1高速缓存，并在同一芯片上所有核心共享的较大二级缓存。

为了提高缓存利用率，我们需要考虑在缓存中的数据重用，在所有核心上共享缓存缓存访问的争夺，和私有缓存间的连贯性缺失率。

亲和性定义：亲和性指进程在给定的cpu或cpu核上运行尽量长的时间而不被转移到别的处理器的倾向性。

在Linux里，内核进程调度器天生就具有软亲和性（soft affinity）的特性，这意味着进程通常不会在处理器或者内核之间频繁迁移。

这种情况是我们希望的，因为进程迁移的频率低意味着产生的负载小，具有更好的性能表现。

在对称多处理(SMP)上，操作系统的进程调度程序必须决定每个CPU上要运行哪些进程。

这带来两项挑战：调度程序必须充分利用所有处理器，避免当一个进程已就绪等待运行，却有一个CPU核心闲置一旁，这显然会降低效率。

然而一个进程一旦被安排在某个CPU核心上运行，进程调度程序也会将它安排在相同的CPU核心上运行。

这会使性能更好，因为将一个进程从一个处理器迁移到另一个处理器是要付出性能代价的。

一般进程会在相同的核或CPU上运行，只会在负载极不均衡的情况下从一个核移往另一个核。

这样可以最小化缓存区迁移效应，同时保证系统中处理器负载均衡。

亲和性程序性能的影响多核处理器的处理器与处理器之间的cache亲和力是通过观察缓存方面积累了一定的进程的状态，即数据或指令后才进行考察的。

ARM Cortex各系列处理器分类比较

C o r t e x-M系列M0：Cortex-M0是目前最小的ARM处理器，该处理器的芯片面积非常小，能耗极低，且编程所需的代码占用量很少，这就使得开发人员可以直接跳过16位系统，以接近8位系统的成本开销获取32位系统的性能。

Cortex-M0处理器超低的门数开销，使得它可以用在仿真和数模混合设备中。

M0+：以Cortex-M0处理器为基础，保留了全部指令集和数据兼容性，同时进一步降低了能耗，提高了性能。

2级流水线，性能效率可达1.08DMIPS/MHz。

M1：第一个专为FPGA中的实现设计的ARM处理器。

Cortex-M1处理器面向所有主要FPGA设备并包括对领先的FPGA综合工具的支持，允许设计者为每个项目选择最佳实现。

M3：适用于具有较高确定性的实时应用，它经过专门开发，可使合作伙伴针对广泛的设备（包括微控制器、汽车车身系统、工业控制系统以及无线网络和传感器）开发高性能低成本平台。

此处理器具有出色的计算性能以及对事件的优异系统响应能力，同时可应实际中对低动态和静态功率需求的挑战。

M4：由ARM专门开发的最新嵌入式处理器，用以满足需要有效且易于使用的控制和信号处理功能混合的数字信号控制市场。

M7：在ARMCortex-M处理器系列中，Cortex-M7的性能最为出色。

它拥有六级超标量流水线、灵活的系统和内存接口（包括AXI和AHB）、缓存（Cache）以及高度耦合内存（TCM），为MCU 提供出色的整数、浮点和DSP性能。

互联：64位AMBA4AXI,AHB外设端口(64MB到512MB)指令缓存：0到64kB，双路组相联，带有可选ECC数据缓存：0到64kB，四路组相联，带有可选ECC指令TCM：0到16MB，带有可选ECC数据TCM：0到16MB，带有可选ECCCortex-A系列：ARMCortex-A系列是一系列用于复杂操作系统和用户应用程序的应用程序处理器。

Cortex-A 系列处理器支持ARM、Thumb和Thumb-2指令集。

CacheCoherence文献综述

Cache Coherence文献综述文献阅读背景如何选择高速缓存一致性的解决方案一直以来都是设计共享存储器体系结构的关键问题。

相对于维护高速缓存一致性而言，数据的传输也显得简单了。

高速缓存一致性协议致力于保证每个处理器的数据一致性。

一致性通常是在高速缓存总线或者网线上得到保证。

高速缓存的缺失可以从内存中得到数据，除非有些系统（处理器或者输入输出控制器）设备修改了高速缓存总线。

为了进行写操作，该处理器必须进行状态的转换，通常是转换为独占的状态，而总线上其他的系统设备都必须将他们的数据无效化，目前该数据块的拥有者就成为了数据来源。

因此，当其他设备提出需要此数据块时，该数据块的拥有者，而不是内存，就必须提供数据。

只有当该数据块的拥有者必须腾出空间用以存放其他的数据时，才将最新的数据重新写回内存中。

当然，在这方面，各种协议也有区别，上文所诉只是最基本的一些解决方案1。

并且，协议也包括基于硬件的以及基于软件的协议两个种类。

也有写无效和写更新的区别。

下面概述性地介绍下体系结构中所采用的两种主要的一致性方案：监听式（也称广播式）协议：所有的地址都送往所有的系统设备中。

每个设备在本地缓存中检查（监听）高速缓存总线的状态，系统在几个时钟周期后决定了全局的监听结果。

广播式协议提供了最低的可能延迟，尤其当缓存之间的传输是基本的传输方式。

监听式协议传输数据的带宽也是有一定限制的，通常被限制在：带宽＝缓存总线带宽×总线时钟周期/每次监听的时钟周期数。

这将在下文中详细提到。

目录式（也称点对点式）协议：每个地址都被送往系统设备中对缓存数据感兴趣的那些设备。

物理存储器的共享状态放在一个地点，称之为目录。

目录式一致性的开销更大，尤其在延时等方面，因为协议本身的复杂性。

但是整体的带宽可以比监听式协议高很多，往往应用于比较大型的系统，最主要的应用是分布式系统。

这将在下文详细提到。

缓存一致性涉及的体系结构主要有如下几种：第一种类型是集中式存储体系结构，也称作为对称（共享存储器）多处理器系统（SMPs），这种体系结构也称为均匀存储器访问（UMA），这是因为所有的处理器访问存储器都有相同的时延。

Cache一致性的基本概念

Cache一致性的基本概念PCI设备对可Cache的存储器空间进行DMA读写的操作的过程较为复杂，有关Cache一致性的话题可以独立成书。

而不同的处理器系统使用的Cache Memory的层次结构和访问机制有较大的差异，这部分内容也是现代处理器系统设计的重中之重。

本节仅介绍在Cache Memory系统中与PCI设备进行DMA操作相关的，一些最为基础的概念。

在多数处理器系统中，使用了以下概念描述Cache一致性的实现过程。

1 Cache一致性协议多数SMP处理器系统使用了MESI协议处理多个处理器之间的Cache一致性。

该协议也被称为Illinois protocol，MESI协议在SMP 处理器系统中得到了广泛的应用。

MESI协议使用四个状态位描述每一个Cache行。

•M(Modified)位。

M 位为1 时表示当前Cache行中包含的数据与存储器中的数据不一致，而且它仅在本CPU的Cache 中有效，不在其他CPU的Cache 中存在拷贝，在这个Cache行的数据是当前处理器系统中最新的数据拷贝。

当CPU对这个Cache行进行替换操作时，必然会引发系统总线的写周期，将Cache行中数据与内存中的数据同步。

•E(Exclusive)位。

E 位为1 时表示当前Cache行中包含的数据有效，而且该数据仅在当前CPU的Cache中有效，而不在其他CPU的Cache中存在拷贝。

在该Cache行中的数据是当前处理器系统中最新的数据拷贝，而且与存储器中的数据一致。

•S(Shared)位。

S 位为1 表示Cache行中包含的数据有效，而且在当前CPU和至少在其他一个CPU中具有副本。

在该Cache行中的数据是当前处理器系统中最新的数据拷贝，而且与存储器中的数据一致。

•I(Invalid)位。

I 位为1 表示当前Cache行中没有有效数据或者该Cache行没有使能。

MESI协议在进行Cache行替换时，将优先使用I位为1的Cache行。

二进制翻译中代码Cache的分级双粒度管理策略

二进制翻译中代码Cache的分级双粒度管理策略
杨浩;武成岗;冯晓兵
【期刊名称】《计算机应用研究》
【年(卷),期】2007(24)6
【摘要】提出一种二进制翻译中代码Cache管理的LRC(Level-Region-Chunk)策略.其兼具全清空策略、FIFO策略和多级Cache的优点,并且考虑了程序的时间空间局部性、执行特性和替换开销,具有较好的性能,实现了代码Cache的高效管理.【总页数】4页(P302-305)
【作者】杨浩;武成岗;冯晓兵
【作者单位】中国科学院,计算技术研究所,北京,100080;中国科学院,计算技术研究所,北京,100080;中国科学院,计算技术研究所,北京,100080
【正文语种】中文
【中图分类】TP319
【相关文献】
1.系统级动态二进制翻译中的代码Cache索引 [J], 邢冲;付宇卓
2.动态二进制翻译中的代码Cache管理策略 [J], 谢海斌;武成岗;张兆庆;冯晓兵
3.动态二进制翻译器qemu的Tcache管理策略 [J], 殷金彪;宋强
4.动态二进制翻译中的TCache替换算法 [J], 马舒兰
5.在二进制翻译中利用本地库代码 [J], 谭月辉; 尹文龙; 郭宝锋; 崔佩璋
因版权原因，仅展示原文概要，查看原文内容请购买。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

10
Disadvantages of Coherent MP Nodes
• Bandwidth shared among nodes
• all-to-all example • applies to coherenandled similarly to bus, but not full broadcast
• faulting processor sends out “search” bus transaction on its bus • propagates up and down hiearchy based on snoop results
Requestor
P C A M/D 2. 3. Read req. to owner Reply with owner identity P C A 4a. Data Reply 4b. Revision message to directory P C A M/D A M/D C A M/D M/D 3a. Inval. req. to sharer 4a. Inval. ack 3b. Inval. req. to sharer 4b. Inval. ack 1. Read request to directory C
• Problems:
• high latency: multiple levels, and snoop/lookup at every level • bandwidth bottleneck at root
• Not popular today
6
Scalable Approach #2: Directories
(a) Read miss to a block in dirty state
(b) Write miss to a block with two sharers
•Many alternatives for organizing directory information
7
A Popular Middle Ground
• Not only hardware latency/bw, but also protocol must scale
3
What Must a Coherent System Do?
• Provide set of states, state transition diagram, and actions • Manage coherence protocol
Main Mem
Snooping Adapter
Main Mem
Dir.
Assist
Assist
Dir.
(a) Snooping-snooping
P C M/D A M/D Network1 Directory adapter P C A M/D P C A M/D Network1 Directory adapter P C M/D A M/D Network1 Dir/Snoopy adapter P C A
• Examples:
• Convex Exemplar: directory-directory • Sequent, Data General, HAL: directory-snoopy
8
Example Two-level Hierarchies
P C B1 P C P C Snooping Adapter B2 Network B1 P C P C B1 Main Mem P C P C B1 Main Mem P C
4
Bus-based Coherence
• All of (a), (b), (c) done through broadcast on bus
• faulting processor sends out a “search” • others respond to the search probe and take necessary action
• cache miss satisfied transparently from local or remote memory
• Natural tendency of cache is to replicate
• but coherence? • no broadcast medium to snoop on
• Scalable coherence:
• can have same cache states and state transition diagram • different mechanisms to manage protocol
5
Approach #1: Hierarchical Snooping
Requestor
P
1. RdEx request to directory
Directory node for block
A
M/D
2. Reply with sharers identity
P C A M/D
Directory node
P C
P
Node with dirty copy
Sharer
Sharer
• Two-level “hierarchy” • Individual nodes are multiprocessors, connected nonhiearchically
• e.g. mesh of SMPs
• Coherence across nodes is directory-based
1
L10… Cache Coherence in Scalable Machines
2
Scalable Cache Coherent Systems
• Scalable, distributed memory plus coherent replication • Scalable distributed memory machines
(b) Snooping-directory
P C A P C M/D A M/D Network1 Dir/Snoop y adapter Bus (or Ring) P C A
Network2
(c) Directory-directory
(d) Directory-snooping
9
Advantages of Multiprocessor Nodes
• directory keeps track of nodes, not individual processors
• Coherence within nodes is snooping or directory
• orthogonal, but needs a good interface of functionality
• Potential for cost and performance advantages
• amortization of node fixed costs over multiple processors – applies even if processors simply packaged together but not coherent • can use commodity SMPs • less nodes for directory to keep track of • much communication may be contained within node (cheaper) • nodes prefetch data for each other (fewer “remote” misses) • combining of requests (like hierarchical, only two-level) • can even share caches (overlapping of working sets) • benefits depend on sharing pattern (and mapping) – good for widely read-shared: e.g. tree data in Barnes-Hut – good for nearest-neighbor, if properly mapped – not so good for all-to-all communication
• Extend snooping approach: hierarchy of broadcast media
• tree of buses or rings (KSR-1) • processors are in the bus- or ring-based multiprocessors at the leaves • parents and children connected by two-way snoopy interfaces – snoop both buses and propagate relevant transactions • main memory may be centralized at root or distributed among leaves
• (0) is done the same way on all systems
• state of the line is maintained in the cache • protocol is invoked if an “access fault” occurs on the line
• Different approaches distinguished by (a) to (c)
• Could do it in scalable network too