爱立信 wnms\超级管理员
爱立信
爱立信
爱立信 wnms\超级管理员
爱立信 wnms\超级管理员
爱立信 wnms\超级管理员
FTMCTRL: 32-bit (P)ROM EDAC Checksum ProgrammingApplication note2018-04-17Doc. No GRLIB-AN-0011Issue 1.0M S -T P L T -1-1-0Date:2018-04-17Page: 2 of 9CHANGE RECORDIssue Date Section / Page Description1.02018-04-17All First issue.TABLE OF CONTENTS1INTRODUCTION (3)1.1Scope of the document (3)1.2Reference documents (3)2ABBREVIATIONS (3)3OVERVIEW (4)3.1Overview (4)3.2FTMCTRL PROM EDAC (5)3.3Sources of memory accesses (5)3.4Programming parallel checkbits (6)3.5Alternatives (6)4CONTROLLING THE PARALLEL CHECKBIT BUS (7)4.1UT699, UT699E and UT700 (7)4.2GR712RC (7)4.3GR740 (8)4.4LEON3FT-RTAX (8)4.5Other designs with FTMCTRL (8)5GRMON (8)Date:2018-04-17Page: 3 of 91INTRODUCTION1.1Scope of the documentThis document describes programming of parallel EDAC checksum (also referred to as checkbits in this document) in systems that make use of the FTMCTRL memory controller. The focus is on programming checksums for non-volatile memories, specifically parallel NOR Flash, that require special address and data sequences to issue commands to the memory devices.Parallel EDAC checksum is only used when EDAC protection is enabled with 32-bit data width. 1.2Reference documents[RD1]GRLIB IP Core User's Manual, Cobham Gaisler AB, http s:///grip.pdf [RD2]GRLIB-AN-0011-flash32 software package, available viahttps:///notes2ABBREVIATIONSBCH Bose–Chaudhuri–Hocquenghem, class of cyclic error-correcting codes EDAC Error Correction And DetectionFTMCTRL Fault-Tolerant Memory controllerMCFG Memory Configuration Register, control register for memory controllerTCB Test Check Bits, field in FTMCTRL MCFG2 register3OVERVIEW3.1OverviewThe FTMCTRL memory controller is commonly used in LEON3FT and LEON4FT processor devices and also in custom designs based on the GRLIB IP library [RD1].The memory controller is a combined 8/16/32-bit memory controller that provides a bridge between external memory and the on-chip bus and is configured through memory-mapped registers referred to as the Memory Configuration (MCFG) registers. The memory controller can handle four types of devices: PROM, asynchronous static ram (SRAM), synchronous dynamic ram (SDRAM) and memory mapped I/O devices (IO). The PROM, SRAM and SDRAM areas can be EDAC-protected using a (39,7) BCH code. The BCH code provides single-error correction and double-error detection for each 32-bit memory word.The PROM device type above typically means that parallel NOR Flash, MRAM or EEPROM is connected to the memory controller. A block diagram of how FTMCTRL can be connected toexternal devices and the on-chip system is shown in the figure below.Figure 1: FTMCTRL generic block diagramThe types of devices supported and the signals available on external pins of a device depends on the specific device implementation.Date:2018-04-17Page: 5 of 93.2FTMCTRL PROM EDACThe FTMCTRL is provided with an BCH EDAC that can correct one error and detect two errors in a 32-bit word. For each word, a 7-bit checksum is generated. A correctable error will be handled transparently by the memory controller. If an un-correctable error (double-error) is detected, the current AHB cycle will end with an AMBA ERROR response. The EDAC is enabled for the PROM area by setting the corresponding EDAC enable bit in the MCFG3 register. When working in 32-bit mode, the checksum is present on the CB bus (see figure 1) and will be stored in a memory device present in parallel with the device(s) providing the 32-bit data bus.For 8-bit mode, the EDAC checkbit bus (CB[7:0]) is not used but it is still possible to use EDAC protection. Data is always accessed as words (4 bytes at a time) and the corresponding checkbits are located at the address acquired by inverting the word address (bits 2 to 27) and using it as a byte address. Please refer to the relevant device user's manual or the FTMCTRL IP core documentation for further documentation on the 8-bit mode and EDAC.When using a parallel device to hold the checkbits, the only way to set the data bus of that device to an arbitrary value is to use the write bypass and read bypass functionality provided by FTMCTRL. If the MCFG3.WB (Memory Configuration register 3, WB field - write bypass) bit is set, then the value in the MCFG3.TCB field will replace the normal checkbits during memory write cycles. If the RB (read bypass) is set, the memory checkbits of the loaded data will be stored in the TCB field during memory read cycles. This bypass functionality has some limitations:•When read bypass is activated, then any memory read access will cause MCFG3.TCB to be updated.•The read bypass functionality requires that EDAC is enabled.•When write bypass is activated, then any memory write access will make use of the MCFG3.TCB field for the checksum valueThis means that accesses to the memory controller must be limited in order for the read bypass and write bypass functionality to be reliable. The next section describes sources of memory accesses. 3.3Sources of memory accessesThis document covers parallel checkbits for the PROM area. Since the same memory controller often provides provides access also to RAM memory used as the primary memory for a processor system, unwanted accessed may be caused by:•Processor instruction fetches due to misses in the instruction cache•Processor data fetches due to misses in the data cache•Peripherals that perform direct-memory access (DMA)Date:2018-04-17Page: 6 of 93.4Programming parallel checkbitsProgramming of parallel checkbits is straightforward for memory types that accept write operations performed in the same way as a read operation, with the difference that a write signal is asserted. Other devices, such as NOR Flash devices using the Common Flash Interface (CFI), require that both the address bus and the data bus are controlled when issuing commands to the memory devices and reading the responses to these commands. Controlling the data bus means that the write bypass functionality must be used in FTMCTRL and reading responses from a memory device means that the read bypass functionality needs to be enabled.The read bypass and write bypass functionality of FTMCTRL can be used safely from a debugger such as GRMON by stopping the processor(s) and all on-chip peripherals capable of DMA in the system. If the bypass functionality shall be used from software running on the processor then it is possible to design a program, taking into account the cache structure and replacement policy of the processor implementation, that runs completely from cache. It is not possible to guarantee that the sequence will run from cache in an environment where radiation effects can case single-event upsets in the processor's cache or if interrupts are enabled which can lead to a changed flow of execution and changes in the cache state (and also to unintended write accesses from interrupt handling).It should also be noted that a complicating, but not blocking, factor is that since read-bypass requires EDAC to be enabled, it is necessary to handle the corresponding AMBA ERROR, leading to a processor trap when reading CFI command responses via read-bypass.Because of the limitations described above it is considered infeasible to perform CFI Flash programming with parallel checkbits from a processor that is executing from memory mapped to the same FTMCTRL, when using an operating system or when operating in an environment where L1 cache parity errors may be encountered.3.5AlternativesConfigurations with memory devices with 32-bit data and parallel checkbits may be wanted due to attainable memory size and memory access latency. In case the non-volatile memory devices need to be reprogrammed during operation then use of NOR Flash devices needs to be considered in combination with the limitations described in section 3.4. It can also be noted that if the non-volatile memory needs to be updated at random addresses then Flash devices usually only support erase operations on a page granularity. Alternatives to NOR Flash include MRAM devices and EEPROM devices.A hybrid solution, usable unless the boot software needs to be updated, is to boot from FTMCTRL with EDAC enabled and make use of parallel checkbits. Once the system is up and running from RAM memory then the EDAC functionality for the PROM area can be disabled. EDAC for other parts of the PROM can then be implemented in software by creating a checksum for EDAC pages and storing it as part of the data that is memory-mapped. This way software will calculate andDate:2018-04-17Page:7 of 9validate checksums for the memory blocks that it reads and writes from non-volatile memory. The memory controller will not cause traps due to EDAC errors from the PROM after the EDAC is disabled.4CONTROLLING THE PARALLEL CHECKBIT BUSThe subsections below contain device specific observations and recommendations for CFI Flash programming.4.1UT699, UT699E and UT700To safely read and control the parallel checkbit bus on a UT699 device from the LEON3FT processor, all accesses to the shared FTMCTRL memory controller must be controlled. This means that:•All DMA units must be stopped•Interrupts must be disabled•Flash programming routines and their corresponding data must reside in L1 cache (cannot be guaranteed if L1 cache encounters parity-errors due to single-event upsets)For the UT699 processor, L1 cache coherency through bus snooping cannot be used and this functionality will be disabled by software. For the UT699E and UT700 the cache snooping functionality can optionally be enabled by software. If bus snooping is enabled then snooping will invalidate cache lines due to DMA traffic and this could have effects for software implementations that rely on data being present in cache for PROM programming.4.2GR712RCFor software running out of external memory, the same limitations apply as described for theUT699E and UT700 in section 4.1.Many of the limitations come from the need to execute software from the same memory controller as the one that provides access to the external non-volatile memory. The GR712RC also has an on-chip RAM. If this RAM is utilized to hold the programming application then it is sufficient if the following rules are met:•All DMA units using external memory must be stopped•The full program, including trap table, must reside in the on-chip RAM•An adapted trap handler for handling AMBA ERROR responses caused by reading memory device command responses with read-bypass must be installed.A software example for programming NOR Flashes with parallel checkbits is available [RD2].Date:2018-04-17Page:8 of 94.3GR740The FTMCTRL in GR740 supports 8- and 16-bit interfaces. EDAC check bits are programmed in the memory-mapped area and the special precautions described in this document do not need to be considered for the GR740.4.4LEON3FT-RTAXThe same restrictions as the ones listed in section 4.1 apply.4.5Other designs with FTMCTRLThe same restrictions as the ones listed in section 4.1 apply for devices that has one FTMCTRL that provides access to both RAM and non-volatile memory. For devices that have other RAM or ROM that software can use, the restrictions described in 4.2 apply.5GRMONGRMON versions 1.x.y and 2.x.y do not support programming parallel check bits. Support will be added for GRMON3 and this document will be updated with version information once the feature is available in GRMON3.Date:2018-04-17Page:9 of 9Copyright © 2018 Cobham Gaisler.Information furnished by Cobham Gaisler is believed to be accurate and reliable. However, no responsibility is assumed by Cobham Gaisler for its use, or for any infringements of patents or other rights of third parties which may result from its use. No license is granted by implication or otherwise under any patent or patent rights of Cobham Gaisler.All information is provided as is. There is no warranty that it is correct or suitable for any purpose, neither implicit nor explicit.。
FAULT CODE表 Internal fault map class 1A-这个级别的失败报告是会影响MO管理功能部分的,错误的硬件是MO发信部分 Internal fault map class 1B-这个级别的失败报告是关于MO管理功能部分的,错误原因是外部MO发信部分 Internal fault map class 2A-这个级别的失败报告是不会影响MO管理功能部分的,错误的硬件是MO发信部分 EXternal condition map class 1(EC1)-这个级别的失败报告是会影响MO管理功能部分的,此告警是有关TG无线电收发组外部的. EXternal condition map class 2(EC2)-这个级别的失败报告是不影响MO管理功能部分的,此告警是有关TG无线电收发组外部的. Replacement unit map(RU map)-替代单元 Fault CF EC1-4 CF EC1-5 CF EC2-9 CF I1A-0 CF I1A-1 CF I1A-2 CF I1A-3 CF I1A-4 CF I1A-5 CF I1A-6 CF I1A-7 复位,失败的重装测试Reset has occured on DXU. 复位,电源开;同上 复位,交换;同上 复位,监视watchdog;同上 复位,软件失败;同上 复位,RAM失败;同上 复位,内部功能改变;同上 X-BUS不好;Xbus fault 时钟单元VCO不好;Timing Unit VCO fault Indicates three possible faults: 1. The VCO control value is out of range. The VCO needs to be recalibrated. The fault CF I2A:13 will probably warn before it is too late. 2. VCO temperature too low. The start-up heater is stuck. Try to power off/on the DXU. 3. VCO not distributing any 13 MHz signal. 定时单元的VCO坏: 1、 VCO的控制值超出范围,需要校正, 2、 VCO的温度太低,启动加热器坏,可以通过DXU断电来解决。 3、VCO没有分配13MHz时钟信号。 CF I1A-8 CF I1A-9 时钟BUS不好;Timing Bus fault 门限温度超出安全范围;Indoor temp out of safe range (Only macro RBS). Temperature in master cabinet is out of the range: 0-55C. CF I1A-10 CF I1A-11 DC电压值超出21.2V;;DC voltage out of range The DC voltage (in master cabinet) is below 21.2 V CF I1A-12 CF I1A-13 本地BUS失败;Local Bus fault The DXU is not able to send any data on local bus. 可能的三种原因:1、DXU硬件坏。2、DXU背板接触不良,3、DXU背板坏 CF I1A-14 Fault ceases when DC voltage is above 22.2 V. CF I2A:18 TRXC I1B:3.21VBFU,20V-BDM Fault ceases when temperature is in rangeTRXC I1B:1 2-53C. Fault Description L(local)/R(remote) SWI (BTS在本地模式) L(local)/R(remote) TI (LINK不好) RBS door( 2101, 2102). The RBS door is open. The alarm ceases 5 minutes after RBS door is closed. Recommended Corrective Action Related Fault
x-sel错误指令表Error Cause Description802 SCIF Receiving ER Status (TheIAI protocol is received. )Interference. Check for noise,disconnected equipment, and impropercommunication setting. When thecommunication is established when PC/TP ismis-connected with user opening SIO-CH1,it is generated.803 Reception Time-out Status(The IAI protocol isreceived. )The interval after the initial byte of datais received is too long. Also, there couldbe a disconnection of the communicationscable or a possibility of disconnectedequipment, too.804 SCIF Overrun Status (SEL isreceived. )Check for noise, disconnected equipment,and improper communication setting.805 SCIF Receiving ER Status (SELis received. )Check for noise, disconnected equipment,and improper communication setting.806 SCIF Other Factor Reception ERStatus (SEL is received. )Interference. Please deal same as ErrorNo.804,805.807 Shutdown Relay ER Status The motor(s) are still energized although the system is in shutdown status. It is possible that the relay has welded shut808 Slave Parameter Write DuringOFF StatusPower was turned OFF during slave parameterwrite. (Only when the backup battery isused, is it possible to detect. )809 Data Flash ROM Write DuringOFF StatusPower supply was turned OFF during dataflash ROM write. (Only when the backupbattery is used, it is possible to detectit. )80A Enhancing SIO Overrun Status(SEL is received. )Check for noise, disconnected equipment,and improper communication setting80B Enhancing SIO Parity ER Status(SEL is received. )Check for noise, disconnected equipment,and improper communication setting80C Enhancing SIO Framing ERStatus (SEL is received. )Check for noise, disconnected equipment,and improper communication setting.80D Enhancing SIO Other FactorReception ER status (SEL isreceived. )Interference. Please deal same as ErrorNo.80A, 80B, and 80C80E Enhancing SIO ReceptionBuffer Overflow Status (SEL isThe reception buffer overflowed. Moreexcessive data than the outside is receivedreceived. ) 900 Empty Step Shortage Error An empty step to preserve the step data isinsufficient. Please secure an empty stepnecessary for preservation901 Step No. ErrorInvalid Step No 902 Symbol Definition Table No. ErrorInvalid Symbol definition table No 903 Point No. Error Invalid Point No904 Variable No. Error Invalid Variable No905 Flag No. Error Invalid Flag No906 I/O Port Flag No. Error Invalid I/O port flag No930 Coordinate System No. ErrorInvalid Coordinate system No. - only newSCARA. 931 Coordinate System Type Error Invalid coordinate system type - only new SCARA.932 Coordinate System Definition Data Number Specification Error The specification of the number ofcoordinate system definition data isinvalid. - Only a new scalar :.933 Axis No. ErrorInvalid Axis No. - only new SCARA. 934SCARA ABS Reset Special Movement Operation Type Error SCARA ABS reset special movement operation type is invalid - only new SCARA. 935PositioningOperation Type Error The positioning operation type is invalid - only new SCARA. 936Interference Check Zone No. Error Interference check zone No. is invalid - only new SCARA. 937Interference Check Zone Specification Invalid Error When the interference check zone invades, the error type is invalid. - only new SCARA. 801 SCIF Overrun Status (The IAI protocol is received. ) Check for noise, disconnected equipment,and improper communication setting.ECB Communication Error (PC)Loss of communication possibly due to lossof power. 938 Interference Check Zone Data Number Specification Error The specification of the number of interference check zone data is invalid. - only new SCARA939 Interference Check Zone Invasion Detection (message level specification) The interference check zone invasion was detected. (message level specification)-only new SCARA93A Rotary Axis CP Jog Prohibition Error Outside Range of Motion Please move in range of motion with eachaxis Jog. - Only a new scalar :.(Tool XY offset, exist, and. )A01 Low System Memory BackupBattery WarningThe voltage of the system memory backupbattery is low. Please exchange thebattery. (voltage level in which data canbe backed up)A02 Low System Memory BackupBattery WarningThe voltage of the system memory backupbattery is low. Please exchange the battery(voltage level at which data cannot bebacked up).A03 Low Absolute Encoder DataBackup BatteryThe voltage of the absolute encoder databattery is low. Please check for batteryconnection or exchange. Chattering contactin the brake connection can also be aculprit.A04 System Mode Error During CoreSection UpdateWhen the system mode was not a core partupdate mode, the update command wasreceived. Please confirm the presence ofthe chip resistance for the core partupdate mode on substrate setting when youupdate the core part. (for maintenance)A05 Motorola, Inc. S Record FormatErrorThe update program file is invalid. Pleaseconfirm the file.A06 Motorola, Inc. S ChecksumErrorThe update program file is invalid. Pleaseconfirm the file.A07 Motorola, Inc. S LoadingAddress ErrorThe update program file is invalid. Pleaseconfirm the file.A08 Over Motorola, Inc. S WritingAddress ErrorThe update program file is invalid. Pleaseconfirm the file.A09 Flash ROM Timing Limit ExcessError (write)Flash ROM write has timed outA0A Flash ROM Timing Limit ExcessError (erase)Flash ROM erase has timed outA0B Flash ROM Verify Error Flash ROM write/erase is invalidA0C Flash ROM ACK Time-out The write is erase/invalid of flash ROM.A0D The First Sector No.Specification ErrorFlash ROM erase ErrorA0E Sector Number SpecificationErrorFlash ROM erase ErrorA0F Offset Address Error atWriting Destination (oddThe write is invalid of flash ROM.number address)A10 Writing Source Data BufferAddress Error (odd number address)The write is invalid of flash ROM.A11 Core Part Code Sector Block ID Invalid Error The core part program written in flash ROMis invalid now.A12 Core Part Code Sector Block ID Deletion Frequency ExaggeratedThe deletion frequency of flash ROM exceeds a permissible frequency. A13 Flash ROM Write Demand Error when Erase Incompleteness is Ended Flash ROM writing command had been received before flash ROM erase command was received when updating it. Please confirm Appdatop ? log ram file, and update it again.A14 Busy Status Release Time-out Error when EEPROM Busy state release waiting time-out afterwriting of EEPROMA15 Mount, target, EEPROM, write in. Demand Error With the unit with CPU of the driver etc. Mount, one, EEPROM, write in, demand, provide.A16 Mount, target, EEPROM, read. Demand Error It was a unit with CPU of the driver etc. , and reading was demanded for the one of the unmounting.A17Sum Check Error (The IAI protocol is received. ) Checksum in reception is invalid. A18 Header Error (The IAI protocol is received. )The header in reception is invalid. A19 Message Addressing Error (During IAI Protocolreceiving)The receiving address is invalid. A1A ID Error (The IAI protocol is received. )ID in reception is invalid. A1C Conversion Error Transmitted data does not agree to protocolor illegal data is included. Please confirmtransmission.A1D Start Mode ErrorProgram start not permitted in present mode(MANU/AUTO). A1E Start Condition Failure Error Program start was given while writing Flash ROM or during E-STOP or Open Safety Gate.A1F Axis Multiple Use Error (SIO) Axis already in useA20 Servo Use write Acquisition Error (SIO) There is not becoming empty in the servo useright.A21Error to have acquired servo use right (SIO) The servo use right has already been acquired. A22 Servo use right unacquisition Error (SIO) It failed in the continuance acquisition ofthe servo use right.A23 Low ABS Encoder Backup Battery Warning (main analysis) The voltage of the absolute encoder data battery is low. Please check for battery connection or exchange.A25Step Number Specification Error Step number specification is invalid. A26 Program Number Specification ErrorProgram number specification is invalid. A27 Program Unregistration Error A pertinent program is not registeredA28 Error which cannot be reorganized while executing program The program area reorganization operation was done while executing the program. Please end all programs previously executing itA29 Error not to be able to edit program while executing it The edit operation was done to the program under execution. Please end the execution of the programA2A Defined Program is not Running The specified program is not running A2B Illegal Program Execution in AUTO Mode Program execution from the software isprohibited in AUTO modeA2C Program No. Error Invalid Program NoA2D Program Restart ErrorThe restart was demanded for a programwhich was not running A2E Program Temporary Stop Error The temporary stop was demanded for a program which was not runningA2F Breakpoint ErrorStep No. specified for a breakpoint is invalid A30 Breakpoint Setting NumberSpecification Error The set number of breakpoints exceeds the limit valueA31 Parameter Change Number Error The number of parameter changes is invalid A32 Parameter Type Error Invalid parameter typeA33 Parameter No. ErrorInvalid parameter No A34Card Parameter Buffer Read Error The card parameter buffer is read and it is invalid A35 Card Parameter Buffer Write Error The card parameter buffer is written and itis invalidA36 Parameter Change Rejection Parameters cannot be changed while an axisError During Movement is in motionA37 Card Manufacturing andFunction Information ChangeRejection ErrorThe card manufacturing and the functioninformation change are prohibitedA38 Parameter Change RejectionError at Servo ONParameters cannot be changed while theservo is onA39 Deferred Collection CardParameter Change ErrorThe change of the parameter of the cardwhich had not been recognized whenresetting it was triedA3A Device No. Error Device No. is invalidA3C Memory Initialization TypeSpecification ErrorAbnormality is found in the specificationof the memory initialization typeA3D Unit Type Error The unit type is invalidA3E SEL Write Data TypeSpecification ErrorAbnormality is found in the specificationof the SEL write data typeA3F Flash ROM Write RejectionError when Program is BeingExecutedWriting Flash ROM while executing a programis prohibitedA40 Flash ROM write inside datachange rejection ErrorThe data change writing flash ROM isprohibitedA41 Multiple Flash ROM WriteInstruction Rejection ErrorFlash ROM writing instruction was receivedagain while writing flash ROMA42 Inside Flash ROM Write DirectMonitor Prohibition ErrorA direct monitor writing flash ROM isprohibitedA43 P0 and P3 Area Direct MonitorProhibition ErrorA direct monitor to P0 and the P3 area isprohibitedA44 Point Data NumberSpecification ErrorInvalid specification of the number ofpoint dataA45 Specification of Number ofSymbol Records ErrorInvalid specification of the number ofsymbol recordsA46 Variable Data NumberSpecification ErrorInvalid specification of the number ofvariable dataErrorCause Description A48One Error Details Inquiry Type Error Type 1 of the Error details inquiry is invalid A49 Two Error DetailsInquiry Type Errors Type 2 of the Error details inquiry is invalidA4A Monitor Data Type Error The data type of the monitor data inquiry is invalidA4B Specification ofNumber of MonitorRecords ErrorInvalid specification of the number of records of monitor data inquiries A4C Monitor OperationSpecial CommandRegister Busy Error Driver special command ACK became a time-out by the monitor operationA4E Parameter Register Busy Error When Slave Command is IssuedDriver special command ACK became a time-out because of the slave command issue A4F Software Reset Rejection Error while In-Motion Software reset (SIO) operating (middle the servo use when the program is being executed) isprohibitedA50 Driving SourceRestoration DemandRejection Error Driving source interception factor (Error, dead man SW, safety gate, and emergency stop, etc.) has not been releasedA51 Temporary Stop of Motion Release Request Rejection ErrorThe stop factor has not been released at temporary stop of all operation A53Rejection Error when Servo is In-Use. The processing not permitted was tried while using the servo A54Function Unsupport Rejection Error It is a unsupport function A55 Function Use Rejection Error Only for The processing not opened except the manufacturerwas triedManufacturerA56 Data Failure RejectionErrorData is invalidA57 Program Multiple StartErrorIllegal execution of a program which is alreadyrunningA58 BCD Failure Warning The data being written to variable 99 is not a valid BCD numberA59 IN/OUT InstructionPort Flag InvalidityWarningIt is possible that the number of input and outputports exceeds 32. Check I/O Parameters 2-9 forproper settingsA5B Character String ->Numerical ValueConversion InvalidityWarningSpecified values of the number of characters whenconverting it are invalid or there is a characterinto which the numerical value cannot be convertedA5C SCPY Instruction CopyCharacter NumberInvalidity WarningSpecified values of the number of characters whencopying it are invalidA5D SIO Channel OpeningError in Non-AUTO modeThe user SIO channel can only be opened in AUTOmodeA5E Specification ofNumber of I/O PortFlags ErrorInvalid specification of the number of I/O portflagsA5F Field Bus Error(LERROR-ON)LERROR-ON was detectedA60 Field Bus Error(LERROR-BLINK)LERROR-blink was detectedA61 Field Bus Error(HERROR-ON)HERROR-ON was detectedA62 Field Bus Error(HERROR-BLINK)HERROR-blink was detectedA63 Field Bus Not Ready The field bus lady cannot confirm itA64 SCIF Overrunning Error(At the SIO bridge. )Check for noise, disconnected equipment, andimproper communication settingA65 SCIF Receiving Error(At the SIO bridge. )Check for noise, disconnected equipment, andimproper communication settingA66 SCI Overrunning Error(At the SIO bridge. )Check for noise, disconnected equipment, andimproper communication settingA67 SCI Framing Error (Atthe SIO bridge. )Check for noise, disconnected equipment, andimproper communication settingA68SCI Parity Error (At the SIO bridge. ) Check for noise, disconnected equipment, and improper communication setting A69Data Change Rejection Error while Running Program data cannot be changed while program is running A6ASoftware Reset During Flash ROM Write Controller was reset through I/O while writing flash ROM. Write again A6B Field Bus Error (FBRS link Error)The FBRS link Error was detected A6C Program Start in AUTO Mode Error While Other Parameter #45 is set to 0, a program cannot be started serially in AUTO mode. Change Other Parameter #45 to 1A6D P0, P3, and FROM area Direct Write Prohibition ErrorP0, P3, and a direct write to the FROM area are prohibited A6E Write Inside Rejection Error The processing not permitted in the slaveparameter write in data flash ROM write was triedB00 SCHA Setting Error The ending character set using SCHA is invalidB01 TPCD Setting Error Set value must be either 0 or 1B02 SLEN Setting Error The setting of SLEN is out of rangeB03 Homing Method ErrorHoming Method specified in Specific AxisParameter #10 is invalid for that particular axis B041 Shot Pulse Output Simultaneous Use Error BTPN and the BTPF timer simultaneous number of operation in one program exceed upper bound (16) B05 Over Travel ErrorDuring Homing Possibly bad limit switch/creep sensorB06 Multiple OPEN ErrorAttempted to OPEN a channel which was alreadyopened in another program B07 SIO channel Usage Error Attempted to use a channel which has not been opened (OPEN command)B08 Multiple WRIT ErrorThe WRIT instruction was executed at the same time by two or more tasks for the same channel B09 SIO RS485 WRIT?READSimultaneous Error The WRIT instruction and the READ instruction were executed at the same timeB0A Usage of Unassigned SIO Channel It tried to use the channel that normally cracked and was not applied. Please confirm I/O parameter No.100-111 and the state of the I/O slotB10 Z-phase SearchTime-out Error Z-phase cannot be detected. Check encoder connections or possibly defective encoderB11 HOME Limit SwitchThe escape from the starting point sensor cannotTime-out Error be confirmed. Please confirm the restraint,wiring, and the starting point sensor etc. ofoperationB12 SEL Command Return CodeStorage Variable No.ErrorSEL command return code storage variable No. isinvalid. Please confirm "Other parameter No.24READ instruction return code storage localvariable No." etcB13 Backup SRAM DataChecksum ErrorBackup SRAM data is destroyed. Please confirm thebatteryB14 Flash ROM8Mbit VersionUnsupport FunctionErrorIt tried to use a function not supported underflash ROM8Mbit substrate environment. (HTconnection specification etc.)B15 Input Port DebuggingFilter Type ErrorThe input port debugging filter type setting valueis invalidB16 SEL OperandSpecification ErrorInvalid SEL instruction word operandspecificationB17 Parameter RegisterBusy Error when SlaveCommand is IssuedDriver special command ACK became a time-outbecause of the slave command issueB18 Device No. Error Device No. is invalidB19 Unit Type Error The unit type is invalidB1A ABS ResetSpecification ErrorThe ABS reset specification is illegal. Itspecifies more than two axes the axis excludingthe specification simultaneously and the ABSencoderC03 Unregistered ProgramSpecification ErrorThe specified program is not registeredC04 Program Entry PointUndetected ErrorExecution was demanded for program No. in whichthe program step was not registeredC05 Program First Step BGSRErrorThe first command in a program cannot be BGSRC06 Executable StepUndetected ErrorIn the program being executed, there are noexecutable stepsC07 Subroutine UndefinedErrorThe subroutine which was called is not definedC08 Multiple SubroutineDefined ErrorThe subroutine is defined in same subroutine two or more placesC0A Multiple Tag DefineErrorTag is defined in same tag No. by two or more placesC0B Tag Undefined Error The tag specified for a jump destination of theGOTO command is not definedC0C DW, IF, IS, and SL PairEnd Mismatch ErrorThe syntax of the branch instruction is invalid.Correspondence with the branch instruction whichappeared last time is invalid at EDIF, EDDO, andEDSL. Please confirm the IF?IS instruction, thecorrespondence of EDIF, the correspondence of theDO instrucC0D DW, IF, IS, and SL PairEnd Shortage ErrorNeither EDIF, EDDO nor EDSL are found. Pleaseconfirm the IF?IS instruction, the correspondenceof EDIF, the correspondence of the DO instructionand EDDO, the SLCT instructions, and thecorrespondences of EDSLC0E BGSR Pair End ShortageErrorEDSR corresponding to BGSR is insufficient or BGSRcorresponding to EDSR is insufficient. Pleaseconfirm the correspondence of BGSR and EDSRC0F DO, IF, and Over IS NestSteps Number ErrorThe frequency of the nest of the DO instructionand the IF?IS instruction exceeds the limit value.The limit is 15 nested statements. Also, check forexiting with a GOTO command before the end ofstatementC10 Over SLCT Nest StepsNumber ErrorThe frequency of the nest of the SLCT exceeds thelimit value. The limit is 15 nested statements.Also, check for exiting with a GOTO command beforethe end of statementC11 Over Error of Frequencyof Subroutine NestThe frequency of the nest of subroutines exceedsthe limit value. The limit is 15 nestedsubroutines. Also, check for exiting with a GOTOcommand before the end of the subroutineC12 DO, IF, and IS NestSteps Number ErrorUnderThe position of EDIF or EDDO is invalid. Pleasedo not confirm whether diverge the correspondenceof the IF?IS instruction, the correspondence ofEDIF, the DO instruction, and EDDO and in syntaxes(outside the syntax in the GOTO instruction)C13 SLCT Nest Steps NumberError underThe position of EDSL is invalid. Please do notconfirm whether diverge the correspondence ofSLCT and EDSL and in syntaxes (outside the syntaxin the GOTO instruction)C14 Error of Frequency ofSubroutine Nest UnderThe position of EDSR is invalid. Please do notconfirm whether diverge the correspondence ofBGSR and EDSR and in syntaxes (outside the syntaxin the GOTO instruction)C15 Error of the Next StepInstruction Code ofSLCTWhich must be the next program step of SLCT ofWHEQ, WHNE, WHGT, WHGE, WHLT, WHLE, WSEQ, WSNE,OTHE, and EDSL?C16 Create Stack Failure It failed in the initialization of the input condition state storage stackC17 Enhancing ConditionCode ErrorThe program step is invalid. The code of theenhancing condition is invalidC18 Enhancing Condition LDSimultaneousProcessing NumericalExcessive ErrorThe number of simultaneous processings of LDexceeds the limit valueC19 Enhancing Condition LDShortage DetectionError 1When O is used enhancing condition A, LD isinsufficientC1A Enhancing Condition LDShortage DetectionError 2When OB is used enhancing condition AB, LD isinsufficientC1C Unused LD DetectionErrorIt tried to execute the command without the useof the input condition that LD was used and savedtwo or more times with enhancing condition AB orOBC1F Input Condition CNDShortage DetectionErrorThere is no necessary input condition when theenhancing condition is usedC21 Input Condition UseError at InputCondition ProhibitionCommandAn input condition cannot be used on the specifiedcommandC22 Positional CommandIllegal Error at InputCondition ProhibitionCommandThere must not be input condition prohibitioncommand on the way of the nest of the inputconditionC23 Operand Invalid Error The program step is invalid. Necessary operand data is invalidC24 Operand Type Error The program step is invalid. The data type of the operand is invalidC25 Actuator ControlDeclaration ErrorA set value of the actuator control declarationinstruction is invalidC26 Over Set Range Error ofTime of TimerInvalid setting for the timerC27 Over Set Range Error ofTime-out Time at WAITInvalid setting of the time-out timeC28 Tick Frequency SettingRange ErrorInvalid setting of the Tick frequencyC29 HOME Limit SwitchTime-out ErrorWhen DIV was ordered, 0 was specified for a divisorC2A Range Error when SQR isOrderedThe operand value of the SQR instruction isinvalid. Operand must be greater than zero.C2B BCD Mark Digit NumberRange ErrorSpecified values of the number of BCD mark digitsare invalid. Please specify the value from 1 to8C2C Program No. Error Program No. is invalid C2D Step No. Error Step No. is invalidC2E Empty Step ShortageErrorAn empty step to preserve the step data isinsufficient. Please secure an empty stepnecessary for preservationC2F Axis No. Error Axis No. is invalidC30 Axis Pattern Error The axis pattern is invalidC32 Operation AxisAddition Error whenCommand is BeingExecutedIt moved a continuous point or the operation axisof the point data was added while processing theOts movement calculationError Cause DescriptionC33 Base Axis No. Error Base axis No. is invalidC34 Zone No. Error Zone No. is invalid. - Only X-SEL : C35 Point No. Error Point No. is invalidC36 I/O port Flag No. Error I/O port flag No. is invalidC37 Flag No. Error Flag No. is invalidC38 Tag No. Error Tag No. is invalidC39 Subroutine No. Error Subroutine No. is invalidC3A Open User Serial Channel No.ErrorInvalid User SIO Channel specified.Verify settings in I/O Parameter Nos.100-111C3B Parameter No. Error Parameter No. is invalid C3C Variable No. Error Variable No. is invalid C3D String No. Error String No. is invalidC3E String Variable Data NumberSpecification ErrorString length goes beyond range of columns(1 ~ 999)C40 Delimiter Non Detection Errorin String VariableThe delimiter in the string variablecannot be detectedC41 String Variable Copy Size OverErrorThe size of the string variable copy is toolargeC42 Character Number Error whenString is ProcessedThe character string length when thestring is processed is not defined. Pleaseexecute the string processing instructionafter it defines it by the SLENinstructionC43 Character String Length Errorwhen String is ProcessedThe character string length when thestring is processed is invalid. Pleaseconfirm the value of the character stringlength defined by the SLEN instructionC46 Source Symbol Storage TableEmpty Area Shortage ErrorAn empty area to store the source symbolis insufficient. Please confirm thesource symbol use frequencyC47 Symbol Retrieval Error The definition of the symbol used in the program step is not foundC48 Error SIO Continuous SwitchingErrorSIO transmission is not corresponding tothe format or includes illegal dataC49 Error During Tasks Other ThanSEL-SIOSIO is being used by other interpretertasksC4A SCIF Non-Open Error It is not opened by the task that cereal of the user open channel 1 tried to use. Please open the channel previously by the OPEN instructionC4B Delimiter Error The ending character is not defined. Please set the ending character before OPEN using the SCHA commandC4E Illegal OPEN of SIO Channel 1 User SIO Channel 1 was opened illegally. If I/O Parameter #90 is set to 2, channel 1 cannot be used for user SIO, only IAI Protocol appliesC4F SEL Program Source Symbol SumCheck ErrorFlash ROM data is destroyedC50 Symbol Definition Table SumCheck ErrorFlash ROM data is destroyedC51 Point Data Sum Check Error Flash ROM data is destroyedC52 Backup SRAM Data DestructionErrorBackup SRAM data is destroyed. Pleaseconfirm the batteryC53 Flash ROM SEL Global Data ErrorList Invalid ErrorSEL global data Error list in flash ROM isinvalidC54 Flash ROM SEL Global Data ErrorList Overlap ErrorSEL global data Error list in flash ROMoverlapsC55 SEL Global Data Error List FlashROM Erase No. Over ErrorFlash ROM erase limit has been exceededC56 Timing Limit Over Error (flashROM erase)Erase is invalid of flash ROMC57 Flash ROM Verifying Error(flash ROM erase)Erase is invalid of flash ROMC58 Flash ROM ACK Time-out Error(flash ROM erase)Erase is invalid of flash ROMC59 Front Sector No. DesignationError (flash ROM erase)Erase is invalid of flash ROMC5A Front Sector No. DesignationError (flash ROM erase)Erase is invalid of flash ROMC5B Timing Limit Over Error (flashROM write)The write is invalid of flash ROMC5C Flash ROM Verifying Error(flash ROM write)The write is invalid of flash ROMC5D Flash ROM ACK Time-out Error(flash ROM write)The write is invalid of flash ROMC5E Writing Destination OffsetAddress Error (flash ROM write)The write is invalid of flash ROMC5F Writing Source Data BufferAddress Error (flash ROM write)The write is invalid of flash ROMC60 Error Without SEL Global DataError Restructuring AreaThere is no erase settlement SEL globaldata error restructuring area。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
AbstractThe performance requirements for contemporary micro-processors are increasing as rapidly as their number of applications grows. By accelerating the clock, performance can be gained easily but only with high additional power consumption. The electrical potential between logic ‘0’ and ‘1’ is decreased as integration and clock rates grow, lead-ing to a higher susceptibility for transient faults, caused e.g. by power fluctuations or Single Event Upsets (SEUs). We introduce a technique which is based on the well-known cyclic redundancy check codes (CRCs) to secure the pipe-lined execution of common microprocessors against tran-sient faults. This is done by computing signatures over the control signals of each pipeline stage including dynamic out-of-order scheduling. To correctly compute the check-sums, we resolve the time-dependency of instructions in the pipeline.We will first discuss important physical properties of Sin-gle Event Upsets (SEUs). Then we present a model of a simple processor with the applied scheme as an example. The scheme is extended to support n-way simultaneous mul-tithreaded systems, resulting in two basic schemes. A cost analysis of the proposed SEU-detection schemes leads to the conclusion that both schemes are applicable at reason-able costs for pipelines with 5 to 10 stages and maximal 4 hardware threads.A worst-case simulation using software fault-injection of transient faults in the processor model showed that errors can be detected with an average of 83% even at a fault rate of 10-2. Furthermore, the scheme is able to detect an error within an average of only 5.05 cycles. 1.Introduction and motivationThe growing number of applications for computing systems led to rapidly growing performance requirements. Perform-ance is easily gained by accelerating the clock frequency F.It is inevitable to decrease the main current at high clock rates, because the power consumption E~F3 [16]. To de-crease the energy consumption, the current potential be-tween logic ‚0‘ and ‚1‘ is reduced. By increasing the integration density, new algorithms and paradigms can be implemented in hardware and energy consumption can be reduced again. Examples are dual-core systems or Simulta-neous Multithreading (SMT) [1]. The Semiconductor In-dustry Association (SIA) roadmap shows an increase of integration density to 22 nm until 2016 (/Files/ITRS_Overview.pdf). At 90 nm and below [11] a problem occurs at sea-level, which was only known from aerospace applications: The collision with high-energetic neutrons from deep-space with silicon. At larger heights, the fault rates in SRAMs increase by a factorof 3-10, approximately by 1.3 per 1000 ft [12].These so-called total ionizing dose effects are caused by the influence of electromagnetic waves or particle radiation, being able to ionize atoms or molecules so that electrons are removed. The loose ions are very reactive and can cause severe circuit damage. Electromagnetic radiation can cause ionization if the wave-length is below 100 nm, since the photon has enough energy to separate one electron. Single Event Effects (SEEs) [6][7] are caused by interaction of a single particle with silicon. They have been investigated since the late seventies, leading to the discovery of memory faults in terrestrial [5] and extra-terrestrial [4] environ-ments. SEEs can be separated in non-destructive soft-errors (or transient faults), causing a temporal malfunction or dis-turbance of digital information and destructive effects caus-ing permanent failures. With high integration, protons areBernhard FechnerFernUniversität in Hagen, Department of Computer Science,58084 Hagen, Germanybernhard.fechner@fernuni-hagen.deAnalysis of Checksum-Based Execution Schemes for Pipelined Processorsable to induce Single Event Upsets (SEUs), leading to a higher SEU susceptibility of the concerned circuits, espe-cially for deep-space applications. R. Baumann showed that the downtime costs caused by SEUs have increased dra-matically [27]. It is forecast that the soft-error-rate (SER) in combinatorial circuits will increase approximately by 105 from 1992 until 2011 [13]. Thus, we have to deal with a total soft error rate of 104 FIT (Failure in Time=1 error in 109 hours of operation) -5fault rate 10/h⇒ in combinato-rial circuits for the next decade [13]. Therefore it is neces-sary to secure the execution of future processors, making them more reliable to face the increasing number of SEUs. This paper makes the following contributions:It introduces an error detection scheme which computes checksums out of the control path to detect transient er-rors and proposes extensions to support SMT.We will estimate the area for all proposed schemes and present the results of a fault-coverage analysis based on software-implemented fault-injection.The rest of the paper is organized as follows: We will have a look at related work in Section 2. Section 3 discusses the fault model and important SEU-properties. Section 4 pre-sents two schemes to secure the pipelined execution and Section 5 an estimation of the costs. Section 6 discusses the simulation methology and presents the results of the fault-coverage analysis. Section 7 concludes the paper.2.Related workPipeline signatures resemble control-flow monitoring tech-niques, where incorrect branches are detected. The soft-ware-implemented execution-checking of loop-free intervals by [18] could detect all sequence errors that resulted in a branch outside the interval. Macroinstruction control-flow monitoring divides the application program into several blocks. The blocks are checked instruction by instruction for control-flow faults [19][20]. In [21] signature instruction streams (SIS) were introduced. The CRC of the instruction stream is inserted into the binary code after a branch. The monitor reads the CRC and compares it with the computed CRC. An error is detected if the checksums do not match. With a probability of a branch occurring every forth to tenth instruction, the overhead to store the signatures was be-tween 10% and 25% of the original program code. Because the monitor is much simpler than the processor it monitors, performance degrades (because of extra memory cycles). The effectiveness of SIS was verified by hardware fault-injection for a Motorola 68000 system [21][23]. SIS raised the error detection rate to 25% in comparison to the original system. Smolens et al. [17] proposed an error detection scheme called fingerprinting. Fingerprinting summarizes the execution history of a processor. By using two processors, errors can be detected by comparing the fingerprints. In contrary to all schemes from above, we are able to detect errors in the pipelined execution and save hardware by us-ing SMT. As a consequence, we do not have to replicate hardware to achieve structural redundancy or insert check-sums in the instruction stream. Since the detection works very fast, counter measurements like error recovery will work more efficiently (time/ cost/ power consumption).3.Physical effects and the fault model Temporal errors are the main cause for errors in semicon-ductors. They are difficult to locate in time, because they are not always in the state causing the error. The possibility for a temporal error is 5 to 100 times higher than for a per-manent error [14]. Temporal errors must not be repaired, since the hardware is not physically damaged. Apart from radiation, they can be caused e.g. from power fluctuations, loosely coupled units, timing-faults, meta-stable states and environmental influences (temperature, humidity, force). Seldom discussed in computer science literature are impor-tant properties of SEUs like creation, duration and energy level. These properties help to understand the effects caused by particles to find appropriate counter measurements. De-fect-types in silicon lattice are - amongst others - character-ized through different, discrete energy levels in the band gap, the entropy-change factors, and annealing temperatures at which the bonds break. Along its path through silicon latter, an ionizing particle creates electron-hole pairs through Coulomb-scattering. The energy of the incident particle can be measured through the energy loss on its path by dE/dx in units of keV/µm or linear energy transfer (LET = dE/ρdx) in units of eV cm2mg−1(ρ is the density of the matter) or in charge per unit length (pC/µm). For silicon, 3.6 eV are needed to create an electron-hole pair. The charge-collecting dynamic has a fast (O(100ps)) and a slow (O(ns)) component [8][9]. It is important to know the dura-tion of a transient fault, when fault-detection/ correction schemes are working on a tight time-basis as in this work. The effects of a SEU could last longer than it takes to cor-rect the error, misleading the correction in the direction that a permanent fault occurred or that no error is detected. To determine the duration of a particle-induced transient fault, we consider citing [23]:…The prompt charge is collected in much less than 1 ns, which is shorter than the response time of most MOS transistor.“In [24] it is shown by using a 0.6µm-CMOS process that the slow component of the charge-collecting dynamic of a SEU is active over 0.5 ns only in the substrate (at maximal 0.5 mA). The quick component of this dynamic is active for ≤0.2 ns (over 3 mA). Targeting an FPGA implementation,this will be of no concern for modern FPGAs, since their clock cycle is still well above this limit. For further details on fault types and rates in submicron technologies, see [10]. The fault model assumes transient faults in the form of SEUs (Single Event Upsets). Furthermore, we assume one fault at a time in one component (pipeline stage). SEUs are modeled through bit-flips in latches or flip-flops. In [15] it has been shown that this modeling matches closely the real faulty behavior.4. Pipeline signaturesIn this Section we present a scheme to compute signatures on a micro-architectural level for the control path of a sim-ple microprocessor, exploiting the pipelined execution scheme. The scheme will be extended to support SMT. Let P be a p-stage pipelined, (superscalar) processor with a fi-nite instruction set of B ≠∅ instructions and t ∈` sets of instruction streams{}{}()11,1,11,,,...,...,...,m t t m t I i i I i i B ====⊆℘, where ()B ℘ is the power set of B. We assume two redun-dant RAMs with equal code and data contents. Thus, we set t=2 (although it is possible to have multiple threads reading from one RAM) and it is I 1=I 2 in the fault-free case. In-struction streams do not have to be necessarily finite, be-cause of program loops. Each stage includes a storage where the processor saves the thread-ID and the control part of the instruction being processed in that stage. For the fetch stage, the control part will be the fetched instruction. For the decode stage, it will be the part of a microprogram, driving the execution unit(s), etc. The signature computa-tion involves the well-known cyclic redundancy check codes (CRCs) [2][3]. Cyclic binary codes are a subgroup of linear codes. They are codes with a fixed number of words 2m and a fixed word length n where m ≤n over the alphabet {0,1}i x ∈. Code words are gained by polynomial divisionof the message polynomial 1()ni i i v x v x ==∑ by the generator polynomial 1()ni i i g x g x ==∑. The selection of a generator polynomial g(x) is not a trivial task. It must be chosen in a way that enough code words are produced and the Ham-ming [22] distance{}1,0,1;(,):n nH i i i a b d a b a b =∈=↔/∑is maximal. For example, if ()1g x x =+, all single-bit er-rors can be detected since g(x) is equivalent to the computa-tion of the parity of v(x). The situation is different with pipeline checksums. Here, the message v(x) is composed out of the instruction stream and the contents of pipeline latches containing the control information for a stage. To clarify this, we start with the computation of a signature fora simple pipeline (Figure 1).Figure 1. Signature computationThe checksums will be saved in a special memory with the corresponding PC and thread-ID. We call this storage the checksum memory . Table 1 shows one checksum memory entry with the according bit ranges.Table 1. A checksum memory entryThread-ID Program counter Checksum37 36:5 4:0We can already apply multithreading at a coarse-grained level for this pipeline. We switch the processor context if latency-causing instructions (e.g. branches) are encoun-tered. Branch instructions within instruction streams will lead to the storage of the checksum and to a selection of another instruction stream. If the second checksum entry is created with the same PC, the checksums are compared. If the entry is not found, it is assumed that a fault corrupted one of the PCs and an error is signaled. If the checksums are equal, no fault occurred or the checksums were changed by a transient fault in a way that both checksums are now equal. If the checksums are not equal an error will be sig-naled. From the calculation of checksum parts concerning a single stage from Figure 1, we see that different generator polynomials g(x), h(x) can be applied to compute single checksum bits (Figure 2). For a cost-effective implementa-tion we considered only one generator polynomial for allpipeline stages.Figure 2. Checksum calculation in a stage In fact, we have a two-level polynomial scheme applied ifwe regard different feedback stages. The implementation oferror correcting codes checking each latch in a pipeline stage is possible, but was omitted due to the high additional power consumption and performance loss. Parity computa-tion for each pipeline latch will also affect performance, since we have to build the parity for all signals from the latches of a pipeline stage. This number can be quite large (e.g. signals from the microcode) so we have to build fan-in trees to compensate fan-in effects, leading to a slowdown. We also consider the contents of out-of-order pipeline stages to be a part of the checksum. This complicates the situation from Figure 1, since the dynamic scheduling will lead to different parts of the control and data stream exiting the execution stage at different times. If the fetch and exe-cution policy is done on a cycle-by-cycle basis, we can real-ize this part easily, if we choose the generator polynomial in a way that no feedback affects the out-of-order stages. Since XOR is an associative operation, dynamic execution will not affect the checksum. For flexibility we want to use any fetch and execution policy. So we cannot use the scheme from Figure 1 without modification. The problem is to resolve the time dependency of instructions in the out-of-order stage. The solution is based on a two-level scheme. Two checksums are calculated separately for the out-of-order and other stages. Figure 3 shows the checksum calcu-lation including the out-of-order stage.Figure 3. Out-of-order checksum calculationFor clarity, the control paths are marked with dotted lines. Both checksums are stored in FIFO buffers. Results from the execution stages are XORed until checksum enable is set. Then both checksums will be XORed. So we shifted the time dependence to the last stage. The scheme will not in-crease the costs except for the XORs in the last stage. Since this applies to all following schemes, we will not mention the costs for the final stage explicitly.5. Cost analysisThe number of XOR gates per stage is equal to s max . Typi-cally it is {}max 1max ,...,p s s s =, where s i is the number of(control) wires from stage i to stage i+1. Let p be the num-ber of pipeline-stages, t the number of threads, b the width of instructions and d the depth of the FIFO buffer fromstage to stage for each instruction stream. To get an upper estimation for the costs, we assume 1. 1.()pi i i i g g x x =∀∈==∑` for the generator polynomial.This means that the last stage is connected to all previous stages. For simplicity, we set the instruction width to b=2*32 bit, the average control path width of pipeline stages to s i =64 bit and the FIFO-depth to d=4. The gate costs for the scheme in Figure 1 are relatively low. Using Table 2 they compute to (.64i i s ∀=):1111111((,))()()4322304 2048.pp i i i i p p i i i i C PIPECRC p d s C XOR s C FF ds s p +==+===+=+=+∑∑∑∑Since the n-to-m switch will be used in the following esti-mations, we explicitly calculate the costs. A n-to-m switch will direct the input x[n-1:0] (width n) to one of m outputs y[n-1:0]. All other m-1 outputs will be set to zero. The out-put is selected by s[ld(m)-1:0]. For the number of NOT-gates, we need the number of zeroes within a binary number of length s. This can be easily calculated recursively, if we consider the following: Let s be the number of digits of a binary number. Then 2s binary numbers are possible, 2s /2 beginning with zero. The remaining zeroes are two times the number of zeroes of the binary number with s-1 digits. This is:#####(0)0;(1)1;(2)4;2()2(1).2sZero Zero Zero Zero s Zero s ====+−The solution of this recurrence is:11##():2(0)22s s s Zero s Zero s s −−=+=. Thus, the cost for an n-to-m switch is (m is a power of two):()()111((,))2()2()22.m m m m C dec n m n m C NOT C AND n m −−+=+=+Analogously the cost for an m-to-n switch, selecting one signal group x[n-1:0] of width n out of m groups y[n-1:0] is:()11((,))((,)()222.m m C mplex n m C dec m n nC OR n m −+=+=++Figure 4 shows the signature calculation for a two-way SMT-system. It can be seen, that hardware costs double (at least) for each hardware thread. Activation and propagation signals for the checksums are not shown for clarity. The checksum will be calculated depending on which thread isactive. Each part of the checksum is activated by the thread-ID, indicating which thread is active in a stage.Since the processor is working on the same data and code, the checksums will not be different in the fault-free case. The additional gate cost for a t-way multithreaded pipeline execution scheme in reference to Table 2 calculates to ()()211t ((,))((,4))((64,))((64,))2304p+2048+2128p+32pt+128+32t +128.pi C PIPECRC p t t C PIPECRC p C dec t C mplex t t ==⋅++=⋅⋅⋅⋅∑Table 2. Gate cost and delay (from [26])Gate Cost Delay NOT 1 1 NAND/NOR 2 1 AND/ OR 2 1 XOR/ XNOR 4 2 Flip-Flop (FF) 8 4Figure 4. Checksum calculation for two threads To compare the calculated checksums in a multithreaded system, t context switches have to occur (t is equal to the number of hardware threads). If the execution was fault-free, the same number of instructions has been executed. Then all FIFOs will have the same contents. Transient faults in the checksum mechanism will lead to different checksums and to a detection of the error. If instructions are pre-decoded, a branch - the criteria for a context change - can easily be recognized. At this point, instructions of other threads may be in the pipeline. We will have to wait for these instructions to exit the pipeline to compute the check-sum. To do this, we use a change in the thread IDs in the last stage to initiate a checksum comparison (checksum en-able ). Additionally to the scheme presented in Figure 4, the scheme in Figure 5 tries to save XOR-gates, since this num-ber can be quite large.The thread IDs in Figure 4 and Figure 5 will assign a part of an instruction in a stage to a signature. Therefore faultythread IDs will be detected, because the wrong signaturewill be selected. Then instruction streams will have differ-ent contents and lengths.Figure 5. Extended checksum calculationThe costs for this kind of checksum calculation compute to: ()1311-1-1((,))((64,))64()4((64,))2((64,))64()12826421282048(1)(38421922272).p i pi t t t t C PIPECRC p t C mplex t t C FF C dec t C mplex t C XOR t t p p t +===+++⋅+=⋅+⋅++⋅++⋅+⋅⋅+∑∑The contour plot in Figure 6 shows the difference ∆ of thecost functions 3((,))C PIPECRC p t and2((,))C PIPECRC p t for the checksum schemes in Figure 4 and Figure 5. The x-axis shows the number of hardware threads t, the y-axis the number of pipeline stages p and thez-axis the costs. We see that the costs for the scheme fromFigure 4 are always lower than those from Figure 5. Bothschemes are applicable at reasonable costs for pipelines with 5 to 10 stages and a maximal number of 4 threads. Figure 6. Contour plot of cost function ∆6.Fault-coverage analysisFor the software simulation, we generated a random stream of 1000 32 bit instructions, which was used as an input for the modeled processor. In the first experiment we wanted to determine the best polynomial to detect an error. Branches were created with probabilitybranchp, assessing the number of instructions between checksum comparisons. The prob-abilities for a branch in Table 3 were gained from SPEC95 benchmark simulations by using SimpleScalar [25].Table 3. Values for p branch (%) Benchmark Go Ijpeg Compress p branch (%)19.355 15.349 9.463 Benchmark Cc1 Apsi Vortexp branch (%)24.251 22.546 22.931We computed the checksum for a 32 bit instruction stream without fault. Then we simulated transient faults in the sec-ond instruction stream by flipping single, randomly chosen bits at random stages with a fault rate of 10-2. We chose such a high error rate to speed up fault-injection experi-ments. This was done for 1000 fault injection runs. In each fault-injection run transient errors were injected. As model we selected a multithreaded 5-stage pipeline with an inter-nal control-path width of 32 bit from stage to stage. For a worst case study, we assumed that the pipeline will be flushed each time a fault is detected or a branch is encoun-tered. On a branch in the second instruction stream both checksums were compared. Due to its simple design, we chose to simulate the checksum scheme from Figure 4. Figure 7 shows the results for the fault coverage analysis toFigure 7. Fault coverage in %Polynomials are given as numbers in the x-axis, where e.g. ‘28’ represents the polynomial 432()g x x x x=++. The y-axis shows the fault coverage in %. We conclude from Figure 7 that the best fault coverage is achieved by applying the polynomial 43()g x x x=+ (83%). Figure 8 shows the fault coverage in relation to the probability of a branch in % for g(x). We see that the fault coverage is strongly depend-ent from the number of branches. The probability for a branch was chosen to range from 0.2 to 0.0032.Figure 8. Fault coverage-branch relationBut how fast are errors detected? To find an answer, the gained polynomial was used to compute the checksums in the second step of the analysis. p branch was set to the upper average of the values from Table 3 (20%). As the number of branches substantially determines the number of checks, errors will be detected afterbranch22pn≥ executed instruc-tions (two instruction streams generating checksums). Figure 9 shows the experimental results - the latency in cycles to detect an error. Note that ‘Time’ on the x-axis is a non-linear factor, since errors occur randomly. The high latency at the beginning results from the initialization phase of the scheme. Since the pipeline is cleared on every branch, this affects the fault coverage and latency, since a feedback with zero does not result in a checksum with high fault coverage.Figure 9. Latency in cycles to detect an errorThe average number of cycles to detect an error was com-puted to 5.05. 7. Summary and ConclusionIn this paper we presented a scheme to detect transient er-rors in pipeline stages of a microprocessor by fetching fromtwo RAMs with identical code and data contents and calcu-lating a checksum using a generator polynomial. Check-sums are compared on every second branch. Since branchesoccur with an average probability of approximately 20% inthe instruction stream, checksums are compared oftenenough. The worst-case analysis by using generated 32 bitinstruction streams for a multithreaded 5-stage pipelinedprocessor with an internal control-path width of 32 bitshowed that an average of 83 of all injected faults can bedetected – even at a fault rate of 10-2. We chose such a highfault rate to speed up fault injection experiments. Overallthe presented scheme is simple and efficient enough to beintegrated in most contemporary microprocessors. It candetect an error very fast - within an average of 5 cycles. Theredundant RAMs can be omitted if the memory is securedagainst transient faults by using Error Correcting Codes andthe fetch bandwidth is large enough. Future work will com-prise a Field Programmable Gate Array implementation andan analysis of the power consumption, size and perform-ance.References[1] D. Tullsen, S. Eggers, and H. Levy, Simultaneous Mul-tithreading: Maximizing On-chip Parallelism , 22ndAn-nual International Symposium on Computer Architecture, June 1995.[2] S. Lin, D. Costello, Error Control Coding , Prentice-Hall, 1983.[3] Peterson, W. & E. Weldon. Error-Correcting Codes , MIT Press, Second Edition, 1972.[4] G. Kane, J. Heinrich, MIPS RISC Architecture, Pren-tice Hall, Englewood Cliffs, 1992.[5] T.C. May, M. H. Wodds, Alpha-Particle-Induced SoftErrors in Dynamic Memories, In Proc. of the 1978IEEE International Reliability Physics Symposium (1978). [6] T. Weatherford, IEEE Nuclear and Space RadiationEffects Conference (NSREC) 2002, Short Course, From Carriers to Contacts, A Review of SEE Charge Collec-tion Processes in Devices, 2002. [7] S. E. Kerns with contributions from B. D. Shafer, Tran-sient-Ionization and Single-Event Phenomena, In: P.V.Dressendorfer, T. P. Ma (Editors), Ionizing Radiation Effects in MOS Devices and Circuits, Wiley, 1989.[8] F. Faccio et al., SEU effects in registers and in a Dual-Ported Static RAM designed in a 0.25 µm CMOS tech-nology for applications in the LHC, CERN/LHCC/99-33 (1999) 571. [9] E. L. Peterson, IEEE Nuclear and Space Radiation Ef-fects Conference (NSREC), Short Course, Single-EventAnalysis and Prediction, 1997.[10] T . Juhnke: Die Soft-Error-Rate von Submikrometer-CMOS-Logikschaltungen Fakultät Elektrotechnik undInformatik, Technischen Universität Berlin, Disserta-tion, 2003.[11] E . Normand, “Single Event Upset at Ground Level,”IEEE Transactions on Nuclear Science, Vol. 43, No. 6,December 1996.[12] H . Kobayashi, et. al., “Soft Errors in SRAM DevicesInduced by High Energy Neutrons, Thermal Neutronsand Alpha Particles,” IEDM Tech. Digest, Dec. 2002,pp. 337-340.[13] P . Shivakumar, M. Kistler, S.W. Keckler, D. Burger, L.Alvisi. Modeling the effect of technology trends onsoft-error rate of combinational logic. In InternationalConference of Dependable Systems and Networks,June 2002.[14] S .R. McConnel, D.P. Siewiorek, M.M. Tsao: TheMeasurement and Analysis of Transient Errors in Digi-tal Systems, Digest of Papers, FTCS-9, pp.67-70, 1979.[15] R .W. Wieler, Z. Zhang, R.D. McLeod, Simulatingstatic and dynamic faults in BIST structures with aFPGA based emulator . In Proc. of IEEE International Workshop of Field-Programmable Logic and Applica-tion, pp. 240-250, 1994.[16] U . Brinkschulte, T. Ungerer, Mikrocontroller und Mik-roprozessoren , Springer-Verlag, 2002. [17]J .C. Smolens, B.T. Gold, J. Kim, B. Falsafi, J.C. Hoe, A. Nowatzyk: “Fingerprinting: bounding soft-error de-tection latency and bandwidth”. ASPLOS 2004: 224-234.[18]S.S. Yau, F.C. Chen. “An Approach to ConcurrentControl Flow Checking”. In IEEE Trans. Soft. Eng.SE-6(2) (March 1980): 126-137.[19]M. Namjoo. “Techniques for Concurrent Testing ofVLSI Processor Operation”. In Proc. of the 12th Int’l.Symp. On Fault-Tolerant-Computing, IEEE Computer Society, Santa Monica, CA, June 1982, pp. 461-468. [20]T. Sridhar, S.M. Thatte. “Concurrent Checking of Pro-gram Flow in VLSI Processors.” In Digest of the 1982 Int’l. Test Conference, IEEE 1982, paper 9.2, pp. 191-199.[21]J.P. Shen, M.A. Schuette. “On-Line Monitoring UsingSignatured Instruction Streams”, IEEE Proc. 13th Int’l.Test Conference, Oct. 1983, pp. 275-282.[22]R ichard W. Hamming. Error-detecting and error-correcting codes, Bell System Technical Journal 29(2):147-160, 1950.[23]M.A. Schuette et al. “Experimental Evaluation of TwoConcurrent Error Detection Schemes”, In Proc. Of the 16th Int’l. Symp. On Fault-Tolerant Computing, Vienna, July 1986, pp. 138-143[24]K arnik et al.: Characterization of Soft Errors Caused bySingle Event Upsets in CMOS Processes, IEEE Trans-actions on Dependable and Secure Computing, Vol. 1, No. 2, April-June 2004.[25]D.C. Burger and T.M. Austin. "The SimpleScalar ToolSet, Version 2.0", Computer Architecture News, 25 (3), pp. 13-25, June, 1997.[26]S.M. Müller, W.J. Paul. Computer Architecture. Com-plexity and Correctness, Springer-Verlag, 2000.[27]R. Baumann, Silicon Amnesia: A Tutorial on RadiationInduced Soft Errors. International Reliability Physics Symposium (IRPS), 2001.。