CortexM4内核及DSP指令
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Thumb-2 Technology DSP and SIMD extensions Single cycle MAC (Up to 32 x 32 + 64 -> 64) Optional single precision FPU Integrated configurable NVIC Compatible with Cortex-M3
Single-cycle SIMD instructions
Stands for Single Instruction Multiple Data
Allows to do simultaneously several operations with several 8-bit or 16-bit data groups
CLASS Arithmetic INSTRUCTION ALU operation (not PC) ALU operation to PC CLZ QADD, QDADD, QSUB, QDSUB QADD8, QADD16, QSUB8, QSUB16 QDADD, QDSUB QASX, QSAX, SASX, SSAX SHASX, SHSAX, UHASX, UHSAX SADD8, SADD16, SSUB8, SSUB16 SHADD8, SHADD16, SHSUB8, SHSUB16 UQADD8, UQADD16, UQSUB8, UQSUB16 UHADD8, UHADD16, UHSUB8, UHSUB16 UADD8, UADD16, USUB8, USUB16 UQASX, UQSAX, USAX, UASX UXTAB, UXTAB16, UXTAH USAD8, USADA8 MUL, MLA MULS, MLAS SMULL, UMULL, SMLAL, UMLAL SMULBB, SMULBT, SMULTB, SMULTT SMLABB, SMLBT, SMLATB, SMLATT SMULWB, SMULWT, SMLAWB, SMLAWT SMLALBB, SMLALBT, SMLALTB, SMLALTT SMLAD, SMLADX, SMLALD, SMLALDX SMLSD, SMLSDX SMLSLD, SMLSLD SMMLA, SMMLAR, SMMLS, SMMLSR SMMUL, SMMULR SMUAD, SMUADX, SMUSD, SMUSDX UMAAL SDIV, UDIV CORTEX-M3 Cortex-M4 1 1 3 3 1 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 1 - 2 1 1 - 2 1 5 - 7 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 2 - 12 2 – 12
4
Different Core
Cortex-M3 Architecture Version Instruction set architecture DMIPS/MHz Integrated NVIC Number interrupts Interrupt priorities Single Cycle Multiply Hardware Divide Single cycle DSP/SIMD Floating point hardware Bus protocol v7M Thumb + Thumb-2 1.25 Yes 1-240 + NMI 8-256 Yes Yes No No AHB Lite, APB Cortex-M4 v7ME Thumb + Thumb-2, DSP, SIMD, FP 1.25 Yes 1-240 + NMI 8-256 Yes Yes Yes Yes AHB Lite, APB
Audio applications
1.5 1 0.5 0 -0.5 -1 -1.5 1.5 1 0.5 0 -0.5 -1 -1.5 1.5 1 0.5 0 -0.5 -1 -1.5
Without saturation
With saturation
Control applications
The PID controllers’ integral term is continuously accumulated over time. The saturation automatically limits its value and saves several CPU cycles per regulators
CM3
n/a n/a n/a n/a n/a n/a n/a n/a 1 2 5-7 5-7 n/a n/a n/a
CM4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
All the above operations are single cycle on the Cortex-M4 processor
Microarchitecture
3-stage pipeline with branch speculation 3x AHB-Lite Bus Interfaces
Configurable for ultra low power
Deep Sleep Mode, Wakeup Interrupt Controller Power down features for Floating Point Unit
OPERATION
16 x 16 = 32 16 x 16 + 32 = 32 16 x 16 + 64 = 64 16 x 32 = 32 (16 x 32) + 32 = 32 (16 x 16) ± (16 x 16) = 32 (16 x 16) ± (16 x 16) + 32 = 32 (16 x 16) ± (16 x 16) + 64 = 64 32 x 32 = 32 32 ± (32 x 32) = 32 32 x 32 = 64 (32 x 32) + 64 = 64 (32 x 32) + 32 + 32 = 64 32 ± (32 x 32) = 32 (upper) (32 x 32) = 32 (upper)
2
Content
An introduction to the Cortex-M4 Core And introduction to some of the DSP instructions
3
Cortex-M4 processor microarchitecture
ARMv7ME Architecture
SIMD techniques operate with packed data
SIMD description in the CMSIS 2.0 help
12
Quad 8bit addition example
13
Cortex-M4 DSP instructions compared
Cycle counts
8
Saturated arithmetic
Intrinsically prevents overflow of variable by clipping to min/max boundaries and remove CPU burden due to software range checks Benefits
Ex: dual 16-bit MAC (Result = 16x16 + 16x16 + 32) Ex: Quad 8-bit SUB / ADD
Benefits
Parallelizes operations (2x to 4x speed gain) Minimizes the number of Load/Store instruction for exchanges between memory and register file (2 or 4 data transferred at once), if 32-bit is not necessary Maximizes register file use (1 register holds 2 or 4 values)
Flexible configurations for wider applicability
Configurable Interrupt Controller (1-240 Interrupts and Priorities) Optional Memory Protection Unit Optional Debug & Trace
• DSP instructions • Floating point unit
Content
An introduction to the Cortex-M4 Core And introduction to some of the DSP instructions
7
Instruction exaHale Waihona Puke Baiduple : single cycle MAC
10
SIMD operation example
SIMD instructions perform multiple operations in one cycle (A,B,C,D are 16bits data) Sum = Sum + (A x C) + (B x D)
32-bit 64-bit 32-bit 64-bit
STM32F4 Core, DSP, FPU & Library
An introduction of the Cortex-M4 An introduction of some DSP instructions
Content
An introduction to the Cortex-M4 Core And introduction to some of the DSP instructions
INSTRUCTIONS
SMULBB, SMULBT, SMULTB, SMULTT SMLABB, SMLABT, SMLATB, SMLATT SMLALBB, SMLALBT, SMLALTB, SMLALTT SMULWB, SMULWT SMLAWB, SMLAWT SMUAD, SMUADX, SMUSD, SMUSDX SMLAD, SMLADX, SMLSD, SMLSDX SMLALD, SMLALDX, SMLSLD, SMLSLDX MUL MLA, MLS SMULL, UMULL SMLAL, UMLAL UMAAL SMMLA, SMMLAR, SMMLS, SMMLSR SMMUL, SMMULR
Multiplication
Single cycle MAC
Division
14
Packed data types
Several instructions operate on “packed” data types Byte or halfword quantities packed into words Allows more efficient access to packed structure types SIMD instructions can act on packed data Instructions to extract and pack data
5
Different Instruction Set
Cortex M0/M1
•A subset of the Cortex-M3 instructions
Cortex M3
•16 and 32 bits instructions •mixed in the same flow (Thumb2)
Cortex M4
Single-cycle SIMD instructions
Stands for Single Instruction Multiple Data
Allows to do simultaneously several operations with several 8-bit or 16-bit data groups
CLASS Arithmetic INSTRUCTION ALU operation (not PC) ALU operation to PC CLZ QADD, QDADD, QSUB, QDSUB QADD8, QADD16, QSUB8, QSUB16 QDADD, QDSUB QASX, QSAX, SASX, SSAX SHASX, SHSAX, UHASX, UHSAX SADD8, SADD16, SSUB8, SSUB16 SHADD8, SHADD16, SHSUB8, SHSUB16 UQADD8, UQADD16, UQSUB8, UQSUB16 UHADD8, UHADD16, UHSUB8, UHSUB16 UADD8, UADD16, USUB8, USUB16 UQASX, UQSAX, USAX, UASX UXTAB, UXTAB16, UXTAH USAD8, USADA8 MUL, MLA MULS, MLAS SMULL, UMULL, SMLAL, UMLAL SMULBB, SMULBT, SMULTB, SMULTT SMLABB, SMLBT, SMLATB, SMLATT SMULWB, SMULWT, SMLAWB, SMLAWT SMLALBB, SMLALBT, SMLALTB, SMLALTT SMLAD, SMLADX, SMLALD, SMLALDX SMLSD, SMLSDX SMLSLD, SMLSLD SMMLA, SMMLAR, SMMLS, SMMLSR SMMUL, SMMULR SMUAD, SMUADX, SMUSD, SMUSDX UMAAL SDIV, UDIV CORTEX-M3 Cortex-M4 1 1 3 3 1 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 1 - 2 1 1 - 2 1 5 - 7 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 2 - 12 2 – 12
4
Different Core
Cortex-M3 Architecture Version Instruction set architecture DMIPS/MHz Integrated NVIC Number interrupts Interrupt priorities Single Cycle Multiply Hardware Divide Single cycle DSP/SIMD Floating point hardware Bus protocol v7M Thumb + Thumb-2 1.25 Yes 1-240 + NMI 8-256 Yes Yes No No AHB Lite, APB Cortex-M4 v7ME Thumb + Thumb-2, DSP, SIMD, FP 1.25 Yes 1-240 + NMI 8-256 Yes Yes Yes Yes AHB Lite, APB
Audio applications
1.5 1 0.5 0 -0.5 -1 -1.5 1.5 1 0.5 0 -0.5 -1 -1.5 1.5 1 0.5 0 -0.5 -1 -1.5
Without saturation
With saturation
Control applications
The PID controllers’ integral term is continuously accumulated over time. The saturation automatically limits its value and saves several CPU cycles per regulators
CM3
n/a n/a n/a n/a n/a n/a n/a n/a 1 2 5-7 5-7 n/a n/a n/a
CM4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
All the above operations are single cycle on the Cortex-M4 processor
Microarchitecture
3-stage pipeline with branch speculation 3x AHB-Lite Bus Interfaces
Configurable for ultra low power
Deep Sleep Mode, Wakeup Interrupt Controller Power down features for Floating Point Unit
OPERATION
16 x 16 = 32 16 x 16 + 32 = 32 16 x 16 + 64 = 64 16 x 32 = 32 (16 x 32) + 32 = 32 (16 x 16) ± (16 x 16) = 32 (16 x 16) ± (16 x 16) + 32 = 32 (16 x 16) ± (16 x 16) + 64 = 64 32 x 32 = 32 32 ± (32 x 32) = 32 32 x 32 = 64 (32 x 32) + 64 = 64 (32 x 32) + 32 + 32 = 64 32 ± (32 x 32) = 32 (upper) (32 x 32) = 32 (upper)
2
Content
An introduction to the Cortex-M4 Core And introduction to some of the DSP instructions
3
Cortex-M4 processor microarchitecture
ARMv7ME Architecture
SIMD techniques operate with packed data
SIMD description in the CMSIS 2.0 help
12
Quad 8bit addition example
13
Cortex-M4 DSP instructions compared
Cycle counts
8
Saturated arithmetic
Intrinsically prevents overflow of variable by clipping to min/max boundaries and remove CPU burden due to software range checks Benefits
Ex: dual 16-bit MAC (Result = 16x16 + 16x16 + 32) Ex: Quad 8-bit SUB / ADD
Benefits
Parallelizes operations (2x to 4x speed gain) Minimizes the number of Load/Store instruction for exchanges between memory and register file (2 or 4 data transferred at once), if 32-bit is not necessary Maximizes register file use (1 register holds 2 or 4 values)
Flexible configurations for wider applicability
Configurable Interrupt Controller (1-240 Interrupts and Priorities) Optional Memory Protection Unit Optional Debug & Trace
• DSP instructions • Floating point unit
Content
An introduction to the Cortex-M4 Core And introduction to some of the DSP instructions
7
Instruction exaHale Waihona Puke Baiduple : single cycle MAC
10
SIMD operation example
SIMD instructions perform multiple operations in one cycle (A,B,C,D are 16bits data) Sum = Sum + (A x C) + (B x D)
32-bit 64-bit 32-bit 64-bit
STM32F4 Core, DSP, FPU & Library
An introduction of the Cortex-M4 An introduction of some DSP instructions
Content
An introduction to the Cortex-M4 Core And introduction to some of the DSP instructions
INSTRUCTIONS
SMULBB, SMULBT, SMULTB, SMULTT SMLABB, SMLABT, SMLATB, SMLATT SMLALBB, SMLALBT, SMLALTB, SMLALTT SMULWB, SMULWT SMLAWB, SMLAWT SMUAD, SMUADX, SMUSD, SMUSDX SMLAD, SMLADX, SMLSD, SMLSDX SMLALD, SMLALDX, SMLSLD, SMLSLDX MUL MLA, MLS SMULL, UMULL SMLAL, UMLAL UMAAL SMMLA, SMMLAR, SMMLS, SMMLSR SMMUL, SMMULR
Multiplication
Single cycle MAC
Division
14
Packed data types
Several instructions operate on “packed” data types Byte or halfword quantities packed into words Allows more efficient access to packed structure types SIMD instructions can act on packed data Instructions to extract and pack data
5
Different Instruction Set
Cortex M0/M1
•A subset of the Cortex-M3 instructions
Cortex M3
•16 and 32 bits instructions •mixed in the same flow (Thumb2)
Cortex M4