APU并行编程
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Integrate CPU & GPU in silicon
GPU Compute C++ support
Unified Address Space for CPU and GPU
Unified Memory Controller
User mode scheduling
GPU uses pageable system memory via CPU pointers
Poor
Expert programmers C and C++ subsets Compute centric APIs , data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch
THE PROGRAMMER’S GUIDE TO THE APU GALAXY
Phil Rogers, Corporate Fellow AMD
THE OPPORTUNITY WE ARE SEIZING
Make the unprecedented processing capability of the APU as accessible to programmers as the CPU is today.
Performance:
Up to 29GB/s System Memory Bandwidth Up to 500 Gflops of Single Precision Compute
7 | The Programmer’s Guide to the APU Galaxy | June 2011
DirectX®
8 | The Programmer’s Guide to the APU Galaxy | June 2011
A NEW ERA OF PROCESSOR PERFORMANCE
Single-Core Era
Enabled by: Moore’s Law Voltage Scaling Constrained by: Power Complexity
Multi-Core Era
Enabled by: Moore’s Law SMP architecture
Heterogeneous Systems Era
Enabled by: Abundant data parallelism Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead
Constrained by: Power Parallel SW Scalability
Assembly C/C++ Java …
pthreads OpenMP / TBB …
Shader CUDA OpenCL !!!
?
we are here
we are here
Modern Application Performance
AMD Fusion System Architecture
Roadmap Software evolution A visual view of the new command and data flow
3 | The Programmer’s Guide to the APU Galaxy | June 2011
Discrete-class DirectX® 11 performance 80 Stream Processors
3rd Generation Unified Video Decoder
PCIe® Gen2
Single-channel DDR3 @ 1066 18W TDP
11 | The Programmer’s Guide to the APU Galaxy | June 2011
FUSION SYSTEM ARCHITECTURE – AN OPEN PLATFORM
Open Architecture, published specifications – FSAIL virtual ISA – FSA memory model – FSA dispatch ISA agnostic for both CPU and GPU
FSA FEATURE ROADMAP
Physical Integration
Optimized Platforms
Architectural Integration
System Integration
GPU compute context switch GPU graphics pre-emption
4 | The Programmer’s Guide to the APU Galaxy | June 2011
LOW POWER E-SERIES AMD FUSION APU: “ZACATE”
E-Series APU
2 x86 Bobcat CPU cores Array of Radeon™ Cores
Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU
Extend to Discrete GPU
6 | The Programmer’s Guide to the APU Galaxy | June 2011
MAINSTREAM A-SERIES AMD FUSION APU: “LLANO”
A-Series APU
Up to four x86 CPU cores
AMD Turbo CORE frequency acceleration
2 | The Programmer’s Guide to the APU Galaxy | June 2011
OUTLINE
The APU today and its programming environment
The future of the heterogeneous platform
Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching
CUDA™, Brook+, etc
See Herb Sutter’s Keynote tomorrow for a cool example of plans for the architected era!
2002 - 2008
2009 - 2011
2012 - 2020
10 | The Programmer’s Guide to the APU Galaxy | June 2011
3rd Generation Unified Video Decoder
PCIe® Gen2
Single-channel DDR3 @ 1066 6W TDP w/ Local Hardware Thermal Control
Performance:
Up to 8.5GB/s System Memory Bandwidth Suitable for sealed, passively cooled designs
EVOLUTION OF HETEROGENEOUS COMPUTING
Architected Era Excellent Architecture Maturity & Programmer Accessibility Standards Drivers Era OpenCL™, DirectCompute Driver-based APIs AMD Fusion System Architecture GPU Peer Processor
APU: ACCELERATED PROCESSING UNIT
The APU has arrived and it is a great advance over previous platforms
Combines scalar processing on CPU with parallel processing on the GPU and high bandwidth access to memory How do we make it even better going forward? – Easier to program – Easier to optimize – Easier to load balance – Higher performance – Lower power
Single-thread Performance
Throughput Performance
weBiblioteka Baiduare here Time (Data-parallel exploitation)
Time
Time (# of processors)
9 | The Programmer’s Guide to the APU Galaxy | June 2011
Proprietary Drivers Era Graphics & Proprietary Driver-based APIs
“Adventurous” programmers Exploit early programmable “shader cores” in the GPU Make your program look like “graphics” to the GPU
TABLET Z-SERIES AMD FUSION APU: “DESNA”
Z-Series APU
2 x86 “Bobcat” CPU cores Array of Radeon™ Cores
Discrete-class DirectX® 11 performance 80 Stream Processors
Array of Radeon™ Cores
Discrete-class DirectX® 11 performance
3rd Generation Unified Video Decoder Blu-ray 3D stereoscopic display PCIe® Gen2 Dual-channel DDR3 45W TDP
Performance:
Up to 8.5GB/s System Memory Bandwidth Up to 90 Gflop of Single Precision Compute
5 | The Programmer’s Guide to the APU Galaxy | June 2011
COMMITTED TO OPEN STANDARDS AMD drives open and de-facto standards – Compete on the best implementation Open standards are the basis for large ecosystems Open standards always win over time – SW developers want their applications to run on multiple platforms from multiple hardware vendors