GPU.Programming.Guide(FULL)

合集下载

GPU_Computing_Guide_2015

CST STUDIO SUITE R 2015 GPU Computing GuideSolvers and FeaturesHardwareNVIDIA Tesla K20X (for Servers)2013releaseNVIDIA Quadro K6000NVIDIA Tesla K80 (for Servers)2014SP6Drivers Download and Installation installer executableexecutable.appear.black momentarily.).passed Windowspreviously installedInstallshielddriver.appearnow”run thedriver hasLinux machine ashasnvidiasettings running in system without graphics driver.if a X-server is installed on your system and you are usingCorrect Installation of GPU Hardware and DriversNVIDIA DriversDrivers”(seeFigure2:”Add or Remove Programs”dialog on WindowsUninstall Procedure on Linuxoption.GPU Computing Simulations in Batch ModeGuidelinesECC Feature via Nvidia Control Panel Figure3:Switch oﬀthe ECC feature for all Tesla cards.Compute Cluster(TCC)modeenableWindowsExclusive ModeComputing and GPU ComputingGPU hardware Computing system andWindows recent Windows Solver Server SystemSTUDIO default usingFigure4:Local System Account.Computing using Windows Remotelicense,GPU computing using RDP can be combinationnote thatServerRDP sessions,cards can’tMultiple Simulations at the Same Timeof Available GPU CardsFigure5:Assignment of GPUs to speciﬁc DC Solver Servers.NVIDIA GPU BoostGPU Boost TM is a feature available on the recent NVIDIA Tesla products. feature takes advantage of any power and thermal headroom in order to boostby increasing the GPU core and memory clock rates.The Tesla GPUs arespeciﬁc Thermal Design Power(TDP).Frequently HPC workloads do notreaching this power limit,and therefore have power headroom.A performanceabove with no success please contact CST technical supportChanges。

NVIDIA Developer Toolkit 用户手册说明书

NVIDIA empowers real-time developers with the latest graphics technology by providing the NVIDIA ® Developer Toolkit , a suite of cutting edge content creation, code integration, performance analysis, and educationalresources. Packed with powerful solutions to challenging problems, the Toolkit will help you create more compelling applications for the latest graphics hardware.Several components of the Toolkit have won Game Developer magazine's prestigious Front Line Awards ,which recognize tools that enable faster and more ef cient game creation for advancing the state of the art.NVPerfHUD 4Analyze your application like an NVIDIA engineer.• Performance Dashboard: Identify high-levelbottlenecks with graphs and directed experiments.• Debug Console: Review Microsoft ® DirectX ® Debugruntime messages, NVPerfHUD warnings, and custom messages.• Frame Debugger: Freeze the current frame and stepthrough your scene one draw call at a time, dissecting them with advanced State Inspectors.• Frame Proﬁler: Automatically identify your mostexpensive rendering states and draw calls and improve performance with detailed GPU instrumentation data.NVPerfKitNVPerfKit is a suite of performance tools that gives you access to low-level performance counters inside the driver, and hardware counters inside the GPU itself. The performance counters are available to OpenGL ® and DirectX applications via the Windows Performance Data Helper (PDH) interface and the PIX for Microsoft ® Windows ® NVIDIA Plug-in.The performance counters can be used to determine exactly how your application is using the GPU, identify bottlenecks, and con rm that performance problems have been resolved. Now, for the rst time ever, this information is available to third party developers. NVPerfKit includes: • NVPerfHUD 4• NVPerfSDK- API for accessing GPU signals in your applications - Code samples for OpenGL and Direct3D• NVIDIA Plug-in for Microsoft PIX for Windows - Access GPU performance counters in PIX • GLExpert- Debug OpenGL API usage errors and performance issues• gDEBugger Trail Version- Pro le and debug OpenGL applicationsNVIDIA Developer Toolkit | The Source for GPU Programming | THE SOURCE FOR GPU PROGRAMMINGNVPerfHUD's Brand-New Frame Proﬁler3DMark06 used with permission from Futuremark Corporation.NVPerfHUD's Improved Performance Dashboard3DMark06 used with permission from Futuremark Corporation.FX ComposerAn integrated shaderdevelopment environmentwith unique real-time preview and optimization features.• CREATE your shaders in a high powered developmentenvironment.• DEBUG your shaders with visual shader debuggingfeatures.• TUNE your shader performance with advancedanalysis and optimization.NVIDIA MelodyCreate high quality normal maps that make a low-poly model look like a high-poly model. Also create height maps, ambient occlusion maps, and more.NVIDIA SDKOur award-winning SDK includes hundreds of codesamples, effects, and libraries to help you take advantage of the latest in graphics technology.• Compelling techniques for DirectX and OpenGL.• Powerful effects ready to use in your application.• Convenient browser enables quick searching for whatyou need.NVIDIA Texture Tools & Photoshop Plug-insThis comprehensive set of command line tools and plug-ins gives artists full control.• Generate normal maps from height elds • Texture compression with visual preview• Powerful mipmap creation and mipmap level editing •Helpful “Mipster” and “Cube Map Shuf er” automation scripts for Adobe Photoshop ®Additional ResourcesThe NVIDIA Developer Toolkit also includes:• Cg Toolkit 1.5 Alpha• NVIDIA Scene Graph SDK •Training videos• Conference presentations • GPU Programming Guide •Utilities, Demos, Drivers, and moreBecome a Registered Developer TodayJoin our FREE registered developer program for early access to NVIDIA drivers, developer tools, online forums, direct bug reporting, and more…All this and more available at | THE SOURCE FOR GPU PROGRAMMING©2006 NVIDIA Corporation. All rights reserved.EverQuest ®content courtesy of Sony Online Entertainment Inc.NVIDIA Developer Toolkit | | The Source for GPU ProgrammingNVIDIA GPU Programming Guide。

NVIDIA虚拟GPU(vGPU)软件产品概述说明书

Date Version Authors Description 2022.10.27V1.0Leon Wang 概述2022.11.28V1.5Leon Wang 安装部署vGPU 配置安装手册vWS 14.3+vSPhere 7.0.3+vCenter 7.0.3修订记录1. NVIDIA vGPU 概述1.1 什么是NVIDIA vGPUNVIDIA 虚拟 GPU (vGPU) 软件为众多工作负载（从图形丰富的虚拟工作站到数据科学和 AI ）提供强大的 GPU 性能，使 IT 能够利用虚拟化的管理和安全优势以及现代工作负载所需的 NVIDIA GPU 的性能。

NVIDIA vGPU 软件安装在云或企业数据中心服务器的物理 GPU 上，会创建虚拟 GPU ，这些 GPU 可以在多个虚拟机（可随时随地通过任意设备访问）之间共享。

1.2 NVIDIA vGPU 软件产品分类1.2.1 NVIDIA 虚拟计算服务器 (vCS)仅支持CUDA ，Guest OS 仅支持Linux OS 。

加速基于 KVM 的基础架构上的虚拟化 AI 计算工作负载。

若是基于VMWare VSPhere 的基础架构，请查阅NVIDIA AI Enterprise 软件套件。

AI 、深度学习和数据科学工作流程需要出色的计算能力。

借助新款 NVIDIA 数据中心 GPU （包括 NVIDIA A30 Tensor Core GPU ），NVIDIA 虚拟计算服务器 (vCS) 可助力数据中心加速服务器虚拟化，以便可以在由 NVIDIA vGPU 技术驱动的虚拟机 (VM) 中运行计算密集程度极高的工作负载（例如人工智能、深度学习和数据科学）。

1.2.2 NVIDIA RTX 虚拟工作站 (vWS)支持DiriectX，OpenGL，Vulkan等图形API，同时支持CUDA/OpenCL计算API，适用于使用图形应用程序的创意和技术专业人士的虚拟工作站。

gpu上的数学运算算法库

gpu上的数学运算算法库
GPU（图形处理器单元）上的数学运算算法库是针对图形处理器进行优化的数学计算库，用于加速数值计算和科学计算。

这些库通常包括线性代数、矩阵运算、向量运算、傅立叶变换、随机数生成等功能，能够充分利用GPU的并行计算能力和高带宽内存来加速数学运算。

一种常见的GPU数学运算算法库是NVIDIA的CUDA（Compute Unified Device Architecture），它提供了丰富的数学函数库，如cuBLAS（基本线性代数子程序）、cuFFT（快速傅立叶变换）、cuRAND（随机数生成）等，以及针对GPU优化的线性代数库和矩阵运算库。

另外，AMD的ROCm（Radeon Open Compute platform）也提供了针对AMD GPU优化的数学运算库，包括ROCm Math Libraries （RocBLAS、RocFFT等）和MIOpen（深度学习加速库）等，用于支持深度学习、科学计算和数值模拟等应用。

除了厂商提供的库外，还有一些开源的GPU数学运算库，如OpenCL（Open Computing Language）和SYCL（C++单一指令、多数
据编程模型），它们提供了跨平台的GPU计算能力，并且支持多种硬件架构。

总的来说，GPU上的数学运算算法库能够充分发挥图形处理器的并行计算能力，加速数值计算和科学计算，为深度学习、人工智能、大数据分析等领域提供了强大的支持。

在选择库的时候，需要考虑硬件兼容性、性能优化、功能丰富度等因素，以满足特定应用的需求。

GPU_Computing_Guide

STUDIO SUITE TM20102Technical RequirementsThis section provides you as the requirements necessary to successfully GPU Please ensure your system points problems during setup your system.2.1Supported Hardwarecontains hardware currentlycomputing R as wellworkstationrecommendationNVIDIA Quadro FX5800NVIDIA Tesla C1060NVIDIA Quadro2200D2CST assumesNVIDIA Tesla M1060(no display link)1approx.40million mesh cells5..714GB GDDR312GB per M10602.2Support The GPU devices supported inDrivers Download and InstallationRemotethethe Microsoftdo notremotecomputeradapters”andRepeatwouldthe TCC driver listed in the table below.Reboot your com-Output of HWAccDiagnostics AMD64.exeWindowsexecutableexecutable.appearThe NVIDIA Install Shield Wizard(Welcomebegin(The screen may turn black momentarily.).the hardware has not passed Windowsselect”Continue Anyway”.The”WizardFigure4:Warning regarding Windows Logowant to restart my computer now”and clickthat you run the HwAccDiagnosticsconﬁrm that the hardware has been AccDiagnostics AMD64.exe which can be installation folder.Linuxroot.beennvidiatryin a terminalwithoutdriver.NVIDIA Driversfrom the5).After”Add or Remove Programs”dialog onProcedure on Linuxthe”--uninstall”option.This requiresGPU ComputingComputing”and specify how many GPUsthat the maximum number of GPUs available number of tokens in your license.Batch Modein batch mode(e.g.via an external job-withGPU)which can be used to switchswitch can be used as follows:Environment.exe"-m-r-withGPU="<NUMBER OF GPUs>"Guidelinescombination(x64)youto installsupported is not operatingfor an Desktop simulation function properly.Windows XP/Vista/7)work wasThisComputinguser canUltraVNCComputingWindowslocaldefaultMultiplesupportDriversvideodocumentConditionsFeaturesChanges。

多核与gpu编程——工具、方法及实现

多核与gpu编程——工具、方法及实现
多核cpu及gpu编程包括以下几个工具、方法及实现：
1. 并行软件库：对于多核cpu编程，多核库主要通过提供常用算法、优化后的代码以及控制函数，来实现多核编程，常见的多核库有Intel TBB，OpenMP，PTHREADS等；对于gpu编程，CUDA，OpenCL等常用的库框架，也可以帮助我们实现高效的gpu编程。

2. 汇编语言：汇编语言是程序员在编程时，有时候可以使用的底层程序语言，通过X86、ARM等汇编指令，能有效进行多核处理。

3. 编译器与运行库：编译器作用于代码，可以有效地完成代码编码、优化以及封装成可执行文件，而常用的编译器包括Intel、Microsoft、GCC等；而运行库则是提供一些常用的函数、接口等，以方便用户对程序进行定制和调节，常见的运行库包括Intel MKL(Math Kernel Library)、Intel
IPP(IntegratedPerformance Primitives)。

4. 框架与API：近来，为了实现多核及gpu编程，社区提出了许多开源的框架以及API，以方便应用于实际情况中，常见的框架包括Intel Cilk Plus、ARM AcC、NVIDIA CUDA AS等，而常用的api则有OpenCL、CUDA，方便用户对gpu 的指令集和操作进行定制开发。

总之，多核及gpu编程所涉及的工具、方法及实现繁多，以上只是其中的一部分，选择合适的工具以及方法，才能实现性能较优的多核及gpu编程技术。

NVIDIA Data Center GPU Driver version 450.80.02 (L

NVIDIA Data Center GPU Driver version 450.80.02 (Linux) / 452.39 (Windows)Release NotesTable of Contents Chapter 1. Version Highlights (1)1.1. Software Versions (1)1.2. Fixed Issues (1)1.3. Known Issues (2)Chapter 2. Virtualization (5)Chapter 3. Hardware and Software Support (7)Chapter 1.Version HighlightsThis section provides highlights of the NVIDIA Data Center GPU R450 Driver (version 450.80.02 Linux and 452.39 Windows).For changes related to the 450 release of the NVIDIA display driver, review the file "NVIDIA_Changelog" available in the .run installer packages.Driver release date: 09/30/20201.1. Software Versions07/28/2020: For this release, the software versions are listed below.‣CUDA Toolkit 11: 11.03Note that starting with CUDA 11, individual components of the toolkit are versionedindependently. For a full list of the individual versioned components (e.g. nvcc, CUDA libraries etc.), see the CUDA Toolkit Release Notes‣NVIDIA Data Center GPU Driver: 450.80.02 (Linux) / 452.39 (Windows)‣Fabric Manager: 450.80.02 (Use nv-fabricmanager -v)‣GPU VBIOS:‣92.00.19.00.01 (NVIDIA A100 SKU200 with heatsink for HGX A100 8-way and 4-way)‣92.00.19.00.02 (NVIDIA A100 SKU202 w/o heatsink for HGX A100 4-way)‣NVSwitch VBIOS: 92.10.14.00.01‣NVFlash: 5.641Due to a revision lock between the VBIOS and driver, VBIOS versions >= 92.00.18.00.00 must use corresponding drivers >= 450.36.01. Older VBIOS versions will work with newer drivers. For more information on getting started with the NVIDIA Fabric Manager on NVSwitch-based systems (for example, HGX A100), refer to the Fabric Manager User Guide.1.2. Fixed Issues‣Various security issues were addressed. For additional details on the med-high severity issues, review the NVIDIA Security Bulletin 5075 .‣Fixed an issue where using CUDA_VISIBLE_DEVICES environment variable to restrict devices seen by CUDA on a multi-GPU A100 system (such as DGX A100 or HGX A100) may cause an out-of-memory error for some workloads, for example when running with CUDA IPC.‣Fixed an issue with ECC DBE handling on A100 resulting in an incorrect part of GPU memory being retired. The faulty memory would continue to be available even afterresetting the GPU/rebooting the system and hitting the same DBE every time could make the GPU unusable.‣Fixed an issue with ECC DBE handling on A100 resulting in an incorrect part of GPU memory being retired. The faulty memory would continue to be available even afterresetting the GPU/rebooting the system and hitting the same DBE every time could make the GPU unusable.1.3. Known IssuesGeneral‣By default, Fabric Manager runs as a systemd service. If using DAEMONIZE=0 in the Fabric Manager configuration file, then the following steps may be required.1.Disable FM service from auto starting. (systemctl disable nvidia-fabricmanager)2.Once the system is booted, manually start FM process. (/usr/bin/nv-fabricmanager-c /usr/share/nvidia/nvswitch/fabricmanager.cfg). Note, since the processis not a daemon, the SSH/Shell prompt will not be returned (use another SSH shell for other activities or run FM as a background task).‣There is a known issue with cross-socket GPU to GPU memory consistency that is currently under investigation‣When starting the Fabric Manager service, the following error may be reported: detected NVSwitch non-fatal error 10003 on NVSwitch pci. This error is not fatal and no functionality is affected. This issue will be resolved in a future driver release.‣On NVSwitch systems with Windows Server 2019 in shared NVSwitch virtualization mode, the host may hang or crash when a GPU is disabled in the guest VM. This issue is under investigation.‣In some cases, after a system reboot, the first run of nvidia-smi shows an ERR! for the power status of a GPU in a multi-GPU A100 system. This issue is not observed when running with peristence mode enabled.GPU Performance CountersThe use of developer tools from NVIDIA that access various performance countersrequires administrator privileges. See this note for more details. For example, readingNVLink utilization metrics from nvidia-smi (nvidia-smi nvlink -g 0) would require administrator privileges.NoScanout ModeNoScanout mode is no longer supported on NVIDIA Data Center GPU products. If NoScanout mode was previously used, then the following line in the “screen” section of /etc/X11/xorg.conf should be removed to ensure that X server starts on data center products:Option "UseDisplayDevice" "None"NVIDIA Data Center GPU products now support one display of up to 4K resolution.Unified Memory SupportSome Unified Memory APIs (for example, CPU page faults) are not supported on Windows in this version of the driver. Review the CUDA Programming Guide on the system requirements for Unified MemoryCUDA and unified memory is not supported when used with Linux power management states S3/S4.IMPU FRU for Volta GPUsThe driver does not support the IPMI FRU multi-record information structure for NVLink. See the Design Guide for Tesla P100 and Tesla V100-SXM2 for more information.Video Memory SupportFor Windows 7 64-bit, this driver recognizes up to the total available video memory on data center cards for Direct3D and OpenGL applications.For Windows 7 32-bit, this driver recognizes only up to 4 GB of video memory on data center cards for DirectX, OpenGL, and CUDA applications.Experimental OpenCL FeaturesSelect features in OpenCL 2.0 are available in the driver for evaluation purposes only.The following are the features as well as a description of known issues with these features in the driver:Device side enqueue‣The current implementation is limited to 64-bit platforms only.‣OpenCL 2.0 allows kernels to be enqueued with global_work_size larger than the compute capability of the NVIDIA GPU. The current implementation supports only combinations of global_work_size and local_work_size that are within the compute capability of the NVIDIA GPU. The maximum supported CUDA grid and block size of NVIDIA GPUs is available at /cuda/cuda-c-programming-guide/index.html#computecapabilities.For a given grid dimension, the global_work_size can be determined by CUDA grid size x CUDA block size.‣For executing kernels (whether from the host or the device), OpenCL 2.0 supports non-uniform ND-ranges where global_work_size does not need to be divisible by thelocal_work_size. This capability is not yet supported in the NVIDIA driver, and therefore not supported for device side kernel enqueues.Shared virtual memory‣The current implementation of shared virtual memory is limited to 64-bit platforms only.Chapter 2.VirtualizationTo make use of GPU passthrough with virtual machines running Windows and Linux, the hardware platform must support the following features:‣ A CPU with hardware-assisted instruction set virtualization: Intel VT-x or AMD-V.‣Platform support for I/O DMA remapping.‣On Intel platforms the DMA remapper technology is called Intel VT-d.‣On AMD platforms it is called AMD IOMMU.Support for these features varies by processor family, product, and system, and should be verified at the manufacturer's website.Supported HypervisorsThe following hypervisors are supported:Tesla products now support one display of up to 4K resolution.Supported Graphics CardsThe following GPUs are supported for device passthrough:VirtualizationChapter 3.Hardware and SoftwareSupportSupport for these feature varies by processor family, product, and system, and should be verified at the manufacturer's website.Supported Operating Systems for NVIDIA Data Center GPUsThe Release 450 driver is supported on the following operating systems:‣Windows x86_64 operating systems:‣Microsoft Windows® Server 2019‣Microsoft Windows® Server 2016‣Microsoft Windows® 10‣The table below summarizes the supported Linux 64-bit distributions. For a complete list of distributions, kernel versions supported, see the CUDA Linux System Requirements documentation.Hardware and Software Support Note that SUSE Linux Enterprise Server (SLES) 15.1 is provided as a preview for Arm64 server since there are known issues when running some CUDA applications related to dependencies on glibc 2.27.Supported Operating Systems and CPU Configurations for HGX A100The Release 450 driver is validated with HGX A100 on the following operating systems and CPU configurations:‣Linux 64-bit distributions:‣Red Hat Enterprise Linux 8.1 (in 4/8/16-GPU configurations)‣CentOS Linux 7.7 (in 4/8/16-GPU configurations)‣Ubuntu 18.04.4 LTS (in 4/8/16-GPU configurations)‣SUSE SLES 15.1 (in 4/8/16-GPU configurations)‣Windows 64-bit distributions:‣Windows Server 2019 (in 4/8/16-GPU configurations)‣CPU Configurations:‣AMD Rome in PCIe Gen4 mode‣Intel Skylake/Cascade Lake (4-socket) in PCIe Gen3 modeSupported Virtualization ConfigurationsThe Release 450 driver is validated with HGX A100 on the following configurations:‣Passthrough (full visibility of GPUs and NVSwitches to guest VMs):‣8-GPU configurations with Ubuntu 18.04.4 LTS‣Shared NVSwitch (guest VMs only have visibility of GPUs and full NVLink bandwidth between GPUs in the same guest VM):‣16-GPU configurations with Ubuntu 18.04.4 LTSAPI SupportThis release supports the following APIs:‣NVIDIA® CUDA® 11.0 for NVIDIA® Kepler TM, Maxwell TM, Pascal TM, Volta TM, Turing TM and NVIDIA Ampere architecture GPUs‣OpenGL® 4.5‣Vulkan® 1.1‣DirectX 11‣DirectX 12 (Windows 10)‣Open Computing Language (OpenCL TM software) 1.2Note that for using graphics APIs on Windows (i.e. OpenGL, Vulkan, DirectX 11 and DirectX 12) or any WDDM 2.0+ based functionality on Tesla GPUs, vGPU is required. See the vGPU documentation for more information.Supported NVIDIA Data Center GPUsThe NVIDIA Data Center GPU driver package is designed for systems that have one or more Tesla products installed. This release of the driver supports CUDA C/C++ applications and libraries that rely on the CUDA C Runtime and/or CUDA Driver API.NoticeTHE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.TrademarksNVIDIA and the NVIDIA logo are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.Copyright© 2020 NVIDIA Corporation. All rights reserved.。

如何通过编程将GPU用于通用计算任务

如何通过编程将GPU用于通用计算任务随着现代图形处理器(GPU)可编程能力及性能的提高，应用开发商们一直希望图形硬件可以解决以前只有通用CPU才能完成的高密集计算任务。

尽管利用通用GPU进行计算很有发展前景，但传统图像应用编程接口仍然将GPU抽象成一个包括纹理、三角形和像素在内的图像绘制器。

寻找一种能够使用这些基本元素的映射算法并不是一项简单的操作，即便对最先进的图形开发商而言也是如此。

幸运的是，基于GPU的计算从概念上讲很容易理解，并且现有多种高级语言和软件工具可以简化GPU的编程工作。

但是，开发商必须首先了解GPU在图像绘制过程中是如何工作的，然后才能确定可用于计算的各个组件。

在绘制图像时，GPU首先接收宿主系统以三角顶点形式发送的几何数据。

这些顶点数据由一个可编程的顶点处理器进行处理，该处理器可以完成几何变换、亮度计算等任何三角形计算。

接下来，这些三角形由一个固定功能的光栅器转换成显示在屏幕上的单独“碎片(fragment)”。

在屏幕显示之前，每个碎片都通过一个可编程的碎片处理器计算最终颜色值。

图1：执行两向量相加的简单Brook代码示例。

Brook支持所有带附加流数据的C句法,流数据存储于GPU的存储器中，而核函数也在GPU上执行。

计算碎片颜色的运算一般包括集合向量数学操作以及从“纹理”中提取存储数据，“纹理”是一种存储表面材料颜色的位图。

最终绘制的场景可以显示在输出设备上，或是从GPU的存储器重新复制到宿主处理器中。

可编程顶点处理器和碎片处理器提供了许多相同的功能和指令集。

但是，大部分GPU编程人员只将碎片处理器用于通用计算任务，因为它通常提供更优的性能，而且可以直接输出到存储器。

利用碎片处理器进行计算的一个简单例子是对两个向量进行相加。

首先，我们发布一个大三角形，其所包含的碎片数量和向量大小(容纳的元素)相同。

产生的碎片通过碎片处理器进行处理，处理器以单指令多数据(SIMD)的并行方式执行代码。

NVIDIA Ampere GPU 架构优化指南说明书

3
Ampere Tuning Guide, Release 12.3
4
Chapter 1. NVIDIA Ampere GPU Architecture
Chapter 2. CUDA Best Practices
The performance guidelines and best practices described in the CUDA C++ Programming Guide and the CUDA C++ Best Practices Guide apply to all CUDA-capable GPU architectures. Programmers must primarily focus on following those recommendations to achieve the best performance. The high-priority recommendations from those guides are as follows:
▶ Find ways to parallelize sequential code. ▶ Minimize data transfers between the host and the device. ▶ Adjust kernel launch configuration to maximize device utilization. ▶ Ensure global memory accesses are coalesced. ▶ Minimize redundant accesses to global memory whenever possible. ▶ Avoid long sequences of diverged execution by threads within the same warp.

了解cpu与gpu的书籍

了解cpu与gpu的书籍
- "GPU Gems"系列：由NVIDIA开发者和研究人员编写的一系列书籍，涵盖了GPU编程的各个方面。

- "CUDA by Example: An Introduction to General-Purpose GPU Programming"：由Jason Sanders和Edward Kandrot编写的书籍，介绍了CUDA（Compute Unified Device Architecture）编程模型，它是NVIDIA开发的用于通用GPU编程的平台。

- "Programming Massively Parallel Processors: A Hands-on Approach"：由David B. Kirk和Wen-mei W. Hwu编写的书籍，介绍了并行计算和GPU编程的基本原理和技术。

- "OpenGL Programming Guide: The Official Guide to Learning OpenGL, Version 4.5"：由John Kessenich、Graham Sellers和Dave Shreiner编写的书籍，是学习OpenGL 图形编程的经典指南。

- "GPU Computing Gems Emerald Edition"：由Wen-mei W. Hwu编辑的书籍，是关于GPU计算的宝贵资源，收集了来自学术界和工业界的专家分享的实用技巧和经验。

这些书籍可以帮助你深入了解CPU和GPU的工作原理和编程方法，但具体选择哪本书籍还需根据你的需求和兴趣来决定。

NVIDIA CUDA基础教程说明书

© NVIDIA Corporation 2009
Data Copies
cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);
returns after the copy is complete blocks CPU thread until all bytes have been copied doesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKind
cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice
Non-blocking memcopies are provided
© NVIDIA Corporation 2009
Code Walkthrough 1
Can be unique for each grid
Block (0, 1)
Block (1, 1)
Block (2, 1)
Built-in variables:
threadIdx, blockIdx blockDim, gridDim
Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
Parallel code (kernel) is launched and executed on a device by many threads Threads are grouped into thread blocks Parallel code is written for a thread

GPU编程技术

几何阶段：从 world space到 eye space
每个人都是从各自的视点出发观察这个世界，无论是主观世界还是客观世界。同样，在计算机中每次只能从唯一的视角出发渲染物体。在游戏中，都会提供视点漫游的功能，屏幕显示的内容随着视点的变化而变化。这是因为 GPU将物体顶点坐标从 world space 转换到了 eye space。所谓eye space，即以 camera（视点或相机）为原点，由视线方向、视角和远近平面，共同组成一个梯形体的三维空间，称之为viewing frustum（视锥）。近平面，是梯形体较小的矩形面，作为投影平面，远平面是梯形体较大的矩形，在这个梯形体中的所有顶点数据是可见的，而超出这个梯形体之外的场景数据，会被视点去除（Frustum Culling，也称之为视锥裁剪）。
GPU图形绘制管线
几何阶段
几何阶段的主要工作是“变换三维顶点坐标”和“光照计算” ，显卡信息中通常会有一个标示为“T&L”硬件部分，所谓“T&L” 即 Transform & Lighting。那么为什么要对三维顶点进行坐标空间变换？输入到计算机中的是一系列三维坐标点，但是我们最终需要看到的是，从视点出发观察到的特定点（可以这样理解，三维坐标点，要使之显示在二维的屏幕上）。一般情况下，GPU帮我们自动完成了这个转换。显示屏是二维的，GPU所需要做的是将三维的数据，绘制到二维屏幕上，并到达“跃然纸面”的效果。顶点变换中的每个过程都是为了这个目的而存在，为了让二维的画面看起具有三维立体感，为了让二维的画面看起来“跃然纸面”。根据顶点坐标变换的先后顺序，主要有如下几个坐标空间，或者说坐标类型：Object space，模型坐标空间；World space，世界坐标系空间；Eye space，观察坐标空间；Clip and Project space，屏幕坐标空间。

opencl编程指南英文版pdf

opencl编程指南英文版pdf**Introduction**In the age of ever-increasing computational demands, the need for efficient and scalable parallel computing solutions has become paramount. OpenCL, or Open Computing Language, is an open standard for parallel programming of heterogeneous systems, enabling the efficient use of various processing units, such as CPUs, GPUs, DSPs, and FPGAs. This guide aims to provide a comprehensive understanding of OpenCL programming, covering its fundamentals, applications, and best practices.**Chapter 1: Introduction to OpenCL**OpenCL is a framework that allows software developers to write programs that can run across multiple platforms and devices. It enables the utilization of the full potential of modern hardware, especially GPUs, for general-purpose computing. OpenCL abstracts the underlying hardware details, providing a uniform programming interface for developers.**Chapter 2: OpenCL Architecture and Components**The OpenCL architecture consists of two main components: the host and the device. The host is the central processing unit (CPU) that manages the execution of the program andthe device is the hardware accelerator, such as a GPU, that performs the parallel computations. OpenCL also provides a runtime library and a set of APIs for developers to program and control the devices.**Chapter 3: OpenCL Programming Basics**OpenCL programming involves writing kernels, which are small functions that are executed in parallel on the device. Kernels are written in a subset of the C programming language and are compiled into executable code for thetarget device. This chapter covers the syntax and semantics of OpenCL kernels, including memory management, data parallelism, and synchronization.**Chapter 4: OpenCL Memory Management**Memory management in OpenCL is crucial for achieving optimal performance. This chapter discusses the different memory objects in OpenCL, such as buffers, images, and sub-buffers, and their usage. It also covers memory allocation,data transfer between the host and device, and memory access patterns for efficient data locality.**Chapter 5: OpenCL Applications and Use Cases**OpenCL finds applications in various domains, including graphics, physics simulations, machine learning, and bioinformatics. This chapter explores some real-world use cases of OpenCL, demonstrating its power and flexibility in parallel computing.**Chapter 6: OpenCL Performance Optimization**Achieving optimal performance in OpenCL programming requires careful consideration of various factors, such as workload distribution, memory access patterns, and kernel design. This chapter provides guidelines and best practices for optimizing OpenCL programs, including profiling tools and techniques for identifying and addressing performance bottlenecks.**Conclusion**OpenCL is a powerful framework for parallel computing, enabling the efficient utilization of modern hardware accelerators. This guide has provided a comprehensiveoverview of OpenCL programming, covering its architecture, programming basics, memory management, applications, and performance optimization. With the knowledge gained from this guide, developers can harness the full potential of OpenCL to create efficient and scalable parallel computing solutions.**OpenCL编程指南：深入探索并行计算的旅程****引言**随着计算需求的不断增长，对高效且可扩展的并行计算解决方案的需求变得至关重要。

NVIDIA Tesla V100 GPU 应用性能指南说明书

Deep Learning and HPC ApplicationsTESLA V100 PERFORMANCE GUIDEModern high performance computing (HPC) data centers are key to solving some of the world’s most important scientific and engineering challenges. NVIDIA ® Tesla ® accelerated computing platform powers these modern data centers with the industry-leading applications to accelerate HPC and AI workloads. The Tesla V100 GPU is the engine of the modern data center , delivering breakthrough performance with fewer servers resulting in faster insights and dramatically lower costs. Improved performance and time-to-solution can also have significant favorable impacts on revenue and productivity.Every HPC data center can benefit from the Tesla platform. Over 500 HPC applications in a broad range of domains are optimized for GPUs, including all 15 of the top 15 HPC applications and every major deep learning framework.Over 500 HPC applications and all deep learning frameworks are GPU-accelerated.>To get the latest catalog of GPU-accelerated applications visit: /teslaapps >To get up and running fast on GPUs with a simple set of instructions for a wide range of accelerated applications visit: /gpu-ready-appsRESEARCH DOMAINS WITH GPU-ACCELERATED APPLICATIONS INCLUDE:Deep Learning is solving important scientific, enterprise, and consumer problems that seemed beyond our reach just a few years back. Every major deep learning framework is optimized for NVIDIA GPUs, enabling data scientists and researchers to leverage artificial intelligence for their work. When running deep learning training and inference frameworks, a data center with Tesla V100 GPUs can save up to 85% in server and infrastructure acquisition costs.KEY FEATURES OF THE TESLA PLATFORM AND V100 FOR DEEP LEARNING TRAINING>Caffe, TensorFlow, and CNTK are up to 3x faster with Tesla V100 compared to P100>100% of the top deep learning frameworks are GPU-accelerated>Up to 125 TFLOPS of TensorFlow operations>Up to 16 GB of memory capacity with up to 900 GB/s memory bandwidth View all related applications at:/deep-learning-appsCAFFEA popular, GPU-accelerated Deep Learning framework developed at UC BerkeleyVERSION1.0ACCELERATED FEATURESFull framework acceleratedSCALABILITYMulti-GPUMORE INFORMATIONMolecular Dynamics (MD) represents a large share of the workload in an HPC data center. 100% of the top MD applications are GPU-accelerated, enabling scientists to run simulations they couldn’t perform before with traditional CPU-only versions of these applications. When running MD applications,a data center with Tesla V100 GPUs can save up to 80% in server and infrastructure acquisition costs.KEY FEATURES OF THE TESLA PLATFORM AND V100 FOR MD>Servers with V100 replace up to 54 CPU servers for applications such as HOOMD-Blue and Amber>100% of the top MD applications are GPU-accelerated>Key math libraries like FFT and BLAS>Up to 15.7 TFLOPS per second of single precision performance per GPU >Up to 900 GB per second of memory bandwidth per GPUView all related applications at:/molecular-dynamics-appsHOOMD-BLUEParticle dynamics package is written from the ground up for GPUsVERSION2.1.6ACCELERATED FEATURESCPU & GPU versions availableSCALABILITYMulti-GPU and Multi-NodeMORE INFORMATION/hoomd-blue/index.htmlAMBERSuite of programs to simulate molecular dynamics on biomoleculeVERSION16.8ACCELERATED FEATURESPMEMD Explicit Solvent & GB; Explicit & Implicit Solvent, REMD, aMDSCALABILITYMulti-GPU and Single-NodeMORE INFORMATION/gpusQuantum chemistry (QC) simulations are key to the discovery of new drugs and materials and consume a large part of the HPC data center's workload. 60% of the top QC applications are accelerated with GPUs today. When running QC applications, a data center's workload with Tesla V100 GPUs can save over 30% in server and infrastructure acquisition costs.KEY FEATURES OF THE TESLA PLATFORM AND V100 FOR QC>Servers with V100 replace up to 5 CPU servers for applications such as VASP>60% of the top QC applications are GPU-accelerated>Key math libraries like FFT and BLAS>Up to 7.8 TFLOPS per second of double precision performance per GPU>Up to 16 GB of memory capacity for large datasetsView all related applications at:/quantum-chemistry-appsVASPPackage for performing ab-initio quantum-mechanical molecular dynamics (MD) simulationsVERSION5.4.4ACCELERATED FEATURESRMM-DIIS, Blocked Davidson, K-points, and exact-exchangeSCALABILITYMulti-GPU and Multi-NodeMORE INFORMATION/vaspFrom fusion energy to high energy particles, physics simulations span a wide range of applications in the HPC data center. Many of the top physics applications are GPU-accelerated, enabling insights previously not possible. A data center with Tesla V100 GPUs can save up to 75% in server acquisition cost when running GPU-accelerated physics applications.KEY FEATURES OF THE TESLA PLATFORM AND V100 FOR PHYSICS>Servers with V100 replace up to 75 CPU servers for applications such as GTC-P, QUDA, and MILC>Most of the top physics applications are GPU-accelerated>Up to 7.8 TFLOPS of double precision floating point performance>Up to 16 GB of memory capacity with up to 900 GB/s memory bandwidth View all related applications at:/physics-appsGTC-PA development code for optimization of plasma physicsVERSION2017ACCELERATED FEATURESPush, shift, and collisionSCALABILITYMulti-GPUMORE INFORMATION/gtc-pQUDAA library for Lattice Quantum Chromo Dynamics on GPUsVERSION2017ACCELERATED FEATURESAllSCALABILITYMulti-GPU and Multi-NodeMORE INFORMATION/qudaMILCLattice Quantum Chromodynamics (LQCD) codes simulate how elemental particles are formed and bound by the “strong force” to create larger particles like protons and neutronsVERSION2017ACCELERATED FEATURESStaggered fermions, Krylov solvers, Gauge-link fatteningSCALABILITYMulti-GPU and Multi-NodeMORE INFORMATION/milcGeoscience simulations are key to the discovery of oil and gas and performing geological modeling. Many of the top geoscience applications are accelerated with GPUs today. When running Geoscience applications, a data center with Tesla V100 GPUs can save up to 70% in server and infrastructure acquisition costs.KEY FEATURES OF THE TESLA PLATFORM AND V100 FOR GEOSCIENCE>Servers with V100 replace up to 82 CPU servers for applications such as RTM and SPECFEM 3D>Top Oil and Gas applications are GPU-accelerated>Up to 15.7 TFLOPS of single precision floating point performance>Up to 16 GB of memory capacity with up to 900 GB/s memory bandwidth View all related applications at:/oil-and-gas-appsRTMReverse time migration (RTM) modeling is a critical component in the seismic processing workflow of oil and gas explorationVERSION2017ACCELERATED FEATURESBatch algorithmSCALABILITYMulti-GPU and Multi-NodeSPECFEM 3DSimulates Seismic wave propagationVERSION7.0.0SCALABILITYMulti-GPU and Multi-NodeMORE INFORMATIONhttps:///cig/software/specfem3d_globeEngineering simulations are key to developing new products across industries by modeling flows, heat transfers, finite element analysis and more. Manyof the top Engineering applications are accelerated with GPUs today. When running Engineering applications, a data center with NVIDIA® Tesla® V100 GPUs can save over 20% in server and infrastructure acquisition costs and over 50% in software licensing costs.KEY FEATURES OF THE TESLA PLATFORM AND V100 FOR ENGINEERING>Servers with Tesla V100 replace up to 4 CPU servers for applications such as SIMULIA Abaqus and ANSYS FLUENT>The top engineering applications are GPU-accelerated>Up to 16 GB of memory capacity>Up to 900 GB/s memory bandwidth>Up to 7.8 TFLOPS of double precision floating pointSIMULIA ABAQUSSimulation tool for analysis of structuresVERSION2017ACCELERATED FEATURESDirect Sparse SolverAMS Eigen SolverSteady-state Dynamics SolverSCALABILITYMulti-GPU and Multi-NodeMORE INFORMATION/simulia-abaqusANSYS FLUENTGeneral purpose software for the simulation of fluid dynamicsVERSION18ACCELERATED FEATURESPressure-based Coupled Solver and Radiation Heat TransferSCALABILITYMulti-GPU and Multi-NodeMORE INFORMATION/ansys-fluentBenchmarks provide an approximation of how a system will perform at production-scale and help to assess the relative performance of different systems. The top benchmarks have GPU-accelerated versions and can help you understand the benefits of running GPUs in your data center.KEY FEATURES OF THE TESLA PLATFORM AND V100 FOR BENCHMARKING>Servers with Tesla V100 replace up to 67 CPU servers for benchmarks such as Cloverleaf, MiniFE, Linpack, and HPCG>The top benchmarks are GPU-accelerated>Up to 7.8 TFLOPS of double precision floating point up to 16 GB of memory capacity>Up to 900 GB/s memory bandwidthCLOVERLEAFBenchmark – Mini-App HydrodynamicsVERSION1.3ACCELERATED FEATURESLagrangian-Eulerianexplicit hydrodynamics mini-applicationSCALABILITYMulti-Node (MPI)MORE INFORMATIONhttp://uk-mac.github.io/CloverLeafMINIFEBenchmark – Mini-AppFinite Element AnalysisVERSION0.3ACCELERATED FEATURESAllSCALABILITYMulti-GPUMORE INFORMATIONhttps:///about/applicationsLINPACKBenchmark – Measures floating point computing powerVERSION2.1ACCELERATED FEATURESAllSCALABILITYMulti-Node and Multi-NodeMORE INFORMATION/project/linpackHPCGBenchmark – Exercises computational and data access patterns that closely match a broad set of important HPC applicationsVERSION3ACCELERATED FEATURESAllSCALABILITYMulti-GPU and Multi-NodeMORE INFORMATION/index.htmlTESLA V100 PRODUCT SPECIFICATIONSAssumptions and DisclaimersThe percentage of top applications that are GPU-accelerated is from top 50 app list in the i360 report: HPC Support for GPU Computing.Calculation of throughput and cost savings assumes a workload profile where applications benchmarked in the domain take equal compute cycles: / industry/reports.php?id=131The number of CPU nodes required to match single GPU node is calculated using lab performance results of the GPU node application speed-up and the Multi-CPU node scaling performance. For example, the Molecular Dynamics application HOOMD-Blue has a GPU Node application speed-up of 37.9X. When scaling CPU nodes to an 8 node cluster, the total system output is 7.1X. So the scaling factor is 8 divided by 7.1 (or 1.13). To calculate the number of CPU nodes required to match the performance of a single GPU node, you multiply 37.9 (GPU Node application speed-up) by 1.13 (CPU node scaling factor) which gives you 43 nodes.。

GPU_Programming_Guide

3.7.
Antialiasing ............................................................................... 32
3.7.1.
Coverage Sampled Anti-Aliasing (CSAA)
33
Chapter 4. GeForce 8 and 9 Series Programming Tips ............................... 34
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation.

NVIDIA Tesla GPU RMA 处理指南说明书

RMA ProcessIntroduction (1)T ools and Diagnostics (2)2.1. nvidia-bug-report (2)2.2. nvidia-healthmon (3)2.3. NVIDIA Field Diagnostic (3)Common System Level Issues (5)RMA Checklist and Flowchart (6)RMA Process Flow (7)NVIDIA is committed to providing the highest level of quality, reliability, and support for the enterprise datacenter-class NVIDIA® Tesla® graphics processing unit (GPU) products. To that end, NVIDIA is focused on two primary goals with the Tesla RMA submission process:Expeditious replacement of returned Tesla GPU productsComprehensive understanding of the customer-observed issue and failure to allow for:NVIDIA replication and confirmation of the failureRoot-cause analysis of the failure aimed at continuous improvement of theproduct and future Tesla offeringsNVIDIA has provided this guide to ensure that the RMA requestor is able to provide the information necessary to meet these goals with each RMA request, best ensuring that such requests are quickly approved and processed.NVIDIA provides a few tools to help diagnose issues and failures observed with Tesla GPU products. These tools are:nvidia-bug-reportnvidia-healthmonNVIDIA Field Diagnostic2.1. nvidia-bug-reportnvidia-bug-report.sh is a shell script included with the NVIDIA Linux driver that gathers system data that is highly valuable to understanding any reported field issue. This includes information such as lspci and system message log files and also includes nvidia-smi information. It is installed with the NVIDIA driver and placed in /usr/bin/ nvidia-bug-report.sh. Running nvidia-bug-report.sh will produce an output file, nvidia-bug-report.log.tgz, in the current working directory.Ideally, nvidia-bug-report.sh should be run immediately after an issue is observed. This will collect the most recent information about the failure.If the report hangs or does not create a complete report, power cycle the machine, save the file that was generated, and run nvidia-bug-report.sh one more time after the power cycle to complete the log. Both logs should be sent to NVIDIA as part of any RMA submission.To run nvidia-bug-report on Linux systems, first log in to “root.”At command line # Type nvidia-bug-report.shNvidia-bug-report.sh will now collect information about your system and create the file, “nvidia-bug-report.log.gz” in the current directoryNote: This file should be included with any RMA request. Failure to include this log file may resultin delays to the processing of the RMA request. For more information, see the section titled, “RMA Checklist and Flowchart.”2.2. nvidia-healthmonnvidia-healthmon detects and troubleshoots common problems affecting TeslaGPUs in a high performance computing environment. nvidia-healthmon contains limited hardware diagnostic capabilities and instead focuses on software and system configuration issues. nvidia-healthmon is designed to discover common problems that affect a GPU’s ability to run a compute job, including:Software configuration issuesSystem configuration issuesSystem assembly issues, like loose cablesA limited number of hardware issuesTo run nvidia-healthmon from the command line with default behavior on all supported GPUs:user@hostname$ nvidia-healthmonnvidia-healthmon will terminate once it completes the execution diagnostics on all specified devices. An exit code of zero will be used when nvidia-healthmon runs successfully. A non-zero exit code indicates that there was a problem with the nvidia-healthmon run. The output of the application must be read to determine the exact problem. nvidia-healthmon ’s output may include a troubleshooting report designedto address common problems, and will often suggest a number of possible solutions. These troubleshooting steps should be undertaken from the top down, as the most likely solution is listed at the top.For more details, command lines arguments, configuration options, and instructions for interpreting the results of the tool, refer to the nvidia-healthmon User Guide.2.3. NVIDIA Field DiagnosticThe NVIDIA Field Diagnostic is a comprehensive Linux based hardware diagnostic tool that provides confirmation of the numerical processing Linux engines in the GPU, integrity of data transfers to and from the GPU, and test coverage of the full onboard memory address space that is available to NVIDIA® CUDA® programs. In the event that any software or system configuration issue cannot be identified (for example,by nvidia-healthmon) and resolved, the NVIDIA Field Diagnostic should be run to determine whether the Tesla GPU may be faulty.The NVIDIA Field Diagnostic can be run with the command “./fieldiag”Note: NVIDIA T esla GPU products have ECC memory protection enabled by default. The NVIDIA Field Diagnostic runs only on boards that have ECC enabled. If the user has previously disabled ECC on a suspect board, ECC must be re-enabled prior to running the NVIDIA Field Diagnostic on that board. NVIDIA will not accept RMA requests for failures that occur only with ECC disabled. Any failure must occur with ECC enabled to be eligible for RMA return.For more details or product-specific command lines arguments, refer to the NVIDIA Field Diagnostic Quick Start Guide (DU-05711-001) and the NVIDIA Field Diagnostic Software Guide (DU-05363-001) included in the NVIDIA Field Diagnostic software package.Upon completion of the diagnostic, a fieldiag.log file is generated.Note: This file should be included with any RMA request. Failure to include this log file may resultin delays to the processing of the RMA request. For more information, see the section titled, “RMA Checklist and Flowchart.”A passing result with the NVIDIA Field Diagnostics is an indication that the NVIDIA Tesla GPU hardware is in good condition, and pointing to a potential software application-level issue.Note: In the event that the NVIDIA Field Diagnostic returns a passing result, NVIDIA requests that data be provided illustrating that the failure follows the particular NVIDIA T esla GPU board and details of the observed failures. Having this data will better allow NVIDIA to reproduce the issue and resolve any potential test weakness in the existing diagnostics.Depending on the type and severity of the observed issue, there may be situations where it may not be possible to run nvidia-bug-report, nvidia-healthmon, or the field diagnostic. In order to better ensure that the failure is attributable to the Tesla GPU, rather than a system-level issue, and avoid any potential delays to the processing of the RMA request as a result, NVIDIA recommends that the following steps be taken to further isolate the cause of the failure.In addition to the power provided by the PCIe slot connector, Tesla GPU boards also require additional power from the host system. Ensure that the appropriate PCIe 8-pin and/or 6-pin auxiliary power cables are properly connected to the board.Consult the product specifications for the specific Tesla GPU in use to determine the auxiliary power requirements for that particular product.Physically remove the Tesla GPU board from the system and reinstall it to ensure that it is fully seated in the PCIe slot.If available, replace the suspect Tesla GPU with a known good board to confirm that the observed issue or failure does not occur with the replacement.If possible, install the suspect Tesla GPU in a different system to determine whether the observed issue or failure follows the board (or system).Note: The RMA submission process will request information demonstrating that common system-level causes have been eliminated. Submitting the RMA with the information as described in Step 1 through Step 4, indicating that system level issues were eliminated will help to accelerate the RMA approval process.Table 1.RMA ChecklistCheck Off Itemnvidia-bug-report log file (nvidia-bug-report.log.gz)NVIDIA Field Diagnostic log file (fieldiag.log)In the event the NVIDIA Field Diagnostic returns a passing result or that the observedfailure is such that the NVIDIA tools and diagnostics cannot be run, the followinginformation should be included with the RMA request:Steps taken to eliminate common system-level causes-Check PCIe auxiliary power connections-Verify board seating in PCIe slot-Determine whether the failure follows the board or the systemDetails of the observed failure:-The application running at the time of failure-Description of how the product failed-Step-by-step instructions to reproduce the issue-Frequency of the failureIs there any known or obvious physical damage to the board?Submit the RMA request at Note: NVIDIA Tesla GPU products have ECC memory protection enabled by default. NVIDIA will not accept RMA requests for failures that occur only with ECC disabled. Any failure must occur with ECC enabled to be eligible for RMA return.ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATEL Y, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSL Y DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.© 2013-2020 NVIDIA Corporation. All rights reserved.。

gpu 逻辑运算

gpu 逻辑运算GPU逻辑运算概述在计算机科学中，逻辑运算是对逻辑值进行操作的过程。

逻辑运算通常涉及与、或、非等逻辑操作符，用于判断条件是否成立。

GPU （图形处理器）是一种专门用于处理图形和并行计算的硬件设备。

它具有高度并行的特性，使得它在逻辑运算方面具有强大的计算能力。

本文将介绍GPU逻辑运算的原理、应用和优势。

GPU逻辑运算的原理GPU逻辑运算是通过GPU上的逻辑电路实现的。

逻辑电路由逻辑门组成，逻辑门是一种基本的逻辑元件，用于实现逻辑运算。

逻辑门通常包括与门、或门、非门等。

在GPU中，逻辑电路由大量的逻辑门组成，通过并行计算来实现高速的逻辑运算。

GPU逻辑运算的应用GPU逻辑运算在许多领域都有广泛的应用。

其中最常见的应用是图形渲染。

在计算机图形学中，需要对大量的图形数据进行处理和计算。

GPU的并行计算能力使得它能够快速进行逻辑运算，从而实现高效的图形渲染。

此外，GPU逻辑运算还广泛应用于科学计算、数据挖掘和人工智能等领域。

GPU逻辑运算的优势相比于CPU（中央处理器），GPU在逻辑运算方面具有以下几个优势：1.并行计算能力：GPU具有大量的并行处理单元，可以同时执行多个逻辑运算任务。

这使得GPU能够在相同时间内完成更多的逻辑运算，提高计算效率。

2.高速缓存：GPU具有高速缓存，可以存储和访问大量的数据。

这减少了对内存的访问次数，提高了逻辑运算的速度。

3.专门优化：GPU是专门为图形和并行计算而设计的，其逻辑运算电路经过了精心的优化。

与CPU相比，GPU在逻辑运算方面具有更高的计算性能和更低的能耗。

4.灵活性：GPU具有可编程的特性，可以根据不同的逻辑运算需求进行编程。

这使得GPU能够适应不同的应用场景，并实现更灵活的逻辑运算。

总结GPU逻辑运算是利用GPU的并行计算能力进行逻辑运算的过程。

通过逻辑电路的组合和并行计算，GPU能够高效地进行逻辑运算，广泛应用于图形渲染、科学计算和人工智能等领域。

MxGPU设置指南说明书

DISCLAIMERThe information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of non-infringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale.©2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD arrow, FirePro, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL is a trademark of Apple, Inc. and used by permission of Khronos. PCIe and PCI Express are registered trademarks of the PCI-SIG Corporation. VMware is a registered trademark of VMware, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.Table of Contents1.Overview (4)2.Hardware and Software Requirements (4)2.1Hardware Requirements (4)2.1.1Host/Server (4)2.1.2Client (4)2.2Software Requirements (5)3.MxGPU Setup (6)3.1Programming SR-IOV Parameters for MxGPU (6)3.2VF Pass Through (8)3.3VF-to-VM Auto-Assignment (9)4.Appendix (11)4.1Host Server Configuration (11)4.2Manual Installation for GPUV Driver for VMware ESXi (13)4.2.1Upload GPUV Driver (13)4.2.2Install GPUV Driver (13)4.2.3Configure GPUV Driver (15)4.2.4Un-Install GPUV Driver (16)4.2.5Update GPUV Driver (17)1.OverviewThis setup guide details the advanced steps necessary to enable MxGPU on the AMD FirePro™ S7100X, S7150 and S7150x2 family of products. The guide uses VMware® products as an example setup. These products include VMware ESXi™ as a hypervisor, the VMware vSphere® client and VMware Horizon® View™. The basic setup steps for the VMware software is detailed in the companion document to this one.2.Hardware and Software RequirementsThe sections below lists the hardware and software that are required for setting up the VMware environment.2.1Hardware Requirements2.1.1Host/ServerGraphics Adapter: AMD FirePro™ S7100X, S7150, S7150x2 for MxGPU and/orpassthrough***note that the AMD FirePro™ S7000, S9000 and S9050 can be used for passthroughonlySupported Server Platforms:•Dell PowerEdge R730 Server•HPE ProLiant DL380 Gen9 Server•SuperMicro 1028GQ-TR ServerAdditional Hardware Requirements:•CPU: 2x4 and up•System memory: 32GB & up; more guest VMs require more system memory•Hard disk: 500G & up; more guest VMs require more HDD space•Network adapter: 1000M & up2.1.2ClientAny of the following client devices can be used to access the virtual machine once theseVMs are started on the host server:•Zero client (up to 4 connectors) with standard mouse/keyboard and monitor•Thin client with standard mouse/keyboard and monitor running Microsoft®Windows® Embedded OS•Laptop/Desktop with standard mouse/keyboard and monitor running withMicrosoft® Windows® 7 and up2.2 Software RequirementsProductType Install OnSectionVersion/Download LocationAMD FirePro™ VIB Driver Hypervisor Driver Host (Server) 3.1 /en-us/download/workstation?os=VMware%20vSphere%20ESXi%206.0#catalyst-pro AMD VIB Install Utility Script Host (Server) 3.1 /en-us/download/workstation?os=VMware%20vSphere%20ESXi%206.0#catalyst-pro PuTTYSSH client Host Admin. System / SSH Secure ShellSSH Client andDownload UtilityHost Admin. System3.1Table 1 : Required Software for Document(Links to non-AMD software provided as examples)3.MxGPU SetupThe following sections describe the steps necessary to enable MxGPU on the graphics adapter(s) in the host. Before proceeding, refer to the Appendix to ensure that the host system is enabled for virtualization and SR-IOV. Once virtualization capabilities are confirmed for the host system, follow the steps in the next two sections to program the graphics adapter(s) for SR-IOV functionality and to connect the virtual functions created to available virtual machines.3.1Programming SR-IOV Parameters for MxGPU1.Download and unzip the vib and MxGPU-Setup-Script-Installer.zip from Table 1.2.Copy script and vib file to the same directory (Example : /vmfs/volumes/datastore1)ing an SSH utility, log into the directory on the host and change the attribute of mxgpu-install.sh to be executable # chmod +x mxgpu-install.sh4.Run command: # sh mxgpu-install.sh to get a list of available commands5.Run command: # sh mxgpu-install.sh –i <amdgpuv…vib>•If a vib driver is specified, then that file will be used. If no vib driver is specified then the script assumes the latest amdgpuv driver in the current directory•The script will check for system compatibility before installing the driver•After confirming system compatibility, the script will display all available AMD adapters7.Next, the script will show three options: Auto/Hybrid/Manual.1)Auto: automatically creates a single config string for all available GPUs:•the script first prompts for the number of virtual machines desired (per GPU) and sets all settings accordingly (frame buffer, time slice, etc…)•next, the script prompts the user if they want to enable Predictable Performance, a feature that keeps performance fixed independent of active VMs•the settings are applied to all AMD GPU available on the bus•if a S7150X2 is detected, the script will add pciHole parameters to VMs• a reboot is required for the changes to take effect2)Hybrid: configure once and apply to all available GPUs:•the script first prompts for the number of virtual machines desired (per GPU) and sets all settings accordingly (frame buffer, time slice, etc…)•next, the script prompts the user if they want to enable Predictable Performance•the settings are applied to the selected AMD GPU; the process repeats for the next GPU•if a S7150X2 is detected, the script will add pciHole parameters to VMs• a reboot is required for the changes to take effect3)Manual: config GPU one by one:•the script prompts the user to enter VF number, FB size/VF, time slice•next, the script prompts the user if they want to enable Predictable Performance•the settings are applied to selected AMD GPU; the process repeats for the next GPU•if a S7150X2 is detected, the script will add pciHole parameters to VMs• a reboot is required for the changes to take effectFigure 1 : Screenshot of MxGPU Setup Script Installation FlowFor users who want to understand the individual steps required for vib installation and configuration,3.2VF Pass ThroughOnce the VFs (virtualfunctions) are set up, thenpassing through theses VFsfollows the same procedureas passing through a physicaldevice. To successfully passthrough the VFs, the physicaldevice CANNOT beconfigured as a passthroughdevice. If the physical deviceis being passed through tothe VM, then the GPUVdriver will not install properly.If that happens, the VFs willnot be enabled and no VFswill be shown.Once the VFs are enabled, they will be listed in the available device list for pass through, and the status of the PF will be changed to unavailable for pass through. No additional operation is needed to move VF into pass through device list.3.3VF-to-VM Auto-Assignment1.After rebooting the system and the VFs are populated on the device list, navigate to thedirectory containing the mxgpu-install.sh script2.Specify Eligible VMs for auto-assign in “vms.cfg” file•Note: If all registered VMs should be considered eligible, skip to step 43.Edit vms.cfg file to include VMs that should be considered for auto-assign•Use # vi vms.cfg to edit the configuration file•For help with using vi, an example can be found on the following VMware page: https:///selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1020302•Examples are provided in the vms.cfg configuration file on how to specify VMs should be considered eligible•Note: Make sure to delete the regular expression .* if you are specifying your own VMs4.Start the auto-assign option # sh mxgpu-install.sh –a5.Select the command to execute [assign|unassign|list|help]6.The assign command will continue if the number of eligible VMs does not exceed the number ofVFs7.Once the VM is powered on, a VF will appear in the Device Manager as a graphics device. Thegraphics driver can now be installedFigure 2 : Screenshot of the default contents of “vms.cfg” fileFigure 3 : Screenshot of Multi-Assign UsageFigure 4 : Screenshot of Multi-Assign “list” command after assigning VFs to VMs4.Appendix4.1Host Server ConfigurationTo enable the MxGPU feature, some basic virtualization capabilities need to be enabled in the SBIOS. These capabilities may be configured from the SBIOS configuration page during system bootup. Different system BIOS vendors will expose different capabilities differently. Some may have one control that enables a number of these capabilities. Some may expose controls for some capabilities while hardcoding others. The following settings, taken from an American Megatrends system BIOS, provides a list of the minimal set of capabilities that have to be enabled :Server CPU supports MMUServer chipset supports AMD IOMMU or Intel VT-dThe option “Intel VT for Directed I/O (VT-d)” should be enabledExample Path : IntelRCSetup → IIO Configuration → Intel(R) VT for Directed I/O (VT-d) → Intel VT for Directed I/O (VT-d)Server (SBIOS) support PCIe standard SR-IOVThe option “SR-IOV Support” should be enabled.Example Path : Advanced → PCI Subsystem Settings → SR-IOV SupportServer (SBIOS) support ARI (Alternative Routing ID)The option “ARI Forwarding” should be enabled.Example Path : Advanced → PCI Subsystem Settings → PCI Express GEN 2 Settings → ARI ForwardingServer (SBIOS and chipset (root port/bridge)) supports address space between 32bit and 40bitIf there is an “Above 4G Decoding” enable it.Example Path : Advanced → PCI Subsystem Settings → Above 4G DecodingServer (Chipset (root port / bridge)) supports more than 4G address spaceThere may be an option “MMIO High Size” for this function (default may be 256G).Example Path : IntelRCSetup → Common RefCode Configuration → MMIO High Size Examples on the next page demonstrate implementations from other system BIOS vendors.The following example showshow to enable SR-IOV on a Dell R730 platform.On some platforms, theSBIOS configuration page provides more options to control the virtualization behavior. One of these options is the ARI (alternative reroute interface) as shown below.In addition, some platformsalso provide controls to enable/disable SVM and/or IOMMU capability. These options must be enabled on the platform.4.2Manual Installation for GPUV Driver for VMware ESXi *note that the GPUV driver refers to the vib driver.4.2.1Upload GPUV Driver1.Download the GPUV driver to the administrator system from Table 1.2.Start SSH Secure File Transfer utility and connect to the host server.3.On the left (theadministratorsystem), navigate tothe directory wherethe GPUV driver issaved; on the right(the host system),navigate to/vmfs/volumes/datastore14.Right click on the GPUV driver file and select “Upload” to upload it to/vmfs/volumes/datastore1.4.2.2Install GPUV Driver1.In vSphere client, place system into maintenance mode2.Start SSH Secure Shell client, connect to host, run the following command:esxcli software vib install --no-sig-check -v /vmfs/volumes/datastore1/amdgpuv-<version>.vib ***note : the vib name is used as an example.You should seesomething this :3.In the vSphere client, exit maintenance mode4.In SSH Secure Shell client window, run the following command :esxcli system module set -m amdgpuv -e trueThis command makes theamdgpuv driver load on ESXiboot up.5.In vSphere client, reboot the server.4.2.3Configure GPUV Driver1.Find out the BDF (bus number, device number, and function number) of the SR-IOV adapter. InSSH Secure Shell client , type in command :lspciYou should see somethinglike in the picture. The BDFfor this adapter is 05.00.0 inthis example.2.In SSH Secure Shell client window run the following command to specify the setting for SR-IOVadapter:esxcfg-module –s “adapter1_conf=<bus>,<dev>,<func>,<num>,<fb>,<intv>” amdgpuvThe configuration is done through esxcfg-module command in the format of parameter as [bus, dev, func, num, fb, intv] to quickly set all VFs on one GPU to the same FB size and time slice.•bus – the bus number: in decimal value•dev – the device number: in decimal value•func – the function number•num – the number of enabled VFs•fb – the size of framebuffer for each VF•intv – the interval of VF switching.For example,•command: esxcfg-module -s "adapter1_conf=1,0,0,15,512,7000" amdgpuv Enables 15 virtual functions, each VF with 512M FB, and 7 millisecond time slice for*******************************.0•command: esxcfg-module -s "adapter1_conf=5,0,0,8,256,7000adapter2_conf=7,0,0,10,256,10000" amdgpuvEnable 8 VF, each VF has 256M FB and 7 millisecond time slice for adapter located @05:00.0Enable 10 VF, each VF has 256M FB and 10 millisecond time slice for adapter located @07:00.0•command: esxcfg-module -s "adapter1_conf=14,0,0,6,1024,7000adapter2_conf=130,0,0,4,1920,7000" amdgpuvEnable 6 VF, each VF has 1024M FB and 7 millisecond time slice for adapter located @0E:00.0Enable 4 VF, each VF has 1920M FB and 7 millisecond time slice for adapter located @82:00.0Note:1)Every time the command is executed, the previous configuration is overwritten. If the userwants to configure a newly added GPU, he needs to apply the previous parameterappending with new parameter in one command, otherwise the previous configuration for the existing GPU is lost.2)If you use lspci to find out the BDF of the GPU location, the value is in hex value instead ofdecimal value. In the last example, the first adapter is located at bus 14, but the lspci willshow as 0E:00.0; the second adapter is located at bus 130, the lspci will show as 82:00.0.3.In order to let the new configuration take effect, a server reboot is needed - in vSphere client,reboot the server.4.2.4Un-Install GPUV Driver1.Unload the GPUV driver by typing in command in SSH Secure Shell client :vmkload_mod -u amdgpuv2.In vSphere Client, set system to maintenance mode3.In SSH Secure Shell client type in command :esxcli software vib remove -n amdgpuv4.Start SSH Secure FileTransfer utility,connect to hostserver. On the right(the host system),navigate to/vmfs/volumes/datastore1, select theamdgpuv driver,right click, select“Delete”.5.In vSphere client, reboot the server.4.2.5Update GPUV Driver1.Follow the sequence in section 4.2.4 to remove the old driver.2.Follow the sequence in section 4.2.1 to download the new driver3.Follow the sequence in section4.2.2 to install the new driver4.Follow the sequence in section 4.2.3 to configure the new driver.。

NVIDIA SLI技术介绍说明书

• Communications overhead
CPU-Bound Applications
• SLI cannot help • Reduce CPU work or better: • Move CPU work onto the GPU
– See
• Don’t throttle frame-rate
– doesn’t return until all rendering is finished – prevents parallelism
Using Multiple GPUs Without SLI
• Some applications want to handle distributing load between multiple GPUs themselves
– Hence not 2x speed-up
AFR Advantages
• All work is parallelized
– Pixel fill, raster, vertex transform
• Preferred SLI mode
• Works best when frame self-contained
– Don’t use immediate mode – Reduces CPU overhead
• Render the entire frame
– Don’t use use glViewport or glScissor – Disables load balancing in SFR mode, and
• Use render-to-texture rather than glCopyTexSubImage
– glCopyTexSubImage requires texture to be copied to both GPUs

gpu代码编译

gpu代码编译
GPU代码编译是在计算机图形处理器（GPU）上把源代码转换为可执行文件的过程，它允许GPU来执行AI模型运算。

类似于对CPU代码的编译，GPU代码编译也需要一个编译器，它能够把源代码转换为目标架构所支持的可执行文件。

GPU编译技术允许开发者们把源代码和资源转换为本机GPU可识别的代码。

这里，源代码可以是C/C++或者是OpenCL或CUDA等特定的代码，将指令为GPU提供执行所必须的架构参数信息，而资源就是一些代码中使用的外部数据文件等。

在大多数情况下，GPU编译的过程分为两个步骤：预处理（Preprocessing）和编译（Compilation)。

预处理的功能是把源代码和资源转换为引擎编译器可以理解的格式，并且为之后的编译过程做准备。

编译的功能则是把预编译文件转换为GPU可识别的字节码。

这里，编译器必须考虑到GPU上不同架构之间的差异，因此它们必须把编译结果转换为与指定GPU架构兼容的代码。

这样就可以保证GPU 编译产生的可执行文件在不同的GPU上都可以正常工作。

GPU编译过程中，编译器还可以对源代码做优化并生成高效的字节码。

GPU代码编译过程以及优化非常复杂，但是却很重要，因为优化能够提高GPU运行时的性能，从而在更大的范围内提高AI模型的性能。

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Version 2.5.01NoticeALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,“MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES,EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THEMATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OFNONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULARPURPOSE.Information furnished is believed to be accurate and reliable. However, NVIDIACorporation assumes no responsibility for the consequences of use of suchinformation or for any infringement of patents or other rights of third parties thatmay result from its use. No license is granted by implication or otherwise underany patent or patent rights of NVIDIA Corporation. Specifications mentioned in thispublication are subject to change without notice. This publication supersedes andreplaces all information previously supplied. NVIDIA Corporation products are notauthorized for use as critical components in life support devices or systems withoutexpress written approval of NVIDIA Corporation.TrademarksNVIDIA, the NVIDIA logo, GeForce, and NVIDIA Quadro are registered trademarksof NVIDIA Corporation. Other company and product names may be trademarks ofthe respective companies with which they are associated.Copyright© 2006 by NVIDIA Corporation. All rights reserved.H ISTORY OF M AJOR R EVISIONSVersion Date Changes2.5.0 03/01/2006 Updated Performance Tools and PerfHUD-related sections2.4.0 07/08/2005 Updated coverAdded GeForce 7 Series content2.3.0 02/08/2005 Added 2D & Video Programming chapterAdded more SLI information2.2.1 11/23/2004 Minor formatting improvements2.2.0 11/16/2004 Added normal map format adviceAdded ps_3_0 performance adviceAdded General Advice chapter2.1.0 07/20/2004 Added Stereoscopic Development chapter2.0.4 07/15/2004 Updated MRT section2NVIDIA GPU Programming Guide3 Table of ContentsChapter 1. About This Document (9)1.1.Introduction..............................................................................9 Chapter 2. How to Optimize Your Application ............................................11 2.1.Making Accurate Measurements................................................11 2.2.Finding the Bottleneck..............................................................12 2.2.1.Understanding Bottlenecks 12 2.2.2.Basic Tests 13 2.2.3.Using PerfHUD 14 2.3.Bottleneck: CPU.......................................................................14 2.4.Bottleneck: GPU.......................................................................15 Chapter 3. General GPU Performance Tips . (17)3.1.List of Tips..............................................................................17 3.2.Batching..................................................................................19 3.2.1.Use Fewer Batches 19 3.3.Vertex Shader..........................................................................19 3.3.1.Use Indexed Primitive Calls 19 3.4.Shaders ..................................................................................20 3.4.1.Choose the Lowest Pixel Shader Version That Works 20 3.4.2.Compile Pixel Shaders Using the ps_2_a Profile 20 3.4.3.Choose the Lowest Data Precision That Works 21 3.4.4.Save Computations by Using Algebra 22 3.4.5.Don’t Pack Vector Values into Scalar Components of Multiple Interpolants 23 3.4.6. Don’t Write Overly Generic Library Functions 233.4.7.Don’t Compute the Length of Normalized Vectors 233.4.8.Fold Uniform Constant Expressions 243.4.9.Don’t Use Uniform Parameters for Constants That Won’t ChangeOver the Life of a Pixel Shader 243.4.10.Balance the Vertex and Pixel Shaders 253.4.11.Push Linearizable Calculations to the Vertex Shader If You’re Boundby the Pixel Shader 25e the mul() Standard Library Function 25e D3DTADDRESS_CLAMP (or GL_CLAMP_TO_EDGE) Instead ofsaturate() for Dependent Texture Coordinates 26e Lower-Numbered Interpolants First 26 3.5.Texturing (26)e Mipmapping 26e Trilinear and Anisotropic Filtering Prudently 263.5.3.Replace Complex Functions with Texture Lookups 27 3.6.Performance (29)3.6.1.Double-Speed Z-Only and Stencil Rendering 293.6.2.Early-Z Optimization 29y Down Depth First 303.6.4.Allocating Memory 30 3.7.Antialiasing (31)Chapter 4. GeForce 6 & 7 Series Programming Tips (33)4.1.Shader Model 3.0 Support (33)4.1.1.Pixel Shader 3.0 344.1.2.Vertex Shader 3.0 354.1.3.Dynamic Branching 354.1.4.Easier Code Maintenance 364.1.5.Instancing 364.1.6.Summary 37 4.2.GeForce 7 Series Features (37)4.3.Transparency Antialiasing (37)4.4.sRGB Encoding (38)4NVIDIA GPU Programming Guide4.5.Separate Alpha Blending (38)4.6.Supported Texture Formats (39)4.7.Floating-Point Textures (40)4.7.1.Limitations 40 4.8.Multiple Render Targets (MRTs) (40)4.9.Vertex Texturing (42)4.10.General Performance Advice (42)4.11.Normal Maps (43)Chapter 5. GeForce FX Programming Tips (45)5.1.Vertex Shaders (45)5.2.Pixel Shader Length (45)5.3.DirectX-Specific Pixel Shaders (46)5.4.OpenGL-Specific Pixel Shaders (46)ing 16-Bit Floating-Point (47)5.6.Supported Texture Formats (48)ing ps_2_x and ps_2_a in DirectX (49)ing Floating-Point Render Targets (49)5.9.Normal Maps (49)5.10.Newer Chips and Architectures (50)5.11.Summary (50)Chapter 6. General Advice (51)6.1.Identifying GPUs (51)6.2.Hardware Shadow Maps (52)Chapter 7. 2D and Video Programming (55)7.1.OpenGL Performance Tips for Video (55)7.1.1.POT with and without Mipmaps 567.1.2.NP2 with Mipmaps 567.1.3.NP2 without Mipmaps (Recommended) 577.1.4.Texture Performance with Pixel Buffer Objects (PBOs) 57 Chapter 8. NVIDIA SLI and Multi-GPU Performance Tips (59)58.1.What is SLI? (59)8.2.Choosing SLI Modes (61)8.3.Avoid CPU Bottlenecks (61)8.4.Disable VSync by Default (62)8.5.DirectX SLI Performance Tips (63)8.5.1.Limit Lag to At Least 2 Frames 638.5.2.Update All Render-Target Textures in All Frames that Use Them 648.5.3.Clear Color and Z for Render Targets and Frame Buffers 64 8.6.OpenGL SLI Performance Tips (65)8.6.1.Limit OpenGL Rendering to a Single Window 658.6.2.Request PDF_SWAP_EXCHANGE Pixel Formats 658.6.3.Avoid Front Buffer Rendering 658.6.4.Limit pbuffer Usage 658.6.5.Render Directly into Textures Instead of Using glCopyTexSubImage66e Vertex Buffer Objects or Display Lists 668.6.7.Limit Texture Working Set 678.6.8.Render the Entire Frame 678.6.9.Limit Data Readback 678.6.10.Never Call glFinish() 67 Chapter 9. Stereoscopic Game Development (69)9.1.Why Care About Stereo? (69)9.2.How Stereo Works (70)9.3.Things That Hurt Stereo (70)9.3.1.Rendering at an Incorrect Depth 709.3.2.Billboard Effects 719.3.3.Post-Processing and Screen-Space Effects 71ing 2D Rendering in Your 3D Scene 719.3.5.Sub-View Rendering 719.3.6.Updating the Screen with Dirty Rectangles 729.3.7.Resolving Collisions with Too Much Separation 729.3.8.Changing Depth Range for Difference Objects in the Scene 72 6NVIDIA GPU Programming Guide9.3.9.Not Providing Depth Data with Vertices 729.3.10.Rendering in Windowed Mode 729.3.11.Shadows 729.3.12.Software Rendering 739.3.13.Manually Writing to Render Targets 739.3.14.Very Dark or High-Contrast Scenes 739.3.15.Objects with Small Gaps between Vertices 73 9.4.Improving the Stereo Effect (73)9.4.1.Test Your Game in Stereo 739.4.2.Get “Out of the Monitor” Effects 74e High-Detail Geometry 749.4.4.Provide Alternate Views 749.4.5.Look Up Current Issues with Your Games 74 9.5.Stereo APIs (74)9.6.More Information (75)Chapter 10. Performance Tools Overview (77)10.1.PerfHUD (77)10.2.PerfSDK (78)10.3.GLExpert (79)10.4.ShaderPerf (79)10.5.NVIDIA Melody (79)10.6.FX Composer (80)10.7.Developer Tools Questions and Feedback (80)789 Chapter 1.About This Document1.1. IntroductionThis guide will help you to get the highest graphics performance out of your application, graphics API, and graphics processing unit (GPU). Understanding the information in this guide will help you to write better graphical applications. This document is organized in the following way:Chapter 1(this chapter) gives a brief overview of the document’s contents.Chapter 2 explains how to optimize your application by finding and addressing common bottlenecks. Chapter 3 lists tips that help you address bottlenecks once you’ve identifiedthem. The tips are categorized and prioritized so you can make the most important optimizations first.Chapter 4 presents several useful programming tips for GeForce 7 Series , GeForce 6 Series , and NV4X-based Quadro FX GPUs. These tips focus on features, but also address performance in some cases.Chapter 5 offers several useful programming tips for NVIDIA® GeForce™ FX and NV3X-based Quadro FX GPUs. These tips focus on features, but also address performance in some cases.Chapter 6 presents general advice for NVIDIA GPUs, covering a variety of different topics such as performance, GPU identification, and more.How to Optimize Your Application10 Chapter 7 explains NVIDIA’s Scalable Link Interface (SLI) technology, which allows you to achieve dramatic performance increases with multiple GPUs.Chapter 8 describes how to take advantage of our stereoscopic gaming support. Well-written stereo games are vibrant and far more visuallyimmersive than their non-stereo counterparts.Chapter 9 provides an overview of NVIDIA’s performance tools.11 Chapter 2.How to Optimize Your ApplicationThis section reviews the typical steps to find and remove performance bottlenecks in a graphics application.2.1. Making Accurate MeasurementsMany convenient tools allow you to measure performance while providing tested and reliable performance indicators. For example, PerfHUD ’s yellow line (see the PerfHUD documentation for more information) measures total milliseconds (ms) per frame and displays the current frame rate.To enable valid performance comparisons:Verify that the application runs cleanly. For example, when theapplication runs with Microsoft’s DirectX Debug runtime, it should not generate any errors or warnings.Ensure that the test environment is valid. That is, make sure you are running release versions of the application and its DLLs, as well as the release runtime of the latest version of DirectX.Use release versions (not debug builds) for all software.Make sure all display settings are set correctly. Typically, this means that they are at their default values. Anisotropic filtering and antialiasing settings particularly influence performance.Disable vertical sync. This ensures that your frame rate is not limited by your monitor’s refresh rate.How to Optimize Your ApplicationRun on the target hardware. If you’re trying to find out if a particularhardware configuration will perform sufficiently, make sure you’re running on the correct CPU, GPU, and with the right amount of memory on the system. Bottlenecks can change significantly as you move from a low-end system to a high-end system.12 2.2. Finding the Bottleneck2.2.1. Understanding BottlenecksAt this point, assume we have identified a situation that shows poorperformance. Now we need to find the performance bottleneck. The bottleneck generally shifts depending on the content of the scene. To make things more complicated, it often shifts over the course of a single frame. So “finding the bottleneck” really means “Let’s find the bottleneck that limits us the most for this scenario.” Eliminating this bottleneck achieves the largest performance boost.Figure 1. Potential BottlenecksIn an ideal case, there won’t be any one bottleneck—the CPU, AGP bus, and GPU pipeline stages are all equally loaded (see Figure 1). Unfortunately, that case is impossible to achieve in real-world applications—in practice, something always holds back performance.NVIDIA GPU Programming Guide13 2.2.2. The bottleneck may reside on the CPU or the GPU. PerfHUD’s green line (see Section Error! Reference source not found. for more information aboutPerfHUD) shows how many milliseconds the GPU is idle during a frame. If the GPU is idle for even one millisecond per frame, it indicates that the application is at least partially CPU-limited. If the GPU is idle for a large percentage of frame time, or if it’s idle for even one millisecond in all frames and theapplication does not synchronize CPU and GPU, then the CPU is the biggest bottleneck. Improving GPU performance simply increases GPU idle time. Another easy way to find out if your application is CPU-limited is to ignore all draw calls with PerfHUD (effectively simulating an infinitely fast GPU). In the Performance Dashboard, simply press N. If performance doesn’t change, then you are CPU-limited and you should use a tool like Intel’s VTune or AMD’s CodeAnalyst to optimize your CPU performance.Basic TestsYou can perform several simple tests to identify your application’s bottleneck. You don’t need any special tools or drivers to try these, so they are often the easiest to start with.Eliminate all file accesses. Any hard disk access will kill your frame rate.This condition is easy enough to detect—just take a look at your computer's "hard disk in use" light or disk performance monitor signals usingWindows’ perfmon tool, AMD’s CodeAnalyst, (/us-en/Processors/DevelopWithAMD/0,,30_2252_3604,00.html ) or Intel’s VTune (/software/products/vtune/). Keep in mind that hard disk accesses can also be caused by memory swapping, particularly if your application uses a lot of memory.Run identical GPUs on CPUs with different speeds. It’s helpful to find a system BIOS that allows you to adjust (i.e., down-clock) the CPU speed, because that lets you test with just one system. If the frame rate varies proportionally depending on the CPU speed, your application is CPU-limited .Reduce your GPU's core clock. You can use publicly available utilities such as Coolbits (see Chapter 6) to do this. If a slower core clockproportionally reduces performance, then your application is limited by the vertex shader, rasterization, or the fragment shader (that is, shader-limited ).Reduce your GPU's memory clock. You can use publicly availableutilities such as Coolbits (see Chapter 6) to do this. If the slower memory clock affects performance, your application is limited by texture or frame buffer bandwidth (GPU bandwidth-limited ).How to Optimize Your Application14 2.2.3. Generally, changing CPU speed, GPU core clock, and GPU memory clock are easy ways to quickly determine CPU bottlenecks versus GPU bottlenecks. If underclocking the CPU by n percent reduces performance by n percent, then the application is CPU-limited. If under-locking the GPU’s core and memory clocks by n percent reduces performance by n percent, then the application is GPU-limited.Using PerfHUDPerfHUD provides a vast array of debugging and profiling tools to helpimprove your application’s performance. Here are some guidelines to get you started. The PerfHUD User Guide contains detailed methodology for identifying and removing bottlenecks, troubleshooting, and more. It is available at /object/PerfHUD_home.html .1. Navigate your application to the area you want to analyze.2. If you notice any rendering issues, use the Debug Console and FrameDebugger to solve those problems first.3. Check the Debug Console for performance warnings.4. When you notice a performance issue, switch to Frame Profiler Mode (ifyou have a GeForce 6 Series or later GPU) and use the advanced profiling features to identify the bottleneck. Otherwise, use the pipeline experiments in Performance Dashboard Mode to identify the bottleneck.2.3. Bottleneck: CPUIf an application is CPU-bound, use profiling to pinpoint what’s consuming CPU time. The following modules typically use significant amounts of CPU time:Application (the executable as well as related DLLs )Driver (nv4disp.dll, nvoglnt.dll )DirectX Runtime (d3d9.dll ) DirectX Hardware Abstraction Layer (hal32.dll )Because the goal at this stage is to reduce the CPU overhead so that the CPU is no longer the bottleneck, it is relatively important what consumes the most CPU time. The usual advice applies: choose algorithmic improvements over minor optimizations. And of course, find the biggest CPU consumers to yield the largest performance gains.NVIDIA GPU Programming Guide15Next, we need to drill into the application code and see if it’s possible to remove or reduce code modules. If the application spends large amounts of CPU in hal32.dll , d3d9.dll , or nvoglnt.dll , this may indicate APIabuse. If the driver consumes large amounts of CPU, is it possible to reduce the number of calls made to the driver? Improving batch sizes helps reduce driver calls. Detailed information about batching is available in the following presentations: /docs/IO/8230/BatchBatchBatch.ppt /developer/presentations/GDC_2004/Dx9Optim ization.pdfPerfHUD also helps to identify driver overhead. It can display the amount of time spent in the driver per frame (plotted as a red line) and it graphs the number of batches drawn per frame.Other areas to check when performance is CPU-bound:Is the application locking resources, such as the frame buffer ortextures? Locking resources can serialize the CPU and GPU, in effect stalling the CPU until the GPU is ready to return the lock. So the CPU is actively waiting and not available to process the application code. Locking therefore causes CPU overhead.Does the application use the CPU to protect the GPU? Culling small sets of triangles creates work for the CPU and saves work on the GPU, but the GPU is already idle! Removing these CPU-side optimizations actually increase performance when CPU-bound.Consider offloading CPU work to the GPU. Can you reformulate your algorithms so that they fit into the GPU’s vertex or pixel processors?Use shaders to increase batch size and decrease driver overhead. For example, you may be able to combine two materials into a single shader and draw the geometry as one batch, instead of drawing two batches each with its own shader. Shader Model 3.0 can be useful in a variety of situations to collapse multiple batches into one, and reduce both batch and drawoverhead. See Section 4.1 for more on Shader Model 3.0. 2.4. Bottleneck: GPUGPUs are deeply pipelined architectures. If the GPU is the bottleneck, we need to find out which pipeline stage is the largest bottleneck. For an overview of the various stages of the graphics pipeline, seeHow to Optimize Your Application16 /docs/IO/4449/SUPP/GDC2003_PipelinePerfor mance.ppt.PerfHUD simplifies things by letting you force various GPU and driver features on or off. For example, it can force a mipmap LOD bias to make all textures 2 × 2. If performance improves a lot, then texture cache misses are the bottleneck. PerfHUD similarly permits control over pixel shader execution times by forcing all or part of the shaders to run in a single cycle.PerfHUD also gives you detailed access to GPU performance counters and can automatically find your most expensive render states and draw calls, so we highly recommend that you use it if you are GPU-limited.If you determine that the GPU is the bottleneck for your application, use the tips presented in to improve performance.Chapter 317Chapter 3.General GPU Performance TipsThis chapter presents the top performance tips that will help you achieveoptimal performance on GeForce FX, GeForce 6 Series, and GeForce 7 Series GPUs. For your convenience, the tips are organized by pipeline stage. Within each subsection, the tips are roughly ordered by importance, so you know where to concentrate your efforts first.A great place to get an overview of modern GPU pipeline performance is the Graphics Pipeline Performance chapter of the book GPU Gems: ProgrammingTechniques, Tips, and Tricks for Real-Time Graphics . The chapter covers bottleneck identification as well as how to address potential performance problems in all parts of the graphics pipeline.Graphics Pipeline Peformance is freely available at /object/gpu_gems_samples.html .3.1. List of TipsWhen used correctly, recent GPUs can achieve extremely high levels ofperformance. This list presents an overview of available performance tips that the subsequent sections explain in more detail.Poor Batching Causes CPU BottleneckUse fewer batchesUse texture atlases to avoid texture state changes. /object/nv_texture_tools.htmlGeneral GPU Performance Tips18In DirectX, use the Instancing API to avoid SetMatrix and similar instancing state changes.Vertex Shaders Cause GPU BottleneckUse indexed primitive callsUse DirectX 9’s mesh optimization calls[ID3DXMesh:OptimizeInplace() orID3DXMesh:Optimize()]Use our NVTriStrip utility if an indexed list won’t work/object/nvtristrip_library.html Pixel Shaders Cause GPU BottleneckChoose the minimum pixel shader version that works for what you’re doingWhen developing your shader, it’s okay to use a higherversion. Make it work first, then look for opportunities tooptimize it by reducing the pixel shader version.If you need ps_2_* functionality, use the ps_2_a profileChoose the lowest data precision that works for what you’re doing: Prefer half to floatUse the half type for everything that you can:Varying parametersUniform parametersVariablesConstantsBalance the vertex and pixel shaders.Push linearizable calculations to the vertex shader if you’re bound by the pixel shader.Don’t use uniform parameters for constants that will not change over the life of a pixel shader.Look for opportunities to savecomputations by using algebra.Replace complex functions with texture lookupsPer-pixel specular lightingUse FX Composer to bake programmatically generated, are native instructions and do not Texturing Causes GPUtextures to filesBut sincos, log expneed to be replaced by texture lookupsBottleneckNVIDIA GPU Programming Guide19Use mipmappingUse trilinear and anisotropic filtering prudentlyMatch the level of anisotropic filtering to texture complexity.Use our Photoshop plug-in to vary the anisotropic filteringlevel and see what it looks like. /object/nv_texture_tools.htmlFollow this simple rule of thumb: If the texture is noisy,turn anisotropic filtering on.Rasterization Causes GPU bottleneckDouble-speed z-only and stencil renderingEarly-z (Z-cull) optimizationsAntialiasingHow to take advantage of antialiasing3.2. Batching3.2.1. 3.3. Use Fewer Batches“Batching” refers to grouping geometry together so many triangles can be drawn with one API call, instead of using (in the worse case) one API call per triangle. There is driver overhead whenever you make an API call, and the best way to amortize this overhead is to call the API as little as possible. In other words, reduce the total number of draw calls by drawing several thousand triangles at once. Using a smaller number of larger batches is a great way to improve performance. As GPUs become ever more powerful, effectivebatching becomes ever more important in order to achieve optimal rendering rates. Vertex Shader3.3.1. Use Indexed Primitive CallsUsing indexed primitive calls allows the GPU to take advantage of its post-transform-and-lighting vertex cache. If it sees a vertex it’s already transformed, it doesn’t transform it a second time—it simply uses a cached result.General GPU Performance Tips20 In DirectX, you can use the ID3DXMesh class’s OptimizeInPlace() or Optimize() functions to optimize meshes and make them more friendly towards the vertex cache.You can also use our own NVTriStrip utility to create optimized cache-friendly meshes. NVTriStrip is a standalone program that is available at/object/nvtristrip_library.html.3.4.ShadersHigh-level shading languages provide a powerful and flexible mechanism thatmakes writing shaders easy. Unfortunately, this means that writing slow shadersis easier than ever. If you’re not careful, you can end up with a spontaneousexplosion of slow shaders that brings your application to a halt. The followingtips will help you avoid writing inefficient shaders for simple effects. Inaddition, you’ll learn how to take full advantage of the GPU’s computationalpower. Used correctly, the high-end GeForce FX GPUs can deliver more than20 operations per clock cycle! And the latest GeForce 6 and 7 Series GPUs candeliver many times more performance.3.4.1.3.4.2.Choose the Lowest Pixel Shader Version That WorksChoose the lowest pixel shader version that will get the job done. For example, if you’re doing a simple texture fetch and a blend operation on a texture that’s just 8 bits per component, there’s no need to use a ps_2_0 or higher shader. Compile Pixel Shaders Using the ps_2_a ProfileMicrosoft’s HLSL compiler (fxc.exe) adds chip-specific optimizations based on the profile that you’re compiling for. If you’re using a GeForce FX GPU and your shaders require ps_2_0 or higher, you should use the ps_2_a profile, which is a superset of ps_2_0 functionality that directly corresponds to the GeForce FX family. Compiling to the ps_2_a profile will probably give you better performance than compiling to the generic ps_2_0 profile. Please note that the ps_2_a profile was only available starting with the July 2003 HLSL release.In general, you should use the latest version of fxc (with DirectX 9.0c or newer), since Microsoft will add smarter compilation and fix bugs with each3.4.3.release. For GeForce 6 and 7 Series GPUs, simply compiling with the appropriate profile and latest compiler is sufficient.Choose the Lowest Data Precision That WorksAnother factor that affects both performance and quality is the precision used for operations and registers. The GeForce FX, GeForce 6 Series, and GeForce 7 Series GPUs support 32-bit and 16-bit floating point formats (called float and half, respectively), and a 12-bit fixed-point format (called fixed). The float data type is very IEEE-like, with an s23e8 format. The half is also IEEE-like, in an s10e5 format. The 12-bit fixed type covers a range from [-2,2) and is not available in the ps_2_0 and higher profiles. The fixed type is available with the ps_1_0 to ps_1_4 profiles in DirectX, and with either the NV_fragment_program extension or Cg in OpenGL.The performance of these various types varies with precision:The fixed type is fastest and should be used for low-precision calculations, such as color computation.If you need floating-point precision, the half type delivers higher performance than the float type. Prudent use of the half type can triple frame rates, with more than 99% of the rendered pixels within one least-significant bit (LSB) of a fully 32-bit result in most applications!If you need the highest possible accuracy, use the float type.You can use the /Gpp flag (available in the July 2003 HLSL update) to force everything in your shaders to half precision. After you get your shaders working and follow the tips in this section, enable this flag to see its effect on performance and quality. If no errors appear, leave this flag enabled. Otherwise, manually demote to half precision when it is beneficial (/Gpp provides an upper performance bound that you can work toward).When you use the half or fixed types, make sure you use them for varying parameters, uniform parameters, variables, and constants. If you’re using DirectX’s assembly language with the ps_2_0 profile, use the _pp modifier to reduce the precision of your calculations.If you’re using the OpenGL ARB_fragment_program language, use the ARB_precision_hint_fastest option if you want to minimize execution time, with possibly reduced precision, or use the NV_fragment_program option if you want to control precision on a per-instruction basis (see/dev_content/nvopenglspecs/GL_NV_fragment_prog ram_option.txt).。