Programming with Intel Math Kernel Library in Integrated Development Environments (IDE)
vasp在intel单机上的并行安装

Please type a selection: 1
通常来说就是 1,典型安装所有组件 接下来就是让同意一个协议,同意的话输入 accept(当然得同意,不 同意就装不了) 之后会出现让选择安装到哪个目录: Processing Intel(R) Fortran Compiler for applications running on
Please make your selection by entering an option from the choices below:
1. Provide the license file name with full path (/<path>/<name>.lic)
or the port and hostname of Flexlm* license server (port@hostname)
1. Provide your serial number [Recommended] Use this option if you have a serial number to install
and register your software. The Intel(R) Software Setup
进入
[david@localhost compiler]$ ls
l_fc_p_10.1.018 l_fc_p_10.1.018.tar.gz l_fcpro_RFS7H5XD.lic
[david@localhost compiler]$ cd l_fc_p_10.1.018
[david@localhost l_fc_p_10.1.018]$ ls
Assistant may connect to the Intel(R) Software Development
MKL在VS2008下的配置与使用要点

1.tools->options->Projects and solutions->VC++ Directoriesa)Show directories for -> Executable files->add mkl的安装目录下的bin:如:“D:\Program Files\Intel\MKL\10.0.3.021\ia32\bin”b)Show directories for -> Include files ->add mkl 的include目录,如“D:\ProgramFiles\Intel\MKL\10.0.3.021\include”c)Show directories for -> Library files -> add mkl 的lib目录,如”D:\ProgramFiles\Intel\MKL\10.0.3.021\ia32\lib”2.在自己创建的工程里面添加:Project->…PropertiesLinker->Input->Additional Dependencies 添加a)mkl_c.libb)libguide40.libc)mkl_solver.libd)mkl_core.libe)libguide.lib(可能不用)f)libiomp5mt.lib(可能不用)此外,有人推荐使用的库/en-us/forums/intel-ma th-kernel-library/topic/62018/mkl_intel_c.libmkl_intel_thread.libmkl_core.liblibiomp5md.lib可以参见userguide.pdf chapter 5/en-us/forums/intel-math-kernel-library/topic/62018/Visual Studio* 2008 or Microsoft* Visual Studio* 2005:1.Select View » Solution Explorer (and make sure this window is active).2.Select Tools » Options » Projects and Solutions » VC++ Directories.3.In the drop down menu titled Show directorie s for:, select Include Files, and thentype in the directory for the Intel MKL include files (e.g. default: C:\ProgramFiles\Intel\MKL\10.0\include).4.In the drop down menu titled Show directorie s for: select Library Files, and thentype in the directory for the Intel MKL library files (e.g. default: C:\ProgramFiles\Intel\MKL\10.0\ia32\lib).5.In the drop down menu titled Show directorie s for: select Executable Files, andthen type in the directory for the Intel MKL executable files (e.g. default: C:\ProgramFiles\Intel\MKL\10.0\ia32\bin).6.On the main toolbar select Project » Properties » Configuration Properties »Linker » Input and in the "Additional Dependencies" line, add the libraries you requireAs Todd mentioned, for a 32-bit system something like the following could be used:mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.libFor more information about Linking application with Intel® MKL version 10.0 you can find in userguide.pdf chapter 5 ).--Gennady3.在使用MKL时容易出现所调用的函数因为MKL本身而内存泄漏,此时在调用计算函数结束时使用MKL_FreeBuffers();可以解决问题。
Intel Integrated Performance Primitives (IPP)

Convert
– polar/cart, complex/real – integer/float – up/down sample
Transforms
– FFT, DFT, Goertzel, DCT, wavelet
Statistics
– norms, threshold, min / max / std.dev., mean, powerspectr
• 信号的生成 • 信号的转换 • 时域频域的转换(FFTs) • 滤波器(FFT, FIR)
®
Intel® IPP Lab Exercise: Signal Processing
IPP 对信号处理的优化
Support
– conj, copy, imag, real, zero, set
Filters
Intel® Integrated Performance Primitives (IPP)
Intel Itanium® Architecture
IA32 Intel IA32 Pentium® IA32 and Xeon processors
Intel® PCA Intel® PCA application application processors processors
MMX™ technology Streaming SIMD Extensions Streaming SIMD Extensions-2
Itanium® Architecture
Intel PCA application processors based on XScale™ technology
Transforms
常用数值计算库

lapack
软件名称 Linear Algebra Package
程序设计语言 Fortran 77
资源网址 /lapack
功能概述 线性代数计算子程序包
lapack++
软件名称 Linear Algebra Package in c++
资源网址 /blas
功能概述 Blas是执行向量和
BLAS in C++ with expression templates. 表达式模版形式的 C++中的BLAS ,
gsl
软件名称 GNU Scientific Library (linux)
程序设计语言 c++
资源网址 /lapack++/
功能概述 c++版的线性代数计算子程序包
BLAS
软件名称 Basic Linear Algebra Subroutines
程序设计语言 Fortran 77
主要开发者 Kagstrom B. ,Ling P. ,Van Loan C.
6.向量统计学库(VSL)
7.高级离散傅立叶变换
IMSL
软件名称 IMSL C Numerical Library(不兼容vc6编译器)
程序设计语言 C, Forton, C#, Java
资源网址 /
功能概述 分为统计库和数学库两部分. 数学库包含应用数学和特殊函数.IMSL 程序库 - 已成为数值分析解决方案的工业标准。 IMSL 程序库提供最完整与最值得信赖的函数库。 IMSL 数值程序库提供目前世界上最广泛被使用的 IMSL 算法,有超过 370 验证过、最正确与 thread-safe 的数学与统计程序。 IMSL FORTRAN 程序库提供新一代以 FORTRAN 90 为程序库基础的程序,能展现出最佳化的演算法能力应用于多处理器与其它高效能运算系统。
IPP程序设计-第七章

武汉大学 电子信息学院
What is IPP?
Integrated Performance Primitives 集成性能基元
主要内容
Intel IPP简介 编程基础 编程示例
IPP简介
面向Intel处理器和芯片的函数库
信号处理,图像处理,多媒体,向量处理等
10. Speech Recognition
11. Data Compression
12. Cryptography 13. String Processing
* Intel IPP crypto usage in Open SSL* * “ippgrep” – regular expression matching
Intel IPP is suitable for a very wide range of applications
• Video broadcasting, Video/Voice Conferencing
• Consumer Multimedia • Medical Imaging, Document Imaging • Computer Vision /Object Tracking / Machine Learning • Databases and Enterprise Data Management • Information Security • Embedded Applications • Mathematics and Science
Integrated Performance Primitives (IPP)
Itanium® Architecture Pentium® II processor Pentium® III processor Pentium® 4 processor Xeon™ processor
全国计算机等级考试三级网络技术英文单词

第一章计算机基础Computer计算机Client客户机Server服务器Peer To Peer对等,P2P计算机辅助工程:Computer Aided Design CAD计算机辅助设计Computer Aided Manufacturing CAM计算机辅助制造Computer Aided Engineering CAE计算机辅助工程Computer Aided Instruction CAI计算机辅助教学Computer Aided Testing CAT计算机辅助测试GIS地理信息系统计算机分类:Mainframe大型主机Minicomputer小型计算机/迷你电脑Personal Computer个人计算机,Microcomputer微型计算机Workstation工作站Supercomputer巨型计算机/超级计算机Minisuper小巨型计算机/小超级计算机服务器按处理器体系结构划分:Complex Instruction Set Computer CISC复杂指令集计算机Reduced Instruction Set Computer RISC精简指令集计算机Very Long Instruction Word VLIW超长指令字Explicitly Parallel Instruction Computing EPIC清晰并行指令计算/简明平行指令计算Intel Architecture IA英特尔架构Blade Serer刀片式服务器计算机分类:Server服务器Workstation工作站Desktop PC台式机Notebook笔记本,Mobile PC便携机/移动PCHandheld PC掌上电脑,Sub-Notebook亚笔记本Ultra Mobile PC UMPC超便携计算机PDA个人数字助理LCD液晶显示器Serial Advanced Technology Attachment SATA串行高级技术附件Serial Attached SCSI串行SCSI硬盘Redundant Array Of Independent Disks RIAD独立磁盘冗余阵列,Disk Array磁盘阵列计算机的技术指标:Million Instruction Per Second,MIPS,单字长定点指令的平均执行速度Million Floating Instruction Per Second,MFLOPS,单字长浮点指令的平均执行速度Bits Per Second,Bps,每秒传输位数Mean Time Between Failure,MTBF,平均无故障时间Mean Time To Repair,MTTR,平均故障修复时间奔腾芯片的技术特点:Superscalar超标量Superpipeline,超流水线Peripheral Component Interconnect,PCI,外围部件互联Video Electronic Standard Association,VESA,视频电子标准协会Streaming SIMD Extension,SSE,流式的单指令流、多数据流扩展指令Mainboard主板、主机板,Motherboard,母版Adapter Card网卡、适配卡软件按授权方式分类:Commercial-Ware商业软件Share Ware共享软件Freeware自由软件信息的形式:Number数字Text文本Graphic图形Image图像Sound声音Media媒体Multimedia多媒体Videodisk视频光盘Speech语音Audio音响Multimedia PC,MPC,多媒体计算机Media Player媒体播放器Sound Recorder录音机Object Linking And Embedding,OLE,对象链接和嵌入数据压缩编码方法:Source Coding源编码Hybrid Coding混合编码Entropy Coding信息熵编码法Huffman Coding哈尔曼编码Run Length Coding游程编码Arithmetic Coding算术编码Prediction Coding预测编码法Differential Pulse Code Modulation,DPCM,微分脉码调制Delta Modulation,DM,Δ调制Transformation Coding变换编码法Discrete Fourier Transform,DFT,离散傅里叶变换Discrete Cosine Transform,DCT,离散余弦变换Discrete Hadamard Transform,DHT,离散哈达玛变换Vector Quantization Coding矢量量化编码法Joint Photographic Experts Group,JPEG,联合图像专家组International Organization For Standardization,ISO,国际标准化组织CCITT国际电报电话咨询委员会Baseline Sequential Codec基线顺序编解码Moving Picture Experts Group,MPEG,运动图像专家组HDTV高清晰度电视ITU国际电信联盟ISDN综合业务数字网IECNode结点Link链接Streaming Media流媒体第二章网络技术基础Advanced Research Projects Agency,ARPA,美国国防部高级研究计划局System Network Architecture,SNA,系统网络体系结构Distributed Computer Architecture,DCA,数字网络体系结构Open System Interconnection,OSI,开放系统互连Ethernet以太网Token Bus令牌总线Token Ring令牌环Fiber Distributed Data Interface,FDDI,光纤分布式数据接口National Information Infrastructure,NII,国家信息基础设施Global Information Infrastructure Committee,GIIC,全球信息基础设施委员会B-ISDN宽带业务综合数据网ATM异步传输模式IEEE美国电子电气工程师协会PSTN公用电话交换网CNNIC中国互联网网络信息中心计算机网络按覆盖的地理范围分类:Local Area Network,LAN,局域网Metropolitan Area Network,MAN,城域网Wide Area Network,WAN,广域网CATV有线电视网Nyquist奈奎斯特Shannon香农Circuit Switching电路交换Store-And-Forward Switching存储转发交换Message Switching报文交换Packet Switching报文分组交换Datagram,DG,数据报Virtual Circuit,VC,虚电路Message报文Packet报文分组Protocol协议Network Architecture计算机网络体系结构Implementation实现Interconnection互连性Interoperation互操作性Portability可移植性Service Definition服务定义Protocol Specification协议规格说明Physical Layer物理层Data Link Layer数据链路层Network Layer网络层Transport Layer传输层Session Layer会话层Presentation Layer表示层Application Layer应用层End-To-End端到端User Agent用户代理FTAM文件传送访问和管理VT虚拟终端TP事务处理RDA远程数据库访问MMS制造业报文规范Intercommunication互通Internet Layer互联层Host-To-Network Layer主机-网络层Transport Control Protocol,TCP,传输控制协议User Datagram Protocol,UDP,用户数据报协议Byte Stream字节流Byte Segment字节段Telnet远程登录协议File Transfer Protocol,FTP,文件传输协议Simple Mail Transfer Protocol,SMTP,简单邮件传输协议Domain Name Service,DNS,域名服务Router Information Protocol,RIP,路由信息协议Network File System,NFS,网络文件系统Hypertext Transfer Protocol,HTTP,超文本传输协议Page页面Web Site Web站点CERN欧洲粒子物理实验室Podcast播客Blog,Weblog博客,网络日志,网志Internet Protocol Television,IPTV,互联网协议电视/网络电视:Video On Demand,VOD,视频点播技术Live TV直播电视Time Shift TV时移电视Instant Messaging,IM,即时通信Wireless MAN,WMAN,无线城域网Bluetooth蓝牙Personal Operating Space,POS,个人操作空间Personal Area Network,PAN,个人区域网络Wireless Personal Area Network,WPAN,无线个人区域网络Mobile Ad Hoc Network,MANET,移动Ad Hoc网络Wireless Sensor Network,WSN,无线传感器网络Packet Radio Network,PRNET,分组无线网第三章局域网基础Fast Ethernet,FE,快速以太网Gigabit Ethernet,GE,千兆以太网Collision冲突Media Access Control,MAC,介质访问控制Logical Link Control,LLC,逻辑链路控制WG工作组TAG技术行动组Carrier Sense Multiple Access With Collision Detection,CSMA/CD,带冲突检测的载波侦听多路访问Truncated Binary Exponential Backoff截止二进制指数后退延迟Unicast Address单一节点地址Multicast Address多点地址Broadcast Address广播地址FCS帧校验字段CRC循环冗余校验Registration Authority Committee,RAC,注册管理委员会Company-Id公司标识Organizationally Unique Identifier,OUI,机构唯一标识符Extended Unique Identifier扩展的唯一标识符EPROM网卡的只读存储器Share LAN共享式局域网Switched LAN交换式局域网Media Independent Interface,MII,介质独立接口Gigabit Media Independent Interface,GMII,千兆介质独立接口High Speed Study Group,HSSG,高速研究组Switched Ethernet交换式以太网Ethernet Switch以太网交换机Hub集线器Cut Through直通Store And Forward存储转发Virtual Network虚拟网络Virtual LAN,VLAN,虚拟局域网Nomadic Access漫游访问Infrared Radio,IR,红外无线Channel Encoder信道编码器Frequence Hopping Spread Spectum,FHSS,跳频扩频通信Direct Sequence Spread Spectrum,DSSS,直接序列扩频Point Coordination Function,PCF,点协调功能Distributed Coordination Function,DCF,分布协调功能Collision Avoidance,CA,冲突避免Interframe Space,IFS,帧间间隔Bridge网桥网桥按路由表的建立方法分类:Transparent Bridge透明网桥Source Routing Bridge源路由网桥Spanning Tree生成树Discovery Frame发现帧第四章服务器操作系统Network Operating System,NOS,网络操作系统Process进程File Handle文件句柄File Allocation Table,FAT,文件表Virtual File Allocation Table,VFAT,虚拟文件表High Performance File System,HPFS,高性能文件系统Basic Input/Output System,BIOS,基本输入/输出系统Graphics Device Interface,GDI,图形设备接口Application Programming Interface,API,应用编程接口Kernel内核Monolithic Kernel单内核Microkernel微内核Nanokernel超微内核Exokernel外核Hardware Abstract Layer,HAL,硬件抽象层Directory Service,DS,目录服务Network Server网络服务器Network Station网络工作站网络操作系统的基本功能:File Service文件服务Print Service打印服务Database Service数据库服务Communication Service通信服务Message Service信息服务Distributed Service分布式服务Network Management Service网络管理服务IntranetSQL结构化查询语言Graphic User Interface,GUI,图形用户界面Domain域Primary Domain Controller主域控制器Backup Domain Controller备份域控制器Thread线程Preemptive抢占式NDIS网络驱动接口规范TDI传输驱动接口Netbeui扩展用户接口Active Directory Manager活动目录管理Tree域树Forest域森林Organizational Unit,OU,组织单元Role角色DEP数据执行保护NAP网络访问保护NAT自动网络地址转换Server Core服务器内核Powershell外壳Business Intelligence,BI,商务智能Netware Core Protocol,NCP,Netware核心协议System Failure Tolerance,SFT,系统容错File Server Mirroring文件服务器镜像Transaction Tracking System,TTS,事物跟踪系统Novell Directory Services,NDS,Novell目录服务Swapping对换Independent Software Vendors,ISV,独立软件厂商Dynamic Logic Partition动态处理器备用SWA软件助手OE操作环境第五章Internet基础ISP互联网服务提供商Remote Access Server远程访问服务器Modem调制解调器ADSL非对称数字用户线路Hybrid Fiber Coaxial,HFC,混合光纤同轴电缆网Cable TV,CATV,有线电视网DDNATMNetid网络号Hosted主机号NATAddress Resolution Protocol,ARP,地址解析协议Dynamic Binding动态绑定Cache缓存区Datagram数据报Maximum Transmission Unit,MTU,最大传输单元源路由选项的分类:Strict Source Route严格源路由选项Loose Source Route松散源路由选项Time Stamp时间戳Universal Time格林尼治时间Internet Control Message Protocol,ICMP,互联网控制报文协议Source Quench源站抑制Routing路由选择Router路由器Metric度量值度量值中经常使用的特征:Hop Count跳数Bandwidth带宽Delay延迟Load负载Reliability可靠性Cost开销应用最广的路由选择协议:Routing Information Protocol,RIP,路由信息协议Open Shortest Path First,OSPF,开放式最短路径优先协议Vector-Distance,V-D,向量-距离,Bellman-FordLink-Status,L-S,链路-状态Convergence收敛CIDR无类域间寻址DHCP动态主机配置协议Qos服务质量保证TCP提供的服务的特征:Connection Orientation面向连接Complete Reliability完全可靠性Full Duplex Communication全双工通信Stream Interface流接口Reliable Connection Startup&Graceful Connection Shutdown连接的可靠建立和优雅关闭Retransmission重发Acknowledgement确认Round Trip Time,RTT,往返时间3-Way Handshake3次握手Window窗口Well-Known Port著名端口第六章Internet基本服务服务器处理多个并发请求的方案:Iterative Server重复服务器Concurrent Server并发服务器First In,First Out先进先出Daemon守护进程Master主服务器Slave从服务器Worm蠕虫互联网的命名机制:Flat Naming无层次命名机制Hierarchy Naming层次型命名机制Label标号Domain域域名解析的两种方式:Recursive Resolution递归解析Iterative Resolution反复解析资源记录的组成:Domain Name域名Time To Live,TTL,最大生存周期,有效期Type类型Class类别Value(域名的)具体值Network Virtual Terminal,NVT,网络虚拟终端Real Terminal实终端数据连接建立的模式:Active主动模式Passive被动模式电子邮件传输协议:Simple Mail Transfer Protocol,SMTP,简单邮件传输协议Post Office Protocol,POP,邮局协议Interactive Mail Access Protocol,IMAP,RFC822将电子邮件报文分为两部分:Mail Header邮件头Mail Body邮件体Multipurpose Internet Mail Extensions,MIME,多用途Internet邮件扩展MIME-Version版本号Content-Type数据类型Content-Transfer-Encoding数据编码类型Quoted-Printable打印编码World Wide Web,WWW,European Center For Nuclear Research,CERN,欧洲核物理研究中心Hyper Text Markup Language,HTML,超文本标记语言Uniform Resource Locator,URL,统一资源定位符History历史Bookmark书签Default默认状态Tag标记Attitude属性Secure Sockets Layer,SSL,安全套接层NTFS第七章网络管理与网络安全网络管理的功能:Configuration Management配置管理Fault Management故障管理Accounting Management计费管理Performance Management性能管理Security Management安全管理NME网管代理模块IETF Internet工程任务组SNMP简单网络管理协议Manager管理者Agent代理者Polling轮询Interrupt-Based基于中断MIB管理信息库Trap-Directed Polling陷入制导轮询方法CIMP公共管理信息协议Association Control Protocol,ACP,联系控制协议Remote Operation Protocol,ROP,远程操作协议Protocol Data Unit,PDU,协议数据单元NCSC国家计算机安全中心Trusted Computer Standard Evaluation Criteria可信任计算机标准评估准则Orange Book橘皮书Dos拒绝服务Ddos分布式拒绝服务DES数据加密标准DEA数据加密算法AES高级加密算法RSANIST美国国家标准和技术研究所Key Distribution Center,KDC,密钥分发中心Certification Authority,CA,认证中心信息完整性认证方法:Massage Authentication Code,MAC,消息认证码Manipulation Detection Code,MDC,篡改检测吗认证函数:Message Encryption Function,MEF,信息加密函数Massage Authentication Code,MAC,信息认证码Hash Function散列函数DSS数字签名标准Token持证MIT麻省理工学院安全电子邮件常用技术:Pretty Good Privacy,PGP,非常好的私密性Secure/Multipurpose Internet Mail Extension,S/MIME,安全/通用Internet邮件扩充Passphrase口令短语Clear-Signed透明签名Ipsec IP安全协议:Authentication Head,AH,身份认证头Encapsulation Security Payload,ESP,封装安全负载TLS运输层安全Internetwork Security Monitor,互联网安全监视器HAR主机审计记录Generic Decryption,GD,类属解密第八章网络应用技术Multicast Backbone,Mbone,组播主干网Unicast单播Broadcast广播Multicast组播IANA管理局组播的相关协议:Internet Group Management Protocol,IGMP,互联网组管理协议CGMPRouter-Port Group Management Protocol,RGMP,路由器-端口组管理协议Dense-Mode Multicast Routing Protocol密集模式组播路由协议Flooding洪泛Distance Vector Multicast Routing Protocol,DVMRP,距离矢量组播路由协议Multicast For Open Shortest Path First,MOSPF,开放最短路径优先的组播扩展协议Protocol Independent Multicast-Dense Mode,PIM-DM,独立组播密集模式Core Based Trees,CBT,基于核心的Multiprotocol Border Gateway Protocol,MBGP,多协议边界网关协议Multicast Source Discovery Protocol,MSDP,组播源发现协议Centralized Topology集中式拓扑结构Decentralized Unstructured Topology分布式非结构化拓扑Distributed Hash Table,DHT,分布式散列表Node ID结点标识符Object ID资源标识符Chum波动Hybrid Structure混合式结构Instant Messaging And Presence Protocol Working Group,IPPWG,IMPP工作小组Request For Comment,RFC,请求评论Internet Engineering Task Force,IETF,Internet工程任务组IM系统的附加功能:Voice/Video Chat音频/视频聊天Application Sharing应用共享File Transfer文件传输File Sharing文件共享Game Request游戏邀请Remote Assistance远程助理Whiteboard白板Session会话Session Initiation Protocol,SIP,会话初始化协议SIP For Instant Messaging And Presence Leverage Extension,SIMPLEExtensible Messaging And Presence Protocol,XMPP,SIP系统的组成:User Agent用户代理User Agent Client,UAC,用户代理客户机User Agent Server,UAS,用户代理服务器Proxy Server代理服务器Redirect Server重定向服务器Registrar注册服务器SIP消息的类型:Request请求Response响应SIP消息的组成:Start-Line起始行Field字段Message Body消息体Entity Header实体头Request-Line请求行Status-Line状态行Message Session Relay Protocol,MSRP,消息中断协议Presence Information呈现信息Presence Service呈现服务呈现服务包括:Presence User Agent,PUA,呈现用户代理Presence Agent,PA,呈现代理Presence Server,PS,呈现服务器Watcher申请者Set Top Box机顶盒Near Video On Demand,NVOD,就近式点播电视True Video On Demand,TVOD,真实点播电视Interactive Video On Demand,IVOD,交互式点播电视Voice Over IP,Voip,IP电话,Internet Protocol PhoneIP电话的实现方法:PC-to-PCPC-to-PhonePhone-to-PhoneIP电话的组成:Terminal终端设备Gateway网关Multipoint Control Unit,MCU,多点控制单元Gatekeeper网守Common Gate Interface,CGI,公共网关接口Page Rank网页等级Store Server存储服务器Searcher搜索器Spiders蜘蛛/搜索器Robot机器人/搜索器Crawlers爬虫/搜索器Indexer索引器Sorter排序器Repository知识库Work Stemming词干法Word Truncation截词Link popularity链接流行度Hyperlink超链接。
mkl使用指南

Using MKL
Introduction Linking Interface layer Threading layer
$ module load intel-mkl
Computational layer Run-time library layer
MKL provides routines in the following areas: •BLAS •Sparse BLAS •LAPACK •PBLAS •ScaLAPACK •Sparse Solver routines •Vector Mathematical Library functions •Vector Statistical Library functions •Fourier Transform functions (FFT) •Cluster FFT •Trigonometric Transform routines •Poisson, Laplace, and Helmholtz Solver routines •Optimization (Trust-Region) Solver routines •GMP arithmetic functions MKL supports for C/C++ and Fortran, however one should note that not all the functions are available directly for both C/C++ and Fortran. In these cases mixed language
Linking model Static & Dynamic Examples
1 1 2 2 2 3 4 5 6
3-英特尔-英特尔面向XPU计算平台统一编程工具oneAPI

英特尔HPC 技术创新论坛英特尔面向XPU 计算平台的统一编程工具:英特尔® oneAPI刘蕴英特尔数据中心XPU 产品高性能计算应用架构师刘蕴oneAPI生态▪面向直接编程开发人员的DPC++▪面向基于API 编程的应用开发者的通用加速库▪面向底层硬件的level0▪Codeplay在DPC++ 上做对Nvidia GPU 的支持▪Argonne,OLCF 在做对AMD GPU 的SYCL* 支持▪Fugaku超级计算机上做了在A64FX 上oneDNN的实现。
▪华为在DPC++ 上做对升腾AI芯片的支持。
oneAPI生态的业界响应DPC++ 与SYCL 生态▪SYCL 是Kronos 下的异构编程模型。
OpenCL 的升级。
▪即保留了OpenCL 在kernel里的性能,又充分利用C++ 的抽象特性。
▪一些库和应用已经有了SYCL*实现(如Eigen, Gromacs)。
▪欢迎大家到第三天的oneAPI专场学习SYCL* 编程。
oneAPI 软件包HPC 用户应下载Base toolkit + HPC toolkit ,相当于以前的Parallel Studio XE 。
包含DPC++ 编译器,MKL 等加速库,gdb/vTune 等工具,同时支持CPU 和XPU 。
Intel ®oneAPI Rendering Toolkit可视化软件库Intel ®oneAPI Tools for HPC包含Fortran ,OpenMP ,MPI 等HPC 应用的支持。
Intel ®oneAPI Tools for IoTIOT 平台软件库Intel ®oneAPI AI Analytics Toolkit加速AI 和大数据应用的软件包。
Intel®Data Parallel C++▪作为产品,DPC++ 也支持openMP和OpenCL 的offload。
选择稀疏矩阵乘法最优存储格式的研究

选择稀疏矩阵乘法最优存储格式的研究李佳佳;张秀霞;谭光明;陈明宇【摘要】稀疏矩阵向量乘法(sparse matrix vector multiplication,SpMV)是科学和工程领域中重要的核心子程序之一,也是稀疏基本线性代数子程序(basic linear algebra subprograms,BLAS)库的重要函数.目前很多SpMV的优化工作在不同程度上获得了性能提升,但大多数优化工作针对特定存储格式或一类具有特定特征的稀疏矩阵缺乏通用性,因此高性能的SpMV实现并没有广泛地应用于实际应用和数值解法器中.另外,稀疏矩阵具有众多存储格式,不同存储格式的SpMV存在较大性能差异.根据以上现象,提出一个SpMV的自动调优器(SpMV auto-tuner,SMAT).对于一个给定的稀疏矩阵,SMAT结合矩阵特征选择并返回其最优的存储格式,应用程序通过调用SMAT来得到合适的存储格式,从而获得性能提升,同时随着SMAT 中存储格式的扩展,更多的SpMV优化工作可以将性能优势在实际应用中发挥作用.使用佛罗里达大学的2 366个稀疏矩阵作为测试集,在Intel上SMAT分别获得9.11GFLOPS(单精度)和2.44GFLOPS(双精度)的最高浮点性能,在AMD平台上获得了3.36GFLOPS(单精度)和1.52GFLOPS(双精度)的最高浮点性能.相比Intel的核心数学函数库(math kernel library,MKL)数学库,SMAT平均获得1.4~1.5倍的性能提升.【期刊名称】《计算机研究与发展》【年(卷),期】2014(051)004【总页数】13页(P882-894)【关键词】SpMV;自动调优;数值解法器;稀疏矩阵;SpBLAS【作者】李佳佳;张秀霞;谭光明;陈明宇【作者单位】计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京100190;中国科学院大学北京100049;计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京100190;中国科学院大学北京100049;计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京100190;计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京100190【正文语种】中文【中图分类】TP302稀疏矩阵向量乘法(简称“SpMV”)是科学和工程领域中重要的核心子程序之一,SpMV的性能对应用起着至关重要的作用,如激光聚变大规模数值模拟和电路模拟等应用.由于应用中的求解线性方程组的部分通常通过调用数值解法器[1]实现,因此SpMV在数值解法器中的性能直接决定了SpMV在实际应用中的性能.本文用数值解法器代替实际应用来评测SpMV的性能.目前常见的数值解法器有 MATLAB[2],PETSc[3],Trilinos[4],hypre[5]等.其中劳伦斯利弗莫尔国家实验室开发的高性能预处理器(hypre[5])中,经测试SpMV 占其中代数多重网格算法[6]运行总行时间的89%.由此可见,SpMV的优化对数值解法器和实际应用都是十分必要的.自20世纪70年代起,SpMV的优化工作一直进行,包括存储格式的优化[7-12]和结合计算机体系结构的优化[13-18]等.虽然许多优化方法显著提高了SpMV的性能,但在数值解法器和实际应用中这些优化工作并未加以很好利用.由于压缩稀疏行(列)方法(即CSR/CSC格式)具有较高的压缩率且实现简单,该格式仍然是数值解法器中最通用的稀疏矩阵存储格式.上面提到的4种常见的数值解法器都使用压缩行存储(compressed row storage,CSR)格式作为基本存储格式.虽然hypre和PETSc采用分块CSR格式或分块的对角线存储(diagonal,DIA)格式作为扩展的存储格式,但相比于SpMV的众多优化方法,数值解法器中的SpMV并没有发挥出其应有的最优性能.以hypre库为例,其中SpMV的性能与Im等人[14]优化的SpMV相比,性能差距达到2倍.由此我们观察到以下现象:目前解法器和实际应用中调用的SpMV核心程序与SpMV程序本身的优化工作之间存在着不小的性能差距,该差距将对数值解法器和实际应用的性能造成极大影响.我们分两方面详细说明产生这一差距的原因:1)虽然SpMV的优化方法达数十种之多,但缺少通用的优化方法.大多数优化方法针对特定的存储格式或一类具有相似特征的稀疏矩阵,如分块CSR格式和分块DIA格式[15]针对具有稠密子块的矩阵.当稀疏矩阵所含的稠密子块比例较大时,分块格式才会表现出明显的性能优势.这种优化的存储格式的适用局限性使得不能将其作为解法器中的单一存储格式.因为对于不具有稠密子块特征的稀疏矩阵,该存储格式会造成这些矩阵的性能下降.因此,SpMV优化方法的不通用性导致不能使用单一优化方法来提升数值解法器的性能.2)在真实应用和某些数值算法中,计算过程中稀疏矩阵并非固定不变.如多重网格算法[6]中,当原始矩阵(第0层网格的矩阵算子)为5对角矩阵时,矩阵具有明显的对角线特征.随着网格层次的增加,粗化算法使得矩阵算子的对角特征越来越弱.在计算过程中矩阵特征的不固定性使得单一的优化方法不能满足整个数值算法和应用的需要.综上,SpMV优化方法的不通用性和计算过程中矩阵特征的不固定性,使得单一的优化方法并不能满足数值解法器对SpMV的调用要求,导致目前的数值解法器仍主要使用基本的CSR存储格式.由于单一优化方法不能满足解法器和应用的需求,我们有必要对优化方法进行动态选择.由于SpMV的优化方法以存储格式的优化为主,因此本文针对SpMV在不同存储格式之间进行动态选择.本文的主要贡献如下:1)使用佛罗里达大学的稀疏矩阵集[19](共2 366个矩阵)进行测试,总结了不同存储格式的特征并提取了稀疏矩阵的特征参数(见第2节).2)构建了一个SpMV的自动调优器,根据不同的输入矩阵在不同平台上自动选择最优的矩阵存储格式.SMAT中采用离线特征提取和在线搜索相结合的方法(见第3节).3)在Intel上SMAT分别获得9.11GFLOPS(单精度)和2.44GFLOPS(双精度)的最高浮点性能,在AMD平台上获得了3.36GFLOPS(单精度)和1.52GFLOPS(双精度)的最高浮点性能.相比Intel的 MKL 数学库[20],SMAT 平均获得1.4~1.5倍的性能提升.同时,SMAT的预测时间约为CSR -SpMV执行时间的9~22倍,在调用上百次SpMV的应用中是可以接受的(见第4节).1 背景1.1 存储格式为节省存储空间、减少矩阵的运算时间,稀疏矩阵通常采用压缩方法进行存储,即只存储矩阵中的非零元素,这种存储方法即为稀疏矩阵的存储格式.存储格式发展于20世纪70年代,4种基本存储格式被广泛接受:CSR[21],DIA[21],ELL[21],COO[21].根据稀疏矩阵的不同特征,基于这4种存储格式产生了新型存储格式,主要分为以下3类:分块格式(VBR[12],BCSR[15],BDIA[15])、需要进行矩阵重排的格式(JAD[21],CSX[9])、需要进行矩阵划分的格式(PKT[13],HYB[13],Cocktail[10]l)等数十种存储格式.本文选取4种基本存储格式的原因如下:1)这4种存储格式在数据结构上差异明显,适合不同特征的稀疏矩阵,并将差异性体现在SpMV的性能中;2)这4种格式均保留了稀疏矩阵的原始特征,其他存储格式基于这4种基本格式并可能改变原始矩阵的特征,分块格式的性能是以基本存储格式为基础,如BCSR和BDIA;矩阵重排和矩阵划分改变了矩阵的特征,并且重排或划分后仍然使用现存的格式进行存储;3)基于这4种基本格式进行研究,为日后SMAT扩展到更多的存储格式打下良好基础.因此,我们选择DIA,ELL,CSR,COO这4种稀疏矩阵的存储格式作为本文的关注点.图1使用一个例子矩阵给出了4种格式的数据结构及其SpMV程序的基本实现.从图1中我们得到不同格式SpMV基本程序的特征对比,在表1中给出.这些特征对SpMV的性能起着不同程度的影响.存储格式对SpMV性能的影响将在第3节详细讨论.Fig.1 Data structure of four basic storage formats and their SpMV implementation.图1 4种存储格式数据结构及其SpMV实现Table 1 Comparison of SpMV Characteristics of Four Formats表1 4种存储格式SpMV特征的对比① Decided by architecture characterisitics.In cache architecture,times of writing vector Yin COO format is 1.Storage Format Characteristics of XAccess Length of Inner-loop Extra Computing Times of Yre-writing onals ELL Irregular Equal Decided by Nonzero Ratio Maximum Number of Nonzeros per Row DIA Continuous Mostly Equal Decided by Nonzero Ratio Number of Diag CSR Irregular Inequal No 1 COO Irregular No 1①1.2 佛罗里达稀疏矩阵集佛罗里达稀疏矩阵集[19](简称UF矩阵集)从1991年开始从实际应用中收集矩阵.该矩阵集用来缩小计算科学家和稀疏矩阵算法开发者之间的距离.有了这些真实应用中的矩阵,程序开发者可以利用它们真实反映算法在实际应用中将会取得的性能.相比其他稀疏矩阵集,如 Matrix Market[22]和Harwell-Boeing [23]集,UF 矩阵集包含了更多的应用且具有更大的矩阵规模(这两个矩阵集已包含在UF矩阵集中),而且此矩阵集仍在不断更新.鉴于以上原因,我们选择UF矩阵集作为我们的测试集,更加真实地反映SpMV在真实应用中表现的性能.UF矩阵集有两种划分方法:1)根据问题来源分为矩阵组(matrix group);2)根据应用范围分为矩阵类(matrix kind).我们关注矩阵类的划分,UF矩阵集所含矩阵类在表2中给出.按照矩阵所属领域,矩阵集分为24个矩阵类,即包含24个应用领域.Table 2 The Distribution of Application Domain in UF Collection表2 UF矩阵集中应用领域的分布App 327 graph 323 structural 277 combinatorial 266 circuit simulation 260 computational fluid dynamics 168 optimization 138 2D/3D 121 economic 71 model reduction 70 chemical process simulation 64 power network 61 theoretical quantum chemistry 47 electromagnetics 33 semiconductor device 33 thermal 29 materials 26 least squares 21 computer graphics vision 12 statistical mathematical 10 counter-example 8 acoustics 7 biochemical network 3 lication Domain Number of Matrices linear programming robotics 3 1.3 动机不同存储格式在具有不同特征的稀疏矩阵上获得不同的性能.我们以“Linear Programming”为例,对4种存储格式的SpMV进行性能测试,画出该矩阵类中矩阵在4种存储格式中的性能(如图2所示).其中84%的矩阵在CSR格式取得最优性能,13%的矩阵在COO上取得最优性能,少数矩阵在DIA或ELL上取得最优性能.如图2所示,当矩阵在DIA格式上取得最优性能时,其性能值远好于其他格式,因此属于同一应用领域的矩阵其最优存储格式并不唯一,矩阵特征也不一致,这使得在某个应用领域中仍难以简单地决定哪种格式是该领域的最优存储格式.Fig.2 The ratio of matrices with the best storage format to all the matrices in the same matrix group.图2 最优存储格式的矩阵在各矩阵类所占比例从图2表达的性能中可知DIA和ELL格式的SpMV性能较高.但并不是使用这两种格式存储的矩阵都能获得最优性能.表3对UF矩阵集——适用矩阵集和最优矩阵集进行分类.由于存储空间的限制,DIA和ELL这两种格式都需要对矩阵进行补零操作.我们对其零元素比例需要进行限制:一个矩阵的对角线条数或最大每行非零元个数不能超过平均非零元个数的20倍,即非零元所占比例至少为存储数据的5%.因此DIA和ELL并不能适用于所有稀疏矩阵,我们将可使用某种存储格式(如DIA)的矩阵集合称为“适用矩阵集”(如表3中DIA_mats),将具有同一最优格式的矩阵集合称为“最优矩阵集”(如表3中good_DIA_mats).每个矩阵有一个最优格式和多个可用格式.由于不同应用中稀疏矩阵特征的不一致性,详细的矩阵信息对选择存储格式从而选择有效的优化方法至关重要.根据提取的矩阵特征,我们建立了SpMV自动调优器SMAT,在不同体系结构上自动选择SpMV的最优存储格式.SMAT在减少应用级程序员工作量的同时保证了SpMV的性能.Table 3 Matrix Sets of the Four Storage Formats表3 4种存储格式对应的矩阵集Size of Matrix Set Storage Formats Suitable Matrix Set Best Matrix Set DIA DIA_mats(458)good_DIA_mats(206)ELL ELL_mats(1878)good_CSR_mats(1 496)COO all_mats(2 366)good_ELL_mats(169)CSR all_mats(2 366)good_COO_mats(507)2 矩阵特征提取下面我们对每个子矩阵集的特征进行分析.为简便起见,我们用M,N表示矩阵的行数和列数,NNZ表示矩阵的非零元个数.本节中每个特征与SpMV性能的关系图均为对特征取值进行分段划分,观察每段区间上的矩阵分布.下文图中使用“GOOD”表示某存储格式(如DIA)为最优格式的矩阵所占比例,因此所有的GOOD矩阵集合即表3中的“good_DIA_mats”.“BAD”表示该格式未能获得最好性能的矩阵比例.“GOOD”和“BAD”矩阵的集合就是“DIA_mats”.我们使用佛罗里达大学的稀疏矩阵集中的2 366个矩阵进行不同格式的SpMV性能测试,通过测试统计出不同存储格式的特征.2.1 DIA格式从表1可以看出,对角线条数和非零元所占比例分别影响SpMV的额外计算量和写Y的次数,对SpMV性能造成影响.我们用Ndiags和ER_DIA分别代表对角线条数和非零元所占比例,ER_DIA的计算公式为式(1):ER_DIA=NNZ/(Ndiags×M).(1)我们对这2个参数在子矩阵集DIA_mats上测试其SpMV性能(如图3、图4所示).Fig.3 The influence of Ndiags on DIA-SpMV.图3 Ndiags对DIA-SpMV 性能的影响Fig.4 The influence of ER_DIAon DIA-SpMV.图4 ER_DIA对DIA格式SpMV性能的影响1)对角线条数(Ndiags):DIA-SpMV 中写Y的次数为Ndiags,随着对角线条数增多,对向量Y的重复读写次数增加,对SpMV性能造成影响.图3给出了Ndiags与DIA-SpMV性能的关系.图3中横坐标为Ndiags的数目,分为9个取值区间;纵坐标为矩阵所占比例.其中“GOOD”指DIA为最优格式的矩阵所占比例,可知所有的GOOD矩阵集合即表3中的“good_DIA_mats”;而“BAD”指DIA未能获得最好性能的矩阵比例.从图3看出,当对角线条数大于300时,DIA格式基本在绝大多数矩阵上不再获得最高性能.结论1.当稀疏矩阵的对角线条数较少时,SpMV使用DIA格式具有性能优势.2)DIA格式中非零元所占比例(ER_DIA):即使一条对角线上只有一个非零元,DIA格式也需要存储整条对角线,包含存储额外的零元素.大量的补零操作降低了非零元所占比例,增加了SpMV的额外计算,从而影响其性能.ER_DIA与DIA-SpMV的性能如图4所示.可知,当矩阵中非零元所占比例过小(<20%)时,DIA格式的SpMV不会取得较好的性能.结论2.只有当非零元所占比例大于某一阈值(>50%)时,DIA-SpMV会取得4种格式中的最好性能.从图4可以看出,当ER_DIA取值在30%~50%之间时,仍有一部分矩阵的最优格式为DIA.因此我们引进新的参数“NTdiags_ratio”来更好地提取矩阵的特征.3)真对角线所占比例(NTdiags_ratio):我们将非零元比例大于一定值(>50%)的对角线称为“真对角线”(Tdiag).在真对角线上使用DIA格式SpMV会取得很好的性能.NTdiags_ratio用来表示真对角线在所有对角线中所占比例,计算公式如下:NTdiags_ratio=真对角线条数/Ndiags.(2)从图5得到如下结论:结论3.当NTdiags_ratio较高(>40%)时,稀疏矩阵因使用DIA格式获得最优性能.通过3个矩阵特征参数与DIA-SpMV的性能关系的测试,我们提取了表示DIA格式特征的3个特征参数(Ndiags,ER_DIA,NTdiags_ratio),并观察到当这3个阈值满足一定条件时,DIA格式的SpMV会以较高的加速比获得最优性能.Fig.5 The influence of NTdiags_ratio on DIA-SpMV.图5 NTdiags_ratio对DIA格式SpMV性能的影响2.2 ELL格式我们使用类似的方法对ELL矩阵进行分析.ELL格式对于每行非零元个数不相等的情况同样需要补零操作.为方便起见,我们将每行的非零元个数简称为“行度(row_degree)”.从表1可以看出,最大行非零元个数和非零元的填充率分别影响SpMV的额外计算量和写Y的次数,因此对SpMV整体性能造成影响.我们用max_RD和ER_ELL分别代表最大行非零元个数(即最大行度)和ELL格式中非零元所占比例,其计算公式为式(3)和式(4).我们在ELL的适用子矩阵集ELL_mats上测试ELLSpMV性能,继而分析这两个参数对SpMV的性能影响.1)矩阵最大行度(max_RD):ELL-SpMV中写向量Y的次数为max_RD.当最大行度增加时,对Y的重复写次数也增加,与DIA格式类似,这同样会造成ELL-SpMV的性能降低.图6给出了max_RD和ELL-SpMV性能的关系.从图6得出如下结论:结论4.只有最大行度较小时(<10),ELLSpMV才可能取得最好性能.Fig.6 The influence of max_RDon ELL-SpMV.图6 max_RD对ELL格式SpMV性能的影响2)ELL格式中非零元所占比例(ER_ELL):当一个稀疏矩阵每行非零元个数不一致时,ELL以最大行非零元个数为依据,对其他行进行补零操作.大量的补零操作给SpMV带来额外计算,从而影响其性能.ER_ELL与ELL-SpMV的性能如图7所示.我们得到如下结论:结论5.只有当非零元所占比例大于某一阈值(如90%)时,ELL-SpMV会取得最好的性能.与ER_DIA不同的是,ER_ELL的阈值取值更大,这是由于ELL-SpMV程序的性能优势要小于DIA-SpMV,因此对其参数取值要求也更加严格.Fig.7 The influence of ER_RDon ELL-SpMV.图7 ER_RD对ELL格式SpMV性能的影响在ELL-SpMV的测试过程中,我们同样观察到影响其性能的另一个因素——行度的波动.当行度波动大时,非零元填充率相应降低,影响SpMV性能.这个参数对于提取适合ELL格式的矩阵起到了辅助作用.3)行度波动值(var_RD):我们用矩阵行度的方差来表示矩阵中行度的波动情况.计算公式如下:var_RD与ELL-SpMV性能的关系如图8所示.从图8得到如下结论:结论6.当每行非零元个数波动很小时(<1),ELL-SpMV才可能取得较好的性能.Fig.8 The influence of var_RDon ELL-SpMV.图8 var_RD对ELL格式SpMV性能的影响通过3个矩阵特征参数与ELL-SpMV的性能关系的测试,我们提取了表示ELL格式特征的3个特征参数(max_RD,ER_ELL,var_RD),并观察到当这3个阈值满足一定条件时,ELL格式的SpMV就会获得最优性能,但加速比并不高.2.3 COO格式使用以上2种格式存储稀疏矩阵时都可能引入不必要的零元素填充,从而影响其SpMV性能.CSR和COO格式则不存在这个问题,它们只存储矩阵中的非零元素,不会引入不必要的开销.由上面对DIA和ELL的分析我们得出了适合这两种存储格式的矩阵特征.当一个稀疏矩阵不具有上述特征时,我们需要在CSR和COO两种格式中进行选择.从文献[11]得知,在GPU上COO-SpMV在幂律矩阵[24]取得这4种格式的最优性能.因此我们通过对幂律矩阵的测试在CPU 上验证这一规律,并将这一规律作为COO的特征.我们使用SNAP[25]中的部分幂律矩阵进行测试,38个矩阵中有20个矩阵在COO格式下取得的最高性能.可得如下结论:结论7.COO格式在小世界网络矩阵中的优势在CPU上并没有GPU上体现得明显.在CPU上并不是所有幂律矩阵都在COO格式上取得最优性能.因此我们只使用幂律特征作为CSR和COO格式选择的一个参考标准.幂律特征值(R):该参数从幂律分布[24]中得到,见式(6):P(k)~k-R,(6)其中,k表示矩阵中不同行度值,P(k)表示该行度出现的频率.我们对矩阵的分布用最小二乘法对该幂律公式中的R值进行计算.为了确定R值的取值范围,我们对good_COO_mats中的矩阵进行测试,如图9所示.由图9可知如下结论:结论8.当R 取值在[1,4][11,24]区间内,COOSpMV会取得好于CSR-SpMV的性能.Fig.9 The influence of Ron COO-SpMV.图9 R对COO格式SpMV性能的影响我们通过对不同矩阵子集性能的测试,得到了影响不同存储格式的特征参数(见表4第1列),并且得到了相应结论.从这些结论中我们得到了每个特征参数的阈值.参数的阈值也在表4中用符号进行表示,阈值的取值将在第3节中介绍.Table 4 Characteristics Parameter List of Sparse Matrices,Their Meaningsand the Symbols of Thresholds表4 稀疏矩阵的特征参数和含义及其对应的阈值表示Parameter Threshold Meaning#of Rows N#of Columns NNZ #of Nonzeros M Ndiags Ndiags_STH Ndiags_LTH#of Diagonals NTdiags _ratio NTdiags_ratio_STH NTdiags_ratio_LTH Ratio of“True Diagonals”ER_DIA ER_DIA_STH ER_DIA_LTH Nonzero Ratio of DIA Format max_RD max_RD_STH max_RD_LTH Maximum Value of“row _degree”var_RD var_RD_STH var_RD_LTH Variat on of“row_degree”ER_ELL ER_ELL_STH ER_ELL_LTH Nonzero Ratio of ELL Format R R_STH R_LTH Chracteristics Value of Power Law Best format The Best Storage Format3 SMAT自动调优器基于这4种基本存储格式我们建立了SpMV自动调优器,用来选择最优的存储格式,从而使SpMV获得最高性能.SMAT架构如图10所示.SMAT架构的输入为CSR格式存储的稀疏矩阵,输出该矩阵的最优格式.该架构分为两大部分:离线部分和在线部分.离线部分通过对稀疏矩阵训练集进行测试,在不同体系结构上通过在测试矩阵集上运行多种SpMV算法,确定特征参数阈值.在线部分通过矩阵特征的实时提取来进行格式预测和运行反馈,从而得到输入矩阵的最优存储格式.下面我们详细介绍每部分的构成.Fig.10 SMAT architecture.图10 SMAT架构3.1 离线部分我们从UF矩阵集中选出2 055个矩阵进行离线部分中每种存储格式的性能测试.类似于第3节的分析,我们在表3中的“适用矩阵集”上测试这4种格式的SpMV性能,从而通过统计得到具有平台差异性的参数阈值.如表4所示,每个参数对应两个阈值,一个称为“紧阈值”.当该参数值满足该阈值时,“适用矩阵集”中90%的矩阵可以在某一存储格式上取得最优性能.以DIA格式中的Ndiags参数为例,我们使用DIA_mats测试DIA-SpMV的性能.当Ndiags<20时,90%以上的矩阵在DIA格式下取得最优性能.因此Ndiags的紧阈值Ndiags_STH=20.另一个阈值为“松阈值”.在“最优矩阵集”中,当多于90%的矩阵该参数取值分布于大于(或小于)某一数值的范围内,我们将该数值称为松阈值.如DIA的“最优矩阵集(good_DIA_mats)”中90%以上的矩阵分布在Ndiags<100的范围内,因此Ndiags_LTH=100.因此,紧阈值确保了预测的正确性,而松阈值则确保了预测的少遗漏性.这样,在SMAT的离线过程我们确定了每个参数的阈值大小,从而确定了在线过程中的存储格式预测模型.3.2 在线部分在线部分对输入的CSR格式稀疏矩阵进行实时处理,通过提取其矩阵特征得到其最优存储格式.在线部分,SMAT首先对输入的CSR格式矩阵进行特征提取,从而得到表4中每个特征参数的取值.由于离线部分已经确定了每个参数的阈值,我们使用一个可信度数组(tag)来标识每个格式的预测情况(如图11所示).数组中的每一位对应一个存储格式,从左到右依次对应DIA,ELL,CSR,COO的标签.当矩阵特征参数全部满足某一存储格式对应的紧阈值时,数组的相应位置的标签为2,即预测该矩阵使用该存储格式很大可能会取得最优性能.当矩阵特征参数不能全部满足紧阈值,而满足全部松阈值对应的条件时,数组的对应位置的标签置为1,即该存储格式可能成为该矩阵的最优格式.当特征参数不能满足全部的松阈值时,数组对应位置0.通过该数组的值,SMAT的在线过程可以通过这些阈值的比较,推测该稀疏矩阵适合的存储格式.图11给出了SMAT的在线流程图.Fig.11 The prediction procedure of each format and the confirmation ofconfidence array value.图11 每种格式的判断流程图及预测可信度数组值的设定Fig.12 The online prediction procedure of SMAT.图12 SMAT在线预测流程图由于提取一个矩阵的全部特征需要的时间较长,并且在进行每个存储格式的预测时,并不需要全部的特征参数,因此SMAT将矩阵的特征提取过程和存储格式预测两部分混合在一起.先对矩阵进行部分特征的提取,再根据格式预测的结果决定是否对下一个存储格式进行特征提取.举例来说,当图11中的可信度数组中DIA格式为2,则认为找到了最优存储格式为DIA而停止下一个存储格式的特征提取和格式预测.这样,在不影响SMAT预测准确性的前提下,节省了矩阵特征提取过程不必要的时间开销.从SMAT的在线流程图(如图12所示)中的分支流程可以看出提取特征和格式预测的交替过程.为了进一步确保SMAT预测的准确度,当tag数组的值都不为2时,我们使用运行反馈模块确保其准确性.当可信度数组中没有标签值为2,SMAT使用实际运行并对比测试性能的做法得到最优存储格式.对于数组中标签为1的位执行一次对应格式的SpMV,并记录其性能.根据性能记录,从中选择性能最大值,其对应的存储格式则为最优存储格式.图12中的运行部分即为运行反馈模块.运行及反馈使得对特征不明显的稀疏矩阵也可以准确地选择存储格式,提高了SMAT的预测准确性.4 性能及分析4.1 实验平台及数据集我们选择Intel和AMD两个平台测试SMAT性能并进行分析.Intel平台采用Intel Xeon X5680为CPU,主频为3.33GHz.该系统配置了24GB的内存和可用磁盘容量为384GB.AMD系统包含主频为1.9GHz的Opteron CPU和。
如何编写高效代码 - How To Write Fast Numerical Code

2
First we note, as explained above, the exponential increase in CPU frequency. This results in a “free” speedup for numerical software. In other words, legacy code written for an obsolete predecessor will run faster without any extra programming effort. However, the theoretical performance of computers has evolved at a faster pace due to increases in the processors’ parallelism. This parallelism comes in several forms, including pipelining, superscalar processing, vector processing and multi-threading. Single-instruction multiple-data (SIMD) vector instructions enable the execution of an operation on 2, 4, or more data elements in parallel. The latest generations are also “multicore,” which means 2, 4, or more processing cores1 exist on a single chip. Exploiting parallelism in numerical software is not trivial, it requires implementation effort. Legacy code typically neither includes vector instructions, nor is it multi-threaded to take advantage of multiple processor cores or multiple processors. Ideally, compilers would take care of this problem by automatically vectorizing and parallelizing existing source code. However, while much outstanding compiler research has attacked these problems (e.g., [2–4]), they are in general still unsolved. Experience shows that this is particularly true for numerical problems. The reason is, for numerical problems, taking advantage of the platform’s available parallelism often requires an algorithm structured differently than the one that would be used in the corresponding sequential code. Compilers cannot be made to change or restructure algorithms since doing so requires knowledge of the algorithm domain. Similar problems are caused by the computer’s memory hierarchy, independently of the available parallelism. The fast processor speeds have made it increasingly difficult to “feed all floating point execution units” at the necessary rate to keep them busy. Moving data from and to memory has become the bottleneck. The memory hierarchy, consisting of registers and multiple levels of cache, aims to address this problem, but can only work if data is accessed in a suitable order. One cache miss may incur a penalty of 20–100s CPU cycles, a time in which 100 or more floating point operations could have been performed. Again, compilers are inherently limited in optimizing for the memory hierarchy since optimization may require algorithm restructuring or an entirely different choice of algorithm to begin with. Adding to these problems is the fact that CPU frequency scaling is approaching its end due to limits to the chip’s possible power density (see Fig. 1): since 2004 it has hovered around 3 GHz. This implies the end of automatic speedup; future performance gains will be exclusively due to increasing parallelism. In summary, two main problems can be identified from Fig. 1: – Years of exponential increase in CPU frequency meant free speed-up for existing software but also have caused and worsened the processor-memory bottleneck. This means to achieve the highest possible performance, code has to be restructured and tuned to the memory hierarchy.
Intel IPP简介---文本资料

Intel® Integrated Performance Primitives (IPP)
Processorspecific Functions
Intel Itanium® Architecture
IA32 Intel Pentium® IA32 IA32 and Xeon processors
Interleaving: • L/R/L/R/L/R… stereo • RGBRGBRGB… color image • RGBARGBA… color +alpha
®
Color Models: • YCbCr (4:4:4, 4:2:2) • YUV (4:2:2, 4:2:0) • YCC, HLS, RGB, RGBA …
Convert
– Pixel/planar – Color conversions
Filters
– User-defined and built-in
Arithmetic & Logical
– Abs, Add, convolve, crosscorrelation, div, exp, ln, LShift, normalize, mul, RShift, sqr, sqrt, sub, threshold – And, not, or – Compare – Phase, magnitude
®
Intel® IPP Lab Exercise: Signal Processing
IPP 对信号处理的优化
IPP 为信号处理应用提供了优化的性能
• 信号的生成 • 信号的转换 • 时域频域的转换(FFTs) • 滤波器(FFT, FIR)
A203 Mini PC 用户指南说明书

A203 Mini PC User GrideA203 M ini PC User GuideNoticePacking ListProduct IntroductionBrief SpecificationsInstall DimensionI nterfacesJetpack KEY FEATURES IN JETPACKSample ApplicationsDevelop Tool1.1 NoticePlease read manual carefully before install, operate, or transport device.•Ensure that the correct power range is being used before powering the device.•Avoid hot plugging.•To properly turn off the power, please shut down the Ubuntu system first, and then cut off the power. Due to the particularity of the Ubuntu system, on the Nvidia developer kit, if the power is turned off when the startup is not completed, there will be a 0.03% probability of abnormality, which will cause the device to fail to start.Due to the use of the Ubuntu system, the same problem also exists on the device.•Do not use cables or connectors other than described in this manual.•Do not use device near strong magnetic fields.•Backup your data before transportation or device is idle.•Recommend to transport device in its original packaging.1.2 Packing ListA203 mini PC x 1Antenna x2Power adapter(Without Power cord ) x 1Processor NVIDIA Jetson Xavier NXAIPerformance21 TOPS (INT8)GPU 384-core NVIDIA Volta™ GPU with 48 Tensor CoresGPU Max Freq 1100 MHz1.3 A203 Mini PC Product IntroductionBriefA203 Mini PC is a powerful and extremely small intelligent edge computer to bring modern AI to the edge, the smaller form factor than the Jetson NX Developer Kit delivers the same AI power for up to 21 TOPs. For smart cities, security, industrial automation, smart factories, and other edge AI solution providers, A203 Industrial Mini PC combines exceptional AI performance, and sufficient storage with a rich set of IOs— HDMI, 2x USB3s, RS232, I2Cs, and GPIOs, supports operating range from -20°C to 80°C for AI embedded industrial and functional safety applications in a power-efficient, small form factor.SpecificationsProcessor ModuleCPU 6-core NVIDIA Carmel ARM®v8.2 64-bit CPU 6MB L2 + 4MB L3CPU Max Freq 2-core @ 1900MHz 4/6-core @ 1400MhzMemory 8 GB 128-bit LPDDR4x @ 1866MHz59.7GB/sStorage16 GB eMMC 5.1 Power10W|15W|20WPCIe 1 x1 + 1x4(PCIe Gen3, Root Port & Endpoint)CSI Camera Up to 6 cameras (36 via virtual channels)12 lanes MIPI CSI-2D-PHY 1.2 (up to 30 Gbps)Video Encode 2x 4K60 |4x 4K30 |10x 1080p60 |22x 1080p30 (H.265) 2x 4K60 |4x 4K30 |10x 1080p60 |20x 108p30 (H.264)Video Decode 2x 8K30 |6x 4K60 |12x 4K30 |22x 1080p60 |44x 1080p30 (H.265)2x 4K60 |6x 4K30 |10x 1080p60 |22x 1080p30(H.264)2 x4K30 |6x1080p60 |14x1080p30(VP9)Display 2 multi-mode DP 1.4/eDP 1.4/HDMI 2.0DLVisionAccelerator7-Way VLIW Vision Processor Networking 10/100/1000 BASE-TEthernetI nterfaceSpecification Network1 x RJ45 Gigabit Ethernet Connector (10/100/1000)Video Output1 x HDMI 2.0 (TYPE A)USB2 x USB 3.0 (TYPE A) + 1 x micro-USB SIM Card1 x SIM Card Function Keys 1 x Reset Button + 1 x Power Button Power SupplySpecification Input TypeDC Input Voltage+9V to +19V DC Input @ 3A Typical consumption 30WMechanicalSpecification Dimensions (W x H x D)100mm x 44mm x 59mmWeight EnvironmentalSpecification Operating Temperature-20℃-60℃, 0.2~0.3m/s air flow Storage Temperature-25℃ ~ +80℃Storage Humidity 10%-90% non-condensingI/OPower SupplyMechanicalEnvironmentalUp view (Unit:mm)Front view (Unit:mm)Left view (Unit:mm)Right view (Unit:mm)Mounting Hole (Unit:mm)1.4 I nstall DimensionDimensions as below:1.5 InterfacesInterface Name NoteHDMI HDMI1x HDMIUSB 3.0USB 3.02x USB3.0 Type-A(compatible USB2.0)RJ45Ethernet1GbE portSIM_Card SIM card slot for SIM cardNote: can work with SSDInterfaces Name NoteDC DC+9V(3A)~+19V(3A) RES Reset ButtonPOWER Power Buttonmicro-USB micro-USB1 x micro-USBNote: This product is self-starting when on powerInterfaces Name Note Debug Debug For debugging R S232RS232CAN CANIIC IIC1.6 Jetpack KEY FEATURES IN JETPACK1.6.1.1 JetPackNVIDIA JetPack SDK is the most comprehensive solution for building AI applications. It bundles Jetson platform software including TensorRT, cuDNN, CUDA Toolkit, VisionWorks, GStreamer, and OpenCV, all built on top of L4T with LTS Linux kernel.JetPack includes NVIDIA container runtime, enabling cloud-native technologies and workflows at the edge.JetPack SDK Cloud-Native on Jetson1.6.1.2 L4TNVIDIA L4T provides the Linux kernel, bootloader, NVIDIA drivers, flashing utilities, sample filesystem, and more for the Jetson platform.You can customize L4T software to fit the needs of your project. By following the platform adaptation and bring-up guide, you can optimize your use of the complete Jetson product feature set. Follow the links below for details about the latest software libraries, frameworks, and source packages.1.6.1.3 DeepStream SDK on JetsonNVIDIA’s DeepStream SDK delivers a complete streaming analytics toolkit for AI-based multi-sensor processing, video and image understanding. DeepStream is an integral part of NVIDIA Metropolis, the platform for building end-to-end services and solutions that transform pixel and sensor data to actionable insights. Learn about the latest 5.0 developer preview features in our developer news article.1.6.1.4 Isaac SDKThe NVIDIA Isaac SDK makes it easy for developers to create and deploy AI-powered robotics. The SDK includes the Isaac Engine (application framework), Isaac GEMs (packages with high-performance robotics algorithms), Isaac Apps (reference applications) and Isaac Sim for Navigation (a powerful simulation platform). These tools and APIs accelerate robot development by making it easier to add artificial intelligence (AI) for perception and navigation into robots.1.6.2 KEY FEATURES IN JETPACKOS NVIDIA L4T provides the bootloader, Linux kernel 4.9, necessary firmwares, NVIDIA drivers, sample filesystem based on Ubuntu 18.04, and more.JetPack 4.6.1 includes L4T 32.7.1 with these highlights:Support for Jetson AGX Xavier 64GB and Jetson Xavier NX 16GBTensorRT TensorRT is a high performance deep learning inference runtime for image classification, segmentation, and object detection neural networks. TensorRT is built on CUDA, NVIDIA’s parallel programming model, and enables you to optimize inference for all deep learning frameworks. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.cuDNN CUDA Deep Neural Network library provides high-performance primitives for deep learning frameworks. It provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.CUDAMultimedia APIComputer VisionCUDA Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. The toolkit includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing the performance of your applications.VPI (Vision Programing Interface) is a software library that provides Computer Vision / Image Processing algorithms implemented on PVA1 (Programmable Vision Accelerator), GPU and CPUOpenCV is a leading open source library for computer vision, image processing and machine learning.VisionWorks2 is a software development package for Computer Vision (CV) and image processing.JetPack 4.6.1 includes VPI 1.2The Jetson Multimedia API package provides low level APIs for flexible application development.Camera application API: libargus offers a low-level frame-synchronous API for camera applications, with per frame camera parameter control, multiple (including synchronized) camera support, and EGL stream outputs. RAW output CSI cameras needing ISP can be used with either libargus or GStreamer plugin. In either case, the V4L2 media-controller sensor driver API is used.DeveloperTools SupportedSDKs andTools Cloud NativeJetPack componentSample locations on reference filesystem TensorRT/usr/src/tensorrt/samples/cuDNN/usr/src/cudnn_samples_/CUDA/usr/local/cuda-/samples/Multimedia API /usr/src/tegra_multimedia_api/VisionWorks /usr/share/visionworks/sources/samples//usr/share/visionworks-tracking/sources/samples//usr/share/visionworks-sfm/sources/samples/OpenCV/usr/share/OpenCV/samples/VPI /opt/nvidia/vpi/vpi-/samples1.7 Sample ApplicationsJetPack includes several samples which demonstrate the use of JetPack components. These are stored in the reference filesystem and can be compiled on the developer kit.CUDA Toolkit provides a comprehensive development environment for C and C++developers building high-performance GPU-accelerated applications with CUDA libraries.The toolkit includes Nsight Eclipse Edition, debugging and profiling tools including Nsight Compute, and a toolchain for cross-compiling applications.NVIDIA Nsight Systems is a low overhead system-wide profiling tool, providing the insightsdevelopers need to analyze and optimize software performance.NVIDIA DeepStream SDK is a complete analytics toolkit for AI-based multi-sensorprocessing and video and audio understanding.DeepStream SDK 6.0 supports JetPack 4.6.1NVIDIA Triton™ Inference Server simplifies deployment of AI models at scale. Triton Inference Server is open source and supports deployment of trained AI models fromNVIDIA TensorRT, TensorFlow and ONNX Runtime on Jetson. On Jetson, Triton InferenceServer is provided as a shared library for direct integration with C API.Jetson brings Cloud-Native to the edge and enables technologies like containers andcontainer orchestration. NVIDIA JetPack includes NVIDIA Container Runtime withDocker integration, enabling GPU accelerated containerized applications on Jetsonplatform.NVIDIA hosts several container images for Jetson on NVIDIA NGC. Some are suitable forsoftware development with samples and documentation and others are suitable forproduction software deployment, containing only runtime components. Find moreinformation and a list of all container images at the Cloud-Native on Jetson page.1.8Developer ToolsJetPack includes the following developer tools. Some are used directly on a Jetson system, and others run on a Linux host computer connected to a Jetson system.Tools for application development and debugging:NSight Eclipse Edition for development of GPU accelerated applications: Runs on Linux host computer.Supports all Jetson products.CUDA-GDB for application debugging: Runs on the Jetson system or the Linux host computer. Supports all Jetson products.CUDA-MEMCHECK for debugging application memory errors: Runs on the Jetson system. Supports allJetson products.Tools for application profiling and optimization:NSight Systems for application multi-core CPU profiling: Runs on the Linux host computer. Helps youimprove application performance by identifying slow parts of code. Supports all Jetson products.NVIDIA® Nsight™ Compute kernel profiler: An interactive profiling tool for CUDA applications. It providesdetailed performance metrics and API debugging via a user interface and command line tool.NSight Graphics for graphics application debugging and profiling: A console-grade tool for debugging andoptimizing OpenGL and OpenGL ES programs. Runs on the Linux host computer. Supports all Jetsonproducts.Abbreviation CECCANDPeDP eMMC HDMII2CI2SLDO LPDDR4x PCIe (PEX) PCMPHYPMICRTCSDIOSLVSSPIUARTUFSUSB DefinitionConsumer Electronic ControlController Area NetworkVESA® DisplayPort® (output)Embedded DisplayPortEmbedded MMCHigh Definition Multimedia InterfaceInter ICInter IC Sound InterfaceLow Dropout (voltage regulator)Low Power Double Data Rate DRAM, Fourth-generation Peripheral Component Interconnect Express interface Pulse Code ModulationPhysical LayerPower Management ICReal Time ClockSecure Digital I/O InterfaceScalable Low Voltage SignalingSerial Peripheral InterfaceUniversal Asynchronous Receiver-Transmitter Universal Flash StorageUniversal Serial BusAbbreviations and Definitions。
Intel-IPP7

Product BriefIntel® Integrated Performance Primitives 7.0For Windows*, Linux* and Mac OS* X Multicore Power for Multimedia and Data Processing∙Library of functions for multimedia, dataprocessing, and communications applications∙Outstanding performance – highlyoptimized and multicore ready Intel® Integrated Performance Primitives (Intel® IPP) is an extensive library of multicore-ready, highly optimized software functions for multimedia, data processing, and communications applications. Intel IPP offers thousands of optimized functions covering frequently usedfundamental algorithms.“Intel® IPP provided a 300 percent improvement in the number of users who can simultaneously participate in a webcast.”Leo Volfson, President and Chief Technology Officer, Inetcam, Inc.PerformanceIntel® IPP vs. Original zlib, up to 1.4x fasterIntel® bzip2 vs. Original zlib, up to 1.6x fasterIntel® IPP vs. Original gzip, up to 1.8x fasterIntel® IPP vs. Original Izopack, up to 1.9x fasterIntel® IPP vs. Original OpenSSL, up to 2.5x fasterContinuous Improvement:Intel® IPP 7.0 vs. Intel IPP 6.1 up to 4.3x fasteFeaturesPerformanceInstruction set level optimizationsIntel IPP functions are designed to deliver performance beyond what optimizing compilers alone can deliver. For each Intel® Architecture-compatible processor, Intel IPP automatically detects the instruction set level and dispatches optimized code to take advantage of the Intel Architecture SIMD instructions.For detailed performance data, visit the Intel IPP product Web page at /software/products/ipp .Support for multicore processors Intel® IPP functions are fully thread-safe, and many are internally threaded to help you get the most out of today’s multicore processors. See below for a complete list of supported CPUs. ProductivityRich set of pre-defined functions With more than 11,000 functions across 15 domains, Intel® IPP provides a rich set of algorithms to speed your application development.Source code usage samplesJumpstart your application development with source code samples incorporating Intel® IPP, includingvideo/audio/speech codecs, image processing, data compression, and other high-level algorithm implementations. Additionally, there are samples showing how to use IPP in Java* and .NET* applications. Future ProofSupport for future instruction sets and additional CPU cores Intel® IPP is optimized for current multicore and future manycore processors. As new instruction sets become supported in Intel CPUs, just relink with the latest version of Intel IPP to achieve the greater application performance provided by the new instruction sets.Royalty-free redistribution Redistribute unlimited copies of the runtime libraries with your application. New Features in Intel® IPPIntel® Advanced Vector Extensions performance optimizations Achieve new performance optimizations for the Intel® Advanced Vector Extensions (Intel AVX) for faster floating-point operations in the signal processing and image processing domains for Sandy Bridge and later processors. New instruction optimizations for AES and CRC32CAccess Advanced Encryption Standard (AES) and CRC32C new instruction optimizations for major performance increases in data compression and cryptography funct ions for Intel® Core™ i7 processors. Windows* Imaging Component API supportEnjoy faster and easier adoption of Intel® IPP image codecs by Windows* developers. JPEG codec performance improvement Dramatically improve JPEG codec performance scaling up to 6x over 8 cores.New JPEG-XR codec sample (previously known as HD Photo)A new image compression standard:Get up to 2x the compression level for the same image quality without the need for greater memory or computing resources.Support lossless and lossy compression as well as incremental decompression of specific image regions. Support higher dynamic range and color depth than existing image codecs.Improved data compression algorithmsBenefit from improved and fully productized binary and source drop-in data compression algorithms (bzip2, zlib and gzip).Purchase Options: Language Specific SuitesSeveral suites are available combining the tools to build, verify and tune your application. The products covered in this product brief are highlighted in green. Single or multi-user licenses and volume, academic, and student discounts are available.C o m p o n e n t s Intel® C / C++ Compiler ● ● ● ● ● ● Intel® Fortran Compiler● ● ● ● ● ● Intel® Integrated Performance Primitives 3 ● ● ● ● ● ● Intel® Math Kernel Library 3● ● ● ● ● ● ● ● I ntel® Cilk™ Plus● ● ● ● ● ● Intel® Threading Building Blocks ● ● ● ● ● ● Intel® Inspector XE ● ● ● ● Intel® VTune™ Amplifier XE ● ● ● ● Static Security Analysis ● ● ● ● Intel® MPI Library● ● Intel® Trace Analyzer & Collector ● ● Rogue Wave IMSL* Library 2 ● Operating System 1W, LW, LW, LW, LW, L, MW, L, MW, LW, LNote: (1)1 Operating System: W=Windows, L= Linux, M= Mac OS* X. (2)2 Available in Intel® Visual Fortran Composer XE for Windows with IMSL* (3)3 Not available individually on Mac OS X, it is included in Intel® C++ & Fortran Composer XE suites for Mac OS XProcessor supportValidated for use with multiple generations of Intel® and compatible processors including but not limited to: 2nd Generation Intel® Core™2 processor, Intel® Core™2 processor, Intel® Core™ processor, Intel® Xeon™ processor, Intel® Atom™ processor, Intel® Pentium® D processor, Intel® Pentium® M processor.Operating systems Use the same API for application development on multiple operating systems: Windows*, Linux*. and Mac OS* XDevelopment tools and environmentsFully compatible with other development tools from Intel such as compilers, performance and threading analyzers, and other Intel® performance libraries. In addition, Intel IPP is easily used and integrated with popular development tools andenvironments such as Microsoft Visual Studio* (2005, 2008, 2010), Xcode*, Eclipse*, and the GNU Compiler Collection* (GCC*). Programming languages Natively supports C and C++ development; cross-language usage examples provided for C#/.NET and Java*.System requirements Please refer to /software/products/systemrequirements/ for details on hardware and software requirements. SupportAll product updates, Intel® Premier Support services and Intel® Support Forums are included for one year. Intel Premier Support gives you confidential support, technical notes, application notes, and the latest documentation. Join the Intel® Support Forums community to learn, contribute, or just browse! /en-us/forums.Download a trial version today/software/products/eval© 2011, Intel Corporation. All rights reserved. Intel, the Intel logo, and VTune are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. INTEL_IPP_PB/Rev1011Optimization Notice Notice revision #20110804Intel’s compilers may or may not optimize to the same degree for non -Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.。
Intel Parallel Studio XE 2013 for Windows 安装指南和发行

Intel® Parallel Studio XE 2013for Windows*Installation Guide and Release Notes Document number: 323803-003US25 June 2013Table of Contents1 Introduction (1)1.1 What’s New (1)1.1.1 Changes from Intel Parallel Studio XE 2011 (2)1.2 Product Contents (2)1.3 System Requirements (2)1.4 Documentation (4)1.5 Technical Support (4)2 Installation (4)2.1 Pre-Installation Steps (4)2.1.1 Configure Visual Studio for 64-bit Applications (4)2.2 Installation (4)2.2.1 Using a License Server (5)3 Disclaimers, Notices and Legal Information (5)1 IntroductionThis document describes system requirements and how to install Intel® Parallel Studio XE 2013. Additional release notes for each component, with details of changes and additional technical information, can be found after installation, in the respective components’ Documentation folder.First-time users should view the Getting Started page that is displayed at the end of installation.1.1 What’s NewThis section highlights important changes from the previous product version and changes in product updates. For information on what is new in each component, please read the individual component release notes.Update 4 – June 2013∙Component products updated to current versionsUpdate 3 – March 2013∙Component products updated to current versionsUpdate 2 – February 2013∙Component products updated to current versionsUpdate 1 – October 2012∙Component products updated to current versions1.1.1 Changes from Intel Parallel Studio XE 2011∙Intel® Advisor XE is a new component included in this product.∙Other components updated to current versions∙Support added for Microsoft Windows 8* and Microsoft Windows Server 2012*∙Microsoft Visual Studio 2012* is now supported∙Microsoft Visual Studio 2005* is no longer supported∙Microsoft Windows Vista* and Microsoft Windows Server 2003* are no longer supported ∙Microsoft Windows XP* is deprecated – support will be removed in a future release1.2 Product ContentsIntel® Parallel Studio XE 2013 includes the following components:∙Intel® C++ Composer XE 2013 Update 5 - includes Intel® Integrated Performance Primitives (Intel® IPP), Intel® Threading Building Blocks (Intel® TBB) and Intel® MathKernel Library (Intel® MKL)∙Intel® Visual Fortran Composer XE 2013 Update 5 - includes Intel® Math Kernel Library (Intel® MKL)∙Intel® Advisor XE 2013 Update 3∙Intel® Inspector XE 2013 Update 6∙Intel® VTune™ Amplifier XE 2013 Update 9∙Sample programs∙On-disk documentation1.3 System RequirementsFor an explanation of architecture names, see http://intel.ly/mXIljK∙ A PC based on an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium® 4 processor or later, or compatible non-Intel processor)o Incompatible or proprietary instructions in non-Intel processors may cause the analysis capabilities of this product to function incorrectly. Any attempt to analyzecode not supported by Intel® processors may lead to failures in this product.o For the best experience, a multi-core or multi-processor system is recommended ∙2GB RAM∙8GB free disk space for all product features and architectures∙Microsoft Windows XP*, Microsoft Windows 7*, Microsoft Windows 8*, Microsoft Windows Server 2012* or Microsoft Windows Server 2008*; 32-bit or “x64” editions -embedded editions not supportedo Support of Microsoft Windows XP is deprecated – a future major release of Intel® Parallel Studio XE will not support Windows XP∙One or more of:o Microsoft Visual Studio 2012* Professional Edition (or higher edition) with C++ component installedo Microsoft Visual Studio 2010* Professional Edition (or higher edition) with C++ and “x64 Compiler and Tools” components installed [1]o Microsoft Visual Studio 2008* Standard Edition (or higher edition) SP1 with C++ and “x64 Compiler and Tools” components installed [1]o For Intel® Visual Fortran, Intel® Advisor XE, Intel® Inspector XE and Intel® VTune™ Amplifier XE use only, Microsoft Visual Studio 2010* Shell and Librariesfrom the Intel® Visual Fortran Composer XE installationo For Intel® Visual Fortran, Intel® Advisor XE, Intel® Inspector XE and Intel® VTune™ Amplifier XE use only, Microsoft Visual Studio 2008* Shell and Librariesfrom an earlier version of Intel® Visual Fortran Composer XE or Intel® VisualFortran Compiler Professional EditionNotes:1. Microsoft Visual Studio 2008 Standard Edition installs the “x64 Compiler and Tools”component by default –the Professional and higher editions require a “Custom” install to select this. Microsoft Visual Studio 2010 and Visual Studio 2012 include this component by default.2. The default for the Intel® compilers is to build IA-32 architecture applications that requirea processor supporting the Intel® SSE2 instructions - for example, the Intel® Pentium®4 processor. A compiler option is available to generate code that will run on any IA-32architecture processor. However, if your application uses Intel® Math Kernel Library,Intel® Integrated Performance Primitives or Intel® Threading Building Blocks, executing the application will require a processor supporting the Intel® SSE2 instructions.3. Applications built with Intel® Compilers can be run on the same Windows versions asspecified above for development. Applications may also run on non-embedded 32-bitversions of Microsoft Windows earlier than Windows XP, though Intel does not test these for compatibility. Your application may depend on a Win32 API routine not present inolder versions of Windows. You are responsible for testing application compatibility. You may need to copy certain run-time DLLs onto the target system to run your application.1.4 DocumentationProduct documentation can be accessed through the Help menu in Microsoft Visual Studio. It can also be found, along with “Getting Started” information, in the Windows “Start” menu und er Intel Parallel Studio XE 2013. Please note that if you view the documentation in Microsoft Internet Explorer*, the browser may display a security warning when you click on links to open a documentation set. If you see this warning, you should click the option to proceed.1.5 Technical SupportIf you did not register your compiler during installation, please do so at the Intel® Software Development Products Registration Center. Registration entitles you to free technical support, product updates and upgrades for the duration of the support term.For information about how to find Technical Support, Product Updates, User Forums, FAQs, tips and tricks, and other support information, please visit/software/products/supportNote: If your distributor provides technical support for this product, please contact them for support rather than Intel.2 Installation2.1 Pre-Installation Steps2.1.1 Configure Visual Studio for 64-bit ApplicationsIf you will be developing 64-bit applications you may need to change the configuration of Visual Studio to add 64-bit support.If you are using Visual Studio 2008 Standard Edition, Visual Studio 2012 or Visual Studio 2010, no configuration is needed to build 64-bit applications. For other editions:1. From Control Panel > Add or Remove Programs, sele ct “Microsoft Visual Studio 2008 >Change/Remove. The Visual Studio Maintenance Mode window will appear. Click Next.2. Click Add or Remove Features3. Under “Select features to install”, expand Language Tools > Visual C++4. If the box “X64 Compiler and Tools” is not checked, check it, then click Update. If thebox is already checked, click Cancel.2.2 InstallationThe installation of the product requires a valid license file or serial number. If you are evaluating the product, you can also choose the “Evaluate this product (no serial number required)” option during installation.If you received your product on DVD, insert the first product DVD in your computer’s DVD drive; the installation should start automatically. If it does not, open the top-level folder of the DVD drive in Windows Explorer and double-click on setup.exe.If you received your product as a downloadable file, double-click on the executable file (.EXE) to begin installation. Note that there are several different downloadable files available, each providing different combinations of components. Please read the download web page carefully to determine which file is appropriate for you.You do not need to uninstall previous versions or updates before installing a newer version –the new version will coexist with the older versions. If you want to remove older versions, you may do so before or after installing the newer one.2.2.1 Using a License ServerIf you have purchased a “floating” license, see http://intel.ly/oPEdEe for information on how to install using a license file or license server. This article also provides a source for the Intel® License Manager for FLEXlm* product that can be installed on any of a wide variety of systems.3 Disclaimers, Notices and Legal InformationINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.The information here is subject to change without notice. Do not finalize a design with this information.The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to:/design/literature.htmCeleron, Centrino, Cilk, Intel, Intel logo, Intel386, Intel486, Intel Atom, Intel Core, Itanium, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others.Copyright © 2013 Intel Corporation. All Rights Reserved.。
ivf配置mkl

Some examples do not pause before the end of execution. To see the results printed in the Console window, set a breakpoint at the very end of the program or add the 'pause' statement before the last 'end' statement.
Right-click the Header Files folder under <project name> and select Add > Existing Item... from the drop-down menu.
The Add Existing Item - <project name> window opens.
The Add Existing Item - <project name> window closes, and the selected files appear in the Source Files folder in Solution Explorer. Some examples with the "use" statements require the next two steps.
The Additional Library Directories window opens.
MKL在MinGW环境下的编译

MKL在MinGW环境下的编译MKL(Math Kernel Library)是英特尔公司出品的数学核心库,优化了科学计算、数据处理等计算密集型工作的效率,广泛应用于机器学习、数据挖掘、信号处理、图像处理等领域。
MKL的编译一般关联到英特尔编译器,但是很多时候我们想在MinGW环境下编译MKL,这里给出一份具体的教程。
第一步:准备必要的环境在开始编译前,需要确保你有以下环境:- 需要使用MinGW-w64的GCC 5.1.0及以上版本。
- 下载MKL相关的头文件和链接库。
- 需要安装Cmake来构建MinGW Makefiles。
第二步:准备环境变量在MinGW的安装文件夹中的“bin”目录下,添加以下路径到环境变量中:- C:\MinGW\bin- C:\MinGW\mingw32\bin同时,在MKL的安装目录下,添加以下的路径到系统环境变量中:- C:\Program Files(x86)\IntelSWTools\compilers_and_libraries\windows\mkl\bin\in tel64- C:\Program Files(x86)\IntelSWTools\compilers_and_libraries\windows\mpi\intel6 4\bin- C:\Program Files(x86)\IntelSWTools\compilers_and_libraries\windows\tbb\bin 第三步:配置Cmake在命令行工具中进入MKL的安装路径,并创建一个新的文件夹,如“build”。
接着运行以下命令:$ cmake -G "MinGW Makefiles" -DFortran_COMPILER_NAME=gfortran -DCMAKE_Fortran_COMPILER=gfortran ../其中,“-G”选项用来制定生成的Makefile的类型,设置为MinGW即可。
Divide and Conquer Eigenvalue Solver Parallelization分而治之的特征值求解器的并行化-PPT课件

• GMP
– arbitrary precision arithmetic operations on integer numbers
Intel® Math Kernel Library Contents
– Solvers and eigensolvers. Many hundreds of routines total!
• ScaLAPACK
– computational, driver and auxiliary routines for distributed-memory architectures via MPI/BLACS
– May have with SMT-on
• If the user is running MPI, and we cannot detect that it is
being used in a thread-safe mode, we will scale down to 1 thread
• If the user is a parallel region, and MKL_DYNAMIC is TRUE
• We have special threading environment variables
– MKL_DYNAMIC – MKL_NUM_THREADS – MKL_DOMAIN_NUM_THREADS – (routine equivalents always have precedence)
Demonstrates we don’t focus on latency, or more than about 30 MPI calls
Intel 64 和 IA-32 架构软件开发者手册(第二卷)说明书

Intel® 64 and IA-32 ArchitecturesSoftware Developer’s ManualVolume 2 (2A, 2B, 2C & 2D):Instruction Set Reference, A-ZNOTE:The Intel 64 and IA-32 Architectures Software Developer's Manual consists of four volumes: Basic Architecture, Order Number 253665; Instruction Set Reference A-Z, Order Number 325383; System Programming Guide, Order Number 325384; Model-Specific Registers, Order Number 335592. Refer to all four volumes when evaluating your design needs.Order Number: 325383-071USOctober 2019Intel technologies features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at , or from the OEM or retailer.No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses.You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifica-tions. Current characterized errata are available on request.This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmapsCopies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting /design/literature.htm.Intel, the Intel logo, Intel Atom, Intel Core, Intel SpeedStep, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.*Other names and brands may be claimed as the property of others.Copyright © 1997-2019, Intel Corporation. All Rights Reserved.CONTENTSPAGECHAPTER 1ABOUT THIS MANUAL1.1INTEL® 64 AND IA-32 PROCESSORS COVERED IN THIS MANUAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1.2OVERVIEW OF VOLUME 2A, 2B, 2C AND 2D: INSTRUCTION SET REFERENCE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4 1.3NOTATIONAL CONVENTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5 1.3.1Bit and Byte Order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-5 1.3.2Reserved Bits and Software Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-5 1.3.3Instruction Operands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-5 1.3.4Hexadecimal and Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-6 1.3.5Segmented Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-6 1.3.6Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-6 1.3.7 A New Syntax for CPUID, CR, and MSR Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-7 1.4RELATED LITERATURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7CHAPTER 2INSTRUCTION FORMAT2.1INSTRUCTION FORMAT FOR PROTECTED MODE, REAL-ADDRESS MODE, AND VIRTUAL-8086 MODE. . . . . . . . . . . . . . . . . . . . 2-1 2.1.1Instruction Prefixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1 2.1.2Opcodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3 2.1.3ModR/M and SIB Bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3 2.1.4Displacement and Immediate Bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3 2.1.5Addressing-Mode Encoding of ModR/M and SIB Bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-4 2.2IA-32E MODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7 2.2.1REX Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-8 2.2.1.1Encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-8 2.2.1.2More on REX Prefix Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-8 2.2.1.3Displacement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 2.2.1.4Direct Memory-Offset MOVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 2.2.1.5Immediates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 2.2.1.6RIP-Relative Addressing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12 2.2.1.7Default 64-Bit Operand Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12 2.2.2Additional Encodings for Control and Debug Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12 2.3INTEL® ADVANCED VECTOR EXTENSIONS (INTEL® AVX). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 2.3.1Instruction Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 2.3.2VEX and the LOCK prefix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 2.3.3VEX and the 66H, F2H, and F3H prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 2.3.4VEX and the REX prefix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 2.3.5The VEX Prefix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14 2.3.5.1VEX Byte 0, bits[7:0] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15 2.3.5.2VEX Byte 1, bit [7] - ‘R’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15 2.3.5.33-byte VEX byte 1, bit[6] - ‘X’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16 2.3.5.43-byte VEX byte 1, bit[5] - ‘B’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16 2.3.5.53-byte VEX byte 2, bit[7] - ‘W’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16 2.3.5.62-byte VEX Byte 1, bits[6:3] and 3-byte VEX Byte 2, bits [6:3]- ‘vvvv’ the Source or Dest Register Specifier. . . . . 2-16 2.3.6Instruction Operand Encoding and VEX.vvvv, ModR/M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17 2.3.6.13-byte VEX byte 1, bits[4:0] - “m-mmmm”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18 2.3.6.22-byte VEX byte 1, bit[2], and 3-byte VEX byte 2, bit [2]- “L” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18 2.3.6.32-byte VEX byte 1, bits[1:0], and 3-byte VEX byte 2, bits [1:0]- “pp”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18 2.3.7The Opcode Byte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 2.3.8The MODRM, SIB, and Displacement Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 2.3.9The Third Source Operand (Immediate Byte) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 2.3.10AVX Instructions and the Upper 128-bits of YMM registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 2.3.10.1Vector Length Transition and Programming Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19Vol. 2A iiiCONTENTSiv Vol. 2A PAGE2.3.11AVX Instruction Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20 2.3.12Vector SIB (VSIB) Memory Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20 2.3.12.164-bit Mode VSIB Memory Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21 2.4AVX AND SSE INSTRUCTION EXCEPTION SPECIFICATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21 2.4.1Exceptions Type 1 (Aligned memory reference) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26 2.4.2Exceptions Type 2 (>=16 Byte Memory Reference, Unaligned). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27 2.4.3Exceptions Type 3 (<16 Byte memory argument) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-28 2.4.4Exceptions Type 4 (>=16 Byte mem arg no alignment, no floating-point exceptions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29 2.4.5Exceptions Type 5 (<16 Byte mem arg and no FP exceptions). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30 2.4.6Exceptions Type 6 (VEX-Encoded Instructions Without Legacy SSE Analogues) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31 2.4.7Exceptions Type 7 (No FP exceptions, no memory arg) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32 2.4.8Exceptions Type 8 (AVX and no memory argument) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32 2.4.9Exceptions Type 11 (VEX-only, mem arg no AC, floating-point exceptions). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33 2.4.10Exceptions Type 12 (VEX-only, VSIB mem arg, no AC, no floating-point exceptions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34 2.5VEX ENCODING SUPPORT FOR GPR INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34 2.5.1Exceptions Type 13 (VEX-Encoded GPR Instructions). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35 2.6INTEL® AVX-512 ENCODING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35 2.6.1Instruction Format and EVEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36 2.6.2Register Specifier Encoding and EVEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-38 2.6.3Opmask Register Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-38 2.6.4Masking Support in EVEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-39 2.6.5Compressed Displacement (disp8*N) Support in EVEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-39 2.6.6EVEX Encoding of Broadcast/Rounding/SAE Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-41 2.6.7Embedded Broadcast Support in EVEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-41 2.6.8Static Rounding Support in EVEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-41 2.6.9SAE Support in EVEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-41 2.6.10Vector Length Orthogonality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-41 2.6.11#UD Equations for EVEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-42 2.6.11.1State Dependent #UD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-42 2.6.11.2Opcode Independent #UD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-42 2.6.11.3Opcode Dependent #UD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-43 2.6.12Device Not Available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-44 2.6.13Scalar Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-44 2.7EXCEPTION CLASSIFICATIONS OF EVEX-ENCODED INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-44 2.7.1Exceptions Type E1 and E1NF of EVEX-Encoded Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-47 2.7.2Exceptions Type E2 of EVEX-Encoded Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-49 2.7.3Exceptions Type E3 and E3NF of EVEX-Encoded Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-50 2.7.4Exceptions Type E4 and E4NF of EVEX-Encoded Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-52 2.7.5Exceptions Type E5 and E5NF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-54 2.7.6Exceptions Type E6 and E6NF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-56 2.7.7Exceptions Type E7NM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-58 2.7.8Exceptions Type E9 and E9NF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-59 2.7.9Exceptions Type E10 and E10NF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-61 2.7.10Exception Type E11 (EVEX-only, mem arg no AC, floating-point exceptions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-63 2.7.11Exception Type E12 and E12NP (VSIB mem arg, no AC, no floating-point exceptions). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-64 2.8EXCEPTION CLASSIFICATIONS OF OPMASK INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-66CHAPTER 3INSTRUCTION SET REFERENCE, A-L3.1INTERPRETING THE INSTRUCTION REFERENCE PAGES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.1.1Instruction Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-1 3.1.1.1Opcode Column in the Instruction Summary Table (Instructions without VEX Prefix). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-2 3.1.1.2Opcode Column in the Instruction Summary Table (Instructions with VEX prefix). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3 3.1.1.3Instruction Column in the Opcode Summary Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5 3.1.1.4Operand Encoding Column in the Instruction Summary Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-8 3.1.1.564/32-bit Mode Column in the Instruction Summary Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-8 3.1.1.6CPUID Support Column in the Instruction Summary Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-9 3.1.1.7Description Column in the Instruction Summary Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-9 3.1.1.8Description Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-9CONTENTSPAGE 3.1.1.9Operation Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-93.1.1.10Intel® C/C++ Compiler Intrinsics Equivalents Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-12 3.1.1.11Flags Affected Section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-14 3.1.1.12FPU Flags Affected Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-14 3.1.1.13Protected Mode Exceptions Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-14 3.1.1.14Real-Address Mode Exceptions Section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-15 3.1.1.15Virtual-8086 Mode Exceptions Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-15 3.1.1.16Floating-Point Exceptions Section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-16 3.1.1.17SIMD Floating-Point Exceptions Section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-16 3.1.1.18Compatibility Mode Exceptions Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-16 3.1.1.1964-Bit Mode Exceptions Section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-16 3.2INSTRUCTIONS (A-L). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17AAA—ASCII Adjust After Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18AAD—ASCII Adjust AX Before Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20AAM—ASCII Adjust AX After Multiply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22AAS—ASCII Adjust AL After Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24ADC—Add with Carry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26ADCX — Unsigned Integer Addition of Two Operands with Carry Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29ADD—Add. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31ADDPD—Add Packed Double-Precision Floating-Point Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33ADDPS—Add Packed Single-Precision Floating-Point Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36ADDSD—Add Scalar Double-Precision Floating-Point Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-39ADDSS—Add Scalar Single-Precision Floating-Point Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-41ADDSUBPD—Packed Double-FP Add/Subtract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43ADDSUBPS—Packed Single-FP Add/Subtract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45ADOX — Unsigned Integer Addition of Two Operands with Overflow Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-48AESDEC—Perform One Round of an AES Decryption Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-50AESDECLAST—Perform Last Round of an AES Decryption Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52AESENC—Perform One Round of an AES Encryption Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54AESENCLAST—Perform Last Round of an AES Encryption Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-56AESIMC—Perform the AES InvMixColumn Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-58AESKEYGENASSIST—AES Round Key Generation Assist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-59AND—Logical AND. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-61ANDN — Logical AND NOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-63ANDPD—Bitwise Logical AND of Packed Double Precision Floating-Point Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-64ANDPS—Bitwise Logical AND of Packed Single Precision Floating-Point Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-67ANDNPD—Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-70ANDNPS—Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73ARPL—Adjust RPL Field of Segment Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-76BEXTR — Bit Field Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-78BLENDPD — Blend Packed Double Precision Floating-Point Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-79BLENDPS — Blend Packed Single Precision Floating-Point Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-81BLENDVPD — Variable Blend Packed Double Precision Floating-Point Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-83BLENDVPS — Variable Blend Packed Single Precision Floating-Point Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-85BLSI — Extract Lowest Set Isolated Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-88BLSMSK — Get Mask Up to Lowest Set Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-89BLSR — Reset Lowest Set Bit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-90BNDCL—Check Lower Bound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-91BNDCU/BNDCN—Check Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-93BNDLDX—Load Extended Bounds Using Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-95BNDMK—Make Bounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-98BNDMOV—Move Bounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-100BNDSTX—Store Extended Bounds Using Address Translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-103BOUND—Check Array Index Against Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-106BSF—Bit Scan Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-108BSR—Bit Scan Reverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-110BSWAP—Byte Swap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-112BT—Bit Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-113BTC—Bit Test and Complement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-115Vol. 2A v。
Intel-AVX-On-Server

How Intel® AVX Improves Performance on Server ApplicationSubmitted by Thai Le (Intel) on Thu, 02/27/2014 - 14:04The latest Intel® Xeon® processor E7 v2 family includes a feature called Intel® Advanced Vector Extensions (Intel® AVX), which can potentially improve application performance. Here we will explain the context, and provide an example of how using Intel® AVX improved performance for a commonly known benchmark.For existing vectorized code that uses floating point operations, you can gain a potential performance boost when running on newer platforms such as the Intel® Xeon® processor E7 v2 family, by doing one of the following:1. Recompile your code, using the Intel® compiler with the proper AVX switch to convert existingSSE code. See the Intel® AVX State Transitions: Migrating SSE Code to AVX2. Modify your code's function calls to leverage the Intel® Math Kernel Libraries (Intel® MKL) whichare already optimized to use AVX where supported3. Use the AVX intrinsic instructions. For high level language (such as C or C++) developers, youcan use Intel® Intrinsic instructions to make the call and recompile code. See the Intel® IntrinsicGuide and Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details4. Code in assembly instructions directly. For low level language (such as assembly) developers,you can use those equivalent AVX instruction from their existing SSE code. See the Intel® 64and IA-32 Architectures Optimization Reference Manual for more detailsIn this article, I will share a simple experiment using the Intel® Optimized LINPACK benchmark to demonstrate the performance gain of three different sized workloads (30K, 40K, and 75K) from Intel AVX running on Windows* and Linux* operating systems. I will also share the list of AVX instructions that were executed and the equivalent SSE instructions for developers who are interested in direct coding.I used the following platform for the experiment:Procedure for running LINPACK:1. Download and install the following:a. Intel® Math Kernel Library – LINPACK Downloadhttps:///en-us/articles/intel-math-kernel-library-linpack-downloadb. Intel® Math Kernel Library (MKL)https:///en-us/intel-math-kernel-library-evaluation-options2. Create three different input files for 30K, 40K, and 75K from the “...\linpack” directory3. For AVX runs update files as follows:a. For Windows, update the runme_xeon64.bat file to take new input files you that havecreated runme_xeon64.bat file. For Linux, update the runme_xeon64 shell script file totake the new input files.b. The results will be in Glops similar to the Table 24. For Intel SSE runs, you will need to have an Intel AVX disabled processor and repeat the abovesteps.What are the Intel AVX and the equivalent Intel SSE instructions that were executed?Table 1 has a list of Intel AVX instructions that were executed during the Intel AVX runs. I have provided the equivalent Intel SSE instructions for those developers who are thinking of moving their existing Intel SSE code to Intel AVX.Table 1– Intel® AVX and Equivalent Intel SSE InstructionsThe list in Table 1 is just a subset of the AVX instructions available. The full list can be obtained from the Intel® 64 and IA-32 Architectures Optimization Reference Manual.What is the performance gain for running the LINPACK benchmark with Intel AVX vs. Intel SSE enabled on the Intel Xeon E7 4890 v2 server?Table 2 shows the results from the three different workloads running on Windows* and Linux*. In the Ratio column, the numbers show that the LINPACK benchmark produces ~1.6x-1.7x better performance when running with the combination of an Intel AVX optimized LINPACK and an Intel AVX capable processor. This is just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.Table2 – Results and Performance Gain from the LINPACK benchmarkConclusionFrom our LINPACK experiment, we see compelling performance benefits when going to an AVX-enabled Intel Xeon processor; in this specific case, we saw a performance increase of ~1.6x-1.7x in our test environment,which is a strong case for developers who have SSE-enabled code and are weighing the benefit of moving to a newer Intel® Xeon® processor-based system with AVX. The reference materials below can help developers learn how to migrate existing SSE code to Intel AVX code.References:∙AVX on Intel Developer Zone∙Introduction to Intel Advanced Vector Extensions∙Which applications are most likely to benefit from recompilation for Intel Advanced Vector Extensions (Intel AVX)?∙How to compile for Intel AVX∙Intel® Advanced Vector Extensions Programming Reference∙Intel® AVX State Transitions: Migrating SSE Code to AVXCategories:∙Cloud Computing∙Cluster Computing∙Vectorization∙Intel® Math Kernel Library∙Intel® Parallel Studio XE Composer Edition∙Intel® Advanced Vector Extensions∙Intel® Streaming SIMD Extensions∙C/C++∙Server∙Desktop∙DevelopersTags:∙vectorization∙MKL∙AVX∙cloud。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Visual Studio环境下联合Intel Math Kernel Library编程翻译:Lyra(Godblessmee)说明:本文翻译自Intel C++ Composer XE 2011的帮助文档,要实现文中所述功能需要先安装Intel C++ Composer XE 2011以及Visual Studio C++集成模块,与Microsoft Visual Studio 2005/2008/2010。
Visual C++环境下连接MKL配置对于VS2010:●依次选择:Project->Properties->Configuration Properties->Intel Performance Libraries;●改变Use MKL属性设置,方法为:选择Parallel、Sequential或者Cluster;对于VS2005/2008:●依次选择:Project->Intel C++ Composer XE 2011->Select Build Components;●从Use MKL下拉式菜单中,选择Parallel、Sequential或者Cluster。
特殊的MKL库取决于你的项目设置。
更多细节请查阅相关文档。
创建、配置、运行Intel C++或VC++ 2008项目本示例演示如何创建一个包含Intel MKL的VC++项目,环境为VS 2008。
下面的步骤创建了一个Win32/Debug项目,包含Intel MKL的示例。
关于如何创建不同类型的VS项目,请参考MSDN。
创建步骤如下:1.创建一个C项目:a)打开VS2008;b)在主菜单中选择File->New->Project,打开New Project窗口;c)选择Project Types->Visual C++->Win32,然后选择Templates->Win32 ConsoleApplication,在Name处输入<Project name>,比如MKL_CBLAS_CAXPYIX,点击OK,New Project窗口关闭,然后Win32 Application Wizard-<Project name>窗口打开;d)选择Next,然后选择Application Settings,选择Additional options->Empty project,点击Finish,Win32 Application Wizard-<Project name>窗口关闭;下面的步骤将在Solution Explorer窗口中进行,在主菜单中选择View Solution Explorer打开它。
2.(可选)要切换到Intel C/C++项目,右键点击<Project name>,从下拉式菜单中选择Convert,即可使用Intel C++系统;3.将Intel MKL例程加入项目:a)在<Project name>下面右击Source Files文件夹,在下拉式菜单中选择Add->Existing Item…,此时Add Existing Item-<Project name>窗口打开;b)浏览Intel MKL例程目录,比如<mkl directory>\examples\cblas\source,选择后缀名为“.c”的例程文件与支持文件,比如cblas_caxpyix.c和common_func.c,更多关于例程支持文件的信息,查看Support Files for Intel MKL Examples,点击Add。
随后Add Existing Item-<Project name>窗口关闭,而选择的文件在Solution Explorer的Source Files文件夹下面出现。
下面的步骤用于调整项目属性。
4.选择<Project name>;5.在主菜单中,选择Project->Properties来打开<Project name> Properties Page窗口;6.选择Intel MKL Include dependencies:a)选择Configuration Properties->C/C++->General,在窗口右侧选择AdditionalInclude Directories->…(浏览按钮),然后Additional Include Directories窗口开启;b)单击New Line按钮,当新的一行出现在窗口中,点击浏览按钮,随后SelectDirectory窗口开启;c)浏览<mkl directories>\include目录,点击OK,Select Directories窗口关闭,IntelMKL的完整包含路径出现在Additional Include Directories窗口中;d)点击OK,关闭窗口。
7.设置library dependencies:a)选择Configuration Properties->Linker->General,在窗口右侧,选择AdditionalLibrary Directories->…(浏览按钮),随后Additional Library Directories窗口打开;b)点击New Line按钮(最上一行第一个按钮),当新的一行出现在窗口以后,点击浏览按钮,随后Select Directory窗口开启;c)浏览到MKL目录<mkl directories>\lib\<architecture>,此处<architecture>在{ ia32,intel64 }中选择,例如:<mkl directories>\lib\ia32(对绝大多数笔记本电脑和台式机选择ia32),单击OK,随后Select Directory窗口关闭,而Intel MKL库的完整路径将出现在Additional Library Directories窗口;d)再次单击New Line按钮,新的一行出现后,单击浏览按钮,随后Select Directory窗口开启;e)浏览到<Composer XE directory>compiler\lib\<architecture>,此处<architecture>在{ ia32,intel64 }中选择,例如:<Composer XE directories>\compiler\lib\ia32,点击OK,Select Directory窗口关闭,规定的完整路径出现在Additional LibraryDirectories窗口中;f)单击OK,关闭Additional Library Directories窗口;g)选择Configuration Properties->Linker->Input,在窗口右侧选择AdditionalDependencies->…(浏览按钮),随后additional Dependencies窗口开启;h)输入所需的库,比如:当<architecture>=ia32时,输入mkl_intel_c.lib、mkl_intel_thread.lib、mkl_core.lib、libiomp5md.lib,更多信息请查看SelectingLibraries to Link文档;i)单击OK,关闭Additional Dependencies窗口;j)如果Intel MKL例程目录下没有包含数据目录,跳至下一步;8.为Intel MKL 例程设置data dependencies:a)选择Configuration Properties->Debugging,在窗口右侧选择CommandArguments->-><Edit…>,随后Command Arguments窗口打开;b)输入加引号的数据文件路径,数据文件名称和例程文件名称相同,再加上一个”d”扩展,例如:”<mkl directory>\examples\cblas\data\cblas_caxpyix.d”;c)单击OK关闭Command Arguments窗口;9.点击OK,关闭<Project name> Properties Page窗口;10.某些例程在执行结束前不会暂停。
要观察控制台窗口显示的打印结果,在最后的“return 0;”前加入断点,或者在最后的“return 0;”加入“getchar();”;11.Build整个项目,选择Build->Build Solution;注意:你可能会看见关于不安全的函数或变量的警告,要消除这些警告信息,进入Project->Properties,在<Project name>Property Pages窗口打开以后,进入Configuration->C/C++->Preprocessor,在窗口右侧,选择Preprocess Definition,加入_CRT_SECURE_NO_WARNINGS,单击OK;12.运行例程,选择Debug->Start Debugging;13.从控制台窗口中你能看到运行结果,如果你加入了断点,按下Enter完成运行;如果你加入了getchar(),选择Debug->Continue。
Intel MKL例程支持文件下面的列表是各例程项目需要添加的支持文件:examples\cblas\source: common_func.cexamples\dftc\source: dfti_example_status_print.c、dfti_example_support.c已知项目创建过程限制不能按照Creating, Configuring, and Running the Intel® C/C++ and/or Visual C++* 2008 Project 的流程从下列路径中创建例程项目:examples\blasexamples\blas95examples\cdftcexamples\cdftfexamples\dftfexamples\fftw2x_cdfexamples\fftw2xcexamples\fftw2xfexamples\fftw3xcexamples\fftw3xfexamples\javaexamples\lapackexamples\lapack95在VS中浏览Intel MKL文档对于VS2005/2008:●选择目录下的Help->Contents,VS帮助文档将被显示;●点击Intel Math Kernel Library Help;●在展开的帮助文件树形图中,点击Intel MKL Reference Manual;对于VS2010:●将IDE配置为本地帮助,方法为:Help->Manage Help Settings,点击I want to use onlinehelp;●使用Help->View Help菜单列出可得的帮助文档,并打开Intel MKL 文档。