Cache-Based Architectures for High Performance Computing

合集下载

UniversityofWisconsin-Madison(

UniversityofWisconsin-Madison(

University of Wisconsin-Madison(UMW)周玉龙1101213442 计算机应用UMW简介美国威斯康辛大学坐落于美国密歇根湖西岸的威斯康辛州首府麦迪逊市,有着风景如画的校园,成立于1848年, 是一所有着超过150年历史的悠久大学。

威斯康辛大学是全美最顶尖的三所公立大学之一,是全美最顶尖的十所研究型大学之一。

在美国,它经常被视为公立的常青藤。

与加利福尼亚大学、德克萨斯大学等美国著名公立大学一样,威斯康辛大学是一个由多所州立大学构成的大学系统,也即“威斯康辛大学系统”(University of Wisconsin System)。

在本科教育方面,它列于伯克利加州大学和密歇根大学之后,排在公立大学的第三位。

除此之外,它还在本科教育质量列于美国大学的第八位。

按美国全国研究会的研究结果,威斯康辛大学有70个科目排在全美前十名。

在上海交通大学的排行中,它名列世界大学的第16名。

威斯康辛大学是美国大学联合会的60个成员之一。

特色专业介绍威斯康辛大学麦迪逊分校设有100多个本科专业,一半以上可以授予硕士、博士学位,其中新闻学、生物化学、植物学、化学工程、化学、土木工程、计算机科学、地球科学、英语、地理学、物理学、经济学、德语、历史学、语言学、数学、工商管理(MBA)、微生物学、分子生物学、机械工程、哲学、西班牙语、心理学、政治学、统计学、社会学、动物学等诸多学科具有相当雄厚的科研和教学实力,大部分在美国大学相应领域排名中居于前10名。

学术特色就学术方面的荣耀而言,威斯康辛大学麦迪逊校区的教职员和校友至今共获颁十七座诺贝尔奖和二十四座普立兹奖;有五十三位教职员是国家科学研究院的成员、有十七位是国家工程研究院的成员、有五位是隶属于国家教育研究院,另外还有九位教职员赢得了国家科学奖章、六位是国家级研究员(Searle Scholars)、还有四位获颁麦克阿瑟研究员基金。

威斯康辛大学麦迪逊校区虽然是以农业及生命科学为特色,但是令人注目,同时也是吸引许多传播科系学子前来留学的最大诱因,则是当前任教于该校新闻及传播研究所、在传播学界有「近代美国传播大师」之称的杰克·麦克劳(Jack McLauld)。

16节点COMA型

16节点COMA型

十六节点的阶层构造COMA 协议设计计算机应用技术COMA 简介【1】:COMA 是并行机结构体系中的一种,它的全称是Cache Only Memory Architecture ,它是NUMA 模型的一种特例,它利用各节点的高速缓存构成全局地址空间。

16结点的COMA 结构如图1所示。

由图可以看出,COMA 是由阶层BUS 构成的,其中P 指的是各个结点的Processor ,M 指的是各个结点的Memory ,D 指的是目录Directory 。

COMA 模型的各处理器结点中没有存储层次结构,全部的Memory 组成了全局地址空间。

而各个Memory 由上层的目录Directory 管理。

Directory 在结构上也是Memory ,但其中存储的不是数据而是某一页内容的位置信息。

图1 16结点COMA 型并行机 在开始时,COMA 使用特别的OS ,将所需的数据分配给各个Memory ,初始状态下每页只有一份,而Directory 中记录着哪些数据存在哪里。

P 使用某页时先向本Memory申请,若本Memory 无此页则向上一层Directory 申请,直到找到该页,将此页拷贝至本结点的Memory 。

这样系统将各个Memory 用作Cache ,运行一段时间后形成数据的最优配置。

因此,各个结点中的Memory 实际上是另外一种意义上的高速缓存Cache ,并且此Cache 的容量一般都大于2级Cache 的容量。

为了不在下文中引起不必要的误解,因此在下文中各节点的Memory 统一成为Cache 。

由于数据会在各个Cache 之间传递,因此某些数据就会在不止一个的Cache 中出现,这就会涉及到了Cache 一致性的问题。

本文基于16结点的COMA 并行机设计了一种Cache 一致性的方法。

本文所设计的COMA 并行机如图1所示,此模型共有三级目录结构,第一级的目录记录在其目录下的二个结点的Cache 中所存储的每一页的数据驻留在Cache 中的信息,包括每一页的位置及状态,第二级的目录记录了三级目录下的所有页的信息。

德尔·韦玛网络S4048T-ON交换机说明书

德尔·韦玛网络S4048T-ON交换机说明书

The Dell EMC Networking S4048T-ON switch is the industry’s latest data center networking solution, empowering organizations to deploy modern workloads and applications designed for the open networking era. Businesses who have made the transition away from monolithic proprietary mainframe systems to industry standard server platforms can now enjoy even greater benefits from Dell EMC open networking platforms. By using industry-leading hardware and a choice of leading network operating systems to simplify data center fabric orchestration and automation, organizations can tailor their network to their unique requirements and accelerate innovation.These new offerings provide the needed flexibility to transform data centers. High-capacity network fabrics are cost-effective and easy to deploy, providing a clear path to the software-defined data center of the future with no vendor lock-in.The S4048T-ON supports the open source Open Network Install Environment (ONIE) for zero-touch installation of alternate network operating systems, including feature rich Dell Networking OS.High density 1/10G BASE-T switchThe Dell EMC Networking S-Series S4048T-ON is a high-density100M/1G/10G/40GbE top-of-rack (ToR) switch purpose-builtfor applications in high-performance data center and computing environments. Leveraging a non-blocking switching architecture, theS4048T-ON delivers line-rate L2 and L3 forwarding capacity within a conservative power budget. The compact S4048T-ON design provides industry-leading density of 48 dual-speed 1/10G BASE-T (RJ45) ports, as well as six 40GbE QSFP+ up-links to conserve valuable rack space and simplify the migration to 40Gbps in the data center core. Each40GbE QSFP+ up-link can also support four 10GbE (SFP+) ports with a breakout cable. In addition, the S4048T-ON incorporates multiple architectural features that optimize data center network flexibility, efficiency and availability, including I/O panel to PSU airflow or PSU to I/O panel airflow for hot/cold aisle environments, and redundant, hot-swappable power supplies and fans. S4048T-ON supports feature-rich Dell Networking OS, VLT, network virtualization features such as VRF-lite, VXLAN Gateway and support for Dell Embedded Open Automation Framework.• The S4048T-ON is the only switch in the industry that supports traditional network-centric virtualization (VRF) and hypervisorcentric virtualization (VXLAN). The switch fully supports L2 VX-• The S4048T-ON also supports Dell EMC Networking’s Embedded Open Automation Framework, which provides enhanced network automation and virtualization capabilities for virtual data centerenvironments.• The Open Automation Framework comprises a suite of interre-lated network management tools that can be used together orindependently to provide a network that is flexible, available andmanageable while helping to reduce operational expenses.Key applicationsDynamic data centers ready to make the transition to software-defined environments• High-density 10Gbase-T ToR server access in high-performance data center environments• Lossless iSCSI storage deployments that can benefit from innovative iSCSI & DCB optimizations that are unique only to Dell NetworkingswitchesWhen running the Dell Networking OS9, Active Fabric™ implementation for large deployments in conjunction with the Dell EMC Z-Series, creating a flat, two-tier, nonblocking 10/40GbE data center network design:• High-performance SDN/OpenFlow 1.3 enabled with ability to inter-operate with industry standard OpenFlow controllers• As a high speed VXLAN Layer 2 Gateway that connects thehypervisor based ovelray networks with nonvirtualized infrastructure Key features - general• 48 dual-speed 1/10GbE (SFP+) ports and six 40GbE (QSFP+)uplinks (totaling 72 10GbE ports with breakout cables) with OSsupport• 1.44Tbps (full-duplex) non-blocking switching fabric delivers line-rateperformance under full load with sub 600ns latency• I/O panel to PSU airflow or PSU to I/O panel airflow• Supports the open source ONIE for zero-touch• installation of alternate network operating systems• Redundant, hot-swappable power supplies and fansDELL EMC NETWORKING S4048T-ON SWITCHEnergy-efficient 10GBASE-T top-of-rack switch optimized for data center efficiencyKey features with Dell EMC Networking OS9Scalable L2 and L3 Ethernet switching with QoS and a full complement of standards-based IPv4 and IPv6 features, including OSPF, BGP and PBR (Policy Based Routing) support• Scalable L2 and L3 Ethernet switching with QoS and a full complement of standards-based IPv4 and IPv6 features, including OSPF, BGP andPBR (Policy Based Routing) support• VRF-lite enables sharing of networking infrastructure and provides L3traffic isolation across tenants• Increase VM Mobility region by stretching L2 VLAN within or across two DCs with unique VLT capabilities like Routed VL T, VLT Proxy Gateway • VXLAN gateway functionality support for bridging the nonvirtualizedand the virtualized overlay networks with line rate performance.• Embedded Open Automation Framework adding automatedconfiguration and provisioning capabilities to simplify the management of network environments. Supports Puppet agent for DevOps• Modular Dell Networking OS software delivers inherent stability as well as enhanced monitoring and serviceability functions.• Enhanced mirroring capabilities including 1:4 local mirroring,• Remote Port Mirroring (RPM), and Encapsulated Remote PortMirroring (ERPM). Rate shaping combined with flow based mirroringenables the user to analyze fine grained flows• Jumbo frame support for large data transfers• 128 link aggregation groups with up to 16 members per group, usingenhanced hashing• Converged network support for DCB, with priority flow control(802.1Qbb), ETS (802.1Qaz), DCBx and iSCSI TLV• S4048T-ON supports RoCE and Routable RoCE to enable convergence of compute and storage on Active FabricUser port stacking support for up to six units and unique mixed mode stacking that allows stacking of S4048-ON with S4048T-ON to providecombination of 10G SFP+ and RJ45 ports in a stack.Physical48 fixed 10GBase-T ports supporting 100M/1G/10G speeds6 fixed 40 Gigabit Ethernet QSFP+ ports1 RJ45 console/management port with RS232signaling1 USB 2.0 type A to support mass storage device1 Micro-USB 2.0 type B Serial Console Port1 8 GB SSD ModuleSize: 1RU, 1.71 x 17.09 x 18.11”(4.35 x 43.4 x 46 cm (H x W x D)Weight: 23 lbs (10.43kg)ISO 7779 A-weighted sound pressure level: 65 dB at 77°F (25°C)Power supply: 100–240V AC 50/60HzMax. thermal output: 1568 BTU/hMax. current draw per system:4.6 A at 460W/100VAC,2.3 A at 460W/200VACMax. power consumption: 460 WattsT ypical power consumption: 338 WattsMax. operating specifications:Operating temperature: 32°F to 113°F (0°C to45°C)Operating humidity: 5 to 90% (RH), non-condensing Max. non-operating specifications:Storage temperature: –40°F to 158°F (–40°C to70°C)Storage humidity: 5 to 95% (RH), non-condensingRedundancyHot swappable redundant powerHot swappable redundant fansPerformance GeneralSwitch fabric capacity:1.44Tbps (full-duplex)720Gbps (half-duplex)Forwarding Capacity: 1080 MppsLatency: 2.8 usPacket buffer memory: 16MBCPU memory: 4GBOS9 Performance:MAC addresses: 160KARP table 128KIPv4 routes: 128KIPv6 hosts: 64KIPv6 routes: 64KMulticast routes: 8KLink aggregation: 16 links per group, 128 groupsLayer 2 VLANs: 4KMSTP: 64 instancesVRF-Lite: 511 instancesLAG load balancing: Based on layer 2, IPv4 or IPv6headers Latency: Sub 3usQOS data queues: 8QOS control queues: 12Ingress ACL: 16KEgress ACL: 1KQoS: Default 3K entries scalable to 12KIEEE compliance with Dell Networking OS9802.1AB LLDP802.1D Bridging, STP802.1p L2 Prioritization802.1Q VLAN T agging, Double VLAN T agging,GVRP802.1Qbb PFC802.1Qaz ETS802.1s MSTP802.1w RSTP802.1X Network Access Control802.3ab Gigabit Ethernet (1000BASE-T)802.3ac Frame Extensions for VLAN T agging802.3ad Link Aggregation with LACP802.3ae 10 Gigabit Ethernet (10GBase-X) withQSA802.3ba 40 Gigabit Ethernet (40GBase-SR4,40GBase-CR4, 40GBase-LR4) on opticalports802.3u Fast Ethernet (100Base-TX)802.3x Flow Control802.3z Gigabit Ethernet (1000Base-X) with QSA 802.3az Energy Efficient EthernetANSI/TIA-1057 LLDP-MEDForce10 PVST+Max MTU 9216 bytesRFC and I-D compliance with Dell Networking OS9General Internet protocols768 UDP793 TCP854 T elnet959 FTPGeneral IPv4 protocols791 IPv4792 ICMP826 ARP1027 Proxy ARP1035 DNS (client)1042 Ethernet Transmission1305 NTPv31519 CIDR1542 BOOTP (relay)1812 Requirements for IPv4 Routers1918 Address Allocation for Private Internets 2474 Diffserv Field in IPv4 and Ipv6 Headers 2596 Assured Forwarding PHB Group3164 BSD Syslog3195 Reliable Delivery for Syslog3246 Expedited Assured Forwarding4364 VRF-lite (IPv4 VRF with OSPF, BGP,IS-IS and V4 multicast)5798 VRRPGeneral IPv6 protocols1981 Path MTU Discovery Features2460 Internet Protocol, Version 6 (IPv6)Specification2464 Transmission of IPv6 Packets overEthernet Networks2711 IPv6 Router Alert Option4007 IPv6 Scoped Address Architecture4213 Basic Transition Mechanisms for IPv6Hosts and Routers4291 IPv6 Addressing Architecture4443 ICMP for IPv64861 Neighbor Discovery for IPv64862 IPv6 Stateless Address Autoconfiguration 5095 Deprecation of T ype 0 Routing Headers in IPv6IPv6 Management support (telnet, FTP, TACACS, RADIUS, SSH, NTP)VRF-Lite (IPv6 VRF with OSPFv3, BGPv6, IS-IS) RIP1058 RIPv1 2453 RIPv2OSPF (v2/v3)1587 NSSA 4552 Authentication/2154 OSPF Digital Signatures Confidentiality for 2328 OSPFv2 OSPFv32370 Opaque LSA 5340 OSPF for IPv6IS-IS1142 Base IS-IS Protocol1195 IPv4 Routing5301 Dynamic hostname exchangemechanism for IS-IS5302 Domain-wide prefix distribution withtwo-level IS-IS5303 3-way handshake for IS-IS pt-to-ptadjacencies5304 IS-IS MD5 Authentication5306 Restart signaling for IS-IS5308 IS-IS for IPv65309 IS-IS point to point operation over LANdraft-isis-igp-p2p-over-lan-06draft-kaplan-isis-ext-eth-02BGP1997 Communities2385 MD52545 BGP-4 Multiprotocol Extensions for IPv6Inter-Domain Routing2439 Route Flap Damping2796 Route Reflection2842 Capabilities2858 Multiprotocol Extensions2918 Route Refresh3065 Confederations4360 Extended Communities4893 4-byte ASN5396 4-byte ASN representationsdraft-ietf-idr-bgp4-20 BGPv4draft-michaelson-4byte-as-representation-054-byte ASN Representation (partial)draft-ietf-idr-add-paths-04.txt ADD PATHMulticast1112 IGMPv12236 IGMPv23376 IGMPv3MSDP, PIM-SM, PIM-SSMSecurity2404 The Use of HMACSHA- 1-96 within ESPand AH2865 RADIUS3162 Radius and IPv63579 Radius support for EAP3580 802.1X with RADIUS3768 EAP3826 AES Cipher Algorithm in the SNMP UserBase Security Model4250, 4251, 4252, 4253, 4254 SSHv24301 Security Architecture for IPSec4302 IPSec Authentication Header4303 ESP Protocol4807 IPsecv Security Policy DB MIBdraft-ietf-pim-sm-v2-new-05 PIM-SMwData center bridging802.1Qbb Priority-Based Flow Control802.1Qaz Enhanced Transmission Selection (ETS)Data Center Bridging eXchange (DCBx)DCBx Application TLV (iSCSI, FCoE)Network management1155 SMIv11157 SNMPv11212 Concise MIB Definitions1215 SNMP Traps1493 Bridges MIB1850 OSPFv2 MIB1901 Community-Based SNMPv22011 IP MIB2096 IP Forwarding T able MIB2578 SMIv22579 T extual Conventions for SMIv22580 Conformance Statements for SMIv22618 RADIUS Authentication MIB2665 Ethernet-Like Interfaces MIB2674 Extended Bridge MIB2787 VRRP MIB2819 RMON MIB (groups 1, 2, 3, 9)2863 Interfaces MIB3273 RMON High Capacity MIB3410 SNMPv33411 SNMPv3 Management Framework3412 Message Processing and Dispatching forthe Simple Network ManagementProtocol (SNMP)3413 SNMP Applications3414 User-based Security Model (USM) forSNMPv33415 VACM for SNMP3416 SNMPv23417 Transport mappings for SNMP3418 SNMP MIB3434 RMON High Capacity Alarm MIB3584 Coexistance between SNMP v1, v2 andv34022 IP MIB4087 IP Tunnel MIB4113 UDP MIB4133 Entity MIB4292 MIB for IP4293 MIB for IPv6 T extual Conventions4502 RMONv2 (groups 1,2,3,9)5060 PIM MIBANSI/TIA-1057 LLDP-MED MIBDell_ITA.Rev_1_1 MIBdraft-grant-tacacs-02 TACACS+draft-ietf-idr-bgp4-mib-06 BGP MIBv1IEEE 802.1AB LLDP MIBIEEE 802.1AB LLDP DOT1 MIBIEEE 802.1AB LLDP DOT3 MIB sFlowv5 sFlowv5 MIB (version 1.3)DELL-NETWORKING-SMIDELL-NETWORKING-TCDELL-NETWORKING-CHASSIS-MIBDELL-NETWORKING-PRODUCTS-MIBDELL-NETWORKING-SYSTEM-COMPONENT-MIBDELL-NETWORKING-TRAP-EVENT-MIBDELL-NETWORKING-COPY-CONFIG-MIBDELL-NETWORKING-IF-EXTENSION-MIBDELL-NETWORKING-FIB-MIBIT Lifecycle Services for NetworkingExperts, insights and easeOur highly trained experts, withinnovative tools and proven processes, help you transform your IT investments into strategic advantages.Plan & Design Let us analyze yourmultivendor environment and deliver a comprehensive report and action plan to build upon the existing network and improve performance.Deploy & IntegrateGet new wired or wireless network technology installed and configured with ProDeploy. Reduce costs, save time, and get up and running cateEnsure your staff builds the right skills for long-termsuccess. Get certified on Dell EMC Networking technology and learn how to increase performance and optimize infrastructure.Manage & SupportGain access to technical experts and quickly resolve multivendor networking challenges with ProSupport. Spend less time resolving network issues and more time innovating.OptimizeMaximize performance for dynamic IT environments with Dell EMC Optimize. Benefit from in-depth predictive analysis, remote monitoring and a dedicated systems analyst for your network.RetireWe can help you resell or retire excess hardware while meeting local regulatory guidelines and acting in an environmentally responsible way.Learn more at/lifecycleservicesLearn more at /NetworkingDELL-NETWORKING-FPSTATS-MIBDELL-NETWORKING-LINK-AGGREGATION-MIB DELL-NETWORKING-MSTP-MIB DELL-NETWORKING-BGP4-V2-MIB DELL-NETWORKING-ISIS-MIBDELL-NETWORKING-FIPSNOOPING-MIBDELL-NETWORKING-VIRTUAL-LINK-TRUNK-MIB DELL-NETWORKING-DCB-MIBDELL-NETWORKING-OPENFLOW-MIB DELL-NETWORKING-BMP-MIBDELL-NETWORKING-BPSTATS-MIBRegulatory compliance SafetyCUS UL 60950-1, Second Edition CSA 60950-1-03, Second Edition EN 60950-1, Second EditionIEC 60950-1, Second Edition Including All National Deviations and Group Differences EN 60825-1, 1st EditionEN 60825-1 Safety of Laser Products Part 1:Equipment Classification Requirements and User’s GuideEN 60825-2 Safety of Laser Products Part 2: Safety of Optical Fibre Communication Systems FDA Regulation 21 CFR 1040.10 and 1040.11EmissionsInternational: CISPR 22, Class AAustralia/New Zealand: AS/NZS CISPR 22: 2009, Class ACanada: ICES-003:2016 Issue 6, Class AEurope: EN 55022: 2010+AC:2011 / CISPR 22: 2008, Class AJapan: VCCI V-3/2014.04, Class A & V4/2012.04USA: FCC CFR 47 Part 15, Subpart B:2009, Class A RoHSAll S-Series components are EU RoHS compliant.CertificationsJapan: VCCI V3/2009 Class AUSA: FCC CFR 47 Part 15, Subpart B:2009, Class A Available with US Trade Agreements Act (TAA) complianceUSGv6 Host and Router Certified on Dell Networking OS 9.5 and greater IPv6 Ready for both Host and RouterUCR DoD APL (core and distribution ALSAN switch ImmunityEN 300 386 V1.6.1 (2012-09) EMC for Network Equipment\EN 55022, Class AEN 55024: 2010 / CISPR 24: 2010EN 61000-3-2: Harmonic Current Emissions EN 61000-3-3: Voltage Fluctuations and Flicker EN 61000-4-2: ESDEN 61000-4-3: Radiated Immunity EN 61000-4-4: EFT EN 61000-4-5: SurgeEN 61000-4-6: Low Frequency Conducted Immunity。

计算机组成与设计英文版第四版课程设计 (2)

计算机组成与设计英文版第四版课程设计 (2)

Course Design for Computer Organization and Design4th EditionIntroductionThe course design for Computer Organization and Design (COD) 4th edition is designed to provide students with an understanding of the fundamental principles of computer organization and design. The course is divided into three mn units: the first unit covers the basic principles of digital logic and circuit design, the second unit covers the architecture and organization of computer systems, and the third unit covers advanced topics such as parallel processing and memory hierarchies.Course ObjectivesUpon completion of the course, students will be able to: - Understand the basic principles of digital logic and circuit design, including boolean algebra, gates, and flip-flops - Understand the architecture and organization of basic computer systems, including the CPU, memory, and I/O devices - Design and implement basic digital circuits using logic gates and flip-flops - Understand the purpose and function of assembly language and machine language, including instruction sets, memory addressing modes, and CPU operations - Understand the basic principles of pipelining and parallel processing, including pipelined CPU design and parallel processing architectures - Understand the principles of memory hierarchies and caching, including the function and organization of cache memoryCourse Outline•Unit 1: Digital Logic and Circuit Design–Introduction to digital logic–Boolean algebra and logic gates–Flip-flops and sequential circuits–Design of basic digital circuits•Unit 2: Computer System Organization and Architecture–Introduction to computer system organization–CPU organization and instruction sets–Memory organization and addressing modes–Input/output (I/O) devices and interfaces–Assembly language programming and machine language •Unit 3: Advanced Topics in Computer Organization and Design –Pipelining and parallel processing–Cache memory and memory hierarchies–Advanced CPU design and architectures–High-performance computer systemsCourse Delivery and AssessmentThe course will be delivered through a combination of lectures, tutorials, and laboratory work. There will be a mid-term exam and afinal exam as well as regular assessments throughout the course. The lab work will be designed to provide hands-on experience with digitalcircuit design, CPU design, and assembly language programming.ConclusionThe design of the COD 4th edition course is intended to provide a strong foundation in the principles of computer organization and design. By the end of the course, students will have gned the skills necessary to design and implement basic digital circuits and understand the architecture and organization of computer systems. With the rapid changes and developments in computer hardware, it is essential for students to have a solid understanding of the fundamental principles underlying computer systems.。

RDMA技术介绍

RDMA技术介绍

RDMA : new communication mechanism
Problem
basic, principal target data position & size vs memory capacity on destination host distributed file system: memory management(data persistence , data reliability , data efficiency…)
RDMA vs TCP/IP Traditional TCP/IP socket-based byte stream communication, long history , commonplace widely-used (intranet/internet), while RDMA (only intranet) RDMA -Low latency – stack bypass and copy avoidance -Kernel bypass – reduces CPU utilization -High memory bandwidth utilization -available
One side RDMA read and server write create race condition !Data persistence
Challenges for Cache Management -Server Unaware of RDMA reads, difficult to keep track of popularity of cached items. (inefficient cache replacement scheme leads to severe performance downgrade) - when the server evicts an items, it needs to invalidate the remote pointer cached in the client side. (broadcast on every eviction is significant overhead) - Goals: - 1) deliver sustainable high hit rate with limited cache capacity. - 2) not to compromise the benefit of RDMA read - 3) provide reliable resource reclamation scheme. - Design decisions: - 1) Client-Assisted Cache Management - 2) Managing Key-Value Pairs via Leases - 3) Popularity Differentiated Lease Assignment - 4) Classifying Hot-Cold Items for Access History Report - 5) Delayed Memory Reclamation

QCon-Project Pravega

QCon-Project Pravega

Project Pravega Storage Reimagined for a Streaming WorldOpen SourceCommunityMassiveData Growth Emergence of Real-Time Apps Infrastructure Commoditizationand Scale-OutRapid Dissemination of Data to Apps Monetize Datawith AnalyticsData Velocity and VarietyMarket DriversToday’s “Accidental Architecture”BatchReal-TimeInteractive explorationby Data ScientistsReal-time intelligence at the NOCSensorsMirrorMakerDR SiteMobile DevicesApp LogsA New Architecture Emerges: Streaming• A new class of streaming systems is emerging to address the accidental architecture’s problems and enable new applications not possible before •Some of the unique characteristics of streaming applications –Treat data as continuous and infinite–Compute correct results in real-time with stateful, exactly-once processing •These systems are applicable for real-time applications, batch applications, and interactive applications•Web-scale companies (Google, Twitter) are beginning to demonstrate the disruptive value of streaming systems•What are the implications for storage in a streaming world?Let’s Rewind A Bit: The Importance of Log Storage Traditional Apps/Middleware Streaming Apps/MiddlewareBLOCKS •Structured Data •Relational DBs FILES•Unstructured Data•Pub/Sub•NoSQL DBsOBJECTS•Unstructured Data•Internet Friendly (REST)•Scale over Semantics•GeoLOGS•Append-only•Low-latency•Tail Read/WriteThe Importance of Log StorageThe Fundamental Data Structure for Scale-out Distributed SystemsAPPEND-ONLY LOGx=5z=6y=2x=4a=7y=5……older newerHigh Throughput catch-up readsLow Latencytailing readswritesOur Goal: Refactor the “Accidental Storage Stack”Ingest Buffer & Pub/Sub“Pravega Streams”Scale-out SDSNoSQL DB SearchAnalytics EnginesUsing Logs as a Shared Storage PrimitiveIngest Buffer & Pub/SubProprietary Log StorageLocal Files DASKafkaNoSQL DBProprietary Log StorageLocal Files DASCassandra et alIntroducing Pravega StreamsA New Log Primitive Designed Specifically For Streaming Architectures•Pravega is an open source distributed storage service offering a new storage abstraction called a stream• A stream is the foundation for building reliable streaming systems: a high-performance,durable,elastic,and infinite append-only log with strict ordering and consistency• A stream is as lightweight as a file–you can create millions of them in a single cluster•Streams greatly simplify the development and operation of a variety ofdistributed systems: messaging, databases, analytic engines, search engines, and so onPravega Architecture Goals•All data is durable–Data is replicated and persisted to disk before being acknowledged •Strict ordering guarantees and exactly once semantics –Across both tail and catch-up reads–Client tracks read offset, Producers use transactions •Lightweight, elastic, infinite, high performance–Support tens of millions of streams–Dynamic partitioning of streams based on load and throughput SLO–Size is not bounded by the capacity of a single node–Low (<10ms) latency writes; throughput bounded by network bandwidth –Read pattern (e.g. many catch-up reads) doesn’t affect write performanceStreaming Storage SystemArchitectureStream AbstractionPravega Streaming ServiceCloud Scale Storage (Isilon or ECS)•High-Throughput •High-Scale, Low-CostLow-Latency StorageApache BookkeeperAuto-TieringCache (Rocks)Messaging AppsReal-Time / Batch / Interactive Predictive AnalyticsStream Processors: Spark, Flink, …Other Apps & MiddlewarePravega Design Innovations 1.Zero-Touch Dynamic Scaling-Automatically scale read/writeparallelism based on load and SLO -No service interruptions-No manual reconfiguration of clients -No manual reconfiguration of service resources2.Smart Workload Distribution-No need to over-provision servers for peak load3.I/O Path Isolation-For tail writes -For tail reads-For catch-up reads4.Tiering for “Infinite Streams”5.Transactions For “Exactly Once”Pravega FundamentalsSegments•Base storage primitive is a segment• A segment is an append-only sequence of bytes •Writes durably persisted before acknowledgement….. 01110110 01100001 0110110001000110AppendRead01000110SegmentPravega• A stream is composed of one or more segments•Routing key determines the target segment for a stream write•Write order preserved by routing key; consistent tail and catch-up readsStream (011010010111011001001010)011101100111011001101111011010010110100101101111PravegaStream•There are no architectural limits on the number of streams or segments •Each segment can live in a different server•System is not limited in any way by the capacity of a single server (011010010111011001001010)011101100111011001101111011010010110100101101111PravegaStreamSegment Sealing• A segment may be sealed• A sealed segment cannot be appended to any more•Basis for advanced features such as stream elasticity and transactions (011010010111011001001010)011101100111011001101111011010010110100101101111XPravegaPravega System ArchitectureCommodity ServerPravega ClientControllerBookkeeper Segment StoreCacheCommodity Server Segment StoreBookkeeper CacheCommodity ServerSegment StoreBookkeeperCacheStream Create, Open, Transactions, …Read, Write, …Long-Term Storage(Object, NFS, HDFS)Beyond the Fundamentals Stream Elasticity, Unbounded Streams, Transactions, Exactly Once•Data arrival volume increases –more parallelism needed!PravegaStreamAppend….. 01110110 01100001 01101100 01000110Stream•Seal original segment •Replace with two new ones!•New segments may be distributed throughout the cluster balancing load….. 01110110 01100001 01101100AppendPravega010001100100011001000110Routing Key Space0.01.0TimeSplitSplitMerget 0t 1t 201000110010001100100011001000110010001100100011001000110010001100100011001000110Routing Key Space0.01.0Timet 0t 1t 201000110010001100100011001000110010001100100011001000110010001100100011001000110Key ranges are dynamically assigned to segments Split SplitMergeScale up Scale downStream ElasticityZero-Touch Scaling: Segment Splitting & MergingSegment 7Segment 4Segment 3Segment 1Segment 2Segment 0Time3210t0t1t2t3Segment 6Segment 5t4Data KeysStream SStreamUnbounded Streams•Segments are automatically tiered to long-term storage•Data in tiered segments is transparently accessible for catch-up reads •Preserves stream abstraction while lowering storage costs for older dataPravega01110110011101100110100101101001011011110110111101101111Long-T erm Storage (Object , HDFS, NFS)Segment Store Setup10 : 2Writer ID = 10Segment StoreXExactly OnceWriter ID = 10Ack Segment StoreID Position101ID Position102Writer ID = 10ID Position102Writer ID = 10Ack Segment Store ID Position103Stream01000110Initial State01110110 StreamBegin TXTX segments0111011001000110Stream01100001Write to TX01100001TX segments0111011001000110 Stream01100000Write toTX01100001TX segments0111011001000110 01100000StreamUpon commitSeal TX segments01110110010001100110000001100001Stream0100011001100001Merge TXsegments into stream segmentsUpon commit0111011001100000Stream01000110Initial State01110110 StreamBegin TXTX segments0111011001000110Stream01100001Write to TX01100001TX segments0111011001000110 Stream01100000Write toTX01100001TX segments0111011001000110 01100000StreamUpon abortSeal TX segments01110110 010001100110000001100001Transactional Semantics For “Exactly Once”New ItemStream 1, Segment 1…Stream 1, Segment 1, TX-230New ItemNew ItemStream ProcessorApp StateApp LogicWorker WorkerPravega Optimizations for Stream ProcessorsInput Stream (Pravega)…Worker…Segment Memory-SpeedStorageDynamically split input stream into parallel logs : infinite sequence, low-latency, durable, re-playable with auto-tiering from hot to cold storage.1Coordinate via protocol between streaming storage and streaming engine to systematically scale up and down the number of logs and source workers based on load variance over time2Support streaming write COMMIT operation to extend Exactly Once processing semantics across multiple, chained applications3S o c i a l , I o T P r o d u c e r sOutput Stream (Pravega)Stream Processor2nd AppSegmentSinkSegmentSegmentNautilus PlatformSecure | Integrated | Efficient | Elastic | ScalableA Turn-Key Streaming Data PlatformPravegaStreaming StorageFlinkStateful Stream ProcessorStreaming SQL API ZeppelinNotebook ExperienceNotebook APIK8SCommodity Servers or CloudS e c u r i t yServiceabilityStreaming Storage API Digital WorldReal-Time/Batch Analytics Frameworks and AppsInteractive ExplorationSummary1.“Streaming Architecture” replaces “Accidental Architecture”–Data: infinite/continuous vs. static/finite–Correctness in real-time: Exactly once processing + consistent storage 2.Pravega Streaming Storage Enables Storage Refactoring–Infinite, durable, scalable, re-playable, elastic append-only log–Open source project3.Unified Storage + Unified Data Pipeline–The New Data Stack!Comparing Pravega and Kafka Design PointsUnlike Kafka, Pravega is designed to be a durable and permanent storage systemQuality Pravega Goal Kafka Design Point Data Durability Replicated and persisted to disk before ACK Replicated but not persisted to disk before ACK Strict Ordering Consistent ordering on tail and catch-up reads Messages may get reorderedExactly Once Producers can use transactions for atomicity Messages may get duplicatedScale Tens of millions of streams per cluster Thousands of topics per clusterElastic Dynamic partitioning of streams based on load and SLO Statically configured partitionsSize Log size is not bounded by the capacity of any single nodePartition size is bounded by capacity of filesystemon its hosting nodeTransparently migrate/retrieve data from Tier 2 storage forolder parts of the logExternal ETL required to move data to Tier 2storage; no access to data via Kafka once movedPerformance Low (<10ms) latency durable writes; throughput bounded bynetwork bandwidthLow-latency achieved only by reducingreplication/reliability parametersRead pattern (e.g. many catch-up readers) does not affectwrite performanceRead patterns adversely affects write performancedue to reliance on OS filesystem cache✗✗✗✗✗✗✗✗✗。

戴尔产品组12TB Microsoft SQL Server 2012快速跟踪数据仓库参考配置说明书

戴尔产品组12TB Microsoft SQL Server 2012快速跟踪数据仓库参考配置说明书

Database Solutions Engineering Dell Product Group Mayura Deshmukh April 2013This document is for informational purposes only and may contain typographical errors and technical inaccuracies. The content is provided as is, without express or implied warranties of any kind.© 2013 Dell Inc. All rights reserved. Dell and its affiliates cannot be responsible for errors or omissions in typography or photography. Dell, the Dell logo, and PowerEdge are trademarks of Dell Inc. Intel and Xeon are registered trademarks of Intel Corporation in the U.S. and other countries. Microsoft, Windows, and Windows Server are either trademarks or registered trademarks of Microsoft Corporation in the United States and/or other countries. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others.February 2013 | Rev 1.0ContentsExecutive Summary (4)FTDW Reference Architectures Using PowerEdge R720xd Server (4)12TB Dell R720XD FTDW Reference Architecture (5)Hardware Components (5)Internal Storage Controller (PERC H710P Mini) Settings (7)Application Configuration (9)Capacity Details (10)Performance Benchmarking (11)Conclusion (13)References (14)TablesTable 1: Dell Fast Track Reference Architectures for PowerEdge R720xd Server (4)Table 2: Tested Dell FTDW Reference Architecture Components (5)Table 3: Mount Point Naming and Storage Enclosure Mapping (9)Table 4: Capacity Metrics (10)Table 5: Performance Metrics (11)FiguresFigure 1: Proposed Dell Fast Track Reference Architecture (5)Figure 2: Memory Slot Locations (7)Figure 3: Virtual Disk Settings (7)Figure 4: Internal Storage Controller Settings (8)Figure 5: RAID Configuration (8)Figure 6: Storage System Components (9)Figure 7: SQLIO Line Rate Test from Cache (Small 5MB File) (12)Figure 8: SQLIO Real Rate Test from Disk (Large 25GB File) (12)Executive SummaryThe performance and stability of any data warehouse solution is based on the integration between solution design and hardware platform. Choosing the correct solution architecture requires balancing the application’s intended purpose and expected use with the hardware platform’s components. Poor planning, bad design, and misconfigured or improperly sized hardware often lead to increased costs, increased risks and, even worse, unsuccessful projects.This white paper provides guidelines to achieve a compact, balanced, optimized 12TB Microsoft® SQL Server® 2012 data warehouse configuration for Dell™PowerEdge™ R720 and R720xd servers using Microsoft Fast Track Data Warehouse (FTDW) principles. Benefits of implementing this reference architecture include:∙Achieve a balanced and optimized system at all levels of the stack by following hardware and software best practices.∙Avoid over-provisioning hardware resources to reduce costs.∙Implement a tested and validated configuration with proven methodologies and performance behaviors to help avoid the pitfalls of improperly designed and configured systems.∙Easily migrate from a small- to medium-sized data warehouse configuration (5TB) to a large data warehouse configuration (12TB).Data center space comes at a premium. This configuration provides a compact, high-performance solution for large data warehouses with 12TB of data or more.FTDW Reference Architectures Using PowerEdge R720xd Server The Microsoft FTDW reference architecture achieves an efficient resource balance between SQL Server data processing capability and realized component hardware throughput to take advantage of improved out-of-the-box performance.As most data warehouse queries scan large volumes of data, FTDW system design and configuration are optimized for sequential reads and are based on concurrent query workloads. Understanding performance and maintaining a balanced configuration helps reduce costs by avoiding over provisioning of components.Dell provides various Fast Track reference architectures for SQL 2012 built using the Dell PowerEdge12th Generation servers. These solutions are differentiated depending on the data warehouse capacity and scan rate requirements. Table 1 summarizes FTDW configurations with Dell R720XD server.Table 1: Dell Fast Track Reference Architectures for PowerEdge R720xd ServerThe 12TB R720XD configuration described in this white paper is also available as a rapid deployment, with hardware, software, and services included in the Dell™ Quickstart Data Warehouse Appliance 2000 (QSDW 2000). This configuration provides a low-cost and easier migration path for customers who wantto go from a 5TB to 12TB solution. For more information on Dell QSDW 2000, see Dell Quickstart Data Warehouse Appliance.12TB Dell R720XD FTDW Reference ArchitectureThe following sections of this paper describe the hardware, software, capacity, and performance characteristics of a 12TB Microsoft SQL Server 2012 FTDW solution with scan rates of about 2GBps using PowerEdge R720XD servers.Hardware ComponentsRedundant and robust tests have been conducted on PowerEdge servers to determine best practices and guidelines for building a balanced FTDW system. Table 2 provides the detailed hardware configuration of the reference architecture.Figure 1: Proposed Dell Fast Track Reference ArchitectureTested Dell Fast Track Reference Architecture Component DetailsTable 2: Tested Dell FTDW Reference Architecture ComponentsPowerEdge R720xd ServerThe PowerEdge R720xd server is a two-socket, 2U high-capacity, multi-purpose rack server offering an excellent balance of internal storage, redundancy, and value in a compact chassis. For technical specifications of the R720xd server, see the Power Edge R720xd Technical Guide.ProcessorsThe Fast Track Data Warehouse Reference Guide for SQL Server 2012 describes how to achieve a balance between components such as storage, memory, and processors. To balance available internal storage and memory for the PowerEdge R720xd, the architecture uses two Intel Xeon E5-2643 four-core processors operating at 3.3GHz.MemoryFor SQL Server 2012 reference architectures, Microsoft recommends using 128GB to 256GB of memory for dual-socket configuration. Selection of memory DIMMS will also play a critical role in the performance of the entire stack.This configuration was tested with various memory sizes running at different speeds—for example,192GB running at 1333MHz, 192GB running at 1600MHz, 112GB running at 1600MHz, and so on. Using DIMMs with memory rate of 1600MHz showed significant performance improvement (about 400MBs/s) over DIMMS with memory rate of 1333MHz. In the test configuration, the database server is configured with 128GB of RAM running at 1600 MHz to which create a well-balanced configuration.To achieve 128GB of RAM on the PowerEdge R720xd server, place eight 16GB RDIMMS in slots A1-A4 and B1-B4 (white connectors). See Figure 2: Memory Slot LocationsFigure 2 for memory slot locations.Figure 2: Memory Slot LocationsInternal Storage Controller (PERC H710P Mini) SettingsThe Dell PERC H710P Mini is an enterprise-level RAID controller that provides disk management capabilities, high availability, and security features in addition to improved performance of up to6GB/s throughput. Figure 3 shows the management console accessible through the BIOS utility.Figure 3: Virtual Disk SettingsStripe element sizeBy default, the PERC H710P Mini creates virtual disks with a segment size of 64KB. For most workloads, the 64KB default size provides an adequate stripe element size.Read policyThe default setting for the read policy on the PERC H710P Mini is “adaptive read ahead.” This configuration was tested with “adaptive read ahead,” “No read ahead,”and “Read Ahead” settings.During testing, it was observed that the default setting of “adaptive read ahead” gave the best performance.Figure 4: Internal Storage Controller SettingsRAID configurationWhen deploying a new storage solution, selecting the appropriate RAID level is a critical decision that impacts application performance. The FTDW configuration proposed in this paper uses RAID 1 disk groups for database data files and database log files, nine RAID 1 data disk groups, and one RAID 1 log disk group, each created with a single virtual disk. Additionally, two drives in RAID 0 are assigned as a staging area. Figure 5 shows the proposed RAID configuration.Figure 5: RAID ConfigurationRAID 1 Data 5RAID 1Data 6RAID 1Data 7RAID 0StageRAID 1LogsRear Bay DrivesDrive slot configuration:∙Slots 0-17: Nine RAID 1 disk groups were created, each configured with a single virtual disk dedicated for the primary user data∙Slots 18-19: One RAID 1 disk group created from two disks and a single virtual disk dedicated tohost the database log files∙Slots 20-21: RAID 0 disk group created from two disks dedicated for staging∙Slots 22-23: Remaining two disks assigned as global hot spares∙Slots 24-25 (rear bay drives): One RAID 1 disk group for operating systemFor FTDW architectures, it is recommended to use mount-point rather than drive letters for storage access. It is also important to assign the appropriate virtual disk and mount-point names to theconfiguration to simplify troubleshooting and performance analysis. Mount-point names should be assigned in such a way that the logical file system reflects the underlying physical storage enclosure mapping. Table 3 shows the virtual disk and mount-point names used for the specific reference configuration and the appropriate storage layer mapping. All of the logical volumes are mounted to the C:\FT folder.Table 3: Mount Point Naming and Storage Enclosure MappingFigure 6 represents the storage system configuration for the proposed FTDW reference architecture.Figure 6: Storage System ComponentsThe production, staging, and system temp databases are deployed per the recommendations provided in the Fast Track Data Warehouse Reference Guide for SQL Server 2012.Application ConfigurationThe following sections explain the settings applied to operating system and database layers.Windows Server 2008 R2 SP1Enable Lock Pages In Memory to prevent the system from paging memory to disk. For more information, see How to: Enable the Lock Pages in Memory Option.SQL Server ConfigurationThe following startup options were added to the SQL Server Startup options:∙-E: This parameter increases the number of contiguous extends that are allocated to a database table in each file as it grows to improve sequential access.∙-T1117: This trace flag ensures the even growth of all files in a file group when auto growth is enabled. It should be noted that the FTDW reference guidelines recommend pre-allocating the data file space rather than allowing auto-grow.SQL Server Maximum Memory: FTDW for SQL Server 2012 guidelines suggest allocating no more than 92% of total server RAM to SQL Server. If additional applications will share the server, then adjust the amount of RAM left available to the operating system accordingly. For this reference architecture, the maximum server memory was set at 119808 MB (117GB).Resource Governor:For SQL Server 2012, Resource Governor provides a maximum of 25% of SQL Server memory resources to each session. The Resource Governor setting can be used to reduce the maximum memory consumed per query. While it can be beneficial for many data warehouse workloads to limit the amount of system resources available to an individual session, this is best measured through analysis of concurrent query workloads. This configuration was tested with both 25% and 19% memory grant, and the 25% setting was found to be optimal for the proposed configuration. For more information, see Using the Resource Governor.Max Degree of Parallelism: The SQL Server configuration option Max degree of parallelism controls the number of processors used for the parallel execution of a query. For the configuration, settings of 12 and 0 were tested. The default setting of 0 provided maximum performance benefits. For more information, see Maximum degree of parallelism configuration option.Capacity DetailsTable 4Table 4 shows the capacity metrics reported for the recommended reference configuration.Table 4: Capacity MetricsPerformance BenchmarkingMicrosoft FTDW guidelines help to achieve optimized database architecture with balanced CPU and storage bandwidth. Table 5 shows the performance numbers reported for the recommended reference configuration.Table 5: Performance MetricsThe following sections describe the detailed performance characterization activities carried out for the validated Dell Microsoft FTDW reference architecture.Baseline Hardware Characterization Using Synthetic I/OThe goal of hardware validation is to determine actual baseline performance characteristics of key hardware components in the database stack to ensure that system performance is not bottlenecked in intermediate layers.The disk characterization tool, SQLIO, was used to validate the configuration. The results in Figure 7 show the maximum baseline that the system can achieve from a cache (called Line Rate). A small file is placed on the storage, and large sequential reads are issued against it with SQLIO. This test verifies the maximum bandwidth available in the system to ensure no bottlenecks are within the data path.Figure 7: SQLIO Line Rate Test from Cache (Small 5MB File)PERC H710P Mini ControllerSynthetic I/O rate: 2674 MB/sThe second synthetic I/O test with SQLIO was performed with a large file to ensure reads are serviced from the storage system hard drives instead of from cache. Figure 8 shows the maximum real rate that the system is able to provide with sequential reads.Figure 8: SQLIO Real Rate Test from Disk (Large 25GB File)PERC H710P Mini ControllerSynthetic I/O rate: 2616 MB/sFTDW Database ValidationThe performance of a FTDW database configuration is measured using two core metrics: Maximum CPU Consumption Rate (MCR) and Benchmark Consumption Rate (BCR).MCR - MCR indicates the per-core I/O throughput in MB or GB per second. This is measured by executing a pre-defined query against the data in the buffer cache, and then measuring thetime taken to execute the query against the amount of data processed in MB or GB. For thevalidated configuration with two Intel E5-2643 four-core processors, the system aggregate MCR was 2488 MB/s. The realized MCR value per core was 311 MB/s.BCR - BCR is calculated in terms of total read bandwidth from the storage hard drives—not from the buffered cache as in the MCR calculation. This is measured by running a set ofstandard queries specific to the data warehouse workload. The queries range from I/Ointensive to CPU and memory intensive, and provide a reference to compare variousconfigurations. For the validated FTDW configuration, the aggregate BCR was 1909 MB/s.During the evaluation cycle, the system configuration was analyzed for multiple query variants (simple, average, and complex) with multiple sessions and different degrees of parallelism(MAXDOP) options to arrive at the optimal configuration. The evaluation results at each stepwere validated and verified jointly by Dell and Microsoft.FTDW Database Validation with Column Store Index (CSI)SQL Server 2012 implements CSI technology as a nonclustered indexing option for pre-existing tables. Significant performance gains are often achieved when CSI query plans are active, and this performance can be viewed as incremental to the basic system design.After the test configuration was validated, CSI was added. Then, the same set of I/O and CPU-intensive queries were executed to compare throughput achieved using CSI. Throughput rating of 4337.5 MB/s was achieved for CSI-enhanced benchmarks. These numbers can be used to approximate the positive impact to query performance expected under a concurrent query workload.ConclusionThe Dell Microsoft FTDW architecture provides a uniquely well-balanced data warehouse solution. By following best practices at all stack layers, a balanced data warehouse environment can be achieved with a greater performance benefits than traditional data warehouse systems.ReferencesDell SQL Server Solutions\sqlDell Services\servicesDell Support\supportMicrosoft Fast Track Data Warehouse and Configuration Guide Information /fasttrackAn Introduction to Fast Track Data Warehouse Architectures/en-us/library/dd459146.aspxHow to: Enable the Lock Pages in Memory Option/fwlink/?LinkId=141863SQL Server Performance Tuning & Trace Flags/kb/920093Using the Resource Governor/en-us/library/ee151608.aspxMaximum degree of parallelism configuration option/kb/2023536Power Edge R720xd Technical Guide/support/edocs/systems/per720/en/index.htm。

[转载]Scratchpad

[转载]Scratchpad

[转载]Scratchpad RAM原⽂地址:Scratchpad RAM作者:TracyScratchpad memory (SPM), also known as scratchpad, scatchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor ("CPU"), scratchpad refers to a special high-speed memory circuit used to hold small items of data for rapid retrieval.It can be considered similar to the L1 cache in that it is the next closest memory to the ALU after the internal registers, with explicit instructions to move data from and to main memory, often using DMA-based data transfer. In contrast with a system that uses caches, a system with scratchpads is a system with Non-Uniform Memory Access latencies, because the memory access latencies to the different scratchpads and the main memory vary. Another difference with a system that employs caches is that a scratchpad commonly does not contain a copy of data that is also stored in the main memory.Scratchpads are employed for simplification of caching logic, and to guarantee a unit can work without main memory contention in a system employing multiple processors, especially in multiprocessor system-on-chip for embedded systems. They are mostly suited for storing temporary results (as it would be found in the CPU stack) that typically wouldn't need to always be committing to the main memory; however when fed by DMA, they can also be used in place of a cache for mirroring the state of slower main memory. The same issues of locality of reference apply in relation to efficiency of use; although some systems allow strided DMA to access rectangular data sets. Another difference is that scratchpads are explicitly manipulated by applications.Scratchpads are not used in mainstream desktop processors where generality is required for legacy software to run from generation to generation, in which the available on-chip memory size may change. They are better implemented in embedded systems, special-purpose processors and game consoles, where chips are often manufactured as MPSoC, and where software is often tuned to one hardware configuration.Contents1 Examples of use2 Alternatives2.1 Cache control vs Scratchpads2.2 Shared L2 vs Cell local storesExamples of useThe Cyrix 6x86, the only x86-compatible desktop processor to incorporate a dedicated Scratchpad.SuperH, used in Sega's consoles, could lock cachelines to an address outside of main memory for use as aScratchpad.The Sony PS1's R3000 had a Scratchpad instead of an L1 cache. It was possible to place the CPU stack here, an example of the temporary workspace usage.Sony's PS2Emotion Engine employed a 16KiB Scratchpad, to and from which DMA transfers could be issued to its GS, and main memory.The Cell's SPEs are restricted purely to working in their "local-store", relying on DMA for transfers from/to main memory and between local stores, much like a Scratchpad. In this regard, additional benefit is derived from the lack of hardware to check and update coherence between multiple caches: the design takes advantage of the assumption that each processor's workspace is separate and private. It is expected this benefit will become more noticeable as the number of processors scales into the "many-core" future.Many other processors allow L1 cache lines to be locked.Most DSPs use a Scratchpad. Many past 3D accelerators and game consoles (including the PS2) have used DSPs for vertex transformations. This differs with the stream based approach of modern GPUs which have more in common witha CPU cache's functions.NVIDIA's 8800GPU running under CUDA provides 16KiB of Scratchpad per thread-bundle when being used for gpgputasks.Ageia's PhysX chip utilizes Scratchpad RAM in a manner similar to the Cell; its theory states that a cache hierarchy is of less use than software managed physics and collision calculations. These memories are also banked and a switch manages transfers between them.AlternativesCache control vs ScratchpadsMany architectures such as PowerPC attempt to avoid the need for cacheline locking or scratchpads through the use of cache control instructions. Marking an area of memory with "Data Cache Block: Zero" (allocating a line but setting its contents to zero instead of loading from main memory) and discarding it after use ('Data Cache Block: Invalidate', signaling that main memory needn't receive any updated data) the cache is made to behave as a scratchpad. Generality is maintained in that these are hints and the underlying hardware will function correctly regardless of actual cache size.Shared L2 vs Cell local storesRegarding interprocessor communication in a multicore setup, there are similarities between the Cell's inter-localstore DMA and a Shared L2 cache setup as in the Intel Core 2 Duo or the Xbox 360's custom powerPC: the L2 cache allows processors to share results without those results having to be committed to main memory. This can be an advantage where the working set for an algorithm encompasses the entirety of the L2 cache. However, when a program is written to take advantage of inter-localstore DMA, the Cell has the benefit of each-other-Local-Store serving the purpose of BOTH the private workspace for a single processor AND the point of sharing between processors; i.e., the other Local Stores are on a similar footing viewed from one processor as the shared L2 cache in a conventional chip. The tradeoff is that of memory wasted in buffering and programming complexity for synchronization, though this would be similar to precached pages in a conventional chip. Domains where using this capability is effective include:Pipeline processing (where one achieves the same effect as increasing the L1 cache's size by splitting one job into smaller chunks).Extending the working set, e.g., a sweet spot for a merge sort where the data fits within 8x256KiBShared code uploading, like loading a piece of code to one SPU, then copy it from there to the others to avoid hitting the main memory again.It would be possible for a conventional processor to gain similar advantages with cache-control instructions, for example, allowing the prefetching to the L1 bypassing the L2, or an eviction hint that signaled a transfer from L1 to L2 but not committing to main memory; however, at present no systems offer this capability in a usable form and such instructions in effect should mirror explicit transfer of data among cache areas used by each core.Retrieved from "/w/index.php?title=Scratchpad_RAM&oldid=415734064"。

973国家重点科研项目

973国家重点科研项目

973国家重点科研项目课题一l 2008年1. Yashuai Lv, Li Shen, Libo Huang, Zhiying Wang, Nong Xiao, Customizing Computation accelerators for extensible multi-issued processors with efficient optimization techniques, in the 45th Design Automatic Conference (DAC), Jun. 20082. Mingche Lai, Zhiying Wang, Lei Gao, Hongyi Lu, A dynamically allocated virtual channel architecture with congestion awareness for on-chip routers, in the 45th Design Automatic Conference (DAC), Jun. 20083. Mingche Lai, Lei Gao, Zhiying Wang, Novel automated approach to guide the processor element for multimedia domain, in Journal of Information and Computational Science, 2008.02: 819-8284. Chen Wei, Lu Hongyi, Shen Li, Wang Zhiying, Xiao Nong, DBTIM: an advanced hardware assisted full virtualization architecture, in Proc. of the 5th International Conference on Embedded and Ubiquitous Computing (EUC), Dec. 2008: 399-4045. Bin Chen, Nong Xiao, Zhiping Cai, Zhiying Wang, An optimal COW block device driver in VMM for fast, on-demand software deployment, in EUV’08, 2008.126. Chen Wei, Lu Hongyi, Shen Li, Wang Zhiying, Xiao Nong, Chen Dan, A novel hardware assisted full virtualization technique, in the 9th International Conference for Young Computer Scientists (ICYCS), Nov. 2008: 1292-12977. 蔡志平、陈彬、肖侬、王志英. 虚拟计算环境中的虚拟网络. 计算机工程与科学. 2008.118. 沈立、吕雅帅、王志英. 传输触发结构ASIP软件工具的自动定制. 计算机辅助设计与图形学学报, 2008.069. 吕雅帅、沈立、黄立波、王志英. 面向嵌入式应用的指令集自动扩展. 电子学报, 2008. 05: 985-988l 2009年1. Mingche Lai, Lei Gao, Nong Xiao, Zhiying Wang, An accurate and efficient performance analysis approach based on queuing model for network on chip, in ICCAD 2009: 563-5702. Miao Wang, Francois Bodin, Sebastien Matz, Automatic Data Distribution for Improving Data Locality on the Cell BE Architecture, in LCPC 20093. Yashuai Lv, Li Shen, Zhiying Wang, Nong Xiao, Dynamically utilizing computation accelerators for extensible processors in a software approach, in CODES+ISSS 20094. Yashuai Lv, Li Shen, Libo Huang, Zhiying Wang, Nong Xiao, Optimal subgraph covering for customizable VLIW processors, in IET Computers & Digital Techniques, Jan. 20095. Libo Huang, Li Shen, Sheng Ma, Nong Xiao, Zhiying Wang, DM-SIMD: a new SIMD predication mechanism for exploiting SLP, in the 8th IEEE International Conference on ASIC, Dec. 20096. Chen Wei, Shen Li, Lu Hongyi, Wang Zhiying, Xiao Nong, A light-weight code cache design for dynamic binary translation, in the 15th International Conference on Parallel and Distributed Systems (ICPADS), Dec. 2009: 120-1257. Li Shen, Libo Huang, Nong Xiao, Zhiying Wang, Implicit data permutation for SIMD devices, EC-Com, Dec. 20098. 沈立, 张晨曦, 吕雅帅, 王志英. 指令扩展中相关子图的分析. 计算机辅助设计与图形学学报, 2009.109. 陈彬, 肖侬, 蔡志平, 王志英. 基于优化的COW虚拟块设备的虚拟机按需部署机制. 计算机学报. 2009.1010. Bein Chen, Nong Xiao, Zhiping Cai, Zhiying Wang, Ji Wang, DPM: a demand-driven virtual disk prefetch mechanism for mobile personal computing environments, in Proc. of the 6th IFIP International Conference on Network and Parallel Computing (NPC), Oct. 200911. Bin Chen, Nong Xiao, Zhiping Cai, Fuyong Chu, Zhiying Wang, Virtual disk reclamation for software updates in virtual machine environments, in Proc. of the 4th IEEE International Conference on Networking, Architecture, Storage (NAS), Jul. 2009: 43-5012. 褚福勇, 肖侬, 蔡志平, 陈彬. 虚拟机备份机制研究. 计算机工程与科学. 2009.0913. Bin Chen, Nong Xiao, Zhiping Cai, Zhiying Wang, Ji Wang, Fast, on-demand software deployment with lightweight, independent virtual disk images, in Proc. of the 8th International Conference on Grid and Cooperative Computing (GCC), Aug. 2009: 16-2314. Chen Wei, Lu Hongyi, Shen Li, Wang Zhiying, Xiao Nong, Using pcache to speedup interpretation in dynamic binary translation, in Proc. of IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Aug. 2009: 525-530l 2010年1. Libo Huang, Li Shen, Zhiying Wang, Wei Shi, Nong Xiao, Sheng Ma, SIF: Overcoming the Limitations of SIMD Devices via Implicit Permutation, in the 16th IEEE International Symposium on High Performance Computer Architecture (HPCA), Jan. 2010, Bangalore2. Wei Shi, Zhiying Wang, Hongguang Ren, Ting Cao, Wei Chen, Bo Su, Hongyi Lu. DSS: Applying Asynchronous Techniques to Architectures Exploiting ILP at Compile Time. In Proc. of the 28th IEEE International Conference on Computer Design (ICCD), best paper award, pp. 321-327, Oct. 20103. Mingche Lai, Lei Gao, Zhiying Wang, Exploration and implementation of a highly efficient processor element for multimedia and signal processing domains, in IET Computers & Digital Techniques, May. 2010: 374-3874. Miao Wang, Nicolas Benoity, Francois Bodinz, Zhiying Wang, Model driven iterative multi-dimensional parallelization of multi-task programs for the Cell BE: a generic algorithm-based approach, in EUR-PDP 20105. Libo Huang, Xin Zhang, Zhiying Wang, Streaming processing in general purpose processor, in the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2010 (poster)6. 陈彬, 肖侬, 蔡志平, 王志英. 虚拟机环境下软件按需部署中的预取机制研究. 软件学报. 2010.12: 3186-31987. Libo Huang, Zhiying Wang, SV: enhancing SIMD architecture via combined SIMD-Vector approach, in the 10th International Conference on Algorithms and Architectures for Parallel Processing, 20108. Zhiping Cai, Fang Liu, Nong Xiao, Qiang Liu, Zhiying Wang, Virtual Network Embedding for evolving networks, in Proc. of IEEE Globecom, Dec. 20109. Xu Fan, Shen Li, Wang Zhiying, A dynamic binary translation framework based on page fault mechanism in Linux kernel, in the 10th International Conference on Computer and Information Technology, Jun. 2010: 2284-228910. Chen Wei, Wang Zhiying, Chen Dan, An emulator for executing IA-32 applications on ARM-based systems, in Journal of Computers, 2010.0711. Zhiping Cai, Zhijun Wang, Kai Zheng, A distributed TCAM coprocessor architecture for integrated policy filtering and content filtering, in Proc. of IEEE International Conference on Communications (ICC), May. 2010: 23-2712. 沈立、王志英、肖侬. 多核平台下应用程序的动态优化. 计算机科学与探索. 2010.04l 2011年1. Libo Huang, Sheng Ma, Li Shen, Zhiying Wang, Low Cost Binary 128 Floating-Point FMA Unit Design with SIMD Support, in IEEE Transactions on Computers, Accepted2. Sheng Ma, Natalie Enright Jerger, Zhiying Wang, DBAR: An Efficient Routing Algorithm to Support Multiple Concurrent Applications in Networks-on-Chip, in ISCA 2011, Jun. 2011, San Jose3. Libo Huang, Zhiying Wang, Li Shen, Hongyi Lu, Nong Xiao, Cong Liu, A Specialized Low-cost Vectorized Loop Buffer for Embedded Processors, Design, Automation & Test in Europe (DATE), Mar. 20114. Shi Wei, Hongguang Ren, Qiang Dou, Zhiying Wang, Li Shen, and Cong Liu, accepted by ICCD 2011, Oct. 2011, Boston5. [19] Mingche Lai, Lei Gao, Sheng Ma, Nong Xiao, Zhiying Wang, A practical low-latency router architecture with wing channel for on-chip network, in Microprocessors and Microsystems, 2011.02: 98-1096. 陈微, 王志英, 肖侬, 沈立, 陆洪毅, 降低协同设计虚拟机启动开销的译码后指令缓存技术, 计算机研究与发展, 2011.01: 19-277. Libo Huang, Zhiying Wang, Nong Xiao, VBON: towards efficient on-chip networks via hierarchic virtual bus, in the 48th Design Automatic Conference (DAC), 2011 (poster)8. Wei Chen, Weixia Xu, Zhiying Wang, Qiang Dou, Yongwen Wang, Baokang Zhao, Baozhang Wang, A formalization of an emulation based co-designed virtual machine, in the 5th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, 20119. Cong Liu, Li Shen, Libo Huang, Zhiying Wang, Sheng Ma, Tuning parallelism of sequential applications via thread level speculation, in the 3rd International Conference on Computer and Network Techniques, 201110. 陈顼颢, 郑重, 沈立, 王志英, 二进制翻译中代码生成的子图覆盖算法, 计算机科学与探索, 2011.0711. 郑重, 陈顼颢, 沈立, 王志英, 浮点到定点的高效翻译策略研究, 计算机科学与探索, 2011.0512. 陈彬, 蔡志平, 肖侬, 褚福勇. 虚拟机管理器中面向虚拟块设备的一种通用快照扩展机制. 计算机工程与科学. 2011.05: 54-5813. Libo Huang, Li Shen, Yashuai Lv, Zhiying Wang, Kui Dai, MAC or Non-MAC: Not a Problem, in Journal of Circuits, Systems, and Computers, Accepted14. Chen Wei, Wang Zhiying, Zheng Zhong, Shen Li, Lu Hongyi, Xiao Nong, TransARM: An Efficient Instruction Architecture Emulator, in Chinese Journal of Electronics, Accepted15. Wen Chen, Dan Chen, Zhiying Wang, An approach to minimizing the interpretation overhead in dynamic binary translation, in the Journal of Supercomputing, accepted16. Bin Chen, Nong Xiao, Zhiping Cai, A demand-driven virtual disk prefetch mechanism for seamless mobility ofpersonal computing environment, in the Journal of Supercomputing, accepted17. 徐帆, 沈立, 王志英. 基于多核平台的多线程动态优化框架. 计算机工程与科学, 已录用18. 蔡志平、刘强、吕品、肖侬、王志英. 虚拟网络映射模型及其优化算法. 软件学报. 已录用<<12345678>>。

Java Runtime Systems Characterization and Architectural Implications

Java Runtime Systems Characterization and Architectural Implications

Java Runtime Systems: Characterization and Architectural ImplicationsRamesh Radhakrishnan,Member,IEEE,N.Vijaykrishnan,Member,IEEE, Lizy Kurian John,Senior Member,IEEE,Anand Sivasubramaniam,Member,IEEE,Juan Rubio,Member,IEEE,and Jyotsna SabarinathanAbstractÐThe Java Virtual Machine(JVM)is the cornerstone of Java technology and its efficiency in executing the portable Java bytecodes is crucial for the success of this technology.Interpretation,Just-In-Time(JIT)compilation,and hardware realization are well-known solutions for a JVM and previous research has proposed optimizations for each of these techniques.However,each technique has its pros and cons and may not be uniformly attractive for all hardware platforms.Instead,an understanding of the architectural implications of JVM implementations with real applications can be crucial to the development of enabling technologies for efficient Java runtime system development on a wide range of platforms.Toward this goal,this paper examines architectural issues from both the hardware and JVM implementation perspectives.The paper starts by identifying the important execution characteristics of Javaapplications from a bytecode perspective.It then explores the potential of a smart JIT compiler strategy that can dynamically interpret or compile based on associated costs and investigates the CPU and cache architectural support that would benefit JVMimplementations.We also study the available parallelism during the different execution modes using applications from the SPECjvm98 benchmarks.At the bytecode level,it is observed that less than45out of the256bytecodes constitute90percent of the dynamic bytecode stream.Method sizes fall into a trinodal distribution with peaks of1,9,and26bytecodes across all benchmarks.Thearchitectural issues explored in this study show that,when Java applications are executed with a JIT compiler,selective translation using good heuristics can improve performance,but the saving is only10-15percent at best.The instruction and data cacheperformance of Java applications are seen to be better than that of C/C++applications except in the case of data cache performance in the JIT mode.Write misses resulting from installation of JIT compiler output dominate the misses and deteriorate the data cacheperformance in JIT mode.A study on the available parallelism shows that Java programs executed using JIT compilers haveparallelism comparable to C/C++programs for small window sizes,but falls behind when the window size is increased.Java programs executed using the interpreter have very little parallelism due to the stack nature of the JVM instruction set,which is dominant in the interpreted execution mode.In addition,this work gives revealing insights and architectural proposals for designing an efficient Java runtime system.Index TermsÐJava,Java bytecodes,CPU and cache architectures,ILP,performance evaluation,benchmarking.æ1I NTRODUCTIONT HE Java Virtual Machine(JVM)[1]is the cornerstone of Java technology,epitomizing theªwrite-once run-any-whereºpromise.It is expected that this enabling technology will make it a lot easier to develop portable software and standardized interfaces that span a spectrum of hardware platforms.The envisioned underlying platforms for this technology include powerful(resource-rich)servers,net-work-based and personal computers,together with resource-constrained environments such as hand-held devices,specialized hardware/embedded systems,and even household appliances.If this technology is to succeed,it is important that the JVM provide an efficient execution/ runtime environment across these diverse hardware plat-forms.This paper examines different architectural issues, from both the hardware and JVM implementation perspec-tives,toward this goal.Applications in Java are compiled into the bytecode format to execute in the Java Virtual Machine(JVM).The core of the JVM implementation is the execution engine that executes the bytecodes.This can be implemented in four different ways:1.An interpreter is a software emulation of the virtualmachine.It uses a loop which fetches,decodes,andexecutes the bytecodes until the program ends.Dueto the software emulation,the Java interpreter has anadditional overhead and executes more instructionsthan just the bytecodes.2.A Just-in-time(JIT)compiler is an execution modelwhich tries to speed up the execution of interpretedprograms.It compiles a Java method into nativeinstructions on the fly and caches the nativesequence.On future references to the same method,the cached native method can be executed directlywithout the need for interpretation.JIT compilers.R.Radhakrishnan,L.K.John,and J.Rubio are with the Laboratory forComputer Architecture,Department of Electrical and Computer Engineer-ing,University of Texas at Austin,Austin,TX78712.E-mail:{radhakri,ljohn,jrubio}@..N.Vijaykrishnan and A.Sivasubramaniam are with the Department ofComputer Science and Engineering,220Pond Lab.,Pennsylvania State University,University Park,PA16802.E-mail:{vijay,anand}@..J.Sabarinathan is with the Motorola Somerset Design Center,6263McNeil Dr.#1112,Austin,TX78829.E-mail:jyotsna@.Manuscript received28Apr.2000;revised16Oct.2000;accepted31Oct.2000.For information on obtaining reprints of this article,please send e-mail to:tc@,and reference IEEECS Log Number112014.0018-9340/01/$10.00ß2001IEEEhave been released by many vendors,like IBM[2],Symantec[3],and piling duringprogram execution,however,inhibits aggressiveoptimizations because compilation must only incura small overhead.Another disadvantage of JITcompilers is the two to three times increase in theobject code,which becomes critical in memoryconstrained embedded systems.There are manyongoing projects in developing JIT compilers thataim to achieve C++-like performance,such asCACAO[4].3.Off-line bytecode compilers can be classified intotwo types:those that generate native code and thosethat generate an intermediate language like C.Harissa[5],TowerJ[6],and Toba[7]are compilersthat generate C code from bytecodes.The choice of Cas the target language permits the reuse of extensivecompilation technology available in different plat-forms to generate the native code.In bytecodecompilers that generate native code directly,likeNET[8]and Marmot[9],portability becomesextremely difficult.In general,only applications thatoperate in a homogeneous environment and thosethat undergo infrequent changes benefit from thistype of execution.4.A Java processor is an execution model thatimplements the JVM directly on silicon.It not onlyavoids the overhead of translation of the bytecodesto another processor's native language,but alsoprovides support for Java runtime features.It can beoptimized to deliver much better performance than ageneral purpose processor for Java applications byproviding special support for stack processing,multithreading,garbage collection,object addres-sing,and symbolic resolution.Java processors can becost-effective to design and deploy in a wide rangeof embedded applications,such as telephony andweb tops.The picoJava[10]processor from SunMicrosystems is an example of a Java processor.It is our belief that no one technique will be universally preferred/accepted over all platforms in the immediate future.Many previous studies[11],[12],[13],[10],[14]have focused on enhancing each of the bytecode execution techniques.On the other hand,a three-pronged attack at optimizing the runtime system of all techniques would be even more valuable.Many of the proposals for improve-ments with one technique may be applicable to the others as well.For instance,an improvement in the synchronization mechanism could be useful for an interpreted or JIT mode of execution.Proposals to improve the locality behavior of Java execution could be useful in the design of Java processors,as well as in the runtime environment on general purpose processors.Finally,this three-pronged strategy can also help us design environments that efficiently and seamlessly combine the different techniques wherever possible.A first step toward this three-pronged approach is to gain an understanding of the execution characteristics of different Java runtime systems for real applications.Such a study can help us evaluate the pros and cons of the different runtime systems(helping us selectively use what works best in a given environment),isolate architectural and runtime bottlenecks in the execution to identify the scope for potential improvement,and derive design enhance-ments that can improve performance in a given setting.This study embarks on this ambitious goal,specifically trying to answer the following questions:.Do the characteristics seen at the bytecode level favor any particular runtime implementation?Howcan we use the characteristics identified at thebytecode level to implement more efficient runtimeimplementations?.Where does the time go in a JIT-based execution(i.e., in translation to native code or in executing thetranslated code)?Can we use a hybrid JIT-inter-preter technique that can do even better?If so,whatis the best we can hope to save from such a hybridtechnique?.What are the execution characteristics when execut-ing Java programs(using an interpreter or JITcompiler)on general-purpose CPU(such as theSPARC)?Are these different from those for tradi-tional C/C++programs?Based on such a study,canwe suggest architectural support in the CPU(eithergeneral-purpose or a specialized Java processor)thatcan enhance Java executions?To our knowledge,there has been no prior effort that has extensively studied all these issues in a unified framework for Java programs.This paper sets out to answer some of the above questions using applications drawn from the SPECjvm98[15]benchmarks,available JVM implementa-tions such as JDK1.1.6[16]and Kaffe VM0.9.2[17],and simulation/profiling tools on the Shade[18]environment. All the experiments have been conducted on Sun Ultra-SPARC machines running SunOS5.6.1.1Related WorkStudies characterizing Java workloads and performance analysis of Java applications are becoming increasingly important and relevant as Java increases in popularity,both as a language and software development platform.A detailed characterization of the JVM workload for the UltraSparc platform was done in[19]by Barisone et al.The study included a bytecode profile of the SPECjvm98 benchmarks,characterizing the types of bytecodes present and its frequency distribution.In this paper,we start with such a study and extend it to characterize other metrics, such as locality and method sizes,as they impact the performance of the runtime environment very strongly. Barisone et e the profile information collected from the interpreter and JIT execution modes as an input to a mathematical model of a RISC architecture to suggest architectural support for Java workloads.Our study uses a detailed superscalar processor simulator and also includes studies on available parallelism to understand the support required in current and future wide-issue processors. Romer et al.[20]studied the performance of interpreters and concluded that no special hardware support is needed for increased performance.Hsieh et al.[21]studied the cache and branch performance of interpreted Java code,C/C++version of the Java code,and native code generated by Caffine (a bytecode to native code compiler)[22].They attribute the inefficient use of the microarchitectural resources by the interpreter as a significant performance penalty and suggest that an offline bytecode to native code translator is a more efficient Java execution model.Our work differs from these studies in two important ways.First,we include a JIT compiler in this study which is the most commonly used execution model presently.Second,the benchmarks used in our study are large real world applications,while the above-mentioned study uses microbenchmarks due to the unavailability of a Java benchmark suite at the time of their study.We see that the characteristics of the application used affects favor different execution modes and,therefore,the choice of benchmarks used is important.Other studies have explored possibilities of improving performance of the Java runtime system by understand-ing the bottlenecks in the runtime environment and ways to eliminate them.Some of these studies try to improve the performance through better synchronization mechan-isms [23],[24],[25],more efficient garbage collection techniques [26],and understanding the memory referen-cing behavior of Java applications [27],etc.Improving the runtime system,tuning the architecture to better execute Java workloads and better compiler/interpreter perfor-mance are all equally important to achieve efficient performance for Java applications.The rest of this paper is organized as follows:The next section gives details on the experimental platform.In Section 3,the bytecode characteristics of the SPECjvm98are presented.Section 4examines the relative performance of JIT and interpreter modes and explores the benefits of a hybrid strategy.Section 5investigates some of the questions raised earlier with respect to the CPU and cache architec-tures.Section 6collates the implications and inferences that can be drawn from this study.Finally,Section 7summarizes the contributions of this work and outlines directions for future research.2E XPERIMENTAL P LATFORMWe use the SPECjvm98benchmark suite to study the architectural implications of a Java runtime environment.The SPECjvm98benchmark suite consists of seven Java programs which represent different classes of Java applica-tions.The benchmark programs can be run using three different inputs,which are named s100,s10,and s1.Theseproblem sizes do not scale linearly,as the naming suggests.We use the s1input set to present the results in this paper and the effects of larger data sets,s10and s100,has also been investigated.The increased method reuse with larger data sets results in increased code locality,reduced time spent in compilation as compared to execution,and other such issues as can be expected.The benchmarks are run at the command line prompt and do not include graphics,AWT (graphical interfaces),or networking.A description of the benchmarks is given in Table 1.All benchmarks except mtrt are single-threaded.Java is used to build applications that span a wide range,which includes applets at the lower end to server-side applications on the high end.The observations cited in this paper hold for those subsets of applications which are similar to the SPECjvm98bench-marks when run with the dataset used in this study.Two popular JVM implementations have been used in this study:the Sun JDK 1.1.6[16]and Kaffe VM 0.9.2[17].Both these JVM implementations support the JIT and interpreted mode.Since the source code for the Kaffe VM compiler was available,we could instrument it to obtain the behavior of the translation routines for the JIT mode in detail.Some of the data presented in Sections 4and 5are obtained from the instrumented translate routines in Kaffee.The results using Sun's JDK are presented for the other sections and only differences,if any,from the KaffeVM environment are mentioned.The use of two runtime implementations also gives us more confidence in our results,filtering out any noise due to the implementation details.To capture architectural interactions,we have obtained traces using the Shade binary instrumentation tool [18]while running the benchmarks under different execution modes.Our cache simulations use the cachesim5simulators available in the Shade suite,while branch predictors have been developed in-house.The instruction level parallelism studies are performed utilizing a cycle-accurate superscalar processor simulator This simulator can be configured to a variety of out-of-order multiple issue configurations with desired cache and branch predictors.3C HARACTERISTICSAT THEB YTECODE L EVELWe characterize bytecode instruction mix,bytecode locality,method locality,etc.in order to understand the benchmarks at the bytecode level.The first characteristic we examine is the bytecode instruction mix of the JVM,which is a stack-oriented architecture.To simplify the discussion,weRADHAKRISHNAN ET AL.:JAVA RUNTIME SYSTEMS:CHARACTERIZATION ANDARCHITECTURAL IMPLICATIONS 133TABLE 1Description of the SPECjvm98Benchmarksclassify the instructions into different types based on their inherent functionality,as shown in Table 2.Table 3shows the resulting instruction mix for the SPECjvm98benchmark suite.The total bytecode count ranges from 2million for db to approximately a billion for compress .Most of the benchmarks show similar distribu-tions for the different instruction types.Load instructions outnumber the rest,accounting for 35.5percent of the total number of bytecodes executed on the average.Constant pool and method call bytecodes come next with average frequen-cies of 21percent and 11percent,respectively.From an architectural point of view,this implies that transferring data elements to and from the memory space allocated for local variables and the Java stack paring this with the benchmark 126.gcc from the SPEC CPU95suite,which has roughly 25percent of memory access operations when run on a SPARC V.9architecture,it can be seen that the JVM places greater stress on the memory system.Consequently,we expect that techniques such as instruction folding proposed in [28]for Java processors and instructioncombining proposed in [29]for JIT compilers can improve the overall performance of Java applications.The second characteristic we examine is the dynamic size of a method.1Invoking methods in Java is expensive as it requires the setting up of an execution environment and a new stack for each new method [1].Fig.1shows the method sizes for the different benchmarks.A trinodal distribution is observed,where most of the methods are either 1,9,or 26bytecodes long.This seems to be a characteristic of the runtime environment itself (and not of any particular application)and can be attributed to a frequently used library.However,the existence of single bytecode methods indicates the presence of wrapper methods to implement specific features of the Java language like private and protected methods or interfaces .These methods consist of a control transfer instruction which transfers control to an appropriate routine.Further analysis of the traces shows that a few unique bytecodes constitute the bulk of the dynamic bytecode134IEEE TRANSACTIONS ON COMPUTERS,VOL.50,NO.2,FEBRUARY 2001TABLE 2Classification ofBytecodesTABLE 3Dynamic Instruction Mix at the BytecodeLevel1.A java method is equivalent to a ªfunctionºor ªprocedureºin a procedural language like C.stream.In most benchmarks,fewer than 45distinct bytecodes constitute 90percent of the executed bytecodes and fewer than 33bytecodes constitute 80percent of the executed bytecodes (Table 4).It is observed that memory access and memory allocation-related bytecodes dominate the bytecode stream of all the benchmarks.This also suggests that if the instruction cache can hold the JVM interpreter code corresponding to these bytecodes (i.e.,all the cases of the switch statement in the interpreter loop),the cache performance will be better.Table 5presents the number of unique methods and the frequency of calls to those methods.The number of methods and the dynamic calls are obtained at runtime by dynamically profiling the application.Hence,only methods that execute at least once have been counted.Table 5also shows that the static size of the benchmarks remain constant across the different data sets (since the number of unique methods does not vary),although the dynamic instruction count increases for the bigger data sets (due to increased method calls).The number of unique calls has an impact on the number of indirect call sites present in the application.Looking at the three data sets,we see that there is very little difference in the number of methods across data sets.Another bytecode characteristic we look at is the method reuse factor for the different data sets.The method reuse factor can be defined as the ratio of method calls to number of methods visited at least once.It indicates the locality of methods.The method reuse factor is presented in Table 6.The performance benefits that can be obtained from using a JIT compiler are directly proportional to the method reuse factor since the cost of compilation is amortized over multiple calls in JIT execution.The higher number of method calls indicates that the method reuse in the benchmarks for larger data sets would be substantially more.This would then lead to better performance for the JITs (as observed in the next section).In Section 5,we show that the instruction count when the benchmarks are executed using a JIT compiler is much lower than when using an interpreter for the s100data set.Since there is higher method reuse in all benchmarks for the larger data sets,using a JIT results in better performance over an interpreter.The bytecode characteristics described in this section help in understanding some of the issues involved in the performance of the Java runtime system (presented in the remainder of the paper).4W HENORW HETHERTOT RANSLATEDynamic compilation has been popularly used [11],[30]to speed up Java executions.This approach avoids the costly interpretation of JVM bytecodes while sidestepping the issue of having to precompile all the routines that could ever be referenced (from both the feasibility and perfor-mance angles).Dynamic compilation techniques,however,pay the penalty of having the compilation/translation to native code falling in the critical path of program execution.Since this cost is expected to be high,it needs to be amortized over multiple executions of the translated code.Or else,performance can become worse than when the code is just interpreted.Knowing when to dynamically compile a method (using a JIT),or whether to compile at all,is extremely important for good performance.To our knowledge,there has not been any previous study that has examined this issue in depth in the context of Java programs,though thereRADHAKRISHNAN ETAL.:JAVA RUNTIME SYSTEMS:CHARACTERIZATION AND ARCHITECTURAL IMPLICATIONS 135Fig.1.Dynamic method size.TABLE 4Number of Distinct Bytecodes that Account for 80Percent,90Percent,and 100Percent of the Dynamic Instruction StreamTABLE 5Total Number ofMethod Calls (Dynamic)and Unique Methods for the Three Data Setshave been previous studies [13],[31],[12],[4]examining efficiency of the translation procedure and the translated code.Most of the currently available execution environ-ments,such as JDK 1.2[16]and Kaffe [17],employ limited heuristics to decide on when (or whether)to JIT.They typically translate a method on its first invocation,regardless of how long it takes to interpret/translate/execute the method and how many times the method is invoked.It is not clear if one could do better (with a smarter heuristic)than what many of these environments provide.We investigate these issues in this section using five SPECjvm98[15]benchmarks (together with a simple HelloWorld program 2)on the Kaffe environment.Fig.2shows the results for the different benchmarks.All execution times are normalized with respect to the execu-tion time taken by the JIT mode on Kaffe.On top of the JIT execution bar is given the ratio of the time taken by this mode to the time taken for interpreting the program using Kaffe VM.As expected (from the method reuse character-istics for the various benchmarks),we find that translating (JIT-ing)the invoked methods significantly outperforms interpreting the JVM bytecodes for the SPECjvm98.The first bar,which corresponds to execution time using the default JIT,is further broken down into two components,the total time taken to translate/compile the invoked methods and the time taken to execute these translated (native code)methods.The considered workloads span the spectrum,from those in which the translation times dominate,such as hello and db (because most of the methods are neither time consuming nor invoked numerous times),to those in which the native code execution dominates,such as compress and jack (where the cost of translation is amortized over numerous invocations).The JIT mode in Kaffe compiles a method to native code on its first invocation.We next investigate how well the smartest heuristic can do so that we compile only those methods that are time consuming (the translation/compila-tion cost is outweighed by the execution time)and interpret the remaining methods.This can tell us whether we should strive to develop a more intelligent selective compilation heuristic at all and,if so,what the performance benefit is that we can expect.Let us say that a method i takes s i time to interpret, i time to translate,and i i time to execute the translated code.Then,there exists a crossover point x i i a s i Ài i ,where it would be better to translate themethod if the number of times a method is invoked n i b x i and interpret it otherwise.We assume that an oracle supplies n i (the number of times a method is invoked)and x i (the ideal cut-off threshold for a method).If n i `x i ,we interpret all invocations of the method,and otherwise translate it on the very first invocation.The second bar in Fig.2for each application shows the performance with this oracle,which we shall call opt .It can be observed that there is very little difference between the naive heuristic used by Kaffe and opt for compress and jack since most of the time is spent in the execution of the actual code anyway (very little time in translation or interpretation).As the translation component gets larger (applications like db ,javac ,or hello ),the opt model suggests that some of the less time-consuming (or less frequently invoked)methods be inter-preted to lower the execution time.This results in a 10-15percent savings in execution time for these applica-tions.It is to be noted that the exact savings would definitely depend on the efficiency of the translation routines,the translated code execution and interpretation.The opt results give useful insights.Fig.2shows that,by improving the heuristic that is employed to decide on when/whether to JIT,one can at best hope to trim 10-15percent in the execution time.It must be observed that the 10-15percent gains observed can vary with the amount of method reuse and the degree of optimization that is used.For example,we observed that the translation time for the Kaffe JVM accounts for a smaller portion of overall execution time with larger data sets (7.5percent for the s10dataset (shown in Table 7)as opposed to the 32percent for the s1dataset).Hence,reducing the translation overhead will be of lesser importance when execution time dominates translation time.However,as more aggressive optimizations are used,the translation time can consume a significant portion of execution time for even larger datasets.For instance,the base configuration of the translator in IBM's Jalapeno VM [32]takes negligible translation time when using the s100data set for javac.However,with more aggressive optimizations,about 30percent of overall execution time is consumed in translation to ensure that the resulting code is executed much faster [32].Thus,there exists a trade-off between reducing the amount of time spent in optimizing the code and the amount of time spent in actually executing the optimized code.136IEEE TRANSACTIONS ON COMPUTERS,VOL.50,NO.2,FEBRUARY2001Fig.2.Dynamic compilation:How well can we do?2.While we do not make any major conclusions based on this simple program,it serves to observe the behavior of the JVM implementation while loading and resolving system classes during system initialization.TABLE 6Method Reuse Factor for the Different DataSets。

存储HCIP模考试题与答案

存储HCIP模考试题与答案

存储HCIP模考试题与答案一、单选题(共38题,每题1分,共38分)1.某应用的数据初始容量是 500GB.,备份频率是每周 1 次全备,6 次增备,全备和增备的数据保存周期均为 4 周,冗余比为 20%。

则 4 周的后端存储容量为:A、3320GB.B、3504GB.C、4380GB.D、5256GB.正确答案:D2.以下哪个不是 NAS 系统的体系结构中必须包含的组件?A、可访问的磁盘阵列B、文件系统C、访问文件系统的接口D、访问文件系统的业务接口正确答案:D3.华为主备容灾方案信息调研三要素不包括哪一项?A、项目背景B、客户需求与提炼C、项目实施计划D、现网环境确认正确答案:C4.以下哪个不是华为 WushanFS 文件系统的特点A、性能和容量可单独扩展B、全对称分布式架构,有元数据节点C、元数据均匀分散,消除了元数据节点性能瓶颈D、支持 40PB 单一文件系统正确答案:B5.站点 A 需要的存储容量为 2543GB.,站点 B 需要的存储容量为3000GB.,站点 B 的备份数据远程复制到站点 A保存。

考虑复制压缩的情况,压缩比为 3,计算站点 A 需要的后端存储容量是多大?A、3543GB.B、4644GB.C、3865GB.D、4549GB.正确答案:A6.关于华为 Oceanstor 9000 各种类型节点下硬盘的性能,由高到低的排序正确的是哪一项?A、P25 Node SSD 硬盘-〉P25 Node SAS 硬盘-〉P12 SATA 硬盘-〉P36 Node SATA 硬盘-〉C36 SATA 硬盘B、P25 Node SSD 硬盘-〉P25 Node SAS 硬盘-〉P12 SATA 硬盘-〉C36 SATA 硬盘C、P25 Node SSD 硬盘-〉P25 Node SAS 硬盘-〉P36 Node SATA 硬盘-〉P12 SATA 硬盘-〉C36 SATA 硬盘D、P25 Node SSD 硬盘-〉P25 Node SAS 硬盘-〉P36 Node SATA 硬盘-〉C36 SATA 硬盘-〉P12 SATA 硬盘正确答案:C7.关于华为 Oceanstor 9000 软件模块的描述不正确的事哪一项?A、OBS(Object-Based Store)为文件系统元数据和文件数据提供可靠性的对象存储功能B、CA(Chent Agent)负责 NFS/CIFS/FTP 等应用协议的语义解析,并发送给底层模块处理C、快照、分级存储、远程复制等多种增值特性功能是由 PVS 模块提供的D、MDS(MetaData Service)管理文件系统的元数据,系统的每一个节点存储可所以元数据正确答案:D8.下面属于华为存储 danger 类型的高危命令的时哪个命令?A、reboot systemB、import configuration_dataC、show alarmD、chang alarm clear sequence list=3424正确答案:A9.华为 Oceanstor 9000 系统提供的文件共享借口不包括以下哪个选项?A、ObjectB、NFSC、CIFSD、FTP正确答案:A10.以下哪个选项不属于 oceanstor toolkit 工具的功能?A、数据迁移功能B、升级功能C、部署功能D、维护功能正确答案:A11.某用户用 Systemreporter 分析其生产设备 Oceanstor 9000 时发现某分区的部分节点 CPU 利用率超过 80%,但平均CPU 利用率约为 50%,另外发现某个节点的读写带宽始终保持在性能规格的 80%以上,其他节点则均在 60%以下,该场景下,推荐使用如下哪种负载均衡策略?A、轮循方式B、按 CPU 利用率C、按节点吞吐量D、按节点综合负致正确答案:D12.下列选项中关于对象存储服务(兼容 OpenStack Swift 接口) 概念描述错误的是:A、Account 就是资源的所有者和管理者,使用Account 可以对Container 进行增、删、查、配置属性等操作,也可以对 Object 进行上传、下载、查询等操作。

NVIDIA Kepler兼容性指南说明书

NVIDIA Kepler兼容性指南说明书

DA-06287-001_v1.0 | April 2012 Application NoteDOCUMENT CHANGE HISTORYTABLE OF CONTENTS Chapter 1.Kepler Compatibility (1)1.1About This Document (1)1.2Application Compatibility on Kepler (1)1.3Verifying Kepler Compatibility for Existing Applications (2)1.3.1Applications Using CUDA Toolkit 4.1 or Earlier (2)1.3.2Applications Using CUDA Toolkit 4.2 (2)1.4Building Applications with Kepler Support (3)1.4.1CUDA Runtime API Applications (3)1.4.2CUDA Driver API Applications (6)APPENDIX A.Revision History (8)A.1Version 1.0 (8)1.1ABOUT THIS DOCUMENTThis application note, Kepler Compatibility Guide for CUDA Applications, is intended to help developers ensure that their NVIDIA® CUDA TM applications will run effectively on GPUs based on the NVIDIA® Kepler Architecture. This document provides guidance to developers who are already familiar with programming in CUDA C/C++ and want to make sure that their software applications are compatible with Kepler.1.2APPLICATION COMPATIBILITY ON KEPLERThe NVIDIA CUDA C compiler, nvcc, can be used to generate both architecture-specific cubin files and forward-compatible PTX versions of each kernel. Each cubin file targets a specific compute-capability version and is forward-compatible only with CUDA architectures of the same major version number. For example, cubin files that target compute capability 2.0 are supported on all compute-capability 2.x (Fermi) devices but are not supported on compute-capability 3.0 (Kepler) devices. For this reason, to ensure forward compatibility with CUDA architectures introduced after the application has been released, it is recommended that all applications support launching PTX versions of their kernels.1Applications that already include PTX versions of their kernels should work as-is on Kepler-based GPUs. Applications that only support specific GPU architectures via cubin files, however, will need to be updated to provide Kepler-compatible PTX or cubin s.1 CUDA Runtime applications containing both cubin and PTX code for a given architecture will automatically use the cubin by default, keeping the PTX path strictly for forward-compatibility purposes.1.3VERIFYING KEPLER COMPATIBILITY FOREXISTING APPLICATIONSThe first step is to check that Kepler-compatible device code (at least PTX) is compiled in to the application. The following sections show how to accomplish this for applications built with different CUDA Toolkit versions.1.3.1Applications Using CUDA Toolkit 4.1 or Earlier CUDA applications built using CUDA Toolkit versions2.1 through 4.1 are compatible with Kepler as long as they are built to include PTX versions of their kernels. To test that PTX JIT is working for your application, you can do the following:④Download and install the latest driver from /drivers.④Set the environment variable CUDA_FORCE_PTX_JIT=1④Create an empty temporary directory on your system.④Set the environment variable CUDA_CACHE_PATH to be the path to this empty directory.④Launch your application.When starting a CUDA application for the first time with the above environment flag, the CUDA driver will JIT-compile the PTX for each CUDA kernel that is used into native cubin code. The generated cubin for the target GPU architecture is cached on disk by the CUDA driver.If you set the environment variables above and then launch your program and it works properly, and if the directory you specified with the CUDA_CACHE_PATH environment variable is now populated with cache files, then you have successfully verified Kepler compatibility. Note that it is not necessary to inspect the contents of the cache files themselves; just check that the previously empty cache directory is now non-empty.Be sure to unset these two environment variables when you are done testing if you do not normally use them. The temporary cache directory you created is safe to delete. 1.3.2Applications Using CUDA Toolkit 4.2CUDA applications built using CUDA Toolkit 4.2 are compatible with Kepler as long as they are built to include kernels in either Kepler-native cubin format (see Section 1.4) or PTX format (see Section 1.3.1 above) or both.1.4BUILDING APPLICATIONS WITH KEPLERSUPPORTThe methods used to build your application with support for Kepler depend on the version of the CUDA Toolkit used and on the choice of the CUDA Runtime API or CUDA Driver API.Note: The CUDA Runtime API is characterized by the use of functions named with the cuda*() prefix and by launching kernels using the triple-angle-bracket <<<>>> notation. The CUDA driver API functions use the cu*() prefix, including for kernel launch. 1.4.1CUDA Runtime API ApplicationsWhen a CUDA application launches a kernel, the CUDA Runtime determines the compute capability of each GPU in the system and uses this information to automatically find the best matching cubin or PTX version of the kernel that is available. If a cubin file supporting the architecture of the target GPU is available, it is used; otherwise, the CUDA Runtime will load the PTX and JIT-compile that PTX to the GP U’s native cubin format before launching it. If neither is available, then the kernel launch will fail.The main advantages of providing native cubin s are as follows:④It saves the end user the time it takes to PTX JIT a kernel that has been compiled asPTX. (However, since the CUDA driver will cache the cubin generated as a result of the PTX JIT, this is mostly a one-time cost for a given user.)④PTX JIT-compiled kernels often cannot take advantage of architectural features ofnewer GPUs, meaning that native-compiled code may be faster or of greater accuracy.1.4.1.1Applications Using CUDA Toolkit 4.1 or EarlierThe compilers included in CUDA Toolkit 4.1 or earlier generate cubin files native to earlier NVIDIA architectures such as Fermi, but they cannot generate cubin files native to the Kepler architecture. To allow support for Kepler and future architectures when using version 4.1 or earlier of the CUDA Toolkit, the compiler must generate a PTX version of each kernel.Below are compiler settings that could be used to build mykernel.cu to run on Fermi and earlier devices natively and on Kepler devices via PTX JIT. In these examples, the lines shown in blue provide compatibility with earlier architectures, and the lines shown in red provide a PTX path for compatibility with Kepler and later architectures.Note that compute_XX refers to a PTX version and sm_XX refers to a cubin version. The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one must be PTX to provide Kepler compatibility.Windows:nvcc.exe -ccbin "C:\vs2008\VC\bin"-Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT"–gencode=arch=compute_10,code=sm_10–gencode=arch=compute_20,code=sm_20–gencode=arch=compute_20,code=compute_20--compile -o "Release\mykernel.cu.obj" "mykernel.cu"Mac/Linux:/usr/local/cuda/bin/nvcc–gencode=arch=compute_10,code=sm_10–gencode=arch=compute_20,code=sm_20–gencode=arch=compute_20,code=compute_20-O2 -o mykernel.o -c mykernel.cuAlternatively, you may be familiar with the simplified nvcc command-line option -arch=sm_XX , which is a shorthand equivalent to the following more explicit –gencode= command-line options used above. -arch=sm_XX expands to the following:–gencode=arch=compute_XX,code=sm_XX–gencode=arch=compute_XX,code=compute_XXHowever, while the -arch=sm_XX command-line option does result in inclusion of a PTX back-end target by default, it can only specify a single target cubin architecture at a time, and it is not possible to use multiple -arch= options on the same nvcc command line, which is why the examples above use -gencode= explicitly.1.4.1.2Applications Using CUDA Toolkit 4.2Beginning with version 4.2 of the CUDA Toolkit, nvcc can generate cubin files native to the Kepler architecture (compute capability 3.0). When using CUDA Toolkit 4.2, to ensure that nvcc will generate cubin files for all released GPU architectures as well as a PTX version for forward compatibility with future GPU architectures, specify the appropriate -gencode= parameters on the nvcc command line as shown in the examples below.In these examples, the lines shown in blue provide compatibility with earlier architectures, the lines shown in green provide native cubin s for Kepler, and the lines in red provide a PTX path for compatibility with future architectures.Windows:nvcc.exe -ccbin "C:\vs2008\VC\bin"-Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT"-gencode=arch=compute_10,code=sm_10-gencode=arch=compute_20,code=sm_20-gencode=arch=compute_30,code=sm_30-gencode=arch=compute_30,code=compute_30--compile -o "Release\mykernel.cu.obj" "mykernel.cu"Mac/Linux:/usr/local/cuda/bin/nvcc-gencode=arch=compute_10,code=sm_10-gencode=arch=compute_20,code=sm_20-gencode=arch=compute_30,code=sm_30-gencode=arch=compute_30,code=compute_30-O2 -o mykernel.o -c mykernel.cuNote that compute_XX refers to a PTX version and sm_XX refers to a cubin version. The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one should be PTX to provide compatibility with future architectures.1.4.2CUDA Driver API ApplicationsApplications that use the CUDA Driver API load their own kernels explicitly. Therefore, the kernel-loading portions of such applications must include a path capable of loading PTX when native cubin s for the target GPU(s) are not available.④Compile CUDA kernel files to PTX, even if also compiling native cubin files forexisting architectures. If multiple compilation target types/versions are to be used, nvcc must be called separately for each generated output file of either type, and the type and version must be specified explicitly at compile time. (An advancedtechnique is to use a “fat binary” (fatbin) file, which contains both cubin and PTX formats. This technique is outside the scope of this document.)A common pattern in many applications is to include cubin s for all supported existingarchitectures plus PTX of the highest-available version for forward compatibility to future architectures.The example below demonstrates compilation of compute_20 PTX, which will work on devices of compute capability 2.x and 3.0, but not on devices of compute capability1.x. Presumably an application using this example as-is would also include cubin s forcompute capability 1.x and/or 2.x as well.Windows:nvcc.exe -ccbin "C:\vs2008\VC\bin"-Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT"-ptx -arch=compute_20-o "pute_20.ptx" "mykernel.cu"Mac/Linux:/usr/local/cuda/bin/nvcc-ptx -arch=compute_20-O2 -o pute_20.ptx "mykernel.cu"④At runtime, your application will need to explicitly check the compute capability ofthe current GPU with the CUDA Driver API function in order to select the best-available cubin or PTX to load. The deviceQueryDrv code sample from the NVIDIA GPU Computing SDK includes a detailed example of the use of this function.cuDeviceComputeCapability(&major, &minor, dev)④Refer to the “PTX Just-in-Time Compilation” (ptxjit) code sample GPU ComputingSDK, available at the URL below, which demonstrates how to use the CUDA Driver API to launch PTX kernels./cuda-cc-sdk-code-samplesA more complex example can be found in the matrixMulDrv code sample from the GPU Computing SDK, which follows a pattern similar to the following: CUmodule cuModule;CUfunction cuFunction = 0;string ptx_source;// Helper function load PTX source to a stringfindModulePath ("matrixMul_kernel.ptx",module_path, argv, ptx_source));// We specify PTX JIT compilation with parametersconst unsigned int jitNumOptions = 3;CUjit_option *jitOptions = new CUjit_option[jitNumOptions];void **jitOptVals = new void*[jitNumOptions];// set up size of compilation log bufferjitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;int jitLogBufferSize = 1024;jitOptVals[0] = (void *)(size_t)jitLogBufferSize;// set up pointer to the compilation log bufferjitOptions[1] = CU_JIT_INFO_LOG_BUFFER;char *jitLogBuffer = new char[jitLogBufferSize];jitOptVals[1] = jitLogBuffer;// set up maximum # of registers to be usedjitOptions[2] = CU_JIT_MAX_REGISTERS;int jitRegCount = 32;jitOptVals[2] = (void *)(size_t)jitRegCount;// Loading a module will force a PTX to be JITstatus = cuModuleLoadDataEx(&cuModule, ptx_source.c_str(),jitNumOptions, jitOptions,(void **)jitOptVals);printf("> PTX JIT log:\n%s\n", jitLogBuffer);A.1 VERSION 1.0Initial public release.Kepler Compatibility Guidefor CUDA Applications DA-06287-001_v1.0| 8 NoticeALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FORA PARTICULAR PURPOSE.Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.TrademarksNVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.Copyright© 2012 NVIDIA Corporation. All rights reserved.。

多核处理器cache一致性技术综述

多核处理器cache一致性技术综述

多核处理器cache一致性技术综述摘要:本文介绍了实现多核处理器cache一致性的基本实现技术,并分析了其存在的问题。

根据存在的问题,介绍了几种最新的解决方案。

关键词:cache 一致性监听协议目录协议性能能耗1 基本实现技术:实现cache一致性的关键在于跟踪所有共享数据块的状态。

目前广泛采用的有以下2种协议,它们分别使用不同的技术跟踪共享数据:1.监听协议( Snooping)处理器在私有的缓存中保存共享数据的复本。

同时处理器对总线进行监听,如果总线上的请求与自己相关,则进行处理,否则忽略总线请求信号。

2.目录式(Directory based)使用目录来存放各节点cache中共享数据的信息,把cache一致性请求只发给存放有相应数据块的节点,从而支持cache的一致性。

下面分别就监听协议和目录协议做简单的介绍:1.1 监听协议监听协议通过总线监听机制实现cache和共享内存之间的数据一致性。

因为其可以通过内存的总线来查询cache的状态。

所以监听协议是目前多核处理器主要采用的一致性技术。

监听协议有两种。

一种称为写无效协议(write invalidate protocol) ,它在处理器写数据块之前通过总线广播使其它该数据的共享复本(主要存储在其它处理器的私有缓存中)变为无效,以保证该处理器能独占访问数据块,实现一致性。

另一种称为写更新(write update) ,它在处理器写入数据块时更新该数据块所有的副本。

因为在基于总线的多核处理器中总线和内存带宽是最紧张的资源,而写无效协议不会给总线和内存带来太大的压力,所以目前处理器实现主要都是应用写无效协议。

读请求:如果处理器在私有缓存中找到数据块,则读取数据块。

如果没有找到数据块,它向总线广播读缺失请求。

其它处理器监听到读缺失请求,则检查对应地址数据块状态:无效状态下,向总线发读缺失,总线向内存请求数据块;共享状态下,直接把数据发送给请求处理器;独占状态下,远端处理器节点把数据回写,状态改为共享,并将数据块发送给请求处理器。

戴尔(Dell)PowerSwitch S3100系列交换机说明书

戴尔(Dell)PowerSwitch S3100系列交换机说明书

Dell PowerSwitch S3100 Series © 2022 Dell Inc. or its subsidiaries.The S3100 switch series offers a power-efficient and resilient Gigabit Ethernet (GbE) switching solution with integrated 10GbE uplinks for advanced Layer 3 distribution for offices and campus networks. The S3100 switch series has high-performance capabilities and wire-speed performance utilizing a non-blocking architecture to easily handle unexpected traffic loads. Use dual internal hot-swappable 80PLUS-certified power supplies for high availability and power efficiency. The switches offer simple management and scalability via an 84Gbps (full-duplex) highavailability stacking architecture that allows management of up to 12 switches from a single IP address.Modernize campus network architecturesModernize campus network architectures with apower-efficient and resilient 1/10GbE switching solution with dense Power over Ethernet Plus (PoE+). SelectS3100 models offer 24 or 48 ports of PoE+ to deliver clean power to network devices such as wireless access points (APs), Voice-over-IP (VoIP) handsets, video conferencing systems and security cameras. For greater interoperability in multivendor networks, S3100 series switches offer the latest open-standard protocols and include technology to interface with Cisco protocol PVST+. The S3100 series supports Dell OS9, VLT and network virtualization features such as VRF-lite and support for Dell Embedded Open Automation Framework.Leverage familiar tools and practicesAll S3100 switches include Dell OS9 for easier deployment and greater interoperability. One common command line interface (CLI) using a well-known command language means a faster learning curve for network administrators.Deploy with confidence at any scaleS3100 series switches help create performanceassurance with a data rate up to 260Gbps (full duplex) and a forwarding rate up to 193Mpps. Scale easily with built-in rear stacking ports. Switch stacks of up to 624 ports can be managed from a single screen using the highly-available stacking architecture for high-density aggregation with seamless redundant availability.Hardware, performance and efficiency•Up to 48 line-rate GbE ports of copper or 24 line-rate ports of fiber, two combo ports for fiber/copper flexibili -ty, and two integrated 10GbE SFP+ ports• Up to 48 ports of PoE+ in 1RU without an external power supply• Hot swappable expansion module supporting dual-port SFP+ or dual-port 10GBaseT• Integrated stacking ports with support up to 84Gbps •Up to 624 ports in a 12-unit stack for high-density, high-availability aggregation and distribution in wiring closets/MDFs. Non-stop forwarding and fast failover in stack configurations•Available with dual 80PLUS-certified hot swappable power supplies. Variable speed fan operation helps decrease cooling and power costs•Energy-Efficient Ethernet and lower-power PHYsreduce power to inactive ports and idle links, providing energy savings from the power cord to the port •Dell Fresh Air compliance for operation in environ-ments up to 113°F (45°C) helps reduce cooling costsin temperature constrained deploymentsDELL POWERSWITCH S3100 SERIESHigh-performance managed Ethernet switches designed for non-blocking access2Dell PowerSwitch S3100 Series© 2022 Dell Inc. or its subsidiaries.**Requires C15 plugDeploying, configuring and managing• Tool-less ReadyRails™ significantly reduces rack installation time•Management via an intuitive and familiar CLI, SN -MP-based manage- ment console application (includingDell OpenManage Network Manager), Telnet or serialconnection • Private VLAN support• AAA authorization, TACACS+ accounting and RADIUS support for comprehensive secure access•Authentication tiering allows network administrators to tier port authentication methods such as 802.1x, MAC Authentication Bypass in priority order so that a single port can provide flexible access and security•Achieve high availability and full bandwidth utilization with VLT and support firmware upgrades without taking the network offline•Interfaces with PVST+ protocol for greater flexibility and interoperability in Cisco networks • Advanced Layer 3 IPv4 and IPv6 functionality • Flexible routing options with policy-based routing to route packets based on assigned criteria beyond destination address• Routed Port Monitoring (RPM) covers a Layer 3 domain without costly dedicated network taps•OpenFlow 1.3 provides the ability to separate thecontrol plane from the forwarding plane for deployment in SDN environments*Contact your Dell Technologies representative for a full list of validated storage arrays.3Dell PowerSwitch S3100 Series © 2022 Dell Inc. or its subsidiaries.Physical2 rear stacking ports (21Gbps) supporting up to 84Gbps (full-duplex)2 integrated front 10GbE SFP+ dedicated ports Out-of-band management port (10/100/1000BASE-T)USB (Type A) port for configuration via USB flash driveAuto-negotiation for speed and flow control Auto-MDI/MDIX, port mirroringEnergy-Efficient Ethernet per port settings Redundant variable speed fans Air flow: I/O to power supplyRJ45 console/management port with RS232 signaling (RJ-45 to female DB-9 connector cable included)Dual firmware images on-boardSwitching engine model: Store and forward ChasisSize (1RU): 1.7126in x 17.0866in x 16.0236in (43.5mm x 434.0mm x 407.0mm) (H x W x D)Approximate weight: 13.2277lbs/6kg (S3124 and S3124F), 14.5505lbs/6.6kg (S3124P), 15.2119lbs/6.9kg (S3148P)ReadyRails rack mounting system, no tools requiredEnvironmentalPower supply: 200W (S3124, S3124F and S3148), 715W or 1,100W (S3124P), 1,100W (S3148P)Power supply efficiency: 80% or better in all operating modesMax. thermal output (BTU/hr): 182.55 (S3124), 228.96 (S3124F), 4391.42 (S3124P), 221.11 (S3148), 7319.04 (S3148P)Power consumption max (watts): 52.8 (S3124), 67.1 (S3124F), 1,287 (S3124P), 74.8 (S3148), 2,145 (S3148P)Operating temperature: 32° to 113°F (0° to 45°C)Operating relative humidity: 95%Storage temperature: –40° to 149°F (–40° to 65°C)Storage relative humidity: 85%PerformanceMAC addresses: 56K (80K in L2 scaled mode)Static routes: 16K (IPv4)/8K (IPv6)Dynamic routes: 16K (IPv4)/8K (IPv6) Switch fabric capacity: 212Gbps (S3124, S3124F and S3124P) (full duplex) 260Gbps (S3148 and S3148P)Forwarding rate: 158Mpps (S3124, S3124F and S3124P) 193Mpps (S3148 and S3148P)Link aggregation: 16 links per group, 128 groups Priority queues per port: 8Line-rate Layer 2 switching: All (non-blocking)Line-rate Layer 3 routing: All (non-blocking)Flash memory: 1GPacket buffer memory: 4MB CPU memory: 2GB DDR3Layer 2 VLANs: 4K MSTP: 64 instances VRF-lite: 511 instancesLine-rate Layer 2 switching: All protocols, including IPv4 and IPv6Line-rate Layer 3 routing: IPv4 and IPv6IPv4 host table size: 22K (42K in L3 scaled hosts mode)IPv6 host table size: 16K (both global + Link Local)(32K in L3 scaled hosts mode)IPv4 Multicast table size: 8KLAG load balancing: Based on Layer 2, IPv4 or IPv6 headersIEEE compliance 802.1AB LLDP802.1D Bridging, STP 802.1p L2 Prioritization 802.1Q VLAN T agging 802.1Qbb PFC 802.1Qaz ETS 802.1s MSTP 802.1w RSTP802.1x Network Access Control 802.1x-2010 Port Based Network Access Control802.3ab Gigabit Ethernet (1000BASE-T)802.3ac Frame Extensions for VLAN T agging802.3ad Link Aggregation with LACP 802.1ax Link Aggregation Revision - 2008 and 2011802.3ae 10 Gigabit Ethernet (10GBase-X)802.3af PoE (for S3124P and S3148P)802.3at PoE+ (for S3124P and S3148P)802.3az Energy Efficient Ethernet (EEE)802.3u Fast Ethernet (100Base-TX) on mgmt ports 802.3x Flow Control 802.3z Gigabit Ethernet (1000Base-X) ANSI/TIA-1057 LLDP-MED Force10 PVST+MTU 12,000 bytes RFC and I-D compliance General Internet protocols 768 UDP 793 TCP 854 Telnet 959 FTPGeneral IPv4 protocols 791 IPv4792 ICMP 826 ARP 1027 Proxy ARP 1035 DNS (client)1042 Ethernet Transmission 1305 NTPv31519 CIDR 1542 BOOTP (relay)1812 Requirements for IPv4 Routers 1918 Address Allocation for Private Internets 2474 Diffserv Field in IPv4 and Ipv6 Headers 2596 Assured Forwarding PHB Group 3164 BSD Syslog 3195 Reliable Delivery for Syslog3246Expedited Assured Forwarding4364 VRF-lite (IPv4 VRF with OSPF and BGP)5798VRRPGeneral IPv6 protocols 1981 Path MTU Discovery Features 2460 Internet Protocol, Version 6 (IPv6) Specification 2464 Transmission of IPv6 Packets over Ethernet Networks 2711 IPv6 Router Alert Option 4007 IPv6 Scoped Address Architecture 4213 Basic Transition Mechanisms for IPv6 Hosts and Routers 4291 IPv6 Addressing Architecture 4443 ICMP for IPv64861 Neighbor Discovery for IPv64862 IPv6 Stateless Address Autoconfiguration 5095 Deprecation of Type 0 Routing Headers in IPv6IPv6 Management support (telnet, FTP , TACACS, RADIUS, SSH, NTP)RIP 1058RIPv1 2453 RIPv2OSPF (v2/v3) 1587 NSSA 4552 Authentication/ 2154 OSPF Digital Signatures 2328 OSPFv2 OSPFv3 2370 Opaque LSA 5340 OSPF for IPv6IS-IS 5301 Dynamic hostname exchange mechanism for IS-IS 5302 Domain-wide prefix distribution with two- level IS-IS5303 Three way handshake for IS-IS point- to-point adjacencies 5308 IS-IS for IPv6BGP 1997 Communities 2385 MD5 2545 BGP-4 Multiprotocol Extensions for IPv6 Inter-Domain Routing 2439 Route Flap Damping 2796 Route Reflection 2842 Capabilities 2858 Multiprotocol Extensions 2918 Route Refresh 3065 Confederations 4360 Extended Communities 4893 4-byte ASN 5396 4-byte ASN representations draft-ietf-idr-bgp4-20 BGPv4draft-michaelson-4byte-as-representation-05 4-byte ASN Representation (partial) draft-ietf-idr-add-paths-04.txt ADD PATH Multicast 1112 IGMPv1 2236 IGMPv2 3376 IGMPv3 MSDPdraft-ietf-pim-sm-v2-new-05PIM-SMw4Dell PowerSwitch S3100 Series © 2022 Dell Inc. or its subsidiaries.Network management 1155 SMIv11157 SNMPv11212 Concise MIB Definitions 1215 SNMP Traps 1493 Bridges MIB 1850 OSPFv2 MIB 1901 Community-Based SNMPv22011 IP MIB 2096 IP Forwarding Table MIB 2578 SMIv22579 Textual Conventions for SMIv22580 Conformance Statements for SMIv22618 RADIUS Authentication MIB 2665 Ethernet-Like Interfaces MIB 2674 Extended Bridge MIB 2787 VRRP MIB 2819 RMON MIB (groups 1, 2, 3, 9)2863 Interfaces MIB 3273 RMON High Capacity MIB 3410 SNMPv33411 SNMPv3 Management Framework 3412 Message Processing and Dispatching for the Simple Network Management Protocol (SNMP)3413 SNMP Applications 3414 User-based Security Model (USM) for NMPv33415 VACM for SNMP 3416 SNMPv23417 Transport mappings for SNMP 3418 SNMP MIB 3434 RMON High Capacity Alarm MIB 3584 Coexistence between SNMP v1, v2 and v34022 IP MIB 4087 IP Tunnel MIB 4113 UDP MIB 4133 Entity MIB 4292 MIB for IP 4293 MIB for IPv6 Textual Conventions 4502 RMONv2 (groups 1,2,3,9)5060 PIM MIBANSI/TIA-1057 LLDP-MED MIB Dell_ITA.Rev_1_1 MIBdraft-grant-tacacs-02 TACACS+draft-ietf-idr-bgp4-mib-06 BGP MIBv1IEEE 802.1AB LLDP MIBIEEE 802.1AB LLDP DOT1 MIB IEEE 802.1AB LLDP DOT3 MIB sFlowv5 sFlowv5 MIB (version 1.3)FORCE10-BGP4-V2-MIB Force10 BGP MIB (draft-ietf-idr-bgp4-mibv2-05)FORCE10-IF-EXTENSION-MIB FORCE10-LINKAGG-MIBFORCE10-COPY-CONFIG-MIB FORCE10-PRODUCTS-MIB FORCE10-SS-CHASSIS-MIB FORCE10-SMI FORCE10-TC-MIBFORCE10-TRAP-ALARM-MIBFORCE10-FORWARDINGPLANE-STATS-MIBRegulatory compliance SafetyUL/CSA 60950-1, Second Edition EN 60950-1, Second EditionIEC 60950-1, Second Edition Including All National Deviations and Group Differences EN 60825-1 Safety of Laser Products Part 1: Equipment Classification Requirements and User’s GuideEN 60825-2 Safety of Laser Products Part 2: Safety of Optical Fibre Communication Systems FDA Regulation 21 CFR 1040.10 and 1040.11EmissionsUSA: FCC CFR 47 Part 15, Subpart B:2011, Class AImmunityEN 300 386 V1.4.1:2008 EMC for Network EquipmentEN 55024: 1998 + A1: 2001 + A2: 2003EN 61000-3-2: Harmonic Current Emissions EN 61000-3-3: Voltage Fluctuations and Flicker EN 61000-4-2: ESDEN 61000-4-3: Radiated Immunity EN 61000-4-4: EFT EN 61000-4-5: SurgeEN 61000-4-6: Low Frequency Conducted ImmunityRoHSAll S Series components are EU RoHS compliant.CertificationsAvailable with US Trade Agreements Act (TAA) complianceUSGv6 Host and Router Certified on Dell NetworkingOS 9.7 and greaterIPv6 Ready for both Host and Router DoD UC-APL approved switchFIPS 140-2 Approved Cryptography WarrantyLifetime Limited Hardware Warranty© 2022 Dell Inc. or its subsidiaries. All Rights Reserved. Dell and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. February 2022 | V1.9Dell PowerSwitch S3100 Series Spec SheetContact a Dell Technologies ExpertView more resourcesLearn more about Dell Networking solutions Join the conversation with@DellNetworkingIT Lifecycle Services for NetworkingExperts, insights and easeOur highly trained experts, with innovative tools and proven processes, help you transform your IT investments into strategic advantages.Plan & DesignLet us analyze your multivendor environment and deliver a comprehensive report and action plan to build upon the existing network and improve performance.Deploy & IntegrateGet new wired or wireless network technology installed and configured with ProDeploy. Reduce costs, save time, and get up and running fast.EducateEnsure your staff builds the right skills for long-term success. Get certified on Dell Networking technology and learn how to increase performance and optimize infrastructure.Manage & SupportGain access to technical experts and quickly resolve multivendor networking challenges with ProSupport. Spend less time resolving network issues and more time innovating.OptimizeMaximize performance for dynamic ITenvironments with Dell Optimize. Benefit from in-depth predictive analysis, remote monitoring and a dedicated systems analyst for your network.RetireWe can help you resell or retire excess hardware while meeting local regulatory guidelines and acting in an environmentally responsible way.Learn more at /Services。

factorization (RRQRF)

factorization (RRQRF)
Fig. 1. Traditional Algorithm for the QR Factorization with Column Pivoting
Permutation vector: perm(j) = j; j = 1 : n Column norm vector: colnorms(j) = kAej jj2; j = 1 : n 2
All authors were partially supported by the Advanced Research Projects Agency, under contract DM28E04120 and P-95006. Quintana also received support from the European ESPRIT Project 9072{GEPPCOM. Sun also received support from National Science Foundation agent ASC-ASC9005933. Bischof also received supported from the Mathematical, Information, and Computational Sciences Division subprogram of the O ce of Computational and Technology Research, U.S. Department of Energy, under Contract W-31-109-Eng-38. y Departamento de Informatica, Universidad Jaime I, Campus Penyeta Roja, 12071 Castellon, Spain, gquintan@inf.uji.es. This work was partially performed while the author was visiting the Departments of Computer Science at Duke University and the University of Tennessee. z Department of Computer Science, Duke University, D107, Levine Science Research Center, Durham, NC 27708-0129, xiaobai@. This work was partially performed while the author was visiting the Department of Computer Science at the University of Tennessee. x Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S. Cass Ave., Argonne, IL 60439-4844, bischof@. 1

Ultrastar DC SS530 SAS SSD说明书

Ultrastar DC SS530 SAS SSD说明书

Maximize Storage and Server Scalability with SAS SSDsData is transforming the world and growing at an exponential pace. The SAS interface continues its dominance in traditional enterprise storage arrays, making the ever-increasing volume of data rapidly and reliably available. Storage solutions depend on SAS’s protection features, such as dual-port failover for redundancy, and SAS’s high reliability to power mission-critical applications such as ERP, OLTP, OLAP, and more. Designed with a dual-port 12Gb/s SAS interface for seamless integration into enterprise environments, the Ultrastar ® DC SS530 SAS SSD is available in capacities from 400GB to 15.36TB¹, double the capacity of prior generations. Delivering performance up to 440,000 random read and 320,000 random write IOPS—best in class among current 12Gb/s SAS SSDs—the Ultrastar DC SS530 can help to drive faster data analytics, drive higher productivity, and power business decision-making.Proven Architecture with Industry-Leading Quality and ReliabilityUltrastar DC SS530 leverages the proven architecture of the prior generation Ultrastar SS300 with second generation 3D TLC NAND flash memory. The Ultrastar DC SS530 achieves an extraordinary 0.35% annual failure rate (AFR) or 2.5 million hours mean-time-between-failure (MTBF). DC SS530 offers three endurance options of 1, 3 and 10 drive writes per day (DW/D) to meet the most stringent data center requirements.Help keep confidential data secure by deploying self-encrypting drive technology that supports Trusted Computing Group (TCG) Enterprise standards for security services and FIPS 140-2 validation for cryptographic-enabled drives that are required for certain government applications. The Ultrastar DC SS530 is backed by a five-year limited warranty or the maximum petabytes (PB) written (based on capacity), whichever comes first.Trust Your Storage Systems with SSD Products Developed by Experts in Enterprise StorageUltrastar SAS SSDs leverage decades of proven enterprise storage expertise in Serial Attached SCSI (SAS) design, reliability, firmware, customer qualification, and system integration. The synergistic relationship between the new throughput-enhancing SSDs and traditional HDDs provides cost-effective, end-to-end enterprise-class storage options that deliver reliability, compatibility, capacity, cost savings, and systemperformance. This combination makes Ultrastar storage drives an ideal choice to help meet escalating reliability, endurance, and performance requirements in the most demanding data center environments.Highlights•2nd generation 3D TLC NAND flash for ultra-high performance and endurance •12Gb/s SAS interface for maximum throughput •Advanced power-loss and data-management technology •Self-encrypting models conform to TCG’s Enterprise specificationApplications•Ultra-high performance tier-0 enterprise storage •Enterprise-class servers and high performance computing (HPC) •Software-defined storage (SDS) •Online transaction processing (OLTP) •Finance and e-commerce •Database analyticsDATA SHEETSAS DATA CENTER SSDFeatures & Benefits15.36TB – 400GB | 3D TLC 2.5-inch SFF | SAS 12Gb/sWDS34-EN-US-0919-02© 2018-2019 Western Digital Corporation or its affiliates. All rights reserved. Produced 07/18, rev. 09/19. Western Digital, the Western Digital logo and Ultrastar are registered trademarks or trademarks of Western Digital Corporation or its affiliates in the US and/or other countries. All other marks are the property of their respective owners. References in this publication to Wesstern Digital products, programs, or services do not imply that they will be made avillable in all countries. Product specifications provided are sample specifications that are subject to change and do not constitute a warranty. Please visit the Support section of our website, , for additional information on product specifications. Pictures shown may vary from actual products.Specifications10DW/D3DW/D1DW/DModel Numberx in Model Number denotes Encryption level: 0 = Instant Secure Erase 1 = TCG Encryption4 = No Encryption, Secure Erase5 = TCG + FIPSWUSTM3232ASS20x WUSTM3216ASS20x WUSTM3280ASS20x WUSTM3240ASS20xWUSTR6464ASS20x WUSTR6432ASS20x WUSTR6416ASS20x WUSTR6480ASS20x WUSTR6440ASS20xWUSTR1515ASS20x WUSTR1576ASS20x WUSTR1538ASS20x WUSTR1519ASS20x WUSTR1596ASS20x WUSTR1548ASS20xConfigurationInterface SAS 6/12Gb/s supports wide port @ 12Gb/s Capacity13.2TB / 1.6TB / 800GB / 400GB6.4TB / 3.2TB / 1.6TB / 800GB / 400GB15.36TB / 7.68TB / 3.84TB / 1.92TB /960GB / 480GBEndurance (Drive Writes per Day - DW/D)21031Maximum Terabytes Written (TBW)²59,690 / 29,410 / 15,220 / 7,63036,170 / 17,520 / 9,410 / 4,700 / 2,35030,110 / 15,050 / 7,000 / 3,760 / 1,880 / 940Form Factor2.5-inch 15mm SFF Flash Memory Technology3D TLC NANDPerformance³Read Throughput (max MiB/s, Seq 128KiB)2,1502,1502,150Write Throughput (max MiB/s, Seq 128KiB)2,1202,1202,120Read IOPS (max, Rnd 4KiB)440,000440,000440,000Write IOPS (max, Rnd 4KiB)320,000240,000100,000Mixed IOPS (70/30 R/W, max, 4KiB)430,000330,000190,000Read/Write Latency⁴ (μs, avg)92 / 2692 / 2792 / 36ReliabilityUnrecoverable Bit Error Rate (UBER) 1 in 1017MTBF 5 (M hours)2.5Annualized Failure Rate⁵ (AFR)0.35%Availability (hrs/day x days/wk)24x7Limited Warranty⁶5 yearsPowerRequirement (+/- 5%)+5 VDC, +12VDCOperating (W, typical)9, 11, 14Idle (W, average)3.2Physical Sizez-height (mm)15Dimensions (width x depth, mm)70.1 x 100.45Weight (g, max)175EnvironmentalOperating Temperature⁷0o to 75o C Non-operating Temperature-40o to 85o CUltrastar ® DC SS5305601 Great Oaks ParkwaySan Jose, CA 95119, USAUS (Toll-Free): 800.801.4618International: /dc-ss530One megabyte (MB) is equal to one million bytes, one gigabyte (GB) is equal to 1,000MB (one billion bytes), and one terabyte (TB) is equal to 1,000GB (one trillion bytes) when referring to storage capacity. Accessible capacity will vary from the stated capacity due to formatting, system software, and other factors.² Endurance rating based on DW/D using 4KiB random write workload over 5 years³ Performance will vary by capacity point, or with the changes in useable capacity. Consult product manual for further details. All performancemeasurements are in full sustained mode and are peak values. Subject to change. 1MiB=1,048,576 bytes or 2, 1KiB= 1,024 bytes or 2.⁴ Average R/W latency at 4KiB QD=1⁵ MTBF and AFR specifications are based on a sample population and are estimated by statistical measurements and acceleration algorithms under typical operating conditions for this drive model. MTBF and AFR ratings do not predict an individual drive’s reliability and do not constitute a warranty.⁶ The warranty for the product will expire on the earlier of (i) the date when the flash media has reached one-percent (1%) of its remaining life or (ii) the expiration of the time period associated with the product.⁷ Internal drive temperature as measured via the drive’s temperature sensor.How to Read the Ultrastar Model NumberExample: WUSTR6464ASS201=6.4TB, SAS 12Gb/s, TCG W = Western Digital U = Ultrastar S = StandardTR = NAND type/endurance(TM=TLC/mainstream endurance, TR= TL C/read-intensive)64 = Full capacity (6.4TB)64 = Capacity of this model(15=15.2TB, 76=7.6TB, 38=3.84TB 32=3.2TB, 19=1.92TB,16=1.6TB, 96=960GB, 80=800GB, 48=480GB, 40=400GB)A = Generation codeS = Small form factor (2.5” SFF)S2 = I nterface, SAS 12Gb/s 1 = Encryption setting(0=Instant Secure Erase, 1=TCG Enterprise encryption,4=No encryption/Secure Erase, 5 = TCG+FIPS)。

Synopsys DesignWare IP for HPC SoCs 2说明书

Synopsys DesignWare IP for HPC SoCs 2说明书

DesignWare IP for Cloud Computing SoCs2High-Performance ComputingToday’s high-performance computing (HPC) solutions provide detailed insights into the world around us and improve our quality of life. HPC solutions deliver the data processing power for massive workloads required for genome sequencing, weather modeling, video rendering, engineering modeling and simulation, medical research, big data analytics, and many other applications. Whether deployed in the cloud or on-premise, these solutions require high performance and low-latency compute, networking, and storage resources, as well as leading edge artificial intelligence capabilities. Synopsys provides a comprehensive portfolio of high-quality, silicon-proven IP that enables designers to develop HPC SoCs for AI accelerators, networking, and storage systems.Benefits of Synopsys DesignWare IP for HPC• Industry’s widest selection of high-performance interface IP , including DDR, PCI Express, CXL, CCIX, Ethernet, and HBM, offers high bandwidth and low latency to meet HPC requirements• Highly integrated, standards-based security IP solutions enable the most efficient silicon design and highest levels of data protection• Low latency embedded memories with standard and ultra-low leakage libraries, optimized for a range of cloud processors, provide a power- and performance-efficient foundation for SoCsIP for HPC SoCs in Cloud ComputingOverviewHyperscale cloud data centers continue to evolve due to tremendous Internet traffic growth from online collaboration, smartphones and other IoT devices, video streaming, augmented and virtual reality (AR/VR) applications, and connected AI devices. This is driving the need for new architectures for compute, storage, and networking such as AI accelerators, Software Defined Networks (SDNs), communications network processors, and solid state drives (SSDs) to improve cloud data center efficiency and performance. Re-architecting the cloud data center for these latest applications is driving the next generation of semiconductor SoCs to support new high-speed protocols to optimize data processing, networking, and storage in the cloud. Designers building system-on-chips (SoCs) for cloud and high performance computing (HPC) applications need a combination of high-performance and low-latency IP solutions to help deliver total system throughput. Synopsys provides a comprehensive portfolio of high-quality, silicon-proven IP that enables designers to develop SoCs for high-end cloud computing, including AI accelerators, edge computing, visual computing, compute/application servers, networking, and storage applications. Synopsys’ DesignWare ® Foundation IP , Interface IP , Security IP , and Processor IP are optimized for high performance, low latency, and low power, while supporting advanced process technologies from 16-nm to 5-nm FinFET and future process nodes.3Benefits of Synopsys DesignWare IP for AI Accelerators• Industry’s widest selection of high-performance interface IP , including DDR, USB, PCI Express (PCIe), CXL, CCIX, Ethernet, and HBM, offers high bandwidth and low latency to meet the high-performance requirements of AI servers• Highly integrated, standards-based security IP solutions enable the most efficient silicon design and highest levels of data protection• Low latency embedded memories with standard and ultra-low leakage libraries, optimized for a range of cloud processors, provide a power- and performance-efficient foundation for SoCsArtificial Intelligence (AI) AcceleratorsAI accelerators process tremendous amounts of data for deep learning workloads including training and inference which require large memory capacity, high bandwidth, and cache coherency within the overall system. AI accelerator SoC designs have myriad requirements, including high performance, low power, cache coherency, integrated high bandwidth interfaces that are scalable to many cores,heterogeneous processing hardware accelerators, Reliability-Availability-Serviceability (RAS), and massively parallel deep learning neural network processing. Synopsys offers a portfolio of DesignWare IP in advanced FinFET processes that address the specialized processing, acceleration, and memory performance requirements of AI accelerators.IP for Core AI AcceleratorBenefits of Synopsys DesignWare IP for Edge Computing• Industry’s widest selection of high-performanceinterface IP , including DDR, USB, PCI Express, CXL, CCIX, Ethernet, and HBM, offers high bandwidth and low latency to meet the high-performance requirements of edge computing servers• Highly integrated, standards-based security IP solutions enable the most efficient silicon design and highest levels of data protection• Low latency embedded memories with standard and ultra-low leakage libraries, optimized for a range of edge systems, provide a power- and performance-efficient foundation for SoCsIP for Edge Server SoCEdge ComputingThe convergence of cloud and edge is bringing cloud services closer to the end-user for richer, higher performance, and lower latency experiences. At the same time, it is creating new business opportunities for cloud service providers and telecom providers alike as they deliver localized, highly responsive services that enable new online applications.These applications include information security, traffic and materials flow management, autonomous vehicle control, augmented and virtual reality, and many others that depend on rapid response. For control systems in particular, data must be delivered reliably and with little time for change between data collection and issuing of commands based on that data.To minimize application latency, service providers are moving the data collection, storage, and processing infrastructure closer to the point of use—that is, to the network edge. To create the edge computing infrastructure, cloud service providers are partnering with telecommunications companies to deliver cloud services on power- and performance-optimized infrastructure at the network edge.ServersThe growth of cloud data is driving an increase in compute density within both centrally located hyperscale data centers and remote facilities at the network edge. The increase in compute density is leading to demand for more energy-efficient CPUs to enable increased compute capability within the power and thermal budget of existing data center facilities. The demand for more energy-efficient CPUs has led to a new generation of server CPUs optimized for performance/watt.This same increase in data volume is also driving demand for faster server interfaces to move data within and between servers. Movement of data within the server can be a major bottleneck and source of latency. Minimizing data movement as much as possible and providing high-bandwidth, low-latency interfaces for moving data when required are key to maximizing performance and minimizing both latency and power consumption for cloud and HPC applications. To improve performance, all internal server interfaces are getting upgrades:• DDR5 interfaces are moving to 6400 MBps• Doubling the bandwidth of PCIe interfaces as they move from PCIe 4.0 at 16GT/s to PCIe 5.0 at 32GT/s and PCIe 6.0 at 64GT/s • Compute Express Link (CXL) provides a cache coherent interface that runs over the PCIe electrical interface and reduces the amount of data movement required in a system by allowing multiple processors/accelerators to share data and memory efficiently• New high-speed SerDes technology at 56Gbps and 112Gbps using PAM4 encoding and supporting protocols enable faster interfaces between devices including die, chips, accelerators, and backplanesCloud server block diagram Benefits of Synopsys DesignWare IP for Cloud Compute Servers• Silicon-proven PCIe 5.0 IP is used by 90% of leadingsemiconductor companies• CXL IP is built on silicon-proven DesignWare PCIExpress 5.0 IP for reduced integration risk and supports storage class memory (also referred to as persistentmemory) for speed approaching that of DRAM withSSD-like capacity and cost• 112Gbps XSR/USR SerDes supports a wide range ofdata rates (2.5 to 112 Gbps) with area-optimized RXVisual ComputingAs cloud applications evolve to include more visual content, support for visual computing has emerged as an additional function of cloud infrastructure. Applications for visual computing include streaming video for business applications, online collaboration, on-demand movies, online gaming, and image analysis for ADAS, security, and other systems that require real-time image recognition. The proliferation of visual computing as a cloud service has led to the integration of high-performance GPUs into cloud servers, connected to the host CPU infrastructure via high-speed accelerator interfaces.Server-based graphics accelerator block diagram45NetworkingTraditional data centers use a tiered network topology consisting of switched Ethernet with VLAN tagging. This topology only defines one path to the network, which has traditionally handled north-south data traffic. The transition to a flat, two-tier leaf-spine hyperscale data center network using up to 800G Ethernet links enables virtualized servers to distribute workflows among many virtual machines, creating a faster, more scalable cloud data center environment.Smart network interface cards (NICs) combine hardware, programmable AI acceleration, and security resources to offload server processors, freeing the processors to run applications. Integrated security, including a root of trust, protects coefficient and biometric data as it moves to and from local memories. Smart NICs accelerate embedded virtual switch, transport offloads, and protocol overlay encapsulation/decapsulation such as NVGRE, VXLAN and MPLS. By offering dedicated hardware offloads including NVMe-over-Fabric (NVMEoF) protocols, Smart NICs free the server CPU to focus compute cycles on cloud application software and enable efficient data sharing across nodes for HPC workloads.Network switch SoCs enable cloud data center top-of-rack and end-of-row switches and routers to scale port densities and speeds to quickly adapt to changing cloud application workloads. By scaling port speeds from 10Gb Ethernet to 400/800G Ethernet and extending port densities from dozens to hundreds of ports, the latest generation Ethernet switch SoCs must scale to provide lowest latency and highest throughput flow control and traffic management. Synopsys’ DesignWare Interface IP portfolio supports high-performanceprotocols such as Ethernet, PCI Express, CXL, CCIX, USB, DDR, and HBM. DesignWare Interface IP is optimized to help designers meet the high-throughput, low-latency connectivity needs of cloud computing networking applications. Synopsys’ Foundation IP offers configurable embedded memories for performance, power, and area, as well as high-speed logic libraries for all processor munication service providers are turning towards server virtualization to increase efficiency, flexibility, and agility tooptimize network packet processing. The latest communications architecture uses Open vSwitch Offloads (OVS), OVS over Data Plane Development Kits (DPDK), network overlay virtualization, SR-IOV, and RDMA to enable software defined data center and Network Function Virtualization (NFV), acceleratingcommunications infrastructure. To achieve higher performance, communications network processors can accelerate OVS offloads for efficiency and security. Synopsys provides a portfolio of high-speed interface IP including DDR, HBM, Ethernet for up to 800G links, CXL for cache coherency, and PCI Express for up to 64GT/s data rates. DesignWare Security IP enables the highest levels of security encryption, and embedded ARC processors offer fast, energy-efficient solutions to meet throughput and QoS requirements. Synopsys’ Foundation IP delivers low-latency embedded memories with standard and ultra-low leakage libraries for a range of cloud processors.IP for Smart NIC in cloud computing networkIP for cloud computing network switchIP for communication network processorsStorageNVMe-based Solid-State Drives (SSDs) can utilize a PCIe interface to directly connect to the server CPU and function as a cache accelerator allowing frequently accessed data, or “hot” data, to be cached extremely fast. High-performance PCIe-based NVMe SSDs with extremely efficient input/ output operation and low-read latency improve server efficiency and avoid having to access the data through an external storage device. NVMe SSD server acceleration is ideal for high transaction applications such as AI acceleration or database queries queries, as well as HPC workloads that require high-performance, low-latency access to large data sets. PCIe-based NVMe SSDs not only reduce power and cost but also minimize area compared to hard disk drives (HDDs). Synopsys’ portfolio of DesignWare Interface IP for advanced foundry processes, supporting high-speed protocols such as PCI Express, USB, and DDR, are optimized to help designers meet their high-throughput, low-power, and low-latency connectivity for cloud computing storage applications. Synopsys’ Foundation IP offers configurable embedded memories for performance, power, and area, as well as high-speed logic libraries for all processor cores. Synopsys also provides processor IP ideally suited for flash SSDs.Storage• High-performance, low-latency PCI Express controllersand PHYs supporting data rates up to 64GT/s enableNVMe-based SSDs• High-performance, low-power ARC processors supportfast read/write speeds for NVMe-based SSDs• Portfolio of interface IP including Ethernet, USB,PCI Express, and DDR provides low latency andfast read/write operationsFigure 6: IP for cloud computing storage6©2021 Synopsys, Inc. All rights reserved. Synopsys is a trademark of Synopsys, Inc. in the United States and other countries. A list of Synopsys trademarks is available at /copyright.html . All other names mentioned herein are trademarks or registered trademarks of their respective owners.05/04/21.CS610890866-SG Bro-Cloud Computing Brochure.。

CacheCoherence文献综述

CacheCoherence文献综述

Cache Coherence文献综述文献阅读背景如何选择高速缓存一致性的解决方案一直以来都是设计共享存储器体系结构的关键问题。

相对于维护高速缓存一致性而言,数据的传输也显得简单了。

高速缓存一致性协议致力于保证每个处理器的数据一致性。

一致性通常是在高速缓存总线或者网线上得到保证。

高速缓存的缺失可以从内存中得到数据,除非有些系统(处理器或者输入输出控制器)设备修改了高速缓存总线。

为了进行写操作,该处理器必须进行状态的转换,通常是转换为独占的状态,而总线上其他的系统设备都必须将他们的数据无效化,目前该数据块的拥有者就成为了数据来源。

因此,当其他设备提出需要此数据块时,该数据块的拥有者,而不是内存,就必须提供数据。

只有当该数据块的拥有者必须腾出空间用以存放其他的数据时,才将最新的数据重新写回内存中。

当然,在这方面,各种协议也有区别,上文所诉只是最基本的一些解决方案1。

并且,协议也包括基于硬件的以及基于软件的协议两个种类。

也有写无效和写更新的区别。

下面概述性地介绍下体系结构中所采用的两种主要的一致性方案:监听式(也称广播式)协议:所有的地址都送往所有的系统设备中。

每个设备在本地缓存中检查(监听)高速缓存总线的状态,系统在几个时钟周期后决定了全局的监听结果。

广播式协议提供了最低的可能延迟,尤其当缓存之间的传输是基本的传输方式。

监听式协议传输数据的带宽也是有一定限制的,通常被限制在:带宽=缓存总线带宽×总线时钟周期/每次监听的时钟周期数。

这将在下文中详细提到。

目录式(也称点对点式)协议:每个地址都被送往系统设备中对缓存数据感兴趣的那些设备。

物理存储器的共享状态放在一个地点,称之为目录。

目录式一致性的开销更大,尤其在延时等方面,因为协议本身的复杂性。

但是整体的带宽可以比监听式协议高很多,往往应用于比较大型的系统,最主要的应用是分布式系统。

这将在下文详细提到。

缓存一致性涉及的体系结构主要有如下几种:第一种类型是集中式存储体系结构,也称作为对称(共享存储器)多处理器系统(SMPs),这种体系结构也称为均匀存储器访问(UMA),这是因为所有的处理器访问存储器都有相同的时延。

FPGA可编程逻辑器件芯片XCVU13P-2FHGA2104I中文规格书

FPGA可编程逻辑器件芯片XCVU13P-2FHGA2104I中文规格书

General DescriptionXilinx® UltraScale™ architecture comprises high-performance FPGA, MPSoC, and RFSoC families that address a vast spectrum of system requirements with a focus on lowering total power consumption through numerous innovative technological advancements.Kintex® UltraScale FPGAs: High-performance FPGAs with a focus on price/performance, using both monolithic andnext-generation stacked silicon interconnect (SSI) technology. High DSP and block RAM-to-logic ratios and next-generation transceivers, combined with low-cost packaging, enable an optimum blend of capability and cost.Kintex UltraScale+™ FPGAs: Increased performance and on-chip UltraRAM memory to reduce BOM cost. The ideal mix of high-performance peripherals and cost-effective system implementation. Kintex UltraScale+ FPGAs have numerous power options that deliver the optimal balance between the required system performance and the smallest power envelope.Virtex® UltraScale FPGAs: High-capacity, high-performance FPGAs enabled using both monolithic and next-generation SSI technology. Virtex UltraScale devices achieve the highest system capacity, bandwidth, and performance to address key market and application requirements through integration of various system-level functions.Virtex UltraScale+ FPGAs: The highest transceiver bandwidth, highest DSP count, and highest on-chip and in-package memory available in the UltraScale architecture. Virtex UltraScale+ FPGAs also provide numerous power options that deliver the optimal balance between the required system performance and the smallest power envelope.Zynq® UltraScale+ MPSoCs: Combine the Arm® v8-based Cortex®-A53 high-performance energy-efficient 64-bit application processor with the Arm Cortex-R5F real-time processor and the UltraScale architecture to create the industry's first programmable MPSoCs. Provide unprecedented power savings, heterogeneous processing, and programmable acceleration. Zynq® UltraScale+ RFSoCs: Combine RF data converter subsystem and forward error correction with industry-leading programmable logic and heterogeneous processing capability. Integrated RF-ADCs, RF-DACs, and soft decision FECs (SD-FEC) provide the key subsystems for multiband, multi-mode cellular radios and cable infrastructure.Family ComparisonsDS890 (v3.14) September 14, 2020Product Specification Table 1:Device ResourcesKintex UltraScale FPGAKintexUltraScale+FPGAVirtexUltraScaleFPGAVirtexUltraScale+FPGAZynqUltraScale+MPSoCZynqUltraScale+RFSoCMPSoC Processing System✓✓RF-ADC/DAC✓SD-FEC✓System Logic Cells (K)318–1,451356–1,843783–5,541862–8,938103–1,143489–930 Block Memory (Mb)12.7–75.912.7–60.844.3–132.923.6–94.5 4.5–34.622.8–38.0 UltraRAM (Mb)0–8190–3600–3613.5–45.0 HBM DRAM (GB)0–16DSP (Slices)768–5,5201,368–3,528600–2,8801,320–12,288240–3,5281,872–4,272 DSP Performance (GMAC/s)8,1806,2874,26821,8976,2877,613 Transceivers12–6416–7636–12032–1280–728–16 Max. Transceiver Speed (Gb/s)16.332.7530.558.032.7532.75 Max. Serial Bandwidth (full duplex) (Gb/s)2,0863,2685,6168,3843,2681,048 Memory Interface Performance (Mb/s)2,4002,6662,4002,6662,6662,666I/O Pins312–832280–668338–1,456208–2,07282–668152–408Cache Coherent Interconnect for Accelerators (CCIX)CCIX is a chip-to-chip interconnect operating at data rates up to 25Gb/s that allows two or more devices to share memory in a cache coherent manner. Using PCIe for the transport layer, CCIX can operate at several standard data rates (2.5, 5, 8, and 16Gb/s) with an additional high-speed 25Gb/s option. The specification employs a subset of full coherency protocols and ensures that FPGAs used as accelerators can coherently share data with processors using different instruction set architectures.PCIE4C blocks support CCIX data rates up to 16Gb/s and contain one CCIX port. Each CCIX port requires the use of one integrated block for PCIe. If not used with a CCIX port, the integrated blocks for PCIe can still be used for PCIe communication.are stored in registers that can be accessed via internal FPGA (DRP), JTAG, PMBus, or I2C interfaces. The I2C interface and PMBus allow the on-chip monitoring to be easily accessed by the System Manager/Host before and after device configuration.The System Monitor in the PS MPSoC and RFSoC uses a 10-bit, 1 mega-sample-per-second (MSPS) ADC to digitize the sensor outputs. The measurements are stored in registers and are accessed via the Advanced Peripheral Bus (APB) interface by the processors and the platform management unit (PMU) in the PS.ConfigurationThe UltraScale architecture-based devices store their customized configuration in SRAM-type internal latches. The configuration storage is volatile and must be reloaded whenever the device is powered up. This storage can also be reloaded at any time. Several methods and data formats for loading configuration are available, determined by the mode pins, with more dedicated configuration datapath pins to simplify the configuration process.UltraScale architecture-based devices support secure and non-secure boot with optional Advanced Encryption Standard - Galois/Counter Mode (AES-GCM) decryption and authentication logic. If only authentication is required, the UltraScale architecture provides an alternative form of authentication in the form of RSA algorithms. For RSA authentication support in the Kintex UltraScale and Virtex UltraScale families, go to UG570, UltraScale Architecture Configuration User Guide.UltraScale architecture-based devices also have the ability to select between multiple configurations, and support robust field-update methodologies. This is especially useful for updates to a design after the end product has been shipped. Designers can release their product with an early version of the design, thus getting their product to market faster. This feature allows designers to keep their customers current with the most up-to-date design while the product is already deployed in the field.Booting MPSoCs and RFSoCsZynq UltraScale+MPSoCs and RFSoCs use a multi-stage boot process that supports both a non-secure and a secure boot. The PS is the master of the boot and configuration process. For a secure boot, the AES-GCM, SHA-3/384 decryption/authentication, and 4096-bit RSA blocks decrypt and authenticate the image.Upon reset, the device mode pins are read to determine the primary boot device to be used: NAND, Quad-SPI, SD, eMMC, or JTAG. JTAG can only be used as a non-secure boot source and is intended for debugging purposes. One of the CPUs, Cortex-A53 or Cortex-R5F, executes code out of on-chip ROM and copies the first stage boot loader (FSBL) from the boot device to the on-chip memory (OCM).After copying the FSBL to OCM, the processor executes the FSBL. Xilinx supplies example FSBLs or users can create their own. The FSBL initiates the boot of the PS and can load and configure the PL, or configuration of the PL can be deferred to a later stage. The FSBL typically loads either a user application or an optional second stage boot loader (SSBL) such as U-Boot. Users obtain example SSBL from Xilinx or a third party, or they can create their own SSBL. The SSBL continues the boot process by loading code from any of the primary boot devices or from other sources such as USB, Ethernet, etc. If the FSBL did not configure the PL, the SSBL can do so, or again, the configuration can be deferred to a later stage.。

Program Chairs

Program Chairs

International Conference on ParallelArchitectures andCompilationTechniques Barcelona, Spain September 8-12, 2001P A C T ’01InvitationThe Organizing Committee of PACT’01 is pleased to announce its advance program. The purpose of the PACT series of conferences is to bring together researchers from the architecture and compiler communities to present ground-breaking research and debate key issues of common interest. This conference, the tenth in the series, will be held in lovely Barcelona. The city represents a unique combination of beauty, culture, history, charm and advanced technology, and fosters an ideal environment for the enjoyable and stimulating exchange of ideas. Come to PACT'01 and see our high quality technical program, including three outstanding keynote speakers (Randall D. Isaac, IBM; Justin Rattner, Intel; Joel Emer, Compaq), 26 cutting-edge research papers, and a special session on Work in Progress. Attend the workshops on OpenMP, Binary Translation, Memory Access Decoupled Architectures, Compilers and Operating Systems for Low Power, and Ubiquitous Computing. Attend the tutorials on 3G Wireless Architecture, and the IBM Research Jalapeno JVM. Meet the experts in the field! After that, spend some days in Barcelona. From the old-world allure of the gothic quarter to the more modern, abstract appeal of the works of Gaudi, Barcelona will captivate you with a diversity of interesting places and museums to visit, charming people to meet, peaceful places to relax or simply enjoy the weather relaxing in the beautiful Mediterranean beaches. Finally, the catalan cuisine is one of the most select and varied in the Mediterranean area. We will try to bring you a piece of all that with our social program.The conference hotel is the Barcelona Hilton. The hotel has agreed to hold a limited number of rooms for conference attendees at a special rate until August 25, 2001. The advance registration deadline is July 25, 2001. Students are strongly encouraged to attend PACT’01. The conference has received some funding for student travel grants through generous corporate sponsorship. Public institutions and corporations have also provided funds to sponsor the activities of the conference.See you in Barcelona!General ChairMateo Valero, UPCProgram ChairsTodd Mowry, CMUJohn Shen, Intel/CMUFinance ChairJosep Torrellas, UIUCLocal Arrangements ChairJosep-Lluis Larriba, UPCPublication ChairGuang Gao, U. of DelawarePublicity ChairSally McKee, U. of UtahTutorials ChairMikko Lipasti, U. Wisconsin - Madison Workshops ChairsEvelyn Duesterwald, HP LabsGabby Silberman, IBMWeb MastersEduard Ayguade, UPC (Conference)Chris Colohan, CMU (Program Committee)Organizing CommitteeProgram CommitteeSarita Adve, UIUCNader Bagherzadeh, U. California,IrvineRas Bodik, U. Wisconsin-MadisonBrad Calder, U. California. San DiegoMichel Cosnard, INRIA, FranceAlan Cox, Rice U.Jim Dehnert, TransmetaSandhya Dwarkadas, U. RochesterKemal Ebcioglu, IBMBabak Falsafi, Carnegie Mellon U.Jesse Fang, IntelGuang Gao, U. DelawareAntonio Gonzalez, UPCDirk Grunwald, U. ColoradoMark Heinrich, Cornell U.Ali Hurson, Penn State U.Steve Keckler, U. Texas–AustinJohn Kubiatowicz, U.C. BerkeleyJames Larus, Microsoft ResearchMikko Lipasti, U. Wisconsin–MadisonProgram Committee (cont.)Margaret Martonosi, Princeton U.Kathryn McKinley, U. MassachusettsBilha Mendelson, IBMDavid Padua, UIUCPen Yew, U. MinnesotaSteering CommitteeNader Bagherzadeh, U. California, IrvineMichel Cosnard, INRIA, FranceKemal Ebcioglu, IBMParaskevas Evripidou, U. CyprusJean-Luc Gaudiot, U. Southern CaliforniaAli Hurson, Penn State U.Gabby Silberman, IBMMary-Lou Soffa, U. PittsburghAdvance ProgramSunday, September 9th20:00 - Welcoming ReceptionMonday, September 10th08:45 - Conference Opening09:00 - Keynote AddressRandall D. Isaac (VP Science and Technology, IBM Research).10:00 - Session 1: Simulation and Modeling "Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications".Tim Sherwood, Erez Perelman and Brad Calder, (University of California, San Diego)"Modeling Superscalar Processors via Statistical Simulation".Sebastien Nussbaum and James Smith (Dept. of Electrical and Computer Engineering,University of Wisconsin-Madison)"Hybrid Analytical-Statistical Modeling for Efficiently Exploring Architecture and Workload Design Spaces".Lieven Eeckhout and Koen De Bosschere (Department of Electronics and Information Systems, Ghent University)11:30 - Coffee Break12:00 - Session 2: Efficient Caches "Filtering Techniques to Improve Trace-Cache Efficiency".Roni Rosner, Avi Mendelson and Ronny Ronen (Israel Design Center, Intel)"Reactive-Associative Caches".Brannon Batson (1) and T. Vijaykumar (2).(1) Compaq and (2) Purdue University"Adaptive Mode Control: A Static-Power-Efficient Cache Design".Huiyang Zhou, Mark Toburen, Eric Rotenberg and Thomas Conte (North Carolina State University)13:30 - Lunch Break15:00 - Session 3: Specialized Instruction Sets"Implementation and Evaluation of the Complex Streamed Instruction Set".Ben Juurlink (1), Dmitri Tcheressiz (2), Stamatis Vassiliadis (1), Harry Wijshoff (2).(1) Computer Engineering Laboratory, Electrical Engineering Department, Delft University of Technology, Delft. (2) Department of Computer Science, Leiden University, Leiden."On the Efficiency of Reductions in micro-SIMD media extensions".Jesus Corbal, Roger Espasa, and Mateo Valero (Computer Architecture Department, UPC)17:00 - Excursion to Montserrat and Reception at Cavas Vinery Tuesday, September 11th09:00 - Keynote AddressJustin Rattner (Intel Fellow and Director of Microprocessor Research Labs).10:00 - Session 4: Prediction and Recovery "Boolean Formula-based Branch Prediction for Future Technologies".Daniel Jimenez (1), Heather Hanson (2) and Calvin Lin (1).(1) Department of Computer Sciences, The Universityof Texas at Austin. (2) Department of Electrical & Computer Engineering, The University of Texas at Austin."Using Dataflow Based Context for Accurate Value Prediction".Renju Thomas and Manoj Franklin (University of Maryland)"Recovery mechanism for latency misprediction". Enric Morancho, Jose Maria Llaberia and Angel Olive (Computer Architecture Department, UPC)11:30 - Coffee Break12:00 - Session 5: Memory Optimization"A Cost Framework for Evaluating Integrated Restructuring Optimizations".Bharat Chandramouli, John Carter, Wilson Hsieh and Sally McKee (University of Utah)"Compiling for the Impulse Memory Controller". Xianglong Huang, Zhenlin Wang and Kathryn McKinley (Computer Science Dept., University of Massachusetts, Amherst)"On the Stability of Temporal Data Reference Profiles".Trishul Chilimbi (Microsoft Research)13:30 - Lunch Break15:00 - Session 6: Program Optimization "Code Reordering and Speculation Support for Dynamic Optimization Systems".Erik Nystrom, Ronald Barnes, Matthew Merten andWen-mei Hwu (University of Illinois)"A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors". Josep Codina, Jesus Sanchez and Antonio Gonzalez (Computer Architecture Department, UPC)"Cache-Friendly Implementations of Transitive Closure".Michael Penner and Viktor Prasanna (University of Southern California)16:30 - Coffee Break17:00 - Session 7: Technology Implications "The Effect of Technology Scaling on CMP Throughput".Jaehyuk Huh, Doug Burger and Stephen Keckler (University of Texas at Austin)"Area and System Clock Effects on SMT/CMP Processors".James Burns (Intel) and Jean-Luc Gaudiot (USC)18:15 - Work In Progress Session21:00 - Conference Banquet (Hilton Hotel)Wednesday, September 12th09:00 - Keynote AddressJoel Emer (Compaq Staff Fellow).10:00 - Session 8: Parallel Machines"Limits on Speculative Module-level Parallelism in Imperative and Object-oriented Programs on CMP Platforms".Fredrik Warg and Per Stenstrom (Chalmers University of Technology)"Compiler and Runtime Analysis for Efficient Communication in Data Intensive Applications". Renato Ferreira (1), Gagan Agrawal (2) and Joel Saltz (1).(1) University of Maryland, (2) University of Delaware "Architectural Support for Parallel Reductions in Scalable Shared-Memory Multiprocessors”. Maria Jesus Garzaran (1), Milos Prvulovic (2), Ye Zhang (2), Alin Jula (3), Hao Yu (3), Lawrence Rauchwerger (3) and Josep Torrellas (2).(1) Universidad de Zaragoza, Spain, (2) University of Illinois at Urbana-Champaign, (3) Texas A&M University11:30 - Coffee Break12:00 - Session 9: Data Prefetching "Optimizing Software Data Prefetches with Rotating Registers".Gautam Doshi, Rakesh Krishnaiyer and Kalyan Muthukumar (Intel Corporation)"Multi-Chain Prefetching: Effective Exploitation of Inter-Chain Memory Parallelism for Pointer-Chasing Codes".Nicholas Kohout (1), Seungryul Choi (2), Dongkeun Kim (3), Donald Yeung (3).(1) Intel Corp., (2) Department of Computer Science, University of Maryland at College Park, (3) Department of Electrical and Computer Engineering, University of Maryland at College Park"Data Flow Analysis for Software Prefetching Linked Data Structures in Java".Brendon Cahoon and Kathryn McKinley (University of Massachusetts)"Comparing and Combining Read Miss Clustering and Software Prefetching".Vijay Pai (1) and Sarita Adve (2).(1) Rice University, (2) University of Illinois14:00 - Final Conference AddressWorkshopsSaturday, September 8thEWOMP'01 (full day).European Workshop on OpenMP.WBT'01 (full day).Workshop on Binary Translation.MEDEA'01 (half day, morning). Workshop on Memory Access Decoupled Architectures.Sunday, September 9thEWOMP'01 (full day).Continuation from previous day.COLP'01 (full day).Workshop on Compilers and Operating Systems for Low Power.WUCC'01 (half day, morning).Workshop on Ubiquitous Computing and Communication.TutorialsSaturday, September 8th. “3G Wireless Infrastructure: Architecture, Algorithms, and Applications”. Allan Berenbaum, Nevin Heintze, Stefanos Kaxiras, Girija Narlikar.Sunday, September 9th. “The Design and Implementation of the Jalapeño JVM”. Michael Hind, IBM Research.Registration InformationEarly registration fees are valid up to July 25, 2001. To benefit from the "member" discount, you must indicate your ACM/IEEE/IFIP membership number. To benefit from the "full-time student" discount, you must send via fax a letter from the appropriate institution that demonstrates your position. The student discount only applies to the conference fee.1. Conference registration feesThe Conference registration fees include attendance in the Conference from September 10-12, coffee breaks, a copy of the conference proceedings, the welcoming reception (September 9), the excursion and conference reception (September 10), and the banquet (September 11).Early Late/On-site Member81700 Ptas.100700 Ptas. Non member103500 Ptas.124450 Ptas. Student53200 Ptas.57000 Ptas.If you plan to come with one or more accompanying people, the price for the excursion and welcoming reception is 13500 Ptas. and the conference banquet is 13500 Ptas., for each accompanying guest.2. Tutorial registration feesThe Tutorial registration includes attendance in one Tutorial on September 8 or 9, coffee breaks, and a copy of the notes.Early Late/On-site Member 29450 Ptas. 36100 Ptas. Non member36100 Ptas.44650 Ptas.3. Workshop registration feesThese fees apply to delegates who plan to attend both the workshops and the conference. The fees include a small discount assuming combined conference and workshop registration.EWOMP'01 Workshop (Two days)The Workshop registration fee includes attendance in EWOMP'01 on September 8 and 9, coffee breaks, the workshop dinner and a copy of the workshop proceedings.Early Late/On-site Member 19000 Ptas. 23750 Ptas. Non member23750 Ptas.29450 Ptas.One-day PassThe One-day Pass entitles you to attend any workshop held during September 8 or 9. The fee includes attendance in the workshops, coffee breaks, and a copy of one of the workshop proceedings.Early Late/On-site Member 9500 Ptas. 15200 Ptas.Non member15200 Ptas.19000 Ptas.HotelsThe Conference will be held at the Hilton Barcelona Hotel. The Hotel is located in the business, commercial and shopping districtof the city, 15 minutes from Barcelona International Airport.Hilton provides PACT'01 attendees with a limited number of rooms at special rates. The rate is 25.500 Ptas. for a single room and 27.500 Ptas. for double room (breakfast included, 7% tax not included). The reservation cut-off date will be August 25, 2001.We have also reserved some additional rooms in other hotels nearby: • Hotel Husa l'Illa ****23.000 Ptas. (single) / 26.000 Ptas. (double) • Hotel Husa Arenas ****15.500 Ptas. (single) / 18.000 Ptas. (double) • Hotel Viladomat ***16.000 Ptas. (single) / 17.500 Ptas. (double)• Hotel Husa Bonanova ***10.850 Ptas. (single) / 14.300 Ptas. (double)All prices include breakfast. 7% tax not included. Please use the reservation form available at the conference web site for each hotel and send if by fax.Important: We strongly suggest that you make your room reservation in advance.Student travel grantsA limited number of grants are available for students to travel to PACT’01. An application form can be downloaded from the conference web site. Applicants must submit the form by July 15, 2001. These travel grants are provided by Hewlett- Packard HPL, IBM Research, Intel and Microsoft Research.Supporting OrganizationsThe Organizing Committee of PACT’01 gratefully acknowledges the support received from public institutions (Spanish Ministry of Education through the CICYT, Catalan Government through the CIRIT, Technical University of Catalunya UPC) and corporations (Compaq, IBM and SGI).。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Army Research LaboratoryAberdeen Proving Ground, MD 21005-50672002ARL-MR-528 February Cache-Based Architectures forHigh Performance ComputingDaniel M. PresselComputational and Information Sciences Directorate, ARLApproved for public release; distribution is unlimited.AbstractMany researchers have noted that scientific codes perform poorly on computer architectures involving a memory hierarchy (cache). Furthermore, a number of researchers and some vendors concluded that simply making the caches larger would not solve this problem. Alternatively, some vendors of HPC systems have opted to equip their systems with fast memory interfaces, but witha limited amount of on-chip cache and no off-chip cache.Some RISC-based HPC systems supported some sort of prefetching or streaming facility that allows one to more efficiently stream data between main memory and the processor (e.g., the Cray T3E). However, there are fundamental limitations on the benefits of these approaches which makes it difficult to seehow these approaches by themselves will eliminate the “Memory Wall.” It hasbeen shown that if one relies solely on this approach for the Cray T3E, one is unlikely to achieve much better than 4–6% of the machine’s peak performance.Does this mean that as the speed of RISC/CISC processors increases, systems designed to process scientific data are doomed to hit the Memory Wall? The answer to that question depends on the ability of programmers to find innovative ways to take advantage of caches. This report discusses some of the techniques that can be used to overcome this hurdle allowing one to considerwhat types of hardware resources are required to support these techniques.iiAcknowledgmentsThe author would like to thank Marek Behr for permission to use his results in this report. He would also like to thank the entire CHSSI CFD-6 team for their assistance in this work as part of that team. He would like to thank his many colleagues that have graciously assisted him in all aspects of the preparation of this report. Additional acknowledgments go to Tom Kendall, Denice Brown, and the systems staff for all of their help. He would also like to thank the employees of Business Plus Corp., especially Claudia Coleman and Maria Brady, who assisted in the preparation and editing of this report.This work was made possible by the grant of computer time by the DOD HPCM Program. Additionally, it is largely based on work that was funded as part of the CHSSI administered by the DOD HPCM Program.Note: All items in bold are in the Glossary.iiiI NTENTIONALLY LEFT BLANK. ivContentsAcknowledgments 1 List of Tables vii1. Introduction 12. Caches and High Performance Computing 13. Understanding the Limitations of a Stride-1 Access Pattern 34. Results 45. Prefetching and Stream Buffers vs. Large Caches 86. Prefetching and Stream Buffers in Combination With Large Caches 97. Conclusions 108. References 11 Glossary 13 Distribution List 15 Report Documentation Page 19vI NTENTIONALLY LEFT BLANK. viList of TablesTable 1. Single processor results from the STREAM benchmark for commonly used HPC systems.a (4)Table 2. The size of the working set for a 1-million grid point problem (5)Table 3. The performance of the RISC optimized version of F3D for single processor runs for the 3-million grid point test case (6)Table 4. The performance of the RISC optimized version of F3D for single processor runs on the SGI Origin 2000, the SGI Origin 3000, the SUNHPC 10000, and the HP Superdome for a range of test cases (7)Table 5. A summary of the test cases (7)Table 6. Comparative performance from running two versions of F3D using eight processors with the 1-million grid point test case (8)Table 7. Comparative performance from running the Department of Energy (DOE) Parallel Climate Model (PCM) using 16 processors.a,b (8)Table 8. The predicted performance increase resulting from upgrading a 195-MHz R10000 Processor to a 300-MHz R12000 Processor in an SGIOrigin 2000 (9)viiI NTENTIONALLY LEFT BLANK. viii1. IntroductionMany researchers have noted that scientific codes perform poorly on computer architectures involving a memory hierarchy (cache) (Bailey 1993; Mucci and London 1998). Furthermore, as a result of simulation studies, running microbenchmarks on real machines, and running real codes on real machines, a number of researchers and some vendors concluded that simply making the caches larger would not solve this problem. As a result of these conclusions, some vendors of high performance computing systems have opted to equip their systems with fast memory interfaces, but with a limited amount of on-chip cache and no off-chip cache (e.g., the Cray T3D, Cray T3E, and the IBM SP with the POWER 2 Super Chip). However, none of these systems approach the memory bandwidth of a vector processor. For example, it has been shown that if one relies solely on this approach for the Cray T3E, one is unlikely to achieve much better than 4–6% of the machine’s peak performance (O’Neal and Urbanic 1997). Does this mean that as the speed of RISC/CISC processors increases, systems designed to process scientific data are doomed to hit the “Memory Wall”? The answer to that question depends on the ability of programmers to find innovative ways to take advantage of caches. This report discusses some of the techniques that can be used to overcome this hurdle. Once these techniques have been identified, one can then consider what types of hardware resources are required to support them.It is important to note that this work is based on the following two key concepts:(1) It is acceptable to make significant modifications to the programs at theimplementation level.(2) Not all computer architectures are created equal. Therefore, one willfrequently have to define a minimum set of resources for tuningpurposes (e.g., cache size).2. Caches and High Performance ComputingMany researchers have noted that scientific codes perform poorly on computer architectures involving a memory hierarchy (cache) (Bailey 1993; Mucci and London 1998). Furthermore, as a result of simulation studies (Kessler 1991), running microbenchmarks on real machines (Mucci and London 1998), and1running real codes on real machines, a number of people and some vendors concluded that simply making the caches larger would not solve this problem. In fact, one group of researchers observed the following:“For all the benchmarks except cgm, there was very little temporal reuse, and the cache size that had approximately the same miss ratio as streams is proportional to the data set size” (Palacharla and Kessler 1994).As a result of these conclusions, some vendors of high performance computing systems have opted to equip their systems with fast memory interfaces but with a limited amount of on-chip cache and no off-chip cache. Examples of such conclusions are as follows:(1) Intel Paragon: 16-kB instruction cache, 16-kB data cache.(2) Cray T3D: 16-kB instruction cache, 16-kB data cache.(3) Cray T3E: 8-kB primary instruction cache, 8-kB primary data cache,96-kB combined instruction/data secondary cache.(4) IBM SP with the Power 2 Super Chip: 64-kB instruction cache, 128-kBdata cache.Can a way be found to beat these conclusions? If so, how and why are these techniques not used more frequently? The following is a list of techniques that have been used to improve the cache miss rate for a variety of scientific codes:(1) Reordering the indices of matrices to improve spatial locality.(2) Combining matrices to improve spatial locality.(3) Blocking the code to improve both spatial and temporal locality.(4) Tiling the matrices to improve spatial locality.(5) Reordering the operations in a manner that will improve the temporallocality of the code.(6) Recognizing that if one is no longer dealing with a vector processor, itmay be possible to eliminate some scratch arrays entirely, whilesubstantially reducing the size of other arrays. When done well, this canincrease both the spatial and temporal locality by an order of magnitude.(7) Writing the code as an out-of-core solver. In many cases, it would notactually be necessary to perform input/output (I/O). However, byrestricting the size of the working arrays, in theory, one could significantly decrease the rate of cache misses that miss all the way backto main memory. This method is especially good at improving thetemporal locality.(8) Borrowing the concept of domain decomposition, which is frequentlyused as an approach to parallelizing programs. While this approach isnot without its consequences, it can significantly decrease the size of theworking set (or help to create one where it would otherwise not exist).Again, this method is aimed at improving the temporal locality.This demonstrates that there are methods for significantly decreasing the cache miss rate. However, as will be seen later in this report, some of these techniques work best when dealing with large caches. Unfortunately, many of the more popular MPP s either lacked caches entirely (e.g., the NCUBE2 and the CM5 when equipped with vector units) or were equipped with small to modest sized caches (e.g., the Intel Paragon, Cray T3D and T3E, and the IBM SP with the POWER2 Super Chip processors). As a result, for many programmers working on high performance computers, there was no opportunity to experiment with ways to tune code for large caches. Furthermore, since many codes are required to be portable across platforms, there was little incentive to tune for architectural features that were not uniformly available.3. Understanding the Limitations of a Stride-1 Access Pattern Before continuing, we will briefly discuss spatial and temporal locality. Let us consider the case of an R12000-based SGI Origin 2000 with prefetching turned off, specifically, a 300-MHz processor generating one load per cycle with a Stride 1 access pattern and no temporal locality (this is an example of pure spatial locality of reference), with a cache line size of 128 bytes for 64-bit data. This arrangement will have a 6.25% cache miss rate. Assuming no other methods of latency hiding are used and assuming a memory latency of 945 ns (Laudon and Lenoski 1997), then this processor will spend 95% of its time stalled on cache misses. Phrasing this another way, if one assumes that the peak speed of the processor is one multiply-add instruction per cycle, the best that the processor will deliver is 32 MFLOPS out of a peak of 600 MFLOPS. This result compares favorably with the measured performance in Table 1.From this, it can seen that for large problem sizes, relying on spatial locality alone will not produce an acceptable level of performance. Instead, one must combine spatial locality with temporal locality (data reuse at the cache level). However, if a vector optimized code is run on this machine with the same assumptions, one can, at best, work on 131,072 values per megabyte of cache (theTable 1. Single processor results from the STREAM benchmark for commonly used HPC systems.a System Peak Speed (MFLOPS) TRIAD(MFLOPS)Cray T3E-900 (Alpha 21164) 900 47.3 Cray T3D (Alpha 21064)150 14.7 IBM SP P2SC (120 MHz)480 65.6 IBM SP Power 3 SMP High (222 MHz)888 51.2 SGI Origin 2000 (R12K - 300 MHz)600 32.3 SUN HPC 10000 (Ultra SPARC II - 400 MHz)800 24.7a McCalpin (2000). R12000-based SGI Origin 2000 is currently being sold with 8-MB secondary caches). Table 2 demonstrates where some of the strengths and weaknesses of this approach lie. Clearly the two most important concepts are:(1) Maximize the processing of the data a grid point at a time. (2) Minimize the amount of data that needs to be stored in cache at one time(minimize the size of the working set).Assuming that the techniques mentioned in the previous section have improved the cache miss rate to 1%, then the peak delivered level of performance rises to 157 MFLOPS (or spending 74.9% of the time stalled on cache misses). Similar results are obtained when analyzing all CISC- and RISC-based architectures. However, only those architectures with large caches lend themselves to some of these tuning techniques.4. ResultsInitial attempts to run a 3-million grid point test case with F3D on an SGI Power Challenge (75-MHz R8000 processor—300 MFLOPS) for 10 time steps took over 5 hours to complete (Pressel 1997). The same run when run on a Cray C90 took roughly 10 minutes to complete. There was never a chance of running it that fast on the Power Challenge, since the processor is slower. However, it was hoped that run times of roughly 30 minutes might be achievable. Table 3 lists the speed of the RISC optimized code when run on a variety of platforms. The speed has been adjusted to remove the startup and termination costs, which are heavily influenced by factors that are not relevant to this discussion.Table 2. The size of the working set for a 1-million grid point problem.Problem Description Number of Variables(per Grid Point) Size of the Working Set(Bytes)One-Dimensional (1-D) 14301008,000 32,000 240,000 800,000Two-Dimensional (2-D) 1000 H 1000 (Processed as a 1-D problem)14301008,00032,000240,000800,0002-D 1000 H 1000 (Processed as 2-D vector optimized problem, 1 row or column at a time)14301008322408002-D 1000 H 1000 (Processed one grid point at a time for maximum temporal reuse)1430100832240800Three-Dimensional (3-D) 100 H 100 H 100 (Processed as a 1-D problem)14301008,00032,000240,000800,0003-D 100 H 100 H 100 (Processed as a plane of data at a time as a 1-D problem)1430100803202,4008,0003-D 100 H 100 H 100(Processed as a 3-D vector optimized problem, 1 row or column at a time)14301008003,20024,00080,0003-D 100 H 100 H 100 (Processed one grid point at a time for maximum temporal reuse)1430100832240800Block of data 32 H 32 (Processed as a block)14301008,19232,768245,760819,200for the 3-million grid point test case.System Name Processor Clock Rate Peak Speed PerformanceSteps/Hr)(MFLOPS)(MFLOPS)(MHz)(TimeHP PA 7200 120 240 16. 63.Convex ExemplarSPP-1600Cray C90a Proprietary 238 952 81. 319.HP Superdome HP PA 8600 552 2208 135. 532.SGI Challenge R4400 200 100 10. 39.SGI Origin 2000 R10000 195 390 41. 162.SGI Origin 2000 R12000 300 600 61. 241.SGI Origin 3000 R12000 400 800 89. 351.SGI Power Challenge R8000 75 300 23. 91.SGI Power Challenge R10000 195 390 32. 126.400 800 46. 181.SUN HPC 10000 Ultra SPARCIIa Cray C90 ran the vector optimized code.Table 4 lists the speed of the RISC optimized code for a variety of problem sizeswhen run on the SGI Origin 2000 (R12000), the SGI Origin 3000 (R12000), theSUN HPC 10000, and the HP Superdome (HP PA 8600). The SGI Origin 2000was equipped with 128, 300-MHz R12000 processors with 8-MB secondarycaches and 2 GB of memory per 2-processor node. The SGI Origin 3000 wasequipped with 256, 400-MHz R12000 processors with 8-MB secondary caches and4 GB of memory per 4-processor node. The SUN HPC 10000 was equipped with64, 400-MHz Ultra SPARC II processors with either 4 or 8 MB of secondarycaches (one of our systems was upgraded before the series of runs was finished)and 4 GB of memory per 4-processor node. There was insufficient memory torun the 206-million grid point test case on the SUN HPC 10000. The HP Superdome was configured with 48, 552-MHz HP PA 8600 processors with 1 MBof on-chip data cache and 1 GB of memory per processor. For some unknown reasons, we could not run several of the cases on this system, even when therewas more than enough memory to run the job.on the SGI Origin 2000, the SGI Origin 3000, the SUN HPC 10000, and the HP Superdome for a range of test cases. Test Case Size (Millions of Grid Points) Speed (Time Steps/Hr) Performance(Time Steps/Million Grid Points-Hr)SGI Origin 2000 SGI Origin 3000 SUN HP Superdome SGI Origin 2000 SGIOrigin 3000 SUNHPSuperdome 1.00 181. 275. 138. 403. 181. 275. 138. 403.3.01 61. 89. 46. 135. 184. 268. 138. 406. 12.0 11.7 22. 10.6 30. 140. 264. 127. 360.35.6 4.0 6.92 3.4 10.4 142. 246. 121. 370.59.4 2.3 3.93 2.1 Would not run. 137. 233. 125. Would not run.124. 1.05 1.48 0.93 NA 130. 184. 115. NA 206. 0.62 0.99 NA NA 128. 204. NA NATable 5 lists the dimensions of the grids used for each of these test cases. For historical reasons, there were some differences between the 1- and 3-million grid point test cases. All of the remaining test cases were based on the 3-million grid point test case. Only the 1-million grid point test case has been run out to a converged solution. The remaining test cases were only used for scalability testing.Table 5. A summary of the test cases.Test Case Size JMAX KMAX LMAX(Millions of Grid Points) Zone 1 Zone 2 Zone 31.00 15 87 89 75 703.01 15 87 89 225 7012.0 15 87 89 450 14035.6 29 173 175 450 21059.4 29 173 175 450 350124. 43 254 266 450 490206. 71 421 442 450 4905. Prefetching and Stream Buffers vs. Large CachesNow that it has been established that large caches can be of value, let us consider the relative performance of systems that stressed prefetching and/or a fast low latency memory system vs. those that include a large cache. Tables 6 and 7 contain some real world examples of codes that were able to benefit from the presence of a large cache. This is not to say that all codes will benefit from the presence of a large cache. In particular, it is no accident that the version of F3D that was parallelized using compiler directives was able to take advantage of a large cache. It was extensively tuned for such an architecture. Other codes might perform better on the Cray T3E, especially if they were never tuned for cache-based systems.Table 6. Comparative performance from running two versions of F3D using eight processors with the 1-million grid point test case.System Peak Processor Speed ParallelizationMethod Performance(MFLOPS) (TimeSteps/Hr)(MFLOPS)SGI R10K Origin 2000 390 CompilerDirectives793 1.04E3SGI R12K Origin 3000 800 CompilerDirectives1764 2.31E3SUN HPC 10000 800 CompilerDirectives999 1.31E3HP V-Class 1760 CompilerDirectives1632 2.13E3HP Superdome 2208 CompilerDirectives2851 3.74E3SGI R12K Origin 2000a 600 SHMEM 349 4.56E2Cray T3E-1200a 1200SHMEM 3824.99E2IBM SP a 640MPI1992.60E2a Results provided courtesy of Marek Behr, formerly of the U.S. Army High Performance Computing ResearchCenter (AHPCRC).Table 7. Comparative performance from running the Department of Energy (DOE)Parallel Climate Model (PCM) using 16 processors.a,bSystem Peak Processor Speed(MFLOPS)Performance (MFLOPS/PE)SGI R10K Origin 2000 500 60 Cray T3E-900 900 38a This data is based on runs done using the T42L18 test case.b Bettge et al. (1999).6. Prefetching and Stream Buffers in Combination WithLarge CachesPreviously, this report pointed out the limitation of relying solely on prefetching and stream buffers. However, there is also a problem with relying solely on caches, even large caches, to solve all of the performance problems. In particular, there is no reason to believe that as the processor speed increases, the cache miss rate will automatically decrease. Even if one were to increase the sizes of the cache while increasing the speed of the processor, it would seem unlikely that the cache miss rate would significantly decline. (As Table 2 demonstrates, the cache miss rate is a function of the size of the cache and the size of the working set. Once the working set comfortably fits in cache, additional increases in the size of the cache will be of minimal value.) If the memory latency is kept constant, then the gain in performance from increasing the speed of the processor will be sublinear. Table 8 shows an example of this. This is what is known as the Memory Wall.Table 8. The predicted performance increase resulting from upgrading a 195-MHz R10000 Processor to a 300-MHz R12000 Processor in an SGI Origin 2000.Percentage of Time Spent on Cache Misses(R10K) Speedup (%)0 5410 4625 3650 2175 1090 4100 0 However, there is nothing that says one cannot combine both caches and some form of prefetching/stream buffers. The goal of this would not be to prefetch values far enough in advance that they would arrive prior to the time needed. With latencies of over 100 cycles, such a design would effectively be a vector processor such as the Cray SV1. We are also not trying to emulate a vector processor’s ability to stream in a large vector of data while encountering the costof a single cache miss. Instead, the goal is to overlap two or more memory latencies, thereby effectively decreasing the average latency by a factor of two or more. A more thorough discussion of this topic can be found in Pressel (2001).7. ConclusionsIt is possible to tune some scientific codes to take good advantage of systems with a memory hierarchy. It appears as though two- and three-dimensional problems have an inherent advantage to one-dimensional problems. Also, algorithms that do a lot of work per time step (e.g., implicit CFD codes) but exhibit a rapid rate of convergence may be better suited for use with caches than algorithms that do very little work per time step but require a large number of time steps to generate an answer (e.g., explicit CFD codes). For example, if Code A performs 1000 floating point operations per time step per grid point and requires 1000 time steps to converge, then it will perform 1 million floating point operations per grid point. In contrast, if Code B performs 3500 floating point operations per time step per grid point and requires 5000 time steps to produce an answer, then it will perform 1.5 million floating point operations per grid point. If one assumes that both programs are efficiently implemented, then Code A might have twice as many cache misses per time step as Code B. However, over the life of the run, Code B will have 2.5 times as many cache misses as Code A. Presumably, Code B will take close to 2.5 times as long to run as Code A. This is an example of how performing more work per time step can increase the potential for data reuse. In any case, one should be prepared to spend a significant amount of time and effort retuning the code.On a side note, a surprising outcome of this work is that BLAS 1, and, to a lesser extent, BLAS 2, subroutines should be avoided when working with systems that use cache. The BLAS 1 subroutines have little or no ability to optimize for either spatial or temporal locality if it does not already exist. The BLAS 2 subroutines can generate spatial locality through the use of blocking but are inherently unlikely to support temporal locality since they operate on planes of data. Similarly, it was shown that other programming styles that were commonly used with vector processors are distinctly suboptimal for the newer systems. Therefore, while some researchers have expressed a strong desire to maintain a single code for use with both RISC- and vector-based systems, it appears as though this is not a good idea.To an increasing extent when designing or buying a computer for high performance computing, the correct choice when faced with the choice of large cache or prefetching/stream buffers will be both. Of course, this assumes that the rest of the system is compatible with that choice.8. ReferencesBailey, D. H. “RISC Microprocessors and Scientific Computing.” Proceedings for Supercomputing 93, Los Alamitos, CA: IEEE Computer Society Press, 1993. Bettge, T., A. Craig, R. James, W. G. Strand, Jr., and V. Wayland. “Performance of the Parallel Climate Model on the SGI Origin 2000 and the Cray T3E.” The 41st Cray Users Group Conference, Minneapolis, MN: Cray Users Group, May 1999.Kessler, R. E. “Analysis of Multi-Megabyte Secondary CPU Cache Memories.”University of Wisconsin, Madison, WI, <ftp:///markhill /Theses/richard-kessler-body.pdf>, 1991.Laudon, J., and D. Lenoski. “The SGI Origin: A ccNUMA Highly Scalable Server.” Proceedings for the 24th Annual International Symposium on Computer Architecture, New York: ACM, 1997.McCalpin, J. D. “STREAM Standard Results.” < /streams/standard/MFLOPS.html>, 2000.Mucci, P. J., and K. London. “The CacheBench Report.” CEWES MSRC/PET TR/98-25, Vicksburg, MS, <>, 1998.O’Neal, D., and J. Urbanic. “On Performance and Efficiency: Cray Architectures.” Pittsburgh Supercomputing Center, < /~oneal/eff/eff.html>, August 1997.Palacharla, S., and R. E. Kessler. “Evaluating Stream Buffers as a Secondary Cache Replacement.” Proceedings for the 21st Annual International Symposium on Computer Architecture, Los Alamitos, CA: IEEE Computer Society Press, 1994.Pressel, D. M. “Early Results From the Porting of the Computational Fluid Dynamics Code, F3D, to the Silicon Graphics Power Challenge.” ARL-TR-1562, U.S. Army Research Laboratory, Aberdeen Proving Ground, MD, December 1997.Pressel, D. M. “Fundamental Limitations on the Use of Prefetching and Stream Buffers for Scientific Applications.” ARL-TR-2538, U.S. Army Research Laboratory, Aberdeen Proving Ground, MD, June 2001.I NTENTIONALLY LEFT BLANK. 12GlossaryAHPCRC Army High Performance Computing Research CenterBLAS Basic linear algebra subroutinesCFD Computational fluid dynamicsCHSSI Common High Performance Computing Software Support Initiative CISC Complicated instruction set computerDOD Department of DefenseDOE Department of EnergyHPC High performance computingHPCM High performance computing modernizationMFLOPS Million floating point operations per secondMPP Massively parallel processorRISC Reduced instruction set computer13I NTENTIONALLY LEFT BLANK. 14TECHNICAL 3 DIRECTOR2 DEFENSECENTER US ARMY RESEARCH LAB INFORMATIONOCA AMSRL CI LLDTIC8725 JOHN J KINGMAN RD 2800 POWDER MILL RD0944 ADELPHISTE20783-1197MDFT BELVOIR VA 22060-62183 DIRECTOR1 HQDA US ARMY RESEARCH LABFDT AMSRL CI IS TDAMOPENTAGON 2800 POWDER MILL RD 400ARMYWASHINGTON DC 20310-0460 ADELPHI20783-1197MD1 OSDOUSD(A&T)/ODDR&E(R)ABERDEEN PROVING GROUND DR R J TREW2 DIRUSARL3800 DEFENSE PENTAGONAMSRL CI LP (BLDG 305) WASHINGTON DC 20301-3800GENERAL1 COMMANDINGUS ARMY MATERIEL CMDAMCRDATF5001 EISENHOWER AVEALEXANDRIA VA 22333-00011 INST FOR ADVNCD TCHNLGYTHE UNIV OF TEXAS AT AUSTIN3925 W BRAKER LN STE 400AUSTIN TX 78759-5316ACADEMYMILITARY1 USMATH SCI CTR EXCELLENCEMADNMATHHALLTHAYERWEST POINT NY 10996-17861 DIRECTORUS ARMY RESEARCH LABDAMSRLDR D SMITH2800 POWDER MILL RDMD20783-1197ADELPHI1 DIRECTORUS ARMY RESEARCH LABAMSRL CI AI R2800 POWDER MILL RD20783-1197MDADELPHI151 PROGRAMDIRECTOR 1 NAVAL RSCH LABHENRY R RAMAMURTI CODE 6410C1010 N GLEBE RD STE 510 4555 OVERLOOK AVE SW22201 WASHINGTON DC 20375-5344 ARLINGTONVA1 DPTY PROGRAM DIRECTOR 1 ARMYAEROFLIGHT DAVIS DYNAMICSLDIRECTORATE 1010 N GLEBE RD STE 510 R MEAKIN M S 258 1VA22201 MOFFETT FIELD CA 94035-1000 ARLINGTON1 DISTRIBUTED CENTERS 1 NAVAL RSCH LABOFFICER HEAD OCEAN DYNAMICS PROJECTTHOMAS & PREDICTION BRANCHV1010 N GLEBE RD STE 510 J W MCCAFFREY JR CODE 732022201 STENNIS SPACE CENTER MS ARLINGTONVA395291 HPC CTRS PROJECT MNGRBAIRD 1 US AIR FORCE WRIGHT LABJ1010 N GLEBE RD STE 510 WLFIMARLINGTON22201 J J S SHANGVA2645 FIFTH ST STE 6MNGR WPAFB OH 45433-79121 CHSSIPROJECTPERKINSL1010 N GLEBE RD STE 510 1 US AIR FORCE PHILIPS LABARLINGTON22201 OLAC PL RKFEVACAPT S G WIERSCHKEUNIVERSITY 10 E SATURN BLVD1 RICEMECHANICAL ENGRNG & EDWARDS AFB CA 93524-7680SCIENCEMATERIALSM BEHR MS 321 1 NAVAL RSCH LABST DR D PAPACONSTANTOPOLOUS MAIN6100HOUSTON TX 77005 CODE63904555 OVERLOOK AVE SW1 NAVAL RSCH LAB WASHINGTON DC 20375-5000J OSBURN CODE 5594BLDG A49 RM 15 1 AIR FORCE RSCH LAB DEHE4555 OVERLOOK AVE SW RPETERKINWASHINGTON DC 20375-5340 3550 ABERDEEN AVE SEKIRTLAND AFB NM 87117-57761 NAVAL RSCH LABJ BORIS CODE 6400 1 NAVAL RSCH LAB4555 OVERLOOK AVE SW RSCH OCEANOGRAPHER CNMOCWASHINGTON DC 20375-5344 GHEBURNBLDG 1020 RM 178FIMC STENNIS SPACE CENTER MS1 WLSTRANG 39529BBLDG4502645 FIFTH ST STE 7 1 AIR FORCE RSCH LABWPAFB OH 45433-7913 INFORMATIONDIRECTORATER W LINDERMAN26 ELECTRONIC PKWYROME NY 13441-451416。

相关文档
最新文档