Measuring cache and TLB performance and their effect on benchmark run times

合集下载

一种低功耗预比较TLB结构

一种低功耗预比较TLB结构

国防科技大学学报第28卷第5期J OUR NAL OF NA TIONA L UNIVERSI TY OF DEFENSE TECHNO LO GY Vol.28No.52006文章编号:1001-2486(2006)05-0084-06一种低功耗预比较TLB结构X侯进永,邢座程(国防科技大学计算机学院,湖南长沙410073)摘要:介绍了一种低功耗TLB结构。

这种结构的思想是基于程序局部性原理,结合Block Buffering[1]技术,并对CAM结构进行改造,提出一种预比较TLB结构,实现低功耗的TLB。

并且采用Si mplescalar3.0模拟该TLB结构和几种传统的TLB结构的失效率。

通过改进的C ACTI3[2]模拟结果显示:提出的TLB结构比FA-TLB 平均功耗@延迟降低约85%,比Micro-TLB降低80%,比Victim-TLB降低66%,比Bank-TLB降低66%以上。

从而,所提出的TLB结构可以达到降低功耗的目的。

关键词:TLB;低功耗;CAM;block buffer中图分类号:TP333196文献标识码:BA Precomparison TLB Structure for Low PowerHOU Jin-yong,XING Zuo-cheng(College of Co mputer,National Univ.of Defense Technol ogy,Changsha410073,China)Abstract:A structure of TLB for low power is introduced.The idea of the prop osed TLB is based on the spatial locality,which is the result of combini ng with the block buffering technology and adjustment of the CAM structure.All of these make the TLB for low power.Wi th Si mple Scalar3.0,a simulation of the proposed TLB and some traditi onal TLB structures were made to observe the miss ratio.The simulation results from the modi fied C AC T I3show that the proposed TLB structure can reduce power*delay about85%, 80%,66%,and66%,compared wi th a FA-TLB,a micro-TLB,a victim-TLB,and a bank-TLB.Therefore the p roposed TLB can achieve low power.Key words:TLB;low power;CAM;block bufferTLB(translation look-aside buffer)是微处理器中用来做虚地址到物理地址转换的快速Buffer[3]。

处理硬件高速缓存和TLB_徐波

处理硬件高速缓存和TLB_徐波

懒惰TLB模式
• Linux内核怎样实现懒惰TLB模式的呢? 首先,内核有一个由tlb_state结构组成的全局数组cpu_tlbstate[],数组的大小 就是CPU的个数,数组中的每个tlb_state结构包含两个信息: 一个是对应该CPU对应的正在执行或者是最近执行的进程task_struct中的 mm_struct字段; 另一个是state,表明CPU是否处于懒惰TLB模式。
如果设置了TLBSTATE_OK,就刷 新,若设置了TLBSTATE_LAZY, 就忽略中断请求,不刷新TLB;
如果中断请求所要刷新的TLB对应的 mm_struct与该CPU活跃的mm_struct不 一样的话,就完全没有必要来刷新TLB;
小结--处理硬件高速缓存和TLB
• Why? • What? • 处理硬件高速缓存 如何是高速缓存命中率最优化
处理硬件高速缓存和TLB
白皮书 P79-83 红皮书 P228-230
徐波
Why?
RAM和CPU速度不一样,减少CPU的等待时间
RAM相对于硬盘来说,已经相当的快了。但和CPU相比,性能还是不 够。当今的CPU时钟频率接近几个GHz,而动态RAM芯片的存取时间 是时钟周期的数百倍,这就意味着从RAM中取操作数或向RAM存放 结果这样的指令执行时,CPU可能需要等待过长的时间。 为了缩小CPU和RAM之间的速度不匹配,引入了硬件高速缓存 (hardware cache memory)。硬件高速缓存基于局部性原理 (locality principle)。该原理既适用程序结构也适用数据结构。因此, 引入小而快的内存来存放最近最常用的代码和数据变得很有意义。
当某个CPU执行该mm_struct结构对应进程改变了页映射关系时,它就会 发送一个处理器间中断给所有cpu_vm_mask下标为1的CPU,以便通知它们更 新自己的TLB,这些CPU通过smp_invalidate_interrupt()来响应该中断。

云计算HCIP考试题与答案

云计算HCIP考试题与答案

云计算HCIP考试题与答案一、单选题(共52题,每题1分,共52分)1.在 FusionCompute 中,创建中继类型的端口组时,不可以配置以下哪一项?A、发送流量整形B、填充 TCP 校验和C、IP 与 MAC 绑定D、DHCP 隔离正确答案:C2.FusionCompute 配置上行链路时,需要添加下列哪种类型的端口A、存储接口B、端口组C、BMC 网口D、聚合网口正确答案:D3.瘦终端内不包含操作系统,仅仅是一个用于呈现远程虚拟桌面显示内容和接入外设的硬件设备。

A、TRUEB、FALSE正确答案:B4.FusionCompute 分布式虚拟交换机一方面可以对多台服务器的虚拟交换机统配置、管理和监控。

另一方面也可以保证虚拟机在服务器之间迁移时网络配置的一致性。

A、TRUEB、FALSE正确答案:A5.链接克隆虚拟机的差分盘,保存用户工作的临时系统数据,只要把虚拟机关闭,差分盘就可以自动清除。

A、TRUEB、FALSE正确答案:A6.在使用云桌面的时候,视频播放不流畅,属于哪种类型的故障?A、登录连接故障B、外设使用故障C、性能体验故障D、业务发放故障正确答案:C7.在 FusionCompute 中,添 IP SAN 存储资源,达到虚拟机可以使用该资源的目的,正确的添加步骤是以下哪一项?A、添加主机存储接口>添加存储资源>添加数据存储>扫描存储设备B、添加存储资源>添加主机存储接口>扫描存储设备>添加数据存储C、添加主机存储接口>添加存储资源>扫描存储设备>添加数据存储D、添加数据存储>添加主机存储接口>添加存储资源>扫描存储设备正确答案:C8.在 FusionCompute 分布式交换机里,虚拟机与外部通信依靠的是什么端口?A、端口组B、上行链路C、存储接口D、Mgnt正确答案:B9.FusionCompute 在勾选一致性快照后会保存当前虚拟机内存中的数据,在还原虚拟机时能还原虚拟机创建快照时的内存状态。

memory 坏块管理

memory 坏块管理

memory 坏块管理Memory(内存)是计算机中的重要组件之一,用于存储程序和数据。

然而,由于长时间使用或其他原因,内存中的块可能会变得损坏。

因此,对于系统管理员和程序员来说,对内存中的坏块进行管理是非常重要的。

我们需要了解什么是内存坏块。

内存坏块是指内存中的一个或多个连续的存储单元出现了物理损坏,导致无法正常读取或写入数据。

这些内存坏块可能是由于硬件故障、电压问题、过热或长时间使用等原因引起的。

内存坏块对计算机系统的稳定性和性能都会产生负面影响。

首先,坏块可能会导致系统崩溃或运行变慢。

当程序试图读取或写入损坏的内存块时,可能会发生错误,导致程序崩溃或运行变得缓慢。

其次,坏块还可能导致数据丢失或损坏。

如果损坏的内存块中存储了重要的数据,那么这些数据可能会永久丢失或变得无法正常使用。

为了管理内存中的坏块,我们可以采取一些措施。

首先,可以使用内存测试工具来检测内存中的坏块。

这些工具可以扫描整个内存,检测并标记出坏块的位置。

一旦发现坏块,我们可以将其标记为不可用,以避免在程序中使用这些坏块。

其次,我们可以通过重新分配内存来规避坏块。

当发现坏块时,我们可以将其替换为可用的内存块。

这可以通过操作系统或程序来实现。

最后,定期维护内存也是管理坏块的重要步骤。

定期检查内存的物理状态,清理内存中的垃圾数据,可以减少坏块的发生。

除了上述措施,我们还可以采取一些预防措施来避免内存坏块的发生。

首先,要确保计算机的电压稳定。

电压过高或过低都可能导致内存损坏。

其次,要注意内存的散热问题。

过热可能导致内存损坏,因此要确保计算机的散热系统正常工作。

此外,定期清理内存中的垃圾数据也是预防坏块的重要步骤。

在实际应用中,内存坏块管理是一个复杂而重要的任务。

系统管理员和程序员需要密切关注内存的状态,及时检测和修复坏块。

对于关键系统和数据,还可以考虑使用冗余内存或备份来提高系统的稳定性和容错性。

另外,注意及时更新和升级硬件和软件也是防止坏块的一种有效措施。

四槽Synology DiskStation DS918+ NAS说明书

四槽Synology DiskStation DS918+ NAS说明书

Synology DiskStation DS918+ is a 4-bay NAS designed for small and medium-sized businesses and IT enthusiasts. Powered by a new quad-core processor, DS918+ provides outstanding performance and data encryption acceleration along with real-time transcoding of 4K Ultra HD source contents. Synology DS918+ is backed by Synology’s 3-year limited warranty.DiskStationDS918+Highlights• Powerful and scalable 4-bay NAS for growing businesses• Encrypted sequential throughput at over 225 MB/s reading and 221 MB/s writing 1• Quad-core processor with AES-NI hardware encryption engine• 4GB DDR3L-1866 memory, expandable up to 8GB• Dedicated M.2 NVMe SSD slots for system cache support• Dual 1GbE LAN with failover and Link Aggregation support• Scalable up to 9 drives with Synology DX5172• Advanced Btrfs file system offering 65,000 system-wide snapshots and 1,024 snapshots per shared folder • Dual-channel H.265/H.264 4K video transcoding on the fly 3High-speed Scalable Storage ServerSynology DS918+ is a 4-bay network attached storage solution equipped with an quad-core processor and 4GB DDR3L memory (expandable up to 8GB). With Link Aggregation enabled, DS918+ delivers great sequential throughput performance at over 226 MB/s reading and 222 MB/s writing 1. With AES-NI hardware accelerated encryption, DS918+ delivers encrypted data throughput at over 225 MB/s reading and 221 MB/s writing 1. DS918+ newly supports M.2 NVMe 2280 SSDs , allowing fast system cache creation without occupying internal drive bays.DS918+ can supp ort up to 9 drives when connected to one Synology DX5172 expansion unit. Storage capacity can be expanded according to your business needs with minimal effort.Btrfs: Next Generation Storage EfficiencyDS918+ introduces the Btrfs file system , bringing the most advanced storage technologies to meet the management needs of modern businesses:• Built-in data integrity check detects data and file system corruption with data and meta-data checksums and improves the overall stability.• Flexible Shared Folder/User Quota System provides comprehensive quota control over all user accounts and shared folders.• Advanced snapshot technology with customizable backup schedule allows up to 1,024 copies of shared folder backups in a minimum 5-minute interval without occupying huge storage capacity and system resources.• File or folder level data restoration brings huge conveniences and saves time for users who wish to restore only a specific file or folder.• File self-healing : Btrfs file system can auto-detect corrupted files with mirroredmetadata, and recover broken data using the supported RAID volumes, which include RAID1, 5, 6, and 10.Comprehensive Business ApplicationsPowered by the innovative Synology DiskStation Manager (DSM), DS918+ comes fully-equipped with applications and features designed specifically for small or growing businesses:• Windows® AD and LDAP support allow easy integration with existing business directory services, without needing to recreate user accounts.• Windows ACL support provides fine-grained access control and efficient privilege settings, allowing DS918+ to fit seamlessly into current infrastructure.• Internet file access is simplified by the encrypted FTP server and Synology File Station , a web-based file explorer. HTTPS, firewall, and IP auto-block support ensure file sharing over the Internet is protected with a high level of security.• Application Privileges controls access to applications and packages based on each individual account, user group, or IP address.• MailPlus allows your business to set up a secure, reliable, and private mail server while giving your employees a modern mail client for receiving and sending messages.• The powerful Collaboration Suite integrates Synology Office , Calendar , and Chat , ensuring secure and effective private communications and allowing your organizations to easily manage and control relevant contents.Virtualization SolutionsSynology’s Virtual Machine Manager opens up abundant possibilities, allowing you to set up and run various virtual machines, including Windows , Linux , and Virtual DSM . You can also test new software versions in a sandbox environment, isolate yourcustomers' machines, and increase the flexibility of your DS918+ with reduced hardware deployment and maintenance costs.Synology iSCSI storage fully supports most of the virtualization solutions, enhancing work efficiency with an intuituve management interface. VMware vSphere™ 6.5 and VAAI integration help offload storage operations and optimize computation efficiency. Windows Offloaded Data Transfer (ODX) speeds up data transfer and migration rate, while OpenStack Cinder support transforms your Synology NAS into a block-based storage component.4K Ultra HD Multimedia LibrarySynology DiskStation DS918+ features real-time transcoding for up to 2 channels of H.264/H.265 4K videos at the same time, bringing more powerful support to thelatest multimedia formats and contents. With Video Station , you can organize personal digital video library with comprehensive media information, and watch 4K Ultra HD movies and films. The intuitive design of Photo Station allows you to effortlessly organize photos into customized categories, smart albums and blog posts, and link them to social networking websites within a few clicks. Audio Station comes with Internet radio and lossless audio formats support, and provides music playback viaDLNA and AirPlay®-compliant devices.Virtual Machine Manager Virtual Machine Manager implements various virtualization solutions, allowing you to manage multiple virtual machines on your DS918+, including Windows , Linux , and Virtual DSM.Synology Collaboration SuiteA powerful and secure private cloud solution for business collaboration and organization, offering intuitive yet effective management options.Hardware OverviewTechnical SpecificationsHardwareCPUIntel Celeron J3455 quad-core 1.5GHz, burst up to 2.3GHz Hardware encryption engine Yes (AES-NI)Hardware transcoding engine • Supported codecs: H.264 (AVC), H.265 (HEVC), MPEG-2, VC-1• Maximum resolution: 4K (4096 x 2160)• Maximum frames per second (FPS): 30Memory4 GB DDR3L SO-DIMM (expandable up to 8 GB)Compatible drive type • 4 x 3.5" or 2.5" SATA SSD/HDD (drives not included)• 2 x M.2 NVMe 2280 SSD (drives not included)External port • 2 x USB 3.0 port • 1 x eSATA port Size (HxWxD)166 x 199 x 223 mm Weight 2.28 kg LAN2 x 1GbE (RJ-45)Wake on LAN/WAN Yes Scheduled power on/off YesSystem fan2 (92 x 92 x 25 mm)AC input power voltage 100V to 240V AC Power frequency 50/60Hz, single phase Operating temperature 5°C to 40°C (40°F to 104°F)Storage temperature -20°C to 60°C (-5°F to 140°F)Relative humidity5% to 95% RH Maximum operating altitude5,000 m (16,400 ft)1Status indicator2Drive status indicator 3Drive tray lock 4USB 3.0 port 5Power button and indicator 6Drive tray 71GbE RJ-45 port 8Reset button9eSATA port 10Power port11Fan12Kensington Security Slot13USB 3.0 port14M.2 NVMe SSD slot (bottom side)6General DSM SpecificationNetworking protocol SMB, AFP, NFS, FTP, WebDAV, CalDAV, iSCSI, Telnet, SSH, SNMP, VPN (PPTP, OpenVPN ™, L2TP)File system • Internal: Btrfs, ext4• External: Btrfs, ext4, ext3, FAT, NTFS, HFS+, exFAT4Supported RAID type Synology Hybrid RAID (SHR), Basic, JBOD, RAID 0, RAID 1, RAID 5, RAID 6, RAID 10Storage management • Maximum internal volumes: 512• Maximum iSCSI targets: 32• Maximum iSCSI LUNs: 256• iSCSI LUN clone/snapshot supportSSD cache SSD read-write cache supportFile sharing capability • Maximum local user accounts: 2,048• Maximum local groups: 256• Maximum shared folders: 512• Maximum concurrent SMB/NFS/AFP/FTP connections: 1,000Privilege Windows Access Control List (ACL), application privilegeDirectory service Windows® AD integration: Domain users login via SMB/NFS/AFP/FTP/File Station, LDAP integration Virtualization VMware vSphere® 6.5, Microsoft Hyper-V®, Citrix®, OpenStack®Security Firewall, encrypted shared folder, SMB encryption, FTP over SSL/TLS, SFTP, rsync over SSH, login auto block, Let's Encrypt support, HTTPS (Customizable cipher suite)Supported client Windows 7 and 10, Mac OS X® 10.11 onwardsSupported browser Chrome®, Firefox®, Internet Explorer® 10 onwards, Safari® 10 onwards; Safari (iOS 10 onwards), Chrome (Android™ 6.0 onwards)Interface Language English, Deutsch, Français, Italiano, Español, Dansk, Norsk, Svensk, Nederlands, Русский, Polski, Magyar, Português do Brasil, Português Europeu, Türkçe, Český,Packages and ApplicationsFile Station Virtual drive, remote folder, Windows ACL editor, compressing/extracting archived files, bandwidth control for specific users or groups, creating sharing links, transfer logsFTP Server Bandwidth control for TCP connections, custom FTP passive port range, anonymous FTP, FTP SSL/TLS and SFTP protocol, boot over the network with TFTP and PXE support, transfer logsUniversal Search Offer global search into applications and filesHyper Backup Support local backup, network backup, and backup data to public clouds Active Backup for Server Centralize data backup for Windows and Linux servers without client installationBackup tools DSM configuration backup, macOS Time Machine support, Cloud Station Backup Shared folder sync - maximum tasks: 8Cloud Station Suite Sync data between multiple platforms by installing the client utilities on Windows, Mac, Linux, Android and iOS devices, while retaining up to 32 historical versions of filesMaximum concurrent file transfers: 1,000Cloud Sync One or two-way synchronization with public cloud storage providers including Amazon Drive, Amazon S3-compatible storage, Baidu cloud, Box, Dropbox, Google Cloud Storage, Google Drive, hubiC, MegaDisk, Microsoft OneDrive, OpenStack Swift-compatible storage, WebDAV servers, Yandex DiskSurveillance Station Maximum IP cameras: 40 (total of 1,200 FPS at 720p, H.264) (includes two free camera licenses; additional cameras require the purchasing of additional licenses)Virtual Machine Manager Deploy and run various virtual machines on Synology NAS, including Windows, Linux, or Virtual DSM High Availability Manager Reduce service downtime by setting up two identical NAS into one high-availability clusterSnapshot Replication • Maximum replications: 64• Maximum shared folder snapshots: 1,024• Maximum system snapshots: 65,536Active Directory Server A flexible and cost-effective domain controller solutionVPN Server Maximum connections: 20, supported VPN protocol: PPTP, OpenVPN™, L2TP/IPSecMailPlus Server Secure, reliable, and private mail solution with high-availability, load balancing, security and filtering design (Includes 5 free email account licenses; additional accounts require the purchasing of additional licenses)MailPlus Intuitive webmail interface for MailPlus Server, customizable mail labels, filters, and user interfaceCollaboration tools • Chat maximum users: 1,500• Office maximum users: 200, maximum simultaneous editing users: 80• Calendar: support CalDAV and access via mobile devicesNote Station Rich-text note organization and versioning, encryption, sharing, media embedding and attachmentsStorage Analyzer Volume and quota usage, total size of files, volume usage and trends based on past usage, size of shared folders, largest/most/least frequently modified filesAntivirus Essential Full system scan, scheduled scan, white list customization, virus definition auto updateSYNOLOGY INC.Synology is dedicated to taking full advantage of the latest technologies to bring businesses and home users reliable and affordable ways to centralize data storage, simplify data backup, share and sync files across different platforms, and access data on-the-go. Synology aims to deliver products with forward-thinking features and the best in class customer services.Copyright © 2017, Synology Inc. All rights reserved. Synology, the Synology logo are trademarks or registered trademarks of Synology Inc. Other product and company names mentioned herein may be trademarks of their respective companies. Synology may make changes to specification and product descriptions at anytime, without notice.DS918+-2017-ENU-REV003Headqu artersSynology Inc. 3F-3, No. 106, Chang An W. Rd., Taipei, Taiwan Tel: +886 2 2552 1814 Fax: +886 2 2552 1824ChinaSynology Shanghai 200070, Room 516, No. 638 Hengfeng Rd., Zhabei Dist. Shanghai, ChinaUnited KingdomSynology UK Ltd.Unit C, Denbigh WestBusiness Park, Third AvenueBletchley, Milton KeynesMK1 1DH, UKTel: +44 1908 366380GermanySynology GmbHGrafenberger Allee125 40237 DüsseldorfDeutschlandTel: +49 211 9666 9666North & South AmericaSynology America Corp.3535 Factoria Blvd SE #200Bellevue, WA 98006, USATel: +1 425 818 1587FranceSynology France SARL39 rue Louis Blanc, 92400Courbevoie, FranceTel: +33 147 176288Download Station Supported download protocols: BT, HTTP, FTP, NZB, eMule, Thunder, FlashGet, QQDL Maximum concurrent download tasks: 80Web Station Virtual host (up to 30 websites), PHP/MariaDB®, 3rd-party applications supportOther packages Video Station, Photo Station, Audio Station, DNS Server, RADIUS Server, iTunes® Server, Log Center, additional 3rd-party packages are available in Package CenteriOS/Android™ applications DS audio, DS cam, DS cloud, DS file, DS finder, DS get, DS note, DS photo, DS video, MailPlus Windows Phone® applications DS audio, DS file, DS finder, DS get, DS photo, DS videoEnvironment and PackagingEnvironment safety RoHS compliantPackage content • DS918+ main unit x 1• Quick Installation Guide x 1• Accessory pack x 1• AC power adapter x 1• RJ-45 LAN cable x 2Optional accessories • D3NS1866L-4G• Expansion Unit DX517• VisualStation VS360HD, VS960HD • Surveillance Device License Pack • MailPlus License PackWarranty 3 years*Model specifications are subject to change without notice. Please refer to for the latest information.1. Performance figures may vary depending on environment, usage, and configuration.2. DS918+ supports one Synology DX517, sold separately.3. DS918+ can transcode 4K video to 1080p or lower. The maximum number of concurrent video transcoding channels supported may vary depending on the video codec, resolution,bitrate and FPS.4. exFAT Access is purchased separately in Package Center.。

DesignWare ARC EM Overlay Management Unit Datashee

DesignWare ARC EM Overlay Management Unit Datashee

DESIGNWARE IP DATASHEETThe DesignWare ® ARC ® EM Overlay Management Unit (OMU) option enables address translation and access permission validation with minimal power and area overhead while boosting the ability to run larger and more data intensive operations, such as those increasingly prevalent within AIoT, storage and wireless baseband applications, on an ARC EM processor. This hardware-based Overlay Management Unit provides support for virtual memory addressing with a Translation Lookaside Buffer (TLB) for address translation and protection of 4KB, 8KB or 16KB memory pages. In addition, fixed mappings of untranslated memory are supported, enabling the system to achieve increased performance over a large code base residing in a slow secondary storage memory, with the option to be paged in as needed into faster small on-chip page RAM (PRAM) in an efficient way. This is particularly suited for operating environments in which virtual address aliasing is avoided in software.In systems that run all code as a single process (single PID), using a large virtual address space with a one-to-one correspondence between the virtual address and a large selected area of secondary storage space (such as flash memory or DRAM), the address-translation facility of the Overlay Management Unit can be used to detect when a section (or one or more pages) of code is resident in the PRAM and provide the physical address to the page in the PRAM. Virtual address spacePhysical address space Figure 1: Virtual to Physical Address TranslationHighlights• Lightweight hardware-based memorymanagement unit (MMU) enablingaddress translation and accesspermission validation• Fully associative Instruction andData µTLBs• Configurable joint TLB depth of 64, 128 or256 entries• Common address space forinstruction and data• Independent rd/wr/execute flags for user/kernel modes per page• Optimized TLB programming withsoftware managed JTLB and hardwareassisted replacement policy• 32-bit unified instruction/dataaddress space–2GB virtual translated addressspace, mapping to 4GB physicaladdress space• Configurable page size: 4 KB, 8 KB, 16 KB• Per page cache control• Optional ECC for JTLB RAMsTarget Applications• AIoT• Storage• Wireless• NetworkingARC Overlay Management Unit forMemory ModelThe EM processor supports virtual memory addressing when the Overlay Management Unit is present. If the Overlay Management Unit is not present or if it is present but disabled, all the virtual addresses are mapped directly to physical addresses. By default, the Overlay Management Unit is disabled after reset. Note that the data uncached region is always active even if the Overlay Management Unit is disabled.The Overlay Management Unit features a TLB for address translation and protection of 4 KB, 8 KB or 16 KB memory pages, and fixed mappings of an untranslated memory. The upper half of the untranslated memory section is uncached for I/O uses while the lower half of the untranslated memory is cached for a system kernel.With the Overlay Management Unit option enabled, the ARC EM cache-based cores define a common address space for both instruction and data accesses in which the memory translation and protection systems can be arranged to provide separate, non-overlapping protected regions of memory for instruction and data access within a common address space. The programming interface to the Overlay Management Unit is independent of the configuration of the TLB in terms of the associativity of number of entries (Figure 2).Virtual address space Physical address space in secondary storagePhysical address space in on-chip page RAM Figure 2: Memory Address Mapping with Overlay Management ComponentsPage Table LookupThe system management or micro-kernel software tracks the mapping of pages from the program store in the level-3 memory to smaller level-2 memory. The number of entries used/required for this varies based on the Overlay Management Unit page size and the size of the level-2 memory. The Overlay Management Unit acts as a software-controlled cache into this page table, performs hardware address translation, and checks access permissions (Figure 3).Two levels of cache are provided:• The first level consists of micro TLBs (or μTLBs). These are very small, fully associative caches into the second level of the OLM cache. They allow for single-cycle translation and permission checking in the processor pipeline. The μTLBs are updated automatically from the second level of the cache.• The second level of the cache is called the joint TLB (JTLB). This consists of a larger, RAM-based 4-way set-associative TLB. The JTLB is loaded by special kernel mode handlers known as TLB miss handlers.• The final level of the hierarchy is the main page table itself. This contains the complete details of each page mapped for use by kernel or user tasks. The μTLBs, JTLB, and miss handlers combine to implement cached access into the OS page table.Figure 3: Overlay Manager Table StructureTranslation Lookaside BuffersTo provide fast translation from virtual to physical memory addresses the Overlay Management Unit contains Translation Lookaside Buffers (TLBs). The TLB architecture of the ARC EM’s Overlay Management Unit can be thought of as a two level cache for page descriptors: “micro-TLBs” for instruction and data (μITLB & μDTLB) as level one, and the “Joint” (J-TLB) as level two. The μITLB and μDTLB contain copies of the content in the joint TLB. The μTLBs may have descriptors not contained in the joint TLB. In addition to providing address translation, the TLB system also provides cache control and memory protection features for individual pages. The ARC EM implementation features a system configured as follows:• The μITLB and μDTLB are fully associative and physically located alongside the instruction cache and data cache respectively, where they perform single-cycle virtual to physical address translation and permission checking. The μITLB and μDTLB are hardware managed. On a μITLB (or μDTLB) page miss, the hardware fetches the missing page mapping from the JTLB.• The JTLB consists of a four-way set associative Joint Translation Lookaside Buffer with 64, 128 or 256 entries and is software managed. On a joint TLB page miss, special kernel-mode TLB miss handlers fetch the missing page descriptor from memory and store it in the JTLB, as well as swapping in the required contents from the main memory store into the level-2 memory. No part of the Overlay Management Unit has direct access to the main memory. The JTLB is filled by software through an auxiliary register interface.DocumentationThe following documentation is available for the DesignWare ARC Overlay Management Unit Option for ARC EM:• ARCv2 ISA Programmers Reference Manual• ARC EM Databook• DesignWare ARC EM Integration GuideTesting, Compliance, and QualityVerification of the ARC EM Overlay Management Unit follows a bottom-up verification methodology from block-level through system-level. Each functional block within the product follows a functional, coverage-driven test plan. The plan includes testing for ARCv2 ISA compliance as well as state- and control-specific coverage points that have been exercised using constrained pseudo-random environments and a random instruction sequence generatorARC EM ProcessorsThe ARC EM processors, built on the ARCv2 instruction set architecture (ISA) are designed to meet the needs of next-generation system-on-chip (SoC) applications and enable the development of a full range of 32-bit processor cores – from low-end, extremely power-efficient embedded cores to very high-performance host solutions that are binary compatible and designed with common pipeline elements. ARC EM processors can be precisely targeted to meet the specific performance and power requirements for each instance on a SoC, while offering the same software programmer’s model to simplify program development and task partitioning.©2021 Synopsys, Inc. All rights reserved. Synopsys is a trademark of Synopsys, Inc. in the United States and other countries. A list of Synopsys trademarks isavailable at /copyright.html . All other names mentioned herein are trademarks or registered trademarks of their respective owners.。

存储HCIP测试试题及答案

存储HCIP测试试题及答案

存储HCIP测试试题及答案1、关于华为 Oceanstor 9000 各种类型节点下硬盘的性能,由高到低的排序正确的是哪一项?A、P25 Node SSD 硬盘-〉P25 Node SAS 硬盘-〉P12 SATA 硬盘-〉P36 Node SATA 硬盘-〉C36 SATA 硬盘B、P25 Node SSD 硬盘-〉P25 Node SAS 硬盘-〉P12 SATA 硬盘-〉C36 SATA 硬盘C、P25 Node SSD 硬盘-〉P25 Node SAS 硬盘-〉P36 Node SATA 硬盘-〉P12 SATA 硬盘-〉C36 SATA 硬盘D、P25 Node SSD 硬盘-〉P25 Node SAS 硬盘-〉P36 Node SATA 硬盘-〉C36 SATA 硬盘-〉P12 SATA 硬盘答案:C2、下面不属于华为存储高危命令的时哪个命令?A、import licenseB、export licenseC、poweroff diskD、import configuration_data答案:B3、Erasure code 冗余技术支持比传统 RAID 算法更高的可靠性和更灵活的冗余策略,以下对 Erasure code 原理描述错误的是哪一项?A、数据写入时被切割成 N 个数据块,大小相同B、每 N 个连续数据块通过 ErasureCode 算法计算出 M 个校验块C、系统将 N+M 个数据块并行的存储于不同的节点中D、在 Erasure Code 存储模式下,系统最多支持 M-1 块硬盘答案:D4、下列关于 InfoRevive 描述正确的是:A、OceanStor 9000 提供的 InfoRevive 特性支持在故障节点/磁盘超出冗余上限的情况下,对视频监控业务的连续型毫无影响。

B、开启读容错模式后,当系统中出现故障超出配置的冗余上限时,可抢救读出所有已经损坏的视频文件数据。

Intel Corporation

Intel Corporation

Impact of JIT JVM Optimizations on Java Application Performance¢K. Shiv , R. Iyer , C. Newburn , J. Dahlstedt , M. Lagergren and O. Lindholm Intel Corporation BEA Systems¢ ¢ ¢ ¡ ¡ ¡ ¡AbstractWith the promise of machine independence and efficient portability, JAVA has gained widespread popularity in the industry. Along with this promise comes the need for designing an efficient runtime environment that can provide high-end performance for Java-based applications. In other words, the performance of Java applications depends heavily on the design and optimization of the Java Virtual Machine (JVM). In this paper, we start by evaluating the performance of a Java server application (SPECjbb2000 ) on an Intel platform running a rudimentary JVM. We present a measurement-based methodology for identifying areas of potential improvement and subsequently evaluating the effect of JVM optimizations and other platform optimizations. The compiler optimizations presented and discussed in this paper include peephole optimizations and Java specific optimizations. In addition, we also study the effect of optimizing the garbage collection mechanism and the effect of improved locking strategies. The identification and analysis of these optimizations are guided by the detailed knowledge of the micro-architecture and the use of performance measurement and profiling tools (EMON and VTune) on Intel platforms.1 IntroductionThe performance of Java client/server applications has been the topic of significant interest in the recent years. The attraction that Java offers is the promise of portability across all hardware platforms. This is accomplished by using a managed runtime engine called the Java Virtual Machine (JVM) that runs a machine-independent representation of Java applications called bytecodes. The most common mode of application execution is based on a Just-InTime (JIT) compiler that compiles the bytecodes into native machine instructions. These native machine instructions are also cached in order to allow for fast re-use of the frequently executed code sequences. Apart from the JIT compilation, the JVM also performs several functions including thread management and garbage collection. This brings us to the reason for our study i.e. Java application performance depends very heavily on the efficient execution of the Java Virtual Machine (JVM). Our goal in this paper is to characterize, optimize and evaluate a JVM while running a representative Java application. Over the last few years, several projects (from academia as well as in the industry) [1,2,4,7,8,9,10,15,16,21] have studied various aspects of Java applications, compilers and interpreters. We found that R. Radhakrishnan et al. [16] cover a brief description of much of the recent work on this subject. In addition, they also provide insights on architectural implications of Java client workloads based on SPECjvm98 [18]. Overall, the published work can be classified into the following general areas of focus: (1) presenting the design of a compiler, JVM or interpreter, (2) optimizing a certain aspect of Java code execution, and (3) discussing the application performance and architectural characterization. In this paper, we take a somewhat different approach touching upon all the three aspects listed above. We present the software architecture of a commercial JVM, identify several optimizations and characterize the performance of a representative Java server benchmark through several phases of code generation optimizations carried out on a JVM. Our contributions in this paper are as follows. We start by characterizing SPECjbb2000 [17] performance on Intel platforms running an early version of BEA’s JRockit JVM [3]. We then identify various possible optimizations (borrowing ideas from literature wherever possible), present the implementation details of these optimizations in the JVM and analyze the effect of each optimization on the execution characteristics and overall performance. Our performance characterization and evaluation methodology is based on hardware measurements on Intel platforms - using performance counters (EMON) and a sophisticated profiler (VTune [11]) that allows us to characterize various regions of software execution. The code generation enhancements that we implement and evaluate include (1) code quality improvements such as peephole optimizations, (2) dynamic code optimizations, (3) parallel garbage collection and (4) fine-grained locks. The outcome of our work is the detailed analysis and breakdown of benefits based on these individual optimizations added to the JVM. The rest of this paper is organized as follows. Section 2 covers a detailed overview of the BEA JRockit JVM, the measurement-based characterization methodology and the SPECjbb2000 benchmark. Section 3 discusses the opti-¤ ¥£2 Background and MethodologyIn this section, we present a detailed overview of JRockit (the commercial JVM used) [3], SPECjbb2000 (the Java server benchmark) [17] and the optimization and performance evaluation methodology and tools.Figure 1. The SPECjbb2000 Benchmark Process2.1 Architecture of the JRockit JVMThe goal of the JRockit project is to build a fast and efficient JVM for server applications. The virtual machine should be made as platform independent as possible without sacrificing platform specific advantages. Some of the considerations included reliability, scalability, nondisruptiveness and of course, high performance. JRockit starts up differently from most ordinary Java JVMs by first JIT-compiling the methods it encounters during startup. When the Java application is running, JRockit has a bottleneck detector active in the background to collect runtime statistics. If a method is executed frequently and found to be a bottleneck, it is sent to the Optimization Manager subsystem for aggressive optimization. The old method is replaced by the optimized one while the program is running. In this way, JRockit is using adaptive optimization to improve code performance. JRockit relies upon a fast JIT-compiler for unoptimized methods, as opposed to interpretative byte-code execution. Other JVMs such as Jalapeno/Jikes [23] have used similar approaches. It is important to optimize the garbage collection mechanism in any JVM in order to avoid disruption and provide maximum performance to the Java application. JRockit provides several alternatives for garbage collection. The ”parallel collector” utilizes all available processors on the host computer when doing a garbage collection. This means that the garbage collector runs on all processors, but not concurrently with the Java program. JRockit also has a concurrent collector which is designed to run without ”stopping the world”, if non-disruptiveness is the most important factor. To complete the server side design, JRockit also contains an advanced thread model, that makes it possible to run several thousands of Java threads as light weight tasks in a very scalable fashion.The database component requirement common to three-tier workloads is emulated using binary trees of objects. The clients are similarly replaced by driver threads. Thus, the whole benchmark runs on a single computer system, and all the three tiers run within the same JVM. The benchmark process is illustrated in Figure 1. The SPECjbb2000 application is somewhat loosely based on the TPC-C [20] specification for its schema, input generation, and operation profile. However, the SPECjbb2000 benchmark only stresses the server-side Java execution of incoming requests and replaces all database tables with Java classes and all data records with Java objects. Unlike TPC-C where the database execution requires disk I/O to retrieve tables, in SPECjbb2000 disk I/O is completely avoided by holding the objects in memory. Since users do not reside on external client systems, there is no network IO in SPECjbb2000 [17]. SPECjbb2000 measures the throughput of the underlying Java platform, which is the rate at which business operations are performed per second. A full benchmark run consists of a sequence of measurement points with an increasing number of warehouses (and thus an increasing number of threads), and each measurement point is work done during a 2-minute run at a given number of warehouses. The number of warehouses is increased from 1 until at least 8. The throughputs for all the points from N warehouses to 2*N inclusive warehouses are averaged, where N is the number of warehouses with best performance. This average is the SPECjbb2000 metric.2.3 Performance Optimization and Evaluation MethodologyThe approach that we have taken is evolutionary. Beginning with an early version of JRockit, performance was analyzed and potential improvements were identified. Appropriate changes were made to the JVM and the new version of the JVM was then tested to verify that the modifications did deliver the expected improvements. The new version of the JVM was then analyzed in its turn for the next stage of performance optimizations. The types of performance optimizations that we investigated were two-fold. Changes were made to the JIT so that the quality of the generated code was superior, and changes were made to other parts of the JVM, particularly to the Garbage Collector, Object Allo-2.2 Overview of the SPECjbb2000 BenchmarkSPECjbb2000 is Java Business Benchmark from SPEC that evaluates the performance of Server Side Java. It emulates a three-tier system, with business logic and object manipulation, the work of the middle layer predominating.$ ' i7 f  8"" h g`e V (d"cb #! '6 § a X  & !! ' § $ '$1 Y "! (#© W"   %$V T S RE BIHG FE D A U(QP 3(#3 BC 3 9@  &! ¦ &6 4 $ $© #! (87 #(¨5$ #! 321 " # 0 $ ' ' ' "#) ( & $ #!  %© $ #"  !   © §   ¨¦mizations - how they were identified, implemented and their performance evaluation. Section 4 summarizes the breakdown of the performance benefits and where they came from. Section 5 concludes this paper with some direction on future work in this area.Figure 2. Processor Scaling on an early JRockit JVM version2.4.2 VTune Performance Monitoring Tool Intel’s VTune performance tools provide a rich set of features to aid in performance analysis and tuning: (1) Timebased and event-based sampling, (2) Attribution of events to code locations, viewed in source and/or assembly, (3) Call graph analysis and (4) Hot spot analysis with the AHA tool, which indicates how the measured ranges of event count values compare with other applications, and which provides some guidance on what other events to collect and how to address common performance issues. One of the key tools provides the means for providing the percentage contribution of a small instruction address range to the overall program performance, and for highlighting differences in performance among versions of applications and different hardware platforms.2.4 Overview of Performance Tools - EMON and VTuneThis section describes the rich set of event monitoring facilities available in many of Intel’s processors, commonly called EMON, and a powerful performance analysis tool based on those facilities, called VTune [11].2.4.1 EMON Hardware and Events Used The event monitoring hardware provides several facilities including simple event counting, time-based sampling, event sampling and branch tracing. A detailed explanation of these techniques is not within the scope of this paper. Some of th key EMON events leveraged in our performance analysis include (1) Instructions – the number of instructions architecturally retired, (2) Unhalted cycles – the number of processor cycles that the application took to execute, not counting when that processor was halted, (3) Branches – the number of branches architecturally retired which are useful for noting reductions in branches due to optimizations, (4) Branch Mispredictions – the number of branches that experienced a performance penalty on the order of 50 clocks, due to a misprediction, (5)Locks – the number of locked cmpxchg instructions, or instructions with a lock prefix and (6) Cache misses – the number of misses and its breakdown at each level of the cache hierarchy. The reader is referred to the Pentium 4 Processor Optimization Guide [24] for more details on these events.3 JVM Optimizations and Performance ImpactIn this section we describe the various JVM improvements that we studied and document their impact on performance. We also show the analysis of JVM behavior and the identification of performance inhibitors that informed the improvements that were made.3.1 Performance Characteristics of an early JVMThe version of JRockit with which we began our experiments was a complete JVM in the sense that all of the required JVM components were functional. Unlike several other commercial JVMs though, JRockit does not include an interpreter. Instead, all application code is compiled before execution. This could slow down the start of an application slightly, but this approach enables greater performance. JRockit also included a selection of Garbage Collectors and two threading models. Figure 3 shows the performance for increasing numbers of warehouses for a 1-processor and a 4-processor system.d–‘  ‰ ˆ ‡ q y „  y t w q s v v u t sq ’cY¨† u ¥Y ‚ƒ €xYdg"rp d • d (™ “ • Q“ ” – Q“ ” — Q“ ” ™ ˜ ”Q“ ™ ”– ”• ™ hf ge kf ij fm l f hqr omp ncator and synchronization, to enhance the processor scaling of the system. Our experiments were conducted on a 4 processor, 1.6 GHz, Xeon platform with 4GB of memory. The processors had a 1M level-3 cache along with a 256K level-2 cache. The processors accessed memory through a shared 100 MHz, quad-pumped, front side bus. The network and disk I/O components of our system were not relevant to studying the performance of SPECjbb2000, since this benchmark does not require any I/O. Several performance tools assisted us in our experiments. Perfmon, a tool supplied with Microsoft’s operating systems, was useful in identifying problems at a higher level, and allowed us to look at processor utilization patterns, context switch rates, frequency of system calls and so on. EMON gave us insight into the impact of the workload on the underlying micro-architecture and into the types of processor stalls that were occurring, and that we could target for optimizations. VTune permitted us to dig deeper by identifying precisely the regions of the code where various processor micro-architecture events were happening. This tool was also used to study the generated assembly code. The next section describes the performance tools – EMON and VTune – in some more detail.Table 1. System Performance Characteristics for early JVMFigure 3. Performance Scaling with Increasing WarehousesThere is a marked roll-off in performance from the peak at 3 warehouses in the 4-processor case. The JVM can thus be seen to be having some difficulty with increasing numbers of threads. Data obtained using Perfmon is shown in Table 1. While the utilization of the 1 processor is quite good at 94%, the processor utilization in the 4-processor case is only 56%. It is clear that improvements are needed to increase the processor utilization. The context switch and system call rates are two orders of magnitude larger in the 4P than in the 1P. The small processor queue length indicates the absence of work in the system. These aspects along with the sharp performance roll-off with increased threads, all point to a probable issue related to synchronization. It appears likely that one or more locks are being highly contended, resulting in a large number of the threads being in a state of suspension waiting for the lock. While being fully functional, this version of JRockit (we call it early JVM) had not been optimized for performance. It thus served as an excellent test-bed for our studies. The processor scaling seen with the initial, nonoptimized early JVM is shown in Figure 2. It is obvious that we can do much better on scaling. Many other statically compiled workloads exhibit scaling of 3X or better from 1 processor to 4 processors, for instance.œ™ Qœ 8 Ÿ¥ ww¦ ˜—¦ 53wQŸ œ¦Ÿ wQQ3— ¥ › œ œ ™œœ wœ 8w› ¥™ Q 8š œ ™œœ wœ 8w› ‰ˆ s ‚ 3u‡€Š ֘™ º› w› —š QQŸ ›——œ 8QwQŸ œœ› w8w ¥ › œ ¦™Ÿ 5ž ¨w Ÿ™ wž 8¦ ™ž 5Ÿ ¨w †…ƒ ‚ €‹„¨ŠÎ Í Ì Ë Ê ÈÄ Å Ç À Æ Å Ä Ã¿ ¼ Á ¿ ¾ ¼ „u€‹8É ‹½ W‹€½ Y"€‹ÂÀ 8 ¼½ g»â ÛâØ Ýá Ü à ß ÞÝ Ü ÛÚ ÙØ Q"#¨3¨Û ¨ÂQ38¨¤w× Õ Ñœ™ wœ ¨— ›— ¨#› ¥ Qw¦ ž¥ž wQQ˜ ¥ œ œ ™œœ Qœ 8Q#› ˜™¦ w˜ 8› œ ™œœ Qœ 8Q#› ‰ˆ s ‚ wu‡wš™ 5Ÿ 8 ¦ Q— ¥ Q— œ˜ 53¦ ¥ œ ™¦ w¥ 8w— ™ Q¥ 8œ › ™˜ 8š €— †…ƒ ‚ ¨g„w Ô¯ 3¢#¨#„#8xQ83#Q(w ‹‘ Ž ©”¹ ”ª”ª ¸’••”“ ’ “” µ Ž« ’ Ž 5#• #• #w ’ª 8” ( ¨© “” µ•”¯ · Ž¶ Ž© Q• (Q#w“ Ž   ¨± #3” #5° “” µ  ° ³ Ž ² w• (•  5´5” (• €± Ž(Qª5|35” QgQŽ 33wgQ¬ ©  ° £ ’¯ Œ  ®“ ­ – 3 ’ª ¨” (© 8 ‹Œ Ž« ’ Ž ¨ Ž –’”• § Ž 5¨#8 ‹Œ – £”¢   ’ ‘ Ž ¤wQw”  Q¡   #8 ‹Œ –’••”“ ’ ‘ Ž 853#w#w #8 ‹Œ v€ x v ~}{ vt wwt 8t w Y5| xyz wus Ò Ï ÒÑ ÐwÏ Ò Ó ä Ñ QÓ Ð æä åã éä çè äë ê ÐÑ æïðíëî ì3.2 Granularity of Heap LocksThe early version of JRockit performed almost all object allocation globally with all allocating threads increasing a pointer atomically to allocate. In order to avoid this contention, thread local allocation and Thread Local Areas (TLAs) were introduced. In that scheme, each thread has its own TLA to allocate from and the atomic operation for increasing the current pointer could be removed, only the allocation of TLAs required synchronizations. A chain is never stronger than its weakest link, once a contention on a lock or an atomic operation is removed, the problem usually pops up somewhere else. The next problem to solve was the allocation of the TLAs. For each TLA that was allocated, the allocating thread had to take a ”heap lock”, find a large enough block on a free list and release the lock. The phase of object allocation that requires space to be allocated from a free list requires a lock. This lock acquisition and release showed up on all our measurements with VTune as a hot spot, marking it as a highly contended lock. One attempt was made to reduce the contention of this lock by letting the allocating thread allocate a couple of TLAs and putting them in a smaller temporary storage where they could be allocated using only atomic operations by other threads. This attempt was a dead end. Even if the thread that had the heap lock put a large amount of TLAs in the temporary storage, all threads still ended up waiting most of the time, either for the heap lock or that the holder of the heap lock would give away TLAs. The final solution was to create several TLA free lists. Each thread has a randomly allotted ”home” free list from which to allocate the TLAs it needs. If the chosen list was empty, the allocating thread tried to take the heap lock and fill that particular free list with several TLAs. After this, the thread would choose another ”home” free list randomly to allocate from. By having several lists, usually only one thread would try to take the heap lock at the same time and the contention of the heap lock was reduced dramatically. Contention was further reduced by providing a TLA cache; the thread that acquires the heap lock moves 1MB of memory into the cache. A thread that finds its TLA free list empty checks for TLAs in the cache before taking the heap lock. Figure 4 shows the marked improvement in processor scaling in the modified JVM, the JVM with the heap lock contention reduction. Scaling at 2 processors has increased from 1.08X to 1.70X, and the scaling at 4 processors has improved to 2.46X from 1.29X. The perfmon data with these changes is interesting, and is shown in Table 2. The increase in processor utilization and the decrease in system calls and context switches are all very dramatic.3.3 Garbage Collection OptimizationsThe early version of JRockit included both a single and multi generational concurrent garbage collector, designed toFigure 4. Improvement of Processor Scaling with Heap Lock contention scalingˆ ¦ˆ — ‡  S¡— ‡ ƒ ‡ S¡‡ — ƒ ‰ ¡¦©‡ ˆ ‡ ˆ ˆ ˆ ˆ ¡ˆ ¡‡ ˆ ¦‘ ƒ ˆ ‘ ¡ƒ ¡‰ c h ` FHg Y 9 X W i @ f¡pF¦R ™ y‡ F‡ — „ ¦¡† ˆ ¦‘ ˆ ‘ ¡™ ˆ ‡ ˆ ™ ‡ F„ FS‰ ˆ ¡— ˆ ƒ ‡ ¡ FS‰ c b a dSF` Y 9 X W i @ f¡pF¦R ˆ ¡ˆ ƒ † ‰ F¡— ƒ ™  — ¡¦† ˆ — ƒ † ¡¡¦F ‰ ©‡ ˆ ˆ ˆ ˆ ¦ˆ ¡‡ ‰ ¡ƒ ‘ ˆ ˆ ˆ ¦ˆ ¡‡ c h ` FHg Y 9 X W T f¦HV eG ™ y‡ F‡  ‘ ¦¦† ‡   ˆ S¦¡¦† ˆ ˆ ‡ ƒ ¡S¡ƒ ‰ ©‡ ˆ — † „ S¡ƒ † ¡„ — ƒ „ † S¡ƒ c b a dSF` Y 9 X W T 4¡HV UGFigure 5. Impact of Parallel Garbage Collection on Processor Scalinghave really short pause times and fair throughput. Throughput in a concurrent collector is usually not a problem since a full collection is rarely noticed, even less on a multiprocessor system. The problem occurs when objects are allocated in such a fast rate that even if the garbage collector collects all the time on one processor and lets the other processors run the program, the collector still doesn’t manage to keep up the pace. This problem started to hurt performance badly in JRockit when running 8 warehouses on 8-way systems. To solve this, the so-called ”parallel collector” was developed. The base was a normal Mark and Sweep [13] collector with one marking thread per processor. Each thread had its own marking stack, and if a stack is empty the thread could work-steal references from other stacks [5]. Normal pushing and popping required no synchronization or atomic operations, only the work-stealing required one atomic operation. Each thread also had an expandable local stack to handle overflow in the exposed marking stack. Sweeping is also done in parallel by splitting the heap in N sections and letting each thread allocate a section, sweep it, allocate a new section and so forth until all sections were swept. The sweeping algorithm focused on performance more than accuracy, creating room for fragmentation if we were unlucky. A partial compaction scheme was employed to reduce this fragmentation. These GC optimizations resulted in an increase in the re-Figure 6. Impact of Parallel Garbage Collection on SPECjbb2000 Performanceported SPECjbb2000 result in a 4P system, and improved processor scaling from 2.46 to 2.92, as illustrated in Figure 5. The benefits of this were more noticeable at higher numbers of warehouses and therefore lead to a much flatter roll-off from the peak, as shown in Figure 6.3.4 Code Quality ImprovementsSeveral code quality improvements were made during the benchmarking process. A new code generation pipeline was developed and merged into the product. This enabled us to do a lot more versatile and low-level optimizations on code than previously was possible. Based on the SPECjbb2000 characteristics measured and analyzed in the previous section, we were able to identify several patterns at the native code level that were suboptimal. The JRockit team replaced these with better code through peephole optimizations (commonly used for compiler optimizations as in [6, 14]) or more efficient code generation methodologies. While the compiler optimizations listed below are well-known and understood, the requirement here is¾ Àà ¿Ð É Ð Æ Ë Ï Ê Î Í Ì Ë Ê É È Ç Æ ¦¦¡edÉ d8U¦ed¦½S4ž ©¿Ã¾Table 2. System Performance Characteristics after Heap Lock Improvements S‘¼ » º º ¹ ¸ · ² ¶ µ ´ ³ ² ½S‚³ º ‚¸ ‚|QƒSe2² f±‰ † ~ „ { }   € ~ { Qˆ ‡ fdm4ƒ‚eUU} |z  SŽ °ƒ¤ £  ¦¡ ¡  ¯ £ £ ¢ ®y­¡¬Fm¦F  ¨ « ª © ¡ § ¦ ¥ ¢ ¡ 4y| £¤ ¡     Œ ¦ ‹ Œ ¦Š ‹  Œ Ž ‹ Œ © ‹ Š Ž ¿ à Á ¾ Ä À Â × ÕÖ Ò Ù ØÒ ÞÒ ÜÔÝ Ù ÚÛ ÓÔ ÑÒ Ÿ“ •ž š ›œ š ™“ “ –—˜ ”• ’“ 3428¤©5 1) 7 6 3 1) ' 420( ¦& ¨ þ ûúõùòôøø÷õ ò Y üý YuóYöô óñ¨ ¥ k s¡”©S©x©Sv¦S¡¤¡r ev e y w y f y f u w r € € y x w x y r s g w s ¦©€ ©€ ¡f w dy ©e d x y r € y k ¤€ ©¡©¡x ’s dn Fy ©4l t s s s e r x y r u t l p s o ¦€ ¤ u€ 4qy ©€ dn s©e¡fS4mFy ¦Q¡s FF¡Q¦h r l • t w k q ’ j x r i ‚ ¡f w Sy ©e ©t ¡Uq s g w s d u s r ‚ w y € ˜ u s r SSe©t ¡Uq ‚ • y ” ’ w v u s r –¡¡¡ uy ¦ ’“ e©t ¡Uq ‚ w r € € y x w v u s r S¡©¡r e©t ¡Uq A E B A P I G E A @ ¡¡@ S@ ¡R QHFD C B ¡09 ¨ ©£ £ §ÿ¥ ¢¡   ¢ ¤£  ¢ ¦¥  ÿ   $% #    !"Figure 8. An Example of Copy PropagationP £b`I00D qik jbD ih IiUtiqQ(V @ H 8 k 8 m A l G A @ A 8 H 8 X 8 h DFigure 7. A Simple Example of Peep-Hole Optimization• ”5Ii˜—• U5y d ™ € – ” “ ‡ "y †  u € y w v u s %iixItr @ P A @ RQ8 I8 0H G F D B A @ 8 0EC 097that the compile time overhead be kept to a minimum since it is a part of the execution time; and as such, not all known optimizations and techniques could be added. These are by no means a complete list of improvements, but give some perspective on things that were done to enhance code quality. 1. Peephole Optimizations: The new JRockit code generator made it possible to work with native code just before emission, i.e. there would be IR operations for each native code operation. Several small peephole optimizations were implemented on this. We present one example of this kind of pattern matching here: Java contains a lot of load/store patterns, where a field is loaded from memory, modified and then rewritten. Literal translation of a Java getfield/putfield sequence would result in three instructions on IA32 as shown in Figure 7(left). IA32 allows most operations to operate directly on addresses, so the above sequence could be collapsed to a single instruction as shown in Figure 7(right). 2. Better use of IA32 FPU instructions: Java has precise floating-point semantics, and works either in 32bit or 64-bit precision. This is usually a problem if one wants to use fast 80-bit floating points that there is hardware support for on IA32, but in some cases we don’t need fp-strict calculations and can use built in FPU instructions. JRockit was modified to determine when this is possible. 3. Better SSA reverse transform: Most code optimizations take place in SSA form. There were some problems with artifacts in the form of useless copies not being removed from the code when transforming back to normal form. The transform was modified to get rid of these, with good results. Register pressure dropped significantly for optimized code. 4. Faster checks: The implementation of several Java runtime checks was speeded up. Some Java runtime checks are quite complicated, such as the non-trivial case of an array store check. These were treated as special native calls, but without using all available registers. Special interference information for these simplified methods was passed to the register allocator, enabling less saves and restores of volatile registers.Table 3. Impact of Better Code Generation on Application Performance5. Specializations for common operations: Array allocation was re-implemented with specialized allocation policies for individual array element sizes. The Java ”arraycopy” function was also specialized, depending on if it was operating on primitives or references and on elements of specific sizes. Other common operations were also specialized. 6. Better Copy Propagation: The copy propagation algorithm was improved and also changed to work on the new low level IR, with all its addressing modes and operations. An example of better copy propagation is shown in Figure 8. These improvements to the JIT were undertaken to reduce the code required to execute an application. It is possible that the techniques used to lower the path length could increase the CPI of the workload, and end up hurting throughput. One example of this would be the usage of a complex instruction to replace a set of simpler instructions. However, Table 3 shows that while the efforts to reduce the path length were well rewarded with a 27% improvement for SPECjbb2000, these optimizations did not hurt the CPI in any significant way. The path length improvement resulted in a 34% boost to the reported SPECjbb2000 result.3.5 Dynamic OptimizationThe initial compile time that is tolerable limits the extent to which compiler optimizations can be applied. This implies that while JRockit provides better code in general than an interpreter, for the few functions that other JITs do choose to compile, there is a risk of under-performance. JRockit has chosen to handle this issue by providing a secondary compilation phase that can include more sophisticated optimizations, and using this secondary compilation during the application run to compile a few frequently used hot functions.© 3  § 1 ' )  ' !   © $ # !   § ¥ 24 6% ¡0%("&%"¡© ¨5¤      § ¥ © 2© ¨5¤ 8 p H h g V q0IiYfe e ‘ ˆ e ¡’ ‚ e i„ G X V A F dcbI8 @ 5a gUˆ ‚ ˆ ‘ q‰ ƒ ˆ ƒ £„ Q‚G X V A DF `YW8 UT08 SfI0qQ e ‘  U‰ ƒ ˆ ‚     § ¥ © © ¨¦¤ © 3  § 1 ' )  ' !   © $ # !   § ¥ 4 2 ¡&0("&%"¡© ¨¦¤      § ¥ © © ¨¦¤ü ¢   ÿ ú ô ó þ ý û úù ÷ õ ô 0£¡U0F|ü ‚2øöóë ð ï î é ä à å ä 4S©Q‚ƒíß ã Qà ä ì ê è æ å ä fë dé yç ƒƒpß ã ‚áß â à ä òß ã ©fë dé yç ñ‚áß å ì ê è æ â à。

高速缓存一致性协议MESI与内存屏障

高速缓存一致性协议MESI与内存屏障

⾼速缓存⼀致性协议MESI与内存屏障⼀、CPU⾼速缓存简单介绍 CPU⾼速缓存机制的引⼊,主要是为了解决CPU越来越快的运⾏速度与相对较慢的主存访问速度的⽭盾。

CPU中的寄存器数量有限,在执⾏内存寻址指令时,经常需要从内存中读取指令所需的数据或是将寄存器中的数据写回内存。

⽽CPU对内存的存取相对CPU⾃⾝的速度⽽⾔过于缓慢,在内存存取的过程中CPU只能等待,机器效率太低。

为此,设计者在CPU与内存之间引⼊了⾼速缓存。

CPU中寄存器的存储容量⼩,访问速度极快;内存存储容量很⼤,但相对寄存器⽽⾔访问速度很慢。

⽽⾼速缓存的存储⼤⼩和访问速度都介于⼆者之间,作为⼀个缓冲桥梁来填补寄存器与主存间访问速度过⼤的差异。

引⼊⾼速缓存后,CPU在需要访问主存中某⼀地址空间时,⾼速缓存会拦截所有对于内存的访问,并判断所需数据是否已经存在于⾼速缓存中。

如果缓存命中,则直接将⾼速缓存中的数据交给CPU;如果缓存未命中,则进⾏常规的主存访问,获取数据交给CPU的同时也将数据存⼊⾼速缓存。

但由于⾼速缓存容量远⼩于内存,因此在⾼速缓存已满⽽⼜需要存⼊新的内存映射数据时,需要通过某种算法选出⼀个缓存单元调度出⾼速缓存,进⾏替换。

由于对内存中数据的访问具有局部性,使⽤⾼速缓存能够极⼤的提⾼CPU访问存储器的效率。

⼆、⾼速缓存⼀致性问题⾼速缓存与内存的⼀致性问题 ⾼速缓存在命中时,意味着内存和⾼速缓存中拥有了同⼀份数据的两份拷贝。

CPU在执⾏修改内存数据的指令时如果⾼速缓存命中,只会修改⾼速缓存中的数据,此时便出现了⾼速缓存与内存中数据不⼀致的问题。

这个不⼀致问题在早期单核CPU环境下似乎不是什么⼤问题,因为所有的内存操作都来⾃唯⼀的CPU。

但即使是单核环境下,为了减轻CPU在I/O时的负载、提⾼I/O效率,先进的硬件设计都引⼊了DMA机制。

DMA芯⽚在⼯作时会直接访问内存,如果⾼速缓存⾸先被CPU 修改和内存不⼀致,就会出现DMA实际写回磁盘的内容和程序所需要写⼊的内容不⼀致的问题。

lmbench Portable Tools for Performance Analysis

lmbench Portable Tools for Performance Analysis

The following paper was originally published in the Proceedings of the USENIX 1996 Annual Technical ConferenceSan Diego, California, January 1996 lmbench: Portable Tools for Performance AnalysisLarry McVoy, Silicon GraphicsCarl Staelin, Hewlett-Packard LaboratoriesFor more information about USENIX Association contact:1. Phone:510 528-86492. FAX:510 548-57383. Email:office@4. WWW URL:lmbench: Portable tools for performance analysisLarry McV oySilicon Graphics, Inc.Carl StaelinHewlett-Packard LaboratoriesAbstractlmbench is a micro-benchmark suite designed to focus attention on the basic building blocks of many common system applications, such as databases, simu-lations, software development, and networking. In almost all cases, the individual tests are the result of analysis and isolation of a customer’s actual perfor-mance problem.These tools can be, and currently are, used to compare different system implementations from different vendors. In several cases, the bench-marks have uncovered previously unknown bugs and design flaws. The results have shown a strong correla-tion between memory system performance and overall performance.lmbench includes an extensible database of results from systems current as of late 1995.1. Introductionlmbench provides a suite of benchmarks that attempt to measure the most commonly found perfor-mance bottlenecks in a wide range of system applica-tions. These bottlenecks have been identified, iso-lated, and reproduced in a set of small micro-benchmarks, which measure system latency and band-width of data movement among the processor and memory,network, file system, and disk.The intent is to produce numbers that real applications can repro-duce, rather than the frequently quoted and somewhat less reproducible marketing performance numbers.The benchmarks focus on latency and bandwidth because performance issues are usually caused by latency problems, bandwidth problems, or some com-bination of the two. Each benchmark exists because it captures some unique performance problem present in one or more important applications.For example, the TCP latency benchmark is an accurate predictor of the Oracle distributed lock manager’s performance, the memory latency benchmark gives a strong indication of Verilog simulation performance, and the file system latency benchmark models a critical path in software development.lmbench was dev e loped to identify and evaluate system performance bottlenecks present in many machines in 1993-1995.It is entirely possible that computer architectures will have changed and advanced enough in the next few years to render parts of this benchmark suite obsolete or irrelevant.lmbench is already in widespread use at many sites by both end users and system designers.In some cases,lmbench has provided the data necessary to discover and correct critical performance problems that might have gone unnoticed.lmbench uncovered a problem in Sun’s memory management software that made all pages map to the same location in the cache, effectively turning a 512 kilobyte (K) cache into a 4K cache.lmbench measures only a system’s ability to transfer data between processor,cache, memory,net-work, and disk.It does not measure other parts of the system, such as the graphics subsystem, nor is it a MIPS, MFLOPS, throughput, saturation, stress, graph-ics, or multiprocessor test suite.It is frequently run on multiprocessor (MP) systems to compare their perfor-mance against uniprocessor systems, but it does not take advantage of any multiprocessor features.The benchmarks are written using standard, portable system interfaces and facilities commonly used by applications, so lmbench is portable and comparable over a wide set of Unix systems. lmbench has been run on AIX, BSDI, HP-UX, IRIX, Linux, FreeBSD, NetBSD, OSF/1, Solaris, and SunOS. Part of the suite has been run on Win-dows/NT as well.lmbench is freely distributed under the Free Software Foundation’s General Public License [Stall-man89], with the additional restriction that results may be reported only if the benchmarks are unmodified.2. Prior workBenchmarking and performance analysis is not a new endeavor. There are too many other benchmark suites to list all of them here.We compare lmbench to a set of similar benchmarks.•I/O (disk) benchmarks:IOstone [Park90]wants to be an I/O benchmark, but actually measures the mem-ory subsystem; all of the tests fit easily in the cache. IObench [Wolman89]is a systematic file system and disk benchmark, but it is complicated and unwieldy. In [McV oy91]we reviewed many I/O benchmarks and found them all lacking because they took too long to run and were too complex a solution to a fairly simpleproblem. We wrote a small, simple I/O benchmark, lmdd that measures sequential and random I/O far faster than either IOstone or IObench.As part of [McV oy91]the results from lmdd were checked against IObench (as well as some other Sun internal I/O benchmarks).lmdd proved to be more accurate than any of the other benchmarks.At least one disk vendor routinely uses lmdd to do performance testing of its disk drives.Chen and Patterson [Chen93, Chen94]measure I/O per-formance under a variety of workloads that are auto-matically varied to test the range of the system’s per-formance. Our efforts differ in that we are more inter-ested in the CPU overhead of a single request, rather than the capacity of the system as a whole.•Berkeley Software Distribution’s microbench suite:The BSD effort generated an extensive set of test benchmarks to do regression testing (both quality and performance) of the BSD releases.We did not use this as a basis for our work (although we used ideas) for the following reasons: (a) missing tests — such as memory latency, (b) too many tests, the results tended to be obscured under a mountain of numbers, and (c) wrong copyright — we wanted the Free Software Foundation’s General Public License.•Ousterhout’s Operating System benchmark: [Ousterhout90]proposes several system benchmarks to measure system call latency, context switch time, and file system performance.We used the same ideas as a basis for our work, while trying to go farther.We measured a more complete set of primitives, including some hardware measurements; went into greater depth on some of the tests, such as context switching; and went to great lengths to make the benchmark portable and extensible.•Networking benchmarks:Netperf measures net-working bandwidth and latency and was written by Rick Jones of Hewlett-Packard.lmbench includes a smaller,less complex benchmark that produces similar results.ttcp is a widely used benchmark in the Internet com-munity.Our version of the same benchmark routinely delivers bandwidth numbers that are within 2% of the numbers quoted by ttcp.•McCalpin’s stream benchmark:[McCalpin95]has memory bandwidth measurements and results for a large number of high-end systems.We did not use these because we discovered them only after we had results using our versions. We will probably include McCalpin’s benchmarks in lmbench in the future.In summary,we rolled our own because we wanted simple, portable benchmarks that accurately measured a wide variety of operations that we con-sider crucial to performance on today’s systems. While portions of other benchmark suites include sim-ilar work, none includes all of it, few are as portable,and almost all are far more complex. Lessfilling, tastes great.3. Benchmarking notes3.1. Sizing the benchmarksThe proper sizing of various benchmark parame-ters is crucial to ensure that the benchmark is measur-ing the right component of system performance.For example, memory-to-memory copy speeds are dramat-ically affected by the location of the data: if the size parameter is too small so the data is in a cache, then the performance may be as much as ten times faster than if the data is in memory.On the other hand, if the memory size parameter is too big so the data is paged to disk, then performance may be slowed to such an extent that the benchmark seems to ‘neverfinish.’lmbench takes the following approach to the cache and memory size issues:•All of the benchmarks that could be affected by cache size are run in a loop, with increasing sizes (typ-ically powers of two) until some maximum size is reached. The results may then be plotted to see where the benchmark no longer fits in the cache.•The benchmark verifies that there is sufficient mem-ory to run all of the benchmarks in main memory.A small test program allocates as much memory as it can, clears the memory,and then strides through that memory a page at a time, timing each reference.If any reference takes more than a few microseconds, the page is no longer in memory.The test program starts small and works forward until either enough memory is seen as present or the memory limit is reached.3.2. Compile time issuesThe GNU C compiler,gcc,is the compiler we chose because it gav e the most reproducible results across platforms.When gcc was not present, we used the vendor-supplied cc.All of the benchmarks were compiled with optimization-O except the benchmarks that calculate clock speed and the context switch times, which must be compiled without opti-mization in order to produce correct results.No other optimization flags were enabled because we wanted results that would be commonly seen by application writers.All of the benchmarks were linked using the default manner of the target system.For most if not all systems, the binaries were linked using shared libraries.3.3. Multiprocessor issuesAll of the multiprocessor systems ran the bench-marks in the same way as the uniprocessor systems. Some systems allow users to pin processes to a partic-ular CPU, which sometimes results in better cache reuse. We do not pin processes because it defeats theMP scheduler.In certain cases, this decision yields interesting results discussed later.3.4. Timing issues•Clock resolution :The benchmarks measure the elapsed time by reading the system clock via the gettimeofday interface. On some systems this interface has a resolution of 10 milliseconds, a long time relative to many of the benchmarks which have results measured in tens to hundreds of microseconds.To compensate for the coarse clock resolution, the benchmarks are hand-tuned to measure many opera-tions within a single time interval lasting for many clock ticks.Typically,this is done by executing the operation in a small loop, sometimes unrolled if the operation is exceedingly fast, and then dividing the loop time by the loop count.•Caching :If the benchmark expects the data to be in the cache, the benchmark is typically run several times; only the last result is recorded.If the benchmark does not want to measure cache per-formance it sets the size parameter larger than the cache. For example, the bcopy benchmark by default copies 8 megabytes to 8 megabytes, which largely defeats any second-level cache in use today.(Note that the benchmarks are not trying to defeat the file or process page cache, only the hardware caches.)•Variability :The results of some benchmarks, most notably the context switch benchmark, had a tendency to vary quite a bit, up to 30%.We suspect that the operating system is not using the same set of physical pages each time a process is created and we are seeing the effects of collisions in the external caches.We compensate by running the benchmark in a loop and taking the minimum ers interested in the most accurate data are advised to verify the results on their own platforms.Many of the results included in the database were donated by users and were not created by the authors.Good benchmarking practice suggests that one shouldName Vender Multi Operating SPEC List used &model or Uni System CPU Mhz Year Int92price IBM PowerPC IBM 43P Uni AIX 3.?MPC604 133’95 17615k IBM Power2 IBM 990 Uni AIX 4.?Power2 71’93 126110k FreeBSD/i586 ASUS P55TP4XE Uni FreeBSD 2.1Pentium 133’95 1903k HP K210HP 9000/859MP HP-UX B.10.01 PA 7200 120’95 16735k SGI Challenge SGI Challenge MP IRIX 6.2-αR4400 200’94 14080k SGI Indigo2SGI Indigo2Uni IRIX 5.3 R4400200 ’94135 15k Linux/Alpha DEC Cabriolet Uni Linux 1.3.38Alpha 21064A 275 ’94189 9k Linux/i586 Triton/EDO RAM Uni Linux 1.3.28 Pentium 120 ’95155 5k Linux/i686 Intel Alder Uni Linux 1.3.37Pentium Pro 200 ’95˜320 7k DEC Alpha@150DEC 3000/500Uni OSF13.0 Alpha 21064 150’93 8435k DEC Alpha@300DEC 8400 5/300MP OSF13.2 Alpha 21164 300’95 341?250k Sun Ultra1Sun Ultra1Uni SunOS 5.5 UltraSPARC 167’95 25021k Sun SC1000Sun SC1000MP SunOS 5.5-βSuperSPARC 50’92 6535k Solaris/i686 Intel Alder Uni SunOS 5.5.1Pentium Pro 133 ’95˜215 5k Unixware/i686 Intel Aurora Uni Unixware 5.4.2Pentium Pro 200 ’95˜320 7kTable 1.System descriptions.run the benchmarks as the only user of a machine,without other resource intensive or unpredictable pro-cesses or daemons.3.5. Using the lmbench databaselmbench includes a database of results that is useful for comparison purposes.It is quite easy to build the source, run the benchmark, and produce a table of results that includes the run.All of the tables in this paper were produced from the database included in lmbench .This paper is also included with lmbench and may be reproduced incorporating new results. For more information, consult the file lmbench-HOWTO in the lmbench distribution.4. Systems testedlmbench has been run on a wide variety of plat-forms. This paper includes results from a representa-tive subset of machines and operating -parisons between similar hardware running different operating systems can be very illuminating, and we have included a few examples in our results.The systems are briefly characterized in Table 1.Please note that the list prices are very approximate as is the year of introduction.The SPECInt92 numbers are a little suspect since some vendors have been ‘‘optimizing’’for certain parts of SPEC.We try and quote the original SPECInt92 numbers where we can.4.1. Reading the result tablesThroughout the rest of this paper,we present tables of results for many of the benchmarks.All of the tables are sorted, from best to worst. Some tables have multiple columns of results and those tables are sorted on only one of the columns.The sorted col-umn’s heading will be in bold .5. Bandwidth benchmarksBy bandwidth, we mean the rate at which a partic-ular facility can move data. We attempt to measure the data movement ability of a number of differentfacilities: library bcopy,hand-unrolled bcopy, direct-memory read and write (no copying), pipes, TCP sockets, the read interface, and the mmap inter-face.5.1. Memory bandwidthData movement is fundamental to any operating system. In the past, performance was frequently mea-sured in MFLOPS because floating point units were slow enough that microprocessor systems were rarely limited by memory bandwidth.Today,floating point units are usually much faster than memory bandwidth, so many current MFLOP ratings can not be main-tained using memory-resident data; they are ‘‘cache only’’ratings.We measure the ability to copy, read, and write data over a varying set of sizes.There are too many results to report all of them here, so we concentrate on large memory transfers.We measure copy bandwidth two ways. Thefirst is the user-level library bcopy interface. The second is a hand-unrolled loop that loads and stores aligned 8-byte words. In both cases, we took care to ensure that the source and destination locations would not map to the same lines if the any of the caches were direct-mapped. In order to test memory bandwidth rather than cache bandwidth, both benchmarks copy an 8M1area to another 8M area.(As secondary caches reach 16M, these benchmarks will have to be resized to reduce caching effects.)The copy results actually represent one-half to one-third of the memory bandwidth used to obtain those results since we are reading and writing mem-ory.If the cache line size is larger than the word stored, then the written cache line will typically be read before it is written.The actual amount of mem-ory bandwidth used varies because some architectures have special instructions specifically designed for the bcopy function. Those architectures will move twice as much memory as reported by this benchmark; less advanced architectures move three times as much memory: the memory read, the memory read because it is about to be overwritten, and the memory written.The bcopy results reported in Table 2 may be correlated with John McCalpin’s stream [McCalpin95]benchmark results in the following man-ner: the stream benchmark reports all of the mem-ory moved whereas the bcopy benchmark reports the bytes copied.So our numbers should be approxi-mately one-half to one-third of his numbers.Memory reading is measured by an unrolled loop that sums up a series of integers. On most (perhaps all) systems measured the integer size is 4 bytes.The loop is unrolled such that most compilers generate code that uses a constant offset with the load, resulting 1Some of the PCs had less than 16M of available memory; those machines copied 4M.in a load and an add for each word of memory.The add is an integer add that completes in one cycle on all of the processors.Given that today’s processor typi-cally cycles at 10 or fewer nanoseconds (ns) and that memory is typically 200-1,000 ns per cache line, the results reported here should be dominated by the memory subsystem, not the processor add unit.The memory contents are added up because almost all C compilers would optimize out the whole loop when optimization was turned on, and would generate far too many instructions without optimization.The solution is to add up the data and pass the result as an unused argument to the ‘‘finish timing’’function.Memory reads represent about one-third to one-half of the bcopy work, and we expect that pure reads should run at roughly twice the speed of bcopy. Exceptions to this rule should be studied, for excep-tions indicate a bug in the benchmarks, a problem in bcopy,or some unusual hardware.Bcopy Memory System unrolled libc read write IBM Power2 242171 205364 Sun Ultra185 167129 152 DEC Alpha@30085 80120 123 HP K21078 57117 126 Unixware/i686 655823588 Solaris/i686 524815971 DEC Alpha@15046 4579 91 Linux/i686 425620856 FreeBSD/i586 39427383 Linux/Alpha 39397371 Linux/i586 38427475 SGI Challenge35 3665 67 SGI Indigo231 3269 66 IBM PowerPC 2121 6326 Sun SC100017 1538 31Table 2.Memory bandwidth (MB/s) Memory writing is measured by an unrolled loop that stores a value into an integer (typically a 4 byte integer) and then increments the pointer.The proces-sor cost of each memory operation is approximately the same as the cost in the read case.The numbers reported in Table 2 are not the raw hardware speed in some cases.The Power22is capa-ble of up to 800M/sec read rates [McCalpin95]and HP PA RISC (and other prefetching) systems also do bet-ter if higher levels of code optimization used and/or the code is hand tuned.The Sun libc bcopy in Table 2 is better because they use a hardware specific bcopy routine that uses instructions new in SPARC V9 that were added specif-ically for memory movement.The Pentium Pro read rate in Table 2 is much higher than the write rate because, according to Intel, 2Someone described this machine as a $1,000 pro-cessor on a $99,000 memory subsystem.the write transaction turns into a read followed by a write to maintain cache consistency for MP systems.5.2. IPC bandwidthInterprocess communication bandwidth is fre-quently a performance issue.Many Unix applications are composed of several processes communicating through pipes or TCP sockets. Examples include the groff documentation system that prepared this paper,the X Window System ,remote file access,and World Wide Web servers.Unix pipes are an interprocess communication mechanism implemented as a one-way byte stream.Each end of the stream has an associated file descrip-tor; one is the write descriptor and the other the read descriptor.TCP sockets are similar to pipes except they are bidirectional and can cross machine bound-aries.Pipe bandwidth is measured by creating two pro-cesses, a writer and a reader,which transfer 50M of data in 64K transfers.The transfer size was chosen so that the overhead of system calls and context switch-ing would not dominate the benchmark time.The reader prints the timing results, which guarantees that all data has been moved before the timing is finished.TCP bandwidth is measured similarly,except the data is transferred in 1M page aligned transfers instead of 64K transfers.If the TCP implementation supports it, the send and receive socket buffers are enlarged to 1M, instead of the default 4-60K.We hav e found that setting the transfer size equal to the socket buffer size produces the greatest throughput over the most imple-mentations.System Libc bcopy pipeTCPHP K21057 9334Linux/i686 5689 18IBM Power2 17184 10Linux/Alpha 3973 9Unixware/i686 5868 -1Sun Ultra1167 6151DEC Alpha@30080 4611Solaris/i686 4838 20DEC Alpha@15045 359SGI Indigo232 3422Linux/i586 4234 7IBM PowerPC 2130 17FreeBSD/i586 4223 13SGI Challenge 36 1731Sun SC100015911Table 3.Pipe and local TCP bandwidth (MB/s)bcopy is important to this test because the pipewrite/read is typically implemented as a bcopy into the kernel from the writer and then a bcopy from the kernel to the reader.Ideally,these results would be approximately one-half of the bcopy results. It is possible for the kernel bcopy to be faster than the C library bcopy since the kernel may have access tobcopy hardware unavailable to the C library.It is interesting to compare pipes with TCP because the TCP benchmark is identical to the pipe benchmark except for the transport mechanism.Ide-ally,the TCP bandwidth would be as good as the pipe bandwidth. It is not widely known that the majority of the TCP cost is in the bcopy ,the checksum, and the network interface driver. The checksum and the driver may be safely eliminated in the loopback case and if the costs have been eliminated, then TCP should be just as fast as pipes.From the pipe and TCP results in Table 3, it is easy to see that Solaris and HP-UX have done this optimization.Bcopy rates in Table 3 can be lower than pipe rates because the pipe transfers are done in 64K buffers, a size that frequently fits in caches, while the bcopy is typically an 8M-to-8M copy, which does not fit in the cache.In Table 3, the SGI Indigo2, a uniprocessor,does better than the SGI MP on pipe bandwidth because of caching effects - in the UP case, both processes share the cache; on the MP,each process is communicating with a different cache.All of the TCP results in Table 3 are in loopback mode — that is both ends of the socket are on the same machine.It was impossible to get remote net-working results for all the machines included in this paper.We are interested in receiving more results for identical machines with a dedicated network connect-ing them.The results we have for over the wire TCP bandwidth are shown below.System NetworkTCP bandwidthSGI PowerChallenge hippi 79.3Sun Ultra1100baseT 9.5HP 9000/735fddi 8.8FreeBSD/i586 100baseT 7.9SGI Indigo210baseT .9HP 9000/73510baseT .9Linux/i586@90Mhz 10baseT .7Table 4.Remote TCP bandwidth (MB/s)The SGI using 100MB/s Hippi is by far the fastest in Table 4.The SGI Hippi interface has hardware sup-port for TCP checksums and the IRIX operating sys-tem uses virtual memory tricks to avoid copying data as much as possible.For larger transfers, SGI Hippi has reached 92MB/s over TCP.100baseT is looking quite competitive when com-pared to FDDI in Table 4, even though FDDI has packets that are almost three times larger.We wonder how long it will be before we see gigabit ethernet interfaces.5.3. Cached I/O bandwidthExperience has shown us that reusing data in the file system page cache can be a performance issue.This section measures that operation through twointerfaces,read and mmap.The benchmark here is not an I/O benchmark in that no disk activity is involved. We wanted to measure the overhead of reusing data, an overhead that is CPU intensive,rather than disk intensive.The read interface copies data from the kernel’s file system page cache into the process’s buffer,using 64K buffers. The transfer size was chosen to mini-mize the kernel entry overhead while remaining realis-tically sized.The difference between the bcopy and the read benchmarks is the cost of the file and virtual memory system overhead. In most systems, the bcopy speed should be faster than the read speed. The exceptions usually have hardware specifically designed for the bcopy function and that hardware may be available only to the operating system.The read benchmark is implemented by reread-ing a file (typically 8M) in 64K buffers. Each buffer is summed as a series of integers in the user process. The summing is done for two reasons: for an apples-to-apples comparison the memory-mapped benchmark needs to touch all the data, and the file system can sometimes transfer data into memory faster than the processor can read the data.For example,SGI’s XFS can move data into memory at rates in excess of 500M per second, but it can move data into the cache at only 68M per second.The intent is to measure perfor-mance delivered to the application, not DMA perfor-mance to memory.Libc File Memory File System bcopy read read mmap IBM Power2 171187 205106 HP K21057 88117 52 Sun Ultra1167 85129 101 DEC Alpha@30080 67120 78 Unixware/i686 5862 235200 Solaris/i686 485215994 DEC Alpha@15045 4079 50 Linux/i686 564020836 IBM PowerPC 2140 6351 SGI Challenge36 3665 56 SGI Indigo232 3269 44 FreeBSD/i586 4230 7353 Linux/Alpha 3924 7318 Linux/i586 4223749 Sun SC100015 2038 28Table 5.File vs. memory bandwidth (MB/s)The mmap interface provides a way to access the kernel’sfile cache without copying the data.The mmap benchmark is implemented by mapping the entire file (typically 8M) into the process’s address space. Thefile is then summed to force the data into the cache.In Table 5, a good system will have File read as fast as (or even faster than)Libc bcopy because as the file system overhead goes to zero, the file reread case is virtually the same as the library bcopy case. How-ev e r, file reread can be faster because the kernel may have access to bcopy assist hardware not available to the C library.Ideally,File mmap performance should approach Memory read performance, but mmap is often dramatically worse. Judging by the results, this looks to be a potential area for operating system improvements.In Table 5 the Power2 does better on file reread than bcopy because it takes full advantage of the memory subsystem from inside the kernel. The mmap reread is probably slower because of the lower clock rate; the page faults start to show up as a significant cost.It is surprising that the Sun Ultra1 was able to bcopy at the high rates shown in Table 2 but did not show those rates for file reread in Table 5.HP has the opposite problem, they get file reread faster than bcopy, perhaps because the kernel bcopy has access to hardware support.The Unixware system has outstanding mmap reread rates, better than systems of substantially higher cost.Linux needs to do some work on the mmap code.6. Latency measurementsLatency is an often-overlooked area of perfor-mance problems, possibly because resolving latency issues is frequently much harder than resolving band-width issues.For example, memory bandwidth may be increased by making wider cache lines and increas-ing memory ‘‘width’’and interleave,but memory latency can be improved only by shortening paths or increasing (successful) prefetching.The first step toward improving latency is understanding the current latencies in a system.The latency measurements included in this suite are memory latency, basic operating system entry cost, signal handling cost, process creation times, context switching, interprocess communication, file system latency, and disk latency.6.1. Memory read latency backgroundIn this section, we expend considerable effort to define the different memory latencies and to explain and justify our benchmark.The background is a bit tedious but important, since we believe the memory latency measurements to be one of the most thought-provoking and useful measurements in lmbench.The most basic latency measurement is memory latency since most of the other latency measurements can be expressed in terms of memory latency. For example, context switches require saving the current process state and loading the state of the next process. However, memory latency is rarely accurately mea-sured and frequently misunderstood.。

存储HCIP习题库含参考答案

存储HCIP习题库含参考答案

存储HCIP习题库含参考答案一、单选题(共38题,每题1分,共38分)1.华为OceanstorV3SmartCache特性适用于以下哪个场景?A、写随机小io场景B、读随机小io场景C、写顺序io场景D、读书须io场景正确答案:B2.哪个特性能够提供文件系统快照的特性?A、InfoLockerB、InfoStamperC、InfoProtectorD、InfoEqualizer正确答案:B3.下列选项中关于管理快存储资源描述错误的是:A、当卷资源状态未及时更新时,管理员可以进行手动更新。

B、管理员可以对块存储服务等级执行创建、查看和删除操作。

C、存储资源通过块资源池统分发至用户,管理员可查看、修改资源池的信息或者删除资源池。

D、管理员可以查看业务群组用户或管理员创建的卷资源信息。

正确答案:C4.在访问华为Oceanstor9000集群的Linux客户端,可以对客户端参数进行调整以提升访问性能,以下错误的是哪一项?A、NFS使用TCP协议来进行数据的收发,适当增大缓存区大小可以加快TCP传输速度B、开启巨帧可以减少TCP损耗,提升性能。

前提是要确认从Oceanstor9000前端网络接口道客户端接口之间所有网络设备都要支持巨帧C、为了支持巨帧,对应的Oceanstor9000所有节点同时都需要进行相应设置D、Oceanstor9000支持巨帧,是需要对所有前后端业务网络网口的MTU进行设置正确答案:D5.Oceanstor9000的wiselink和smartconnect技术的不同点是以下哪一项?A、域名解析功能B、域名设计和分区管理C、Rebalance功能D、节点间的IPfailover功能正确答案:C6.存储设备产生了故障告警,以下方法中,哪一项不能收集存储系统故障信息()A、检查Zone配置B、检查所有事件C、导出系统数据D、检查告警信息正确答案:A7.OceanStorV3创建远程复制Pair时,默认同步速度是以下哪一种?A、低B、高C、很高D、中正确答案:D8.以下哪个特性不能提升oceanstorv3存储产品的性能?A、smartcacheB、smartpartitionC、smarttierD、smartvirtualization正确答案:D9.某企业应用的备份情况是2月1日(Feb1)进行了所有文件和目录的全量备份,2月2日(Feb2),2月3日(Feb3),2月4日(Feb4)进行了累积增量式备份,2月5日业务磁盘出现故障,现在要恢复到2月4日的版本,需要哪些备份文件,A、Febl+Feb2B、Febl+Feb3C、Febl+Feb2+Feb3+Feb4D、Febl+Feb4正确答案:D10.华为的超融合存储采用了分布式存储逻辑架构,下列哪项是MDC 模块的功能()A、提供硬盘的管理B、为VM和数据库提供标准SCSI/iSCSI服务C、提供存储集群状态的管理D、HASH分区正确答案:C11.以下关于oceanstorv3存储系统恢复重构说法不正确的是哪一项?A、在位磁盘忽然故障,此时会进行恢复重构B、完成恢复重构,需要数据差异日志信息C、恢复重构使用在磁盘不能正常使用,raid组处于降级状态的场景D、可快速恢复优先数据,未更新的数据不需要护肤正确答案:A12.华为SmartErase特性适用于哪种类型的LUN?A、镜像LUNB、快照LUNC、ThickLUND、ThinLUN正确答案:C13.以下哪个不是华为WushanFS文件系统的特点A、性能和容量可单独扩展B、全对称分布式架构,有元数据节点C、元数据均匀分散,消除了元数据节点性能瓶颈D、支持40PB单一文件系统正确答案:B14.在华为OceanStor备份解决方案中,如果配置备数据的保留时间和保留循环分别是13天和2,而实际备份是3天一个循环,那么数据保留多少天,A、13B、26C、6D、4正确答案:A15.华为Oceanstor9000提供CIFS/NFS/FTP共享的单位是什么?A、目录B、文件C、用户D、用户组正确答案:A16.以下哪一种协议采用了SHA-256算法?A、FTP2.0B、SMB2.0C、NFS3.0D、NFS4.0正确答案:B17.有一用户采用的配额类型为容量配额、配额方式为强制配额,此用户使用容量的建议阈值为3GB、软性阈值为4GB、硬件阈值为6GB,宽限时间为10天,以下哪个选项的描述是不正确的?A、用户使用存储空间达到3GB时,系统不会上报告警,用户可以继续写入B、用户已用存储空间达到5GB时,同时存储容量过4GB已有11天,用户将被限制数据写入C、用户已用存储空间达到4GB时,系统会上报告警,在宽限时间内不限制数据写入D、用户已用存储空间达到6GB时,系统会禁止数据写入正确答案:A18.Hypermetro如果采用仲裁服务器模式下,仲裁服务器会防止在如下哪个位置?A、主站点B、主,灾备,第三方站点各放一个C、第三方站点D、灾备站点正确答案:C19.华为oceanstorv3重删压缩功能可为客户节省存储空间,降低TCO,以下重删压缩功能说法错误的时哪一项?A、可对文件系统和thinlun进行数据重删B、对数据进行指纹计算的最小容量单位是extentC、可对重删厚度数据进行压缩,重删压缩结合使用,综合效果更好D、如系统配置了重删压缩加速卡,则指纹计算,压缩,解压等操作将会由该加速卡完成正确答案:B20.下列描述不属于横向扩展文件存储功能特点的是:A、分布的元数据管理,支持3~4000个节点的线性扩展。

TLB(快表)

TLB(快表)

TLB(快表),通常叫它快表。

快表是⼀块⼩容量的相联存储器快表是⼀块⼩容量的相联存储器快表即转换后援缓冲器(Translation Lookaside Buffer),简称TLB,通常叫它快表。

(Associative Memory),由⾼速缓存器组成,速度快,并且可以从硬件上保证按内容并⾏查找,⼀般⽤来存放当前访问最频繁的少快表的⽤途是加快线性地址的转换。

当⼀个线性地址第⼀次使⽤时,通过慢速访问RAM中的页表计算出相应的物数活动页⾯的页号。

数活动页⾯的页号。

快表的⽤途是加快线性地址的转换。

当⼀个线性地址第⼀次使⽤时,通过慢速访问理地址。

同时,物理地址被存放在⼀个TLB表项中,以便以后对同⼀个线性地址的引⽤可以快速地得到转换。

TLB:Translation lookaside buffer,即旁路转换缓冲,或称为页表缓冲;⾥⾯存放的是⼀些页表⽂件(虚拟地址到物理地址的转换表)。

X86保护模式下的寻址⽅式:段式逻辑地址—〉线形地址—〉页式地址; 页式地址=页⾯起始地址+业内偏移地址; 对应于虚拟地址:叫page(页⾯);对应于物理地址:叫frame(页框); X86体系的系统内存⾥存放了两级页表,第⼀级页表称为页⽬录,第⼆级称为页表。

TLB和CPU⾥的⼀级、⼆级缓存之间不存在本质的区别,只不过前者缓存页表数据,⽽后两个缓存实际数据。

⼆:内部组成: 1:TLB在X86体系的CPU⾥的实际应⽤最早是从Intel的486CPU开始的,在X86体系的CPU⾥边,⼀般都设有如下4组TLB: 第⼀组:缓存⼀般页表(4K字节页⾯)的指令页表缓存(Instruction-TLB); 第⼆组:缓存⼀般页表(4K字节页⾯)的数据页表缓存(Data-TLB); 第三组:缓存⼤尺⼨页表(2M/4M字节页⾯)的指令页表缓存(Instruction-TLB); 第四组:缓存⼤尺⼨页表(2M/4M字节页⾯)的数据页表缓存(Data-TLB); 2:TLB命中和TLB失败 果 TLB中正好存放着所需的页表,则称为TLB命中(TLB Hit);如果TLB中没有所需的页表,则称为TLB失败(TLB Miss)。

嵌入式系统论文英文

嵌入式系统论文英文
1. Propose I-Cache locking techniques to minimize tmbedded applications by exploring applications’ statistical profile information and application-specific foreknowing information.
2. Prove that the static I-Cache locking problem for ACET reduction is an NP-Hard problem, and propose a fully locking algorithm and a partially locking algorithm.
Received: 27 August 2010 / Revised: 31 August 2011 / Accepted: 21 November 2011 © Springer Science+Business Media, LLC 2011
Abstract Cache is effective in bridging the gap between processor and memory speed. It is also a source of unpredictability because of its dynamic and adaptive behavior. A lot of modern processors provide cache locking capability which locks instructions or data of a program into cache so that a more precise estimation of execution time can be obtained. The selection of instructions or data to be locked in cache has dramatic influence on the system performance. For real-time systems, cache locking is mostly utilized to improve the Worst-Case Execution Time (WCET). However, Average-Case Execution Time (ACET) is also an important criterion for some embedded systems, especially for soft real-time embedded systems, such as image processing systems. This paper aims to utilize instruction cache (I-Cache) locking technique to guarantee a minimized estimable ACET for embedded systems by exploring the probability profile information. A Probability Execution Flow Tree (PEFT) is introduced to model an embedded application with runtime profile information. The static I-Cache locking problem is proved to be NP-Hard and two kinds of locking, fully locking and partially locking, are proposed to find the instructions to be locked. Dynamic I-Cache locking can further improve the ACET. For dynamic I-Cache locking, an algorithm that leverages the application’s branching information is proposed. All the algorithms are executed during the compilation time and the results are applied during the runtime. Experimental

在有一级cache的情况下cpu读写ram的过程

在有一级cache的情况下cpu读写ram的过程

一级缓存和二级缓存是位于CPU内部的高速缓存,其访问速度比主存储器(RAM)更快。

它们的目的是为了减少对主存储器的频繁访问,提高CPU的执行效率。

当数据或指令在缓存中找到时,CPU可以更快地获取并处理它们,从而加快整体的计算速度。

只有当数据或指令在缓存中不可用时,CPU才会访问主存储器(RAM)来获取所需的内容。

这种层级化的缓存设计有助于提高计算机系统的性能。

在有一级缓存的情况下,CPU读写RAM(Random Access Memory)的过程可以按照以下步骤进行:
1. CPU检查一级缓存:在执行指令或读取数据之前,CPU首先检查一级缓存(L1 Cache)中是否存在所需的数据或指令。

2. 缓存未命中:如果数据或指令不在一级缓存中(缓存未命中),CPU将发出一个缓存未命中的信号。

3. 访问二级缓存:CPU将请求发送到二级缓存(L2 Cache)中,以查找所需的数据或指令。

4. 缓存未命中:如果数据或指令也不在二级缓存中(缓存未命中),CPU将发出另一个缓存未命中的信号。

5. 访问主存储器(RAM):CPU向主存储器(RAM)发出请求,以获取所需的数据或指令。

此时,CPU将等待主存储器的响应时间,这是相对较慢的操作。

6. 数据传输:一旦主存储器中的数据或指令可用,它将通过系统总线传输到CPU 的缓存层级中。

传输完成后,CPU将可以直接访问这些数据或指令。

7. 数据处理:一旦数据或指令在缓存中可用,CPU将对其进行处理,执行相应的操作,例如执行指令或计算。

计算机组成原理中Cache性能优化

计算机组成原理中Cache性能优化

计算机组成原理中Cache性能优化在计算机组成原理中,Cache是一个非常重要的概念。

它是指CPU内部的一块高速缓存,用来存储CPU常用的数据和指令,其目的是为了加速对这些数据和指令的访问速度,提高计算机的整体性能。

然而,Cache的性能也会影响计算机的整体性能,因此,优化Cache的性能是计算机组成原理的一个重要方向。

Cache的性能优化的方法有很多种,下面我们就简单谈一谈其中的一些方法。

增加Cache的大小Cache的大小与访问速度直接相关。

一般来说,Cache大小越大,CPU访问Cache的速度就会越快。

因此,增加Cache的大小是一种有效的优化Cache性能的方法。

但是,增加Cache大小具有一定的限制,因为Cache的大小不仅会影响CPU的访问速度,还会影响CPU的成本和功耗。

因此,在增加Cache大小时,需要综合考虑各种因素,并做出合理的折衷。

使用高速Cache高速Cache是指访问速度更快的Cache。

普通的Cache通常有1级和2级。

而高速Cache主要包括3级和4级别的Cache。

相比普通的Cache,高速Cache能够更快的响应CPU的操作请求,从而提高计算机的整体性能。

使用高速Cache的方法也比较简单,只需要对Cache进行升级或者更换即可。

使用多级Cache使用多级Cache,是一个提高Cache性能的另一个有效方法。

多级Cache通常包括三层缓存:L1、L2和L3。

L1位于CPU内部,大小比较小,但速度非常快;L2位于CPU和内存之间,大小比L1大,速度比L1慢;L3位于主板上,大小比L2更大,速度比L2更慢。

多级Cache的优势在于,当L1 Cache大小有限时,数据和指令能够被传递到L2 Cache中,并在L3 Cache中更多地缓存。

这种方法能够减少L1 Cache的访问次数,从而提高CPU的性能。

使用硬件加速Cache硬件加速Cache是一种专为Cache设计的硬件,主要用于提高Cache的性能。

CMEM——高速缓存一致性问题的解决

CMEM——高速缓存一致性问题的解决

CMEM——高速缓存一致性问题的解决
1. CMEM——高速缓存一致性问题的解决
多核设计中,共享的二级高速缓存之间数据可能不一致,不同CPU内核的私有高速缓存也可能存在数据不一致,称为高速缓存的一致性问题。

解决一致性问题的方法从整体上分可以分成软件方法和硬件方法两种,各有其优势。

Davinci采用软件解决方案。

1) 数据完整性。

CE中的CMEM模块是用来做共享内存分配的,因为应用程序是运行在MVista Linux上的,在应用程序里malloc到的buffer都是虚拟地址,实际的物理空间不一定连续,当把这个指针传递给算法的时候,数据完整性问题就出现了,因为算法是运行在DSP/BIOS上的,这是一个只有实地址的世界。

为了解决这个问题,在共享缓存动态申请空间的时候,就要调用CMEM API。

CMEM驱动模式可以用来在ARM上分配连续的物理缓冲区。

2) CACHE一致性问题。

在ARM和DSP内核中都有CACHE用来提高使用外部存储的效率。

所有的内核在当它们进行读写的时候,都管理着与它们相关的CAHCE的一致性。

但是,当数据经共享内存从一个内核发送到另一个内核时,内核是从不进行CACHE的管理的,因为它们无法互相得知。

像虚拟地址问题一样,只要编程者注意到的话这个问题是很容易解决的。

当客户端写一个客户端存根函数(stub)来对要发送到服务器端的一个变量或者缓冲区进行CACHE回写时,服务器端就写服务端存根函数(stub)来使得收到的这些变量和缓冲区无效。

Design tradeoffs for software-managed TLBs

Design tradeoffs for software-managed TLBs

Citation:D. Nagle, R. Uhlig, T. Stanley, T. Mudge, S. Sechrest and R. Brown. Design Tradeoffs for Software-Managed TLBs. Proc. of the 20th Ann. Int. Symp. Com-puter Architecture, May 1993, pp. 27-38.AbstractAn increasing number of architectures provide virtual memory support through software-managed TLBs. However, software management can impose considerable penalties, which are highly dependent on the operating system’s structure and its use of vir-tual memory. This work explores software-managed TLB design tradeoffs and their interaction with a range of operating systems including monolithic and microkernel designs. Through hardware monitoring and simulation, we explore TLB performance for benchmarks running on a MIPS R2000-based workstation run-ning Ultrix, OSF/1, and three versions of Mach 3.0.Results: New operating systems are changing the relative fre-quency of different types of TLB misses, some of which may not be efficiently handled by current architectures. For the same applica-tion binaries, total TLB service time varies by as much as an order of magnitude under different operating systems. Reducingthe handling cost for kernel TLB misses reduces total TLB service time up to 40%. For TLBs between 32 and 128 slots, each dou-bling of the TLB size reduces total TLB service time by as much as 50%.Keywords: Translation Lookaside Buffer (TLB), Simulation,Hardware Monitoring, Operating Systems.1IntroductionMany computers support virtual memory by providing hardware-managed translation lookaside buffers (TLBs). However, some computer architectures, including the MIPS RISC [1] and the DEC Alpha [2], have shifted TLB management responsibility into the operating system. These software-managed TLBs can simplify hardware design and provide greater flexibility in page table structure, but typically have slower refill times than hardware-managed TLBs [3].Design Tradeoffs for Software-Managed TLBsDavid Nagle, Richard Uhlig, Tim Stanley, Stuart Sechrest, Trevor Mudge & Richard Brown Department of Electrical Engineering and Computer ScienceUniversity of Michigane-mail: uhlig@, bassoon@At the same time, operating systems such as Mach 3.0 [4] are moving functionality into user processes and making greater use of virtual memory for mapping data structures held within the ker-nel. These and related operating system trends place greater stress upon the TLB by increasing miss rates and, hence, decreasing overall system performance.This paper explores these issues by examining design trade-offs for software-managed TLBs and their impact, in conjunction with various operating systems, on overall system performance.To examine issues which cannot be adequately modeled with sim-ulation, we have developed a system analysis tool called Monster,which enables us to monitor actual systems. We have also devel-oped a novel TLB simulator called Tapeworm, which is compiled directly into the operating system so that it can intercept all of the actual TLB misses caused by both user process and OS kernel memory references. The information that Tapeworm extracts from the running system is used to obtain TLB miss counts and to sim-ulate different TLB configurations.The remainder of this paper is organized as follows: Section 2examines previous TLB and OS research related to this work.Section 3 describes our analysis tools, Monster and Tapeworm.The MIPS R2000 TLB structure and its performance under Ultrix,OSF/1 and Mach 3.0 is examined in Section 4. Experiments, anal-ysis and hardware-based performance improvements are pre-sented in Section 5. Section 6 summarizes our conclusions.2Related WorkBy caching page table entries, TLBs greatly speed up virtual-to-physical address translations. However, memory references that require mappings not in the TLB result in misses that must be ser-viced either by hardware or by software. In their 1985 study, Clark and Emer examined the cost of hardware TLB management by monitoring a V AX-11/780. For their workloads, 5% to 8% of a user program’s run time was spent handling TLB misses [5]. More recent papers have investigated the TLB’s impact on user program performance. Chen, Borg and Jouppi [6], using traces generated from the SPEC benchmarks, determined that the amount of physical memory mapped by the TLB is strongly linked to the TLB miss rate. For a reasonable range of page sizes,the amount of the address space that could be mapped was more important than the page size chosen. Talluri et al. [7] have shown that although older TLBs (as in the V AX-11/780) mapped large regions of memory, TLBs in newer architectures like the MIPS doThis paper was presented at the 20th International Symposium onComputer Architecture, San Diego, California, May 1993 (pages 27-38)not. They showed that increasing the page size from 4 KBytes to 32 KBytes decreases the TLB’s contribution to CPI by a factor of at least 31.Operating system references also have a strong impact on TLB miss rates. Clark and Emer’s measurements showed that although only 18% of all memory references were made by the operating system, these references resulted in 70% of all TLB misses. Sev-eral recent papers [8-10] have pointed out that changes in the structure of operating systems are altering the utilization of the TLB. For example, Anderson et al. [8] compared an old-style monolithic operating system (Mach 2.5) and a newer microkernel operating system (Mach 3.0), and found a 600% increase in TLB misses requiring a full kernel entry. Kernel TLB misses were far and away the most frequently invoked system primitive for the Mach 3.0 kernel.This work distinguishes itself from previous work through its focus on software-managed TLBs and its examination of the impact of changing operating system technology on TLB design. Unlike hardware-managed TLB misses, which have a relatively small refill penalty, the design trade-offs for software-managed TLBs are rather complex. Our measurements show that the cost ofhandling a single TLB miss on a DECstation 3100 running Mach 3.0 can vary from 20 to more than 400 cycles. Because of this wide variance in service times, it is important to analyze the fre-quency of various types of TLB misses, their cost and the reasons behind them. The particular mix of TLB miss types is highly dependent on the implementation of the operating system. We therefore focus on the operating system in our analysis and dis-cussion.3Analysis Tools and ExperimentalEnvironmentTo monitor and analyze TLB behavior for benchmark programs running on a variety of operating systems, we have developed a hardware monitoring system called Monster and a TLB simulator called Tapeworm. The remainder of this section describes these tools and the experimental environment in which they are used.3.1System Monitoring with MonsterThe Monster monitoring system enables comprehensive analyses of the interaction between operating systems and architectures. Monster is comprised of a monitored DECstation 31002, an attached logic analyzer and a controlling workstation. Monster’s capabilities are described more completely in [11].In this study, we used Monster to obtain the TLB miss handling costs by instrumenting each OS kernel with marker instructions that denoted the entry and exit points of various code segments (e.g. kernel entry, TLB miss handler, kernel exit). The instru-mented kernel was then monitored with the logic analyzer whose state machine detected and dumped the marker instructions and a nanosecond-resolution timestamp into the logic analyzer’s trace buffer. Once filled, the trace buffer was post-processed to obtain a histogram of time spent in the different invocations of the TLB miss handlers. This technique allowed us to time code paths with 1. The TLB contribution is as high as 1.7 cycles per instruction for somebenchmarks.2. The DECstation 3100 contains an R2000 microprocessor (16.67 MHz)and 16 Megabytes of memory.far greater accuracy than can be obtained using a system clock with its coarser resolution or, as is often done, by repeating a code fragment N times and then dividing the total time spent by N. 3.2TLB Simulation with TapewormMany previous TLB studies have used trace-driven simulation to explore design trade-offs [5-7, 12]. However, there are a number of difficulties with trace-driven TLB simulation. First, it is diffi-cult to obtain accurate traces. Code annotation tools like pixie [13] or AE [14] generate user-level address traces for a single task. However, more complex tools are required in order to obtain real-istic system-wide address traces that account for multiprocess workloads and the operating system itself [5, 15]. Second, trace-driven simulation can consume considerable processing and stor-age resources. Some researchers have overcome the storage resource problem by consuming traces on-the-fly [6, 15]. This technique requires that system operation be suspended for extended periods of time while the trace is processed, thus intro-ducing distortion at regular intervals. Third, trace-driven simula-tion assumes that address traces are invariant to changes in the structural parameters or management policies3 of a simulated TLB. While this may be true for cache simulation (where misses are serviced by hardware state machines), it is not true for soft-ware-managed TLBs where a miss (or absence thereof) directly changes the stream of instruction and data addresses flowing through the processor. Because the code that services a TLB miss can itself induce a TLB miss, the interaction between a change in TLB structure and the resulting system address trace can be quite complex.We have overcome these problems by compiling our TLB sim-ulator, Tapeworm, directly into the OSF/1 and Mach 3.0 operating system kernels. Tapeworm relies on the fact that all TLB misses in an R2000-based DECstation 3100 are handled by software. We modified the operating systems’ TLB miss handlers to call the Tapeworm code via procedural “hooks” after every miss. This 3. Structural parameters include the page size, the number of TLB slotsand the partition of TLB slots into pools reserved for different pur-poses. Management policies include the placement policy (direct mapped, 2-way set-associative, fully-associative, etc.) and the replace-ment policy (FIFO, LRU, random, etc.).Software Trap on TLB Missmechanism passes the relevant information about all user and ker-nel TLB misses directly to the Tapeworm simulator. Tapeworm uses this information to maintain its own data structures and to simulate other possible TLB configurations.A simulated TLB can be either larger or smaller than the actual TLB. Tapeworm ensures that the actual TLB only holds entries available in the simulated TLB. For example, to simulate a TLB with 128 slots using only 64 actual TLB slots (Figure 1), Tape-worm maintains an array of 128 virtual-to-physical address map-pings and checks each memory reference that misses the actual TLB to determine if it would have also missed the larger, simu-lated one. Thus, Tapeworm maintains a strict inclusion property between the actual and simulated TLBs. Tapeworm controls the actual TLB management policies by supplying placement and replacement functions called by the operating system miss han-dlers. It can simulate TLBs with fewer entries than the actual TLB by providing a placement function that never utilizes certain slots in the actual TLB. Tapeworm uses this same technique to restrict the associativity of the actual TLB 1. By combining these policy functions with adherence to the inclusion property, Tapeworm can simulate the performance of a wide range of different-sized TLBs with different degrees of associativity and a variety of placement and replacement policies.mately the same amount of time to run (100-200 seconds under Mach 3.0).All measurements cited are the average of three runs.The Tapeworm design avoids many of the problems with trace-driven TLB simulation cited above. Because Tapeworm is driven by procedure calls within the OS kernel, it does not require address traces at all; the various difficulties with extracting, stor-ing and processing large address traces are completely avoided.Because Tapeworm is invoked by the machine’s actual TLB miss handling code, it considers the impact of all TLB misses whether they are caused by user-level tasks or the kernel itself. The Tape-worm code and data structures are placed in unmapped memory and therefore do not distort simulation results by causing addi-tional TLB misses. Finally, because Tapeworm changes the struc-tural parameters and management policies of the actual TLB, the behavior of the system itself changes automatically, thus avoiding the distortion inherent in fixed traces.3.3Experimental EnvironmentAll experiments were performed on an R2000-based DECstation 3100 (16.7 MHz) running three different base operating systems (Table 1): Ultrix, OSF/1, Mach 3.0. Each of these systems includes a standard UNIX file system (UFS) [16]. Two additionalversions of Mach 3.0 include the Andrew file system (AFS) cache manager [17]. One version places the AFS cache manager in the Mach Unix Server while the other migrates the AFS cache man-ager into a separate server task.To obtain measurements, all of the operating systems were instrumented with counters and markers. For TLB simulation,Tapeworm was imbedded in the OSF/1 and Mach 3.0 kernels.Because the standard TLB handlers for OSF/1 and Mach 3.0implement somewhat different management policies, we modified OSF/1 to implement the same policies as Mach 3.0.Throughout the paper we use the benchmarks listed in Table 1.The same benchmark binaries were used on all of the operating systems. Each measurement cited in this paper is the average of three trials.1. The actual R2000 TLB is fully-associative, but varying degrees of associativity can be emulated by using certain bits of a mapping’s vir-tual page number to restrict the slot (or set of slots) into which the mapping may be placed.4OS Impact on Software-Managed TLBsOperating system references have a strong influence on TLB per-formance. Yet, few studies have examined these effects, with mostconfined to a single operating system [3, 5]. However, differences between operating systems can be substantial. To illustrate this point, we ran our benchmark suite on each of the operating sys-tems listed in Table 1. The results (Table 2) show that although the same application binaries were run on each system, there is signif-icant variance in the number of TLB misses and total TLB service time. Some of these increases are due to differences in the func-tionality between operating systems (i.e. UFS vs. AFS). Other increases are due to the structure of the operating systems. For example, the monolithic Ultrix spends only 11.82 seconds han-dling TLB misses while the microkernel-based Mach 3.0 spends 80.01 seconds.Notice that while the total number of TLB misses increases 4fold (from 9,177,401 to 36,639,834 for AFSout), the total time spent servicing TLB misses increases 11.4 times. This is due to the fact that software-managed TLB misses fall into different cat-egories, each with its own associated cost. For this reason, it is important to understand page table structure, its relationship to TLB miss handling and the frequencies and costs of different types of misses.4.1Page Tables and Translation HardwareOSF/1 and Mach 3.0 both implement a linear page table structure (Figure 2). Each task has its own level 1 (L1) page table, which is maintained by machine-independent pmap code [18]. Because the user page tables can require several megabytes of space, they are themselves stored in the virtual address space. This is supported through level 2 (L2 or kernel) page tables, which also map other kernel data. Because kernel data is relatively large and sparse, the L2 page tables are also mapped. This gives rise to a 3-level page table hierarchy and four different page table entry (PTE) types.The R2000 processor contains a 64-slot, fully-associative TLB, which is used to cache recently-used PTEs. When thevant PTE must be held by the TLB. If the PTE is absent, the hard-ware invokes a trap to a software TLB miss handling routine that finds and inserts the missing PTE into the TLB. The R2000 sup-ports two different types of TLB miss vectors. The first, called the user TLB (uTLB) vector, is used to trap on missing translations for L1U pages. This vector is justified by the fact TLB misses on L1U PTEs are typically the most frequent [3]. All other TLB miss types (such as those caused by references to kernel pages, invalid pages or read-only pages) and all other interrupts and exceptionsFor the purposes of this study, we define TLB miss types (Table 3) to correspond to the page table structure implemented by OSF/1 and Mach 3.0. In addition to L1U TLB misses, we define five subcategories of kernel TLB misses (L1K, L2, L3,modify and invalid). Table 3 also shows our measurements of the time required to handle the different types of TLB misses. The wide differential in costs is primarily due to the two different miss vectors and the way that the OS uses them. L1U PTEs can be retrieved within 16 cycles because they are serviced by a highly-tuned handler inserted at the uTLB vector. However, all other miss types require from about 300 to over 400 cycles because they are serviced by the generic handler residing at the generic excep-tion vector.The R2000 TLB hardware supports partitioning of the TLB into two sets of slots. The lower partition is intended for PTEs with high retrieval costs, while the upper partition is intended to hold more frequently-used PTEs that can be re-fetched quickly (e.g. L1U) or infrequently-referenced PTEs (e.g L3). The TLB hardware also supports random replacement of PTEs in the upper partition through a hardware index register that returns random numbers in the range 8 to 63. This effectively fixes the TLB parti-tion at 8, so that the lower partition consists of slots 0 through 7,while the upper partition consists of slots 8 through 63.4.2OS Influence on TLB PerformanceIn the operating systems studied, there are three basic factors which account for the variation in the number of TLB misses and their associated costs (Table 4 & Figure 3). The central issues are (1) the use of mapped memory by the kernel (both for page tables and other kernel data structures), (2) the placement of functional-ity within the kernel, within a user-level server process (service migration) or divided among several server processes (OS decom-position) and (3) the range of functionality provided by the system (additional OS services). The rest of Section 4 uses our data to examine the relationship between these OS characteristics and TLB performance.4.2.1Mapping Kernel Data StructuresMapping kernel data structures adds a new category of TLB misses: L1K misses. In the MIPS R2000 architecture, an increase in the number of L1K misses can have a substantial impact on TLB performance because each L1K miss requires several hun-dred cycles to service 1 .Ultrix places most of its data structures in a small, fixed por-tion of unmapped memory that is reserved by the OS at boot time.However, to maintain flexibility, Ultrix can draw upon the much larger virtual space if it exhausts this fixed-size unmapped mem-ory. Table 5 shows that few L1K misses occur under Ultrix.In contrast, OSF/1 and Mach 3.0 2 place most of their kernel data structures in mapped virtual space, forcing them to rely heavily on the TLB. Both OSF/1 and Mach 3.0 mix the L1K PTEs and L1U PTEs in the TLB’s 56 upper slots. This contention pro-duces a large number of L1K misses. Further, handling an L1K miss can result in an L3 miss 3 . In our measurements, OSF/1 and Mach 3.0 both incur more than 1.5 million L1K misses. OSF/1must spend 62% of its TLB handling time servicing these misses while Mach 3.0 spends 37% of its TLB handling time servicing L1K misses.4.2.2Service MigrationIn a traditional operating system kernel such as Ultrix or OSF/1(Figure 3), all OS services reside within the kernel, with only the kernel’s data structures mapped into the virtual space. Many of these services, however, can be moved into separate server tasks,increasing the modularity and extensibility of the operating sys-tem [8]. For this reason, numerous microkernel-based operating systems have been developed in recent years (e.g. Chorus [19],Mach 3.0 [4], V [20]).By migrating these services into separate user-level tasks,operating systems like Mach 3.0 fundamentally change the behav-ior of the system for two reasons. First, moving OS services into user space requires both their program text and data structures to be mapped. Therefore, they must share the TLB with user tasks,possibly conflicting with the user tasks’ TLB footprints. Compar-ing the number of L1U misses in OSF/1 and Mach 3.0, we see a 2.2 fold increase from 9.8 million to 21.5 million. This is directly due to moving OS services into mapped user space. The second change comes from moving OS data structures from mapped ker-nel space to mapped user space. In user space, the data structures are mapped by L1U PTEs which are handled by the fast uTLB handler (20 cycles for Mach 3.0). In contrast, the same data struc-tures in kernel space are mapped by L1K PTEs which are serviced by the general exception (294 cycles for Mach 3.0).4.2.3Operating System DecompositionMoving OS functionality into a monolithic UNIX server does not achieve the full potential of a microkernel-based operating sys-tem. Operating system functionality can be further decomposed into individual server tasks. The resulting system is more flexible and can provide a higher degree of fault tolerance.Unfortunately, experience with fully decomposed systems has shown severe performance problems. Anderson et al. [8] com-pared the performance of a monolithic Mach 2.5 and a microker-nel Mach 3.0 operating system with a substantial portion of the file system functionality running as a separate AFS cache manager task. Their results demonstrate a significant performance gap1. From 294 to 355 cycles, depending on the operating system (Table 3).2. Like Ultrix, Mach3.0 reserves a portion of unmapped space for dynamic allocation of data structures. However, it appears that Mach 3.0 quickly uses this unmapped space and must begin to allocate mapped memory. Once Mach 3.0 has allocated mapped space, it does not distinguish between mapped and unmapped space despite their dif-fering costs.3. L1K PTEs are stored in the mapped, L2 page tables (Figure 2).between the two systems with Mach 2.5 running 36% faster than Mach 3.0, despite the fact that only a single additional server task is used. Later versions of Mach 3.0 have overcome this perfor-mance gap by integrating the AFS cache manager into the UNIX Server.We compared our benchmarks running on the Mach3+AFSin system, against the same benchmarks running on the Mach3+AFSout system. The only structural difference between the systems is the location of the AFS cache manager. The results (Table 5) show a substantial increase in the number of both L2 and L3 misses. Many of the L3 misses are due to missing mappings needed to service L2 misses.The L2 PTEs compete for the R2000’s 8 lower TLB slots. Yet, the number of slots required is proportional to the number of tasks concurrently providing an OS service. As a result, adding just a single, tightly-coupled service task overloads the TLB’s ability to map L2 page tables. Thrashing results. This increase in L2 misses will grow ever more costly as systems continue to decompose ser-vices into separate tasks.4.2.4Additional OS FunctionalityIn addition to OS decomposition and migration, many systems provide supplemental services (e.g. X, AFS, NFS, Quicktime). Each of these services, when interacting with an application, can change the operating system behavior and how it interacts withthe TLB hardware.For example, adding a distributed file service (in the form of an AFS cache manager) to the Mach 3.0 Unix server adds 10.39 sec-onds to the L1U TLB miss handling time (Table 6). This is due solely to the increased functionality residing in the Unix server. However, L1K misses also increase, adding 14.3 seconds. These misses are due to the additional management the Mach 3.0 kernel must provide for the AFS cache manager. Increased functionality will have an important impact on how architectures support oper-ating systems and to what degree operating systems can increase and decompose functionality.5Improving TLB PerformanceIn this section, we examine hardware-based techniques for improving TLB performance under the operating systems ana-lyzed in the previous section. However, before suggesting changes, it is helpful to consider the motivations behind the design of the R2000 TLB.The MIPS R2000 TLB design is based on two principal assumptions [3]. First, L1U misses are assumed to be the most frequent (> 95%) of all TLB miss types. Second, all OS text and most of the OS data structures (with the exception of user page tables) are assumed to be unmapped. The R2000 TLB design reflects these assumptions by providing two types of TLB miss vectors: the fast uTLB vector and the much slower general excep-tion vector (described in Section 4.1). These assumption are also reflected in the partitioning of the 64 TLB slots into two disjoint sets of 8 lower slots and 56 upper slots (also described previ-ously). The 8 lower slots are intended to accommodate a tradi-tional UNIX task (which requires at least three L2 PTEs) and UNIX kernel (2 PTEs for kernel data), with three L2 PTEs left for additional data segments [3].Our measurements (Table 5) demonstrate that these design choices make sense for a traditional UNIX operating system such as Ultrix. For Ultrix, L1U misses constitute 98.3% of all misses. The remaining miss types impose only a small penalty. However,these assumptions break down for the OSF/1- and Mach 3.0-based systems. In these systems, the non-L1U misses account for the majority of time spent handling TLB misses. Handling these misses substantially increases the cost of software-TLB manage-ment (Table 6).The rest of this section proposes and explores four hardware-based improvements for software-managed TLBs. First, the cost of certain types of TLB misses can be reduced by modifying the TLB vector scheme. Second, the number of L2 misses can be reduced by increasing the number of lower slots1. Third, the fre-quency of most types of TLB misses can be reduced if more total TLB slots are added to the architecture. Finally, we examine the tradeoffs between TLB size and associativity.Throughout these experiments, software policy issues do not change from those originally implemented in Mach 3.0. The PTE replacement policy is FIFO for the lower slots and Random for the upper slots. The PTE placement policy stores L2 PTEs in the lower slots and all other PTEs in the upper slots. The effectiveness of these and other software-based techniques are examined in a related work [21].5.1Additional TLB Miss VectorsThe data in Table 5 show a significant increase in L1K misses for OSF/1 and Mach 3.0 when compared against Ultrix. This increase is due to both systems reliance on dynamic allocation of kernel mapped memory. The R2000’s TLB performance suffers, how-ever, because L1K misses must be handled by the costly generic exception vector which requires 294 cycles (Mach 3.0).To regain the lost TLB performance, the architecture could vector all L1K misses through the uTLB handler, as is done in the newer R4000 processor. Based on our timing and analysis of the 1. The newer MIPS R4000 processor [1] implements both of thesechanges.。

memory分离式方案

memory分离式方案

memory分离式方案随着计算机技术的不断发展,人们对于计算机性能的需求也越来越高。

尤其是在大数据、人工智能等领域的应用中,对于计算机的内存需求也越来越大。

为了满足这种需求,研究人员提出了一种名为memory分离式方案的新技术。

所谓memory分离式方案,即将计算机的内存分为多个层次,每个层次具有不同的特性和性能。

这样一来,就可以根据不同的应用场景和需求,选择合适的内存层次来提供更高效的计算和存储能力。

在memory分离式方案中,通常将内存分为三个层次,分别是L1缓存、主内存和辅助存储器。

L1缓存是位于CPU内部的高速缓存,它具有很高的访问速度,但容量较小。

主内存是计算机中较大的内存空间,可以存储更多的数据,但访问速度相对较慢。

辅助存储器则是指硬盘、固态硬盘等外部存储设备,容量更大,但访问速度更慢。

在memory分离式方案中,数据的访问是按照层次进行的。

首先,CPU会先在L1缓存中查找需要的数据,如果找不到,则会从主内存中获取。

如果主内存中也没有需要的数据,则需要从辅助存储器中读取。

这样一来,就可以充分利用不同层次内存的特性,提高计算机的整体性能。

除了提高计算机性能外,memory分离式方案还具有其他一些优点。

首先,通过将内存分为多个层次,可以节省成本。

因为L1缓存的成本较高,而主内存和辅助存储器的成本相对较低。

通过合理配置内存层次,可以在保证性能的同时降低成本。

其次,memory分离式方案还可以提高能效。

由于L1缓存的访问速度快,可以减少CPU 等待数据的时间,从而提高能效。

然而,memory分离式方案也存在一些挑战和限制。

首先,需要进行复杂的硬件设计和软件优化。

由于涉及到多个层次的内存,需要设计合适的缓存机制和访问策略,以及相应的软件算法和优化方法。

其次,内存的层次划分和数据迁移也需要合理规划。

不同层次的内存之间的数据迁移需要耗费一定的时间和资源,需要合理规划和管理。

memory分离式方案是一种提高计算机性能和能效的重要技术。

存储HCIP试题库(含参考答案)

存储HCIP试题库(含参考答案)

存储HCIP试题库(含参考答案)一、单选题(共38题,每题1分,共38分)1.以下关于备份组网LAN-free和Server-free的说法正确的是哪一项?A、Server-free在数据备份时要产生一个副本,该副本会被mount 到备份服务器。

B、Server-free组网可实现直接将备份数据备份到备份介质。

C、LAN-free备份组网对生产主机没有影响。

D、LAN-free备份数据流的路径是生产存储->备份服务器->备份介质。

正确答案:A2.下列选项中不属于全局缓存带来的优势的是:A、提升系统整体性能B、提升缓存数据访问命中率C、提升缓存空间利用率D、提升系统安全可靠性正确答案:D3.华为存储oceanstor的cli管理模式下查看控制器状态的命令是:A、showsysB、showcotrolinfoC、showsystemgeneralD、showcotrollergeneral正确答案:D4.以下关于Oceanstor9000维护说法不正确的是哪一项?A、系统默认提供admin(超级管理员)账户,该管理员账户不允许被B、只有超级管理员用户才能执行createsystemuser命令创建用户C、查看所有账户的信息可用命令cat/etc/passwd。

D、ps-ef命令检查TCP服务/端口正确答案:D5.OceanStorV318800产品不支持以下哪种线缆?A、minisas线缆B、sas线缆C、AOC线缆D、fc线缆正确答案:B6.华为OceanstorV3SmartCache特性适用于以下哪个场景?A、写随机小io场景B、读随机小io场景C、写顺序io场景D、读书须io场景正确答案:B7.大数据时代对传统技术升级已经满足不了了大数据处理的需求,以下哪一项不是大数据时代技术发展的方向?A、向虚拟化方向发展B、网络向更高苏,协议开销更低,更有效的方向发展C、由非关系型数据库想关系型数据库发展D、计算向集群化发展正确答案:C8.华为Oceanstor9000通过什么文件系统,对外提供统一的业务访问A、NTFSB、HDFSC、WushanFSD、EXT3正确答案:C9.在Oceanstor9000中,对WushanFS全局缓存技术理解错误的是哪一项A、WushanFS中的GlobalCache将所有存储服务器上的内存空间在逻辑上组成一个整体内存资源池B、某一节点缓存中的数据不能被其他节点的读写业务命中C、全局缓存技术有助于提升节点内存资源共享D、WushanFS利用分布式锁实现全局缓存数据管理,同一业务数据只在某个节点缓存一份,当其他节点需要访问该数据时,通过申请锁,获取该缓存数据正确答案:B10.在oceanstorv3存储中,NFS不支持如下哪一种验证方式?A、NISB、LADPC、本地认证D、AD域正确答案:D11.关于华为Oceanstor9000软件模块的描述不正确的事哪一项?A、OBS(Object-BasedStore)为文件系统元数据和文件数据提供可靠性的对象存储功能B、CA(ChentAgent)负责NFS/CIFS/FTP等应用协议的语义解析,并发送给底层模块处理C、快照、分级存储、远程复制等多种增值特性功能是由PVS模块提供的D、MDS(MetaDataService)管理文件系统的元数据,系统的每一个节点存储可所以元数据正确答案:D12.在华为Oceanstor9000(C30)中以下哪个特征不需要激活相应License?A、自动分级存储B、配额管理C、目录快照D、轮循负载均衡正确答案:D13.以下OceanStor9000的软件特性中,能够提供文件系统快照的特性是哪个特性?A、InfoEqualizerB、InfoStamperC、InfoProtectorD、InfoLocker正确答案:B14.一个同事在制定双活容灾方案,通过测试仲裁服务器到两端存储系统带宽为4Mbit/s,时延RTT≤是40ms,丢包率≤1%,那么这个方案:A、不可行,因为双活容灾时延RTT≤120msB、不可行,因为双活容灾丢包率必须≤0.1%。

高速缓存存储器的优化方案与配置建议(八)

高速缓存存储器的优化方案与配置建议(八)

高速缓存存储器的优化方案与配置建议在计算机系统中,高速缓存存储器(Cache)起着至关重要的作用。

它是位于处理器和主存之间的一层内存层次结构,用于存储最近被处理器访问过的数据和指令。

通过将数据和指令缓存到高速缓存中,可以大大加快数据访问速度,提高系统的整体性能。

然而,高速缓存的优化和合理的配置是一个复杂的问题,下面我将提出一些优化方案和配置建议。

一、缓存大小与命中率的关系高速缓存的大小对于系统性能有着重要的影响。

一般来说,缓存的大小越大,命中率(Cache Hit Rate)越高,系统性能越好。

但是,增大缓存的大小也会增加成本和能耗。

因此,在确定缓存大小时需要平衡性能需求和成本控制。

一种常见的做法是根据应用程序的访存特征和预算限制,通过实验和分析来确定合适的缓存大小。

二、高速缓存的替换策略当高速缓存已经满了,但需要缓存新的数据和指令时,就需要进行替换操作。

常见的替换策略有最近最少使用(LRU)、先进先出(FIFO)和随机替换(Random Replacement)等。

在实际应用中,选择合适的替换策略对于提高缓存的命中率至关重要。

最近最少使用是一种比较常用的替换策略。

它根据数据和指令的历史使用情况,将最近最少被使用的数据和指令替换出去。

这种策略可以很好地利用程序的局部性原理,提高缓存的命中率。

三、高速缓存的关联度与访问延迟的折衷高速缓存的关联度(Associativity)指的是数据和指令在缓存中的存储位置的选择空间。

一般有直接映射、全关联和组关联等不同的关联度选择方式。

不同的关联度选择方式对于缓存的性能和实现难度有不同的影响。

直接映射缓存是最简单的形式,它将数据和指令按照某种映射函数映射到缓存的某一块。

这种方式的优点是实现简单,但缺点是容易产生冲突,导致较低的缓存命中率。

全关联缓存是最理想的方式,但也是最昂贵的。

组关联缓存则是直接映射和全关联之间的折中选择,通过将缓存划分为多个组,每个组包含多个块,可以在一定程度上提高命中率。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

To appear IEEE Transactions on Computers, 1995.Measuring Cache and TLB Performance and Their Effect on Benchmark Run Times†Rafael H. Saavedra‡ ‡ Alan Jay Smith‡ABSTRACTIn previous research, we have developed and presented a model for measuring machines and analyzing programs, and for accurately predicting the running time of any analyzed program on any measured machine. That work is extended here by: (a) developing a high level program to measure the design and performance of the cache and TLB for any machine; (b) using those measurements, along with published miss ratio data, to improve the accuracy of our run time predictions; (c) using our analysis tools and measurements to study and compare the design of several machines, with particular reference to their cache and TLB performance. As part of this work, we describe the design and performance of the cache and TLB for ten machines. The work presented in this paper extends a powerful technique for the evaluation and analysis of both computer systems and their workloads; this methodology is valuable both to computer users and computer system designers.1. Introduction The performance of a computer system is a function of the speed of the individual functional units, such as the integer, branch, floating-point units, caches, bus, memory system, I/O units, and of the workload presented to the system. In our previous research [Saav89, 92b, 92c], described briefly below, we have measured the performance of the parts of the CPU on corresponding portions of various workloads, but this work has not explicitly considered the† The material presented here is based on research supported principally by NASA under grant NCC2-550, and also in part by the National Science Foundation under grants MIP-8713274, MIP-9116578 and CCR-9117028, by the State of California under the MICRO program, and by Sun Microsystems, Mitsubishi Electric Research Laboratories, Philips Laboratories/Signetics, Apple Computer Corporation, Intel Corporation, Digital Equipment Corporation, and IBM. ‡ Computer Science Department, Henry Salvatori Computer Science Center, University of Southern California, Los Angeles, California 90089-0781 (e-mail: saavedra@). ‡ Computer Science Division, EECS Department, University of California, Berkeley, California 94720. ‡2 behavior and performance of the cache memory. It is well known (see e.g. [Smit82]) that caches are a critical component of any high performance computer system, and that access time to the cache and the misses from the cache are frequently the single factor most constraining performance. In this paper we extend our work on machine characterization and performance prediction to include the effect of cache memories and cache memory misses. Our research in the area of performance evaluation has focused on developing a uniprocessor machine-independent model (the Abstract Machine Model) of program execution to characterize machine and application performance, and the effectiveness of compiler optimization. In previous papers we have shown that we can measure the performance of a CPU on various abstract operations, and can separately measure the frequency of these operations in various workloads. By combining these separate measurements, we can make fairly accurate estimates of execution times for arbitrary machine/program combinations [Saav89, Saav92abc]. Our technique allows us to identify those operations, either on the machine or in the programs, which dominate the benchmark results. This information helps designers to improve the performance of future machines, and users to tune their applications to better utilize the performance of existing machines. Recently, the abstract machine concept was used by Culler et al. to evaluate the mechanisms for fine-grained parallelism in the J-machine and CM-5 [Sper93]. The model presented in the previous papers omitted any consideration of TLB and cache misses, i.e. program locality. Our measurement technique involves the timing of operations executed repeatedly within small loops; in such cases, few cache and TLB misses are encountered. Thus for workloads with high miss ratios, that technique will underestimate run times. Our results on the SPEC and Perfect benchmarks as reported in [Saav92] do not show large errors because the locality on most of these programs is relatively high [Pnev90, GeeJ93]. In this paper we deal with the issue of locality and incorporate this factor in our performance model. We straightforwardly extend our basic model to include a term which accounts for the time delay experienced by a program as a result of bringing data to the processor from different levels and components of the memory hierarchy. We focus on characterizing cache and TLB units by running experiments which measure their most important parameters, such as cache and TLB size, miss penalty, associativity and line (page) size; we then present cache and TLB measurements for a variety of machines. These measurements are then combined with published cache and TLB miss ratios for the SPEC benchmarks to compute the delay experienced by these programs as a result of the cache and TLB misses. These new results are then used to evaluate how much our execution time predictions for the SPEC benchmarks improve when we incorporate these memory delays. We find that the prediction errors decrease in most3 of the programs, although the improvement is modest. The SPEC (Fortran) benchmark workload is then used to evaluate the effect of memory delay in the overall performance of various machines. Finally, we discuss in some detail the performance differences between the caches and TLBs of four machines based on the same family of processors. We find that the relative benchmark results on these machines is explained by their clock rates and memory systems. This paper is organized as follows: Section 2 contains a brief discussion of the Abstract Machine Performance model. Section 3 presents our approach to characterizing the memory hierarchy and the experimental methodology followed throughout the paper. The experimental results are presented in Section 4. The effect of locality in the SPEC benchmarks is contained in Section 5, followed by a discussion of the results in Section 6. 2. Background Material We have developed a performance model based on the concept of the abstract machine that allows us to characterize the performance of the CPU, predict the execution time of uniprocessor applications, and evaluate the effectiveness of compiler optimizations. In this section we briefly discuss and explain this model. 2.1. The Abstract Machine Performance Model We call the approach we have used for performance evaluation the abstract machine performance model. The idea is that every machine is modeled as and is considered to be a high level language machine that executes the primitive operations of Fortran. We have used Fortran for three reasons: (a) Most standard benchmarks and large scientific programs are written in Fortran; (b) Fortran is relatively simple to work with; (c) Our work is funded by NASA, which is principally concerned with the performance of high end machines running large scientific programs written in Fortran. Our methodology applies as well to other similar high level languages such as C, Ada, or Modula-3. There are three basic parts to our methodology. In the first part, we analyze each physical machine by measuring the execution time of each primitive Fortran operation on that machine. Primitive operations include things like add-real-single-precision, store-single-precision, etc; the full set of operations is presented in [Saav89, 92a]. Measurements are made by using timing loops with and without the operation being be measured. Such measurements are complicated by the fact that some operations are not separable from other operations (e.g. store) at the source (Fortran) language level, and that it is very difficult to get precise values in the presence of noise4 (e.g. cache misses, task switching) and low resolution clocks [Saav89, 92a]. We have also called this machine analysis phase narrow spectrum benchmarking. This approach, of using the abstract machine model, is extremely powerful, since it saves us from considering the peculiarities of each machine, as would be done in an analysis at the machine instruction level [Peut77]. The second part of our methodology is to analyze Fortran programs. This analysis has two parts. In the first, we do a static parsing of the source program and count the number of primitive operations per line. In the second, we execute the program and count the number of times each line is executed. From those two sets of measurements, we can determine the number of times each primitive operation is executed in an execution of the entire program. The third part of our methodology is to combine the operation times and the operation frequencies to predict the running time of a given program on a given machine without having run that program on that machine. As part of this process, we can determine which operations account for most of the running time, which parts of the program account for most of the running time, etc. In general, we have found our run time predictions to be remarkably accurate [Saav92a, 92b]. We can also easily estimate the performance of hypothetical machines (or modifications of existing machines) on a variety of real or proposed workloads by replacing measured parameters in our models with proposed or hypothetical ones. It is very important to note and explain that we separately measure machines and programs, and then combine the two as a linear model. We do not do any curve fitting to improve our predictions. The feedback between prediction errors and model improvements is limited to improvements in the accuracy of measurements of specific parameters, and to the creation of new parameters when the lumping of different operations as one parameter were found to cause unacceptable errors. The curve fitting approach has been used and has been observed to be of limited accuracy [Pond90]. The main problems with curve-fitting are that the parameters produced by the fit have no relation to the machine and program characteristics, and they tend to vary widely with changes in the input data. In [Saav89] we presented a CPU Fortran abstract machine model consisting of approximately 100 abstract operations and showed that it was possible to use it to characterize the raw performance of a wide range of machines ranging from workstations to supercomputers. These abstract operations were also combined into a set of reduced parameters, each of which was associated with the performance of a specific CPU functional unit. The use of such reduced parameters permitted straightforward machine to machine comparisons. In [Saav92a,b] we studied the characteristics of the SPEC, Perfect Club and other common benchmarks using the same abstract machine model and showed that it is possible to predict the5 execution time of arbitrary programs on a large number of machines. Our results were successful in accurately predicting ‘inconsistent’ machine performance, i.e. that machine A is faster than B for program x, but slower for program y. Both of these studies assumed that programs were compiled and executed without optimization. In [Saav92c] we extended our model to include the effect of (scalar) compiler optimization. It is very difficult to predict which optimizations will be performed by a compiler and also to predict their performance impact. We found, surprisingly, that we could model the performance improvement due to optimization as an improvement in the implementation of the abstract machine (an ‘‘optimized’’ machine) while assuming that the behavior of the program remains unchanged. We showed that it is possible to accurately predict the execution time of optimized programs in the large majority of cases, with accuracy that was only slightly less than that for unoptimized code. 2.2. Adding Locality to the Abstract Machine Model The variations in execution time due to changes in locality are not captured by our performance model, which ignores how the stream of references affects the content of both the cache and the TLB. This is a direct consequence of using a linear model, and it is clearly expressed in the following equationTA , M =Σ C i , A Pi , M i =1n(1)where TA , M is the total execution time of the program, Ci , A is the number of times operation i is executed by program A , and Pi , M is the execution time of parameter i on machine M . Equation (1) does not include a term to account for cache and TLB misses. Nevertheless, we have found that with a few exceptions (e.g. MATRIX300 without use of a blocking preprocessor), our predictions have been quite good. This has been the case because most of the programs that have been analyzed (almost all of which are standard benchmarks) have relatively low miss ratios. It is straightforward to extend equation (1) to include cache and TLB misses (and/or misses at any other level of the memory hierarchy):TA , M =Σ Ci , A Pi , M + iΣ1Fi , A Di , M , i =1 =nm(2)where Fi A (faults) is the number of misses at the level i of the memory hierarchy, and Di , M (delay) is the penalty paid by the respective miss. How many levels of the memory hierarchy6 exist varies between machines, but in most machines there are one or two levels of caches, a TLB1, main memory, and disk. In order to use equation (2) we need: 1) to measure the number of misses at each level of hierarchy, or at least on those levels which significantly affect the execution time, and 2) to measure the set of penalties due to different types of misses. Measurement of the number of misses by a given program for a given memory hierarchy can be done either by trace driven simulation (see e.g. [Smit82, 85]) or by hardware measurement. The former can be extremely time consuming for any but the shortest programs ([Borg90, GeeJ93]), and the latter requires both measurement tools (a hardware monitor or logical analyzer) and access to the necessary electrical signals. This measurement of miss ratios, however, is beyond the scope of this paper; we are principally concerned here with analysis of the memory hierarchy and performance prediction. We rely on measurements taken by others [GeeJ93] for the miss ratios used in this paper. 3. Characterizing the Performance of the Cache and TLB We have written a set of experiments (narrow spectrum benchmarks or micro benchmarks) to measure the physical and performance characteristics of the memory hierarchy in uniprocessors, in particular, the primary and secondary (data) caches and the TLB. Each experiment measures the average time per iteration required to read, modify, and write a subset of the elements belonging to an array of a known size. The number of misses will be a function of the size of the array and the stride between consecutive addresses referenced. From the number of misses and the number of references, as we vary the stride and array size, we can compute the relevant memory hierarchy parameters, including the size of the data cache and the TLB, the size of a cache line and the granularity of a TLB entry, the time needed to satisfy a cache or TLB miss, and the cache and TLB associativity. Other parameters such as the number of sets in the cache or entries in the TLB are obtained easily from the above parameters. Note that this technique only permits us to measure the characteristics of the data (or unified) cache; measuring the performance of an instruction cache would suggest the use of jump tables and/or self-modifying code. The results in [GeeJ93] show that instruction misses account for very little performance loss for the SPEC benchmarks, and we do not further consider that issue here.1 The TLB is not a level in the memory hierarchy, but it is a high-speed buffer which maintains recently used virtual and real memory address pairs [Smit82]. It is a level in the process of accessing the memory hierarchy, however, and to simplify our discussion in the rest of the paper we refer to it as part of the memory hierarchy. Doing this does not affect in any way our methodology or conclusions.7 At least one previous study used a similar, but much simpler technique to measure the cache miss penalty, although the measurement was made at the machine instruction level, not using a high level language program. Peuto and Shustek [Peut77] wrote an assembly language loop which generated a predictable number of cache misses; from this, they were able to calculate the cache miss penalty for the IBM 3033 and the Amdahl 470V/6. They also determined the effectiveness of the write buffers in the 3033. For both machines, however, they knew the cache design parameters (e.g. cache size) and so didn’t need to deduce them. 3.1. Experimental Methodology We explain how we measure data cache parameters by assuming that there is only one level of the memory hierarchy to measure; to the extent that the characteristics of two levels (e.g. cache and TLB) are sufficiently different, it is straightforward to calculate the parameters of each from these measurements. In what follows we assume the existence of separate instruction and data caches, although this is done only to simplify the discussion; the instruction loop that we use is so small that the measurements are virtually identical for a unified cache. We also assume that the data cache is virtually addressable, i.e. that an array which occupies a contiguous region of virtual memory also occupies a contiguous region of physical memory. By making this assumption, it is possible to identify from the array access pattern which cache set, relative to the start of the array, is referenced. For machines with real address caches, however, it is possible that the virtual to real mapping is actually random and that there are mapping conflicts in the cache that would not be present in a virtually addressed cache. Our experience has been that the results we obtain are generally consistent with this assumption. 2 To the extent that the data we present later seems a little "noisy", we believe that it is due to minor failures of this assumption. Assume that a machine has a cache capable of holding C k -byte words, a line size of b words, and an associativity a . The number of sets in the cache is given by C /ab . We also assume that the replacement algorithm is LRU, and that the lowest available address bits are used to select the cache set. For a machine that did not use LRU replacement (e.g. the IBM 3033 [Smit82]) in the cache, this technique would not work without modifications.In practice, we can minimize this problem by running the experiments in a machine that has been recently booted, when the physical memory has not suffered fragmentation. Furthermore, it is always possible, from the measurements, to detect when the results are unreliable due to memory mapping conflicts.28 Each of our experiments consists of computing a simple floating-point function on each of a subset of elements taken from a one-dimensional array of N k -byte elements. We run each experiment several times to eliminate experimental noise [Saav89]. The reason for the (arbitrary) floating point computation is to avoid having a measurement loop which actually does nothing and is therefore eliminated by the compiler optimizer from the program. The subset includes the following elements (by sequence number): 1, s + 1, 2s + 1, ..., N − s + 1. Thus, each experiment is characterized by a particular value of N and s . The stride s allows us to change the rate at which misses are generated by controlling the number of consecutive accesses to the same cache line, page, etc. The magnitude of s varies from 1 to N /2 in powers of two. Computing a new value on a particular element involves first reading the element into the CPU, computing the new value using a simple recursive equation, and writing the result back into the cache. Thus, on each iteration the cache gets two consecutive requests, one read and one write, both having the same address. Of these two requests only the read can generate a cache miss, and it is the time needed to fetch the value for the read that our experiments measure. Note that we are implicitly assuming a blocking cache; i.e. one which halts on a cache miss even if the target of the miss is not immediately used. In the case of non-blocking caches, we would have to run two experiments, one in which the data was used immediately (to determine the miss penalty) and one in which the data was not used immediately (to show the non-blocking nature of the cache). Depending on the values of N and s and the size of the cache (C ), the line size (b ), and the associativity (a ), there are four possible regimes of operation; each of these is characterized by the rate at which misses occur in the cache. A summary of the characteristics of the four regimes is given in Table 1.Regime 1 2.a 2.b 2.c Size of Array1≤ N ≤C C < N C < N C < NStride1 ≤ s ≤ N /2 1≤ s ≤b b ≤ s < N /a N /a ≤ s ≤ N /2Frequency of Misses no misses one miss every b /s elements one miss every element no missesTime per IterationTno −miss Tno −miss + Ds /b Tno −miss + D Tno −missTable 1: Cache miss patterns as a function of N and s. No misses are generated when N ≤ C . When N > C , the rate of misses is determined by the stride between consecutive elements. D is the delay penalty.Regime 1: N ≤ C . The complete array fits into the cache and thus, for all values of the stride s , once the array is loaded for the first time, there are no more misses. The execution time per iteration (Tno −miss ) includes the time to read one element from the cache, compute its new value, and9 store the result back into the cache. Note that in a cache where the update policy is writethrough, Tno −miss may also include the time that the processor is forced to wait if the write buffer backs up. Regime 2.a: N > C and 1 ≤ s < b . The array is bigger than the cache, and there are b /s consecutive accesses to the same cache line. The first access to the line always generates a miss, because every cache line is displaced from the cache before it can be re-used in subsequent computations of the function. This follows from condition N > C . Therefore, the execution time per iteration isTno −miss + Ds /b , where D is the delay penalty and represents the time that it takes to read thedata from main memory and resume execution. Regime 2.b: N > C and b ≤ s < N /a ( a > 1). The array is bigger than the cache and there is a cache miss every iteration, as each element of the array maps to a different line. Again, every cache line is displaced from the cache before it can be re-used. The execution time per iteration is Tno −miss + D . Regime 2.c: N > C and N /a ≤ s ≤ N /2. The array is bigger than the cache, but the number of addresses mapping to a single set is less than the set associativity; thus, once the array is loaded, there are no more misses. Even when the array has N elements, only N /s < a of these are touched by the experiment, and all of them can fit in a single set. This follows from the fact that N /a ≤ s . The execution time per iteration is Tno −miss . Figure 1 illustrates the state of the cache in each of the four regimes. In these examples we assume that the cache size is large enough to hold 32 4-byte elements, the cache line is 4 elements long, and the (set) associativity is 2. We also assume that the replacement policy is LRU, and that the first element of the array maps to the first element of the first line of the cache. On each of the cache configurations we highlight those elements that are read and generate a miss, those that are read but do not generate a miss, and those that are loaded into the cache as a result of accessing other elements in the same line, but are not touched by the experiment. The four diagrams in upper part of the figure corresponds to regime 1. Here the size of the array is equal to the cache size, so, independently of the value of s , no misses occur. If we double N , which is represented by the lower half of the figure, then cache misses will occur at a rate which depends on the value of s . The leftmost diagram represents regime 2.a, the middle two diagrams regime 2.b, and the rightmost diagram regime 2.c.10s=21 17 5 21 9 25 13 29 2 18 6 22 10 26 14 30 3 19 7 23 11 27 15 31 4 20 8 24 12 28 16 32C = 32 x 4 bytes N = 32 s=4 set 01 17 2 18 6 22 10 26 14 30 3 19 7 23 11 27 15 31 4 20 8 24 12 28 16 32 1b = 4 x 4 bytes a=2 s=82 18 X X 10 26 X X 3 19 X X 11 27 X X 4 20 X X 12 28 X Xs = 16 set 01 17 2 18 X X X X X X 3 19 X X X X X X 4 20 X X X X X X17 X X 9 25 X Xset 15 21set 1X Xset 29 25set 2X Xset 313 29set 3X Xno misses read & miss 1 miss every 2 elements1 33 17 49 5 37 21 53 9 41 25 57 13 45 29 61 30 62 14 46 31 63 26 58 15 47 32 64 10 42 27 59 16 48 22 54 11 43 28 60 13 6 38 23 55 12 44 18 50 7 39 24 56 9 2 34 19 51 8 40 3 35 20 52 5 4 36 1no missesno misses read & no miss 1 miss every element1 36 33 17 52 8 40 X X 9 41 25 60 57 X X X X 48 26 58 X X 49 X X 10 42 27 59 X X 18 50 X X 11 43 28 60 2 34 19 51 X X 12 44 3 35 20 52 4 36no misses no read no misses set 01 33 2 34 X X X X X X 3 35 X X X X X X 4 36 X X X X X X1 miss every element2 33 17 49 6 37 21 53 10 41 25 57 14 45 29 61 30 62 46 31 63 26 58 15 47 32 64 42 27 59 16 22 54 11 43 28 38 23 55 12 44 18 50 7 39 24 56 34 19 51 3 35 20 4set 0set 1set 1X Xset 2set 2X Xset 3set 3X Xs=2s=4 C = 32 x 4 bytes N = 64s=8 b = 4 x 4 bytes a=2s = 32Figure 1: The figure illustrates the four different regimes of cache accesses produce by a particular combination of N and s . Each diagram shows the mapping of elements to cache entries, assuming that the first element of the array maps to the first entry of the first cache line in the cache. The replacement policy is LRU. The four diagrams on the upper part of the figure correspond to regime 1. For the diagram in the lower half of the figure, the leftmost diagram corresponds to regime 2.a, the two in the middle to regime 2.b, and the rightmost to regime 2.c. The sequence of elements reference by an experiment is: 1, s + 1, 2s + 1, ... , N − s + 1.3.2. Measuring the Characteristics of the Cache By making a plot of the value of the execution time per iteration as a function of N and s , we can identify where our experiments make a transition from one regime to the next, and using this information we can obtain the values of the parameters that affect the performance of the cache and the TLB. In what follows we explain how these parameters are obtained.3.2.1.Cache SizeMeasuring the size of the cache is achieved by increasing the value of N until cache misses start to occur.When this happens the time per iteration becomes significantly larger than T no−miss.The cache size is given by the largest N such that the average time iteration is equal to T no−miss.3.2.2.Average Miss DelayAn experiment executing in regime2.b generates a miss every iteration,while one in regime1generates no misses,so the difference between their respective times gives the memory delay per miss.An alternative technique is to measure the difference in the iteration time between regime2.a and regime1,and then multiply this difference by b/s,the number of refer-ences per miss.3.2.3.Cache Line SizeIn regime2.a,the rate at which misses occur is one every b/s iterations.This rate increases with s,and achieves its maximum when s≥b,when there is a miss on every iteration(regime 2.b).The value of s when the transition between regimes2.a and2.b gives the cache line size.3.2.4.AssociativityThe associativity of the cache(for a≥2)is obtained in the following way.Assume that our experiments cover a region of memory N which is larger than the cache size C,and that the stride s equals the cache line size b.It is clear that under these assumptions a cache miss occurs on every reference and that the number of elements mapping to a set is larger than the associa-tivity.If we start increasing the stride,as long as its value is less than C/a(the number of dif-ferent sets in the cache),the number of data elements mapping to a particular set in the cache remains the same,while the number of sets touched decreases.Once the stride crosses the boun-dary only a single set is referenced,and the number of elements mapping to this set starts to decrease.When the stride is maximum N/2only two elements are being map to the surviving set. Therefore,if at some point the effect of cache misses disappears,then the point where this occurs using the formula a=N/s gives us the associativity.Otherwise,it is obvious that the cache has to be direct mapped.。

相关文档
最新文档