Virtual Memory on Data Diffusion Architectures
operating system《操作系统》ch09-virtual memory-70

Example of a page table snapshot:
Frame #
valid-invalid bit
v v v v i
i i
page table
During address translation, if valid–invalid bit in page table entry is I page fault
Copy-on-Write (COW) allows both parent and child processes to initially share the same pages in memory If either process modifies a shared page, only then is the page copied COW allows more efficient process creation as only modified pages are copied Free pages are allocated from a pool of zeroed-out pages
Need For Page Replacement
Basic Page Replacement
1. Find the location of the desired page on disk
2. Find a free frame:
- If there is a free frame, use it
page fault 1. Operating system looks at another table to decide:

虚拟化基础知识:内存虚拟化(转)分类: 虚拟化作者: 刘苏平内存虚拟化基本知识在管理内存资源之前,应当了解ESX/ESXi 是如何虚拟化和使用这些内存资源的。
VMkernel 管理所有的计算机内存。
(一种例外情况是在ESX 中分配给服务控制台的内存。
)VMkernel 会将这种受管计算机内存的一部分拿来自己使用。
虚拟机将计算机内存用于两个用途:每个虚拟机均需要有自己的内存,且VMM 需要一些内存和动态开销内存用于其代码和数据。
虚拟内存空间划分为块,每个块通常为4 KB,块也称为页。
物理内存也划分为块,每个块通常也是 4 KB。
ESX/ESXi 还提供对大页(2 MB) 的支持。
由于ESX/ESXi 主机使用内存管理技术,因此虚拟机可以使用的内存大于物理机(主机)可用的内存。
例如,您有一个内存为2 GB 的主机,其上运行四个虚拟机,每个虚拟机的内存为1 GB。
为了改善内存利用率,ESX/ESXi 主机将闲置虚拟机的内存转移给需要更多内存的虚拟机。
ESX/ESXi 系统使用专用的分页共享技术安全地消除了内存页的冗余副本。
计算机组成原理中英对照COMPUTER HARDWARE 计算机硬件Computer systems consist of hardware and software. Hardware is the physical part of the system. Once designed,hardware is difficult and expensive to change. Software is the set of programs that instruct the hardware and is easier to modify than hardware.计算机系统由硬件和软件组成。
Every computer has four basic hardware components:每台计算机都有如下4种基本硬件部件:•Input devices.输入设备•Output devices.输出设备•Main memory.主存储器•Central processing unit(CPU).中央处理器A PROCESSOR 处理器A processor is composed of two functional units—a control unit and an arithmetic/logic unit—and a set of special workspaces called registers.处理器由两个功能部件(控制部件和算术逻辑部件)与一组称为寄存器的特殊工作空间组成。
The Control Unit控制部件The Arithmetic and Logic Unit算术逻辑部件Registers寄存器MEMORY SYSTEMS 存储系统Memory Devices存储器RANDOM-ACCESS MEMORY随机存储器READ-ONLY MEMORY只读存储器MAGNETIC DISKS磁盘CD-ROMS只读光盘MAGNETIC TAPE磁带INPUT/OUTPUT SYSTEM 输入输出(I/O系统)Programmed I/O(程序控制I/O)Programmed I/O,also known as direct I/O,is accomplished by a program executing on the processor itself to control the I/O operations and transfer the data.程序控制I/O,又称直接I/O,由在处理器上执行的程序去控制I/O操作和传送数据。
因此,为了满足AI 和机器学习应用程序的需要,位置(Location)越来越多地应用于数据需要驻留的地方和存储数据的内存。
在人工智能硬件峰会内存互联的挑战和解决方案圆桌讨论上Rambus研究员Steve Woo 表示:“我们都在人工智能的不同方面工作。
Marvell ASIC业务部门的CTO Igor Arsovski在SRAM方面有着丰富的经验,他表示,用啤酒来比喻内存互连并不坏。
”HBM vs LPDDR外接存储的竞赛,基本上以DRAM-GDDR和HBM为主。
Address Translation: Hardware converts virtual addresses to physical addresses via an OS-managed lookup table (page table)
虚拟存储器 (Virtual Memory)
程序所需要的存储器比计算机实际具有的内 存大
程序需要在内存中可重定位(relocatable) 程序之间需要隔离保护
虚拟存储器使每个程序运行在独立的、足够 大的虚拟地址空间!
Processor Signals Controller
(1) Initiate Block Read
Read block of length P starting at disk address X and store starting at memory address Y
页表(page table)
页在物理内存中采用全相联组织 全相联组织使得操作系统能使用复杂的
替换策略,从而提高命中率 虚拟地址到物理地址之间的映射由页表
来实现 页表在物理内存中连续存放 页表的位置由专用寄存器—页表寄存器
Page Faults ( “Cache Misses”)
Physical Addresses
0: 1:

Writes Reads
9:00 AM
6:00 PM
10:00 PM
以1000个50GB桌面作为性能估算基准: • 单一用户耗时120秒登录,IOPS= (25000/120) = 208IOPS 每用户。分五批每批200个用户同时登 录,需要40,800 IOPS,该性能压力持续十分钟; • 单一用户耗时30秒登录,IOPS= (2,400/30) = 80 IOPS 每用户。1000个用户同时登录,需要80 , 000 IOPS,该性能压力持续30秒; • 正常工作状态IOPS为20~30 IOPS,则1000用户 系统正常IOPS需求至少为20,000~30,000 IOPS, 该性能必须同时保证; • 按照容量需满足至少50TB可用容量。
✔ ✔ ✔
✗ ✗
单盘性能受限,通过增加磁 盘数量可提升性能,但此方 案采购成本高,同时需要占 用更多机架空间,空调和电 力成本,性价比低
闪存擦写次数,温度敏感程序 ,写入放大效应都会对数据安 全产生不利影响
2015.10.28 | 北京, 中国
薛友逢 NimbleStorage 存储架构师
1 2 3
NimbleStorage公司简介 虚拟桌面存储性能分析 虚拟桌面高性能解决方案
Virtual Memory on Data Diffusion ArchitecturesJorge Buenabad-Ch´a vezA thesis submitted to the University of Bristol in accordance with the requirements for the degree of Doctor of Philosophy in the Faculty of Engineering,Department of ComputerScience.July1998A b s t r a c tData diffusion architectures(also known as COMA machines)are scalable multiprocessors that provide a shared address space on top of distributed main memory.Their distinctive feature is that data “diffuses”,or migrates and replicates,in main memory according to whichever processors are using the data;thus effective access time tends to be local access time.This property is possible due to the associative organisation of main memory,which in effect decouples each address and its data item from any physical location.A data item can thus be placed and replicated where it is needed.Also, the physical address space does not have to be afixed and contiguous address range.It can be any set of addresses within the address range of the processors,possibly varying over time,provided it is smaller than the size of main memory.Thisflexibility,which is similar to that of an address space under virtual memory management,offers new possibilities to organise virtual memory in order to support a general purpose multiprogramming environment.This thesis presents an analysis of possible ways to organise virtual memory on such machines, and proposes two main alternatives:traditional virtual memory(TVM)is organised around afixed and contiguous physical address space using a traditional mapping;associative memory virtual memory (AMVM)is organised around a varying and non-contiguous physical address space using a simpler mapping.Our analysis suggests that AMVM has performance advantages over TVM.The KSR-1, thefirst commercial data diffusion architecture,has a virtual memory which is similar to AMVM,but is partly integrated with the data diffusing hardware.AMVM is more hardware independent,and has potentially improved performance.For data to diffuse,a fraction of main memory must be reserved as diffusion space to ensure reasonable performance.This thesis presents an analysis of diffusion space requirements that suggests that,on set-associative memory,the adequate provision of diffusion space should start with a base size diffusion space across all memory sets,and that it be adjusted on demand.To evaluate TVM and AMVM,and to gain insight into the provision of diffusion space,a multi-processor emulation of a data diffusion architecture has been extended to include the emulation of part of the Mach operating system virtual memory.This extension implements TVM;a slightly modified version implements AMVM.On applications tested,AMVM shows a marginal performance gain over TVM.We argue that AMVM will offer greater advantages on applications with higher degrees of parallelism or larger data sets.For the provision of diffusion space to be adequate,our results on set-associative memory suggest the need for a simple interaction between virtual memory software and the data diffusing hardware.iAcknowledgementsIt is a pleasure to thank the people who made possible the completion of this work.The British Council,CINVESTA V,CONACyT and SEP provided the necessary funds.PACT and the University of Bristol provided a much suitable environment.Dr.Sergio Chapa was particularly supportive at CINVESTA V,ensuring that funds were available all the way through.I had very useful discussions and nice talks with Catherine Barnaby,Andy Jones,Hendrik-Jan Agterkamp,Julio Peralta and Oscar Olmedo.Graham Riley was very helpful during my visit to the Centre for Novel Computing at Manchester University,where I studied the operation of virtual memory management on the KSR-1multiprocessor.Mike Brown,manager of PACT,was extremely patient and supportive in giving me as much time as possible for the completion of the experimental work and writing up of the thesis.He also took care of“migrating”the experimental platform to M´e xico for this research to be continued at the CINVESTA V.Professor David May was very supportive in the last stage of the writing up of the thesis.Henk Muller and Paul Stallard were a source of support and inspiration.The emulator of the HORN DDM,on which the experimental work was carried out,is a piece of art.This work would not have been possible without it.I also had lots of much useful discussions with them.Thank you very much Henk and Paul—just alphabetical order.I feel very lucky to have worked with my adviser,Professor David H.D.Warren.His patience, encouragement,and both technical and administrative support made possible the completion of this thesis.Above all,from working with him I had all the way through such an experience as that described below.“It is quite obvious that in order to be sensitive to oneself,one has to have an image of complete, healthy human functioning—and how is one to acquire such an experience if one has not had it in one’s own childhood,or later in life?...In previous epochs of our own culture,or in China and India,...the teacher was not only,or even primarily,a source of information,but his function was to convey certain human attitudes.”Thank you David.A Paty de Albamis hermanos Paty,Ra´u l y el G¨u eromi pap´a Jorgey mi mam´a Maria Luisa “Termina lo que empiezas y luego te vas a jugar”.ivDECLARATIONvviContentsAbstract i Acknowledgements ii DECLARATION v 1Introduction11.1Summary and Thesis Layout11 2Virtual Memory132.1Introduction132.2Virtual Address Space Management162.2.1Address Translation172.2.2Paging or Segmentation?192.2.3Combining Paging and Segmentation202.2.4Multiple,Single,or Global Address Space Virtual Memory?212.3Page Table Organisation Alternatives242.4Translation Lookaside Buffer262.4.1TLB Coherency272.4.2TLB Organisation and Management292.4.3TLB Performance312.5Main Memory Management332.5.1Page Replacement Policies332.5.2Allocation Policies352.6Swap Space Management362.7Summary37 3Virtual memory on shared memory multiprocessors393.1Shared Memory Multiprocessors393.1.1Data Coherence403.1.2Memory Consistency413.2Virtual Address Space Management423.3TLB Management443.3.1Hardware Solutions453.3.2Software Solutions453.4Scalability47vii3.5Mach:a Portable Uniprocessor and Multiprocessor Operating System483.5.1Virtual Address Space Management493.5.2Main Memory Management503.6Summary51 4Data diffusion architectures554.1Background and Related Work564.1.1A Software Approach:Shared Virtual Memory564.1.2A Hardware-software Approach:Shared Virtual Memory II574.1.3A Hardware Approach:Virtual Shared Memory574.1.4A CC-NUMA Example:the DASH Multiprocessor594.2Data Diffusion Architectures–Design Issues614.2.1Data Coherency Protocol644.2.2Associative Main Memory Organisation674.3Data Diffusion Architectures–Examples694.3.1The HORN DDM704.3.2The KSR-1724.4Summary74 5Design Issues for Virtual Memory on Data Diffusion Architectures775.1Introduction775.2Supporting New Views of Memory805.3Supporting a Traditional View of Memory835.4Design Goals875.5Summary88 6The KSR-1Virtual Memory916.1Introduction916.2Virtual Address Space Management926.3Main Memory Management966.4Discussion1006.4.1Efficient Address Translation1006.4.2Advantages of a Large Main Memory Allocation Unit1026.4.3Disadvantages1036.5Summary104 7Our Proposals for Virtual Memory on Data Diffusion Architectures1077.1Traditional Virtual Memory:TVM1087.2Associative Memory Virtual Memory:AMVM1127.2.1Using TLB Support for Superpages under AMVM1147.3Discussion1157.4Performance Evaluation—Experimental Goal1187.5Provision of Diffusion Space–Experimental Goal1217.5.1Informal analysis of diffusion space requirements121viii7.5.2Providing diffusion space1247.6Summary125 8Experimental methodology1298.1Multiprocessor emulation1298.2The HORN DDM emulator1318.2.1Calibration1328.2.2Emulation of the HORN DDM1328.3Emulation of virtual memory1368.3.1Emulation of TLB1368.3.2Emulation of Traditional Virtual Memory1388.3.3Some comments on the port1408.3.4Emulation of Associative Memory Virtual Memory1418.3.5Base experimental configuration1428.4Validation of virtual memory emulation1438.4.1Method1448.4.2Results1458.4.3Conclusions1468.5Applications1478.6Summary150 9Performance Evaluation of Two Approaches1539.1Method1549.2Results1569.2.1Wave Virtual Memory Overhead1589.2.2Aurora Virtual Memory Overhead1609.2.3Mp3d Virtual Memory Overhead1609.3Discussion1619.4Summary162 10Evaluation of Diffusion Space Requirements16510.1Method16610.1.1Comments on our experimental methodology16910.2Results17110.2.1Diffusion Space Requirement of Wave17110.2.2Diffusion Space Requirements of Aurora18010.2.3Diffusion Space Requirements of Mp3d18110.3Discussion18410.3.1Adjusting Diffusion Space18610.3.2Other Results18810.3.3Generalising our Results18910.3.4Further Work19110.4Summary191ix11Conclusions and Further Work19311.1Limitations and Further work200 Glossary213xList of Figures2.1Virtual memory organisation.Each virtual address used by an application is dynami-cally translated into a physical address using mapping tables.142.2A virtual address space(shaded)is logically divided into pages all of the same size,into segments of variable size,or into segments themselves divided into pages.162.3Virtual-to-physical address translation at page granularity.182.4Address translation in the GE645processor under segmentation and paging.212.5Global address space virtual memory,above.The segments of each application aremapped to the paged global virtual address space.Below,private-to-global virtualaddress translation in the IBM RT PC architecture.233.1Shared memory multiprocessors:bus-based(left)and cross-bar-based.404.1The DASH architecture.594.2The Data Diffusion Machine with buses as interconnect medium.624.3The Bristol DDM.704.4The HORN DDM.714.5The HORN DDM with a general interconnect and collapsed hierarchy.724.6The KSR-1architecture for1088processors.735.1In data diffusion architectures the physical address space can be any set of addresses,possibly varying over time,within a physical address range which is limited by theaddressing capability of the processors or the associative organisation of main memory.796.1Mapping of virtual addresses into global virtual addresses in the KSR-1.936.2Global address space virtual memory in the KSR-1.The global virtual address spaceis mapped with the identity function to the physical address range of the associativemain memory.946.3Memory layout in each memory node in the KSR-1.976.4OSF-1memory lists for each machine-wide memory set in the KSR-1.997.1Alternatives to organise virtual memory around afixed and contiguous physical ad-dress space and a traditional mapping with commodity hardware.1117.2Traditional virtual memory(TVM).1127.3Associative memory virtual memory(AMVM).113xi8.1Increasing the fanout of directory nodes in the HORN DDM emulator on a networkof T800transputers.133 8.2Transputer processes in the HORN DDM emulator.134 8.3Transputer address space layout under the HORN DDM emulator.135 8.4Superpage and partial subblock TLB entries.137 8.5Transputer address space layout under TVM.138 8.6Transputer address space layout under single address space AMVM.141 8.7Our version of Rosenburg’s program on the HORN DDM emulator.1468.8The speedup of our applications on the HORN DDM emulator without virtual memory.1479.1Wave virtual memory overhead.158 9.2Aurora virtual memory overhead.1599.3Mp3d virtual memory overhead.16110.1The effect of varying the size of diffusion space on Wave-TVM.172 10.2Grid data partitioning of Wave into each processor.173 10.3The effect of varying the size of diffusion space on Aurora-TVM.180 10.4The effect of varying the size of diffusion space on Mp3d-TVM.182xiiList of Tables4.1Access times in the memory hierarchy of the KSR-1multiprocessor.748.1Calibration example.1328.2Our experimental HORN DDM hardware configuration.1438.3Shared references made by our applications without virtual memory.1489.1Total TLB miss count over all processors.15710.1The effect of varying the number of computation steps and associativity offirst-leveldirectories on the execution time of Wave-TVM on32processors.Main memoryis of size64Mega bytes(5124-K-byte pages for each memory node),and16-wayset-associative.Diffusion space is3004-K-byte pages for each memory node.17610.2The effect of varying the number of computation steps and associativity offirst-leveldirectories on the execution time of Wave-TVM on32processors,under the FIFO withsecond chance replacement for memory sets.Main memory is of size64Mega bytes(5124-K-byte pages for each memory node),and16-way set-associative.Diffusionspace is3004-K-byte pages for each memory node.17810.3The effect of varying the number of computation steps and associativity offirst-leveldirectories on the execution time of Wave-TVM on32processors,under the standardreplacement policy for memory sets and modifying the array that it in the originalversion is only read.Main memory is of size64Mega bytes(5124-K-byte pagesfor each memory node),and16-way set-associative.Diffusion space is3004-K-bytepages for each memory node.179xiiixivChapter1IntroductionMultiprocessors are general purpose machines capable of exploiting parallelism,and thus are a means to improve application performance.To this end,the computation of the application has to be divided among several processors,which need to communicate with each other in order to coordinate their task and to share data and code.Dividing the computation and specifying processor communication are the responsibility of the application programmer,system software,or both[4,pp.223–287]. However,the way processor communication actually takes place depends on the underlying hardware organisation.In early designs of multiprocessors,memory is either shared by all processors through a common interconnect medium,such as a bus,or distributed among the processors,each processor using privately a portion of memory.These configurations are classified as shared memory and distributed memory multiprocessors,respectively.Processor communication in shared memory multiprocessors takes place via shared variables. There is thus no need to specify communication for processors to share data and code,only to coordinate their tasks.This aspect simplifies parallel programming,and also offers more portability to software developed for uniprocessor systems.Also,the common interconnect simplifies maintaining data coherence(guaranteeing that each processor uses the last value written to a data item).Coherence becomes an issue as a result of each processor using a cache.When multiple caches hold a copy of a data item that is subsequently written by a processor,all other copies must be invalidated or updated.A common interconnect also provides uniform memory access(UMA)time to each processor,which simplifies migrating a computation from one processor to another in order to balance the system1workload.Unfortunately,shared memory multiprocessors are not scalable,as the memory bandwidth of a common interconnect does not increase as more processors are added.In contrast,distributed memory multiprocessors tend to be scalable because memory bandwidth increases with the number of processors.They are also simpler to build than shared memory mul-tiprocessors because data coherence is not an issue for the hardware.However,programmers have to specify communication for processors to coordinate the task,and also to share data and code,via messages explicitly sent and received by each processor.Message-passing communication requires more involvement from the programmer,and renders software less portable.More recently,the design of some multiprocessors has been seeking to combine the ease of programming of shared memory multiprocessors with the scalability of distributed memory multi-processors.The approach has been to provide a shared address space on top of memory distributed among the processors.Memory bandwidth thus increases as more processors are added.In these systems,a reference by a processor can be local(to the nearest memory portion)or remote.However, this is transparent to software;the underlying hardware is responsible to fetch local or remote data to each processor.Multiprocessors with this organisation have variously been referred to as virtual shared memory machines,distributed shared memory machines,or scalable shared memory ma-chines.These systems are attractive because in principle they can offer as much processing power as vector or array processors at a more reasonable cost,and because they are software compatible with shared memory multiprocessors.A few early designs have already been implemented.The RP3[15], the Stanford DASH[64]and the MIT Alewife[17,2]are research prototypes.The KSR-1was a commercial product during1991-95[44].Non-uniform memory access(NUMA)architectures were thefirstly proposed class of virtual shared memory multiprocessors.The term NUMA emphasises the difference in time to access different data due to the distributed organisation of memory.The design of NUMAs was mainly addressed to the system interconnect and the mapping of the shared address space onto memory nodes.However,data coherence at the level of caches,if available and used,was the programmer’s responsibility[15].The design of cache-coherent(CC-)NUMA architectures then built on that early work to support by hardware data coherence at cache line granularity of typically16to128bytes.In NUMAs and CC-NUMAs,a key factor to achieve scalable performance is that most of the data2used by each processor should be resident either in its cache or in its local memory.If a processor mostly incurs remote references,performance may degrade significantly.The placement of data upon memory nodes is thus an issue in these architectures,although it can be avoided in CC-NUMAs if caches are large enough to hold most of the local and remote data used by each processor.Yet to improve the local to remote access ratio(hence performance),methods have been suggested that include modifying application code and data structures[41],or modifying system software to migrate and replicate data among memory nodes as needed[18,113].Both alternatives are not simple to implement:software for(UMA)shared memory multiprocessors has to be modified to deal with the distributed organisation of main memory.Also,both alternatives cannot guarantee improving the local to remote access ratio due to the dynamic access pattern of each application.Balancing the system workload by migrating a computation from one processor to another is clearly difficult too.The RP3 is a NUMA;the Dash and MIT Alewife are both CC-NUMAs.Data diffusion architectures are another class of virtual shared memory multiprocessors.Their distinctive feature is that data diffuses,or migrates and replicates,in main memoryFor this reason,we think the term“cache only memory architecture”is somewhat misleading,and prefer the term“data diffusion architecture”.An associative main memory will be somewhat more expensive than traditional main memory. However,hardware components are becoming cheaper,and represent a cost to be made only once.In contrast,the cost of software development has increased,and tuning applications to improve the local to remote access ratio has to be made for each application and for each processor-count and memory configuration.Data diffusion architectures were introduced with the proposal of the Data Diffusion Machine (DDM)[114].The KSR-1was thefirst commercial architecture of this class.Both the DDM and the KSR-1use a hierarchical organisation where software runs at the leave nodes only,and nodes above the leaves keep track of the data items in the nodes below.The DDM wasfirst proposed with a hierarchy of buses,but has since been investigated for point-to-pointlinks as interconnect medium[86,83,85,115]. The KSR-1uses a hierarchy of A-F is a more recent proposal of aflat(non-hierarchical) data diffusion architecture[48,111].This thesis investigates the design issues for virtual memory on data diffusion architectures.As is well known,virtual memory is a memory management scheme that makes main memory appear much larger than it actually is.This illusion simplifies the programming task,as applications larger than the available main memory can run partially loaded in main memory in a way transparent to the programmer.Virtual memory also efficiently supports multiprogramming,since the more applications partially loaded in main memory there are,the easier it is for the scheduler tofind an application ready to run when others block for I/O or synchronisation.These benefits of virtual memory are possible by considering each address used by an application only a means to identify(code and)data[32], and not a description of where in main memory data is.The addresses used by an application are virtual addresses which are dynamically translated into physical addresses using mapping tables. When the mapping of a virtual address is invalid,the data is initialised or swapped into main memory from secondary storage,and the mapping tables updated accordingly.The virtual address space an application can use is thus not limited to the size of main memory;it can be as large as the addressing capability of the processor.Virtual memory makes possible aflexible and general purpose multiprogramming environment,4which is desirable for a data diffusion architecture.Yet the design of virtual memory for these architectures is interesting in itself.Virtual memory was conceived with traditional main memory underneath,where physical addresses are bound to storage locations.Hence the physical address space isfixed and in effect contiguous.In data diffusion architectures,the associative organisation of main memory in effect decouples each(physical)address and its data item from any storage location. Physical addresses are only means to identify data in main memory.They are like virtual addresses. Hence in principle,the physical address space can be any set of addresses within the address range of the processor.Software can choose the configuration of the physical address space.Moreover,the set of addresses composing the physical address space can vary over time.The set of all possible physical addresses that can be represented in main memory,or physical address range,is only limited by the addressing capability of the processors.Theflexibility of the physical address space of data diffusion architectures is similar to that of a virtual address space,and offers new possibilities to organise virtual memory.This thesis presents an analysis of possible ways to organise virtual memory in these architectures.The thesis describes how virtual memory can support novel views of memory in which the use of address space is sparse: only items actually used come to exist.Although this is also possible on traditional main memory,the virtual-to-physical address mapping would have to be organised at item granularity:with a mapping table entry for each item.However,for efficiency in main memory use and swapping operations, the mapping in most systems is organised atfixed-size page granularity of typically4K-bytes;the mapping of a virtual address into a physical address is:,where is the virtual page number(the most significant bits)of,is the offset(least significant bits)of,and gives the initial address of the physical page which is the image of the virtual page being accessed.In data diffusion architectures,address translation can proceed at page granularity,yet only items actually allocated come to exist.The thesis also describes how virtual memory can provide a traditional view of memory,where address space is allocated in contiguous chunks,but take a novel view of the physical memory. Virtual memory can be organised in several different ways to provide a traditional view of memory. Here,however,we only describe briefly our proposals to organise virtual memory on data diffusion architectures,which were chosen based on the criteria of low implementation cost and performance.5The motivating factor of ourfirst proposal is low implementation cost:maximum portability. Hence we ignore theflexibility of the physical address space,and choose afixed and contiguous physical address space as in traditional main memory.Hence commodity hardware and software can be used to implement the virtual memory system.A virtual memory for(UMA)shared memory multiprocessor should require little change to make it immediately usable in a data diffusion architec-ture.Most virtual memory systems today support what is referred to as multiple address space virtual memory:a private virtual address space is handled for each application,and each virtual address space is logically divided into pages.Hence ourfirst proposal is a paged multiple address virtual memory, which we will refer to as traditional virtual memory,or TVM,because the view of main memory as afixed and contiguous physical address space is traditional.Address translation under TVM will proceed at page granularity,which has been found to incur a somewhat significant overhead for some applications(on uniprocessors),up to50%of the execution time[105,88].This overhead corresponds to what is referred to as translation lookaside buffer (TLB)miss overhead.The TLB is a small and fast associative memory near the processor which caches the more recently used page mapping table entries,and is an essential component to achieve reasonable performance under virtual memory.Only on a TLB miss,when the TLB does not hold the required mapping table entry,are page tables in main memory actually accessed.Even so,for large applications the TLB may not hold the mapping for all the pages continuously used.And unfortunately,it is not only a matter of increasing the size of the TLB;larger TLBs are slower and thus address translation and accessing data are also slower[107].Address translation overhead can be reduced if the virtual address space of each application is divided into variable-size segments,which can be of size up to Giga bytes.Since most applications have a code segment,a data segment and for each processor a stack segment,each processor only requires three or so TLB entries.Hence TLB miss overhead is negligible.However,handling variable-size segments makes the management of main memory a difficult task.Swapping segments of variable size in and out tends to fragment main memory into small,non-contiguous portions that cannot be used even though they are not allocated to any application;and a segment can only be swapped into main memory if a large enough contiguous portion of physical address space is available.In contrast, handling pages all of the same size simplifies main memory management because any virtual page6can be placed into any physical page.Another alternative to reduce TLB miss overhead is to use TLB support for variable-size pages[105, 88].Pages larger than the base-size page are referred to as superpages,and can be up to tens,hundreds or a even a few thousand Mega-bytes.The use of TLB support for superpages tends to reduce the TLB miss count and consequently TLB miss overhead.But it is optional because superpages are like segments only restricted to have a size and alignment in powers of two:using superpages requires to map a virtual superpage into a physical superpage.If swapping and main memory management are to proceed at base-size page granularity,which is desirable for efficiency,then physical superpages must be dynamically promoted by virtual memory software.Base-size physical pages must be copied around,possibly deallocating and reallocating some of them and this entails invalidating mapping information in TLBs for processors not to access erroneous data.And while superpage promotion is taking place the processors must wait.The motivating factor of our second proposal is performance.In our second proposal,a private virtual address space is used for each application,and is mapped at segment granularity into a paged global virtual address space.Applications can easily share data in the global virtual address space by mapping private segments to the same global segment.The mapping from private virtual address into global addresses is efficient because each processor only needs a few segmented TLB entries or segment registers.Yet swapping and main memory management are efficient because the global virtual address space is paged.The mapping of global virtual addresses into physical addresses is simply the identity function: the global virtual address space is mapped with the identity function to the physical address range of the associative main memory.However,since the physical address range cannot all be represented in main memory,as neither can the data it may identify,physical address creation and destruction must be supported by the data diffusing hardware.Hence the physical address space will be varying and non-contiguous at page granularity.Because using a varying physical address space is only possible if main memory is associative,we refer to our second proposal as associative memory virtual memory, or AMVM.The rationale behind AMVM is to use TLB support for superpages without incurring the overhead of superpage promotion.Since the global virtual addresses of each segment are mapped into physical7。