An Implementation of Pipelined Prallel Processing System for Multi-Access Memory System
中文翻译

QAM is a widely used multilevel modulation technique,with a variety of applications in data radio communication systems.Most existing implementations of QAM-based systems use high levels of modulation in order to meet the high data rate constraints of emerging applications.This work presents the architecture of a highly parallel QAM modulator,using MPSoC-based design flow and design methodology,which offers multirate modulation.The proposed MPSoC architecture is modular and provides dynamic reconfiguration of the QAM utilizing on-chip interconnection networks,offering high data rates(more than1 Gbps),even at low modulation levels(16-QAM).Furthermore,the proposed QAM implementation integrates a hardware-based resource allocation algorithm that can provide better throughput and fault tolerance,depending on the on-chip interconnection network congestion and run-time faults.Preliminary results from this work have been published in the Proceedings of the18th IEEE/IFIP International Conference on VLSI and System-on-Chip(VLSI-SoC2010).The current version of the work includes a detailed description of the proposed system architecture,extends the results significantly using more test cases,and investigates the impact of various design parameters.Furthermore,this work investigates the use of the hardware resource allocation algorithm as a graceful degradation mechanism,providing simulation results about the performance of the QAM in the presence of faulty components.Quadrature Amplitude Modulation(QAM)is a popular modulation scheme,widely used in various communication protocols such as Wi-Fi and Digital Video Broadcasting(DVB).The architecture of a digital QAM modulator/demodulator is typically constrained by several, often conflicting,requirements.Such requirements may include demanding throughput, high immunity to noise,flexibility for various communication standards,and low on-chip power.The majority of existing QAM implementations follow a sequential implementation approach and rely on high modulation levels in order to meet the emerging high data rate constraints.These techniques,however,are vulnerable to noise at a given transmission power,which reduces the reliable communication distance.The problem is addressed by increasing the number of modulators in a system,through emerging Software-Defined Radio (SDR)systems,which are mapped on MPSoCs in an effort to boost parallelism.These works, however,treat the QAM modulator as an individual system task,whereas it is a task that can further be optimized and designed with further parallelism in order to achieve high data rates,even at low modulation levels.Designing the QAM modulator in a parallel manner can be beneficial in many ways.Firstly, the resulting parallel streams(modulated)can be combined at the output,resulting in a system whose majority of logic runs at lower clock frequencies,while allowing for high throughput even at low modulation levels.This is particularly important as lower modulation levels are less susceptible to multipath distortion,provide power-efficiency and achieve low bit error rate(BER).Furthermore,a parallel modulation architecture can benefit multiple-input multiple-output(MIMO)communication systems,where information is sent and received over two or more antennas often shared among many ing multiple antennas at both transmitter and receiver offers significant capacity enhancement on many modern applications,including IEEE802.11n,3GPP LTE,and mobile WiMAX systems, providing increased throughput at the same channel bandwidth and transmit power.Inorder to achieve the benefit of MIMO systems,appropriate design aspects on the modulation and demodulation architectures have to be taken into consideration.It is obvious that transmitter architectures with multiple output ports,and the more complicated receiver architectures with multiple input ports,are mainly required.However,the demodulation architecture is beyond the scope of this work and is part of future work.This work presents an MPSoC implementation of the QAM modulator that can provide a modular and reconfigurable architecture to facilitate integration of the different processing units involved in QAM modulation.The work attempts to investigate how the performance of a sequential QAM modulator can be improved,by exploiting parallelism in two forms:first by developing a simple,pipelined version of the conventional QAM modulator,and second, by using design methodologies employed in present-day MPSoCs in order to map multiple QAM modulators on an underlying MPSoC interconnected via packet-based network-on-chip (NoC).Furthermore,this work presents a hardware-based resource allocation algorithm, enabling the system to further gain performance through dynamic load balancing.The resource allocation algorithm can also act as a graceful degradation mechanism,limiting the influence of run-time faults on the average system throughput.Additionally,the proposed MPSoC-based system can adopt variable data rates and protocols simultaneously,taking advantage of resource sharing mechanisms.The proposed system architecture was simulated using a high-level simulator and implemented/evaluated on an FPGA platform.Moreover, although this work currently targets QAM-based modulation scenarios,the methodology and reconfiguration mechanisms can target QAM-based demodulation scenarios as well. However,the design and implementation of an MPSoC-based demodulator was left as future work.While an MPSoC implementation of the QAM modulator is beneficial in terms of throughput, there are overheads associated with the on-chip network.As such,the MPSoC-based modulator was compared to a straightforward implementation featuring multiple QAM modulators,in an effort to identify the conditions that favor the MPSoC implementation. Comparison was carried out under variable incoming rates,system configurations and fault conditions,and simulation results showed on average double throughput rates during normal operation and~25%less throughput degradation at the presence of faulty components,at the cost of approximately35%more area,obtained from an FPGA implementation and synthesis results.The hardware overheads,which stem from the NoC and the resource allocation algorithm,are well within the typical values for NoC-based systems and are adequately balanced by the high throughput rates obtained.Most of the existing hardware implementations involving QAM modulation/demodulation follow a sequential approach and simply consider the QAM as an individual module.There has been limited design exploration,and most works allow limited reconfiguration,offering inadequate data rates when using low modulation levels.The latter has been addressed through emerging SDR implementations mapped on MPSoCs,that also treat the QAM modulation as an individual system task,integrated as part of the system,rather than focusing on optimizing the performance of the modulator.Works inuse a specific modulation type;they can,however,be extended to use higher modulation levels in order toincrease the resulting data rate.Higher modulation levels,though,involve more divisions of both amplitude and phase and can potentially introduce decoding errors at the receiver,as the symbols are very close together(for a given transmission power level)and one level of amplitude may be confused(due to the effect of noise)with a higher level,thus,distorting the received signal.In order to avoid this,it is necessary to allow for wide margins,and this can be done by increasing the available amplitude range through power amplification of the RF signal at the transmitter(to effectively spread the symbols out more);otherwise,data bits may be decoded incorrectly at the receiver,resulting in increased bit error rate(BER). However,increasing the amplitude range will operate the RF amplifiers well within their nonlinear(compression)region causing distortion.Alternative QAM implementations try to avoid the use of multipliers and sine/cosine memories,by using the CORDIC algorithm, however,still follow a sequential approach.Software-based solutions lie in designing SDR systems mapped on general purpose processors and/or digital signal processors(DSPs),and the QAM modulator is usually considered as a system task,to be scheduled on an available processing unit.Works inutilize the MPSoC design methodology to implement SDR systems,treating the modulator as an individual system task.Results in show that the problem with this approach is that several competing tasks running in parallel with QAM may hurt the performance of the modulation, making this approach inadequate for demanding wireless communications in terms of throughput and energy efficiency.Another particular issue,raised in,is the efficiency of the allocation algorithm.The allocation algorithm is implemented on a processor,which makes allocation slow.Moreover,the policies used to allocate tasks(random allocation and distance-based allocation)to processors may lead to on-chip contention and unbalanced loads at each processor,since the utilization of each processor is not taken into account.In,a hardware unit called CoreManager for run-time scheduling of tasks is used,which aims in speeding up the allocation algorithm.The conclusions stemming from motivate the use of exporting more tasks such as reconfiguration and resource allocation in hardware rather than using software running on dedicated CPUs,in an effort to reduce power consumption and improve the flexibility of the system.This work presents a reconfigurable QAM modulator using MPSoC design methodologies and an on-chip network,with an integrated hardware resource allocation mechanism for dynamic reconfiguration.The allocation algorithm takes into consideration not only the distance between partitioned blocks(hop count)but also the utilization of each block,in attempt to make the proposed MPSoC-based QAM modulator able to achieve robust performance under different incoming rates of data streams and different modulation levels. Moreover,the allocation algorithm inherently acts as a graceful degradation mechanism, limiting the influence of run-time faults on the average system throughput.we used MPSoC design methodologies to map the QAM modulator onto an MPSoC architecture,which uses an on-chip,packet-based NoC.This allows a modular, "plug-and-play"approach that permits the integration of heterogeneous processing elements, in an attempt to create a reconfigurable QAM modulator.By partitioning the QAM modulator into different stand-alone tasks mapped on Processing Elements(PEs),weown SURF.This would require a context-addressable memory search and would expand the hardware logic of each sender PE's NIRA.Since one of our objectives is scalability,we integrated the hop count inside each destination PE's packet.The source PE polls its host NI for incoming control packets,which are stored in an internal FIFO queue.During each interval T,when the source PE receives the first control packet,a second timer is activatedfor a specified number of clock cycles,W.When this timer expires,the polling is halted and a heuristic algorithm based on the received conditions is run,in order to decide the next destination PE.In the case where a control packet is not received from a source PE in the specified time interval W,this PE is not included in the algorithm.This is a key feature of the proposed MPSoC-based QAM modulator;at extremely loaded conditions,it attempts to maintain a stable data rate by finding alternative PEs which are less busy.QAM是一种广泛应用的多级调制技术,在数据无线电通信系统中应用广泛。
赛林思V5除法IP手册

•
Radix-2 Solution
Radix-2 Feature Summary
• Provides quotient with integer or fractional remainder
© 2006-2009 Xilinx, Inc. XILINX, the Xilinx logo, Virtex, Spartan, ISE and other designated brands included herein are trademarks of Xilinx in the United States and other countries. All other trademarks are the property of their respective owners.
Applications
Division is the most complex of the four basic arithmetic operations. Because hardware solutions are correspondingly larger and more complex than the solutions for other operations, it is best to minimize the number of divisions in any algorithm. There are many forms of division implementation, each of which may offer the optimal solution in different circumstances. The divider generator core provides two division algorithms, offering solutions targeted at small operands and large operands. The Radix-2 non-restoring algorithm solves one bit of the quotient per cycle using addition and subtraction. The design is fully pipelined, and can achieve a throughput of one division per clock cycle. If the throughput required is smaller, the divisions per clock parameter allows compromises of throughput and resource use. This algorithm naturally generates a remainder, so is the choice for applications requiring integer remainders or modulus results. The High Radix with prescaling algorithm resolves multiple bits of the quotient at a time. It is implemented by reusing the quotient estimation block, and so throughput is a function of the number of iterations required. The operands must be conditioned in preparation for the iterative operation. This overhead makes this algorithm less suitable for smaller operands. Although the iterative calculation is more complex than for Radix-2, taking more cycles to perform, the number of bits of quotient resolved per iteration and its use of XtremeDSP slices makes this the preferred option for larger operand widths.
Pipeline approach used for recognition of dynamic

Figure 1: Screenshot of MapEditor. The GUI front end of MVE-2. It shows a simple convolution2.5 Advanced pipeline examplesExecution mechanism implemented in the core can do more than simple pipeline execution.We support multiple execution, module driven executionand cycles. Any map can be run N - times. Module can run a subbranch to provide its data more than once. It is also possible to create cycles in module map graph.Sinus is an example of sub-branch construction. Execution of Sinus module is controlled by GenerateGraph module. In this particular case the module-map runs only once while the Sinus module runs 100 times. (See Figure 2)Counter is an example of DelayModule usage. The DelayModule acts as a single place memory with initialization. It returns data form previous (N-1) step. In the first step it returns data from initialization port, allowing cycles in module-map graph. This example counts fromzero to number of runs minus one. The delaymodules can be chained. (See Figure 3)2.6 Module creationSimple creation of modules is one of the most interesting features of our system. By inheriting a new class from the MveCore.Module abstract class, a fully functional module is created.There are only two methods that have to be overridden. The first one is the constructor ,which creates ports and defines their namesand accepted data types. The second one is Execute method that represents the activity of module.We are using features of the .NET system to provide comfort to module authors. For example any public property of a module is automatically displayed in a module setup dialog, and saved/restored with the module map. Adding a user-editable parameter is therefore a matter of exposing it using the property mechanism.There is a set of advanced methods that can becalled and set of events that can be handled by a module. These additional methods make it possible to create a module with advanced features, such as immediate reaction on incoming data, advanced module GUI creation, execution of subbranch etc.2.7 Documenting MVE-2Documentation for MVE-2 module libraries can be generated automatically using the MMDOC utility that is part of the system. It uses attribute classes and comments that describe modules, ports and data types, and generates electronic documentation in a number of formats (html, chm, ...). It can bealso used to generate a list of uncommentedentities (methods, modules etc.), thus enforcingcareful commenting. Figure 2: (left) Simple example of module driven subbranch execution.Figure 3: (right) Simple example of loop with delay module.3 Application of MVE-2 for AI and CG The recognition task was one of the main topics of AI research in the past decades. Many efforts appeared in the field of voice recognition,image recognition and mesh recognition,as this task is crucial for understanding the environment.Recognition algorithms allow the AI to reduce the amount of information to be processed;it allows to understand the relations in the environment and to make correct decisions quickly.The task we are addressing using MVE-2 is recognition of dynamic meshes, i.e. animations in surface representation.Our goal is to provide not only static information(i.e.like “the object in front of the camera is a human”), but also dynamic (“the object in front of the camera is a human,who is jumping”). This will not only allow the system to better analyze the scene in current time, but it will also help the system to predict future states of its environment.The task of dynamic mesh extraction is one of the state of the art problems that is being investigated by many recent papers ([3], [4]), but for our purposes we can assume that the extraction was already performed. Our input is therefore a dynamic mesh M, that consists of n static meshes.Our task is to qualitatively evaluate the dynamic mesh and to produce information about it that will help an AI system to plan its actions.Our approach is based on template comparison. We suppose that there is a library of dynamic meshes that represent actions known to the system. The information we are extracting is the correspondence of the given dynamic mesh M to the meshes present in the library. Namely we want to create a metric in the space of dynamic meshes, that will tell us which of the known animations is most similar to the one extracted from the environment of the system.Our method is based on the approach used for static mesh comparison ([1], [2]), i.e. using the Hausdorff distance of two objects.We represent each dynamic mesh in E3 by a static tetrahedral mesh in4D,subsequently we compute the approximate Hausdorff distance of given mesh to each of the library meshes, and finally we pick the one with the smallest distance.Following this scheme however requires addressing some non trivial issues, which will be briefly discussed in the following paragraphs.3.1 Dynamic mesh representationOur approach is to represent a dynamic triangle mesh by a static tetrahedral mesh in space-time. This can be easily done for meshes of constant connectivity (i.e. where each triangle corresponds to exactly one triangle in any frame of the animation). In such a case we can see the evolution of a triangle between two frames as a prism in 4D. We can now break this prism into three tetrahedra. If implemented carefully,this approach leads to consistent tetrahedral mesh representation,even though the faces of the 4D prism are non-planar. Another issue to be addressed is the used units. We must use consistent units for all the meshes, and we must define relation between time and space units.In order to unify space units we have decided to use relative lengths only, i.e. all sizes and positions are expressed as fractions of the body diagonal of the object.This allows us to measure spatial difference consistently for all meshes.On the other hand,time can be measured absolutely and should never be scaled.The only thing that needs to be done is to relate the time units to spatial units in order to produce the Euclidean metric in space-time that will be needed for the Hausdorff distance computation.The purpose of the space-time representation is to find how similar two animations are.In other words,distance in the space should represent difference between meshes. Therefore the distance represented by a single unit in each direction should represent equal difference. We wish to find a constant that will relate time (measured absolutely in seconds) and space (measured as a fraction of the main diagonal).We don’t know the value of this constant,but we can do following considerations in order to estimate its value: 1.time span of1/100s is almostunrecognizable for a human observer, while spatial shifts of 10% is on the limit of acceptability,therefore we expect the constant to be larger than 0.01/0.1 = 0.1 2.time spans of units of seconds are on thelimit of acceptability, while spatial shift of0.1%is almost unrecognizable,thereforewe expect the constant to be smaller than 1/0.001 = 1000Saying that,we can guess the value of the relation coefficient to be about10,i.e.time span of 100ms is equal to spatial shift of 1%. 3.2 ImplementationWe have implemented the proposed method in a set of MVE-2modules.First,we have debugged a simple module for computation of a distance from a point to a tetrahedron in 4D. Constructing a module that composes a set of triangular meshes into a space-time tetrahedral mesh was very easy thanks to the generality of data structures provided by the Visualization library. It is also easy to use a variety of input formats.In order to speed the computation up we have also constructed a module called AnimationDistanceEvaluator that encapsulates the distance evaluation from each vertex of one mesh to each tetrahedron of the other.This module provides a significant speedup of the process by using advanced acceleration techniques(spatial subdivision etc.),while it preserves reusability of code, because it calls public methods of the PointToTetrahedronDistanceEvalu ator module.A typical map may consist of two loops that compose two space-time tetrahedral meshes. For each of them a new point attribute is computed using the PointToTetrahedron DistanceEvaluator module that represents the distance from each point to the other mesh.A general AttributeMax module can then be used for computing the one-way mesh distance,and a general ScalarMax module finally produces the symmetric estimate of Hausdorff distance.The resulting point attribute can also be used in other ways.We may display its value distribution by the standard Attribute Histogram and CurveRenderer modules.Such visualization helps when considering similarity of animations.We can also transform this attribute into color attribute and display it using some MVE-2 renderer. It allows us to see exactly where and when the two animations are similar or distinct.Such information can be also very useful in many AI tasks, for example machine learning, where a trainee can see how precisely she follows some pattern.4 ConclusionsWe have shown a method for comparing dynamic meshes. This method can be used for a variety of AI applications, from animation recognition to automated learning or teaching. The implementation in the MVE-2 environment allows easy experimenting with the method in various setups and algorithms. The current implementation is still not fast enough to compare moderately complex animations in real time,but we are still working on speeding the method up.We believe that the performance of the optimized。
义守大学计算及组织Chapter 4 The Processor

Use multiplexers where alternate data sources are used for different instructions
Chapter 4 — The Processor — 18
R-Type/Load/Store Datapath
Chapter 4 — The Processor — 19
Chapter 4 — The Processor — 7
How to Design a Processor?
Analyze instruction set (datapath requirement)
Select set of datapath components and establish
PC Extender for zero- or sign-extension Add 4 or extended immediate to PC
Chapter 4 — The Processor — 11
§ 4.3 Building a Datapath
Building a Datapath
clocking methodology
Build datapath meeting the requirements
Analyze implementation of each instruction to determine setting of control points effecting
Read two register operands Perform arithmetic/logical operation Write register result
NLFM脉冲压缩及其FPGA时域实现

第40卷第4期2018年7月湖北大学学报(自然科学版)Journal of Hubei University(Natural Science)Vol.40㊀No.4㊀July,2018㊀收稿日期:20170904基金项目:国家自然科学基金(61601175)资助作者简介:陆聪(1993-),男,硕士生;王旭光,通信作者,博士,讲师,硕士生导师,E-mail:109278484@ 文章编号:10002375(2018)04038406NLFM 脉冲压缩及其FPGA 时域实现陆聪,杨维明,王旭光,曾张帆(湖北大学计算机与信息工程学院,湖北武汉430062)摘要:介绍非线性调频(NLFM)信号的产生原理和设计匹配滤波器实现脉冲压缩技术的方法.使用MATLAB 工具产生NLFM 脉冲及雷达回波信号,基于FPGA 器件EP2C35F672C8设计分布式FIR 结构的匹配滤波器,实现脉冲压缩技术,对采样㊁量化后的回波信号进行脉冲压缩处理,最后使用Modelsim 对脉冲压缩后的回波信号进行波形仿真,检测匹配滤波器的设计效果.整个电路设计采用全流水线并行执行的结构,占用硬件资源:2468个逻辑单元㊁2073个寄存器㊁25KB 的RAM.利用FPGA 芯片丰富的BRAM 和LAB 代替乘法器IP,打破硬件资源对滤波器长度的限制.关键词:NLFM 信号;时域脉冲压缩;FPGA;匹配滤波器;分布式滤波算法中图分类号:TN713㊀㊀文献标志码:A㊀㊀DOI :10.3969/j.issn.1000-2375.2018.04.013NLFM pulse compression and its time domain implementation by FPGALU Cong,YANG Weiming,WANG Xuguang,ZENG Zhangfan(School of Computer &Information Engineering,Hubei University,Wuhan 430062,China)Abstract :The generation principle of nonlinear frequency modulated (NLFM)signal and the method of design matched filter realized pulse compression technique were analyzed in this paper.NLFM signal and the radar echo signal were generated by MATLAB tools,distributed FIR structure for realizing pulse compression technology was designed based on FPGA device EP2C35F672C8,which processed the sampled and quantized echo signals finally.The simulated waveform of the signal which was handled by Modelsim software to detect the effect of the matched filter.The whole circuit of the filter was designed by using the structure of full pipelined parallel execution.The FPGA hardware resources that the circuit occupied include 2468logical units,2073registers,and 25KB of RAM.By using BRAM and LAB of the FPGA chip instead of the multipliers IP,the limitation of hardware resources on the length of the filter was broken.Key words :NLFM signal;time domain pulse compression;FPGA;matched filter;distributed filter algorithm 0㊀引言现代雷达通常采用脉冲压缩技术提高系统的速度分辨力和距离分辨力[1].脉冲压缩就是将雷达发射端发射的宽脉冲调频信号,在接收端经数字匹配滤波器的处理,获得窄脉冲回波信号的过程.经过脉冲压缩后的信号同时具备大时宽㊁大带宽的特点,能保证雷达的探测距离和目标分辨精度[2].LFM 信号和NLFM 信号是常用于脉冲压缩中的两种基本信号.LFM 信号易于产生,应用广泛,但是LFM 信号的回波直接经过匹配滤波器,脉压后的信号旁瓣较大,一般需用窗函数对脉压后的输出信号进行旁瓣抑制,不同程度地造成主瓣展宽;NLFM 信号一般是基于窗函数设计产生的[3],优点是若对其回波信号直接匹配滤波,就能得到旁瓣很低的信号,省去了加权环节.第4期陆聪,等:NLFM脉冲压缩及其FPGA时域实现385㊀脉冲压缩可采用频域法和时域法两种方式实现[4].频域法实现时速度较快,但需多次用到快速傅里叶变换(FFT)和逆快速傅里叶变换(IFFT),硬件开销较大;时域法实现时电路结构简单,但速度较慢.本文中设计基于分布式算法的FIR匹配滤波器[5-6],采用全流水线并行执行结构,基于FPGA完成NLFM 信号脉冲压缩的时域实现,既节省硬件开销,又提高运算速度.1㊀NLFM信号产生及脉冲压缩技术的实现1.1㊀NLFM信号的产生㊀NLFM信号的产生比较复杂,且数学模型较多,没有统一标准,目前都是采用近似的方式实现.比较经典的是采用逗留相位原理产生NLFM信号,具体实现是将LFM信号的加权窗函数转变成频谱函数,使设计出的NLFM信号具有近似的窗形频谱,这样的信号进行脉冲压缩时,相对于LFM信号,省去了中间的加权环节,具有更好的旁瓣抑制效果和较为陡峭的过渡带.以Hamming窗为例设计NLFM信号(其他窗函数的设计方法类似),设计原理如下[7]:Hamming窗函数的表达式:W(f)=0.54+0.46cos(2πf/B)㊀-B/2ɤfɤB/2(1)则基于窗函数的群延时为:T(f)=K Tʏf-ɕW(y)d y(2)其中常数K T=(T/B)/0.54,将(1)式带入(2)式得:T(f)=(T/B)f+(0.426T/π)sin(2πf/B),㊀-B/2ɤfɤB/2(3)进一步对上式求T的反函数得:f(T)=T-1(f)(4)为了更加直观,使用t代替T,即f(t)为基于Hamming窗函数设计的NLFM信号.对于较简单的群延时函数,利用MATLAB的自带函数可以直接求得其反函数,但是,当群延时函数比较复杂时,需要采用数值分析方法推导函数的反函数.可以基于数字频率合成(DDS)产生NLFM信号,也可以使用MATLAB数学工具,本文中采用基于MATLAB的数值分析方法产生雷达的发射信号与回波信号.1.2㊀脉冲压缩技术的实现㊀脉冲压缩原理就是对雷达接收端的宽脉冲回波信号进行压缩,降低信号的时宽,提高了压缩后信号的峰值,使信号的时宽带宽积远大于1.采用脉冲压缩技术的雷达系统,可以同时兼顾速度分辨力和距离分辨力,而采用FPGA设计的匹配滤波器是目前实现脉冲压缩技术的主流方式,现在用数学推导的方式说明脉冲压缩的处理过程[8].时域脉冲压缩就是匹配滤波器的传输函数h(t)与雷达回波信号s(t)的线性卷积过程,即:y(t)=s(t)∗h(t)=ʏt-ɕs(τ)h(t-τ)dτ(5)根据最佳匹配原则,当输出信号的信噪比达到最佳时,匹配滤波器的传输函数h(t)为:h(t)=Ks∗(t0-t)(6)其中K是常数,t0为延时,s∗(t)表示共轭;当K=1,t0=0时,滤波器的传输函数为回波信号的复共轭.考虑到回波信号携带噪声的多样性以及目标信息的不确定性,在设计时域匹配滤波器时,采用近似替代的方法,使用发射信号的复共轭作为滤波器的传输函数,发射信号为已知信号,大大方便滤波器的设计.另外,相较于线性调频信号脉冲压缩过程中采用窗函数加权来抑制旁瓣的方式,直接使用NLFM信号作为雷达的发射脉冲,使得电路设计更简单有效.2㊀匹配滤波器的设计及实现2.1㊀FIR滤波器结构分析㊀传统FIR结构的匹配滤波器的结构如图1所示.匹配滤波器的输出:ðN-1i=0x(N-1)h(i)(7)386㊀湖北大学学报(自然科学版)第40卷图1㊀传统FIR匹配滤波器的结构㊀从(7)式可以看出,N阶传统FIR结构匹配滤波器需要N个乘法器和N-1个加法器,而回波信号和匹配滤波器的传输函数都是复数形式,设计N阶匹配滤波器,则需要4N个乘法器和4N-1个加法器.当N值较大时,FPGA内嵌的IP资源将不能满足滤波器的设计要求,而且乘法运算比较复杂㊁延时较高;采用分布式算法,BRAM和LAB代替乘法器的使用,不仅节约乘法器资源对滤波器设计的限制,而且保证滤波器的运算速率.2.2㊀分布式滤波器原理分析㊀分布式滤波器就是利用嵌入在FPGA芯片的BRAM和丰富的LUT,采用数据存储㊁地址转换的方式代替卷积运算中的乘法器.分布式滤波器的设计是,先将N阶卷积运算的所有可能值预先存储在RAM模块中,接着将输入数据转换成存储模块的寻址,对RAM进行查表,然后将存储模块的输出进行移位求和得到卷积运算的结果.该算法实现乘法到存储器㊁寄存器的转换,充分利用FPGA芯片资源,节省了硬件成本.分布式算法原理是,对回波信号x(t)进行采样,得到滤波器的输入信号x(n),其二进制表示形式为:x(n)=ðb-1k=0x k(n)2k(8)其中x k(n)表示x(n)的第k位,b是采样数据的位长,则N阶匹配滤波器输出:y(n)=ðN-1x(n)h(N-1-n)=ðN-1h(N-1-n)ðb-1x k(n)2k=ðb-12kðN-1h(N-1-n)x k(n)(9)由(9)式看出,首先输入数据第k位的值(1或0)与滤波器系数进行与运算并求和,然后将累加和左移k位(2k相当于左移k位)并求和,最终得到卷积和y(n),分布式算法就是将卷积运算由乘积项累加转变为移位求和的过程[9].分析算法可以看出,只要知道第一步的累加值,再进行移位求和就可以得到时域卷积的值,所以在电路设计中首先将第一次累加和的所有可能值预先存储在RAM块中,然后将输入数据转换为存储器的寻址数据,并对存储器输出的数据进行移位求和,这就是分布式算法的原理.2.3㊀分布式滤波器的FPGA实现㊀由(9)式可知,本次滤波器设计长度为48阶,式ðN-1n=0h(N-1-n)x k(n)的可能乘积项有248种,考虑到复数乘法,则直接采用ROM表进行数据存储,需要22ˑ248个存储单元,对于现有的FPGA芯片是不可能实现的.所以针对阶数较长的情况,可采用多条流水线并行执行的结构,对总流水线进行切割,就可以减少存储资源的使用量.采用6条流水线并行处理的结构,此时每条流水线都为一个8阶FIR结构的滤波器,每条流水线的存储大小为28单位,流水线设计将卷积运算的RAM使用量降到6ˑ28ˑ22单位,这使得一般的FPGA 芯片都可以满足.甚至可以增加流水线的数量,进一步缩减存储资源的使用量,分布式滤波器的设计框图如图2所示.图2中k表示输入数据的第k位,每个ROM表存储8阶FIR结构采用分布式算法的所有可能乘积项,需要28个存储单元,对输入数据进行转换作为存储器的寻址,接着将ROM表的输出数据进行移位(左移k位)求和,整个滤波器设计需要4条这样的流水线结构,输出的值分别为图3中I1㊁I2㊁Q1㊁Q2中的一个值.因为滤波器的输入数据是复数,由复数乘法可知,需要4条图2的流水线设计.分布式滤波器整体设计结构如图3所示.由图3可知,首先对回波信号的实部虚部进行分解,然后分别进行采样㊁量化,这个过程通过第4期陆聪,等:NLFM 脉冲压缩及其FPGA 时域实现387㊀图2㊀分布式滤波器的流水线结构㊀图3㊀分布式滤波器的总体结构㊀MATLAB 工具实现.滤波器最终需要输出的是信号的模值,然而传统的求模方式依旧用乘法器和开方运算,运算复杂且延时较高,所以需要找到一种简单的模值估算方法求取信号的模值,且能降低延时.设信号的模值为Y ,估算算法为[10]:Y =MAX{MAX(|I |,|Q |),7/8MAX(|I |,|Q |)+1/2MIN(|I |,|Q |)}(10)据统计,采用该复数求模公式对信号的损失不超过0.13dB,其中7/8MAX (|I |,|Q |)可以采用移位寄存器与加法器的结合来实现.至此匹配滤波器的整体结构完成,整个设计完全使用寄存器和加法器资源,理论上只要FPGA 的ROM 和加法器资源足够,就可以设计任意长度的滤波器.图4㊀分布式滤波器电路原理图3㊀脉冲压缩的FPGA 实现与测试3.1㊀FPGA 硬件电路设计㊀分布式滤波器的硬件电路实现,采用全流水线并行执行的结构进行设计,其特点是运算快,资源使用量大.选用ALTERA 公司FPGA 器件EP2C35F672C8进行电路设计,硬件电路原理图设计如图4所示.回波信号实部和虚部经过采样㊁量化后存储在片内存储模块ROM_real㊁ROM _imag 中,经时钟信号CLK 驱动,通过计数器counter 寻址,作为匹配滤波器的输入数据;address 模块将输入数据的第k 位转变成存储模块ROM 表的寻址,完成对乘积项的提取,这个过程是分布式算法的核388㊀湖北大学学报(自然科学版)第40卷心部分,完成卷积乘法器到查找表的转化;最后将ROM输出的值进行移位求和得到回波信号脉冲压缩后的实部I和虚部Q.分布式滤波器输出脉冲压缩后回波信号的实部和虚部,需要进一步求信号的模值.由(10)式可知,可完全采用加法器和移位寄存器完成该近似算法,电路原理设计如图5所示.采用xor2模块求实部和虚部的绝对值(输入数据与其最高位逐位异或),比较器的数据选择㊁加法器的累加求和完成复数求模运算,输出data4[18ʒ0]为回波信号脉压后的近似模值.至此基于FPGA的匹配滤波器硬件电路设计完成,需要进一步编写Test Bench驱动程序,完成匹配滤波器性能检测.图5㊀求模电路原理图㊀3.2㊀Modelsim设计仿真㊀使用MATLAB工具设计NLFM信号[11],并产生雷达回波信号作为滤波器的输入信号,设计参数:带宽B=5MHz,时宽T=5μs,根据奈奎斯特采样定理:采样频率要大于等于信号最高频率的两倍,否则会发生混叠效应;设定回波信号采样频率为:f p=2.5㊃B.由雷达分辨力精度公式:δ=c/2B,c为光速,则理论精度值δ=30m,尽管实测值受采样精度与滤波器阶数的限制与理论值有差异,但现代雷达在不断追求这个理论值.设定双目标信号间距为45m,MATLAB端的仿真结果如图6所示.图6㊀脉冲压缩技术的MATLAB仿真效果㊀从图6看出,雷达发射的脉冲信号为NLFM信号,设定间距为45m的两个检测目标,经过一段时间在雷达接收端收到回波信号,从回波信号的波形无法获得目标信号的数量和间距等信息,而采用脉冲压缩技术处理后,可以看出波形的旁瓣受到抑制,代表目标信号的两个主瓣更加明显,使信号能量集中在主瓣,降低了能量的损失.对回波信号进行采集㊁量化作为FPGA设计匹配滤波器的输入数据,检测匹配滤波器采用分布式算法实现脉冲压缩技术的效果,结果如图7所示.与图6对比可以看出,采样㊁量化后的回波信号经过FPGA设计的匹配滤波器处理,可以很好地达到脉冲压缩的效果,这也表明采样分布式算法完全可以代替线性卷积中的乘法运算.测量实验数据得到:主瓣间距Δt=12ns,两个目标之间的测量间距为45m,与设定值相符.考虑到采样频率㊁量化精度的影响,增加滤波器阶数可以进一步提高目标间距的分辨力.验证结果表明,基于窗函数产生的NLFM信号作为雷达系统的发射脉冲,雷达的回波经脉冲压缩后的波形旁瓣抑制性能好,过渡带陡峭,具有较强的目标识别能力.FPGA验证结果显示整个电路占用硬件资源:2468个逻辑单元㊁2073个寄存器㊁25K 字节的RAM,可以看出全流水线结构实现分布式算法对资源的需求较高,但是随着工艺水平的提升,芯片集成的基本资源将更加丰富,使用分布式算法实现脉冲压缩技术的应用将愈加广泛.㊀第4期陆聪,等:NLFM脉冲压缩及其FPGA时域实现389图7㊀回波信号经FPGA电路处理后的波形㊀4㊀结束语本文中采用分布式FIR结构的匹配滤波器实现NLFM信号脉冲压缩,利用FPGA的寄存器㊁加法器和ROM资源代替传统滤波器中的乘法器以及求模的开方运算,大大减小硬件开销;采用全流水线并行执行的结构实现,保证时域脉冲压缩的运算速率.通过对比NLFM信号与LFM信号脉冲压缩后的仿真结果,可以看出,采用NLFM信号作为雷达系统的发射脉冲,在接收端可以获得旁瓣低㊁过渡带陡峭的回波波形,减小了有效带宽内雷达信号的能量损失,而且具有较强的目标分辨力.对NLFM信号来说,匹配滤波器的阶数N要接近甚至等于f p㊃D/B,当时宽带宽积D值较大时,时域实现脉冲压缩的成本也较大.5㊀参考文献[1]潘琳.基于FPGA的雷达脉冲压缩系统的研究与实现[D].上海:上海交通大学,2008.[2]梁丽.基于FPGA的雷达信号处理系统设计[D].南京:南京理工大学,2006.[3]阮黎婷.非线性调频信号的波形设计与脉冲压缩[D].西安:西安电子科技大学,2009.[4]汪堃.基于FPGA的脉冲压缩系统研究与实现[D].武汉:华中科技大学,2009.[5]程远东,郑晶翔.一种用于数字下变频的高阶分布式FIR滤波器及FPGA实现[J].电子技术应用,2011,37(2):57-59.[6]李书华,曾以成.基于分布式算法的高阶FIR滤波器及其FPGA实现[J].计算机工程与应用,2010,46(12): 136-138.[7]徐飞.基于FPGA的非线性调频信号脉冲压缩的实现[D].西安:西安电子科技大学,2014.[8]孙宝鹏.基于FPGA的雷达信号处理算法设计与实现[D].北京:北京理工大学,2014.[9]崔永强,高晓丁,贺素馨.基于FPGA分布式算法的滤波器设[J].现代电子技术,2010,33(16):117-119.[10]杨维明.一种基于EPLD技术的信号取模方法[J].湖北大学学报(自然科学版),1999,11(2):138-141.[11]杜勇.数字滤波器的MATLAB与FPGA实现[M].2版.北京:电子工业出版社,2015.(责任编辑㊀郭定和,赵㊀燕)。
计算机组成与设计第五版答案

计算机组成与设计:《计算机组成与设计》是2010年机械工业出版社出版的图书,作者是帕特森(DavidA.Patterson)。
该书讲述的是采用了一个MIPS 处理器来展示计算机硬件技术、流水线、存储器的层次结构以及I/O 等基本功能。
此外,该书还包括一些关于x86架构的介绍。
内容简介:这本最畅销的计算机组成书籍经过全面更新,关注现今发生在计算机体系结构领域的革命性变革:从单处理器发展到多核微处理器。
此外,出版这本书的ARM版是为了强调嵌入式系统对于全亚洲计算行业的重要性,并采用ARM处理器来讨论实际计算机的指令集和算术运算。
因为ARM是用于嵌入式设备的最流行的指令集架构,而全世界每年约销售40亿个嵌入式设备。
采用ARMv6(ARM 11系列)为主要架构来展示指令系统和计算机算术运算的基本功能。
覆盖从串行计算到并行计算的革命性变革,新增了关于并行化的一章,并且每章中还有一些强调并行硬件和软件主题的小节。
新增一个由NVIDIA的首席科学家和架构主管撰写的附录,介绍了现代GPU的出现和重要性,首次详细描述了这个针对可视计算进行了优化的高度并行化、多线程、多核的处理器。
描述一种度量多核性能的独特方法——“Roofline model”,自带benchmark测试和分析AMD Opteron X4、Intel Xeo 5000、Sun Ultra SPARC T2和IBM Cell的性能。
涵盖了一些关于闪存和虚拟机的新内容。
提供了大量富有启发性的练习题,内容达200多页。
将AMD Opteron X4和Intel Nehalem作为贯穿《计算机组成与设计:硬件/软件接口(英文版·第4版·ARM版)》的实例。
用SPEC CPU2006组件更新了所有处理器性能实例。
图书目录:1 Computer Abstractions and Technology1.1 Introduction1.2 BelowYour Program1.3 Under the Covers1.4 Performance1.5 The Power Wall1.6 The Sea Change: The Switch from Uniprocessors to Multiprocessors1.7 Real Stuff: Manufacturing and Benchmarking the AMD Opteron X41.8 Fallacies and Pitfalls1.9 Concluding Remarks1.10 Historical Perspective and Further Reading1.11 Exercises2 Instructions: Language of the Computer2.1 Introduction2.2 Operations of the Computer Hardware2.3 Operands of the Computer Hardware2.4 Signed and Unsigned Numbers2.5 Representing Instructions in the Computer2.6 Logical Operations2.7 Instructions for Making Decisions2.8 Supporting Procedures in Computer Hardware2.9 Communicating with People2.10 ARM Addressing for 32-Bit Immediates and More Complex Addressing Modes2.11 Parallelism and Instructions: Synchronization2.12 Translating and Starting a Program2.13 A C Sort Example to Put lt AU Together2.14 Arrays versus Pointers2.15 Advanced Material: Compiling C and Interpreting Java2.16 Real Stuff." MIPS Instructions2.17 Real Stuff: x86 Instructions2.18 Fallacies and Pitfalls2.19 Conduding Remarks2.20 Historical Perspective and Further Reading2.21 Exercises3 Arithmetic for Computers3.1 Introduction3.2 Addition and Subtraction3.3 Multiplication3.4 Division3.5 Floating Point3.6 Parallelism and Computer Arithmetic: Associativity 3.7 Real Stuff: Floating Point in the x863.8 Fallacies and Pitfalls3.9 Concluding Remarks3.10 Historical Perspective and Further Reading3.11 Exercises4 The Processor4.1 Introduction4.2 Logic Design Conventions4.3 Building a Datapath4.4 A Simple Implementation Scheme4.5 An Overview of Pipelining4.6 Pipelined Datapath and Control4.7 Data Hazards: Forwarding versus Stalling4.8 Control Hazards4.9 Exceptions4.10 Parallelism and Advanced Instruction-Level Parallelism4.11 Real Stuff: theAMD OpteronX4 (Barcelona)Pipeline4.12 Advanced Topic: an Introduction to Digital Design Using a Hardware Design Language to Describe and Model a Pipelineand More Pipelining Illustrations4.13 Fallacies and Pitfalls4.14 Concluding Remarks4.15 Historical Perspective and Further Reading4.16 Exercises5 Large and Fast: Exploiting Memory Hierarchy5.1 Introduction5.2 The Basics of Caches5.3 Measuring and Improving Cache Performance5.4 Virtual Memory5.5 A Common Framework for Memory Hierarchies5.6 Virtual Machines5.7 Using a Finite-State Machine to Control a Simple Cache5.8 Parallelism and Memory Hierarchies: Cache Coherence5.9 Advanced Material: Implementing Cache Controllers5.10 Real Stuff: the AMD Opteron X4 (Barcelona)and Intel NehalemMemory Hierarchies5.11 Fallacies and Pitfalls5.12 Concluding Remarks5.13 Historical Perspective and Further Reading5.14 Exercises6 Storage and Other I/0 Topics6.1 Introduction6.2 Dependability, Reliability, and Availability6.3 Disk Storage6.4 Flash Storage6.5 Connecting Processors, Memory, and I/O Devices6.6 Interfacing I/O Devices to the Processor, Memory, andOperating System6.7 I/O Performance Measures: Examples from Disk and File Systems6.8 Designing an I/O System6.9 Parallelism and I/O: Redundant Arrays of Inexpensive Disks6.10 Real Stuff: Sun Fire x4150 Server6.11 Advanced Topics: Networks6.12 Fallacies and Pitfalls6.13 Concluding Remarks6.14 Historical Perspective and Further Reading6.15 Exercises7 Multicores, Multiprocessors, and Clusters7.1 Introduction7.2 The Difficulty of Creating Parallel Processing Programs7.3 Shared Memory Multiprocessors7.4 Clusters and Other Message-Passing Multiprocessors7.5 Hardware Multithreading 637.6 SISD,MIMD,SIMD,SPMD,and Vector7.7 Introduction to Graphics Processing Units7.8 Introduction to Multiprocessor Network Topologies7.9 Multiprocessor Benchmarks7.10 Roofline:A Simple Performance Model7.11 Real Stuff:Benchmarking Four Multicores Using theRooflineMudd7.12 Fallacies and Pitfalls7.13 Concluding Remarks7.14 Historical Perspective and Further Reading7.15 ExercisesInuexC D-ROM CONTENTA Graphics and Computing GPUSA.1 IntroductionA.2 GPU System ArchitecturesA.3 Scalable Parallelism-Programming GPUSA.4 Multithreaded Multiprocessor ArchitectureA.5 Paralld Memory System G.6 Floating PointA.6 Floating Point ArithmeticA.7 Real Stuff:The NVIDIA GeForce 8800A.8 Real Stuff:MappingApplications to GPUsA.9 Fallacies and PitflaUsA.10 Conduding RemarksA.1l HistoricalPerspectiveandFurtherReadingB1 ARM and Thumb Assembler InstructionsB1.1 Using This AppendixB1.2 SyntaxB1.3 Alphabetical List ofARM and Thumb Instructions B1.4 ARM Asembler Quick ReferenceB1.5 GNU Assembler Quick ReferenceB2 ARM and Thumb Instruction EncodingsB3 Intruction Cycle TimingsC The Basics of Logic DesignD Mapping Control to HardwareADVANCED CONTENTHISTORICAL PERSPECTIVES & FURTHER READINGTUTORIALSSOFTWARE作者简介:David A.Patterson,加州大学伯克利分校计算机科学系教授。
咸阳 英文翻译

沈阳工业大学本科生毕业设计(论文)英文翻译毕业设计题目:一个用于网络应用程序的高速DES实现学院:信息科学与工程学院专业班级:计算机科学与技术0704班学生姓名:咸阳学生学号:070405127指导教师:刘革A High-speed DES Implementationfor Network ApplicationsAbstractA high-speed data encryption chip implementing the Data Encryption Standard (DES) has been developed. The DES modes of operation supported are Electronic Code Book and Cipher Block Chaining. The chip is based on a gallium arsenide (GaAs) gate array containing 50K transistors. At a clock frequency of 250MHz, data can be encrypted or decrypted at a rate of 1 GBit/second, making this the fastest singlechip implementation reported to date. High performance and high density have been achieved by using custom-designed circuits to implement the core of the DES algorithm. These circuits employ precharged logic. a methodology novel to the design of GaAs devices. A pipelined flowthrough architecture and an efficient key exchange mechanism make this chip suitable for low-latency network controllers.1. IntroductionNetworking and secure distributed systems are major research areas at the Digital Equipment Corporation's Systems Research Center. A prototype network called Autonet with 100 MBit/s links has been in service there since early 1990 [14]. We are currently working on a follow-on network with link data rates of 1GBit/s.The work described here was motivated by the need for data encryption hardware for this new high-speed network. Secure transmission over a network requires encryption hardware that operates a t link speed. Encryption will become an integral part of future high-speed networks.We have chosen the Data Encryption Standard (DES) since it iswidely used in commercial applications and allows for efficient hardware implementations. Several single-chip implementations of the DES algorithm exist or have been announced. Commercial products include the AmZ8068/Am9518 [l]with an en-cryption rate of 14 hiIBit/s and the recently announced VM007 with a throughput of 192 MBit/s [18].An encryption rate of 1GBit/s can be achieved by using a fast VLSI technology. Possible candidates are GaAs direct-coupled field-effect transistor logic (DCFL) and silicon emitter-coupled logic (ECL). As a semiconductor materid GaAs is attractive because of the high electron mobility which makes GaAs circuits twice as fast as silicon circuits. In addition, electrons reach maximum velocity in GaAs at a lower voltage than in silicon, allowing for lower internal OPerating voltages, which decreases power consumption. These properties position GaAs favorably with respect to silicon in particular for high speed applications. The disadvantage of GaAs technology is its immaturity compared with silicon technology. GaAs has been recognized as a possible alternative to silicon for over twenty years, but only recently have the difficulties with manufacturing been overcome. GaAs is becoming a viable contender for VLSI designs [8, 101 and motivated us to explore the feasibility of GaAs for our design.In this paper, we will describe a new implementation of the DES algorithm with a GaAs gate array. We will show how high performance can be obtained even with the limited flexibility of a semi-custom design. Our approach was to use custom-designed circuits to implement the core of the DES algorithm and an unconventional chip layout that optimizes the data paths. Further, we will describe how encryption can be incorporated into network controllers without compromising network throughput or latency. We will show that low latency can be achieved with a fully pipelined DES chip architecture and hardwaresupport for a key exchange mechanism that allows for selecting the key on the fly.Section 2 of this paper outlines the DES algorithm. Section 3 describes the GaAs gate array that we used for implementing the DES algorithm. Section 4 provides a detailed description of our DES implementation. Section 5 shows how the chip can be used for network applications and the features that make it suitable for building low-latency network controllers. This section also includes a short analysis of the economics of breaking DES enciphered data. Finally, section 6 contains some concluding remarks.2. DES AlgorithmThe DES algorithm was issued by the Kational Bureau of Standards (NBS) in 1977. A detailed description of the algorithm can be found in [Il, 131. The DES algorithm enciphers 64-bit data blocks using a 56-bit secret key (not including parity bits which are part of the 6 1-bit key block). The algorithm employs three different types of operations: permutations, rotations, and substitutions. The exact choices for these transformations, i.e. the permutation and substitution tables are not important to this paper. They are described in [ll].As shown in Fig. 1,a block to be enciphered is first subjected to an initial permutation (IP), then to 16 iterations, or rounds, of a complex key-dependent computation, and finally to the inverse initial permutation (IP-'1. The key schedule transforms the 56-bit key into sixteen 48-bit partial keys by using each of the key bits several times.It shows an expanded version of the 16 DES iterations for encryption. The inputs to the 16 rounds are the output of IP and sixteen 48-bit keys X1..16 that are derived from the supplied 56-bit key. First, the 64-bit output shows an expanded version of the 16 DES iterations for encryption. The inputs to the 16 rounds are the output of IP andsixteen 48-bit keys X1..16 that are derived from the supplied 56-bit key. First, the 64-bit output data block of IP is divided into two halves LOand ROeach consisting of 32 bits.Decryption and encryption use the same data path, and differ only in the order in which the key bits are presented to function f. That is, for decryption K16 is used in the first iteration, K15 in the second, and so on, with XI used in the 16th iteration. The order is reversed simply by changing the direction of the rotate operation performed on C O .For enciphering data streams that are longer than 64 bits the obvious solution is to cut the stream into 64-bit blocks and encipher each of them independently. This method is known as Electronic Code Book (ECB) mode [12]. Since for a given key and a given plaintext block the resulting ciphertext block wiIl always be the same, frequency analysis could be used to retrieve the original data. There exist alternatives to the ECB mode that use the concept of diffusion SO that each ciphertext block depends on all previous plaintext blocks. These modes are called Cipher Block Chaining (CBC) mode, Cipher Feedback (CFB) mode, and Output Feedback (OFB) mode.3. DES Chip ImplementationThis section describes how we implemented the DES algorithm.3.1. OrganizationThere are two ways to improve an algorithm's performance. One can choose a dense but slow technology such as silicon CMOS and increase performance by parallelizing the algorithm or flattening the logic. Alternatively, one can choose a fast but low-density technology such as silicon ECL or GaAs DCFL. The DES algorithm imposes limits on the former approach. The CBC mode of operation combines the result obtained by encrypting a block with the next input block. Since the result has to be available before the next block can be processed, it is impossible toparallelize the algorithm and operate on more than one block at a time. It is, however, possible to unroll the 16 rounds of Fig. 1 and implement all 16 iterations in sequence. Flattening the design in this manner will save the time needed to latch the intermediate results in aregister on every iteration. Even though the density of CMOS chips is sufficient for doing this, the speed requirements of a 1GBit/s CMOS implementation might still be challenging.Since we wanted to use GaAs technology, we had to choose a different approach. The limited density of GaAs gate arrays forced us to implement only one of the 16 rounds and reuse it for all 16 iterations. Even without unrolling the 16 rounds, fitting the implementation into the available space and meeting the speed requirements was a major challenge. In order to achieve a data rate of 1GBit/s, each block has to be processed in 64 ns, which corresponds to 4ns per iteration or a clock rate of 250MHz.The register-level block diagrams for encryption and decryption are shown in Figures 5 and 6. The DES chip realizes a rigid %stage pipeline, that is, a block is first written into the input register I, is then moved into register LR, where it undergoes the 16 iterations of the cipher function f, and finally is written into the output register 0.3.2. Implementation CharacteristicsThe implementation of the DES chip contains 480 flipflops, 2580 gates, and 8 PLAs. There are up to ten logic levels that have to be passed during the 4ns clock period. The chip uses 84% of the transistors available in the VSC15K gate array. The high utilization is the result of a fully manual placement. Timing constraints further forced us to lay out signal wires partially by hand.The chip's interface is completely asynchronous. The data ports are 8, 16, or 32 bits wide. A separate 7-bit wide port is available for loading the master key. Of the 211 available pins, 144 are used for signals and 45 are used forpower and ground. With the exception of the 250MHz clock, which is ECL compatible, all input and output signals are TTL compatible. The chip requires power supply voltages of -2 V for the GaAs logic and 5 V for the TTL-compatible output drivers. The maximum power consumption is 8 W.3.3. Asynchronous InterfaceAsynchronous ports are provided in order to avoid synchronization with the 250 MHz clock. The data input and output registers are controlled by two-way handshake signals which determine when the registers can be written or read. The data ports are 8, 16, or 32 bits wide. The variable width allows for reducing the width of the external data path at lower operating speeds. With the 32-bit wide port, a new data word must be loaded every 32ns in order to achieve an encryption rate of 1GBit/s. The master key register is loaded through a separate, also fully asynchronous 7-bit wide port. Our implementation does not check the byte parity bits included in the 64-bit key. The low speed of the data and key ports makes it possible to use TTL-levels for all signals except for the 250 MHz clock which is a differential ECL-compatible signal.Thanks to the fully asynchronous chip interface, the chip manufacturer was able to do at-speed testing even without being able to supply test vectors at full speed. For this purpose, the 250MHz clock was generated by a separate generator, while the test vectors were supplied asynchronously by a tester running at only 40 MHz. At-speed testing was essential particularly in testing the precharged logic which wdl be described in the following section.4. ApplicationsOur implementation of the DES algorithm is tailored for high-speed network applications. This requires not only encryption hardware operating at link speed but also support for low-latencycontrollers. Operating at link data rates of 1 GBit/s requires a completely pipelined controller structure. Low latency can be achieved by buffering data in the controller as little as possible and by avoiding protocol processing in the controller. In this respect, the main features of the DES chip are a pipelined flow-through design and an efficient key exchange mechanism.As described in the previous section, the chip is implemented as a rigid 3- stage pipeline with separate input and output ports. Each 64-bit data block is entered into the pipeline together with a command word. While the data block flows through the pipeline, the accompanying command instructs the pipeline stages which operations to apply to the data block. On a block-by-block basis it is possible to enable or disable encryption, to choose ECB or CBC mode, and to select the master key in MK or the key in CD. None of these commands causes the pipeline to stall. It is further possible to instruct the pipeline to load a block from the output register 0 into register CD. Typical usage of this feature is as follows: a data block is decrypted with the master key, is loaded into CD, and is then used for encrypting or decrypting subsequent data blocks. This operation requires a one-cycle delay slot; that is, the new key in CD cannot be applied to the data block immediately following.5.ConclusionsWe began designing the DES chip in early 1989 and received the first prototypes at the beginning of 1991. The parts were logically functional, but exhibited electrical problems and failed at high temperature. A minor design change fixed this problem. In the fall of 1991, we received 25 fully functional parts that we plan to use in future high-speed network controllers.With an encryption rate of lGBit/s, the design presented in this paper is the fastest DES implementation reported to date. Both ECB and CBC modes of operation are supported at full speed. This data rate is based on a worst case timing analysis and a clock frequency of 250MHz. The fastest chips we tested run at 350 MHz or 1.4 GBit/s.We have shown that a high-speed implementation of the DES algorithm is possible even with the limited flexibility of a semi-custom design. An efficient implementation of the S boxes offering both high performance and high density has been achieved with a novel approach to designing PLA structures in GaAs. An unconventional floorplan has been presented that eliminates long wires caused by permuted data bits in the critical path.The architecture of the DES chip makes it possible to build very low-latency network controllers. A pipelined design together with separate fully asynchronous input and output ports allows for easy integration into controllers with a flowthrough architecture. ECL levels are required only for the 250MHz clock; TTL levels are used for all the data and control pins, thus providing a cost-effective interface even at data rates of 1 GBit/s. The provision of a data path for loading the key from the data stream allows for selecting the encryption or decryption key on the fly. These features make it possible to use encryption hardware for network applications with very little overhead.References1. Advanced Micro Devices: ArnZ8068,/Am9518 Data Ciphering Processor. Datasheet, July 19842. National Bureau of Standards: Data Encryption Standard. Federal Information3. National Bureau of Standards: DES Modes of Operation. Federal Information Processing Standards Publication FIPS PUB 81, December 19804.Diffie, W., Hellman, M.: Exhaustive cryptananlysis of the NBS Data EncryptionStandard. Computer, voi. 10, no. 6, June 1977, pp. 74-845. McCluskey, E.: Logic Design Principles. Prentice-Hall, 19866. National Bureau of Standards: Guidelines for Implementing and Using the NBS Data Encryption Standard. Federal Information Processing Standards Publication FIPS PUB 74, April 19817. VLSI Technology: VM007 Data Encryption Processor. Datasheet, October 19918. Brassard, G.: Modern Cryptology. Lecture Notes in Computer Science, no. 325, Springer-V erlag, 1988一个用于网络应用程序的高速DES实现摘要一个实现数据加密标准(DES)的高速数据加密芯片已经研制成功。
电子信息工程专业英语教程_第5版 题库

《电子信息工程专业英语教程(第5版)》题库Section A 术语互译 (1)Section B 段落翻译 (5)Section C阅读理解素材 (12)C.1 History of Tablets (12)C.2 A Brief History of satellite communication (13)C.3 Smartphones (14)C.4 Analog, Digital and HDTV (14)C.5 SoC (15)Section A 术语互译Section B 段落翻译Section C阅读理解素材C.1 History of TabletsThe idea of the tablet computer isn't new. Back in 1968, a computer scientist named Alan Kay proposed that with advances in flat-panel display technology, user interfaces, miniaturization of computer components and some experimental work in WiFi technology, you could develop an all-in-one computing device. He developed the idea further, suggesting that such a device would be perfect as an educational tool for schoolchildren. In 1972, he published a paper about the device and called it the Dynabook.The sketches of the Dynabook show a device very similar to the tablet computers we have today, with a couple of exceptions. The Dynabook had both a screen and a keyboard all on the same plane. But Key's vision went even further. He predicted that with the right touch-screen technology, you could do away with the physical keyboard and display a virtual keyboard in any configuration on the screen itself.Key was ahead of his time. It would take nearly four decades before a tablet similar to the one he imagined took the public by storm. But that doesn't mean there were no tablet computers on the market between the Dynabook concept and Apple's famed iPad.One early tablet was the GRiDPad. First produced in 1989, the GRiDPad included a monochromatic capacitance touch screen and a wired stylus. It weighed just under 5 pounds (2.26 kilograms). Compared to today's tablets, the GRiDPad was bulky and heavy, with a short battery life of only three hours. The man behind the GRiDPad was Jeff Hawkins, who later founded Palm.Other pen-based tablet computers followed but none received much support from the public. Apple first entered the tablet battlefield with the Newton, a device that's received equal amounts of love and ridicule over the years. Much of the criticism for the Newton focuses on its handwriting-recognition software.It really wasn't until Steve Jobs revealed the first iPad to an eager crowd that tablet computers became a viable consumer product. Today, companies like Apple, Google, Microsoft and HP are trying to predict consumer needs while designing the next generation of tablet devices.C.2 A Brief History of satellite communicationIn an article in Wireless World in 1945, Arthur C. Clarke proposed the idea of placing satellites in geostationary orbit around Earth such that three equally spaced satellites could provide worldwide coverage. However, it was not until 1957 that the Soviet Union launched the first satellite Sputnik 1, which was followed in early 1958 by the U.S. Army’s Explorer 1. Both Sputnik and Explorer transmitted telemetry information.The first communications satellite, the Signal Communicating Orbit Repeater Experiment (SCORE), was launched in 1958 by the U.S. Air Force. SCORE was a delayed-repeater satellite, which received signals from Earth at 150 MHz and stored them on tape for later retransmission. A further experimental communication satellite, Echo 1, was launched on August 12, 1960 and placed into inclined orbit at about 1500 km above Earth. Echo 1 was an aluminized plastic balloon with a diameter of 30 m and a weight of 75.3 kg. Echo 1 successfully demonstrated the first two-way voice communications by satellite.On October 4, 1960, the U.S. Department of Defense launched Courier into an elliptical orbit between 956 and 1240 km, with a period of 107 min. Although Courier lasted only 17 days, it was used for real-time voice, data, and facsimile transmission. The satellite also had five tape recorders onboard; four were used for delayed repetition of digital information, and the other for delayed repetition of analog messages.Direct-repeated satellite transmission began with the launch of Telstar I on July 10, 1962. Telstar I was an 87-cm, 80-kg sphere placed in low-Earth orbit between 960 and 6140 km, with an orbital period of 158 min. Telstar I was the first satellite to be able to transmit and receive simultaneously and was used for experimental telephone, image, and television transmission. However, on February 21, 1963, Telstar I suffered damage caused by the newly discovered Van Allen belts.Telstar II was made more radiation resistant and was launched on May 7, 1963. Telstar II was a straight repeater with a 6.5-GHz uplink and a 4.1-GHz downlink. The satellite power amplifier used a specially developed 2-W traveling wave tube. Along with its other capabilities, the broadband amplifier was able to relay color TV transmissions. The first successful trans-Atlantic transmission of video was accomplished with Telstar II , which also incorporated radiation measurements and experiments that exposed semiconductor components to space radiation.The first satellites placed in geostationary orbit were the synchronous communication (SYNCOM ) satellites launched by NASA in 1963. SYNCOM I failed on injection into orbit. However, SYNCOM II was successfully launched on July 26, 1964 and provided telephone, teletype, and facsimile transmission. SYNCOM III was launched on August 19, 1964 and transmitted TV pictures from the Tokyo Olympics. The International Telecommunications by Satellite (INTELSAT) consortium was founded in July 1964 with the charter to design, construct, establish, and maintain the operation of a global commercial communications system on a nondiscriminatory basis. The INTELSAT network started with the launch on April 6, 1965, of INTELSAT I, also called Early Bird. On June 28, 1965, INTELSAT I began providing 240 commercial international telephone channels as well as TV transmission between the United States and Europe.In 1979, INMARSAT established a third global system. In 1995, the INMARSAT name was changed to the International Mobile Satellite Organization to reflect the fact that the organization had evolved to become the only provider of global mobile satellite communications at sea, in the air, and on the land.Early telecommunication satellites were mainly used for long-distance continental and intercontinental broadband, narrowband, and TV transmission. With the advent of broadband optical fiber transmission, satellite services shifted focus to TV distribution, and to point-to-multipoint and very small aperture terminal (VSAT) applications. Satellite transmission is currently undergoing further significant growth with the introduction of mobile satellite systems for personal communications and fixed satellite systems for broadband data transmission.C.3 SmartphonesThink of a daily task, any daily task, and it's likely there's a specialized, pocket-sized device designed to help you accomplish it. You can get a separate, tiny and powerful machine to make phone calls, keep your calendar and address book, entertain you, play your music, give directions, take pictures, check your e-mail, and do countless other things. But how many pockets do you have? Handheld devices become as clunky as a room-sized supercomputer when you have to carry four of them around with you every day.A smartphone is one device that can take care of all of your handheld computing and communication needs in a single, small package. It's not so much a distinct class of products as it is a different set of standards for cell phones to live up to.Unlike many traditional cell phones, smartphones allow individual users to install, configure and run applications of their choosing. A smartphone offers the ability to conform the device to your particular way of doing things. Most standard cell-phone software offers only limited choices for re-configuration, forcing you to adapt to the way it's set up. On a standard phone, whether or not you like the built-in calendar application, you are stuck with it except for a few minor tweaks. If that phone were a smartphone, you could install any compatible calendar application you like.Here's a list of some of the things smartphones can do:•Send and receive mobile phone calls•Personal Information Management (PIM) including notes, calendar and to-do list•Communication with laptop or desktop computers•Data synchronization with applications like Microsoft Outlook•E-mail•Instant messaging•Applications such as word processing programs or video games•Play audio and video files in some standard formatsC.4 Analog, Digital and HDTVFor years, watching TV has involved analog signals and cathode ray tube (CRT) sets. The signal is made of continually varying radio waves that the TV translates into a picture and sound. An analog signal can reach a person's TV over the air, through a cable or via satellite. Digital signals, like the ones from DVD players, are converted to analog when played on traditional TVs.This system has worked pretty well for a long time, but it has some limitations:•Conventional CRT sets display around 480 visible lines of pixels. Broadcasters have been sending signals that work well with this resolution for years, and they can't fit enough resolution to fill a huge television into the analog signal.•Analog pictures are interlaced - a CRT's electron gun paints only half the lines for each pass down the screen. On some TVs, interlacing makes the picture flicker.•Converting video to analog format lowers its quality.United States broadcasting is currently changing to digital television (DTV). A digital signal transmits the information for video and sound as ones and zeros instead of as a wave. For over-the-air broadcasting, DTV will generally use the UHF portion of the radio spectrum with a 6 MHz bandwidth, just like analog TV signals do.DTV has several advantages:•The picture, even when displayed on a small TV, is better quality.• A digital signal can support a higher resolution, so the picture will still look good when shown on a larger TV screen.•The video can be progressive rather than interlaced - the screen shows the entire picture for every frame instead of every other line of pixels.•TV stations can broadcast several signals using the same bandwidth. This is called multicasting.•If broadcasters choose to, they can include interactive content or additional information with the DTV signal.•It can support high-definition (HDTV) broadcasts.DTV also has one really big disadvantage: Analog TVs can't decode and display digital signals. When analog broadcasting ends, you'll only be able to watch TV on your trusty old set if you have cable or satellite service transmitting analog signals or if you have a set-top digital converter.C.5 SoCThe semiconductor industry has continued to make impressive improvements in the achievable density of very large-scale integrated (VLSI) circuits. In order to keep pace with the levels of integration available, design engineers have developed new methodologies and techniques to manage the increased complexity inherent in these large chips. One such emerging methodology is system-on-chip (SoC) design, wherein predesigned and pre-verified blocks often called intellectual property (IP) blocks, IP cores, or virtual components are obtained from internal sources, or third parties, and combined on a single chip.These reusable IP cores may include embedded processors, memory blocks, interface blocks, analog blocks, and components that handle application specific processing functions. Corresponding software components are also provided in a reusable form and may include real-time operating systems and kernels, library functions, and device drivers.Large productivity gains can be achieved using this SoC/IP approach. In fact, rather than implementing each of these components separately, the role of the SoC designer is to integrate them onto a chip to implement complex functions in a relatively short amount of time.The integration process involves connecting the IP blocks to the communication network, implementing design-for-test (DFT) techniques and using methodologies to verify and validate the overall system-level design. Even larger productivity gains are possible if the system is architected as a platform in such as way that derivative designs can be generated quickly.In the past, the concept of SoC simply implied higher and higher levels of integration. That is, it was viewed as migrating a multichip system-on-board (SoB) to a single chip containing digital logic, memory, analog/mixed signal, and RF blocks. The primary drivers for this direction were the reduction of power, smaller form factor, and lower overall cost. It is important to recognize that integrating more and more functionality on a chip has always existed as a trend by virtue of Moore’s Law, which predicts that the number of transistors on a chip will double every 18-24 months. The challenge is to increase designer productivity to keep pace with Moore’s Law. Therefore, today’s notion of SoC is defined in terms of overall productivity gains through reusable design and integration of components.。
Pipeline Emergency Response Guidelines

I.IntroductionI. IntroductionPipelines are the safest and most reliable way to transport energy products, including: natural gas, crude oil, liquid petroleum products, and chemical products. Pipelines areprimarily underground, which keeps them away from public contact and accidentaldamage. It is also a fact that pipelines can move large volumes of product at asignificantly lower operating cost when compared to other modes of transportation.Despite safety and efficiency statistics, increases in energy consumption and population growth near pipelines present the potential for a pipeline incident.To meet the pipeline industry‟s goal of incident-free operation, pipeline operators invest substantial human and financial resources to protect the people, property andenvironments near pipelines. Damage prevention measures include routine inspection and maintenance, corrosion protection, continuous monitoring and control technologies, public awareness programs, and integrity management and emergency response plans.While pipelines are generally the safest method of transporting hazardous chemicals, they are not failsafe. Pipeline product releases, whether in the form of a slow leak or violent rupture, are a risk in any community.In the unlikely event of an incident near or involving a pipeline, it is critical you know how to respond and are prepared to work together with the pipeline operator‟srepresentatives.This guide is intended to provide fire fighters, law enforcement officers, emergencymedical technicians and all other emergency responders who may be the first to arrive at the scene with the information they need to safely handle a pipeline incident. This guide is not intended to provide information on the physical or chemical properties of the products transported through pipelines. Nor should it be considered a substitute for emergency response training, knowledge or sound judgment. Rather, this guide contains information that will help you make decisions about how to best protect your emergency response team and the surrounding public during a pipeline incident.Please review and become familiar with the emergency response guidelines before you are called to respond to a pipeline incident.© 2007 by Pipeline Association for Public Awareness• Revision 0II.Pipeline BasicsBefore we discuss how to respond to a pipeline inciden t, let‟s quickly review the basics about pipelines:•What are pipelines and why do we use them?•Where are pipelines located?•How will you identify a pipeline right-of-way in your community?•How does the operator monitor pipeline performance?A.Pipelines in Your CommunityPeople across the nation expect to have the energy they need to drive their cars, heat theirhomes and cook dinner, never really considering how they get the petroleum, natural gas, and other chemical products necessary to power their daily activities.The pipeline industry has installed more than 2.1 million miles of pipeline to transport a variety of gases and liquids from gathering points to storage areas, and from refineries and processing plants to customers‟ homes and p laces of business. The U.S. Department of Transportation(DOT) defines a pipeline system as all parts of a pipeline facility through which a hazardousliquid or gas moves in transportation, including piping, valves, and other appurtenancesconnected to the pipeline, pumping units, fabricated assemblies associated with pumping units, metering and delivery stations, and breakout tanks. To ensure these pipeline systems remain safe,a body of local, state and federal laws, regulations and standards govern pipeline design,construction, operation, and public awareness and damage prevention programs.Specifically, pipeline operators use a series of gathering, transmission and distribution pipelines to transport more than 43 different gas and liquid products.•Gathering pipelines transport crude oil and natural gas from the wellheads and production facility to processing facilities where the oil, gas and water are separated and processed.•Transmission pipelines move refined liquid products, crude oil, and natural gas from refineries to marketing and distribution terminals typically using larger diameter, high-pressure lines.•Distribution systems for liquid and gas products vary. Liquid products are stored and transported by tanker trucks to their final destination, while gases, such as natural gas,butane, propane, ethane, etc., are transported from a storage location directly to residentialand industrial customers through low-pressure distribution pipelines.Maps of transmission pipelines and contact information for pipeline operators in your area can be found in the National Pipeline Mapping System (NPMS) at: . The directory can be searched by zip code or state and county. More detailed pipeline maps are also available to Emergency Responders who have obtained a logon ID and password.B. Pipeline Right-of-WayAlthough typically buried underground, pipelines may also be found aboveground in extremely cold and harsh environments, and at pump and compressor stations, some valve stations and terminals. Whether aboveground or belowground, pipelines are constructed along a clearcorridor of land called the right-of-way (ROW). The ROW may contain one or more pipelines, may vary in width, and will cross through public and private property. The ROW should be free of permanent structures and trees and be identified with a marker sign.C. Pipeline Marker SignsAboveground signs and markers identify the approximate location of underground pipelines.Markers are required to be present wherever a pipeline crosses under roads, railroads orwaterways. They may also be found at other intervals and locations along the pipeline right-of- way, such as near buildings and pipeline facilities. Markers do NOT tell you the exact location, depth or direction of the pipeline; the pipeline may curve or angle around natural and manmade features. If there are multiple pipelines in the ROW, a marker sign should be posted for each pipeline.Pipeline markers may look different, but every sign tells you the same information:• Pipeline product • Pipeline operator• 24-hour emergency phone numberPainted Metal or Plastic PostsSigns located near roads, railroads & along pipeline right-of-waysPipelinecasingventMarker for pipeline patrol planeNOTE: If you are responding to a 9-1-1 call about a strange odor or leak in the area, approach the scene withcaution, look for clues that a pipeline is involved, and find a marker sign identifying the pipeline product,operator and phone number to call to report the incident and obtain additional information.D. Pipeline Control CenterWhen you call the 24-hour emergency phone number on a marker sign, you will speak with someone at the pipeline operator‟s control center. The control center is the heart of pipeline operations. Information about the pipeline‟s operating equipment and parameters is constantly communicated to the control center where personnel use computers to monitor pipeline pressure, temperature, flow, alarms, and other conditions in the pipeline. While pipelineoperators work hard to achieve incident-free operation, accidents do occur. In the event of an emergency, the control center can immediately shutdown the pipeline and begin to isolate the source of the leak. The pipeline operator‟s control center may also have the capability to remotely open and close valves and transfer products both to and from the main pipeline atmarketing and distribution facilities.NOTE: As an emergency responder, you can help control the incident by being prepared to communicate as muchinformation as possible to the pipeline operator about the current, incident situation.III.Pipeline IncidentsA pipeline incident exists when third-party damage, corrosion, material defects, worker error ornatural events cause a fire, explosion, accidental release, or operational failure that disruptsnormal operating conditions.Pipeline incidents present some of the most dangerous situations an emergency responder may encounter. Pipelines contain flammable, hazardous and even deadly petroleum gases, liquids, and other chemical products that present emergency responders with a myriad of hazards and risks that vary depending on the topography, weather, and properties of the material involved. For the majority of pipeline incidents, you will have a limited number of options to actually stop the leak. In almost all cases, the pipeline operator will be required to resolve the incident safely.Consequently, your goal is to minimize the level of risk to other responders, the community and the environment.Advance knowledge of where pipelines are located in your community, the products transported in them, and how to contact and work together with the pipeline operator in the event of an incident are key factors to an effective and safe response. Each pipeline operator maintains an emergency response plan that outlines the roles and responsibilities of company, contractor, and local response personnel.NOTE:Contact your local pipeline operator(s) to learn more about the pipeline systems and specific response plans regarding your area of jurisdiction. Make sure you comment on special issues in your community. Pipelineoperators use the feedback from communications with emergency responders to develop and update theirintegrity management and emergency response plans.To effectively respond to a pipeline leak, spill or fire, emergency responders need to understand the hazards and risks associated with the incident. You should seek additional information about the pipeline in question as soon as possible. Calling the 24-hour emergency phone number on a nearby pipeline marker sign, contacting the appropriate emergency response agency, andconsulting the information in the DOT 2004 Emergency Response Guidebook may provide more detailed, situation-specific information.Regardless of the nature of the pipeline incident, following standardized procedures will bring consistency to the response operation and will help minimize the risk of exposure to allresponders. Pipeline operators hope you never have to respond to a pipeline incident, but if you do, remember:•Every incident is different – each will have special problems and concerns.•Carefully select actions to protect people, property and the environment.•Continue to gather information and monitor the situation until the threat is removed.IV.Incident Response StepsFollowing standardized procedures will bring consistency to each response operation and will help minimize the risk of exposure to all responders. The information in this guide provides a framework to discuss safety issues as they relate to the hazards and risks presented by pipeline emergencies in your community.After reviewing the standard pipeline incident response steps, you should discuss your agency‟s pipeline emergency preparedness, how you will handle an incident, and other planning issues within your community.A.Assess the Situation1.Approach with Caution from Upwind LocationTo protect yourself and other responders from any hazards associated with the incident it is critical you approach cautiously from an upwind and/or crosswind location.•Do not park over manholes or storm drains.•Do not approach the scene with vehicles or mechanized equipment until the isolation zones have been established. Vehicle engines are a potential ignition source.•Do not walk or drive into a vapor cloud or puddle of liquid.•Use appropriate air-monitoring equipment to establish the extent of vapor travel.•Because any number of fire and health hazards may be involved, it is important you resist the urge to rush in until you know more about the product and hazards involved inthe incident. Consider the following:−Is there a fire, spill or leak?−What are the weather conditions?−What direction is the wind blowing?−What is the terrain like?−Who and what is at risk:people, property or environment?2.Secure the Scene −Is there a vapor cloud?−What actions should be taken: evacuation or diking?−What human/equipment resources are required and readily available? −What can be done immediately?Without entering the immediate hazard area, you want to isolate the area and deny entry to unauthorized persons including other responders. It may be necessary to evacuate everyone in the danger area to a safe location upwind of the incident area.3.Employ NIMS and the Incident Command SystemDeveloped by the Department of Homeland Security, the National Incident Management System (NIMS) integrates effective practices in emergency preparedness and response into acomprehensive national framework for incident management. The NIMS enables responders at all jurisdictional levels and across all disciplines to work together − effectively and efficiently.Because pipeline incidents require coordination of information and resources among allresponders, the Incident Command System (ICS) is one of the most important …bestpractices‟ in the NIMS. The ICS provides common terminology, organizational structure and duties, and operational procedures among operator personnel and various federal, state and local regulatory and response agencies that may be involved in response operations.•Identify an Incident Commander. The Incident Commander is the person responsible for the management of on-scene emergency response operations. In cooperation withthe pipeline operator‟s point of contact, the Incident Commander determines when it issafe for the response teams to enter the area and access the pipeline. The IncidentCommander must be trained to perform these responsibilities and not be automaticallyauthorized by virtue of his/her normal position within the organization.•Establish a command post, lines of communication and a staging area for additional responding equipment and personnel.NOTE:If other public safety units are on-scene, ensure operations are coordinated and unified command is established.4.Identify the HazardsA product‟s physical and chemical properties determine how the product will behave andhow it can harm. Emergency responders need to analyze the problem and assess potentialoutcomes based on the hazardous materials involved, type ofcontainer and its integrity, and the environment where the incident has occurred. Understanding the hazards will enable you to understand what risk you will be taking and how to select the best course of action with the least risk.•Locate a pipeline marker sign to identify the pipeline product, operator and 24-hour emergency phone number. Use caution as you may encounter: •Flammable atmospheres •Hydrogen sulfide (H2S) in crude oil/natural gas pipelines •Anhydrous ammonia pipelines •Oxygen deficient/enriched atmospheres•Call the emergency phone number to report the incident to the pipeline operator‟s control center. Control center personnel may provide additional information about thepipeline product and its hazards.•Use the DOT 2004 Emergency Response Guidebook to initially analyze the key properties (flash point, explosive range, specific gravity, and vapor density).•Use air-monitoring equipment appropriate to the materials in the pipeline. Do NOT assume gases or vapors are harmless because of a lack of smell or quick desensitizationto the strong odors of materials such as hydrogen sulfide or anhydrous ammonia.•Use the highest level of precaution and protection until you know the area is safe of flammable, toxic, and mechanized and electrical hazards.NOTE:If natural gas is escaping inside a building, refer to Appendix B for additional precautions.B.Respond to Protect People, Property and the EnvironmentProtective actions are those steps taken to preserve the health and safety of emergencyresponders and the public during a pipeline incident. While the pipeline operator concentrates on the pipeline, responders should concentrate on isolating and removing ignition sources andmoving the public out of harms way. Several response procedures can and should be pursued simultaneously. You will also need to continually reassess and modify your response accordingly.1.Establish Isolation Zones and Set Up BarricadesIsolation zones and barricades prevent unauthorized people and unprotected emergencyresponders from entering the hazard area and becoming injured. The size of thecontainment area will be dictated by the location and size of the release. You also want toconsider atmospheric conditions, as isolation distances change from daytime to nighttimedue to different mixing and dispersion conditions in the air. Remember, gas odor or the lack of gas odor is not a sufficient measurement to establish safe isolation zones.•Based on the type of incident, use any or all of the following to calculate and establish isolation zones:−DOT 2004 Emergency Response Guidebook−Information from the pipeline operator‟s representative−Heat intensity levels−Measurements from air-monitoring equipment•Use visible landmarks, barricade tape and traffic cones to identify hot/warm/cold zones.•Define entry and exit routes. Plan an escape route in case conditions deteriorate.•Be certain to allow enough room to move and remove your own equipment. The more time, distance and shielding between you and the material the lower the risk.2.Rescue and Evacuate PeopleAny efforts made to rescue persons and protect property or the environment must beweighed against the possibility that you could become part of the problem.•Do not walk or drive into a vapor cloud or puddle of liquid.•Evacuate or shelter-in-place as necessary, providing instruction and frequent updates to the public while evacuated or sheltered-in-place.•Administer first aid and medical treatment, as needed.•Enter the area only when wearing appropriate protective gear, such as Structural Fire Fighters‟ Protective Clothing (SFPC) (helmet, coat, pants, boots, gloves and hood) and aPositive Pressure Self-Contained Breathing Apparatus (SCBA). Because no singleprotective clothing material will protect you from all dangerous pipeline materials, alwaysuse the highest level of caution.3.Eliminate Ignition SourcesIgnition sources include electrical motors, firearms, vehicles, telephones, emergency radios, cigarettes, construction equipment, static electricity, open flames or sparks.•Eliminate ignition sources, if possible without additional exposure or great risk.•Park all emergency vehicles at a safe distance beyond the isolation zone (upwind).•Do NOT light a match, start an engine, use a telephone or radio, switch lights on or off, or use anything that may create a spark.4.Control Fires, Vapor and LeaksBecause there are many variables to consider, the decision to use water on fires or spillsinvolving water-reactive materials should be based on information from an authoritativesource, such as the pipeline operator, who can be contacted by calling the 24-houremergency phone number listed on a nearby pipeline marker sign.WARNING:Some products, such as anhydrous ammonia, can react violently or evenexplosively with water. Water getting inside a ruptured or leaking containermay cause an explosi on or the product‟s reaction with water may be moretoxic, corrosive, or otherwise more undesirable than the product of a firewithout water applied. Consequently, it is best to leave a fire or leak aloneexcept to prevent its spreading.a.Fire ControlExtinguishing a primary fire can result in explosive re-ignition. Unless it is necessary to save human life, flammable gas fires should NOT be extinguished on flammable gas pipelinesunless the fuel source has been isolated and the pipeline operator advises you to take thisaction! If the fuel source is not shut off and the fire is extinguished, leaking gas can migrate away from the pipeline and find an ignition source.•Let the primary fire burn. Eliminate potential ignition sources.•Cool surrounding structures, equipment and vessels. Because water is an inefficient and even dangerous way to fight fuel fires, use a fog pattern, NOT a straight stream of water.Please note some products are not compatible with water; refer to the DOT 2004Emergency Response Guidebook.•Do not inhale fumes, smoke or vapors.•Once the primary fire is out, beware of hot spot re-ignition.•Do not operate pipeline equipment.b.Vapor ControlLimiting the amount of vapor released from a pool of flammable or corrosive liquidsrequires the use of proper protective clothing, specialized equipment, appropriate chemical agents, and skilled personnel. For these reasons, it is best to contain the hazards and wait for the pipeline operator‟s representative to handle the pipe line and its product.•Do not inhale fumes, smoke or vapors.•Eliminate ignition sources! Flammable gases may escape under pressure from a pipeline, form a vapor cloud, and be ignited by an ignition source in the area. Explosions ofunconfined vapor clouds can cause major structural damage and quickly escalate theemergency beyond responder capabilities.•Do NOT ignite a vapor cloud! Pipeline operators will perform this dangerous task.•Avoid forced ventilation of structures and excavations. Forced ventilation can actually increase the possibility of a flammable atmosphere.•Limited fog misting can be of some benefit if knocking down a vapor cloud, especially if such a cloud appears to be spreading beyond the containment site. Fog misting must beused carefully to prevent incompatible product/water mixing or the spread of product toother areas, as containment dikes may become overfilled.•Product-compatible foam can be used to suppress vapors or for rescue situations, however, be extremely cautious if fuel discharge is not yet stopped.CAUTION:Before using water spray or foam to control vapor emissions or suppressignition, obtain technical advice based on chemical name identification.Refer to the pipeline operator and DOT 2004 Emergency Response Guidebook.c.Leak ControlIn addition to hazards such as flammability, toxicity and oxygen deficiency, liquid pipelineleaks and ruptures can create major problems with spill confinement and containment. Whatseems like a minor spill may evolve into a major spill as liquid inside the pipeline continuesto bleed out of the line.•Ask yourself where the spill will be in a few hours, how close the incident is to exposures or sensitive areas, and what can be done to confine the spill or divert it away fromexposures.•Establish barriers to prevent leak from spreading to water sources, storm drains or other sensitive areas. There are several basic containment devices that can be used to preventthe migration of petroleum products on land or on small streams.−Storm sewer or manhole dam−Small stream containment boom−Pipe skimming underflow dam−Wire fence or straw filter dam•If a leak is accidentally ignited, firefighting should focus on limiting the spread of fire damage, but in NO circumstances should efforts be made to extinguish the fire until thesource of supply has been cut off or controlled.•Do not walk into or touch spilled material.•Do not operate pipeline equipment.C.Call for Assistance of Trained Personnel1.Contact Your OrganizationAs soon as possible, contact your organization. This will set in motion a series of eventsranging from dispatching additional trained personnel to the scene to activating the localemergency response plan. Ensure that other local emergency response departments havebeen notified.2.Call the Pipeline Operator•Immediately call the 24-hour emergency phone number of the pipeline operator, which is listed on a marker sign located at a nearby road crossing, railroad or other point alongthe pipeline right-of-way. During the call, pipeline control center personnel will dispatcha representative to the scene. The control center will immediately act to shutdown thepipeline and isolate the emergency. The pipeline control center may also have thecapability to remotely open and close manifold valves and to transfer products both toand from the main pipeline at marketing and distribution facilities.•Be prepared to provide pipeline control center personnel with the following information:−Call-back number, contact name(usually the Incident Commander) −Detailed location, including state,county, town, street or road−Type of emergency: fire, leak, vapor−When incident was reported locally−Any known injuries3.Obtain National Assistance −Other officials on site: police, fire, medical, LEPCs, etc.−Surrounding exposures/sensitive areas −Any special conditions: nearby school, hospital, prison, railroad, etc.−Local conditions: weather, terrainIf the pipeline operator‟s 24-hour emergency phone number is not available, contact the appropriate emergency response agency listed in the DOT 2004 Emergency Response Guidebook.Pipeline Emergency Response GuidelinesIV. Incident Response StepsD. Work Together with the Pipeline OperatorPipeline operator personnel will establish product containment and drain barriers while workingin concert with local emergency responders to limit or contain the spill, and avoid possibleignition of a leak or vapor cloud.1. Pipeline Operator’s Representative:•Serves as the primary contact for communication between the operator’s team and emergency responders. They will be familiar with the Incident Command System and are normally HAZWOPER certified as well. Establishes contact with the Incident Commander before and upon arrival to avoid accidental entry into isolation zones or ignition of the release. Communicates which actions to take especially as they relate to containment and control of the pipeline product. The pipeline operator’s representative(s) is trained to know: − How to shut off the supply of gas or liquid. Only the operator’s representative is trained to operate pipeline equipment. − What potential hazards may be present at the location. − What additional complications may result from response activities as they relate to the pipeline and its product. − How to fight small fires with hand held extinguishers, administer basic first aid, perform CPR, and assist with evacuations or traffic control.• •2. Emergency Responders:• • • • • •Maintain site control and act as Incident Commander. Eliminate ignition sources. Provide standby fire-watch personnel. Suppress vapor generation. Provide standby rescue personnel to pipeline operator personnel entering the incident area to stop the release. Help maintain containment dams and install more as needed. Monitor the atmosphere in the repair and containment areas.3. Together, Incident Commander and Pipeline Operator’s Representative:• • •Review whether it is safe for the operator’s emergency response team and/or their equipment to enter the incident area. Determine whether the zone of influence needs additional barricading and diking. Decide when the area is safe for the affected public to re-enter.© 2007 by Pipeline Association for Public Awareness• Revision 0Pipeline Emergency Response GuidelinesV. Damage PreventionV. Damage Prevention − A Shared ResponsibilityThe pipeline industry uses a wide range of tools and technologies to maintain safe operations. They visually inspect aboveground pipes and related equipment for damage. Operator personnel walk, drive and fly over pipeline right-of-ways inspecting them for corrosion, obvious damage, unauthorized activities that might endanger the pipeline, or unusual changes in vegetation that might indicate a leak. As you already know, pipeline control center personnel continuously monitor pipeline operation. Pipeline operators also use in-line inspection tools known as “smart pigs”, hydrostatic testing, electromagnetic testing, and other techniques to remove impurities and ensure the integrity of the pipeline. If inspection and testing identify any integrity-threatening anomalies, the operator repairs them as soon as possible.In our nation’s time of heightened security, it is more important than ever to protect pipelinesagainst damage or attack. Homeland security and infrastructure protection is a shared responsibility.A. Pipeline MappingLocate transmission pipelines in your community. Search the DOT’s National Pipeline Mapping System (NPMS) web site by zip code or county and state: .B. One-Call CenterRemind those in your community to always call the state or local one-call center to request marking of underground facilities within a proposed excavation site. This is a free service. They can simply dial 811 and will be connected to the center serving that area or they can call the national one-call locator number at 1-888-258-0808 to learn the number of the state’s one-call center.C. Damage ReportingReport any damage or unusual or suspicious activities along a pipeline right-of-way to the pipeline operator. The operator will immediately investigate and repair any damage. Improved communication and cooperation with local organizations are key components toprotecting life, enhancing public safety, improving emergency preparedness, increasingprotection of the environment, and preventing damage to pipeline property and facilities. If you would like more information about The Pipeline Association for Public Awareness or pipeline safety and emergency preparedness education, please contact The Association at 16361 Table Mountain Parkway, Golden, CO 80403, or visit .© 2007 by Pipeline Association for。
流水线的工作流程英语

流水线的工作流程英语英文回答:Pipeline Workflow.In modern computer architecture, a pipeline is a set of processing stages that are connected in a linear fashion. Each stage performs a specific operation on the data, and the data flows through the stages in a sequential order. This allows for greater efficiency and performance by reducing the amount of time that the processor is idle.The workflow of a pipeline can be described as follows:1. Instruction fetch: The first stage of the pipelineis the instruction fetch stage. In this stage, the processor fetches the next instruction from memory.2. Instruction decode: The next stage of the pipelineis the instruction decode stage. In this stage, theprocessor decodes the instruction and determines what operation needs to be performed.3. Operand fetch: The third stage of the pipeline is the operand fetch stage. In this stage, the processor fetches the operands that are needed for the operation.4. Execute: The fourth stage of the pipeline is the execute stage. In this stage, the processor executes the operation.5. Write back: The fifth and final stage of thepipeline is the write back stage. In this stage, the processor writes the results of the operation back to memory.The pipeline workflow is a continuous process. Once the first instruction is fetched, the pipeline will continue to execute instructions until there are no more instructions to execute.中文回答:流水线工作流程。
压力管道元件制造许可规则英文

压力管道元件制造许可规则英文The manufacturing of pressure pipeline components is a critical industry that requires strict adherence to safety standards and regulations to prevent catastrophic failures. The "Pressure Pipeline Component Manufacturing Permit Rules" are designed to ensure that manufacturers comply with the highest safety and quality standards.These rules mandate that all manufacturers must undergo a rigorous certification process. This includes the evaluation of their manufacturing processes, quality control systems, and the qualifications of their personnel. Manufacturers are required to demonstrate that they have the necessary infrastructure, equipment, and trained staff to produce components that meet the required specifications and safety standards.The permit rules also stipulate the need for ongoing inspections and audits of the manufacturing facilities. This is to ensure that the manufacturers maintain the necessary standards and are in compliance with the regulations at all times. Any deviations from the set standards can result in penalties, including the suspension or revocation of the manufacturing permit.Furthermore, the rules emphasize the importance of traceability and documentation. Manufacturers must keep detailed records of the materials used, the manufacturingprocess, and the testing and inspection results. This documentation is essential for the traceability of the components and for the verification of compliance with the regulations.In addition to the manufacturing process itself, the permit rules also cover the design and testing of pressure pipeline components. Manufacturers must ensure that their designs are in accordance with the latest engineering standards and that the components are thoroughly tested to withstand the pressures and conditions they will be subjected to in service.Finally, the permit rules require manufacturers to have a robust system for handling non-conforming products. This includes procedures for the identification, segregation, and disposal of any components that do not meet the required standards.In conclusion, the "Pressure Pipeline Component Manufacturing Permit Rules" are essential for maintaining the safety and reliability of pressure pipelines. By enforcing these rules, regulatory bodies can ensure that only high-quality components are used in critical infrastructure, thereby protecting the public and the environment from potential hazards associated with pipeline failures.。
流水线的利弊 英语作文

流水线的利弊英语作文## Pipelining Advantages and Disadvantages.Advantages of Pipelining.Increased performance: Pipelining allows multiple instructions to be executed simultaneously, increasing the overall performance of the system.Reduced latency: Each stage of the pipeline can execute its task independently, reducing the latency for individual instructions.Improved throughput: The pipeline can process a continuous stream of instructions, improving the throughput of the system.Increased utilization of resources: The pipeline ensures that all resources are utilized efficiently, reducing idle time and increasing overall efficiency.Reduced power consumption: Pipelining can reduce power consumption by optimizing the utilization of resources and reducing the need for repeated operations.Disadvantages of Pipelining.Increased complexity: Pipelining introduces additional complexity to the system design and implementation.Increased delays: The pipeline can introduce delays in the execution of certain instructions, such as branching or cache misses.Increased area overhead: The additional stages in the pipeline require additional hardware resources, increasing the area overhead.Increased design costs: Designing and validating a pipelined system can be more challenging and costly than designing a non-pipelined system.Increased susceptibility to pipeline hazards: Pipelining can be susceptible to hazards, such asstructural hazards, data hazards, and control hazards, which can disrupt the flow of instructions in the pipeline.## 流水线的优点和缺点。
【doc】并行系统的以存储器为中心的互联机制MCIM

并行系统的以存储器为中心的互联机制MCIM第22卷第4期1999年4月计算机PUTERSV o】.22No4ADr19991~并行系统的以存储器为中心的互联机制MCIM李三立戈弋武剑峰(算机率西五i北京10.084)5?0哆摘要并行系统中计算结点之间的互联网络IN(InterconnecfionNetwork)一直是并行体系结构的研究热.30年来曾研究过多种IN的结构及其特性,然而这些都是以逻辑电路为基础的.本文提出~种以多端口快速静态存储器(MPFSRAM)为中心的并行系统互联机制.称之为MCIMMCIM不同于共享数据的菇享存储器,它的容量较小,戈lj分为多十消息传递的通信邮区,并通过每十端口的访同接口(PA1).连接8—16十计算结点.常用的四端口存储器可组成32~64个计算结点的并行系统,构成当前国际流行的超结点".MCIM并行系统可充分利用MPFS—RAM的频宽.使消息传递的数据传输率达到10Gbps以上;可使数据发送和接收按流水线方式进行,从而减少消息传递的延迟.本支描述了MCIM的原理,PAI的仲裁和选择作用以及以存储器为中心的体系结构的实现物理布局.本文还描述了MCIM并行系统的仿真工作.并给出了实验结果关键词蹦存储器为中,fi,的互联机制.并行系统,消息传递分类号:TP302,f7害嚣He蔓徽MCIM:MEM0RYCENTRICINTERC0NNECT10NMECHANISM FoRPARALLELSYSTEMLISan~LiGEYLWUJian—Feng(Departm~tofComputerS~'enceandEngineering,TsingAuaUm'ver~'ty,BeOing100084) AbstractTheInterconneefionNetwork(IN)whichconnectscomputingnodesinparalle1sys —ternsremainsthekeyissueinparalle1architectureresearch.Inthepast30years,thestructures andfeaturesofvariousINShavebeenstudied.however.almostal1theseINSarebuiltontheba sisollogiccircuits.Thispaperproposesakindofintereonnectionmechanisminparallelsystemst hatistermedMemorvCentrieInterconnectionMeehanism,shortlyforMCIM.MCIMdiffersfro mthedatasharingorientedSharedMemoryconcept,MCIMemploysMulti—PortedFastStaticRAM(MPFSRAM)withconsiderablysmallstoragecapacity.MPFSRAMisdividedintoseveralc om-municationmailboxesformessagepassing,inwhicheachportcorrespondsaPortAccessInte rface(PADthatlinks8—16computingnodes.Soaconventionalfour—portmemorycanconstructapar—alldsystemwith32—64computingnodes,itconstitutesas.一called"Supernode"thatispromis—ingintheparallelsupercomputingarea.MCIMparalMsystemcanfullyutilizethebandwidth ofMPFSRAM,itallowsthedatatransmissionrateformessagepassingreachingabove10Gbps; it allowsthedatasendinganddatareceivingtobeoperatedinpipelinedmode,thusitreducesthel a—tencyofmessagepassing.ThispaperdescribestheprincipleofMCIMoperation,thearbitrati onandselectionfunctionofPAl,andthephysicallayoutforimplementingmemoryeentriearchit ecture.ThispaperalsoillustratesthesimulationframeworkforMCIMparalMsystem,andgives theexperimenta1results.率文1997一锄05收到,修改文1998—05—18收到.率瀑题得到国家攀登计划资助李--Or男,1995年生中N2:~N院士,博士生导师,从事计算机体系结构,网络并行计算并行处理等方面的研究戈弋,男,博士研究生,从事计算机体系结构,并行处理等方面的研究?武剖蜂,男,博士研究生,从事计算机体系结掏并行处理等方面研究计算机KeywordsMemoryeentricinterconneetionmechanism,parallelsystem.messagepassing 1概述在高性能计算的实现途径中.MPP技术的发展曾经经过几次反复.目前MPP技术的一个重要发展趋势是采用结点数不是太多,但每一个结点的性能很高的结构.或称为超结点(Supernode).多个超结点之间用专门设计的互联网络连接,如SGI的O—RIGIN2000和SUN最新公布的STARFIRE,这种类型的每个超结点都可包含B4到128个处理机.而这两个新型高性能并行系统代表了一个方向.即扩展性(Scalability),模块性(Modularity)和软件可移植性(Por*abdity)都很强.它也标志着公司生产的批量高性能并行系统进人市场的开端.目前的超结点系统一般都采用多级互联的组成结构.例如STARFIRE就是通过二级总线的紧耦合结构来组成一个超结点.其第一级是由24个CPU构成的单总线SMP结构的处理单元(Pu).各处理单元之间通过第二级的系统总线进行通信.为了解决系统总线带来的瓶颈问题,它在第二层采用了4组独立的总线控制逻辑,井利用交叉开关(Crossbar)结构来进行数据交换.这样的实现虽然弥补了上述的缺陷,但是增加了设计的复杂性,同时在传输链路建立时,要求收发处理单元和系统总线都空闲,操作延时较大.且不利于系统资源的有效利用.为此本文提出了一种适用于超结点结构并行计算机系统的互联机制,其中并不以逻辑电路为基础. 而是以多端口快速静态存储器为基础.我们称之为"以存储器为中心的互联机制"(MCIM).2MCIM原理近年来国际上计算机工业发展的一个重要特征是存储器工艺的迅速发展.存储器芯片的容量和速度都有了大幅度的提高.价格大大下降.不仅大容量主存芯片已有1Gbps产品,而且静态存储器的发展也很快,其存储周期可小于5ns.新型的双端口和四端口的快速静态存储器FSRAM已有商售产品这些都为我们实现MCIM提供了有利的条件2.1McIM结构图1是用四端口快速静态存储器进行消息传递(Messagepassing)的并行系统互联原理图.端口访问接口PAI——端{萋臣已磊磊赢压石五田臣墨端口.端口访问接口PAl图IMCIM通信原理图端口访阿接口AI图1的中间就是用于消息传递的多端E1存储器.从逻辑上来看.它分为四个消息传递的邮箱区,每个邮箱对应于一个端口.要发送的消息或数据首先放到与目的端E1对应的邮箱中,再通过端E1传输到目的结点.在物理实现上,我们可以将整个存储器分成许多固定长度的消息缓冲区(据统计,在一般的应用中大部分的消息长度都不超过2KB,可以选择此参数为消息缓冲区的合适长度;或者可以将缓冲区的长度进一步减小.在此情况下可以获得类似WormHole传输的一些优点).对每一个要传输的消息提供一个缓冲区.对于较长的消息,可以将其分为几个较短的包进行传输,来保证缓冲区不会溢出.消息传送完毕就释放缓冲区.发往同一端口的消息组成一个消息队列,对应于逻辑意义上的消息传递邮箱.从而有效地利用存储空间,采用这种通信机制可以很方便地实现消息的广播和选播等传输模式,此时只需将一个消息包同时放入多个消息队列中即可.PAl(端口访问接口)是实现多端口存储器消息传递的重要部分,也是MCIM从根本上区别于共享存储器数据共享的重要部分.它控制着对应端口消息的发送与接收,负责分配空闹的消息缓冲区,管理端口的消息队列,以及对结的访问请求进行仲裁和应答.因为多端口存储器可以支持多个端口同时对同一数据的访问操作,消息的发送和接收可以并发地进行,即以数据流水的方式在结点之间传送消息,从而显着地减少了传输延时.2.2MCIM的特点作为一个消息传递的互联机制,MCIM有一些与传统上以逻辑电路为基础的互联网络所不同的特点r1.由于多端口存储器通信可采用同时读写或流水的操作方式,数据传输的延时可以降到两个周期.目前市场上的多端口静态存储器的读写周期达到5ns,以四个端E1.每个端E164位为计.MCIM的聚合频宽可达50Gbps,而延时仅为10ns,具有较好的传输性能.4期李三立等:并行系统的存储器为中心的互联机制MCIM 2.可以在端口之间实现多种传输方式.在MCIM的基础上,不仅能实现多对端口之间点对点的数据传输,还可以支持选播,广播等传输方式.甚至能实现全部端口同时发送或者接收的操作.以此为基础的互联机制.能根据不同应用的要求,灵活地选用各种通信模式,而不需要软件上的调整.从而减少了通信的软件开销,有利于提高系统的效率.例如在共享存储的多处理机系统中,其一致性协议经常'要用广播或选播方式发送数据的一致性消息.如果用点对点的通信模式,就要求结点将同一消息重复地发送多次.而MCIM只需由PAl进行相应的操作即可一次发完所有的消息.3.由于多端口存储器本身就支持数据的存储及其在各端口之间的交换,只需要在此基础上增加一些控制逻辑,就可以实现互联的功能3以MCIM为基础的并行系统采用四端口陕速静态存储器的MCIM并行系统中,对应于每一个存储端口.有一排计算结点,每排有k个(图2).每个计算结点可以是一个处理机, 也可以是多于一个处理机的SMP计算结点.这样. 系统中共有4×k个计算结点.通过四个端口,结点机可以向快速静态存储器MPFSRAM读或写数据字,其字长可达64位.目前从技术上已经能制造读写周期为5ns的静态存储器了.".采用这种快速静态存储器,MCIM的聚台频宽可达到50Gbps.图2MCIM并行系统结构图仲裁与选择是MCIM并行系统中的重要部分,见图2,当某排上的一个计算结点,如P要发送数据到目标结点机中去,它要使用FSRAM的端口2. 因为该排上的另外结点也可能同时要使用端口2, 这就会发生竞争,因此需要根据各结点的状态和优先权来进行仲裁.同样,目标结点(如P)接收数据. 也可能要争用存储端口3.这时也要根据第3排上结点的状态和优先权来进行仲裁,并且根据目标结点地址来选择接收结点.在MCIM并行系统中.仲裁与选择的任务主要由端口访问接口PAl来完成.在收到结点的访问请求后,它根据相应的状态信息进行选择,给出应答信号.不仅如此.PAl还要实现其它一些功能.包括对消息缓冲队列的管理及与结点和存储端口的应答. 图3是PAl的逻辑结构简图.其中DC是~个双向的数据通道.它是由NCI与MCI共同控制的. 在给定的控制信号下,数据可在处理单元与存储端口之间流动.存储器控制单元MCU(MemoryCon—trolUnit)是实现PAl对存储端口操作的部分,所有的端口操作命令都是由MCU发出.由于在数据传输的过程中是PAl而非结点直接访问中央快速静态存储器的端口,因此向端口发出读写命令及地址信号都通过存储控制单元来完成.同时MCU向数据通道传送控制信号,保证数据从存储器到处理单元的传输.缓冲医管理单元BMU(BufferManage—mentUnit)主要负责对端口消息缓冲队列的管理. 在消息传递前申请空闲缓冲医,传输完毕后释放无计算机用的缓冲区,并记录状态信息.当然,缓冲区管理单元还需要向MCU提供传输缓冲区的地址.以启动传输进程.结点控制单元NCU(NodeControlUnit) 负责PAI与处理单元之间的交换操作及传输进程控制首先它要对处理单元的传输请求进行仲裁和选择,给出适当的响应,然后通过与发送单元的交互操作控制数据的传输,同时向存储控制单元发送信号.启动存储操作.存储端口和"端口访问接口"相配合的是,每一个结点必须有它自己的特征字1w.这个特征字1w要送到一个特征字寄存器IWR中,见图4.特征字寄存器是直接在PAl中参与仲裁与选择的.特征字是由若干字段所组成的,它包括本结点状态,优先权,目的端NCU结控制单元IWR:特征字寄存器MCU;存储控制单元BMU缓冲区管理单元口号以及目标结点地址,当然也可以把传送字节数作为选用的字段.如果传送字节数是放在被传送的消息(Message)数据之中(实用情况往往如此),则IWR中可以不包含它.4消息的发送与接收圈4特征字寄存器LWR组成以MCIM为互联的并行系统中,发送结点机如要发送消息到目标结点.它先送出消息的特征字1w到特征字寄存器IWR,其中的本结点状态,优先权和目的端口号等字段先参与仲裁.如结点P.要发送数据到目标结点机去,则它在端VI访问控制PAI 中参与仲裁,以决定它是否能在P"P多个结点机中首先竞争到使用端VI2的权利.如果它能竞争到端VI2.则发送结点P开始把要发送的数据按IWR中的目的端口号,写入多端口MPFSRAM中相应队列的消息传递缓冲区中去,如目前的目的端口号为2.与此同时,发送结点P的特征字寄存器中的目标结点地址,将被送到相应目标结点所在的仲裁与选择器中去参与选择.若目前的目标结点地址为P则将送到PAl.中去参与选择.由于P一P的特征字也在PAla中等待选择,则此时要看目标地址P的本结点状态是否为"空闲"(即为…0),还是"(即为"1"),以及它的优先权能否使P竞争到存储端口3.如果P.的本结点状态为"忙",或其优先权较低,不能争用到端VI3,则要根据调度算法去等待,或者发送结点及接收结点重新再发新的特征字1w.若接收结点P为"空闲",而且其优先权也能争用到存储端VI3,则P.将从消息传递缓冲区队列3中逐个接收(读出)消息数据.由于这是多端VI存储器,所以发送数据和接收数据是可以并发地进行的.接收结点要在接收每一个数据以后,就把该消息缓冲区中的存储单元清零,然后再从其次一个存储单元中读出下一个数据.清零操作可以在两次相邻的读数操作中完成,不会影响读操作的连续进行.在接收结点Pj读出发来的数据过程中,禁止任何第三排中的结点在此时发送信.息,因为在PAla的仲裁中它必然失败,不可能在存储端VI3接收数据的过程中争用到端口3.当接收结点开始接收发送来的数据.以及接收完发来的全部数据以后,都要发出一个应答信号.4期李三立等:并行系统的存储器为中心的互联机制MCIM 消息传递时数据通信的仲裁与选择以及数据的读写时序关系见图5.可以看出,发送数据与接收数A:发送结点参与仲裁AIIB:发送结点发送数据BC:目标结点地址选择CD:目标结点接收数据DE;通信邮箱中原数据清零EF:接收结点发出应答信号F鼓据II数据2据之间的延迟只有两个存储周期.如在快速静态存储器中(如周期为5ns),则其延迟时间只有10ns.数据1I数据2UUUUU图j发送数据与接收数据时序圉假如在本排中的处理机结点之间要传送消息,也有一个仲裁与选择的过程.发送结点的特征字要先送到其相应的PAI中去参与仲裁,根据其本结点状态和优先权选择出一个结点占有端口,然后根据目标结点地址挑选目标结点.若目标结点为同排结点,由NCU向目标结点发送选择命令.当目标结点答以接收准备好"时.NCU可向发送结点发信号.开始传输过程.由于本排的各结点机的数据线是通过结点的选择门连接在一起的,当选择门打开以后, 发送结点可以直接通过数据总线快速地把数据传送到接收结点中去,这个过程就象DMA传送一样,不必经过中间的存储器.5仿真模拟和实验结果我们的仿真模拟实验主要研究MCIM并行系...1墼堡II墼堡l统互联性能中的传输频带宽度以及时间延迟.目前适度并行的高性能计算机一般采用单息线SMP或多级总线的并行结构,它们又往往是前述超结点结构的技术基础.本文的仿真实验将MCIM与这两种类型的并行系统进行比较和分析.5.1实验模型在选择实验模型时,我们先选用双端口的快速静态存储器.这是在市场上比较常用的商售产品MCIM的仿真模型结构见图6.处理机组l一4对应于图2中的各排结点.模型中每一对读端口和写端口对应多端口存储器的一个存储端口.通信数据从写端口进入存储器,从读端口被取出.相应的写仲裁和读仲裁分别控制着结点对端口的读写操作.Bank l4对应于各个端口的消息缓冲队列.由于数据在其上的写端口和读端口之间单向流动,故用FIFO 模型即可模拟消息缓冲队列.图6MCIM仿真模型结构发送结点首先通过写伸裁获得对端口的写权限,即可将数据写入相应的缓冲队列中.同样,接收结点在经过读仲裁得到对读端口的访问权岳,才可以从中央存储器中读出数据.整个过程根据收发结计算机学撮点的状况分为两个阶段灵活地进行.5.2模拟结果与分析在仿真模拟时.我们假设:计算简单产生消息(Message)的机率是按泊松分布的;而目标结点地址的分配是均匀分布的;发送消息数据的长度则是按指数分布.其中实验中假设处理机结点数为16; 对于MCIM,假设分别有4个读和写端口;G是MCIM并行系统中每个结点发送消息数据的时间间隔的期望值,它表明处理机结点对端口访问的繁忙程度,在此作为实验比较的自变量;而Length是消息长度的期望值,它对系统性能的影响与G相关,为了更准确地描述变化趋势,在此设其为512位,这也是符台实际情况的.为了反映在实际过程中的不确定性…G和Length都是随机变量.图710是MCIM在改变发送时间间隔和消息长度的情况下的带宽和延时曲线.1O305O7O9011o130l5017019O21o23025027029o3lO3303503703904lO43045047o 49051o消息发送间隔(Cycle囹7MCIM带宽与消息发送频繁程度的关系曲线—带图9MCIM带宽与消息长度的关系曲线256512102420484096平均诮息长度(bits图l0MCIM延时与消息长度的关系曲线4期李三立等:并行系统的以存储器为中心的互联机制MCIM由图中的曲线可知,MCIM的消息传输性能与消息的长度和发送的频度都相关.随着消息长度的增加或者发送间隔的减少.传输的延时逐渐变大,系统的吞吐量也在增大,直到达到饱和.囤为当消息长度变大或变小时,消息之间为竞争资源发生冲突的概率增大了,从而导致传输延时的变化.但是系统的吞吐量并不囤此受到影响,仍然逐1.::42.O步增加至最大传输率.这是由MCIM的传输机制所决定的,由于中央存储器的存在,它不需要整个链路都空闲时才能进行传输过程,而是最大限度地利用系统资源.圉12是在其它条件相同的情况下,MCIM与多级总线多处理机(MP)系统(如图11)的传输频宽和延时的比较.图11多级总线MP系统结构图48648096ll2128l4416ofa)图12模拟实验结果(MCIM与多级总线结构)在模拟实验中,对于MCIM的SRAM的读写方式分为两种:方式1.即写即读消息开始写入后,在开始写人的下一个周期.只要目标地址结点的读存储端口空闲,就可以进行读操作方式2.同步读写.只有在发送结点的写存储端口和接收结点的读存储端口都空闲时,才可以写入消息数据,同时目标地址结点以流水线方式读出数据. 实验曲线中,B是读写方式1的结点信息传输频宽;是读写方式1的消息传输平均延时;B是读写方式2的结点信息传输频宽;Ta是读写方式2 的消息传输平均延时.B和,,s指的是二级总线系统的平均带宽和延时,占s和7s是系统总线宽度两倍于下层总线时系统的结点平均带宽和延时. 从图示的结果可以看出,对于多级总线的多处理机系统,MCIM在传输性能上仍具有一定的优势这是因为在多级总线系统中.上层的总线往往成为系统的瓶颈.即使增加其频带宽度,由于仲裁及响应时间的延时较大,其传输性能仍不如MCIM的互联网络.与多级总线的互联结构相比,MCIM有以下几个优点首先,由于采用存储器为中心的互联机制.MCIM每个端口的数据宽度可以达到8到32个字节,超过了并行系统互联网络通路的宽度;其次.MCIM紧耦合的实现方式和消皇流水的收发,使之具有更高的频率和更小的延迟;而多端I:1存储器的特点及硬件实现的端I:1仲裁,避免了传输过程中消息的阻塞.所以与目前结点频宽为几百Mbps的并行系统的互联网络相比,MCIM是一项有一定优点的互联技术.6结论以存储器为中心的互联机制MCIM是一种新型的概念和结构,以MCIM构成的并行系统是一个新园ii21一了计算机的研究领域,它具有数据传输频带宽,数据传输延时小等显着优点.采用商售的多端口FSRAM,比并行系统互联网络中定制的路由器要简单便宜.如把MCIM系统看成是超结点,并把它与阿格栅结构结合起来,可为百万亿次的巨型超级计算机技术开辟良好前景.亦可将其连接在因特网上,利用它较强的计算能力来在更广泛的范围内进行并行操作和资源共享参考文献1李三立,武剑峰,戈弋McIM——存储器为审心的互联机制2345S。
面向异构并行计算系统的流水线式压缩检查点

第2期2012年2月电子学报ACTAELECTRONICASINICAV01.40No.2Feb.2012面向异构并行计算系统的流水线式压缩检查点刘勇鹏1,王锋1,卢凯1,刘勇燕2(1.国防科学技术大学计算机学院,湖南长沙410073;2.中国科技部信息中心,北京100862)摘要:在大规模并行计算系统中,并行检查点触发大量结点同时保存计算状态,造成巨大文件存储空间开销,以及对通信和存储系统的巨大访问压力.数据压缩可以缩小检查点文件尺寸,从而降低存储空间开销以及对通信和存储系统的访问压力.但是,它也带来额外的压缩计算开销.本文针对异构并行计算系统,提出流水线式并行压缩检查点技术,采用一系列优化技术来降低压缩引入的计算延时,包括:流水线式双重写缓存队列、文件写操作的合并、GPU加速的流水压缩算法和GPU资源的多进程调度,等等.本文介绍了该技术在天河一号系统中的实现,并对所实现的检查点系统进行综合评测.实验数据表明该方法在大规模异构并行计算系统中是可行、高效、实用的.关键词:异构并行体系结构;检查点;数据压缩;软流水线;图形处理器中图分类号:TP338.4文献标识码:A文章编号:0372—2112(2012)02—0223—07电子学报URL:http://www.ejournal.org.anDOI:10.3969/j.issn.0372—2112.2012.02.003PipelinedCompressedCheckpointingforHeterogeneousSystemsLIUYong-pen91,WANGFen91,LUKail,LIUYong-yan2(1.CollegeofComputer,NationalUniversityof蜘eTechnology,Changsha,Hunan410073,China;2.InformationCenter,ll以nistryofScienceandTechnologyofChina,Beijing100862,Ch/na)Abstract:Checkrx,intingisalleffectivetechniquetoimprovethereliabilityoflargescaleparallelcomputingsystems.Datacompressionisapromisingtechniquetoreducethesizeofdatatobesavedinthefilesinthestoragesubsystemandtheamountofdatatogothroughthecommunicationsubsystem.However,compressioncausesahugeamountoftimeoverhead.ThetimeoverheadisthemaintechnicalbalTierofitspracticalusability.Inthispaper,weproposeaparallelcompressedcheclcpointingtechniquetore—ducethetimeoverheadofcornpressioninheterogenousarchitectures.Itintegratesanumberofoptimizationtechniques,whichin—chdetransmittingcheckt)0intingdatabetweenhostandGPUinbufferedpipelines,aggregatingfilewriteoperations,employingapipelinedparallelcompressionalgorithm,anddelegatingcompressionoperationstoGPU,etc.ThepaperreportsanimplementationofthetechniqueintheTH-1systemandtheevaluationexperimentswiththesystem.Theexperimentdatashowthatthetechniqueisef-ficientandpracticallyuseable.Keywords:heterogenousarchitecture;checkpoint;datacompression;pipeline;graphicprocessingunit(GPU)1引言对大规模并行计算系统而言,故障的发生不可避免,系统固有可靠性无法满足高性能应用的运行需求[1].基于检查点的回卷恢复是一种典型的软件容错技术.基本思想是,在计算到达检查点时,保存当前的计算状态,其后发生故障时,系统恢复到最近保存的状态并继续计算.在大规模并行计算系统中,并行检查点需要保存所有并行计算进程的运行状态,导致大量计算结点同时将计算状态数据保存到文件系统,不仅造成巨大的存储空间开销,而且形成密集文件访问,对通信和文件系统造成巨大压力.数据压缩是降低检查点文件访问量的重要手段.但是,数据压缩带来额外计算开销,影响系统的正常计算性能[7,8,10J.Socket级异构在计算密度、能效、性价比等方面优势突出,是高性能计算的一个重要趋势,在最新Top500前10名中有4台为Socket级异构系统HJ.在该类异构中,协处理器提供了丰富的计算能力,为优化压缩时间提供了新的契机.收稿日期:2010-09—20;修回日期:2011—07.27基金项目:国家863高技术研究发展计划重大项目(No.2009AA01A128);高效能服务器和存储技术国家重点实验室开放基金(No.2009HSSA04);国家自然科学基金(No.60603061)电子学报2012拄2压缩检查点时间开销的分析典型的Socket级异构并行计算系统如图1所示,计算结点包含CPU和协处理器(GPU)两种异构的计算单元.计算阵列通过互连网络与存储阵列相连,访问全局文件系统.r}¨时…砭违姐信一绋产幽—菊丽产圈1Socket级异构并行计算系统大规模并行计算系统中,当大量结点并行创建局部检查点时,形成密集文件操作,造成对互连网络和存储阵列的巨大访问压力.降低检查点文件的尺寸,不仅可以节省存储空间,而且可缓解对网络和文件系统的访问压力,是提高并行检查点技术可用性的重要途径.本地检查点操作分为数据采集和数据保存两部分.数据采集时间晶包括搜集进程状态、构造检查点数据等操作所需的时间.Socket级异构系统中,各进程的文件吞吐率b基本相同.假设当进程数P达到k时,单位时间内并行发出的文件访问量等于文件系统的访问带宽毋,则有k×b=毋.创建并行检查点时,如果单位时间的文件访问总量小于带宽,系统带宽不是瓶颈,文件访问时间由进程自身的检查点数据量S和文件访问速率(口/々)决定,如式(1(n)).我们称k为并行检查点技术的适用规模.当单位时间的文件访问总量大于等于带宽时,文件访问时问与进程数相关,受制于系统文件访问带宽,有式(1(b))..L、,oITo(s)+掣,P<k(口)L={。
Expressing irregular computations in modern Fortran dialects

Expressing Irregular Computationsin Modern Fortran DialectsJan F.Prins,Siddhartha Chatterjee,and Martin SimonsDepartment of Computer ScienceThe University of North CarolinaChapel Hill,NC27599-3175Abstract.Modern dialects of Fortran enjoy wide use and good support on high-performance computers as performance-oriented programming languages.By pro-viding the ability to express nested data parallelism in Fortran,we enable irregularcomputations to be incorporated into existing applications with minimal rewritingand without sacrificing performance within the regular portions of the applica-tion.Since performance of nested data-parallel computation is unpredictable andoften poor using current compilers,we investigate source-to-source transforma-tion techniques that yield Fortran90programs with improved performance andperformance stability.1IntroductionModern science and engineering disciplines make extensive use of computer simula-tions.As these simulations increase in size and detail,the computational costs of naive algorithms can overwhelm even the largest parallel computers available today.Fortu-nately,computational costs can be reduced using sophisticated modeling methods that vary model resolution as needed,coupled with sparse and adaptive solution techniques that vary computational effort in time and space as needed.Such techniques have been developed and are routinely employed in sequential computation,for example,in cos-mological simulations(using adaptive n-body methods)and computationalfluid dy-namics(using adaptive meshing and sparse linear system solvers).However,these so-called irregular or unstructured computations are problematic for parallel computation,where high performance requires equal distribution of work over processors and locality of reference within each processor.For many irregular com-putations,the distribution of work and data cannot be characterized a priori,as these quantities are input-dependent and/or evolve with the computation itself.Further,irreg-ular computations are difficult to express using performance-oriented languages such as Fortran,because there is an apparent mismatch between data types such as trees, graphs,and nested sequences characteristic of irregular computations and the statically analyzable rectangular multi-dimensional arrays that are the core data types in modernFortran dialects such as Fortran90/95[19],and High Performance Fortran(HPF)[16]. Irregular data types can be introduced using the data abstraction facilities,with a repre-sentation exploiting pointers.Optimization of operations on such an abstract data type is currently beyond compile-time analysis,and compilers have difficulty generating high-performance parallel code for such programs.This paper primarily addresses the expression of irregular computations in Fortran95,but does so with a particular view of the compilation and high performance execution of such computations on parallel processors.The modern Fortran dialects enjoy increasing use and good support as mainstream performance-oriented programming languages.By providing the ability to express ir-regular computations as Fortran modules,and by preprocessing these modules into a form that current Fortran compilers can successfully optimize,we enable irregular computations to be incorporated into existing applications with minimal rewriting and without sacrificing performance within the regular portions of the application.For example,consider the NAS CG(Conjugate Gradient)benchmark,which solves an unstructured sparse linear system using the method of conjugate gradients[2].Within the distributed sample sequential Fortran solution,79%of the lines of code are standard Fortran77concerned with problem construction and performance reporting.The next 16%consist of scalar and regular vector computations of the BLAS2variety[17],while thefinal5%of the code is the irregular computation of the sparse matrix-vector product. Clearly we want to rewrite only this5%of the code(which performs97%of the work in the class B computation),while the remainder should be left intact for the Fortran compiler.This is not just for convenience.It is also critical for performance reasons; following Amdahl’s Law,as the performance of the irregular computation improves,the performance of the regular component becomes increasingly critical for sustained high performance overall.Fortran compilers provide good compiler/annotation techniques to achieve high performance for the regular computations in the problem,and can thus provide an efficient and seamless interface between the regular and irregular portions of the computation.We manually applied the implementation techniques described in Sect.4to the ir-regular computation in the NAS CG problem.The resultant Fortran program achieved a performance on the class B NAS CG1.0benchmark of13.5GFLOPS using a32pro-cessor NEC SX-4[25].We believe this to be the highest performance achieved for this benchmark to date.It exceeds,by a factor of2.6,the highest performance reported in the last NPB1.0report[27],and is slightly faster than the12.9GFLOPS recently achieved using a1024processor Cray T3E-900[18].These encouraging initial results support the thesis that high-level expression and high-performance for irregular computations can be supported simultaneously in a production Fortran programming environment. 2Expressing irregular computations using nested data parallelismWe adopt the data-parallel programming model of Fortran as our starting point.The data-parallel programming model has proven to be popular because of its power and simplicity.Data-parallel languages are founded on the concept of collections(such as arrays)and a means to allow programmers to express parallelism through the applica-tion of an operation independently to all elements of a collection(e.g.,the elementwiseaddition of two arrays).Most of the common data-parallel languages,such as the array-based parallelism of Fortran90,offer restricted data-parallel capabilities:they limit collections to multidimensional rectangular arrays,limit the type of the elements of acollection to scalar and record types,and limit the operations that can be applied in par-allel to the elements of a collection to certain predefined operations rather than arbitraryuser-defined functions.These limitations are aimed at enabling compile-time analysisand optimization of the work and communication for parallel execution,but make it difficult to express irregular computations in this model.If the elements of a collection are themselves permitted to have arbitrary type,thenarbitrary functions can be applied in parallel over collections.In particular,by operat-ing on a collection of collections,it is possible to specify a parallel computation,eachsimultaneous operation of which in turn involves(a potentially different-sized)parallel subcomputation.This programming model,called nested data parallelism,combinesaspects of both data parallelism and control parallelism.It retains the simple program-ming model and portability of the data-parallel model while being better suited for de-scribing algorithms on irregular data structures.The utility of nested data parallelism asan expressive mechanism has been understood for a long time in the LISP,SETL[29],and APL communities,although always with a sequential execution semantics and im-plementation.Nested data parallelism occurs naturally in the succinct expression of many irregular scientific problems.Consider the sparse matrix-vector product at the heart of the NASCG benchmark.In the popular compressed sparse row(CSR)format of representingsparse matrices,the nonzero elements of an sparse matrix are represented as a sequence of rows,where the th row is,in turn,represented by a (possibly empty)sequence of pairs where is the nonzero value and isthe column in which it occurs:.With a dense-vector represented as a simple sequence of values,the sparse matrix-vector productmay now be written as shown using the N ESL notation[4]:This expression specifies the application of,in parallel,to each row of to yield the element result sequence.The sequence constructor serves a dual role:it specifies parallelism(for each in),and it establishes the order in which the result elements are assembled into the result sequence,i.e.,.We obtain nested data parallelism if the body expres-sion itself specifies the parallel computation of the dot product of row with as the sum-reduction of a sequence of nonzero products:More concisely,the complete expression could also written as follows:where the nested parallelism is visible as nested sequence constructors in the source text.Nested data parallelism provides a succinct and powerful notation for specifying paral-lel computation,including irregular parallel computations.Many more examples of ef-ficient parallel algorithms expressed using nested data parallelism have been described in[4].3Nested data parallelism in FortranIf we consider expressing nested data parallelism in standard imperative programming languages,wefind that they either lack a data-parallel control construct(C,C++)or else lack a nested collection data type(Fortran).A data-parallel control construct can be added to C[11]or C++[30],but the pervasive pointer semantics of these languages complicate its meaning.There is also incomplete agreement about the form of paral-lelism should take in these languages.The construct,originated in HPF[16]and later added into Fortran95,spec-ifies data-parallel evaluation of expressions and array assignments.To ensure that there are no side effects between these parallel evaluations,functions that occur in the expres-sions must have the attribute.Fortran90lacks a construct that specifies parallel evaluations.However,many compilers infer such an evaluation if specified using a con-ventional loop,possibly with an attached directive asserting the independence of iterations.constructs(or Fortran90loops)may be nested.To specify nested data-parallel computations with these constructs,it suffices to introduce nested aggre-gates,which we can do via the data abstraction mechanism of Fortran90.As a consequence of these language features,it is entirely possible to express nested data-parallel computations in modern Fortran dialects.For example,we might introducethe types shown in Fig.1to represent a sparse matrix.is the type of a sparse matrix element,i.e.,the pair of the N ESL example.is the type of vectors(1-D arrays)of sparse matrix elements,i.e.,a row of the matrix.A sparse matrix is characterized by the number of rows and columns,and by the nested sequence of sparse matrix elements.Using these definitions,the sparse matrix-vector product can be succinctly written as shown in Fig.2.The loop specifies parallel evaluation of the inner products for all rows.Nested parallelism is a consequence of the use of parallel operations such as sum and elementwise multiplication,projection,and indexing.DiscussionEarlier experiments with nested data parallelism in imperative languages include V[11], Amelia[30],and F90V[1].For thefirst two of these languages the issues of side-effects in the underlying notation(C++and C,respectively)were problematic in the potential introduction of interference between parallel iterations,and the efforts were abandoned. Fortranfinesses this problem by requiring procedures used within a construct to be,an attribute that can be verified statically.This renders invalid those con-structions in which side effects(other than the nondeterministic orders of stores)can be observed,although such a syntactic constraint is not enforced in Fortran90.The specification of nested data parallelism in Fortran and N ESL differ in impor-tant ways,many of them reflecting differences between the imperative and functional programming paradigms.First,a sequence is formally a function from an index set to a value set.The N ESL sequence constructor specifies parallelism over the value set of a sequence while the Fortran statement specifies parallelism over the index set of a sequence.Thisallows a more concise syntax and also makes explicit the shape of the common index domain shared by several collections participating in a construct.Second,the N ESL sequence constructor implicitly specifies the ordering of result elements,while this ordering is explicit in the statement.One consequence is that the restriction clause has different semantics.For instance,the N ESL expressionyields a result sequence of length of odd values while the Fortran statementreplaces the elements in the odd-numbered positions of.Third,the Fortran construct provides explicit control over memory.Explicit control over memory can be quite important for performance.For example,if we were to repeatedly multiply the same sparse matrix repeatedly by different right hand sides (which is in fact exactly what happens in the CG benchmark),we could reuse a single temporary instead of freeing and allocating each time.Explicit control over memory also gives us a better interface to the regular portions of the computation.Finally,the base types of a nested aggregate in Fortran are drawn from the Fortran data types and include multidimensional arrays and pointers.In N ESL,we are restricted to simple scalar values and record types.Thus,expressing a sparse matrix as a collec-tion of supernodes would be cumbersome in N ESL.Another important difference is that we may construct nested aggregates of heterogeneous depth with Fortran,which is im-portant,for example,in the representation of adaptive oct-tree spatial decompositions. 4Implementation issuesExpression of nested data-parallelism in Fortran is of limited interest and of no utility if such computations can not achieve high performance.Parallel execution and tuning for the memory hierarchy are the two basic requirements for high performance.Since the locus of activity and amount of work in a nested data-parallel computation can not be statically predicted,run-time techniques are generally required.4.1Implementation strategiesThere are two general strategies for the parallel execution of nested data parallelism, both consisting of a compile-time and a run-time component.The thread-based approach.This technique conceptually spawns a different thread of computation for every parallel evaluation within a construct.The compile-time component constructs the threads from the nested loops.A run-time component dynam-ically schedules these threads across processors.Recent work has resulted in run-time scheduling techniques that minimize completion time and memory use of the gener-ated threads[9,6,20].Scheduling veryfine-grained threads(e.g.,a single multiplica-tion in the sparse matrix-vector product example)is impractical,hence compile-time techniques are required to increase thread granularity,although this may result in lost parallelism and increased load imbalance.C(a)(b)SS3(c)(d)Theflattening approach.This technique replaces nested loops by a sequence of steps, each of which is a simple data-parallel operation.The compile-time component of this approach is a program transformation that replaces constructs with“data-parallel extensions”of their bodies and restructures the representation of nested aggre-gate values into a form suitable for the efficient implementation of the data-parallel op-erations[8,26].The run-time component is a library of data-parallel operations closely resembling HPFLIB,the standard library that accompanies HPF.A nested data-parallel loop that has beenflattened may perform a small multiplicative factor of additional work compared with a sequential implementation.However,full parallelism and opti-mal load balance are easily achieved in this pile-time techniques to fuse data-parallel operations can reduce the number of barrier synchronizations,decrease space requirements,and improve reuse[12,24].The two approaches are illustrated for a nested data-parallel computation and its associated dependence graph1in Fig.3.Here and denote assignment statements that can not introduce additional dependences,since there can be no data dependences between iterations of loops.In Fig.3(c)we show a decomposition of the work into parallel threads. In this decomposition the body of the outer loop has been serialized to increase the grain size of each thread.As a result the amount of work in each thread is quitedifferent.On the other hand,since each thread executes a larger portion of the sequentialimplementation,it can exhibit good locality of reference.In Fig.3(d)we show a decomposition of the work into sequential steps, each of which is a simple data-parallel operation.The advantage of this approach is thatwe may partition the parallelism in each operation to suit the resources.For example,we can create parallel slack at each processor to hide network or memory latencies.In this example,the dependence structure permits the parallel execution of steps and,although this increases the complexity of the run time scheduler.4.2Nested parallelism using current Fortran compilersWhat happens when we compile the Fortran90sparse matrix-vector productshown in Fig.2for parallel execution using current Fortran compilers?For shared-memory multiprocessors we examined two auto-parallelizing Fortran 90compilers:the SGI F90V7.2.1compiler(beta release,March1998)for SGI Originclass machines and the NEC FORTRAN90/SX R7.2compiler(release140,February1998)for the NEC SX-4.We replaced construct in Fig.2with an equivalent loop to obtain a Fortran90program.Since the nested parallel loops in do not define a polyhedral iteration space,many classical techniques for parallelization do notapply.However,both compilers recognize that iterations of the outer loop(over rows) are independent and,in both cases,these iterations are distributed over processors.The dot-product inner loop is compiled for serial execution or vectorized.This strategy is not always optimal,since the distribution of work over outermost iterations may be uneven or there may be insufficient parallelism in the outer iterations.For distributed memory multiprocessors we examined one HPF compiler.This com-piler failed to compile because it had no support for pointers in Fortran90derived types.Our impression is that this situation is representative of HPF compilers in general, since the focus has been on the parallel execution of programs operating on rectangular arrays.The data distribution issues for the more complex derived types with pointers are unclear.Instead,HPF2.0supports the non-uniform distribution of arrays over pro-cessors.This requires the programmer to embed irregular data structures in an array and determine the appropriate mapping for the distribution.We conclude that current Fortran compilers do not sufficiently address the prob-lems of irregular nested data parallelism.The challenge for irregular computations is to achieve uniformly high and predictable performance in the face of dynamically varying distribution of work.We are investigating the combined use of threading andflattening techniques for this problem.Our approach is to transform nested data parallel constructs into simple Fortran90,providing simple integration with regular computations,and leveraging the capabili-ties of current Fortran compilers.This source-to-source translation restricts our options somewhat for the thread scheduling strategy.Since threads are not part of Fortran90, the only mechanism for their(implicit)creation are loops,and the scheduling strategies we can choose from are limited by those offered by the compiler/run-time system.In this regard,standardized loop scheduling directives like the OpenMP directives[23] can improve portability.A nested data parallel computation should be transformed into a(possibly nested)iter-ation space that is partitioned over threads.Dynamic scheduling can be used to tolerate variations in progress among threads.Flattening of the loop body can be used to ensure that the amount of work per thread is relatively uniform.4.3ExampleConsider a sparse matrix with a total of nonzeros.Implementation of the simple nested data parallelism in the procedure of Fig.2must address many of the problems that may arise in irregular computations:–Uneven units of work:may contain both dense and sparse rows.–Small units of work:may contain rows with very few nonzeros.–Insufficient units of work:if is less than the number of processors and is suf-ficiently large,then parallelism should be exploited within the dot products rather than between the dot products.We constructed two implementations of.The pointer-based implementation is obtained by direct compilation of the program in Fig.2using auto-parallelization. As mentioned,this results in a parallelized outer loop,in which the dot products for different rows are statically or dynamically scheduled across processors.Theflat implementation is obtained byflattening.Toflatten we replace the nested sequence representation of with a linearized representation.Here is an array of pairs,indexed by and,partitioned into rows of by. Application of theflattening tranformations to the loop in Fig.2yieldswhere is a data-parallel operation with efficient parallel implementa-tions[3].By substituting for thefirst argument in the body of ,the sum and product may be fused into a segmented dot-product.The resulting algorithm was implemented in Fortran90for our two target architectures.For the SGI Origin200,is divided into sections of length where is the number of processors and is a factor to improve the load balance in the presence of multiprogramming and operating system overhead on the processors.Sec-tions are processed independently and dot products are computed sequentially within each section.Sums for segments spanning sections are adjusted after all sections are summed.For the NEC SX-4,is divided into sections where is the vector length re-quired by the vector units[5].Section,,occupies element in a length vector of thread.Prefix dot-products are computed independently for all sections using a sequence of vector additions on each processor.Segment dot-products are computed from the prefix dot-products and sums for segments spanning sections are adjusted after all sections are summed[25].On the SX-4,is typically not needed since the operating system performs gang-scheduling and the threads experience very similar progress rates.1020304050020000400006000080000100000120000140000160000180000p e r f o r m a n c e i n M F L O P S number of rows/columns Constant number of non-zeros [20] / varying matrix size (SGI Origin 200)1 proc2 proc4 procflat pointer 1020304050020406080100120140160180p e r f o r m a n c e i n M F L O P S number of non-zeros per row Varying number of non-zeros / constant matrix size [20000] (SGI Origin 200)1 proc2 proc 4 procflat pointer 4.4ResultsThe SGI Origin 200used is a 4processor cache-based shared memory multiprocessor.The processors are 180MHz R10000with 1MB L2cache per processor.The NEC SX-4used is a 16processor shared-memory parallel vector processor with vector length 256.Each processor has a vector unit that can perform 8or 16memory reads or writes per cycle.The clock rate is 125MHz.The memory subsystem provides sufficient sustained bandwidth to simultaneously service independent references from all vector units at the maximum rate.The performance on square sparse matrices of both implementations is shown for 1,2,and 4processors for the Origin 200in Fig.4and for the SX-4in Fig.5.The top graph of each figure shows the performance as a function of problem size in megaflops1101001000020000400006000080000100000120000140000160000180000p e r f o r m a n c e i n M F L O P S number of rows/columns Constant number of non-zeros [20] / varying matrix size (NEC SX-4)1 proc2 proc4 proc 1 proc 2 proc 4 procflat pointer1101001000020406080100120140160180p e r f o r m a n c e i n M F L O P S number of non-zeros per row Varying number of non-zeros / constant matrix size [20000] (NEC SX-4)1 proc2 proc 4 proc 1 proc 2 proc 4 procflat pointerper second,where the number of floating point operations for the problem is .Each row contains an average of 20nonzeros and the number of rows is varied between 1000and 175000.The bottom graph shows the influence of the average number of nonzerosper row ()on the performance of the code.To measure this,we chose a fixed matrix size ()and varied the average number of nonzeros on each row between 5and 175.In each case,the performance reported is averaged over 50different matrices.On the Origin 200the flattened implementation performed at least as well as the pointer-based version over most inputs.The absolute performance of neither imple-mentation is particularly impressive.The sparse matrix-vector problem is particularly tough for processors with limited memory bandwidth since there is no temporal locality in the use of (within a single matrix-vector product),and the locality in reference todiminishes with increasing .While reordering may mitigate these effects in some(a)(b)SGI Origin200NEC SX-4flat pointer41804076 applications,it has little effect for the random matrices used here.The Origin200imple-mentations also do not exhibit good parallel scaling.This is likely a function of limited memory bandwidth that must be shared among the processors.Higher performance can be obtained with further tuning.For example,the current compiler does not perform optimizations to map the and components of into separate arrays.When applied manually,this optimization increases performance by25%or more.On the SX-4theflattened implementation performs significantly better than the pointer implementation over all inputs.This is because theflattened implementation always operates on full-sized vectors(provided),while the pointer-based im-plementation performs vector operations whose length is determined by the number of nonzeros in a row.Hence the pointer-based implementation is insensitive to problem size but improves with average row length.For theflattened implemementation,abso-lute performance and parallel scaling are good primarily because the memory system has sufficient bandwidth and the full-sized vector operations fully amortize the memory access latencies.Next,we examined the performance on two different inputs.The regular input is a square sparse matrix with rows.Each row has an average of36randomly placed nonzeros for a total of nonzeros.The irregular input is a square sparse matrix with rows.Each row has20randomly placed nonzeros,but now20consecutive rows near the top of contain20000nonzeros each.Thus the total number of nonzeros is again900000,but in this case nearly half of the work lies in less than0.1%of the dot products.The performance of the two implementations is shown in Fig.6.The pointer-based implementation for the Origin200is significantly slower for the irregular problem, regardless of the thread scheduling technique used(dynamic or static).The problem is that a small“bite”of the iteration space may contain a large amount of work,leading to a load imbalance that may not be correctable using a dynamic scheduling technique.In the case of the SX-4pointer-based implementation this effect is not as noticeable,since the dot product of a dense row operates nearly two orders of magnitude faster than the dot product of a row with few nonzeros.Theflattened implementation delivers essentially the same performance for both problems on the Origin200.The SX-4performance in the irregular case is reduced。
TPOT Translucent Proxying of TCP Pablo Rodriguez

TPOT:Translucent Proxying of TCPPablo Rodriguez Sandeep Sibal,Oliver SpatscheckEURECOM,France AT&T Labs–Researchrodrigue@eurecom.fr sibal,spatsch@AbstractTransparent Layer-4proxies are being widely deployed inthe current Internet to enable a vast variety of applications.These include Web proxy caching,transcoding,service dif-ferentiation,and load balancing.To ensure that all IP pack-ets of an intercepted TCP connection are seen by the inter-cepting transparent proxy,they must sit at focal points inthe network.Translucent Proxying of TCP(TPOT)over-comes this limitation by using TCP options and IP tun-neling to ensure that all IP packets belonging to a TCPconnection will traverse the proxy that intercepted thefirstpacket.This guarantee allows the ad-hoc deployment ofTPOT proxies anywhere within the network.No extra sig-naling support is required for its correct functioning.Inaddition to the advantages TPOT proxies offer at the ap-plication level,they also usually improve the throughputof intercepted TCP connections.In this paper we discussthe TPOT protocol,explain how it enables various appli-cations,describe a prototype implementation,analyze itsimpact on the performance of TCP,and address scalabilityissues.1Introduction and Related WorkTransparent proxies are commonly used in solutionswhen an application is to be proxied in a manner thatis completely oblivious to a client,without requiring anyprior configuration.Recently,there has been a great dealof activity in the area of transparent proxies for Webcaching.Several vendors in the area of Web proxy cachinghave announced dedicated Web proxy switches and appli-ances[1,2,8,12].In the simplest scenario,a transparent proxy interceptsall TCP connections that are routed through it.This maybe refined by having the proxy intercept TCP connectionsdestined only for specific ports(e.g.,80for HTTP),or for aspecific set of destination addresses.The proxy responds tothe client request,masquerading as the remote web server.A TPOT proxy,on seeing such a SYN packet,intercepts it.The ACK packet that it returns to the source carries the proxy’s IP address stuffed within a TCP-OPTION.On re-ceiving this ACK,the source sends the rest of the packets via the intercepting proxy over an IP tunnel.The protocol is discussed in detail in Section2.The above mechanism will work if the client is TPOT enabled.In a situation where the client is not TPOT en-abled,we may still be able to use TPOT.As long as the client is single-homed,and has a proxy at a focal point,we can TPOT enable the connection by having the proxy be-have like a regular transparent proxy on the side facing the client,but a TPOT(translucent)proxy on the side facing the server.Implementation of such a proxy is covered in Section3.The general idea of using TCP-OPTIONs as a signaling scheme for proxies is not new[20].However combining this idea with IP tunneling to pin down the path of a TCP connection has not been proposed before to the best of our knowledge.One alternative to TPOT is the use of Active Network techniques[31].We believe that TPOT is a relatively lightweight solution that does not require an overhaul of existing IP networks.In addition,TPOT can be deployed incrementally in the current IP network,without disrupting other Internet traffic.1.1Applications of TPOTIn addition to allowing the placement of transparent Web proxy caches anywhere in the network,TPOT also enables newer architectures that employ Web proxy networks.In such architectures a proxy located along the path from the client to the server simply picks up the request and satisfies it from its own cache,or lets it pass through.This,in turn, may be picked up by another proxy further down the path. These incremental actions lead to the dynamic construction of spontaneous hierarchies rooted at the server.Such archi-tectures require the placement of multiple proxies within the network,not just at their edges and gateways.Existing proposals[15,21,33]either need extra signaling,or they simply assume that all packets of the connection will pass through an intercepting proxy.Since TPOT explicitly pro-vides this guarantee,implementing such architectures with TPOT is elegant and easy.With TPOT no extra signaling support or prior knowledge of neighboring proxies is re-quired.While the original motivation for TPOT was to enable Web proxy caching and Web proxy caching networks in an oblivious and ad-hoc fashion,the general idea of TPOT can also be applied to enable proxy-based services in a variety of other applications layered on top of TCP in an elegant and efficient fashion.One such use is Transcoding.This refers to a broad class of problems that involve some sort of adaptation of con-tent(e.g.,[13,23]),where content is transformed so as to increase transfer efficiency,or is distilled to suit the capa-bilities of the client.Another similar use is the notion of enabling a transformer tunnel[30]over a segment of the path within which data transfer is accomplished through some alternate technique that may be better suited to the specific properties of the link(s)traversed.Proposals that we know of in this space require one end-point to explic-itly know of the existence of the other end-point–requir-ing either manual configuration or some external signal-ing/discovery protocol.TPOT can accomplish such func-tionality in a superior fashion.In TPOT an end-point non-invasivelyflags a connection,signifying that it can trans-form content–without actually performing any transfor-mation.Only if and when a second TPOT proxy(capable of handling this transformation)sees thisflag and notifies thefirst proxy of its existence,does thefirst proxy begin to transform the connection.Note that this does not require any additional handshake for this to operate correctly,since the TPOT mechanism plays out in concert with TCP’s ex-isting3-way handshake.Another use of TPOT is to enable the selection of spe-cific applications for preferential treatment.Such service differentiation could be used to enable and enforce Q0S policies.One might also want to prioritize traffic belong-ing to an important set of clients,or a set of mission-critical servers.1.2Paper OverviewSection2describes the TPOT protocol.In addition to the basic version,a pipelined version of the protocol is also discussed.Pathological cases,extensions,and limitations are also studied.Section3details a prototype implementation of TPOT in Scout[24].We use this prototype in all our experiments. We address the TCP level performance of TPOT in Sec-tion4using both theoretical analysis as well as exper-iments.Contrary to what we initially expected,TPOT typically improves the performance of TCP connections. This apparent counter-intuitive result has been observed before[3,7,16],though in somewhat different contexts. In[3]a modified TCP stack called Indirect TCP is em-ployed for mobile hosts to combat problems of mobility and unreliability of wireless links.Results show that em-ploying Indirect TCP outperforms regular TCP.In[16]sim-ilar improvements are reported for the case when TCP con-nections over a satellite link are split using a proxy.Fi-nally,in[7],the authors discuss at length how TCP per-formance may be enhanced by using proxies for HFC net-works.The notion of inserting proxies with the sole rea-son of enhancing performance has recently led to the coin-ing of the term Performance Enhancing Proxies(PEP).An overview is provided in[5].As we will see in Section4, TPOT does indeed enhance performance,but unlike PEP, this is not the motivation behind TPOT.Scalability is an important criterion if TPOT is to bepractically deployed.Section5discusses our approach to solving this problem using a technique that we call TPARTY,which employs a farm of servers that sit behind a front-end machine.The front-end machine only farms out requests to the army of TPOT machines that sit behind it. We show that our solution scales almost linearly with the number of TCP connections in the region of interest. Finally Section6highlights our major contributions,dis-cusses future work,and possible extensions to TPOT.2The TPOT ProtocolThis section describes the operation of the basic and pipelined versions of the TPOT protocol.Pathological cases,extensions,and limitations are also studied.Before describing the operation of the TPOT protocol,we provide a brief background of IP and TCP which will help in better understanding TPOT.See[29]for a detailed discussion of TCP.2.1IP and TCPEach IP packet typically contains an IP header and a TCP segment.The IP header contains the packet’s source and destination IP address.The TCP segment itself contains a TCP header.The TCP header contains the source port and the destination port that the packet is intended for.This4-tuple of the IP addresses and port numbers of the source and destination uniquely identify the TCP connection that the packet belongs to.In addition,the TCP header contains aflag that indicates whether it is a SYN packet,and also an ACKflag and sequence number that acknowledges the receipt of data from its peer.Finally,a TCP header might also contain TCP-OPTIONs that can be used for custom signaling.In addition to the above basic format of an IP packet,an IP packet can also be encapsulated in another IP packet.At the source,this involves prefixing an IP header with the IP address of an intermediate tunnel point on an IP packet.On reaching the intermediate tunnel point,the IP header of the intermediary is stripped off.The(remaining)IP packet is then processed as usual.See RFC2003[27]for a longer discussion.2.2TPOT:Basic VersionIn the basic version of TPOT a source that intends to connect with destination via TCP,as shown in Fig-ure1(a).Assume that thefirst(SYN)packet sent out by to reaches the intermediary TPOT proxy.is the notation that we use to describe a packet that is headed from to,and has andas the source and destination ports respectively.To co-exist peacefully with other end-points that do not wish to talk TPOT,we use a special TCP-OPTION “TPOT,”that a source uses to explicitly indicate to TPOT proxies within the network,such as,that they are inter-ested in using the TPOT mechanism.If does not see this option,it will take no action,and simply forwards the packet on to on its fast-path.If sees a SYN packet that has the TCP-OPTION“TPOT”set,it responds to with a SYN-ACK that encodes its own IP address in the TCP-OPTIONfield.On receiving this packet,must then send the remaining packets of that TCP connection,IP tunneled to.From an implementation standpoint this would imply adding another20byte IP header with’s IP address as destination address to all packets that sends out for that TCP connection.Since this additional header is removed on the next TPOT proxy,the total overhead is limited to20 bytes regardless of the number of TPOT proxies intercept-ing the connection from the source to thefinal destination. This overhead can be further reduced by IP header com-pression[10,18].For applications such as Web Caching where may be able to satisfy a request from,the response is simply served from one or more caches attached to.In the case of a“cache miss”or for other applications where might connect to after inspecting some data,communicates with the destination as shown in Figure1(a).Note that the proxy sets the TCP-OPTION“TPOT”in its SYN to to allow possibly another TPOT proxy along the way to again proxy the connection.Note that Figure1only shows the single proxy scenario.2.3TPOT:Pipelined VersionIn certain situations one can do better that the basic ver-sion of the TPOT protocol.It is possible for to pipeline the handshake by sending out the SYN to immediately after receiving the SYN from.This pipelined version of TPOT is depicted infigure1(b).The degree of pipelining depends on the objective of the proxying mechanism.In the case of an L-4proxy for Web Caching,the initial SYN contains the destination IP ad-dress and port number.Since L-4proxies do not inspect the content,no further information is needed from the connec-tion before deciding a course of action.In such a situation a SYN can be sent out by to almost immediately after received a SYN from,as shown in Figure1(b).In the case of L-7switching,however,the proxy would need to inspect the HTTP Request(or at a minimum the URL in the Request).Since this is typically not sent with the SYN, a SYN sent out to can only happen after thefirst ACK is received by from.This is consistent with Figure1.2.4Pathological CasesWhile the typical operation of TPOT appears correct,we are aware of two pathological cases that also need to be addressed.Destination: (D, D_p)Intermediary: (T, T_p)Source: (S, S_p)DATA: (T,T_p,D,D_p)SYN-ACK: (D,D_p,T,T_p)tcp-option: TPOTSYN: (T,T_p,D,D_p)SYN-ACK: (D,D_p,S,S_p)ip-tunneled via TDATA: (S,S_p,D,D_p)tcp-option: TPOT SYN: (S,S_p,D,D_p)tcp-option: T (a)Basic Version Destination: (D, D_p)Intermediary: (T, T_p)Source: (S, S_p)SYN-ACK: (D,D_p,S,S_p)ip-tunneled via TDATA: (S,S_p,D,D_p)tcp-option: TPOTSYN: (S,S_p,D,D_p)tcp-option: T tcp-option: TPOTSYN: (T,T_p,D,D_p)SYN-ACK: (D,D_p,T,T_p)DATA: (T,T_p,D,D_p)(b)Pipelined VersionFigure 1:The TPOT protocol1.In a situation when a SYN is retransmitted by ,it is possible that the retransmitted SYN is intercepted by ,while the first SYN is not –or vice versa.In such a situation,may receive SYN-ACKs from both as well as .In such a situation simply ignores the second SYN-ACK,by sending a RST to the source of the second SYN-ACK.2.Yet another scenario,is a simultaneous open from to and vice-versa,that uses the same port number.Further intercepts only one of the SYNs.This is a situation that does not arise in the client-server appli-cations which we envision for TPOT.Since can turn on TPOT for only those TCP connections for which TPOT is appropriate,this scenario is not a cause for concern.2.5ExtensionsAs a further sophistication to the TPOT protocol it is pos-sible for multiple proxied TCP connections at a client or proxy that terminate at the same (next-hop)proxy,to inte-grate their congestion control and loss recovery at the TCP level.Mechanisms such as TCP-Int proposed in [4]can be employed in TPOT as well.Since the primary focus of TPOT,and this paper,is to enable proxy services on-the-fly,rather than enhance performance we do not discuss this further.The interested reader is directed to [4]and [32]for such a discussion.Note that an alternative approach is to multiplex sev-eral TCP connections onto a single TCP connection.This is generally more complex as it requires the demarcationof the multiple data-streams,so that they may be sensi-bly demultiplexed at the other end.Proposals such as P-HTTP [22]and MUX [14],which use this approach,may also be built into TPOT.2.6LimitationsAs shown in Figure 1the TCP connection that the inter-mediate proxy initiates to the destinationwill carry ’s IP address.This defeats any IP-based access-control or authentication that may use.Note that this limitation is not germane to TPOT,and in general,is true of any trans-parent or explicit proxying mechanism.In a situation where the entire payload of an IP packet is encrypted,as is the case with IPsec,TPOT will simply not be enabled.This does not break TPOT,it simply restricts its use.The purist may also object to TPOT breaking the seman-tics of TCP,since in TPOT a proxy in general interacts with ,in a fashion that is asynchronous with its interac-tion with .While it is possible to construct a version of TPOT that preserves the semantics of TCP,we do not pur-sue it here.In defense,we point to several applications that are prolific on the Internet today (such as firewalls)that are just as promiscuous as TPOT.3Implementing TPOT in ScoutTPOT can be implemented in any operating system.This section describes an implementation in an OS designed specifically to support communication:Scout [24].Whilethe primary purpose of this section is toflesh out some of the details any implementation would have to address,it has a secondary objective of illustrating how a technique like TPOT can be naturally realized in an operating sys-tem designed around communication-oriented abstractions. Many overheads and latency penalties incurred by proxies on general purpose operating systems like Linux,BSD or WindowsNT can be avoided by such an operating system. Scout is a configurable OS explicitly designed to sup-port dataflows,such as video streams through an MPEG player,or a pair of TCP connections through afirewall. Specifically,Scout defines a path abstraction that encap-sulates data as they move through the system,for example, from input device to output device.In effect,a Scout path is an extension of a network connection through the OS. Each path is an object that encapsulates two important ele-ments:(1)it defines the sequence of code modules that are applied to the data as they move through the system,and (2)it represents the entity that is scheduled for execution.PROXYTCPTCPIPIPNET1NET2Figure2:TCP proxy in two Scout paths.The path abstraction lends itself to a natural implemen-tation of TCP proxies.Figure2schematically depicts a naive implementation of a proxy in Scout.It consists of two paths:one connecting thefirst network interface to the proxy,and another connecting the proxy to a second net-work interface.In thisfigure,the path has a source and a sink queue,and is labeled with the sequence of software modules that define how the path transforms the data it car-ries1.As afirst approximation,the configuration of Scout shown in Figure2represents the implementation one would expect in a traditional OS.The two-path configuration shown in Figure2has subop-timal performance because it requires the hand-off of eachTCP TCP (with TPOT)TCP (with TPOT)TCP (with TPOT)FWDIPNET2NET1FWDNET1IP in IP IP in IP IPIPIPNET2IP in IP Figure 4:TPOT implementation in Scout.buffer is also set to 32KByte to match the values used by the Linux client and server.4Performance MeasurementsThis section analyzes the TCP performance of TPOT based on actual measurements in lab setups using proto-type TPOT proxies.Wherever relevant,we compare the observed performance with expected values suggested by theoretical results on the performance of idealized TCP.In our experiments we use the Reno flavor of TCP [29],which is generally considered to be the most popular im-plementation on the Internet today.We expect our obser-vations to largely hold for other flavors of TCP,though it is quite possible that flavors of TCP such as TCP-Vegas [6],which have different congestion detection and avoidance techniques,may yield somewhat different numbers.The primary focus of the following experiments is to evaluate the performance benefits and penalties in the presence of realistic Round-trip-times (RTTs)and packet losses,when one or more TPOT machines intercepts a TCP connection.For these experiments the TPOT machines are not overloaded.Section 5discusses overload situations,and techniques of scaling TPOT to combat this.For our experiments we tested the pipelined version of TPOT (See Section 2).In the worst case the basic version would incur an additional delay of half a round-trip-time.Aside from this,the two versions of TPOT yield similar results.4.1SetupAll hosts used in our experiment are at least 200MHz Pentium II workstations with 256KB cache,32MB RAM,and 3COM 3x5932-bit PCI 10/100Mb/s adapters.The first TPOT machine runs the transparent proxy version of TPOT,while the second TPOT machine runs the interior TPOT version which are described in the previous section.The clients and servers are Linux 2.2.12machines.Thephysical configuration of our test setup is shown in Fig-ure 5.The client is connected with a 10Mbit hub to the first TPOT machine.The first TPOT machine is connected by another 10Mbit Hub to the second TPOT machine.The second TPOT machine is in turn connected by a 10Mbit hub to the server.TPOT 1HubHubTPOT 2HubServerClientFigure 5:Test SetupThe TPOT machines either operate as TPOT proxies or as simple routers.If they operate as TPOT proxies the first TPOT machine enables the TPOT protocol and data is subsequently tunneled between the TPOT machines.De-lays and losses are added in the device driver code of each TPOT device.The granularity of the delay queue is 1ms.For throughput measurements TTCP is used to measure the throughput on the receiver.TTCP transfers a specified amount of data from the client to the server.After all the data has been transfered,the connection is closed.The re-sults reported for each experiment,are averaged over ten runs.The Linux TCP code implements the Timestamp option which is not supported by Scout.We believe that the impact of this shortcoming is minor in our test environments which by design have low RTT variances.Despite that fact that both Linux and Scout advertise the SACK option during initial handshake,tcpdump traces show that SACK was not used during the data transfer phase of the TCP connections in any of the experiments.4.2Impact of RTTTo measure the impact of the RTT,we introduced de-lay into the output queue of the Ethernet devices on the TPOT machines.In one set of experiments the TPOT ma-chines work exclusively as routers,and in the second set exclusively as TPOT proxies.In the second experiment the distribution of the added delay,is either equally distributedover all links,or is concentrated on a single link between the two TPOT machines.50100150200250300350400050100150200250300T h r o u g h p u t i n K B /s e cTotal RTT in msRouterTPOT delay on central hub TPOT delay equaly distributed100200300400500600700800050100150200250300T h r o u g h p u t i n K B /s e cTotal RTT in msRouterTPOT delay on central hub TPOT delay equaly distributedFigure 6:Throughput for different RTTs for 10KB (top)and 10MB (bottom)document sizes.Figure 6shows the throughput for RTTs from 1-300ms for 10KB and 10MB document sizes.Smaller documents are not measured since the connection establishment time dominates the experiment.The impact on small documents is discussed in section 4.5.The results show that if the en-tire RTT is concentrated at the single link between the two TPOT machines,the throughput is on average 24%worse for 10KB documents and 6%worse for 10MB documents.This is not surprising since the TPOT machines need to per-form additional processing during connection setup which gets amortized over the lifetime of the connection,in addi-tion to the processing for each packet.On the other hand in the case when the RTT is equally distributed over the links,we find that TPOT improves the overall throughput.For example,the TPOT throughput for a 300ms RTT and 10MB documents is 92%better than the routed throughput for the same RTT.Theoretical AnalysisTo better understand this phenomenon we turn our atten-tion to results in the literature that analyze the performance of TCP using idealized models.Note that this section is intended as a theoretical backing for our study,and is notintended as a comprehensive or formal analysis of TCP.In [11]the authors provide a rough sketch for the throughput of an idealized TCP connection in the conges-tion avoidance phase.A more rigorous derivation of this and a few other results may be found in [25].The authors of [26]model TCP throughput in a more comprehensive fashion taking into account TCP timeouts as well.We use the terminology of [26]in what follows.Letand be the packet loss and RTT on link ,and be the corresponding throughput in packets per second .Also,let be the maximum advertised window size.Let the constant ,be the number of pack-ets acknowledged by each ACK.Then in steady-state,as per [26]:(1)Note that the above equation ignores timeouts.Including timeouts does not change the nature of the analysis that follows.A detailed discussion is beyond the scope of this paper.For connections with a high RTT the advertised windowsizebecomes the bottleneck,so that the above equa-tion reduces to:amount.If the scaled window,,becomes so large that it is no longer the bottleneck,the send window will be-come the limiting factor.This case is discussed in the next section.4.3Impact of LossWhile the advertised window size was the determiningfactor offor high RTT connections in the previous experiment,the goal of this experiment is to demonstrate that TPOT also performs better if the sender’s congestion window and not the receiver’s advertised window limits throughput.To study this scenario we randomly drop pack-ets in an uniform and independent fashion in the Ethernet device driver of the TPOT machines.In this experiment no artificial delay is introduced.This results in an RTT of 1ms between the client and server due to the real delay on the Ethernet and TPOT machines.The idea here is to simulate packet losses either due to lossy links or packet loss due to buffer overflows along the path.Again we measure the performance of end to end TCP using the TPOT machines configured exclusively as routers,or as TPOT proxies.In the case where they are configured as proxies the loss is either equally distributed between the links,or is concen-trated on the link between the two TPOT machines.10020030040050060070080090005101520T h r o u g h p u t i n K B /s e cLoss in %RouterTPOT loss on central hub TPOT loss equaly distributedFigure 7:Throughput for different drop rates.Figure 7depicts the results of this experiment for 10MB document sizes for different loss rates.The experiment for 10KB is not reported since the results were highly vari-ant.This is because of the timeout behavior of TCP SYN packets,and the fact that the the total number of packets transfered is low.Figure 7shows that the Router version is slightly better than the TPOT proxy version with packet loss concentrated on the central link.We believe this is due to the overhead involved in introucing TPOT proxies.However when the packet loss is equally distributed,The TPOT proxy version outperforms the Router version by far.Note that this shows up only for throughput values below 600KBps,since above this,the 10Mbps Ethernet dominates the picture.Theoretical AnalysisIn this situation,theof each link is the same.Let us denote this by .When the throughput is not dom-inated by,Equation 1reduces to:is thus determined by the most lossy link.Note that in this case the overall loss probability is conserved,so that,.The results of the experiments roughlycorroborate this.For the 10MB document size,we see that after the throughput drops below the Ethernet saturation point,the equally distributed packet loss case,outperforms the router case significantly –in fact slightly more than the theoretically expected value ofCase Link1Link3(RTT/drop)(RTT/drop) I1ms/0%1ms/0%139ms/6.96%III110ms/3%70ms/6% Table1:RTT and packet drop rate distributions for the ex-periments.of size100KB was measured from the client to the server. The TPOT machines,as in the previous experiments,were either used as routers or as TPOT proxies.The RTT and packet drop rate distribution was the same in either case.Case TPOT throughput24kBps24kBps24kBps Table2:100KB transfer for case I-III with and without TPOT.Table2shows the results of the experiment.Not surpris-ingly,the throughput remains the same for all cases,when the TPOT machines are configured as routers.It does not make any difference where the data is lost or where RTT delays are introduced as long as the end to end RTT and loss are equal.Also,not surprisingly,the throughput of the proxied TCP increases by more than a factor of three due to the reduction of the individual TCP connections drop rate and RTT.This case study also shows that additional bene-fits can be derived from the TPOT architecture.Case I and II can be implemented without TPOT since all proxies can be placed on focal points in the network.However,case III is possible only if the connection is TPOT enabled,since in Case III the proxy/switch is in the middle of the network.4.5Small Data TransfersAnother important question is how TPOT effects small data transfers.The previous experiments have shown that throughput increases when TPOT proxies are used.How-ever,for smallfiles,the connection establishment overhead dominates the overall performance,and sustained through-put rates become irrelevant.To measure the effect of TPOT on smallfile transfers,the setup in the previous experiments was used.However,instead of TTCP which transfers data in one direction,we used a TCP ping test which returns the data back to the sender simulating a HTTP Request fol-lowed by a short HTTP Response.The total time from be-fore the open system call to after the close system call on the client side of the connection was measured using the hardware cycle counter of the Pentium II processor. Table3shows the results for three transfer sizes and two different values for RTT.The delay was equally distributed RTT Routed Delay 1 2.7ms1KB8.5ms133.5ms 1B137.3ms100210.7ms 10KB374.3ms。
卷积神经网络CNN硬件实现

– FIFO的深度决定了属于每一段的最长时 间
2009/Architecture
• VALU:
– 所有卷积神经网络的基本运算都在其中 实现,主要包括二维卷积累加求和、使 用均值滤波器和最大值滤波器进行二、 维空间合并和二次抽样和逐点非线性变 换
2013B/A Memory-Centric Architecture—ICCD
• The effect of the memory bottlenect can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload.
• 有些配置参数影响了硬件的实现,所以在综合后将固定,比如PEs的数量、连 接方式、支持地址模式、缓冲区配置和缓冲区的深度。
• 文章中使用了Vivado HLS(AutoESL)工具对加速器的配置进行快速的评估 (估计是为了找到最佳的配置),使用C对加速器进行高级描述,配置内容包 括分割灵活的存储空间或者是累加器PEs的流水线操作。
– 用于CNN的可配置加速模块有灵活的数据重用缓冲区,这个模块支持CNN工作量中不 同的计算参数并能够配置匹配外存带宽
• A memory-centric design flow to synthesize and program the accelerator template. Our design flow uses quick design space exploration to optimize on-chip memory size and data reuse.
WinMips64帮助文档

Using WinMIPS64 SimulatorA Simple TutorialThis exercise introduces WinMIPS64, a Windows based simulator of a pipelined implementation of the MIPS64 64-bit processor.1. Starting and configuring WinMIPS64Start WinMIPS64 from the task bar.A window (denoted the main window) appears with seven child windows and a status line at the bottom. The seven windows are Pipeline, Code, Data, Registers, Statistics, Cycles and Terminal.To make sure the simulation is reset, click on the File menu and click Reset MIPS64.WinMIPS64 can be configured in many ways. You can change the structure and time requirements of the floating-point pipeline, and the code/data memory size. To view or change standard settings click Configure/Architecture (read this as: click Configure to open the menu, then clicking on Architecture) and you will see the following settings:You can change the settings by clicking in the appropriate field and editing the given numbers. Any changes to the Floating-point latencies will be reflected in the Pipeline window. The Code Address Bus refers to the actual number of wires in the address bus. So a value of 10 means that 210 = 1024 bytes of code memory will be displayed in the Code window. When you are finished, click OK to return to the main window.Three more options in the Configuration menu can be selected: Multi-Step, Enable Forwarding, Enable Branch Target Buffer and Enable Delay Slot. Of these Enable Forwarding should be enabled, that is, a small hook should be shown beside it. If this is not the case, click on the option. You can change the size and/or position of child windows or bring up only one window using the maximise option for that window.2. Loading a test program.Use a standard text editor to create this file sum.s, which is a MIPS64 program that calculates the sum of two integers A and B from memory, and stores the result into the memory on location C..dataA: .word 10B: .word 8C: .word 0.textmain:ld r4,A(r0)ld r5,B(r0)dadd r3,r4,r5sd r3,C(r0)haltA small command line utility asm.exe is provided to test a program for syntactical correctness. To check this program typeH:>asm sum.sIn order to be able to start the simulation, the program must be loaded into the main memory. To accomplish this, select File/Open. A list of assembler programs in current directory appears in a window, including sum.s.To load this file into WinMIPS64, do the following:▪Click on sum.s▪Click the Open buttonThe program is now loaded into the memory and the simulation is ready to begin.You can view the content of code memory using the Code window, and observe the program data in the Data Window.3. Simulation3.1 Cycle-by-cycle SimulationAt any stage you can press F10 to restart the simulation from the beginning.At the start you will note that the first line in the Code window with the address 0000 is coloured yellow. The IF stage in the Pipeline window is also coloured in yellow and contains the assembler mnemonic of the first instruction in the program. Now inspect the Code window and observe the first instruction ld r4,A(r0). Look in the Data window to find the program variable A.Clock 1:Pressing Execute/Single Cycle (or simply pressing F7) advances the simulation for one time step or one clock tick; in the Code Window, the colour of the first instruction is changed to blue and the second instruction is coloured in yellow. These colours indicate the pipeline stage the instruction is in (yellow for IF, blue for ID, red for EX, green for MEM, and purple for WB).If you look in the IF stage in the Pipeline window, you can see that the second instruction ld r5, B(r0) is in the IF stage and the first instruction ld r4,A(r0) has advanced to the second stage, ID.Clock 2:Pressing F7again will re-arrange the colours in the Code window, introducing red for the third pipeline stage EX. Instruction dadd r3,r4,r5enters the pipeline. Note that the colour of an instruction indicates the stage in the pipeline that it will complete on the next clock tick.Clock 3:Pressing F7 again will re-arrange the colours in the Code window, introducing green for the fourth pipeline stage MEM. Instruction sd r3,C(r0)enters the pipeline. Observe the Clock Cycle Diagram which shows a history of which instruction was in each stage before each clock tick.Clock 4:Press F7 again. Each stage in the pipeline is now active with an instruction. The value that will end up in r4 has been read from memory, but has not yet been written back to r4. However it is available for forwarding from the MEM stage. Hence observe that r4 is displayed as green (the colour for MEM) in the Registers window. Can you explain the value of r4? Note that the last instruction halt has already entered the pipeline.Clock 5:Press F7again. Something interesting happens. The value destined for r5becomes available for forwarding. However the value for r5 was not available in time for the dadd r3,r4,r5 instruction to execute in EX. So it remains in EX, stalled. The status line reads "RAW stall in EX (R5)", indicating where the stall occurred, and which register's unavailability was responsible for it.The picture in the Clock Cycle Diagram and the Pipeline window clearly shows that the dadd instruction is stalled in EX, and that the instructions behind it in the pipeline are also unable to progress. In the Clock Cycle Diagram, the dadd instruction is highlighted in blue, and the instructions behind are shown in gray.Clock 6:Press F7. The dadd r3,r4,r5instruction executes and its output, destined for r3, becomes available for forwarding. This value is 12 hex, which is the sum of 10+8 = 18 in decimal. This is our answer.Clock 7:Press F7. The halt instruction entering IF has had the effect of "freezing" the pipeline, so no new instructions are accepted into it.Clock 8:Press F7. Examine Data memory, and observe that the variable C now has the value 12 hex. The sd r3,C(r0) instruction wrote it to memory in the MEM stage of the pipeline, using the forwarded value for r3.Clock 9:Press F7.Clock 10:Press F7. The program is finishedLook at the Statistics window and note that there has been one RAW stall. 10 clock cycles were needed to execute 5 instructions, so CPI=2. This is artificially high due to the one-off start-up cost in clock cycles needed to initially fill the pipeline.The statistics window is extremely useful for comparing the effects of changes in the configuration. Let us examine the effect of forwarding in the example. Until now, we have used this feature; what would the execution time have been without forwarding?To accomplish this, click on Configure. To disable forwarding, click on Enable Forwarding(the hook must vanish).Repeat the cycle-by-cycle program execution, re-examine the Statistics window and compare the results. Note that there are more stalls as instructions are held up in ID waiting for a register, and hence waiting for an earlier instruction to complete WB. The advantages of forwarding should be obvious.3.2 Other execution modesClick on File/Reset MIPS64. If you click on File/Full Reset, you will delete the data memory, so you will have to repeat the procedure for program loading. Clicking on File/Reload or F10 is a handy way to restart a simulation.You can run simulation for a specified number of cycles. Use Execute/Multi cycle...for this. The number of cycles stepped through can be changed via Configure/Multi-step.You can run the whole program by a single key-press - press F4. Alternatively click on Execute/Run to.Also, you can set breakpoints. Press F10. To set a break-point, double-left-click on the instruction, for example on dadd r3,r4,r5. Now press F4. The program will halt when this instruction enters IF. To clear the break-point, double-left-click on the same instruction again.3.3 Terminal OutputThe simulator supports a simple I/O device, which works like a simple dumb terminal screen, with some graphical capability. The output of a program can appear on this screen. To output the result of the previous program, modify it like this.dataA: .word 10B: .word 8C: .word 0CR: .word32 0x10000DR: .word32 0x10008.textmain:ld r4,A(r0)ld r5,B(r0)dadd r3,r4,r5sd r3,C(r0)lwu r1,CR(r0) ;Control Registerlwu r2,DR(r0) ;Data Registerdaddi r10,r0,1sd r3,(r2) ;r3 output..sd r10,(r1) ;.. to screenhaltAfter this program is executed you can see the result of the addition printed in decimal on the Terminal window. For a more complete example of the I/O capabilities, see the testio.s and hail.s example programs.The Instruction setThe following assembler directives are supported.data - start of data segment.text - start of code segment.code - start of code segment (same as .text).org <n> - start address.space <n> - leave n empty bytes.asciiz <s> - enters zero terminated ascii string.ascii <s> - enter ascii string.align <n> - align to n-byte boundary.word <n1>,<n2>.. - enters word(s) of data (64-bits).byte <n1>,<n2>.. - enter bytes.word32 <n1>,<n2>.. - enters 32 bit number(s).word16 <n1>,<n2>.. - enters 16 bit number(s).double <n1>,<n2>.. - enters floating-point number(s)where <n> denotes a number like 24, <s> denotes a string like "fred", and <n1>,<n2>.. denotes numbers separated by commas. The integer registers can be referred to as r0-r31, or R0-R31, or $0-$31 or using standard MIPS pseudo-names, like $zero for r0, $t0 for r8 etc. Note that the size of an immediate is limited to 16-bits. The maximum size of an immediate register shift is 5 bits (so a shift by greater than 31 bits is illegal).Floating point registers can be referred to as f0-f31, or F0-F31The following instructions are supported. Note reg is an integer register, freg is a floating-point (FP) register, and imm is an immediate value.lb reg,imm(reg) - load bytelbu reg,imm(reg) - load byte unsignedsb reg,imm(reg) - store bytelh reg,imm(reg) - load 16-bit half-wordlhu reg,imm(reg) - load 16-bit half word unsignedsh reg,imm(reg) - store 16-bit half-wordlw reg,imm(reg) - load 32-bit wordlwu reg,imm(reg) - load 32-bit word unsignedsw reg,imm(reg) - store 32-bit wordld reg,imm(reg) - load 64-bit double-wordsd reg,imm(reg) - store 64-bit double-wordl.d freg,imm(reg) - load 64-bit floating-points.d freg,imm(reg) - store 64-bit floating-pointhalt - stops the programdaddi reg,reg,imm - add immediatedaddui reg,reg,imm - add immediate unsignedandi reg,reg,imm - logical and immediateori reg,reg,imm - logical or immediatexori reg,reg,imm - exclusive or immediatelui reg,imm - load upper half of register immediateslti reg,reg,imm - set if less than immediatesltiu reg,reg,imm - set if less than immediate unsignedbeq reg,reg,imm - branch if pair of registers are equalbne reg,reg,imm - branch if pair of registers are not equalbeqz reg,imm - branch if register is equal to zerobnez reg,imm - branch if register is not equal to zeroj imm - jump to addressjr reg - jump to address in registerjal imm - jump and link to address (call subroutine)jalr reg - jump and link to address in registerdsll reg,reg,imm - shift left logicaldsrl reg,reg,imm - shift right logicaldsra reg,reg,imm - shift right arithmeticdsllv reg,reg,reg - shift left logical by variable amountdsrlv reg,reg,reg - shift right logical by variable amountdsrav reg,reg,reg - shift right arithmetic by variable amountmovz reg,reg,reg - move if register equals zeromovn reg,reg,reg - move if register not equal to zeronop - no operationand reg,reg,reg - logical andor reg,reg,reg - logical orxor reg,reg,reg - logical xorslt reg,reg,reg - set if less thansltu reg,reg,reg - set if less than unsigneddadd reg,reg,reg - add integersdaddu reg,reg,reg - add integers unsigneddsub reg,reg,reg - subtract integersdsubu reg,reg,reg - subtract integers unsigneddmul reg,reg,reg - signed integer multiplicationdmulu reg,reg,reg - unsigned integer multiplicationddiv reg,reg,reg - signed integer divisionddivu reg,reg,reg - unsigned integer divisionadd.d freg,freg,freg - add floating-pointsub.d freg,freg,freg - subtract floating-pointmul.d freg,freg,freg - multiply floating-pointdiv.d freg,freg,freg - divide floating-pointmov.d freg,freg - move floating-pointcvt.d.l freg,freg - convert 64-bit integer to a double FP format cvt.l.d freg,freg - convert double FP to a 64-bit integer format c.lt.d freg,freg - set FP flag if less thanc.le.d freg,freg - set FP flag if less than or equal toc.eq.d freg,freg - set FP flag if equal tobc1f imm - branch to address if FP flag is FALSEbc1t imm - branch to address if FP flag is TRUEmtc1 reg,freg - move data from integer register to FP register mfc1 reg,freg - move data from FP register to integer register Memory Mapped I/O areaAddresses of CONTROL and DATA registersCONTROL: .word32 0x10000DATA: .word32 0x10008Set CONTROL = 1, Set DATA to Unsigned Integer to be outputSet CONTROL = 2, Set DATA to Signed Integer to be outputSet CONTROL = 3, Set DATA to Floating Point to be outputSet CONTROL = 4, Set DATA to address of string to be outputSet CONTROL = 5, Set DATA+5 to x coordinate, DATA+4 to y coordinate, and DATA to RGB colour to be outputSet CONTROL = 6, Clears the terminal screenSet CONTROL = 7, Clears the graphics screenSet CONTROL = 8, read the DATA (either an integer or a floating-point) from the keyboardSet CONTROL = 9, read one byte from DATA, no character echo.Notes on the Pipeline SimulationThe pipeline simulation attempts to mimic as far as possible that described in Appendix A of Computer Architecture: A Quantitative Approach.However in a few places alternative strategies were suggested, and we had to choose one or the other. Stalls are handled where they arise in the pipeline, not necessarily in ID.We decided to allow floating-point instructions to issue out of ID into their own pipelines, if available. There they either proceed or stall, waiting for their operands to become available. This has the advantage of allowing out-of-order completion to be demonstrated, but it can cause WAR hazards to arise. However the student can thus learn the advantages of register renaming.Consider this simple program fragment:-.textadd.d f7,f7,f3add.d f7,f7,f4mul.d f4,f5,f6 ; WAR on f4If the mul.d is allowed to issue, it could "overtake" the second add.d and write to f4 first. Therefore in this case the mul.d must be stalled in ID.Structural hazards arise at the MEM stage bottleneck, as instructions attempt to exit more than one of the execute stage pipelines at the same time. Our simple rule is longest latency first. See page A-52 InstallationOn your own computer, just install anywhere convenient, and create a short-cut to point at it. Note that winmips64 will write two initialisation files into this directory, one winmips64.ini which stores architectural details, one s which remembers the last .s file accessed.On a network drive, install winmips64.exe into a suitable system directory. Then use a standard text editor to create a file called winmips64.pth, and put this file in the same directory.The read-only file winmips64.pth should contain a single line path to a read-write directory private to any logged-in user. This directory will then be used to store their .ini and .las files.For example winmips64.pth might containH:orc:\tempBut remember only a single line - don't press return at the end of it!。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
An Implementation of Pipelined Prallel Processing System for Multi-Access Memory SystemHyung Lee1, Hyeon-Koo Cho2, Dae-Sang You1, and Jong-Won Park11 Department of Information & Communications Engineering,Chungnam National University,220 Gung-Dong Yu sung-Gu, Daejeon, 305-764, KOREATel.: +82-42-821-7793, Fax.: +82-42-825-77922 Virtual I Tech. Inc.,Room 503, Engineering Building 3,220 Gung-Dong Yusung-Gu, Daejeon, 305-764, KOREAe-mail : {hyung, dsyou, hkcho, jwpark}@u.ac.krAbstract: We had been developing the variety of parallel processing systems in order to improve the processing speed of visual media applications. These systems were using multi-access memory system(MAMS) as a parallel memory system, which provides the capability of the simultaneous accesses of image points in a line-segment with an arbitrary degree, which is required in many low-level image processing operations such as edge or line detection in a particular direction, and so on. But, the performance of these systems did not give a faithful speed because of asynchronous feature between MAMS and processing elements.To improve the processing speed of these systems, we have been investigated a pipelined parallel processing system using MAMS. Although the system is considered as being the single instruction multiple data(SIMD) type like the early developed systems, the performance of the system yielded about 2.5 times faster speed.1. IntroductionThere already exit the variety of machines that are capable of performing h igh-speed image applications. In general, these machines can be divided into two classes[1]: the first basically comprises two dimensional(2-D) array processors that operate on an entire image or subimage in a set of parallel processes. Examples of this type of machine are CLIP, MPP, and PIXIE-5000. In general, all of these machines can also be considered as being the single instruction multiple data (SIMD) type. The main drawback of 2D array processors is their cost. In addition, due to the inherent serial nature of the input-image data, full utilization of the processors may not be attained.The second class of machines – local-windows processors – scans an image and performs operations on a small neighborhood window. Examples of such machines include MITE, PIPE, and Cytocomputer. Note that, with this type of processor, an increase in image size requires a quadratic increase in processor speed in order to maintain a constant processing speed.Most of the above local-window processors are general purpose in nature in that they are programmable. Although these general-purpose cellular machines are flexible because of their programmability, they do not provide the capability of the simultaneous accesses of image points in a line-segment with an arbitrary degree, which is required in many low-level image processing operations such as edge or line detection in a particular direction, and so on. We have been developing the parallel processing system which is considered as being the pipelined SIMD type with a multi-access memory system(MAMS) satisfying to provide the capability of the simultaneous accesses of image points.In this paper, we propose 5-stage pipelined parallel processing system involving MAMS for accessing the data elements within three access types with a constant interval simultaneously. Each processing element(PE) is designed with 2 states for which memory access instructions and general instructions. Although two states are performed in parallel, the cycle of general instructions is depended on one of memory access instructions because memory access instructions have to access data via MAMS.The remainder of this paper is organized as follows. Section 2 introduces multi-access memory system which is redesigned for the proposed system, and the propose d pipelined parallel processing system is described in Section 3. Section 4 presents experimental results yielded by simulations. Finally, we conclude this paper in Section 5 followed by the references.2. Multi-Access Memory SystemFor a parallel processing system with PEs, it is necessory to use an MAMS[2,3] to reduce the memory access time. Also, the memory system has the important goals to provide the efficient utilization for )(pqn=PEs of the pipelined parallel processing system we proposed, where p and q are design parameters. The goals are as follows: various access types and constant interval between the data elements, simultaneous access with no restriction on the location, simple and fast address calculation and routing circuitry, and small number of memory modules.The memory system consists of a memory module selection circuitry, a data routing circuitry for WRITE, an address and a routing circuitry, memory modules, and a data routing circuitry for READ. In order to distribute the data elements of the NM×array (*,*)I among)1(+=pqm memory modules, a memory module assignment function must place in distinct memory modules array elements that are to be accessed simultaneously. Also, an address assignment function must allocate differentaddresses to array elements assigned to the same memory module.The MAMS we redesigned is implemented to the pipelined with three stages because the parallel processing system will be introduced in Section 3 is designed for pipelined architecture. In the case of sequential memory operations, therefore, memory access times are reduced in comparing with that of the original. The block diagram of the multi-access memory system is presented in Figure 1.Access Memory System.3. Pipelined Parallel Processing SystemTo perform high-speed visual applications, we had been developed the variety of parallel processing systems using MAMS. But, the performance of these systems did not give us faithful speed because two modules, which are MAMS module and the modules including processing elements, are asynchronous[4,5]. That is, when a memory instruction followed by a general instruction in the early developed parallel processing system, the general instruction had to wait to be excuted until the previous memory instruction was done completely. Although these systems give efficient fuctionality for accessing data in logical 2-D memory array simultaneously, memory access time in them is longer than one of general memory system, which does not support parallelism, because these system were depended on the time going through MAMS. It was main drawback to grow the processing speed up to the limitation of Amdal’s law. To improve the processing speed of these systems, we have been investigated a pipelined parallel processing system together with MAMS.The pipelined parallel processing system we proposed consists of a processing unit that is made up of a processor module(Motorola MPC860P Processor) for global processing and controling all devices and a PCI Controller (PLX9056) for transferring data to/from host computer; a local memory which stores instructions and common data for its programmability; DMA controller which has set of registers to fetch instructions from the local memory and issue them to n PEs; PE which interprets and executes instructions synchronously; a multi-access memory controller (MAMC) which provides n data elements to n PEs simultaneously; and m external memory modules which store n data elements to be manipulated.The processor unit(PU) controls the system and communicates with host computer via PCI bus. DMA controller fetches an instruction and stores it in a register pool also synchronously transfers data or an instruction to n PEs and controls them. PE was designed as ALU with the functionality that is interpreting an issued instruction. And, to perform an application, DMA controller steels bus cycles until the end of the application.PE can execute two kinds of instructions: memory-reference instructions for accessing m external memory modules via MAMC and 16 general instructions including register-reference instructions and I/O instructions. Therefore, an application to be processed on the system is compiled to operation codes within 18 instructions. And, when each of two instructions is in different set of instructions, they are executed at the same time. That is, one of memory-reference instructions and one of general instructions are executed simultaneously. Hence, a memory-reference instruction followed by a general instruction is executed in a memory access cycle and vise versa. It is reducing processing time in some modules, for example, convolution mask operators which is frequently used in the spatial domain.The system can provide logically two-dimensional addressing way, which is used in (r,c)-based image domain, to system programmers because MAMS gets away sementic gap. For this feature of the system, most of image processing in spatial domain are done with enough parallel processing power. The block diagram of the system is presented in Figure 2.Parallel Processing System.4. ExperimentsTo verify the performance of the proposed system, we chose one of convolution mask operator that is used in image processing methods frequently but it is one of time consuming works.We tried to transform the codes of the operator into adaptable codes to the proposed system in order to achieve the parallel version of this operation. The codes for the operation are presented in Code 1. Notice that two instructions, one is of memory-reference instructions andthe other is of general instructions, are executed simultaneously.Code 1. Read (1,0,0) ValTran AC 0 Read (1,1,0) Add rd_reg Read (1,2,0) Add rd_reg ……NOP Add rd_reg NOP Div rd_reg Write(1,1,1)NopThe dedicated parallel processing system was described by Verilog-HDL and simulated by CADENCE Verilog-XL hardware simulation package in order to verify its functionality. And the system was compiled and fitted into EFP10K200EGC599-1 device by MAXPLUX II. The final simulation was performed after getting delay files and doing back-annotation. The waveform generated during the simulation is illustrated in Figure 3.Figure 3. A Waveform obtained through post-layout simulation.The waveform shows that fetching an instruction, one of memory access instructions, and one of arithmetic logic instructions were overlapped. And, we know that execution time of an application applied to the proposed system was depended on the memory access time. In the simulation results, the proposed system yielded 2.5 times faster speed than the early developed systems[4,5].Unfortunately, we can not present real measuring values because the circuit board manufactured has not yet been verified. The work is going right now and the test environments are depicted in Figure 4.5. ConclusionsThe demands for processing multimedia data in real-time using unified and scalable architecture are ever increasing with the proliferation of multimedia applications. We had been developing the variety of parallel processing systems in order to improve the processing speed of visual media applications. These systems were using MAMS as a parallel memory system, which provides the capability of the simultaneous accesses of image points in a line-segment with an arbitrary degree. But, the performance of these systems did not give a faithful speed because ofasynchronous feature between MAMS and processing elements.Figure 4. The test board : to verify the system.To improve the processing speed of these systems, we have been investigated a pipelined parallel processing system using MAMS. Although the system is considered as being the SIMD type like the early developed systems, the performance of the system yielded 2.5 times faster speed.Although the comparison values previously mentioned were obtained through simulations and speedup performances during each application on the system were achieved, they were just estimated values because the proposed system has not yet been verified on the circuit board manufactured.Unfortunately, some problems are yet occurred in transferring a lot of data elements from the host to the system and vise versa during processing the application on the system. That is, the time for transferring data allocated more than the processing time. To solve this, the bus bandwidth needs to be improved on the system side and new specific methods to the system be developed on the method side.References[1] Alexander C. P. Loui, et. al, “Flexible Architecture forMorphological Image Processing and Analysis,” IEEE Transaction on Circuits and Systems for Video Technology , Vol. 2, no. 1, Mar. 1992[2] J. W. Park, “An efficient memory system for imageprocessing,” IEEE transactions on Computers , vol. C-35, no. 7, pp.33-39, 1986.[3] J. W. Park and D. T. Harper III, “an Efficient MemorySystem for SIMD Construction of a Gaussian Pyramid,” IEEE Transactions on Parallel and Distributed System , Vol. 7, No. 7 July 1996.[4] Hyung Lee, K. A. Moon, J. W. Park, "Design of parallelprocessing system for facial image retrieval", 4th International ACPC’99, Salzburg, Austria, Feb. 1999. [5] Hyung Lee, J. W. Park, “A study on Parallel ProcessingSystem for Automatic Segm entation of Moving Object in Image Sequences,” ITS-CSCC 2000, Vol. 2, pp. 429-432, July 2000.。