RECONFIGURABLE RADIO WITH FPGA-BASED APPLICATION-SPECIFIC PROCESSORS
数字示波器外文翻译文献
数字示波器外文翻译文献(文档含中英文对照即英文原文和中文翻译)原文:Design and FPGA implementation of a wireless hyperchaotic communication system for secure real-time image transmission AbstractIn this paper, we propose and demonstrate experimentally a new wireless digital encryption hyperchaotic communication system based on radio frequency (RF) communication protocols for secure real-time data or image transmission. A reconfigurable hardware architecture is developed to ensure the interconnection between two field programmable gate array developmentplatforms through XBee RF modules. To ensure the synchronization and encryption of data between the transmitter and the receiver, a feedback masking hyperchaotic synchronization technique based on a dynamic feedback modulation has been implemented to digitally synchronize the encrypter hyperchaotic systems. The obtained experimental results show the relevance of the idea of combining XBee (Zigbee or Wireless Fidelity) protocol, known for its high noise immunity, to secure hyperchaotic communications. In fact, we have recovered the information data or image correctly after real-time encrypted data or image transmission tests at a maximum distance (indoor range) of more than 30 m and with maximum digital modulation rate of 625,000 baud allowing a wireless encrypted video transmission rate of 25 images per second with a spatial resolution of 128 ×128 pixels. The obtained performance of the communication system is suitable for secure data or image transmissions in wireless sensor networks.IntroductionOver the past decades, the confidentiality of multimedia communications such as audio, images, and video has become increasingly important since communications of digital products over the network (wired/wireless) occur more frequently. Therefore, the need for secure data and transmission is increasing dramatically and defined by the required levels of security depending on the purpose of communication. To meet these requirements, a wide variety of cryptographic algorithms have been proposed. In this context, the main challenge of stream cipher cryptography relates to the generation of long unpredictable key sequences. More precisely, the sequence has to be random, its period must be large, and the various patterns of a given length must be uniformly distributed over the sequence. Traditional ciphers like DES, 3DES, IDEA, RSA, or AES are less efficient for real-time secure multimedia data encryption systems and exhibit some drawbacks and weakness in the high streamdata encryption. Indeed, the increase and availability of a high-power computation machine allow a force brute attack against these ciphers. Moreover, for some applications which require a high-levelcomputation and where a large computational time and high computing power are needed (for example, encryption of large digital images), these cryptosystems suffer from low-level efficiency. Consequently, these encryption schemes are not suitable for many high-speed applications due to their slow speed in real-time processing and some other issues such as in the handling of various data formatting. Over the recent years, considerable researches have been taken to develop new chaotic or hyperchaotic systems and for their promising applications in real-time encryption and communication. In fact, it has been shown that chaotic systems are good candidates for designing cryptosystems with desired properties. The most prominent is sensitivity dependence on initial conditions and system parameters, and unpredictable trajectories.Furthermore, chaos-based and other dynamical systembased algorithms have many important properties such as the pseudorandom properties, ergodicity and nonperiodicity. These properties meet some requirements such as sensitivity to keys, diffusion, and mixing in the cryptographic context. Therefore, chaotic dynamics is expected to provide a fast and easy way for building superior performance cryptosystems, and the properties of chaotic maps such as sensitivity to initial conditions and random-like behavior have attracted the attention to develop data encryption algorithms suitable for secure multimedia communications. Until recently, chaotic communication has been a subject of major interest in the field of wireless communications. Many techniques based on chaos have been proposed such as additive chaos masking (ACM), where the analog message signal is added to the output of the chaos generator within the transmitter. In, chaos shift keying is used where the binary message signal selects the carrier signal from two or more different chaotic attractors. Authors use chaotic modulation where the message information modulates a parameter of the chaotic generator. Chaos control methods rely on the fact that small perturbations cause the symbolic dynamics of a chaotic system to track a prescribed symbol sequence. In, the receiver system is designed in an inverse manner to ensure the recovery of theencryption signal. An impulsive synchronization scheme is employed to synchronize chaotic transmitters and receivers. However, all of these techniques do not provide a real and practical solution to the challenging issue of chaotic communication which is based on extreme sensitivity of chaotic synchronization to both the additive channel noise and parameter mismatches. Precisely, since chaos is sensitive to small variations of its initial conditions and parameters, it is very difficult to synchronize two chaotic systems in a communication scheme. Some proposed synchronization techniques have improved the robustness to parameter mismatches as reported in, where impulsive chaotic synchronization and an open-loop-closed-loopbased coupling scheme are proposed, respectively. Other authors proposed to improve the robustness of chaotic synchronization to channel noise, where a coupled lattice instead of coupled single maps is used to decrease the master-slave synchronization error. In, symbolic dynamics-based noise reduction and coding are proposed. Some research into equalization algorithms for chaotic communication systems are also proposed. For other related results in the literature, see. However, none of them were tested through a real channel under real transmission conditions. Digital synchronization can overcome the failed attempts to realize experimentally a performed chaotic communication system. In particular, when techniques exhibit any difference between the master/transmitter and slave/receiver systems, it is due to additive information or noise channel (disturbed chaotic dynamics) which breaks the symmetry between the two systems, leading to an accurate non-recovery of the transmitted information signal at the receiver. In, an original solution to the hard problem of chaotic synchronization high sensibility to channel noise has been proposed. This solution, based on a controlled digital regenerated chaotic signal at the receiver, has been tested and validated experimentally in a real channel noise environment through a realized wireless digital chaotic communication system based on zonal intercommunication global-standard, where battery life was long, which was economical to deploy and which exhibited efficient use of resources, knownas the ZigBee protocol. However, this synchronization technique becomes sensible to high channel noise from a higher transmission rate of 115 kbps, limiting the use of the ZigBee and Wireless Fidelity (Wi-Fi) protocols which permit wireless transmissions up to 250 kbps and 65 Mbps, respectively.Consequently, no reliable commercial chaos-based communication system is used to date to the best of our knowledge. Therefore, there are still plentiful issues to be resolved before chaos-based systems can be put into practical use. To overcome these drawbacks, we propose in this paper a digital feedback hyperchaotic synchronization and suggest the use of advanced wireless communication technologies, characterized by high noise immunity, to exploit digital hyperchaotic modulation advantages for robust secure data transmissions. In this context, as results of the rapid growth of communication technologies, in terms of reliability and resistance to channel noise, an interesting communication protocol for wireless personal area networks (WPANs, i.e., ZigBee or ZigBee Pro Low-Rate-WPAN protocols) and wireless local area network (WLAN, i.e., Wi-Fi protocol WLAN) is developed. These protocols are identified by the IEEE 802.15.4 and IEEE 802.11 standards and known under the name ZigBee and Wi-Fi communication protocols, respectively. These protocols are designed to communicate data through hostile Radio Frequency (RF) environments and to provide an easy-to-use wireless data solution characterized by secure, low-power, and reliable wireless network architectures. These properties are very attractive for resolving the problems of chaotic communications especially the high noise immunity property. Hence, our idea is to associate chaotic communication with theWLAN or WPAN communication protocols. However, this association needs a numerical generation of the chaotic behavior since the XBee protocol is based on digital communications.In the hardware area, advanced modern digital signal processing devices, such as field programmable gate array (FPGA), have been widely used to generate numerically the chaotic dynamics or the encryption keys. The advantage of these techniques is that the parameter mismatch problem does not existcontrary to the analog techniques. In addition, they offer a large possible integration of chaotic systems in the most recent digital communication technologies such as the ZigBee communication protocol. In this paper, a wireless hyperchaotic communication system based on dynamic feedback modulation and RF XBee protocols is investigated and realized experimentally. The transmitter and the receiver are implemented separately on two Xilinx Virtex-II Pro circuits and connected with the XBee RF module based on the Wi-Fi or ZigBee protocols. To ensure and maintain this connection, we have developed a VHSIC (very high speed integrated circuit) hardware description language (VHDL)-based hardware architecture to adapt the implemented hyperchaotic generators, at the transmitter and receiver, to the XBee communication protocol. Note that the XBee modules interface to a host device through a logic-level asynchronous serial port. Through its serial port, the module can communicate with any logic and voltage-compatible Universal Asynchronous Receiver/Transmitter (UART). The used hyperchaotic generator is the well-known and the most investigated hyperchaotic Lorenz system. This hyperchaotic key generator is implemented on FPGA technology using an extension of the technique developed in for three-dimensional (3D) chaotic systems. This technique is optimal since it uses directly VHDL description of a numerical resolution method of continuous chaotic system models. A number of transmission tests are carried out for different distances between the transmitter and receiver. The real-time results obtained validate the proposed hardware architecture. Furthermore, it demonstrates the efficiency of the proposed solution consisting on the association of wireless protocols to hyperchaotic modulation in order to build a reliable digital encrypted data or image hyperchaotic communication system.Hyperchaotic synchronization and encryption techniqueContrary to a trigger-based slave/receiver chaotic synchronization by the transmitted chaotic masking signal, which limits the performance of the rate synchronization transmission, we propose a digital feedback hyperchaoticsynchronization (FHS). More precisely, we investigate a new scheme for the secured transmission of information based on master-slave synchronization of hyperchaotic systems, using unknown input observers. The proposed digital communication system is based on the FHS through a dynamic feedback modulation (DFM) technique between two Lorenz hyperchaotic generators. This technique is an extension and improvement of the one developed in for synchronizing two 3D continuous chaotic systems in the case of a wired connection.The proposed digital feedback communication scheme synchronizes the master/transmitter and the slave/receiver by the injection of the transmitted masking signal in the hyperchaotic dynamics of the slave/receiver. The basic idea of the FHS is to transmit a hyperchaotic drive signal S(t) after additive masking with a hyperchaotic signal x(t) of the master (transmitter) system (x , y , z ,w ). Hyperchaotic drive signal is then injected both in the three subsystems (y , z ,w ) and (r r r w z y ,,). The subscript r represents the slave or receiver system (r r r r w z y x ,,,). At the receiver, the slave system regenerates the chaotic signal )(t x r and a synchronization is obtained between two trajectories x(t) and )(t x r if()()0||lim =-∞→t X t X r t (1) This technique can be applied to chaotic modulation. In our case, it is used for generating hyperchaotic keys for stream cipher communications, where the synchronization between the encrypter and the decrypter is very important. Therefore, at the transmitter, the transmitted signal after the additive hyperchaos masking (digital modulation) isS(t) = x(t) + d(t). (2)where d(t) is the information signal and x(t) is the hyperchaotic carrier. At the receiver, after synchronization of the regenerated hyperchaotic signal )(t x rwith the received signal )(t S r and the demodulation operation, we can recover the information signal d(t) correctly as follows:)()()(t x t S t d r r -=. (3)Therefore, the slave/receiver will generate a hyperchaotic behavior identical to that of the master/transmitter allowing to recover correctly the information signal after the demodulation operation. The advantageof this technique is that the information signal d(t) doesnot perturb the hyperchaotic generator dynamics, contraryto the ACM-based techniques of and, because d(t) is injected at both the master/transmitter and slave/receiver after the additive hyperchaotic masking. Thus, for small values of information magnitude, the information will be recovered correctly. It should be noted that we have already confirmed this advantage by testing experimentally the HS-DFM technique performances for synchronizing hyperchaotic systems (four-dimensional (4D) continuous chaotic systems) in the case of wired connection between two Virtex-II Pro development platforms. After many experimental tests and from the obtained real-time results, we concluded that the HS-DFM is very suitable for wired digital chaotic communication systems. However, in the present work, one of the objectives is to test and study the performances of the HS-DFM technique in the presence of channel noise through real-time wireless communication tests. To performthe proposed approach, a digital implementation of the master and slave hyperchaotic systems is required. Therefore, we investigate the hardware implementation of the proposed FHS-DFM technique between two Lorenz hyperchaotic generators using FPGA. To achieve this objective, we propose the following details of the proposed architecture.译文:无线超混沌通信系统安全的实时图像传输的设计和FPGA实现摘要在本文中,我们提出并论证了一种基于无线电频率通信协议对数据或图像安全实时传输的新的无线数字超混沌加密通信系统。
A survey of CORDIC algorithms for FPGA based computers
A survey of CORDIC algorithms for FPGA based computersRay AndrakaAndraka Consulting Group, Inc16 Arcadia DriveNorth Kingstown, RI 02852401/884-7930 FAX 401/884-7950email:randraka@1.ABSTRACTThe current trend back toward hardware intensive signal processing has uncovered a relative lack of understanding of hardware signal processing architectures. Many hardware efficient algorithms exist, but these are generally not well known due to the dominance of software systems over the past quarter century. Among these algorithms is a set of shift-add algorithms collectively known as CORDIC for computing a wide range of functions including certain trigonometric, hyperbolic, linear and logarithmic functions. While there are numerous articles covering various aspects of CORDIC algorithms, very few survey more than one or two, and even fewer concentrate on implementation in FPGAs. This paper attempts to survey commonly used functions that may be accomplished using a CORDIC architecture, explain how the algorithms work, and explore implementation specific to FPGAs.1.1KeywordsCORDIC, sine, cosine, vector magnitude, polar conversion 2.INTRODUCTIONThe digital signal processing landscape has long been dominated by microprocessors with enhancements such as single cycle multiply-accumulate instructions and special addressing modes. While these processors are low cost and offer extreme flexiblility, they are often not fast enough for truly demanding DSP tasks. The advent of reconfigurable logic computers permits the higher speeds of dedicated hardware solutions at costs that are competitive with the traditional software approach. Unfortunately, algorithms optimized for these microprocessor based systems do not usually map well into hardware. While hardware-efficient solutions often exist, the dominance of the software systems has kept those solutions out of the spotlight. Among these hardware-efficient algorithms is a class of iterative solutions for trigonometric and other transcendental functions that use only shifts and adds to perform. The trigonometric functions are based on vector rotations, while other functions such as square root are implemented using an incremental expression of the desired function. The trigonometric algorithm is called CORDIC, an acronym for COordinate Rotation DIgital Computer. The incremental functions are performed with a very simple extension to the hardware architecture, and while not CORDIC in the strict sense, are often included because of the close similarity. The CORDIC algorithms generally produce one additional bit of accuracy for each iteration. The trigonometric CORDIC algorithms were originally developed as a digital solution for real-time navigation problems. The original work is credited to Jack Volder [4,9]. Extensions to the CORDIC theory based on work by John Walther[1] and others provide solutions to a broader class of functions. The CORDIC algorithm has found its way into diverse applications including the 8087 math coprocessor[7], the HP-35 calculator, radar signal processors[3] and robotics. CORDIC rotation has also been proposed for computing Discrete Fourier[4], Discrete Cosine[4], Discrete Hartley[10] and Chirp-Z [9] transforms, filtering[4], Singular Value Decomposition[14], and solving linear systems[1].This paper attempts to survey the existing CORDIC and CORDIC-like algorithms with an eye toward implementation in Field Programmable Gate Arrays (FPGAs). First a brief description of the theory behind the algorithm and the derivation of several functions is presented. Then the theory is extended to the so-called unified CORDIC algorithms, after which implementation of FPGA CORDIC processors is discussed.Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@. FPGA 98 Monterey CA USACopyright 1998 ACM 0-89791-978-5/98/01..$5.003. CORDIC THEORY: AN ALGORITHM FOR VECTOR ROTATIONAll of the trigonometric functions can be computed or derived from functions using vector rotations, as will be discussed in the following sections. Vector rotation can also be used for polar to rectangular and rectangular to polar conversions, for vector magnitude, and as a building block in certain transforms such as the DFT and DCT. The CORDIC algorithm provides an iterative method of performing vector rotations by arbitrary angles using only shifts and adds. The algorithm, credited to Volder[4], is derived from the general (Givens) rotation transform:x x y y y x ’cos sin ’cos sin =−=+φφφφwhich rotates a vector in a Cartesian plane by the angle φ.These can be rearranged so that:[][]x x y y y x ’cos tan ’cos tan =⋅−=⋅+φφφφSo far, nothing is simplified. However, if the rotation angles are restricted so that tan(φ)=±2-i , the multiplication by the tangent term is reduced to simple shift operation.Arbitrary angles of rotation are obtainable by performing a series of successively smaller elementary rotations. If the decision at each iteration, i, is which direction to rotate rather than whether or not to rotate, then the cos(δi ) term becomes a constant (because cos(δi ) = cos(-δi )). The iterative rotation can now be expressed as:[][]x K x y d y K yx d i i i i i i i iii i i+−+−=−⋅⋅=+⋅⋅1122 where:()K d i i ii ==+=±−−−cos tan 1221121Removing the scale constant from the iterative equations yields a shift-add algorithm for vector rotation. The product of the K i ’s can be applied elsewhere in the system or treated as part of a system processing gain. That product approaches 0.6073 as the number of iterations goes to infinity. Therefore, the rotation algorithm has a gain, A n ,of approximately 1.647. The exact gain depends on the number of iterations, and obeys the relationA n in=+−∏122The angle of a composite rotation is uniquely defined by the sequence of the directions of the elementary rotations. That sequence can be represented by a decision vector. The set of all possible decision vectors is an angular measurementsystem based on binary arctangents. Conversions between this angular system and any other can be accomplished using a look-up. A better conversion method uses an additional adder-subtractor that accumulates the elementary rotation angles at each iteration. The elementary angles can be expressed in any convenient angular unit. Those angular values are supplied by a small lookup table (one entry per iteration) or are hardwired, depending on the implementation. The angle accumulator adds a third difference equation to the CORDIC algorithm:()z z d i i i i+−−=−⋅112tan Obviously, in cases where the angle is useful in the arctangent base, this extra element is not needed.The CORDIC rotator is normally operated in one of two modes. The first, called rotation by Volder[4], rotates the input vector by a specified angle (given as an argument).The second mode, called vectoring, rotates the input vector to the x axis while recording the angle required to make that rotation.In rotation mode, the angle accumulator is initialized with the desired rotation angle. The rotation decision at each iteration is made to diminish the magnitude of the residual angle in the angle accumulator. The decision at each iteration is therefore based on the sign of the residual angle after each step. Naturally, if the input angle is already expressed in the binary arctangent base, the angle accumulator may be eliminated. For rotation mode, the CORDIC equations are:()x x y d y y x d z z d i i i i i i i i i i i i i i+−+−+−−=−⋅⋅=+⋅⋅=−⋅1111222tan whered i = -1 if z i < 0, +1 otherwisewhich provides the following result:[][]x A x z y z y A yz x z z A n n nnn n in=−=+==+−∏00000002012cos sin cos sin In the vectoring mode, the CORDIC rotator rotates the input vector through whatever angle is necessary to align the result vector with the x axis. The result of the vectoring operation is a rotation angle and the scaled magnitude of the original vector (the x component of the result). The vectoring function works by seeking to minimize the y component of the residual vector at each rotation. The signINTRODUCTION ABOUT THE TWO MODESof the residual y component is used to determine which direction to rotate next. If the angle accumulator is initialized with zero, it will contain the traversed angle at the end of the iterations. In vectoring mode, the CORDIC equations are:()x x y d y y x d z z d i i i i i i i i i i i i i i+−+−+−−=−⋅⋅=+⋅⋅=−⋅1111222tan whered i = +1 if y i < 0, -1 otherwise.Then:x A x y y z z y x A n n n n n in=+==+=+−−∏020201002012tan The CORDIC rotation and vectoring algorithms as stated are limited to rotation angles between -π/2 and π/2. This limitation is due to the use of 20 for the tangent in the first iteration. For composite rotation angles larger than π/2, an additional rotation is required. Volder[4] describes an initial rotation ±π/2. This gives the correction iteration:x d y y d x z z d ’’’=−⋅=⋅=+⋅π2whered = +1 if y <0, -1 otherwise.There is no growth for this initial rotation. Alternatively, an initial rotation of either π or 0 can be made, avoiding the reassignment of the x and y components to the rotator elements. Again, there is no growth due to the initial rotation:x d x y d y’’=⋅=⋅z’ = z if d = 1, or z - π if d = -1d = -1 if x <0, +1 otherwise.Both reduction forms assume a modulo 2π representation of the input angle. The style of first reduction is more consistent with the succeeding rotations, while the second reduction may be more convenient when wiring is restricted, as is often the case with FPGAs.The CORDIC rotator described is usable to compute several trigonometric functions directly and others indirectly. Judicious choice of initial values and modes permits direct computation of sine, cosine, arctangent,vector magnitude and transformations between polar and Cartesian coordinates.3.1 Sine and CosineThe rotational mode CORDIC operation can simultaneously compute the sine and cosine of the input angle. Setting the y component of the input vector to zero reduces the rotation mode result to:x A x z y A x z n n n n =⋅=⋅0000cos sin By setting x 0 equal to 1/ A n , the rotation produces the unscaled sine and cosine of the angle argument, z 0. Very often, the sine and cosine values modulate a magnitude value. Using other techniques (e.g., a look up table)requires a pair of multipliers to obtain the modulation. The CORDIC technique performs the multiply as part of the rotation operation, and therefore eliminates the need for a pair of explicit multipliers. The output of the CORDIC rotator is scaled by the rotator gain. If the gain is not acceptable, a single multiply by the reciprocal of the gain constant placed before the CORDIC rotator will yield unscaled results. It is worth noting that the hardware complexity of the CORDIC rotator is approximately equivalent to that of a single multiplier with the same word size.3.2 Polar to Rectangular TransformationA logical extension to the sine and cosine computer is a polar to Cartesian coordinate transformer. The transformation from polar to Cartesian space is defined by:x = r cos θy = r sin θAs pointed out above, the multiplication by the magnitude comes for free using the CORDIC rotator. The transformation is accomplished by selecting the rotation mode with x 0=polar magnitude, z 0=polar phase, and y 0=0.The vector result represents the polar input transformed to Cartesian space. The transform has a gain equal to the rotator gain, which needs to be accounted for somewhere in the system. If the gain is unacceptable, the polar magnitude may be multiplied by the reciprocal of the rotator gain before it is presented to the CORDIC rotator.3.3 General vector rotationThe rotation mode CORDIC rotator is also useful for performing general vector rotations, as are often encountered in motion correction and control systems. For general rotation, the 2 dimensional input vector is presented to the rotator inputs. The rotator rotates the vector throughthe desired angle. The output is scaled by the CORDIC rotator gain, which must be accounted for elsewhere in the system. If the scaling is unacceptable, a pair of constant multipliers is required to compensate for the gain.CORDIC rotators may be cascaded in a tree architecture for general rotation in n-dimensions. Some optimization of multidimensional rotation is possible to permit computational savings over the general n-dimensioned case,as reported by Hsiao et al. [4]3.4 ArctangentThe arctangent, θ=Atan(y/x), is directly computed using the vectoring mode CORDIC rotator if the angle accumulator is initialized with zero. The argument must be provided as a ratio expressed as a vector (x, y). Presenting the argument as a ratio has the advantage of being able to represent infinity (by setting x=0). Since the arctangent result is taken from the angle accumulator, the CORDIC rotator growth does not affect the result.z z y x n =+−0100tan 3.5 Vector MagnitudeThe vectoring mode CORDIC rotator produces the magnitude of the input vector as a byproduct of computing the arctangent. After the vectoring mode rotation, the vector is aligned with the x axis. The magnitude of the vector is therefore the same as the x component of the rotated vector. This result is apparent in the result equations for the vector mode rotator:x A x y n n =+0202The magnitude result is scaled by the processor gain, which needs to be accounted for elsewhere in the system. This implementation of vector magnitude has a hardware complexity of roughly one multiplier of the same width.The CORDIC implementation represents a significant hardware savings over an equivalent Pythagorean processor. The accuracy of the magnitude result improves by 2 bits for each iteration performed.3.6 Cartesian to Polar transformationThe Cartesian to Polar transformation consists of finding the magnitude (r=sqrt(x 2+y 2)) and phase angle (φ=atan[y/x])of the input vector, (x, y). The reader will immediately recognize that both functions are provided simultaneously by the vectoring mode CORDIC rotator. The magnitude of the result will be scaled by the CORDIC rotator gain, and should be accounted for elsewhere in the system. If the gain is unacceptable, it can be corrected by multiplying the resulting magnitude by the reciprocal of the gain constant.3.7 Inverse CORDIC functionsIn most cases, if a function can be generated by a CORDIC style computer, its inverse can also be computed. Unlessthe inverse is calculable by changing the mode of the rotator, its computation normally involves comparing the output to a target value. The CORDIC inverse is illustrated by the Arcsine function.3.8 Arcsine and ArccosineThe Arcsine can be computed by starting with a unit vector on the positive x axis, then rotating it so that its y component is equal to the input argument. The arcsine is then the angle subtended to cause the y component of the rotated vector to match the argument. The decision function in this case is the result of a comparison between the input value and the y component of the rotated vector at each iteration:()x x y d y y x d z z d i i i i i i i i i i i i i i+−+−+−−=−⋅⋅=+⋅⋅=−⋅1111222tan whered i = +1 if y i < c , -1 otherwise, and c = input argument.Rotation produces the following result:()x A x c y cz z c An x A n n n n n in=⋅−==+⋅=+−∏02200212arcsin The arcsine function as stated above returns correct angles for inputs -1 < c /A n x 0 < 1, although the accuracy suffers as the input approaches ±1 (the error increases rapidly for inputs larger than about 0.98). This loss of accuracy is due to the gain of the rotator. For angles near the y axis, the rotator gain causes the rotated vector to be shorter than the reference (input), so the decisions are made improperly.The gain problems can be corrected using a “double iteration algorithm”[9] at the cost of an increase in complexity.The Arccosine computation is similar, except the difference between the x component and the input is used as the decision function. Without modification, the arccosine algorithm works only for inputs less than 1/A n , making the double iteration algorithm a necessity. The Arccosine could also be computed by using the arcsine function and subtracting π/2 from the result, followed by an angular reduction if the result is in the fourth quadrant.3.9 Extension to Linear functionsA simple modification to the CORDIC equation permits the computation of linear functions:()x x y d x y y x d z z d i i i i i i i i i i i i i i i+−+−+−=−⋅⋅⋅==+⋅⋅=−⋅1110222For rotation mode (d i = -1 if z i < 0, +1 otherwise) the linear rotation produces:x x y y x z z n n n ==+=0000This operation is similar to the shift-add implementation of a multiplier, and as multipliers go is not an optimal solution. The multiplication is handy in applications where a CORDIC structure is already available. The vectoring mode (d i = +1 if y i < 0, -1 otherwise) is more interesting, as it provides a method for evaluating ratios:x x y z z y x n n n ===−0000The rotations in the linear coordinate system have a unity gain, so no scaling corrections are required.3.10 Extension to Hyperbolic FunctionsThe close relationship between the trigonometric and hyperbolic functions suggests the same architecture can be used to compute the hyperbolic functions. While, there is early mention of using the CORDIC structure for hyperbolic coordinate transforms [4], the first description of the algorithm is that by Walther [1]. The CORDIC equations for hyperbolic rotations are derived using the same manipulations as those used to derive the rotation in the circular coordinate system. For rotation mode these are:()x x y d y y x d z z d i i i i i i i i i i i i i i+−+−+−−=+⋅⋅=+⋅⋅=−⋅1111222tanh whered i = -1 if z i < 0, +1 otherwise.Then:[][]x A x z y z y A y z x z z A n n n n n n i n=+=+==−≈−∏000000002012080cosh sinh cosh sinh .In vectoring mode (d i = +1 if y i < 0, -1 otherwise) the rotation produces:x A x y y z z y x A n n n n n in=−==+=−−−∏020201002012tanh The elemental rotations in the hyperbolic coordinate system do not converge. However, it can be shown[1] that convergence is achieved if certain iterations (I=4, 13, 40,...,k, 3k+1,...) are repeated.The hyperbolic equivalents of all the functions discussed for the circular coordinate system can be computed in a similar fashion. Additionally, as Walther[1] points out, the following functions can be derived from the CORDIC functions: tan α = sin α/cos α tanh α = sinh α/cosh α exp α = sinh α + cosh αln α = 2tanh -1[y/x] where x=α +1 and y=α-1 (α)1/2 = (x 2-y 2)1/2 where x=α+1/4 and y=α-1/4It is worth noting the similarities between the CORDIC equations for circular, linear, and hyperbolic systems. The selection of coordinate system can be made by introducing a mode variable that takes on values 1,0, or -1 for circular,linear and hyperbolic systems respectively. The unified [1]CORDIC iteration equations are then:x x m y d y y x d z z d e i i i i i i i i i i i i i i+−+−+=−⋅⋅⋅=+⋅⋅=−⋅11122where e i is the elementary angle of rotation for iteration i inthe selected coordinate system. Specifically, e i = tan -1(2-i)for m=1, e i = 2-i for m=0, and e i = tanh -1(2-i) for m=-1.This unification, due to Walther, permits the design of a general purpose CORDIC processor.3.11 Short cutsFor fixed angle rotations, as are encountered in such places as fast Fourier Transforms (FFTs), the arctangent base representation of the angle can be pre-computed and applied directly to the CORDIC rotator. This hardwiring of a fixed angle(s) eliminates the need for the angle accumulator, which reduces the circuit complexity by about25 percent. If the constraints on the decision variable are relaxed to allow that variable to take on values of {-1,0,1}instead of just {-1,1}, the number of iterations can also be reduced. Iterations for which the decision variable is zero pass the data unrotated, and can thus be eliminated. This modification causes the gain to become a function of the rotated angle, so it is only useful if the rotation angle is fixed. Hu and Naganathan[10] propose a method of pre-computing the recoded angles for the ternary decision variable. This technique can significantly reduce the complexity of on-line CORDIC processors used for fixed angle rotations.4. IMPLEMENTATION IN AN FPGAThere are a number of ways to implement a CORDIC processor. The ideal architecture depends on the speed versus area tradeoffs in the intended application. First we will examine an iterative architecture that is a direct translation from the CORDIC equations. From there, we will look at a minimum hardware solution and a maximum performance solution.4.1 Iterative CORDIC ProcessorsAn iterative CORDIC architecture can be obtained simply by duplicating each of the three difference equations in hardware as shown in Figure 1. The decision function, d i , is driven by the sign of the y or z register depending on whether it is operated in rotation or vectoring mode. In operation, the initial values are loaded via multiplexers into the x, y and z registers. Then on each of the next n clock cycles, the values from the registers are passed through the shifters and adder-subtractors and the results placed back in the registers. The shifters are modified on each iteration to cause the desired shift for the iteration. Likewise, the ROM address is incremented on each iteration so that the appropriate elementary angle value is presented to the z adder-subtractor. On the last iteration, the results are read directly from the adder-subtractors. Obviously, a simple state machine is required keep track of the current iteration,and to select the degree of shift and ROM address for each iteration.The design depicted in Figure 1 uses word-wide data paths (called bit-parallel design). The bit-parallel variable shift shifters do not map well to FPGA architectures because of the high fan-in required. If implemented, those shifters will typically require several layers of logic (i.e., the signal will need to pass through a number of FPGA cells). The result is a slow design that uses a large number of logic cells.x 0y 0z 0nnnFigure 1. Iterative CORDIC structureA considerably more compact design is possible using bit serial arithmetic. The simplified interconnect and logic in a bit serial design allows it to work at a much higher clock rate than the equivalent bit parallel design. Of course, the design also needs to clocked w times for each iteration (w is the width of the data word). The bit serial design consists of three bit serial adder-subtractors, three shift registers and a serial Read Only Memory (ROM). Each shift register has a length equal to the word width. There is also some gating or multiplexers to select taps off the shift registers for the right shifted cross terms (shifting is accomplished using bit delays in bit serial systems). The bit serial CORDIC architecture is shown in Figure 2. In this design,w clocks are required for each of the n iterations, where w is precision of the adders. In operation, the load multiplexers on the left are opened for w clock periods to initialize the x ,y and z registers (these registers could also be parallel loaded to initialize). Once loaded, the data is shifted right through the serial adder-subtractors and returned to the left end of the register. Each iteration requires w clocks to return the result to the register. At the beginning of each iteration, the control state machine reads the sign of the y (or z ) register and sets the add/subtract controls accordingly. The appropriate tap off the register for the cross terms is also selected at the beginning of each iteration. During the n th iteration, the results can be read from the outputs of the serial adders while the next initialization data is shifted into the registers.Figure 2 Bit serial iterative CORDICThe simplicity of the bit serial design is apparent from figure 2. Even in this case, the wiring of the shift tap multiplexers can present problems in some FPGAs (this is one place where tri-state long lines can come in handy). Even so, the interconnect is minimal and the logic between registers is simple. This combination permits bit clock rates near the maximum toggle frequency of the FPGA. The possibility of using extreme bit clock frequencies makes up for the large number of clock cycles required to complete each rotation.Now, if the design is in a Xilinx 4000E series part, the shift registers can be implemented in the CLB RAM[2]. The RAM emulates a shift register by incrementing the read/write address after each access. The dual port capability of the CLB RAM provides the capability to read two locations in the 16x1 RAM simultaneously [9]. By properly sequencing the second address, the effect of the shift tap multiplexer is achieved without a physical multiplexer. The result is the shift register and multiplexer for word lengths up to 16 bits are implemented in a single CLB (plus 8 CLBs for the 2 address sequencers and iteration counter, which are shared by the three shifters). The serial ROM also uses the CLB for data storage. One CLB is required for every two iterations. The 16 bit, 8 iteration CORDIC processor shown in Figure 3 uses only 21 CLBs, and will run at bit rates up to about 90 MHz (mainly limited by the RAM write cycle). This translates to about a 1.5µS processing time, which is only about three and a half times longer than the best one could expect from the much larger bit parallel iterative solution.4.2On-Line CORDIC ProcessorsThe CORDIC processors discussed so far are iterative, which means the processor has to perform iterations at n times the data rate. The iteration process can unrolled[18] so that each of n processing elements always performs the same iteration. An unrolled CORDIC processor is shown in Figure 4. Unrolling the processor results in two significant simplifications. First the shifters are each a fixed shift, which means that they can be implemented in the wiring. Second, the lookup values for the angle accumulator are distributed as constants to each adder in the angle accumulator chain. Those constants can be hardwired instead of requiring storage space. The entire CORDIC processor is reduced to an array of interconnected adder-subtractors. The need for registers is also eliminated, making the unrolled processor strictly combinatorial. The delay through the resulting circuit would be substantial, but the processing time is reduced from that required by the iterative circuit (if by nothing else than the set-up and hold times of the register). Most times, especially in an FPGA, it does not make sense to use such a large combinatorial circuit. The unrolled processor is easily pipelined by inserting registers between the adder-subtractors. In the case of most FPGA architectures there are already registers present in each logic cell, so the addition of the pipeline registers has no hardware cost.FPGA uses 21 CLBs。
可重构计算(Reconfigurable Computing)
发展趋势
…
…
系统互连的趋势
交换式结构代替总线式 高速串行点对点连接代替并行总线 基于包交换的协议代替独立控制信号 异步协议代替同步协议 传统意义上的互联走向通信模式? 为可重构互连带来了机会? 模块化 异步性
“拆”和“聚”
光互连让“拆”成为了可能: 长距离传输,带宽 可重构计算为“聚”提供了支持: 编制新的应用程序时,可直接调用共享内存或消息 传递算法模块,利用已有成果,加速程序的开发。一个应 用程序可能包括对三类结构库函数的并行调用。例如程序 员开发通过投票方式确定基因比对结果的程序(一组数据 调用三组函数库独立处理,结果比对,2:1为执行完), 机器将自动调整为三部分(SMP、MPP、Cluster),并行 执行三个独立的程序,数据可以共享!
DSAG:光互连-“拆”;RC-“聚”,聚的过程需要重构 研究RC体系结构理论和方法对DSAG理论的指导 研究如何利用现有的RC技术和产品构建DSAG
RC的研究主题
体系结构 逻辑,连接 软件技术 描述,编译,开发环境 快速可重构技术 实时性,更高的动态性 应用 ASIC(小雨点卡),design/verification(龙芯),DSAG (?)
可重构计算(Reconfigurable Computing)
李磊 eniac@ 智能中心HPC-OG组 2003-10-22
内容
RC:what&why RC的体系结构 RC的研究项目 RC与DSAG
RC:What & Why
可重构计算:Reconfigurable Computing, RC FPGA-based RC 历史:50年代,80年代 目标:"the performance of hardware with the flexibility of software." ASIC-专用,processor-通用 性能-成本 我们的目的
A C Compiler for a Processor with a Reconfigurable Functional Unit
A C Compiler for a Processor with a ReconfigurableFunctional UnitZhi Alex Ye Nagaraj Shenoy Prithviraj BanerjeeDepartment of Electrical and Computer Engineering,Northwestern UniversityEvanston, IL 60201, USA{ye, nagaraj, banerjee}@ABSTRACTThis paper describes a C compiler for a mixed Processor/FPGA architecture where the FPGA is a Reconfigurable Functional Unit (RFU). It presents three compilation techniques that can extract computations from applications to put into the RFU. The results show that large instruction sequences can be created and extracted by these techniques. An average speedup of 2.6 is achieved over a set of benchmarks.1.INTRODUCTIONWith the flexibility of the FPGA, reconfigurable systems are able to get significant speedups for some applications. As the general purpose processor and the FPGA each has its own suitable area of applications, several architectures are proposed to integrate a processor with an FPGA in the same chip.In this paper, we talk about a C compiler for a Processor/FPGA system. The target architecture is Chimaera, which is a RISC processor with a Reconfigurable Functional Unit (RFU). We describe how the compiler identifies sequences of statements in a C program and changes them into RFU operations (RFUOPs). We show the performance benefits that can be achieved by such optimizations over a set of benchmarks.The rest of the paper is organized into five sections. Section 2 discusses related work. In Section 3, we give an overview of the Chimaera architecture. Section 4 discusses the compiler organization and implementation in detail. In this section, we first discuss a technique to enhance the size of the instruction sequence: control localization. Next, we describe the application of the RFU to SIMD Within A Register (SWAR) operations. Lastly, we introduce an algorithm to identify RFUOPs in a basic block. Section 5 demonstrates some experimental results. We summarize this paper in Section 6.2.RELATED WORKSeveral architectures have been proposed to integrate a processor with an FPGA [6,7,8,9,13,14,15]. The usage of the FPGA can be divided into two categories: FPGA as a coprocessor or FPGA as a functional unit.In the coprocessor schemes such as Garp[9], Napa[6], DISC[14], and PipeRench[7], the host processor is coupled with an FPGA based reconfigurable coprocessor. The coprocessor usually has the ability of accessing memory and performing control flow operations. There is a communication cost between the coprocessor and the host processor, which is several cycles or more. Therefore, these architectures tend to map a large portion of the application, e.g. a loop, into the FPGA. One calculation in the FPGA usually corresponds to a task that takes several hundred cycles or more.In the functional unit schemes such as Chimaera[8], OneChip[15], and PRISC[13], the host processor is integrated with an FPGA based Reconfigurable Functional Unit (RFU). One RFU Operation (RFUOP) can take on a task that usually requires several instructions on the host processor. As the functional unit is interfaced only with the register file, it cannot perform memory operations or control flow operations. The communication is faster than the coprocessor scheme. For example, in the Chimaera architecture, after an RFUOP’s configuration is loaded, an invocation of it has no overhead in communication. This gives such architecture a larger range of application. Even in cases where only a few instructions can be combined into one RFUOP, we could still apply the optimization if the execution frequency is high enough.3.CHIMAERA ARCHITECTUREIn this section, we review the Chimaera architecture to provide adequate background information for explaining the compiler support for this architecture. More information about Chimaera can be found in [8].The overall Chimaera architecture is shown in Figure 1. The main component of the system is the Reconfigurable Functional Unit (RFU), which consists of FPGA-like logic designed to support high-performance computations. It gets inputs from the host processor’s register file, or a shadow register file which duplicates a subset of the values in the host’s register file. The RFU is capable of computing data-dependent operations (e.g., tmp=r2-r3, r5=tmp+r1), conditional evaluations (e.g., "if (b>0) a=0; else a=1;"), and multiple sub-word operations (e.g., four instances of 8-bit addition).The RFU contains several configurations at the same time. An RFUOP instruction will activate the corresponding configuration in the RFU. An RFU configuration itself determines from whichregisters it reads its operands. A single RFUOP can read from all the registers connected to the RFU and then put the result on the result bus. The maximum number of input registers is 9 in Chimaera. Each RFUOP instruction is associated with a configuration and an ID. For example, an execution sequence “r2=r3<<2; r4=r2+r5; r6=lw 0(r4)” can be optimized to “r4=RFUOP #1; r6=lw 0(r4)”. Here #1 is the ID of this RFUOP and “r5+r3<<2” is the operation of the corresponding configuration. After an RFUOP instruction is fetched and decoded, the Chimaera processor checks the RFU for the configuration corresponding to the instruction ID. If the configuration is currently loaded in the RFU, the corresponding output is written to the destination register during the instruction writeback cycle. Otherwise, the processor stalls when the RFU loads the configuration.4. COMPILER IMPLEMENTATIONWe have developed a C compiler for Chimaera, which automatically maps some operations into RFUOPs. The generated code is currently run on a Chimaera simulator to gather performance information. A future version of the compiler will be integrated with a synthesis tool.The compiler is built using the widely available GCC framework. Figure 2 depicts the phase ordering of the implementation. The C code is parsed into the intermediate language of GCC: Register Transfer Language (RTL), which is then enhanced by several early optimizations such as common expression elimination, flow analysis, etc. The partially optimized RTL is passed through the Chimaera optimization phase, as will be explained below. The Chimaera optimized RTL is then processed by later optimization phases such as instruction scheduling, registers allocation, etc. Finally, the code for the target architecture is generated along with RFUOP configuration information.From the compiler’s perspective, we can consider an RFUOP as an operation with multiple register inputs and a single register output. The goal of the compiler is to identify the suitable multiple-input-single-output sequences in the programs and change them into RFUOPs.Chimaera Optimization consists of three steps: Control Localization, SWAR optimization and Instruction Combination.Due to the configuration loading time, these optimizations can be applied only in the kernels of the programs. Currently, we only optimize the innermost loop in the programs.The first step of Chimaera optimization is control localization.It will transform some branches into one macroinstruction to form a larger basic block. The second step is the SIMD Within A Register (SWAR) Optimization. This step searches the loop body for subword operations and unrolls the loop when appropriate.The third step is instruction combination. It takes a basic block as input and extracts the multiple-input-single-output patterns from the data flow graph. These patterns are changed into RFUOPs if they can be implemented in RFU. The following subsections discuss the three steps in detail.4.1 Control LocalizationIn order to get more speedup, we want to find larger and more RFUOPs. Intuitively, a larger basic block contains more instructions, thus has more chances of finding larger and more RFUOPs. We find that control localization technique [11][13] isFigure 1. The overall Chimaera architectureH o s t P r o c e s s o rFigure 2: Phase ordering of the C compiler for Chimaera(a)(b)Figure 3: Control Localization(a) control flow graph before control localization.Each oval is an instruction, and the dashed box marks the code sequence to be control localized.(b) control flow graph after control localizationuseful in increasing the size of basic blocks. Figure 3 shows an example of it. After control localization, several branches are combined into one macroinstruction, with multiple output and multiple input. In addition to enlarging the basic block, the control localization sometimes finds RFUOPs directly. When a macroinstruction has only one output, and all the operations in it can be implemented in the RFU, this macroinstruction can be mapped into an RFUOP. This RFUOP can speculatively compute all operations on different branch paths. The result on the correct path where the condition evaluates to true is selected to put into the result bus. This macro instruction is called as “CI macroin”and can be optimized by Instruction Combination.4.2SWAR OptimizationAs a method to exploit medium-grain data parallelism, SIMD (single instruction, multiple data) has been used in parallel computers for many years. Extending this idea to general purpose processors has led to a new version of SIMD, namely SIMD Within A Register (SWAR)[4]. The SWAR model partitions each register into fields that can be operated on in parallel. The ALUs are set up to perform multiple field-by-field operations. SWAR has been successful in improving the multimedia performance. Most of the implementations of this concept are called multimedia extensions, such as Intel MMX, HP MAX, SUN SPARC VIS, etc. For example, “PADDB A, B” is an instruction from Intel MMX. Both operands A and B are 64-bit and are divided into eight 8-bit fields. The instruction performs eight additions in parallel and stores the eight results to A.However, current implementations of SWAR do not support a general SWAR model. Some of their limitations are:•The input data must be packed and aligned correctly, causing packing and unpacking penalties sometimes.•Most of current hardware implementations support 8, 16 and 32-bit field size only. Other important sizes such as 2-bit and 10-bit are not supported.•Only a few operations are supported. When the operation for one item becomes complex, SIMD is impossible. For example, the following code does not map well to a simple sequence of SIMD operations:char out[100],in1[100],in2[100];for(i=0;i<100;i++) {if ((in1[i]-in2[i])>10)out[i]=in1[i]-in2[i];elseout[i]=10;}With the flexibility of the FPGA, the RFU can support a more general SWAR model without the above disadvantages. The only requirement is that the output fields should fit within a single register. The inputs don’t need to be stored in packed format, nor is there limitation on the alignment. In addition, complex operations can be performed. For example, the former example can be implemented in one RFUOP.Our compiler currently supports 8-bit field size, which is the size of “char” in C. In current implementation, the compiler looks for the opportunity to pack several 8-bit outputs into a word. In most cases, this kind of pattern exists in the loop with stride one. Therefore, the compiler searches for the pattern such that the memory store size is a byte and the address changes by one forunrolled four times. In the loop unrolling, conventional optimizations such as local register renaming and strength reduction are performed. In addition, the four memory stores are changed to four sub-register movements. For example,“store_byte r1,address;store_byte r2,address+1;store_byte r3,address+2;store_byte r4,address+3;”are changed into“(r5,0)=r1; (r5,1)=r2;(r5,2)=r3; (r5,3)=r4;”.The notation (r, n) refers to the n th byte of register r. We generate a pseudo instruction "collective-move" that moves the four sub-registers into a word register, e.g. “r5=(r5,0) (r5,1) (r5,2) (r5,3)”. In the data flow graph, the four outputs merge through this “collective-move” into one. Thus a multiple-input-single-output subgraph is formed. The next step, Instruction Combination, canrecognize this subgraph and change it to an RFUOP when appropriate. Finally, a memory store instruction is generated tostore the word register. The compiler then passes the unrolled copy to the instruction combination step.4.3Instruction CombinationThe instruction combination step analyzes a basic block and changes the RFU sequences into RFUOPs. It first finds out what instructions can be implemented in the RFU. It then identifies the RFU sequences. At last, it selects the appropriate RFU sequences and changes them into RFUOPs.We categorize instructions into Chimaera Instruction (CI) and Non-Chimaera Instruction (NCI). Currently CI includes logic operation, constant shift and integer add/subtract. The “collective_move”, “subregister movement” and “CI macroin” are also considered as CI. NCI includes other instructions such as multiplication/division, memory load/store, floating-point operation, etc.The algorithm FindSequences in Figure 4 finds all the maximum instruction sequences for the RFU. It colors each node in the data flow graph(DFG). The NCI instructions are marked as BLACK. A CI instruction is marked as BROWN when its output must be put into a register, that is, the output is live-on-exit or is the input of some NCI instructions. Other CI instructions are marked as WHITE. The RFU sequences are the subgraphs in the DFG that consists of BROWN nodes and WHITE nodes.The compiler currently changes all the identified sequences into RFUOPs. Under the assumption that every RFUOP takes one cycle and the configuration loading time can be amortized over several executions, this gives an upper bound of the speedup we could expect from Chimaera. In the future, we will take into account other factors such as the FPGA size, configuration loading time, actual RFUOP execution time, etc.5.EXPERIMENTAL RESULTSWe have tested the compiler’s output through a set of benchmarks on the Chimaera simulator. The simulator is a modification of SimpleScalar Simulator[3]. The simulated architecture has 32 general purpose 32-bit registers and 32 floating point registers. The instruction set is a superset of MIPS-IV ISA. Presently, the simulator executes the programs sequentially and gathers theEarly results on some benchmarks are presented in this section. Each benchmark is compiled in two ways: one is using “gcc -O2”, the other is using our Chimaera compiler. We studied the differences between the two versions of assembly codes as well as the simulation results. In the benchmarks, decompress.c and compress.c are from Honeywell benchmark[10], jacobi and life are from Raw benchmark[2], image reconstruction[12] and dct[1] are implementations of two program kernels of MPEG, image restoration is an image processing program. They are noted as dcmp, cmp, life, jcbi, dct, rcn and rst in the following figure.Table 1 shows the simulation results of the RFU optimizations. Insn1 and insn2 are the instruction counts without and with RFU optimization. The speedup is calculated as insn1/insn2. The following three columns IC, CL and SWAR stand for the portion of performance gain from Instruction Combination, Control Localization and SWAR respectively.The three optimizations give an average speedup of 2.60. The best speedup is up to 7.19.To illustrate the impact of each optimization on the kernel sizes, we categorize instructions into four types: NC, IC, CL and SWAR. NC is the part of instructions that cannot be optimized for Chimaera. NCI instructions and some non-combinable integer operations fall in this category. IC, CL and SWAR stand for the instructions that can be optimized by Instruction Combination, Control Localization and SWAR optimization respectively. Figure 5 shows the distribution of these four types of instructions in the program kernels. After the three optimizations, the kernel size can be reduced by an average of 37.5%. Of this amount, 22.3% is from Instruction Combination, 9.8% from Control Localization and 5.4% from SWAR.Table 1: Performance results over some benchmarks. The "avg" row is the average of all benchmarks.62.50%22.30%0%0%Figure 5: Distribution of the kernel instructionsFurther analysis shows that 58.4% of the IC portion comes from address calculation. For example, the following C code “int a[10], ...=a[i]” is translated to "r3=r2<<2, r4=r3+r1, r5=lw 0(r4)" in assembly. The first two instructions can be combined in Chimaera. The large portion of address calculation indicates that our optimizations can be applied to a wide range of applications, as long as they have complex address calculations in the kernel. Furthermore, as the address calculation is basically sequential, existing ILP architectures like superscalar and VLIW cannot take advantage of it. This suggests that we may expect speedup if we integrate a RFU into an advanced ILP architecture.Figure 6 illustrates the frequencies of different RFUOP sizes. For Instruction Combination and Control Localization, most of the sizes are from 2 to 6. These small sizes indicate that these techniques are benefiting from the fast communication of the functional unit scheme. In the coprocessor scheme, the communication overhead would make them prohibitive to apply. The SWAR optimization generally identifies much larger RFUOPs. The largest one comes from the image reconstruction benchmark, whose kernel is shown in Figure 7. In this case, a total of 52 instructions are combined in the RFU, which results in a speedup of 4.2.model. We have also simulated the architecture in an out-of-order execution environment. We considered a superscalar host processor, different latencies of RFUOPs, and configuration loading time. These results are reported in [16].In summary, the results show that the compilation techniques are able to create and find many instruction sequences for the RFU. Most of their sizes are several instructions, which demonstrate that the fast communication is necessary. The system gives an average speedup of 2.6.6.CONCLUSIONThis paper describes a C compiler for the Processor/FPGA architecture when the FPGA is served as a Reconfigurable Functional Unit (RFU).We have introduced an instruction combination algorithm to identify RFU sequences of instructions in a basic block. We have also shown that the control localization technique can effectively enlarge the size of the basic blocks and find some more sequences. In addition, we have illustrated the RFU support for SWAR. By introducing “sub-register movement” and “collective-move”, the instruction combination algorithm is able to identify complex SIMD instructions for the RFU.Finally, we have presented the experimental results, which demonstrate that these techniques can effectively create and identify larger and more RFU sequences. With the fast communication between RFU and the processor, the system can achieve considerable speedups.7.ACKNOWLEDGEMENTSWe would like to thank Scott Hauck for his contribution to this research. We would also like to thank the reviewers for their helpful comments. This work was supported by DARPA under Contract DABT-63-97-0035.8.REFERENCES[1]K. Atsuta, DCT implementation, http://marine.et.u-tokai.ac.jp/database/koichi.html.[2]J.Babb, M.Frank, et al. The RAW benchmark Suite:Computation Structures for General Purpose Computing. FCCM, Napa Vally, CA, Apr.1997[3] D. Burger, and T. Austin, The Simplescalar Tool Kit,University of Wisconsin-Madison Computer Sciences Department Technical Report #1342, June, 1997 [4]P. Faraboschi, et al. The Latest Word in Digital andMedia Processing, IEEE signal processing magazine, Mar 1998[5]R. J. Fisher, and H. G. Dietz, Compiling For SIMDWithin A Register, 1998 Workshop on Languages and Compilers for Parallel Computing, North Carolina, Aug 1998[6]M.B. Gokhale, et al. Napa C: Compiling for a HybridRISC/FPGA Architecture, FCCM 98, CA, USA[7]S. C. Goldstein, H. Schmit, M. Moe, M. Budiu, S.Cadambi, R. R. Taylor, and R. Laufer. PipeRench: A Coprocessor for Streaming Multimedia Acceleration, ISCA’99, May 1999, Atlanta, Georgia[8]S. Hauck, T. W. Fry, M. M. Hosler, J. P. Ka, TheChimaera Reconfigurable Functional Unit, IEEE Symposium on FPGAs for Custom Computing Machines, 1997[9]J. R. Hauser and J. Wawrzynek. GARP: A MIPSprocessor with a reconfigurable coprocessor.Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines (FCCM), Napa, CA, April 1997.[10]Honeywell Inc, Adaptive Computing SystemsBenchmarking,/projects/acsbench/ [11]W. Lee, R. Barua, and et al. Space-Time Scheduling ofInstruction-Level Parallelism on a Raw Machine, MIT.ASPLOS VIII 10/98, CA, USA[12]S. Rathnam, et al. Processing the New World ofInteractive Media, IEEE signal processing magazine March 1998[13]R. Razdan, PRISC: Programmable Reduced InstructionSet Computers, Ph.D. Thesis, Harvard University, Division of Applied Sciences,1994[14]M. J. Wirthlin, and B. L. Hutchings. A DynamicInstruction Set Computer, FCCM, Napa Vally, CA, April, 1995[15]R. D. Wittig and P. Chow. OneChip: An FPGAProcessor with Reconfigurable Logic, FCCM, Napa Vally, CA, April, 1996[16]Z. A. Ye, A. Moshovos, P. Banerjee, and S. Hauck,"Chimaera, a high performance architecture with a tightly-coupled reconfigurable functional unit", submitted to the 27th International Symposium on Computer Architecture (ISCA-2000).。
NI 9775 4-通道 ±10 V 20 MS s 14 位数字化仪器数据手册说明书
DAT ASHEETNI 97754-Ch, ±10 V, 20 MS/s, 14-Bit Digitizer•BNC connectivity•High-speed measurements up to 20 MS/s/ch at 68 dB SNR•High-resolution measurements up to 5 MS/s/ch at 74 dB SNR•14-bit resolution•Built-in analog reference trigger •128 Mbits onboard memoryThe NI 9775, a 4-channel digitizer, can measure transient phenomenon like faults in electrical transmission lines from lightning strikes or structural failure events at 20 MS/s/ch. Themodule's store and forward architecture allows up to 128 Mbits of measurement data to be sent back to the controller and analyzed. The module has a built-in analog reference trigger, or you can use CompactRIO and LabVIEW FPGA to develop an advanced trigger based on low-speed streaming data for added flexibility.Kit ContentsAccessories • NI 9775• NI 9775 Getting Started Guide• BNC Male to BNC Male CablesNI C Series OverviewNI provides more than 100 C Series modules for measurement, control, and communication applications. C Series modules can connect to any sensor or bus and allow for high-accuracy measurements that meet the demands of advanced data acquisition and control applications.•Measurement-specific signal conditioning that connects to an array of sensors and signals •Isolation options such as bank-to-bank, channel-to-channel, and channel-to-earth ground •-40 °C to 70 °C temperature range to meet a variety of application and environmental needs•Hot-swappableThe majority of C Series modules are supported in both CompactRIO and CompactDAQ platforms and you can move modules from one platform to the other with no modification. CompactRIOCompactRIO combines an open-embedded architecturewith small size, extreme ruggedness, and C Seriesmodules in a platform powered by the NI LabVIEWreconfigurable I/O (RIO) architecture. Each systemcontains an FPGA for custom timing, triggering, andprocessing with a wide array of available modular I/O tomeet any embedded application requirement. CompactDAQCompactDAQ is a portable, rugged data acquisition platformthat integrates connectivity, data acquisition, and signalconditioning into modular I/O for directly interfacing to anysensor or signal. Using CompactDAQ with LabVIEW, youcan easily customize how you acquire, analyze, visualize, andmanage your measurement data.2| | NI 9775 DatasheetSoftwareLabVIEW Professional Development System for Windows•Use advanced software tools for large project development•Generate code automatically using DAQ Assistant and InstrumentI/O Assistant•Use advanced measurement analysis and digital signal processing•Take advantage of open connectivity with DLLs, ActiveX,and .NET objects•Build DLLs, executables, and MSI installersNI LabVIEW FPGA Module•Design FPGA applications for NI RIO hardware•Program with the same graphical environment used for desktop andreal-time applications•Execute control algorithms with loop rates up to 300 MHz•Implement custom timing and triggering logic, digital protocols, andDSP algorithms•Incorporate existing HDL code and third-party IP including Xilinx IP generator functions•Purchase as part of the LabVIEW Embedded Control and Monitoring SuiteNI LabVIEW Real-Time Module•Design deterministic real-time applications with LabVIEWgraphical programming•Download to dedicated NI or third-party hardware for reliableexecution and a wide selection of I/O•Take advantage of built-in PID control, signal processing, andanalysis functions•Automatically take advantage of multicore CPUs or setprocessor affinity manually•Take advantage of real-time OS, development and debuggingsupport, and board support•Purchase individually or as part of a LabVIEW suiteNI 9775 Datasheet| © National Instruments| 3CircuitryNote The diagram shows one channel inside the NI 9775.•The shell of the BNC connects to CHASSIS GND.•The four channels of the NI 9775 share the clock circuit and operate simultaneously.•The NI 9775 has two separate data engines operating simultaneously for each channel: the continuous stream engine and triggered record engine.•The module waits for a trigger event, fills the circular buffer with the configured set of data (called a record), then streams the entire record from the module to the chassis.•The analog filter allows you to select 10 MHz or 5 MHz bandwidth.•The software-selectable digital decimation filter improves resolution and alias rejection.•The ADC samples the analog signal continuously at 20 MS/s.Timing ModesThe NI 9775 has two timing modes: high-speed and high-resolution. High-speed mode turns off the digital decimation filter on all channels and enables you to set the analog filter per channel. High-resolution mode turns on both the analog filter and digital decimation filter for all channels.Acquisition ModesThe NI 9775 has three acquisition modes: continuous mode, record mode, and advanced mode. In continuous mode, the NI 9775 transfers real-time data to the chassis at an aggregate rate of 4 MS/s across all channels. In record mode, the NI 9775 stores samples into onboard memory at up to 20 MS/s then transfers the data to the chassis at a slower rate. In advanced 4| | NI 9775 Datasheetmode, the NI 9775 combines the functionality of continuous mode and record mode to enable more complex triggering schemes based on the continuous data.Note Advanced mode is only available on CompactRIO systems.Related InformationHorizontal on page 11NI 9775 SpecificationsThe following specifications are typical for the range -40 °C to 70 °C unless otherwise noted.Caution Do not operate the NI 9775 in a manner not specified in this document.Product misuse can result in a hazard. You can compromise the safety protectionbuilt into the product if the product is damaged in any way. If the product isdamaged, return it to NI for repair.Input CharacteristicsNumber of channels 4 (simultaneously sampled)Input type Reference single-endedInput impedance 1 MΩInput capacitance24 pFInput coupling DCInput rangeNominal±10 VTypical±11.3 VMinimum±10.04 VADC resolution14 bitsOvervoltage protection±30 V DC, safe operating areaNI 9775 Datasheet| © National Instruments| 5Figure 1. Safe Operating AreaInput Signal Frequency (MHz)V o l t a g e (V )DC gain drift ±140 ppm/°C DC offset drift ±0.34 mV/°C AC amplitude accuracy ±0.25 dB at 50 kHz AC amplitude drift±172 ppm/°C Channel-to-channel crosstalk < -90 dB at 5 MHzTiming modes (software-selectable)High-speed, high-resolution Analog filter (software-selectable)6th order low-pass Bessel1Range equals 10 V for absolute accuracy calcuations.2Uncalibrated accuracy refers to the accuracy achieved when acquiring in raw or unscaled modes where the calibration constants stored in the module are not applied to the data.6 | | NI 9775 DatasheetAnalog filter -3 dB bandwidthHigh-speed mode with analog filter disabled13.9 MHz High-speed mode with analog filter enabled4.7 MHz High-resolution mode2.36 MHzAlias rejection in high-resolution mode45 dB at 5 MS/s onlyFigure 2. Frequency Response in High-Speed Mode with Analog Filter DisabledA m p l i t u d e (dB )-3.5-4.0-3.0-2.5-2.0-1.5-1.00.5-0.50Frequency (Hz)10k10M1M100k20MFigure 3. Frequency Response in High-Speed Mode with Analog Filter EnabledA m p l i t u d e (dB )–50.0–60.0–40.0–30.0–20.0–10.010.00Frequency (Hz)10 k1 M10 M 100 k20 MNI 9775 Datasheet | © National Instruments | 7Figure 4. Frequency Response in High-Resolution ModeA m p l i t u d e (dB )–3.5–4.0–3.0–2.5–2.0–1.5–1.00.5–0.50Frequency (Hz)10 k1 M100 k10 MFigure 5. Idle Channel FFT in High-Speed Mode with Analog Filter Disabled (20 MS/s,32,768 point FFT)FrequencyA m p l i t u d e (dB F S )0 Hz–120–110–100–90–80–70–60–50–40–30–20–1001 MHz2 MHz3 MHz4 MHz5 MHz6 MHz7 MHz8 MHz9 MHz 10 MHz8 | | NI 9775 DatasheetFigure 6. Idle Channel FFT in High-Resolution Mode (1 MS/s, 32,768 point FFT)A m p l i t u d e (dB F S )Frequency (kHz)Figure 7. Step ResponseTime (ns)A m p l i t u d e (V)NI 9775 Datasheet | © National Instruments | 9Figure 8. Single-T one Spectrum at (-1 dB FS, 2.45 MHz)A m p l i t u d e (dB )Frequency (Hz)Spurious free dynamic range (-60 dB FS input)High-speed mode at 2.45 MHz 89 dB FS High-resolution mode at 100 kHz 94 dB FSInput to trigger delayHigh-speed mode with analog filter disabled863 ns High-speed mode with analog filter enabled950 ns High-resolution mode 4.62 μs Input delay (Continuous Mode)High-speed mode with analog filter disabled913 ns High-speed mode with analog filter enabled999 ns High-resolution mode 4.67 µs NoiseHigh-speed mode 2.8 mV RMS High-resolution mode 1.4 mV RMS Effective number of bitsHigh-speed mode 11 bits High-resolution mode12 bits 10 | | NI 9775 DatasheetSignal-to-Noise ratioHigh-speed mode 68 dB at 2.45 MHz High-resolution mode74 dB at 100 kHzTotal harmonic distortion at -1 dB FS inputHigh-speed mode with analog filter disabled at 2.45 MHz-62 dB FS High-speed mode with analog filter enabled at 1 MHz-69 dB FS High-resolution mode at 100 kHz and -75 dB FS Channel-to-channel skewAnalog filter disabled 1.5 ns Analog filter enabled 12.7 ns LSB weight1.385 mV/LSB HorizontalSample clock source20 MHz PLLMaximum sample rate in record modeHigh-speed mode 20 MS/s High-resolution mode5 MS/sFigure 9. Phase NoiseP h a s e N o i s e (d B c /H z )–160–170–150–140–130–120–70–110–100–90–80Frequency (Hz)100100 k 1 M1 k10 k 10 MTimebase frequency 20 MHz Timebase accuracy±50 ppmNI 9775 Datasheet | © National Instruments | 11PLL reference clock sourceInternal master timebase12.8 MHzChassis OCLK12.8 MHzData Rate in Record Mode20MS/sWhereN ∈ {1, 2, 3, 4, 5, ..., 65,535} for high-speed modeN ∈ {4, 8, 12, 16, 20, ..., 65,532} for high-resolution modeData Rate in Continuous Mode20MS/sWhereM ∈ {5, 6, 7, 8, 9, ..., 65,535} with one channel enabled for high-speed modeM ∈ {8, 12, 16, 20, 24, ..., 65,532} with one channel enabled for high-resolution mode M ∈ {10, 11, 12, 13, 14, ..., 65,535} with two channels enabled for high-speed modeM ∈ {12, 16, 20, 24, 28, ..., 65,532} with two channels enabled for high-resolution mode M ∈ {15, 16, 17, 18, 19, ..., 65,535} with three channels enabled for high-speed mode M ∈ {16, 20, 24, 28, 32, ..., 65,532} with three channels enabled for high-resolutionmodeM ∈ {20, 21, 22, 23, 24, ..., 65,535} with four channels enabled for high-speed mode M ∈ {20, 24, 28, 32, 36, ..., 65,532} with four channels enabled for high-resolution modeT riggerSupported trigger modes Start and referenceTrigger types Analog edge, digital edge, and software Trigger sources AI0 to AI3 and chassis backplaneDead time0 samplesAnalog Edge TriggerTrigger sources AI0 to AI3Settings Level, slope, and hysteresisTrigger uncertainty≤ 1 sampleRearm time 1 sample minimum12| | NI 9775 DatasheetWaveform SpecificationsOnboard memory size128 MbitsMinimum record length16 samplesMinimum number of pre-trigger samplesCompactRIO1CompactDAQ2Minimum number of post-trigger samplesCompactRIO1CompactDAQ2Maximum number of records32 recordsMaximum number of samples per record3BBB BℎBBBRecord data transfer rateMaximum4 4.7 MS/sTypical 4 MS/sPower RequirementsPower consumed from chassisActive mode0.9 W maximumSleep mode52.5 μW maximumThermal dissipation (at 70 °C)Active mode 1.06 W maximumSleep mode 3.65 mW maximum3The maximum number of samples per record is different for CompactRIO systems.4With all four channels enabled.NI 9775 Datasheet| © National Instruments| 13Safety VoltagesConnect only voltages that are within Measurement Category O.IsolationChannel-to-channel NoneChannel-to-earth ground NoneNote Measurement Categories CAT I and CAT O are equivalent. These test andmeasurement circuits are not intended for direct connection to the MAINS buildinginstallations of Measurement Categories CAT II, CAT III, or CAT IV.Caution Do not connect the NI 9775 to signals or use for measurements withinMeasurement Categories II, III, or IV.Physical CharacteristicsIf you need to clean the module, wipe it with a dry towel.Tip For two-dimensional drawings and three-dimensional models of the C Seriesmodule and connectors, visit /dimensions and search by module number. Weight172 gHazardous LocationsU.S. (UL)Class I, Division 2, Groups A, B, C, D, T4;Class I, Zone 2, AEx nA IIC T4Canada (C-UL)Class I, Division 2, Groups A, B, C, D, T4;Class I, Zone 2, Ex nA IIC T4Europe (ATEX) and International (IECEx)Ex nA IIC T4 GcSafety and Hazardous Locations StandardsThis product is designed to meet the requirements of the following electrical equipment safety standards for measurement, control, and laboratory use:•IEC 61010-1, EN 61010-1•UL 61010-1, CSA 61010-1•EN 60079-0:2012, EN 60079-15:2010•IEC 60079-0: Ed 6, IEC 60079-15; Ed 4•UL 60079-0; Ed 6, UL 60079-15; Ed 4•CSA 60079-0:2011, CSA 60079-15:2012Note For UL and other safety certifications, refer to the product label or the OnlineProduct Certification section.14| | NI 9775 DatasheetElectromagnetic CompatibilityThis product meets the requirements of the following EMC standards for electrical equipment for measurement, control, and laboratory use:•EN 61326-1 (IEC 61326-1): Class A emissions; Industrial immunity•EN 55011 (CISPR 11): Group 1, Class A emissions•EN 55022 (CISPR 22): Class A emissions•EN 55024 (CISPR 24): Immunity•AS/NZS CISPR 11: Group 1, Class A emissions•AS/NZS CISPR 22: Class A emissions•FCC 47 CFR Part 15B: Class A emissions•ICES-001: Class A emissionsNote In the United States (per FCC 47 CFR), Class A equipment is intended foruse in commercial, light-industrial, and heavy-industrial locations. In Europe,Canada, Australia and New Zealand (per CISPR 11) Class A equipment is intendedfor use only in heavy-industrial locations.Note Group 1 equipment (per CISPR 11) is any industrial, scientific, or medicalequipment that does not intentionally generate radio frequency energy for thetreatment of material or inspection/analysis purposes.Note For EMC declarations and certifications, and additional information, refer tothe Online Product Certification section.CE ComplianceThis product meets the essential requirements of applicable European Directives, as follows:•2014/35/EU; Low-V oltage Directive (safety)•2014/30/EU; Electromagnetic Compatibility Directive (EMC)•2014/34/EU; Potentially Explosive Atmospheres (ATEX)Online Product CertificationRefer to the product Declaration of Conformity (DoC) for additional regulatory compliance information. To obtain product certifications and the DoC for this product, visit / certification, search by model number or product line, and click the appropriate link in the Certification column.NI 9775 Datasheet| © National Instruments| 15Shock and VibrationTo meet these specifications, you must panel mount the system.Operating vibrationRandom (IEC 60068-2-64) 5 g rms, 10 Hz to 500 HzSinusoidal (IEC 60068-2-6) 5 g, 10 Hz to 500 HzOperating shock (IEC 60068-2-27)30 g, 11 ms half sine; 50 g, 3 ms half sine;18 shocks at 6 orientations EnvironmentalRefer to the manual for the chassis you are using for more information about meeting these specifications.-40 °C to 70 °COperating temperature(IEC 60068-2-1, IEC 60068-2-2)-40 °C to 85 °CStorage temperature(IEC 60068-2-1, IEC 60068-2-2)Ingress protection IP40Operating humidity (IEC 60068-2-78)10% RH to 90% RH, noncondensing Storage humidity (IEC 60068-2-78)5% RH to 95% RH, noncondensing Pollution Degree2Maximum altitude5,000 mIndoor use only.Environmental ManagementNI is committed to designing and manufacturing products in an environmentally responsible manner. NI recognizes that eliminating certain hazardous substances from our products is beneficial to the environment and to NI customers.For additional environmental information, refer to the Minimize Our Environmental Impact web page at /environment. This page contains the environmental regulations and directives with which NI complies, as well as other environmental information not included in this document.Waste Electrical and Electronic Equipment (WEEE) EU Customers At the end of the product life cycle, all NI products must bedisposed of according to local laws and regulations. For more information abouthow to recycle NI products in your region, visit /environment/weee.16| | NI 9775 Datasheet电子信息产品污染控制管理办法(中国RoHS)中国客户National Instruments符合中国电子信息产品中限制使用某些有害物质指令(RoHS)。
NI PXIe-7868R R Series Reconfigurable I O 模块(AI、AO
GETTING STARTED GUIDENI PXIe-7868RR Series Reconfigurable I/O Module (AI, AO, DIO) for PXI Express, 6 AI, 18 AO, 48 DIO, 1 MS/s AIO, 512 MB DRAM,Kintex-7 325T FPGAThis document describes how to begin using the NI PXIe-7868R.Safety GuidelinesCaution Do not operate the NI PXIe-7868R in a manner not specified in thisdocument. Product misuse can result in a hazard. You can compromise the safetyprotection built into the product if the product is damaged in any way. If the productis damaged, return it to NI for repair.Caution This icon denotes a caution, which advises you to consult documentationwhere this symbol is marked.Electromagnetic Compatibility GuidelinesThis product was tested and complies with the regulatory requirements and limits for electromagnetic compatibility (EMC) stated in the product specifications. These requirements and limits provide reasonable protection against harmful interference when the product is operated in the intended operational electromagnetic environment.This product is intended for use in industrial locations. However, harmful interference may occur in some installations, when the product is connected to a peripheral device or test object, or if the product is used in residential or commercial areas. To minimize interference with radio and television reception and prevent unacceptable performance degradation, install and use this product in strict accordance with the instructions in the product documentation.Furthermore, any changes or modifications to the product not expressly approved by National Instruments could void your authority to operate it under your local regulatory rules.Caution To ensure the specified EMC performance, operate this product only withshielded cables and accessories.Caution To ensure the specified EMC performance, the length of all I/O cablesmust be no longer than 3 m (10 ft).Preparing the EnvironmentEnsure that the environment in which you are using the NI PXIe-7868R meets the following specifications.0 °C to 55 °COperating temperature(IEC 60068-2-1, IEC 60068-2-2)Operating humidity (IEC 60068-2-56)10% RH to 90% RH, noncondensing Pollution degree2Maximum altitude2,000 mIndoor use only.Note Refer to the device specifications on /manuals for completespecifications.2| | NI PXIe-7868R Getting Started GuideUnpacking the KitCaution To prevent electrostatic discharge (ESD) from damaging the device,ground yourself using a grounding strap or by holding a grounded object, such asyour computer chassis.1.Touch the antistatic package to a metal part of the computer chassis.2.Remove the device from the package and inspect the device for loose components or anyother sign of damage.Caution Never touch the exposed pins of connectors.Note Do not install a device if it appears damaged in any way.3.Unpack any other items and documentation from the kit.Store the device in the antistatic package when the device is not in use.Verifying the Kit ContentsVerify that the following items are included in the NI PXIe-7868R kit.Figure 1. NI PXIe-7868R Kit Contents1.Hardware2.NI-RIO Media3.Getting Started GuideNI PXIe-7868R Getting Started Guide | © National Instruments| 3Installing Software on the Host ComputerBefore using the NI PXIe-7868R, you must install the following application software and device drivers on the host computer.bVIEW 2017 or laterbVIEW Real-Time Module 2017 or later1bVIEW FPGA Module 2017 or later4.NI R Series Multifunction RIO Device Drivers July 2017 or laterVisit /info and enter the Info Code softwareversion for minimum software support information.Installing the NI PXIe-7868RCaution To prevent damage to the NI PXIe-7868R caused by ESD orcontamination, handle the module using the edges or the metal bracket.1.Ensure the AC power source is connected to the chassis before installing the module.The AC power cord grounds the chassis and protects it from electrical damage while you install the module.2.Power off the chassis.3.Inspect the slot pins on the chassis backplane for any bends or damage prior toinstallation. Do not install a module if the backplane is damaged.4.Remove the black plastic covers from all the captive screws on the module front panel.5.Identify a supported slot in the chassis. The following figure shows the symbols thatindicate the slot types.Figure 2. Chassis Compatibility Symbols1.PXI Express System Controller Slot2.PXI Peripheral Slot3.PXI Express Hybrid Peripheral Slot4.PXI Express System Timing Slot5.PXI Express Peripheral SlotNI PXIe-7868R modules can be placed in PXI Express peripheral slots, PXI Express hybrid peripheral slots, or PXI Express system timing slots.6.Touch any metal part of the chassis to discharge static electricity.1LabVIEW Real Time Module is only required when the R Series board is used in a chassis where the PXIe Controller is running a real-time operating system.4| | NI PXIe-7868R Getting Started Guide7.Place the module edges into the module guides at the top and bottom of the chassis. Slide the module into the slot until it is fully inserted.Figure 3. Module Installation1.Chassis2.System Controller3.Hardware Module4.Front-Panel Mounting Screws5.Module Guides6.Power Switch8.Secure the module front panel to the chassis using the front-panel mounting screws.Note Tightening the top and bottom mounting screws increases mechanicalstability and also electrically connects the front panel to the chassis, which can improve the signal quality and electromagnetic performance.9.Cover all empty slots using EMC filler panels or fill using slot blockers to maximize cooling air flow, depending on your application.10.Power on the chassis.Verifying Hardware Installation for Host TargetsYou can verify that the system recognizes the NI PXIe-7868R by using Measurement &Automation Explorer (MAX).unch MAX by navigating to Start »All Programs »National Instruments »MAX or byclicking the MAX desktop icon.2.Expand Devices and Interfaces .3.Verify that the device appears under Devices and Interfaces .If the device does not appear, press <F5> to refresh the view in MAX. If the device does not appear after refreshing the view, visit /support for troubleshooting information.NI PXIe-7868R Getting Started Guide | © National Instruments | 5Verifying Hardware Installation for RemoteT argetsYou can verify that the system recognizes the NI PXIe-7868R by using Measurement & Automation Explorer (MAX).unch MAX on the host computer.2.Expand Remote Targets in the configuration tree and locate your system.3.Install LabVIEW Real-Time Module 2017 and NI RIO Device Drivers July 2017 or lateron your Remote Target.a)Refer to the Installing Software on the Host Computer section for information aboutinstalling software on the host.b)Refer to the PXI Express Controllers User Manual at /manuals forinformation on installing software on the target.4.Under Remote Targets, find and expand Devices and Interfaces.If the device does not appear, press <F5> to refresh the view in MAX. If the device does not appear after refreshing the view, visit /support for troubleshooting information. Connecting the NI PXIe-7868RNI recommends using the following cables and accessories with the NI PXIe-7868R:6| | NI PXIe-7868R Getting Started GuideNote The SCB-68A DIP switches must be set for Direct Feedthrough mode for use with R Series devices. Visit /info and enter the Info Code scb68acables for more information on the SCB-68A accessory.Note NI is not liable for connections that exceed any of the maximum ratings of input or output signals on the NI PXIe-7868R and on the computer chassis. Refer to the NI PXIe-7868R Specifications, available at /info for the maximum input and output ratings for each signal.NI PXIe-7868R Getting Started Guide | © National Instruments| 7PinoutCONNECTOR 0(RMIO)CONNECTOR 2(RAO)CONNECTOR 1(RDIO)Table 2. NI PXIe-7868R Signal Descriptions8 | | NI PXIe-7868R Getting Started GuideTable 2. NI PXIe-7868R Signal Descriptions (Continued)The NI PXIe-7868R is protected from overvoltage and overcurrent conditions.Note Refer to the NI PXIe-7868R Specifications, available at /manuals formore information.Note The pinout label on the lid of the SCB-68A accessory is incompatible withthe NI PXIe-7868R. Refer to the NI 78xxR Pinout Labels for the SCB-68A, availableat /manuals for the compatible pinout labels.NI PXIe-7868R Getting Started Guide | © National Instruments| 9Where to Go NextWorldwide Support and ServicesThe NI website is your complete resource for technical support. At /support, you have access to everything from troubleshooting and application development self-help resources to email and phone assistance from NI Application Engineers.Visit /services for NI Factory Installation Services, repairs, extended warranty, and other services.Visit /register to register your NI product. Product registration facilitates technical support and ensures that you receive important information updates from NI.A Declaration of Conformity (DoC) is our claim of compliance with the Council of the European Communities using the manufacturer’s declaration of conformity. This system affords the user protection for electromagnetic compatibility (EMC) and product safety. You can obtain the DoC for your product by visiting /certification. If your product supports calibration, you can obtain the calibration certificate for your product at /calibration. 10| | NI PXIe-7868R Getting Started GuideNI corporate headquarters is located at 11500 North Mopac Expressway, Austin, Texas, 78759-3504. NI also has offices located around the world. For telephone support in the United States, create your service request at /support or dial 1 866 ASK MYNI (275 6964). For telephone support outside the United States, visit the Worldwide Offices section of / niglobal to access the branch office websites, which provide up-to-date contact information, support phone numbers, email addresses, and current events.NI PXIe-7868R Getting Started Guide | © National Instruments| 11Information is subject to change without notice. Refer to the NI T rademarks and Logo Guidelines at /trademarks for information on NI trademarks. Other product and company names mentioned herein are trademarks or trade names of their respective companies. For patents covering NI products/technology, refer to the appropriate location: Help»Patents in your software, the patents.txt file on your media, or the National Instruments Patent Notice at /patents. Y ou can find information about end-user license agreements (EULAs) and third-party legal notices in the readme file for your NI product. Refer to the Export Compliance Information at /legal/export-compliance for the NI global trade compliance policy and how to obtain relevant HTS codes, ECCNs, and other import/export data. NI MAKES NO EXPRESS OR IMPLIED WARRANTIES AS TO THE ACCURACY OF THE INFORMA TION CONTAINED HEREIN AND SHALL NOT BE LIABLE FOR ANY ERRORS. U.S. Government Customers: The data contained in this manual was developed at private expense and is subject to the applicable limited rights and restricted data rights as set forth in FAR 52.227-14, DFAR 252.227-7014, and DFAR 252.227-7015.© 2017 National Instruments. All rights reserved.378036B-01July 5, 2017。
基于FPGA的卷积神经网络并行加速器设计
0引言随着人工智能的快速发展,卷积神经网络越来越受到人们的关注。
由于它的高适应性和出色的识别能力,它已被广泛应用于分类和识别、目标检测、目标跟踪等领域[1]。
与传统算法相比,CNN 的计算复杂度要高得多,并且通用CPU 不再能够满足计算需求。
目前,主要解决方案是使用GPU 进行CNN 计算。
尽管GPU 在并行计算中具有自然优势,但在成本和功耗方面存在很大的缺点。
卷积神经网络推理过程的实现占用空间大,计算能耗大[2],无法满足终端系统的CNN 计算要求。
FPGA 具有强大的并行处理功能,灵活的可配置功能以及超低功耗,使其成为CNN 实现平台的理想选择。
FPGA 的可重配置特性适合于变化的神经网络网络结构。
因此,许多研究人员已经研究了使用FPGA 实现CNN 加速的方法[3]。
本文参考了Google 提出的轻量级网络MobileNet 结构[4],并通过并行处理和流水线结构在FPGA 上设计了高速CNN 系统,并将其与CPU 和GPU 的实现进行了比较。
1卷积神经网络加速器的设计研究1.1卷积神经网络的介绍在深度学习领域中,卷积神经网络占有着非常重要的地位,它的图像识别准确率接近甚至高于人类的识别水平。
卷积神经网络是同时具有层次结构性和局部连通性的人工神经网络[5]。
卷积神经网络的结构都是类似的,它们采用前向网络模型结构,节点使用神经元来实现分层连接。
并且,相邻层之间的节点是在局部区域内相连接,同一层中的一些神经元节点之间是共享连接权基于FPGA 的卷积神经网络并行加速器设计王婷,陈斌岳,张福海(南开大学电子信息与光学工程学院,天津300350)摘要:近年来,卷积神经网络在许多领域中发挥着越来越重要的作用,然而功耗和速度是限制其应用的主要因素。
为了克服其限制因素,设计一种基于FPGA 平台的卷积神经网络并行加速器,以Ultra96-V2为实验开发平台,而且卷积神经网络计算IP 核的设计实现采用了高级设计综合工具,使用Vivado 开发工具完成了基于FPGA 的卷积神经网络加速器系统设计实现。
FPGA的英文文献及翻译
Building Programmable Automation Controllers with LabVIEW FPGAOverviewProgrammable Automation Controllers(PACs)are gaining acceptance within the industrial control market as the ideal solution for applications that require highly integratedanalog and digital I/O,floating-point processing,and seamless connectivity to multiple processing nodes.National Instruments offers a variety of PAC solutions powered by onecommon software development environment,NI LabVIEW.With LabVIEW,you can buildcustom I/O interfaces for industrial applications using add-on software,such as the NI LabVIEW FPGA Module.With the LabVIEW FPGA Module and reconfigurable I/O(RIO)hardware,National Instruments delivers an intuitive,accessible solution for incorporating the flexibility andcustomizability of FPGA technology into industrial PAC systems.You can define the logicembedded in FPGA chips across the family of RIO hardware targets without knowing low-level hardware description languages(HDLs)or board-level hardware design details, as wellas quickly define hardware for ultrahigh-speed control,customized timing and synchronization,low-level signal processing,and custom I/O with analog,digital,and counters within a single device.You also can integrate your custom NI RIO hardware withimage acquisition and analysis,motion control,and industrial protocols,such as CAN andRS232,to rapidly prototype and implement a complete PAC system.Table of Contents1.IntroductionNI RIO2.Hardware for PACsBuilding PACs with LabVIEW and bVIEW FPGA ModuleFPGA Development4.FlowUsing NI SoftMotion to Create5.Custom Motion ControllersApplications6.Conclusion7.IntroductionYou can use graphical programming in LabVIEW and the LabVIEW FPGA Module to configure the FPGA(field-programmable gate array)on NI RIO devices.RIO technology,themerging of LabVIEW graphical programming with FPGAs on NI RIO hardware, provides aflexible platform for creating sophisticated measurement and control systems that you couldhardware.custom-designed with only create previouslyAn FPGA is a chip that consists of many unconfigured logic gates.Unlike the fixed, vendor-defined functionality of an ASIC(application-specific integrated circuit)chip, you canconfigure and reconfigure the logic on FPGAs for your specific application.FPGAs are usedin applications where either the cost of developing and fabricating an ASIC is prohibitive,orthe hardware must be reconfigured after being placed into service.The flexible, software-programmable architecture of FPGAs offer benefits such as high-performance execution ofcustom algorithms,precise timing and synchronization,rapid decision making,and simultaneous execution of parallel tasks.Today,FPGAs appear in such devices as instruments,consumer electronics,automobiles,aircraft,copy machines,and application-specific computer hardware.While FPGAs are often used in industrial control products,FPGA functionality has not previously been made accessible to industrial control engineers. Defining FPGAs has historically required expertise using HDL programming or complexdesign tools used more by hardware design engineers than by control engineers.With the LabVIEW FPGA Module and NI RIO hardware,you now can use LabVIEW, ahigh-level graphical development environment designed specifically for measurement andcontrol applications,to create PACs that have the customization,flexibility,and high-performance of FPGAs.Because the LabVIEW FPGA Module configures custom circuitry inhardware,your system can process and generate synchronized analog and digital signalsrapidly and deterministically.Figure1illustrates many of the NI RIO devices that you canconfigure using the LabVIEW FPGA Module.bVIEW FPGA VI Block Diagram and RIO Hardware PlatformsNI RIO Hardware for PACsHistorically,programming FPGAs has been limited to engineers who have in-depth knowledge of VHDL or other low-level design tools,which require overcoming a very steeplearning curve.With the LabVIEW FPGA Module,NI has opened FPGA technology to abroader set of engineers who can now define FPGA logic using LabVIEW graphical development.Measurement and control engineers can focus primarily on their test and controlapplication,where their expertise lies,rather than the low-level semantics of transferring logicinto the cells of the chip.The LabVIEW FPGA Module model works because of the tightintegration between the LabVIEW FPGA Module and the commercial off-the-shelf (COTS)hardware architecture of the FPGA and surrounding I/O components.National Instruments PACs provide modular,off-the-shelf platforms for your industrialcontrol applications.With the implementation of RIO technology on PCI,PXI,and CompactVision System platforms and the introduction of RIO-based CompactRIO,engineers nowhave the benefits of a COTS platform with the high-performance,flexibility,and customization benefits of FPGAs at their disposal to build PACs.National Instruments PCIand PXI R Series plug-in devices provide analog and digital data acquisition and control forhigh-performance,user-configurable timing and synchronization,as well as onboard decisionmaking on a single ing these off-the-shelf devices,you can extend your NI PXI orPCI industrial control system to include high-speed discrete and analog control, customsensor interfaces,and precise timing and control.NI CompactRIO,a platform centered on RIO technology,provides a small,industrially rugged,modular PAC platform that gives you high-performance I/O and unprecedentedflexibility in system timing.You can use NI CompactRIO to build an embedded system forapplications such as in-vehicle data acquisition,mobile NVH testing,and embedded machinecontrol systems.The rugged NI CompactRIO system is industrially rated and certified, and itis designed for greater than50g of shock at a temperature range of-40to70°C.NI Compact Vision System is a rugged machine vision package that withstands the harshenvironments common in robotics,automated test,and industrial inspection systems. NICVS-145x devices offer unprecedented I/O capabilities and network connectivity for distributed machine vision applications.NI CVS-145x systems use IEEE1394 (FireWire)technology,compatible with more than40cameras with a wide range of functionality, performance,and price.NI CVS-1455and NI CVS-1456devices contain configurable FPGAs so you can implement custom counters,timing,or motor control in yourvision application.Building PACs with LabVIEW and the LabVIEW FPGA ModuleWith LabVIEW and the LabVIEW FPGA Module,you add significant flexibility and customization to your industrial control hardware.Because many PACs are already programmed using LabVIEW,programming FPGAs with LabVIEW is easy because it usesthe same LabVIEW development environment.When you target the FPGA on an NI RIOdevice,LabVIEW displays only the functions that can be implemented in the FPGA, furthereasing the use of LabVIEW to program FPGAs.The LabVIEW FPGA Module Functionspalette includes typical LabVIEW structures and functions,such as While Loops,For Loops,Case Structures,and Sequence Structures as well as a dedicated set of LabVIEW FPGA-specific functions for math,signal generation and analysis,linear and nonlinear control,comparison logic,array and cluster manipulation,occurrences,analog and digital I/O, andtiming.You can use a combination of these functions to define logic and embed intelligencedevice.RIO NI your ontoFigure2shows an FPGA application that implements a PID control algorithm on the NIRIO hardware and a host application on a Windows machine or an RT target that communicates with the NI RIO hardware.This application reads from analog input0 (AI0),performs the PID calculation,and outputs the resulting data on analog output0(AO0). Whilethe FPGA clock runs at40MHz the loop in this example runs much slower because eachcomponent takes longer than one-clock cycle to execute.Analog control loops can run on anFPGA at a rate of about200kHz.You can specify the clock rate at compile time.This example shows only one PID loop;however,creating additional functionality on the NI RIOdevice is merely a matter of adding another While Loop.Unlike traditional PCFPGAs are parallel processors.Adding additional loops to your application does not affect theperformance of your PID loop.Figure2.PID Control Using an Embedded LabVIEW FPGA VI with Corresponding LabVIEW Host VI.FPGA Development FlowAfter you create the LabVIEW FPGA VI,you compile the code to run on the NI RIO hardware.Depending on the complexity of your code and the specifications of your development system,compile time for an FPGA VI can range from minutes to several hours.To maximize development productivity,with the R Series RIO devices you can use a bit-accurate emulation mode so you can verify the logic of your design before initiating thecompile process.When you target the FPGA Device Emulator,LabVIEW accesses I/O fromthe device and executes the VI logic on the Windows development computer.In this mode,you can use the same debugging tools available in LabVIEW for Windows,such as executionhighlighting,probes,and breakpoints.Once the LabVIEW FPGA code is compiled,you create a LabVIEW host VI to integrateyour NI RIO hardware into the rest of your PAC system.Figure3illustrates the developmentprocess for creating an FPGA application.The host VI uses controls and indicators on theFPGA VI front panel to transfer data between the FPGA on the RIO device and the hostprocessing engine.These front panel objects are represented as data registers within theFPGA.The host computer can be either a PC or PXI controller running Windows or a PC,PXI controller,Compact Vision System,or CompactRIO controller running a real-time operating system(RTOS).In the above example,we exchange the set point,PID gains, looprate,AI0,and AO0data with the LabVIEW host VI.bVIEW FPGA Development FlowThe NI RIO device driver includes a set of functions to develop a communication interface to the FPGA.The first step in building a host VI is to open a reference to the FPGAVI and RIO device.The Open FPGA VI Reference function,as seen in Figure2,also downloads and runs the compiled FPGA code during execution.After opening the reference,you read and write to the control and indicator registers on the FPGA using theRead/WriteControl function.Once you wire the FPGA reference into this function,you can simply selectwhich controls and indicators you want to read and write to.You can enclose the FPGARead/Write function within a While Loop to continuously read and write to the FPGA. Finally,the last function within the LabVIEW host VI in Figure2is the Close FPGA VIReference function.The Close FPGA VI Reference function stops the FPGA VI and closesthe reference to the device.Now you can download other compiled FPGA VIs to the device tochange or modify its functionality.The LabVIEW host VI can also be used to perform floating-point calculations,data logging,networking,and any calculations that do not fit within the FPGA fabric.For addeddeterminism and reliability,you can run your host application on an RTOS with the LabVIEW Real-Time bVIEW Real-Time systems provide deterministic processing engines for functions performed synchronously or asynchronously to the FPGA.For example,floating-point arithmetic,including FFTs,PID calculations,and custom controlalgorithms,are often performed in the LabVIEW Real-Time environment.Relevant data canbe stored on a LabVIEW Real-Time system or transferred to a Windows host computeroff-line analysis,data logging,or user interface displays.The architecture for this configuration is shown in Figure4.Each NI PAC platform that offers RIO hardware can runLabVIEW Real-Time VIs.plete PAC Architecture Using LabVIEW FPGA,LabVIEW Real-Time and Host PC Within each R Series and CompactRIO device,there is flash memory available to store acompiled LabVIEW FPGA VI and run the application immediately upon power up of thedevice.In this configuration,as long as the FPGA has power,it runs the FPGA VI, even if thehost computer crashes or is powered down.This is ideal for programming safety power downand power up sequences when unexpected events occur.Using NI SoftMotion to Create Custom Motion ControllersThe NI SoftMotion Development Module for LabVIEW provides VIs and functions to help you build custom motion controllers as part of NI PAC hardware platforms that caninclude NI RIO devices,DAQ devices,and Compact FieldPoint.NI SoftMotion provides allof the functions that typically reside on a motion controller DSP.With it,you can handle pathplanning,trajectory generation,and position and velocity loop control in the NI LabVIEWenvironment and then deploy the code on LabVIEW Real-Time or LabVIEWFPGA-basedtarget hardware.NI SoftMotion includes functions for trajectory generator and spline engine and examples with complete source code for supervisory control,position,and velocity controlloop using the PID algorithm.Supervisory control and the trajectory generator run on a LabVIEW Real-Time target and run at millisecond loop rates.The spline engine and thecontrol loop can run either on a LabVIEW Real-Time target at millisecond loop rates or on aLabVIEW FPGA target at microsecond loop rates.ApplicationsBecause the LabVIEW FPGA Module can configure low-level hardware design of FPGAs and use the FPGAs within in a modular system,it is ideal for industrial control applications requiring custom hardware.These custom applications can include a custom mixof analog,digital,and counter/timer I/O,analog control up to125kHz,digital control up to20MHz,and interfacing to custom digital protocols for the following:Batch control?Discrete control?Motion control?In-vehicle data acquisition?Machine condition monitoring?Rapid control prototyping(RCP)?Industrial control and acquisition?Distributed data acquisition and control?Mobile/portable noise,vibration,and harshness(NVH)analysis?ConclusionThe LabVIEW FPGA Module brings the flexibility,performance,and customization ofFPGAs to PAC ing NI RIO devices and LabVIEW graphical programming,youcan build flexible and custom hardware using the COTS hardware often required in industrialcontrol applications.Because you are using LabVIEW,a programming language already usedin many industrial control applications,to define your NI RIO hardware,there is nolearn VHDL or other low-level hardware design tools to create custom hardware. Using theLabVIEW FPGA Module and NI RIO hardware as part of your NI PAC adds significantflexibility and functionality for applications requiring ultrahigh-speed control, interfaces tocounters.and digital,analog,of mix I/O custom a or protocols,digital custom使用(现场可编程门阵列)模块开发可编程自动化控FPGALabVIEW制器综述工业控制上的应用要求高度集成的模拟和数字输入输出、浮点运算和多重处理节点的无缝连接。
一种基于胸阻抗法的心功能无创检测分析仪
一种基于胸阻抗法的心功能无创检测分析仪霍威;季忠;赵云冬【摘要】目的:为了经济、快捷、全面地实现心功能的无创检测,基于胸阻抗法成功研发了一种心功能无创检测分析仪,可方便地实现胸阻抗信号、心电信号、心音信号的同步检测分析,从而实现对患者心功能的无创综合评价。
该方法无毒无创,操作简单,完全可以实现家用化普及。
方法本文首先描述了系统的硬件模块构成,说明了胸阻抗信号的采集过程。
其次,使用FPGA芯片与DDS芯片构成系统的控制与信号发生核心,指出了恒流源的精度等性能指标。
再次,指出了胸阻抗信号处理的要点,运用互感原理实现干扰信号的隔离。
最后,介绍了仪器软件功能,并展示了仪器软件实测结果。
结果通过临床试验,对胸阻抗法与超声多普勒法检测的数据结果进行t检验,结果表明二者具有一致性。
结论由于采用了先进的特征点判别方法,该设备具有较高的临床检测精度和较好的临床适用性,可满足临床心功能无创检测和评估的要求。
%Objective A kind of heart function analytical deviceis invented based on impedance cardiography method to detect heart function economically, fast and comprehensively� The device can detect and analyze ICG, ECG and PCG signals synchronously, so comprehensive noninvasive assessment of heart function in patients can be achieved�Methods This article introduces the system and main hardware modules of the device, describes the process of ICG acquisition� FPGA and DDS are used as hardware core and performance indices of the device are pointed out� Key points of signal processing are presented and the electromagnetic principle of mutual inductance is used in this part�Results By analyzing and comparing clinical data, this article indicates thatthe device can gain more accurate clinical test data and has better clinical applicability�Conclusions Because of its advanced feature points’ extraction method, this device meets the requirements of clinical examination and assessment for heart function.【期刊名称】《北京生物医学工程》【年(卷),期】2015(000)005【总页数】6页(P489-494)【关键词】胸阻抗;仪器设计;临床数据分析【作者】霍威;季忠;赵云冬【作者单位】重庆大学生物工程学院重庆 400044;重庆大学生物工程学院重庆400044;重庆大学生物工程学院重庆 400044【正文语种】中文【中图分类】R318.04胸阻抗血流图检测是运用人体阻抗测量技术检测心搏量等生理参数以反映心功能的常用方法。
一种针对COTS器件的抗辐射加固方法
一种针对COTS器件的抗辐射加固方法梁健;张润宁;赵帅【摘要】随着商用现货(COTS)器件在空间任务中的广泛应用,COTS器件的抗辐射加固显得尤为重要,针对COTS器件在空间环境下易受宇宙射线和高能粒子冲击而产生辐射效应的特点,文章结合三模冗余(TMR)技术与现场可编程门阵列(FPGA)的重构技术,提出了一种基于TMR的可重构星载处理单元抗辐射加固方法.通过基于Markov过程的可靠度分析可知,冗余和重构技术相结合可以使处理单元具有更强的容错能力.文章利用实验模拟验证了该星载处理单元的各项关键技术,结果表明:此处理单元能够屏蔽单模故障,并能够定位和修复由空间复杂环境引发的软错误.【期刊名称】《航天器工程》【年(卷),期】2016(025)004【总页数】6页(P81-86)【关键词】星载处理单元;冗余;可重构;商用现货【作者】梁健;张润宁;赵帅【作者单位】航天东方红卫星有限公司,北京 100094;航天东方红卫星有限公司,北京 100094;西北工业大学,西安710072【正文语种】中文【中图分类】V473空间环境中存在较多的宇宙射线和高能粒子,运行在这种复杂环境下的航天器的计算系统很容易受到这些粒子和射线的冲击而产生辐射效应。
当前,在空间处理系统中,现场可编程门阵列(FPGA)以其低功耗、灵活性、通用性、高集成性等优点获得了广泛的应用。
基于静态随机存储器(SRAM)型的商用现货(COTS)FPGA器件很容易受到高能粒子的影响而产生辐射效应[1-3]。
而基于Flash架构的FPGA器件有较强的抗辐射性能,对由空间高能粒子引发的固件错误具有免疫能力,但是基于Flash架构的FPGA器件逻辑资源有限,不具备嵌入式软处理器的能力,故运算和处理能力受限,直接应用于复杂的星务管理和运算具有一定的局限性[4-5]。
传统的三模冗余(TMR)技术可以有效屏蔽单模故障,但表决器本身并不具备抗辐射能力,或者表决器由简单的逻辑开关组成,控制与协调能力不足[6-7]。
基于FPGA的无人机电调模块的设计
的 是I R 2 l 3 2 芯片 ,此 芯片 是 利用 电容 充放 电技 术 来驱 动 功 率MO S . F E T 设计 的, 由于 其 内部 驱动 器 的 导通 阻抗 小 , 因此在 MO S F E T 的 栅 极 以及源 极加 入 了 电阻 以保证 更好 的工 作 “ 。
含 了 电平 转 换 和 驱动 电路 等 简 单 电路 ,避 免 了MC U等 单指 令 周 期 芯 片 的时序 特 点 ,系 统硬 件结 构简 单 ,扩 展性 更 强 。能够 满足 各种 类型 无人 机 的实 时远 程控 制要 求 。 本 设计 系统 框 图如 图 l 所示 ,是 一个 闭环 反馈 系 统 ,控 制 量 是
图1无 人 机 电 调 模 块 原 理 图
巳
=  ̄l u F /=
l ∞V
图 2三 相 功 率 桥 电 路 以 及 位 置 检 测 电路
图3 PI D 控 制 器原 理 图
2 . 系 统硬 件 电 路 设 计 方 案
本 设 计 选 用 的 是 自带 数 字 型霍 尔 传 感 器 的无 刷 直 流 电机 , 电 极对 数 为 I , 可反 馈 三路 H a 、H b 、H c 霍尔 信 号 。额 定转 速 为3 0 0 0 r / mi n , 额 定 电压 为 2 4 V。硬 件 电路 部 分 包 括 功率 桥 电路 , 位置 检 测
的 无 人机 电调主 要 是 由单 片机 、A R M等 实现 的。采 用 基于 F P GA 设
计 无 人机 电调 ,充 分利 用 了F P G A并 行数 据 处理 能 力 和 同步 设 计优 势 ,将 大 部 分功 能 模块 都 集 成 在F P G A芯 片 内部 , 外 围 电路仅 仅 包
基于射频收发芯片的jesd204c链路层设计
基于射频收发芯片的jesd204c链路层设计英文版Design of JESD204C Link Layer Based on RF Transceiver Chip Abstract:This article presents the design of a JESD204C link layer based on an RF transceiver chip. The JESD204C standard is widely used in high-speed data converters, enabling efficient data transmission between converters and FPGA/ASIC-based processing units. This design focuses on the implementation of the link layer, which ensures reliable and high-speed data communication.Introduction:With the ever-increasing demand for high-speed data transmission in modern electronic systems, standards like JESD204C have become crucial. JESD204C, a standard for high-speed digital interfaces, provides a scalable and flexible mechanism for data converters to interface with processingunits. This article details the design considerations and implementation of the JESD204C link layer using an RF transceiver chip.Design Overview:The JESD204C link layer design involves several key components: physical layer interface, link layer controller, and error detection and correction mechanisms. The physical layer interface handles the low-level details of data transmission, such as clock synchronization and data encoding. The link layer controller manages the flow of data, ensuring reliable transmission and error-free reception. Error detection and correction mechanisms are essential to maintain data integrity.Implementation Details:The RF transceiver chip chosen for this design offers excellent performance and scalability. It supports high-speed data transmission and provides the necessary interfaces for JESD204C implementation. The physical layer interface is implemented using the transceiver's built-in features, while thelink layer controller is programmed using a suitable microcontroller or FPGA. Error detection and correction are handled using standard JESD204C protocols.Conclusion:The design of a JESD204C link layer based on an RF transceiver chip offers a robust and efficient solution for high-speed data transmission in modern electronic systems. This design ensures reliable data communication between converters and processing units, supporting the growing demand for high-performance data acquisition and processing.中文版基于射频收发芯片的JESD204C链路层设计摘要:本文介绍了一种基于射频收发芯片的JESD204C链路层设计。
专业英语句子
Unit 9A。
1. According to the Nyquist theorem,a signal with a maximum frequency of W Hz (called a band—limited signal)must be sampled at least 2W samples per second to ensure accurate recording。
根据奈奎斯特采样定理,为了确保准确记录信号,最高频率为W Hz的信号(称为带限信号),每秒内必须采集至少2W个样本。
2. This process begins by converting each digital code into an analog voltage that is proportional in size to the number represented by the code。
这个电压值在零阶保持器中保持到下一个码字出现,即需要保持一个采样间隔。
3。
Depending on the relationship between the signal frequencies and the sampling rate,spectral inversion may cause the shape of the spectrum in the baseband to be inverted from the true spectrum of the signal。
根据信号频率和采样频率之间的关系的不同,可能出现“频谱反转”现象——基带频谱的形状和信号真实频谱的形状正好相反。
4。
Field-Programmable Gate Arrays (FPGA) have the capability of being reconfigurable within a system,which can be a big advantage in applications that need multiple trial versions within development, offering reasonably fast time to market.现场可编程门阵列具有在系统可重新配置的能力,在开发需要多次试用的应用时,这是一个巨大的优势——它能提供快速的上市时间.5. However, for applications in which the end product must process answers in real time,or must do so while powered by consumer batteries,GPPs comparatively poor real time performance and high power consumption all but rules them out。
lesson16 翻译
不能第16课DSP的基本概念We don't speak in a digital signal. A digital signal is a language of 1s and 0s that canbe processed by mathematics. We speak in real-world, analog signals. Analog signalsare real world signals that we experience everyday-sound, light, temperature,and pressure. A digital signal is a numerical representation of the analog signal. It may beeasier and more cost effective to process these signals in the digital world. In the real world,we can convert these signals into digital signals through the analog-to-digital converter, process the signals,and if needed, bring the signals back out to the analogworld through the digital-to-analog converter.我们说话时不用数字信号。
一个数字信号是1和0的语言,可以通过数学处理。
在现实世界中说话我们用模拟信号。
模拟信号是现实世界中的信号,我们日常的经验,声,光,温度和压力。
数字信号是模拟信号的数字表示。
在世界上的这些数字信号,可能是更简单,更符合成本效益的过程。
在现实世界中,我们可以通过模数转换器将模拟信号转换成数字信号,对信号进行处理时,如果需要,将通过数模转换器转换成模拟信号。
FPGA技术介绍
外文翻译FPGA技术介绍系部:文翻译班级:姓名:学号:指导教师:年月日FPGA技术介绍概述:场域可程式化闸阵列FPGA技术正持续发展,而全世界FPGA市场的产值,则预估可从 2005 年的 19 亿美金提升到 2010 年的 27 亿 5 千万美金。
FPGA是在 1984 年由Xilinx 公司所发明,从简单的胶合逻辑Glue logic 晶片,演变为可取代客制的特定应用积体电路 ASIC 与处理器,适用于讯号处理与控制应用。
为何FPGA技术如此成功?此篇文章将介绍FPGA,并说明数项让FPGA如此独特的优点。
什么是FPGA?最笼统来说,FPGAs 即为可再程式化的晶片。
透过预先建立的逻辑区块与可程式化路由资源,不需更改面包板或焊锡部分,即可设定这些晶片以建置客制硬体功能。
使用者可于软体中开发数位运算系统 Computing task 并将之编译为组态档案或位元流Bitstream,可包含元件接线的相关资讯。
此外,FPGA完全为可重设性质,当使用者重新编译不同的电路设定时,可立刻拥有不同的特性。
在过去,工程师必须深入了解数位硬体设计,才能够使用FPGA技术。
然而,高阶设计工具的新技术可针对图形化程式区或 C 程式码,转换为数位硬体电路,即变更了FPGA程式设计的规则。
FPGA整合了 ASIC 与处理器架构系统的最佳部分,使FPGA晶片可应用于所有产业。
FPGA具有硬体时脉的速度与可靠性,且其仅需少量即可进行作业;可降低客制化 ASIC设计的费用。
可重新程式设计的晶片,具有与软体相同的弹性,却不受限于处理核心的数量。
与处理器不同的是,FPGA为实际的平行架构,因此不同的处理作业并不需要占用相同资源。
每个独立的处理作业均将指派至专属的晶片区块,不需影响其他逻辑区块即可自动产生功能。
因此,当新增其他处理作业时,应用某部分的效能亦不会受到影响。
FPGA技术的 5 大优点:效能–透过硬体的平行机制,FPGA可突破依序执行 Sequential execution 的固定运算,并于每时脉循环完成更多作业,以超越数位讯号处理器 DSP 的计算功能。
简单梳理一下IRS
简单梳理⼀下IRSWhat is IRS? IRS is inessence a reconfigurable metal surface equipped with of a large number of passive reflecting elements without RF chains. 智能反射⾯(IRS)是⼀种全新的⾰命性技术,它可以通过在平⾯上集成⼤量低成本的⽆源反射元件,智能地重新配置⽆线传播环境,从⽽显著提⾼⽆线通信⽹络的性能。
具体地说,IRS的不同元件可以独⽴地控制反射信号的幅度或相位来实现信号的增强或衰减。
Hareware architecture of IRS IRS的硬件是依靠“超表⾯(metasurface)”的概念实现的。
超表⾯(metasurface)是元原⼦阵列组成的超薄光学元件。
通过设计其元件的⼏何外形、尺⼨、⽅向和排列等,超表⾯可以控制光的相位、赋值。
IRS的硬件主要有以下⼏个部分组成: (1)最外层:称为reflecting element。
主要是由许多的反射元件组合⽽成的阵列,直接与⼊射信号接触。
反射元件可以通过电⼦元件制作⽽成(PIN⼆极管、场效应管或微机电系统(MEMS))。
⼀般来说,每个元件都嵌⼊PIN⼆极管,通过控制PIN⼆极管的偏置电压,PIN⼆极管可以在开和关之间转换,从⽽产⽣Pi的相移差。
(2)中间层:称为Copper backplane。
主要由铜板组成,防⽌信号的能量泄漏。
(3)最⾥层:称为Control circuit board。
主要是⼀个控制电路板,上⾯集成着IRS控制器(通常为FPGA实现)。
它主要控制每个反射元件的反射系数(反射幅度和相位),并通过⽆线链路与其他的⽹络组件(BS、AP、Devices)通信。
值得⼀提的是,连续的控制IRS元件是利于⽆线通信⽹络的。
但是在现实中,考虑到设计复杂度和成本问题,通常都会采取离散的幅度(相位)变化。
基于CompactRIO的脉冲发电机励磁控制器设计
Key words:pulse generator; excitation regulation; pulse-triggered; CompactRIO; LabVIEW FPGA
0 前言
中国环流器 2 号 M 装置( HL-2M) 是我国目前规
模最大、 参数最高的新一代核聚变实验研究装置, 在
构, 如图 3 所示, 励磁控制器的 PC 上位机为励磁控
能; CompactRIO 为励磁控制器的控制核心, 提供浮点
运算、 实时控制和逻辑处理等功能[10-11] 。 CompactRIO 嵌
入式硬件平台由实时控制器、 可重配置的FPGA和工
业级 I / O 模块三个部分组成 [12-13] 。 搭载NI Linux操作
internal communication of excitation controller is realized through TCP / IP protocol and shared
variables technology. The fundamental functions of the excitation controller are verified based on a
系统仍沿用 HL-2A 时期的 DOS 系统, 由多个功能模
块共同完成脉冲发电机励磁控制, 系统复杂度高且不
2023. №6
易维护, 同时也无法兼容日后 HL-2M 装置中央控制
系统对实时数据传输、 数据共享、 集总式控制的要
求
[8-9]
91
大 电 机 技 术
。 另外, 在高参数等离子体物理实验中, 脉冲
调节器以及基于 LabVIEW FPGA 实现励磁触发器。
LDPC码的级联译码算法的改进与实现
LDPC码的级联译码算法的改进与实现陈猛【摘要】针对中短码长中LDPC码的OSD串行级联译码算法,给出了一种FPGA实现方案.该方案基于FPGA芯片中的块RAM资源,实现了OSD译码中GF(2)上的高斯消元算法,避免了其对逻辑资源的大量消耗.结果表明,该实现方案可在中低端FPGA上实现500 kbit·s-1吞吐量的LDPC码OSD串行级联译码器.【期刊名称】《电子科技》【年(卷),期】2014(027)006【总页数】4页(P156-159)【关键词】信道编码;低密度奇偶校验码;可靠性译码;对数似然比累积;现场可编程门阵列【作者】陈猛【作者单位】中航雷达与电子设备研究院1部,江苏无锡214063【正文语种】中文【中图分类】TN911.22LDPC(LDPC:Low Density Parity-Check)码是一类具有逼近香农限的译码性能纠错码,近年来受到了广泛的关注和研究。
文献[1]中给出了LDPC 码在基于置信传播(Belief Propagation,BP)译码算法下的性能限,其推导建立在由Tanner 图中环所带来的错误传播可忽略的假设上,这就要求LDPC 码的码长达到一定长度。
然而在实际应用中,由于对系统时延的要求,使得LDPC 码的码长不能过长,这就可能造成较大的译码性能损失。
而级联译码算法被广泛应用于中短码长LDPC码的译码中,已达到性能与复杂度的折中。
LDPC 码的级联译码是指将BP 译码所输出的软信息传送给基于可靠性的软判决译码器进行译码。
文献[2]将排序统计译码(Ordered Statistic Decoding,OSD)算法嵌入BP 译码迭代之中,即在每次BP 迭代后均进行一次OSD 译码,从而有效提高了译码性能并减少了迭代的次数。
更加实用的级联译码方案是指在最后一次迭代后进行可靠性译码,即将BP 译码与可靠性译码串行级联。
文献[4]给出了一种串行级联译码方案,该方案通过对对数似然比进行累积(LLRA:Log-Likelihood Ratio Accumulation)以消除BP 算法输出软信息的震荡,从而改善送给可靠性译码的软信息的准确性。
基于FPGA技术的HDLC帧收发器的设计与实现的开题报告
基于FPGA技术的HDLC帧收发器的设计与实现的开题报告一、研究背景与意义在现代通信系统中, HDLC(High Level Data Link Control)协议是一种广泛应用的数据链路层协议,在数据传输领域具有重要的意义。
HDLC协议能够实现可靠的数据传输、错误检测和纠错等功能,被广泛应用于各种通信系统中,如计算机网络、通信卫星、传感器网络等。
而在HDLC协议中,帧的生成和解析是其最基本的部分,因此设计一种高效、可靠的HDLC帧收发器技术,对实现基于HDLC协议的数据通信具有重要的意义。
而FPGA(Field-Programmable Gate Array)技术是一种高性能、高灵活性的可编程逻辑器件,具有并行处理能力强、延迟低、功耗小、可扩展性高等特点,被广泛应用于数字电路设计领域。
基于FPGA技术设计和实现HDLC帧收发器,能够实现高速、高效的帧传输和处理,提高通信系统的可靠性和稳定性。
二、研究内容与目标本次研究的主要内容是基于FPGA技术设计和实现一种高效、可靠的HDLC帧收发器,实现对于HDLC帧的收发和解析。
具体包括以下内容:1. HDLC协议的理论分析和研究,了解HDLC帧格式和协议流程,确定设计方案和实现策略;2. FPGA器件的选型和系统设计,包括硬件电路设计和信号处理等方面的内容;3. 基于Verilog HDL进行设计和实现,包括模块化设计、状态机设计、时序控制等;4. 通过仿真和实际硬件测试验证设计的正确性和可行性;本研究的目标是实现一种高效、可靠、具有扩展性的HDLC帧收发器,满足各种通信系统对于数据传输和处理的需求,并且与现有通信系统的兼容性良好。
三、研究方法1. 理论分析法:对HDLC协议进行深入研究和理论分析,确定设计方案和实现策略,为后续的FPGA设计提供理论基础和指导;2. 硬件电路设计法:选用适合的FPGA器件进行系统设计,包括电路设计和信号处理等方面的内容,确定硬件电路结构和信号流程;3. Verilog HDL设计法:基于Verilog HDL进行设计和实现,包括模块化设计、状态机设计、时序控制等,实现HDLC协议的帧收发和解析功能;4. 仿真和测试法:采用Modelsim等仿真工具进行软件仿真,通过实际硬件测试验证设计的正确性和可行性。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
RECONFIGURABLE RADIO WITHFPGA-BASED APPLICATION-SPECIFIC PROCESSORS Rob Jackson (Altera European Technology Centre; High Wycombe; UK; rjackson@) Sambuddhi Hettiaratchi (European Technology Centre; High Wycombe; UK: shettiar@); Mike Fitton (European Technology Centre; High Wycombe; UK; mfitton@); Steven Perry (European Technology Centre; HighWycombe; UK; sperry@)ABSTRACTFPGAs are a viable solution for the computationally intensive signal processing requirements in software-defined radio (SDR). The conventional approach to partially or fully reconfigure the device to meet different requirements results in an associated delay and/or over-provision of resources required. We demonstrate a new approach based on application-specific integrated processors (ASIPs) where the FPGA is configured to provide a number of functional units controlled by a simple processor and associated program. Reconfiguration then merely requires modification of the program and may be performed quickly and simply.1.INTRODUCTIONFPGAs are widely used for computationally intensive signal processing applications. We describe a novel approach to implementing algorithms on FPGAs based on ASIPs.An ASIP combines the flexibility of a software approach with the efficiency and performance of dedicated hardware. It is a processor that has been specialised to perform certain tasks or classes of tasks efficiently and with a required level of performance. An ASIP may be derived from a general purpose processor by varying the number and type of function units (arithmetic, logical, load/store, register-file), introducing new types of application-specific function unit (multiply-add, permute, address generation), and by changing the internal topology of processor—i.e., the pattern of interconnection between function units.Digital signal processors and network processors can be viewed as ASIPs targeting specific classes of application. In general, the more specialised the processor, the greater the efficiency and less the applicability.There are a number of advantages to implementing an ASIP using an FPGA. While a processor implemented on a FPGA may be larger and slower than one implemented as an ASIC, they use the same process technology. FPGAs are typically fabricated using a more advanced process than is available to the majority of ASIC applications. An FPGA processor implementation is also able to benefit from improvements in process technology as they are introduced with new generations of FPGA devices. FPGAs have become large enough to contain a complete system on a single device, hence the name “system on a programmable chip” (SOPC). The single-chip design allows tight integration between multiple processors, memories, and other system components [1].We believe building ASIPs using FPGAs has further advantages. Modern FPGAs have all the building blocks necessary to build a processor: digital signal processing (DSP) blocks (multiply-accumulators), small and large RAMs, and fast arithmetic and logic. The FPGA fabric supports variation of the number, type, and topology of the function units and creation of new types of function unit. Highly specialised FPGA-based ASIPs can be quickly and cheaply produced and offer very high levels of efficiency.Furthermore, for computationally intensive tasks that require a hardware solution to meet performance requirements, the ASIP-on-FPGA model has a number of benefits over traditional design flows using hardware description languages like VHDL and Verilog.In contrast to the traditional hardware design flows, the ASIP approach naturally separates behaviour (algorithm) from implementation (architecture). For example, sharing (or reuse) of hardware blocks is much more naturally expressed in terms of a processor. In a hardware description language, multiplexers and control signals must be written explicitly, whereas they are implicit in a given ASIP architecture.The assembler and compiler tools available for an ASIP manage pipeline delays, schedule operations, and automatically generate a control program. These tasks are performed manually when writing a hardware description.The ASIP program is explicitly stated using a sequential programming model with instruction-level parallelism. A hardware description implicitly encodes the program in the behaviour of a number of parallel processes.Control flow is expressed as a series of arithmetic and logical operations applied to a set of variables. In an ASIP this may be implemented using a register file and an arithmetic logic unit (ALU) with each compare-and-branchoperation performed sequentially. A hardware implementation may include a finite state machine, multiple registers, and comparators, and perform multiple comparisons in a single step (an N-way branch). However,in many computationally intensive applications, control flow represents a small fraction of the total number of operations and is not performance critical. The smaller, denser ASIP may require more cycles but may also operate faster. The ASIP may also be quickly and easily reprogrammed by changing the contents of the program memory. A hardware approach will require the FPGA to be reconfigured to make even trivial changes to its behaviour.As a result of a higher degree of algorithm-implementation separation, the ASIP approach is more amenable to design automation and design space exploration than traditional hardware design flows. Moreover, this technique permits an easier migration from programmable logic to ASICs without compromising its ability to be reconfigured.2. CONSTRUCTING AN ASIP2.1 Analysis and DesignThe first step in constructing an ASIP is to analyse the application that is to be executed. A simple analysis can yield characteristics such as the amount of memory required and the number and type of operations. Given a real-time constraint such as throughput or latency, we can translate algorithm requirements into resource requirements: i.e., a simple implementation of a 15-tap finite impulse response (FIR) filter requires 15 multiply-add operations, so if givena throughput of 1,000,000 samples per second, we require resources to provide 15 MMACS (million multiply-accumulated operations per second).By comparing the algorithmic resource requirements against the features and performance of various functionFigure 1. Architecture of an ASIP Notes:1.FU = function unit2.Mux – multiplexerunits, we can begin to define an ASIP architecture.The availability, classes, and performance of function units depend on the FPGA device targeted. For instance, using the DSP blocks on Altera’s largest Stratix® II device, we could construct 384 18-bit multiply-accumulate function units with appropriate rounding and saturation, and each performing 370 MMACS [2]. A straightforward 32-bit add/subtract unit implemented using logic cells can perform 350 million add/subs per second.Memory requirements are often more important than computational requirements. The ASIP architecture must include sufficient memory units to satisfy the bandwidth requirements of the algorithm. Modern high-density FPGAs typically contain a range of memory resources which can be combined or configured to provide different sized memories.Designing the ASIP architecture is an iterative process whereby the processor architecture and the algorithm are progressively refined. The cost of candidate processor architecture may be estimated relatively easily by counting the resources used. The performance of the algorithm can be assessed by a more detailed analysis of implementation of the key sections of the application (such as innermost loop bodies).Reaching the optimal implementation using this approach may, in general, require a number of iterations. However, we find in practise that an acceptable architecture can be designed for many applications relatively quickly.For applications that have been implemented in software or in previous hardware generations, modern FPGAs often provide vastly more computational power than is required. We can, therefore, apply standard architectural templates to generate ASIP architectures: for DSP-like algorithms, an architecture template with N memories feeding M multipliers and accumulators, and for control applications, a register file feeding an ALU.2.2 Architecture DefinitionTypically a microprocessor is defined in terms of a non-application-specific instruction set. This instruction set is somewhat abstract, often relying on a set of logical register names. An instruction-set-based processor will require logic to:•Detect hazards: for example, when a register is read but its value has not been completely computed•Forward results: either by directly moving a computed value from the output of one function unit to the input of another while bypassing the register file, or bydetecting which physical location contains the current value of a logical register nameIn an FPGA, many function units are pipelined and so the number of hazard conditions and the number of forwarding paths will be large.When a function unit can receive its arguments from a number of sources, a multiplexer must be included to select between them. The multiplexer must be constructed out of logic cells, thereby adding to the resources needed for the processor and potentially slowing its execution. A four-to-one multiplexer implemented in an Altera® Stratix FPGA will consume twice as many resources as an add/sub unit.When constructing an ASIP, we include no hazard detection hardware. We require the assembler/compiler to statically schedule instructions so that no hazards occur. Moving this analysis to compile time means the processors generated are smaller and can run faster.However, this form of static scheduling may lead to inefficient programs. Some hazards are data dependent either due to control flow or through address generation and cannot be determined until the program executes. A static schedule must be pessimistic to ensure correctness. For most applications, fortunately, the inefficiency we encounter in the program is more than compensated by obtaining a faster and smaller processor. In DSP algorithms, there is often little data-dependent execution, and memory accesses are regular and analyzable, so the potential benefits of run-time hazard detection are minimal.When constructing an FPGA-targeted ASIP, we only include those routing paths which are required by the application. The need for multiplexers is removed or reduced, resulting in a smaller and faster processor. The resulting ASIP architecture has much in common with a transport-triggered architecture (TTA) and the finite state machine plus datapath (FSM-D) architecture generated by many high-level behavioural synthesis tools [3][4].Figure 1 shows the general form of an FPGA-based ASIP. Pipelined memory and counter programs supply the machine with an encoded instruction word. The memory program is typically included within the processor and exploits the dual-port facilities of the memories to allow external sources to load program code.The encoded instruction word feeds a decode block that decodes the data to provide a set of control signals for the processor. Control signals include immediate values such as literals, register file read and write addresses, function unit enable and operation select signals, and multiplexer operand-select codes.The processing core includes a set of function units and the multiplexers that route data between them. The function units include memories, registers, basic arithmetic and logic units, and multiply-add blocks. These blocks may exploit specific features of the FPGA device or may rely on standard libraries such as the library of parameterized modules (LPM) [5]. In addition, custom application-specific units may be included.Function units implementing bus-masters, slaves, general purpose I/O (GPIOs), and streaming point-to-point protocols provide I/O functionality.Figure 2. Combined FFT/FIR ASIP Architecture2.3 ProgramEach ASIP defines its own custom assembly language. For each processor we also construct a custom assembler which translates an executable specification of the algorithm into the correct control signals. The assembler schedules operations and loops to avoid hazards and minimise execution time.3. RECONFIGURABLE RADIOSDR implements algorithms in a software form to improve portability, lifetime costs, and retargetability. However, achieving cost and performance requirements necessitates the use of application-specific hardware. Fast Fourier transform (FFT) and FIR filters are representative of the typical algorithms used in SDR. Therefore, we use FFT and FIR implementations to demonstrate the effectiveness of the ASIP on FPGA methodology.3.1 Fast Fourier TransformThe details of the FFT algorithm and its implementation are well understood and widely known [6][7], so we will not describe them in detail here. The N-point FFT can be implemented as a series of log2(N) passes across the source data set. Each pass requires N/2 multiplies. Therefore, a 1024-point FFT requires 5120 multiply operations.Our FFT ASIP is primarily composed of a dual-ported data memory and a single-ported coefficient memory with associated address generation units. A single 16 x 16 multiplier and a 16-bit arithmetic block are included. We include an accumulator and single-ported register file (memory) to act as a simple general purpose controller.3.2 Finite Impulse ResponseThe FIR filter is the building block of many DSP applications. We chose the simplest FIR structure that requires a single ported data and coefficient, a multiplier, along with an accumulator.3.3 Combined FFT/FIRThe architecture of the processor we have specified is shown in Figure 2. In effect, we have built a simple digital signal processor capable of performing FIR and FFT. The instruction set exposes all the registers and potential hazards. The FFT and FIR algorithms are expressed in terms of the operations the machine performs and the tools automatically schedule each operation to avoid pipelined hazards.The final ASIP can be considered as either an application-specific VLIW digital signal processor or a hardware block implementing the FFT algorithm using a microcoded FSM.The hardware is simple and small (322 Logic Elements (LEs), one 16 x 16 multiplier) and fast (230 MHz in a Stratix -5 part). A 1024-point radix-2 complex FFT takes 21850 cycles and executes in approximately 95 µs. Moreover, the FFT/FIR ASIP takes less than 5 percent of the available LEs and DSP blocks on the smallest Altera Stratix device.In comparison, a complex radix-2 FFT implemented on a C62x DSP device from Texas Instruments would take 20840 cycles at 300 MHz (69.5 µs) to compute a 1024 point FFT [8]. Although the DSP solution is faster than the specific ASIP implementation reported here, a DSP solution requires an entire digital signal processor, while the ASIP on FPGA solution requires less than five percent of a Stratix device.Another alternative solution would be to use an IP Core optimised for a particular family of FPGAs. The Radix-4 FFT IP core from Xilinx, for example, takes 1869 logic slices (a Xilinx logic slice contains four 4-input look-up tables (LUTs) compared to an Altera LE containing one 4-input LUT) and takes 4145 clock cycles at 100 MHz (41.45 µs) [9]. A radix-4 FFT implemented using our ASIP on FPGA methodology takes about 6368 cycles at 256Hz (under 25 µs) and 600 LEs.4. CONCLUSIONWe have described a methodology for constructing customised application-specific processors targeting an FPGA. We believe that this architectural style leads to efficient algorithm implementations in modern FPGA devices, with a FFT/FIR processor as an example. FPGA-implemented ASIPs are a good target for SDR as they give the performance of an FPGA but can be re-programmed without having to complete a hardware change cycle with its associated risks. They also remain reprogrammable if the FPGA design is converted to a structured ASIC like Hardcopy®.5. REFERENCES[1] Altera Corporation, NIOS II Processor Reference Handbook,/literature/lit-nio2.jsp, 2004.[2] Altera Corporation, Stratix II Device Handbook,/literature/lit-stx2.jsp, 2004.[3] H. Corporaal, Microprocessor Architectures from VLIW toTTA, Wiley, Chichster, 1998.[4] D. Gajski, N. Dutt, and A. Wu, High-Level Synthesis, Kluwer,Boston, 1992.[5] Electronic Industry Alliance, “EIA/IS-103-A Library ofParameterizedModules.”http://www.ediforg/lpmweb, 1999. [6] P. Duhamel and M. Vetterli. "Fast Fourier Transforms: ATutorial Review." Signal Processing, Vol. 19, pp 259-299, 1990.[7] G.D. Gergkand "A Guided Tour of the Fast FourierTransform." IEEE Spectrum, Vol. 6, pp 41-52, July 1969. [8] C62x DSP Benchmarks, Texas Instruments, Inc., latestversion.[9] LogiCore High Performance 1024-Point Complex FFT/IFFT,Xilinx, Inc., July 2000.。