FPGA外文文献

合集下载

FPGA的英文文献及翻译

FPGA的英文文献及翻译

Building Programmable Automation Controllers with LabVIEWFPGAOverviewProgrammable Automation Controllers (PACs) are gaining acceptance within the industrial control market as the ideal solution for applications that require highly integrated analog and digital I/O, floating-point processing, and seamless connectivity to multiple processing nodes. National Instruments offers a variety of PAC solutions powered by one common software development environment, NI LabVIEW. With LabVIEW, you can build custom I/O interfaces for industrial applications using add-on software, such as the NI LabVIEW FPGA Module.With the LabVIEW FPGA Module and reconfigurable I/O (RIO) hardware, National Instruments delivers an intuitive, accessible solution for incorporating the flexibility and customizability of FPGA technology into industrial PAC systems. You can define the logic embedded in FPGA chips across the family of RIO hardware targets without knowinglow-level hardware description languages (HDLs) or board-level hardware design details, as well as quickly define hardware for ultrahigh-speed control, customized timing and synchronization, low-level signal processing, and custom I/O with analog, digital, and counters within a single device. You also can integrate your custom NI RIO hardware with image acquisition and analysis, motion control, and industrial protocols, such as CAN and RS232, to rapidly prototype and implement a complete PAC system.Table of Contents1.Introduction2.NI RIO Hardware for PACs3.Building PACs with LabVIEW and the LabVIEW FPGA Module4.FPGA Development Flowing NI SoftMotion to Create Custom Motion Controllers6.Applications7.ConclusionIntroductionYou can use graphical programming in LabVIEW and the LabVIEW FPGA Module to configure the FPGA (field-programmable gate array) on NI RIO devices. RIO technology, the merging of LabVIEW graphical programming with FPGAs on NI RIO hardware, provides a flexible platform for creating sophisticated measurement and control systems that you could previously create only with custom-designed hardware.An FPGA is a chip that consists of many unconfigured logic gates. Unlike the fixed, vendor-defined functionality of an ASIC (application-specific integrated circuit) chip, you can configure and reconfigure the logic on FPGAs for your specific application. FPGAs are used in applications where either the cost of developing and fabricating an ASIC is prohibitive, or the hardware must be reconfigured after being placed into service. The flexible,software-programmable architecture of FPGAs offer benefits such as high-performance execution of custom algorithms, precise timing and synchronization, rapid decision making, and simultaneous execution of parallel tasks. Today, FPGAs appear in such devices as instruments, consumer electronics, automobiles, aircraft, copy machines, andapplication-specific computer hardware. While FPGAs are often used in industrial control products, FPGA functionality has not previously been made accessible to industrial control engineers. Defining FPGAs has historically required expertise using HDL programming or complex design tools used more by hardware design engineers than by control engineers.With the LabVIEW FPGA Module and NI RIO hardware, you now can use LabVIEW, a high-level graphical development environment designed specifically for measurement and control applications, to create PACs that have the customization, flexibility, andhigh-performance of FPGAs. Because the LabVIEW FPGA Module configures custom circuitry in hardware, your system can process and generate synchronized analog and digital signals rapidly and deterministically. Figure 1 illustrates many of the NI RIO devices that you can configure using the LabVIEW FPGA Module.Figure 1. LabVIEW FPGA VI Block Diagram and RIO Hardware PlatformsNI RIO Hardware for PACsHistorically, programming FPGAs has been limited to engineers who have in-depth knowledge of VHDL or other low-level design tools, which require overcoming a very steep learning curve. With the LabVIEW FPGA Module, NI has opened FPGA technology to a broader set of engineers who can now define FPGA logic using LabVIEW graphical development. Measurement and control engineers can focus primarily on their test and control application, where their expertise lies, rather than the low-level semantics of transferring logic into the cells of the chip. The LabVIEW FPGA Module model works because of the tightintegration between the LabVIEW FPGA Module and the commercial off-the-shelf (COTS) hardware architecture of the FPGA and surrounding I/O components.National Instruments PACs provide modular, off-the-shelf platforms for your industrial control applications. With the implementation of RIO technology on PCI, PXI, and Compact Vision System platforms and the introduction of RIO-based CompactRIO, engineers now have the benefits of a COTS platform with the high-performance, flexibility, and customization benefits of FPGAs at their disposal to build PACs. National Instruments PCI and PXI R Series plug-in devices provide analog and digital data acquisition and control for high-performance, user-configurable timing and synchronization, as well as onboard decision making on a single device. Using these off-the-shelf devices, you can extend your NI PXI or PCI industrial control system to include high-speed discrete and analog control, custom sensor interfaces, and precise timing and control.NI CompactRIO, a platform centered on RIO technology, provides a small, industrially rugged, modular PAC platform that gives you high-performance I/O and unprecedented flexibility in system timing. You can use NI CompactRIO to build an embedded system for applications such as in-vehicle data acquisition, mobile NVH testing, and embedded machine control systems. The rugged NI CompactRIO system is industrially rated and certified, and it is designed for greater than 50 g of shock at a temperature range of -40 to 70 °C.NI Compact Vision System is a rugged machine vision package that withstands the harsh environments common in robotics, automated test, and industrial inspection systems. NICVS-145x devices offer unprecedented I/O capabilities and network connectivity for distributed machine vision applications.NI CVS-145x systems use IEEE 1394 (FireWire) technology, compatible with more than 40 cameras with a wide range of functionality, performance, and price. NI CVS-1455 and NI CVS-1456 devices contain configurable FPGAs so you can implement custom counters, timing, or motor control in your machine vision application.Building PACs with LabVIEW and the LabVIEW FPGA Module With LabVIEW and the LabVIEW FPGA Module, you add significant flexibility and customization to your industrial control hardware. Because many PACs are already programmed using LabVIEW, programming FPGAs with LabVIEW is easy because it uses the same LabVIEW development environment. When you target the FPGA on an NI RIO device, LabVIEW displays only the functions that can be implemented in the FPGA, further easing the use of LabVIEW to program FPGAs. The LabVIEW FPGA Module Functions palette includes typical LabVIEW structures and functions, such as While Loops, For Loops, Case Structures, and Sequence Structures as well as a dedicated set of LabVIEWFPGA-specific functions for math, signal generation and analysis, linear and nonlinear control, comparison logic, array and cluster manipulation, occurrences, analog and digital I/O, and timing. You can use a combination of these functions to define logic and embed intelligence onto your NI RIO device.Figure 2 shows an FPGA application that implements a PID control algorithm on the NI RIO hardware and a host application on a Windows machine or an RT target that communicates with the NI RIO hardware. This application reads from analog input 0 (AI0), performs the PID calculation, and outputs the resulting data on analog output 0 (AO0). While the FPGA clock runs at 40 MHz the loop in this example runs much slower because each component takes longer than one-clock cycle to execute. Analog control loops can run on an FPGA at a rate of about 200 kHz. You can specify the clock rate at compile time. This example shows only one PID loop; however, creating additional functionality on the NI RIO device is merely a matter of adding another While Loop. Unlike traditional PC processors, FPGAs are parallel processors. Adding additional loops to your application does not affect the performance of your PID loop.Figure 2. PID Control Using an Embedded LabVIEW FPGA VI with Corresponding LabVIEW HostVI.FPGA Development FlowAfter you create the LabVIEW FPGA VI, you compile the code to run on the NI RIO hardware. Depending on the complexity of your code and the specifications of your development system, compile time for an FPGA VI can range from minutes to several hours.To maximize development productivity, with the R Series RIO devices you can use abit-accurate emulation mode so you can verify the logic of your design before initiating the compile process. When you target the FPGA Device Emulator, LabVIEW accesses I/O from the device and executes the VI logic on the Windows development computer. In this mode, you can use the same debugging tools available in LabVIEW for Windows, such as execution highlighting, probes, and breakpoints.Once the LabVIEW FPGA code is compiled, you create a LabVIEW host VI to integrate your NI RIO hardware into the rest of your PAC system. Figure 3 illustrates the development process for creating an FPGA application. The host VI uses controls and indicators on the FPGA VI front panel to transfer data between the FPGA on the RIO device and the host processing engine. These front panel objects are represented as data registers within the FPGA. The host computer can be either a PC or PXI controller running Windows or a PC, PXI controller, Compact Vision System, or CompactRIO controller running a real-time operating system (RTOS). In the above example, we exchange the set point, PID gains, loop rate, AI0, and AO0 data with the LabVIEW host VI.Figure 3. LabVIEW FPGA Development FlowThe NI RIO device driver includes a set of functions to develop a communication interface to the FPGA. The first step in building a host VI is to open a reference to the FPGA VI and RIO device. The Open FPGA VI Reference function, as seen in Figure 2, also downloads and runs the compiled FPGA code during execution. After opening the reference, you read and write to the control and indicator registers on the FPGA using the Read/Write Control function. Once you wire the FPGA reference into this function, you can simply select which controls and indicators you want to read and write to. You can enclose the FPGA Read/Write function within a While Loop to continuously read and write to the FPGA. Finally, the last function within the LabVIEW host VI in Figure 2 is the Close FPGA VI Reference function. The Close FPGA VI Reference function stops the FPGA VI and closes the reference to the device. Now you can download other compiled FPGA VIs to the device to change or modify its functionality.The LabVIEW host VI can also be used to perform floating-point calculations, data logging, networking, and any calculations that do not fit within the FPGA fabric. For added determinism and reliability, you can run your host application on an RTOS with the LabVIEW Real-Time Module. LabVIEW Real-Time systems provide deterministicprocessing engines for functions performed synchronously or asynchronously to the FPGA. For example, floating-point arithmetic, including FFTs, PID calculations, and custom control algorithms, are often performed in the LabVIEW Real-Time environment. Relevant data can be stored on a LabVIEW Real-Time system or transferred to a Windows host computer for off-line analysis, data logging, or user interface displays. The architecture for this configuration is shown in Figure 4. Each NI PAC platform that offers RIO hardware can run LabVIEW Real-Time VIs.Figure 4. Complete PAC Architecture Using LabVIEW FPGA, LabVIEW Real-Time and Host PC Within each R Series and CompactRIO device, there is flash memory available to store a compiled LabVIEW FPGA VI and run the application immediately upon power up of the device. In this configuration, as long as the FPGA has power, it runs the FPGA VI, even if the host computer crashes or is powered down. This is ideal for programming safety power down and power up sequences when unexpected events occur.Using NI SoftMotion to Create Custom Motion ControllersThe NI SoftMotion Development Module for LabVIEW provides VIs and functions to help you build custom motion controllers as part of NI PAC hardware platforms that can include NI RIO devices, DAQ devices, and Compact FieldPoint. NI SoftMotion provides all of the functions that typically reside on a motion controller DSP. With it, you can handle path planning, trajectory generation, and position and velocity loop control in the NI LabVIEW environment and then deploy the code on LabVIEW Real-Time or LabVIEW FPGA-based target hardware.NI SoftMotion includes functions for trajectory generator and spline engine and examples with complete source code for supervisory control, position, and velocity control loop using the PID algorithm. Supervisory control and the trajectory generator run on a LabVIEW Real-Time target and run at millisecond loop rates. The spline engine and the control loop can run either on a LabVIEW Real-Time target at millisecond loop rates or on a LabVIEW FPGA target at microsecond loop rates.ApplicationsBecause the LabVIEW FPGA Module can configure low-level hardware design of FPGAs and use the FPGAs within in a modular system, it is ideal for industrial controlapplications requiring custom hardware. These custom applications can include a custom mix of analog, digital, and counter/timer I/O, analog control up to 125 kHz, digital control up to 20 MHz, and interfacing to custom digital protocols for the following:•Batch control•Discrete control•Motion control•In-vehicle data acquisition•Machine condition monitoring•Rapid control prototyping (RCP)•Industrial control and acquisition•Distributed data acquisition and control•Mobile/portable noise, vibration, and harshness (NVH) analysis ConclusionThe LabVIEW FPGA Module brings the flexibility, performance, and customization of FPGAs to PAC platforms. Using NI RIO devices and LabVIEW graphical programming, you can build flexible and custom hardware using the COTS hardware often required in industrial control applications. Because you are using LabVIEW, a programming language already used in many industrial control applications, to define your NI RIO hardware, there is no need to learn VHDL or other low-level hardware design tools to create custom hardware. Using the LabVIEW FPGA Module and NI RIO hardware as part of your NI PAC adds significant flexibility and functionality for applications requiring ultrahigh-speed control, interfaces to custom digital protocols, or a custom I/O mix of analog, digital, and counters.使用LabVIEW FPGA(现场可编程门阵列)模块开发可编程自动化控制器综述工业控制上的应用要求高度集成的模拟和数字输入输出、浮点运算和多重处理节点的无缝连接。

FPGA外文资料133

FPGA外文资料133
System-Level FPGA Device Driver with High-Level Synthesis Support
Kizheppatt Vipin, Shaaib A. Fahmy, Nachiket Kapre
I. I NTRODUCTION FPGAs are used in both embedded platforms and specialized, stand-alone, bespoke computing systems (e.g. PCI Pamette [1], Splash [2], BEE2 [3]). The ability to design custom interfaces allows data to be streamed to and from FPGA logic pipelines at very high throughput. We have seen FPGAs make their way into commodity computing platforms as first-class computing devices in tandem with CPUs. Some platforms allow FPGAs to be integrated over a PCIe interface (e.g. Xilinx ML605, VC707, Altera DE4, Maxeler Max3), some over Ethernet (e.g. Maxeler 10G, NetFPGA), and others using a CPU-socket FSB interface (e.g. Convey HC, Nallatech ACP). There have also been recent attempts at creating opensource, standards-inspired interfaces (e.g. RIFFA, OpenCPI, SIRC) which further ease design burden. We investigate the design and engineering of an FPGA driver that (1) is portable across multiple physical interfaces, and (2) provides simple plug-and-play composition with high-level synthesis tools. In this regard, the stable CUDA driver API is an example of effective driver interface design. It supports a variety of CUDA-capable GPU devices in a high-performance, portable manner. It provides a limited set of interaction primitives that are precise, clear and behave consistently across different GPU devices. In the context of an FPGA driver, we have a harder challenge. We must design both the hardware and software components of the driver. The idea of considering

FPGA论文相关的英文文献

FPGA论文相关的英文文献

FPGA Implementation of RS232 to Universalserial bus converter1 V.Vijaya, (PhD) M.Tech2 Rama Valupadasu (Ph D), M.Tech, 3.B.RamaRao Chunduri, PhD, M.TechAssoc. Professor. VCEW Asst.Professor, NIT, Warangal Professor, NIT, Warangalvsrtej@yahoo.co.in agnivesh91@yahoo.co.in cbrr@nitw.ac.in4. Ch.Kranthi Rekha, M.Tech,5. B.Sreedevi, M.Tech,Asst.Professor Assoc. Professor. VCEWLUC, Mantin, Malaysia vaagvijs_15@yahoo.co.inmadakranthirekha@yahoo.co.inAbstract— Universal Serial Bus (USB) is a new personal computer interconnection protocol, developed to make the connection of peripheral devices to a computer easier and more efficient. It reduces the cost for the end user, improves communication speed and supports simultaneous attachment of multiple devices (up to127)RS232, in another hand, was designed to single device connection, but is one of the most used communication protocols. An embedded converter from RS232 to USB is very interesting, since it would allow serial-based devices to experience USB advantages without major changes. This work describes the specification and development of such converter and it is also a useful guide for implementing other USB devices. The main blocks in the implementation are USB device, UART (RS232 protocol engine) and interface FIFO logic. The USB device block has to know how to detect and respond to events at a USB port and it has to provide a way for the device to store data to be sent and retrieve data that have been received UART consists of different blocks which handle the serial communication through RS232 protocol. There are a set of control registers to control the data transfer. The interface FIFO logic has FIFO to bridge the data rate differences between USB and RS232 protocols. Index Terms— First-In-First-Out, RS-232, Universal Asynchronous Receive Transmit, Universal Serial Bus.I.INTRODUCTIONThis paper describes the specification and implementation of a converter from RS232 to USB (Universal Serial Bus). Thisconverter is responsible for receiving data from a peripheraldevice’s serial interface and sending it to a computer’s USBinterface. In the same way, it must be able to send data from the PC’s USB interface to the device. The problems faced with the old standards stimulated the development of a newcommunication protocol, which should be easier to use,faster, and more efficient. RS232 is a definition for serial communication on a 1:1 base. RS232defines the interface layer, but not the application layer. To use RS232 in a specific situation, application specific software must be written on devices on both ends of the connecting RS232 cable. RS232 ports can be either accessed directly by an application, or via a device driver in the operating system. USB is a new personal computer interconnection standard developed by industry and telecommunication leaders, which implements the Plug and Play technology. It allows multiple devices connection (up to 127) ranges. The use of a the devices attachment to PCs. USB is a low cost, easing solution and supports transfer rates up to 12Mbs, comprehending the low-speed and mid-speed data converter from a serial interface to USB would free a serial communication port to other applications, allowing a device that uses a serial interface to communicate using an USB interface. USB on the other hand is a bus system which allows more than one peripheral to be connected to a host computer via one USB port. Hubs can be used in the USB chain to extend the cable length and allow for even more devices to connect to the same USB port. The standard not only describes the physical properties of the interface, but also the protocols to be used. Because of the complex USB protocol requirements, communication with USB ports on a computer is always performed via a device driver. This way, we are not limited to the availability of a serial port and we can experience the USB advantages. Using a converter allows us to have the device unchanged, making the converter responsible for treating the differences between the protocols. This work was based on protocol engine which can be managed by exchanging data with a PC across a serial interface. Most of the times, this communication is not done constantly, since it is necessary to have a serial port available just for it. This paper presents the converter implementation, focusing on the development process, which comprehends the device itself and the PC-side software that will communicate with it. This methodology can be extended to other devices. We first present some important USB standard concepts. Then, we define the system specification, divided on host and device requirements. After, we describe the hardware (UART) features and software design and implementation. Finally, we discuss about achieved results and future workII.PROBLEM DESCRIPTIONThe USB specification describes bus attributes, protocol definition, programming interface and other features required to design and build systems and peripherals compliant with the USB standard. We briefly explain features used in our project.2011 IEEE Symposium on Computers & InformaticsThe USB interface does not give this flexibility. When however an RS232 port is used via an USB to RS232 converter, this flexibility should be present in some way. Therefore to use an RS232 port via an USB port, a second device driver is necessary which emulates a RS232 UART, but communicates via USB. USB works as a Master/Slave bus, where the USB Host is the Master and the devices are the Slaves. The only system resources required by a USB system are the memory locations used by USB system software and the memory and/or I/O address space and IRQ line used by the USB host controller. USB devices can be functional (displays, mice, etc) or hubs, used to connect other devices in the bus. They can be implemented as low or high-speed devices. Low-speed devices are limited to a maximum 1.5 Mb/s rate. Each device has a number of individual registers - known as Endpoints which are indirectly accessed by the device drivers for data exchange. Each endpoint supports particular transfer characteristic has a unique address and direction. A special case is Endpoint 0, which is used for control operations and can do bi-directional transfers. It must be present in all devices. According to the device’s characteristics, other types of endpoints can be defined. USB Host verifies the attachment and detachment of new devices, initiating the enumeration process and managing all the following transactions. It is responsible to install device driver (based on information provided by device descriptors), to automatically reconfigure the system (hot attachment) and to collect statistics and status of each device. USB on the other hand is a bus system which allows more than one peripheral to be connected to a host computer via one USB port. Hubs can be used in the USB chain to extend the cable length and allow for even more devices to connect to the same USB port. The standard not only describes the physical properties of the interface, but also the protocols to be used. Because of the complex USB protocol requirements, communication with USB ports on a computer is always performed via a device driver. Device’s descriptors specify USB devices attributes and characteristics and describe device communication requirements (Endpoint Descriptors). The USB host uses this information to configure the device, to find its driver, and to access it. Devices with similar functions are grouped into classes [1, 2] in order to share common features and even use the same device drivers. Each class can define their own descriptors (class-specific descriptors), as for example, HID (Human Interface Device) Class Descriptors and Report Descriptors. The HID class consists of devices used by people to control computer systems. It defines a structure that describes a HID device, with specific communication requirements. According to the converter characteristics, it can be implemented as a HID device, using already developed HID drivers. A HID device’s descriptors must support an Interrupt IN endpoint and the firmware must also contain a report descriptor that defines the format for transmitted and received device data.A. RequestsThe USB protocol is based on requests sent by the host and processed by the USB devices. These requests can be directed to a device or a specific endpoint in it. Standard requests must be implemented by all devices and are used for configuring a device and controlling the state of its USB interface, among other features. Two HID-specific requests must be supported by the converter: Set Report and Get Report. These requests enable the device to receive and send generic device information to the host. Set Report request is the only way the host can send data to a HID device, once it does not have an Interrupt OUT endpointB. Communication FlowUSB is a shared bus and many devices might use it at the same time. The devices share the bandwidth using a protocol based on tokens and commanded by the host. USB communication is based on transferring data at regular intervals called frames. A frame is composed by one or more transactions that must be executed in a 1 ms time. USB data transfers are typically originated by a USB Device Driver when it needs to communicate with its device. It supplies a memory buffer used to store the data in transfers to or from the USB device. The USB Driver provides the interface between USB Device Driver and USB Host Controller, translating transfer requests into USB transactions, consistent with the bandwidth requirements and protocol structure. Some of these transfers consist of a large block of data, which need to be splitted into several transactions. The Host Controller generates the transaction based on the Transfer Descriptor, which describes the frame sharing among the several devices requests. When a transaction is sent to the bus, all devices see it. Each transaction begins with a packet that determines its type and the endpoint address. The USB driver controls this addressing scheme. Inside the device, the USB Device Layer comprehends the actual USB communication mechanism and transfer characteristics. USB Logical Device implements a collection of endpoints that comprise a given functional interface, which can be manipulated by its respective USB client.C. Transfer TypesThe USB specification defines four transfer types: Control, Interrupt, Isochronous and Bulk. Control transfers send requests and data relating to the device’s abilities and configuration. They can also be used to transfer blocks of information for any other purpose. Control transfers consist of a Setup stage, followed by a Data stage, which is composed of one or more Data transactions, and a Status stage. All data transactions in a Data Stage must be in the same direction (In or out). Interrupt transfers are typically used for devices that need to transfer data at regular period of time, and consequently must be polled periodically. The polling interval is defined in the Endpoint Descriptor. The data payloadfor this kind of transfer for low-speed devices is 8 bytes. Error correction is done in this kind of transfer. Two other transfer types are Isochronous and Bulk , which are used for devices thatneed a guaranteed transfer rate or for large blocks of data transfers. They are not used in this work.III. PROCEDURE/ALGORITHM A. System SpecificationTo develop a USB peripheral we need all the following: A host that supports USB. Driver software on the host to communicate with the peripheral. An application executing in the host that communicates with the peripheral device. A UART with a USB interface. Code implementation on the USB controller to carry Out the USB communication. Code implementation on the USB controller to carry out the peripheral functions. Hardware specific problem arises from handshaking to prevent buffer overflows at the receiver's side. RS232 applications can use two types of handshaking, either with control commands in the data stream, called software flow control, or with physical lines, called hardware flow control. Not all USB to RS232 converters provide these hardware flow control lines. It is not always easily identified if an application needs them. Some applications do not use hardware flow control at all, and those cheap USB to RS232 converters will work without problems. Other applications use hardware flow control, but infrequently. Only with large data bursts, or in situations where the CPU is busy performing other tasks, hardware flow control might kick in to prevent data loss. In those situations, communications may seem error free, but with sometimes bytes lost, or unspecified errors in the communications. In a UART& FIFO used to store sent and received data in the USB communication process. Two endpoints were defined for the converter, where the first one is Endpoint 0, used for control operations and the second one is an Interrupt IN Endpoint, defined for sending data to the host. This way, a converter from a serial interface to USB can be implemented as a HID device with the features mentioned above.Fig 2.RS232 to USB ConverterB. HOST REQUIREMENTSThe choice of the Operating System used by the host wasdone in 1999, based on the USB support it provides. Itshould provide the entire drivers infrastructure and supportthe protocol characteristics, as for example, Plug and Play. The host must be able to receive USB data using its device drivers and make them available to the applications that have done the request. It is essential that we have a driver in the host to process USB transfers, recognizing the device, receiving and sending data to a USB device.A. Device requirementsSome communication requirements, such as transmission speed, frequency and amount of data to be transferred, were essential in communication the process of defining the UART be used. Considering the speeds available for USB devices, it was clear that the converter could be implemented as a low speed device, where the communication speed varies from 10 to 100Kb/s. Considering the amount of data transferred and the transmission frequency, the converter was defined to use Interrupt transfers, a transfer type where considerable amounts of data must be transferred in pre defined amounts of time. The host is responsible for verifying if the device needs to transmit data from time to time. Interrupt transfers can be done in both directions, but needs to transmit data from time to time. Interrupt transfers can be done in both directions, but not at the same time. For the converter, they could be used to send and receive data from the PC. The Operating System provides HID drivers that allow us to use this transfer type. The maximum packet size for one transaction is 8 bytes for low speed devices. If we are sending larger amounts of data, they need to be splitted into many transactions, once USB is a shared bus. Another feature defined for the converter was the number of endpoints needed. As explained before, endpoints are buffersFig 1.RS232 to USB Interface DiagramIII HARDWARE DESCRIPTIONIt is a low-cost solution for low-speed applications with high I/O requirements. RS232 ports which are physically mounted in a computer are often powered by three power sources: +5 Volts for the UART logic, and -12 Volts and +12 Volts for the outputdrivers. USB however only provides a +5 Volt power source.Some USB to RS232converters use integrated DC /DC converters to create the appropriate voltage levels for the RS232 signals, implementations, the +5 Volt voltages is directly used to drive the output The UART has serial interface to the RS232 driver. The operation of UART is controlled by an external host processor. There is an 8-bit data interface to host along with read and write control signals. Clock is fed from external crystal. Thefamily is USB specification [1] compliant and supports one address and three data endpoints [5]. The choice of a UART with three endpoint was done in order to allow us to have, beyond the Interrupt IN, an Interrupt OUT endpoint for receiving data from the host (OUT). Its definition requires we have an odd endpoint number besides Endpoint 0. This configurationcould not be implemented at the time the project was being developed once the Operating System did not offer support for Interrupt OUT endpoints, which were defined in a later version of the specification. The instruction set has been optimized specifically for USB operations, USB controller provides one USB device address with three endpoints. The USB device address is assigned to the device and saved in the USB Device Address Register (7 bits) during the USB enumeration process. The USB controller communicates with the host using dedicated FIFO, one per endpoint. Each endpoint FIFO is implemented as 8 bytes of dedicated SRAM and the status and control of each of them can be done using its Mode Register and Count Register.IV. SOFTWARE DESIGN AND IMPLEMENTATIONThe development of the converter was divided in phases: Descriptors definition. Device detection and enumeration module (request treatment), Serial data exchange module, USB/serial modules interface be overlapped. USB data exchange module (request treatment). The phases definition does not imply that they cannot be overlapped.A .Descriptors definitionThe main structure to be data implemented consists of device descriptors, as defined by the USB specification [1] These descriptors store information about the device and the USB communication process, used by the host to identify the device and its characteristics. The Device Descriptor is the first descriptor the host reads on device attachment. It includes the basic information the host needs in order to retrieve further characteristics from the device. Its fields' values were defined according to the converter characteristics [7]. To implement a new device, some of these values must be re evaluated and changed if necessary. The converter was defined to use just one interface and two endpoints (Control and Interrupt IN). Interrupt OUT endpoints were defined just in a later version of HID specification. To solve this problem, data packets are sent to the UPS across Endpoint 0, using the SET REPORT request, and received through Endpoint 1, using Interrupt transfers. The data reception is done through Output Reports, which were defined as 16 8-bit fields, according to the largest command sent to the UPS. Sending data to the host is done through Input Reports, which were defined as 8, 8- bit fields. Report Descriptors define the size and uses for the data that implements the device’s functionality.B. Device Detection and EnumerationThe second phase consists of the implementation of the code that enables the host to detect and enumerate the device. The implementation of these routines was based on some example codes [8, 9, 10]. Inside the we must have the code to access the descriptors, to recognize and to respond to the request codes that the host sends when it enumerates the device.C. The process of sending and receiving dataThe process of sending data to the UPS is done through ControlTransfers using SET REPORT on Endpoint 0. The host sends a request to the USB device, indicating it wants to send data. Aninterrupt informs the device when new data have arrived onEndpoint 0 and the corresponding Interrupt Service Routinecopies it into a data buffer, which is used in the serial communication process.. The maximum packet size that isreceived from the host was defined according to the largestcommand that must be sent to the function must be changed toallow receiving an arbitrary number of bytes. These routines are called after the Host or the controller sends a packet to the bus.Endpoint 0 ISR receives. Using hardware flow control impliesthat more lines must be present between the sender and thereceiver, leading to a thicker and more expensive cable. Therefore, software flow control is a good alternative if it is notneeded to gain maximum performance in communications.Software flow control makes use of the data channel between thetwo devices which reduces the bandwidth. The reduce of bandwidth is in most cases however not so astonishing that it is areason to not use it. First, the computer sets its RTS line to signalthe device that some information is present. The device checks ifthere is room to receive the information and if so, it sets the CTS line to start the transfer. When using a null modem connection,this is somewhat different. There are two ways to handle thistype of handshaking in that situation. One is, where the RTS ofeach side is connected with the CTS side of the other. In that way, the communication protocol differs somewhat from theoriginal one. The RTS output of computer A signals computer B that A is capable of receiving information, rather than a requestfor sending information as in the original configuration. Thistype of communication can be performed with a null modemcable for full handshaking. Although using this cable is not completely compatible with the original way hardware flow control was designed, if software is properly designed for it it can achieve the highest possible speed because no overhead ispresent for requesting on the RTS line and answering on the CTSline. In the second situation of null modem communication withhardware flow control, the software side looks quite similar to the original use of the handshaking lines. The CTS and RTS linesof one device are connected directly to each other. This means,that the request to send query answers itself. As soon as the RTSoutput is set, the CTS input will detect a high logical value indicating that sending of information is allowed. This impliesthat information will always be sent as soon as sending isrequested by a device if no further checking is present. Toprevent this from happening, two other pins on the connector are used, the data set ready DSR and the data terminal ready DTR. These two lines indicate if the device attached is working properly and willing to accept data. When these lines are cross-connected (as in most null modem cables) flow control can be performed using these lines. A DTR output is set, if that computer accepts incoming characters.V.R ESULT A NALYSIS:Fig.4.Shows the Waveforms of RS232USBconverterFIG. 5. RTLSCHEMATICSFIG. 6.The Routed designVI. CONCLUSIONSAn embedded converter from RS232 to USB is designed in this project. VHDL will be used for implementing all these blocks. ModelSim Simulator tool will be used for functional simulation of the design. Reduces the cost for the end user, improves communication speed and supports simultaneous attachment of multiple devices (up to 127). USB protocol operates at 480 Mbps FPGA implementation of the design is done on Spartan 3E FPGA (XC3S500E). The design used 6% of the FPGA area and a maximum frequency of 130MHz is obtained.ACKNOWLEDGMENTWe are grateful to management Vaagdevi college of Engineering, Warangal, NIT Warangal, Linton University College, and Mantin for the facilities to provide to complete the project in time.REFERENCES1.Ana Luiza de Almeida Pereira Zuquim, Claudionor JosCNunes Coelho Jr, Antanio Ot6vio Fernández, Marcos PCgode Oliveira, AndrCa Iabrudi Tavares, “An EmbeddedConverter from RS232 to Universal Serial Bus”, IEEE2.Jan axelson, “USB Complete, Everything you need todevelop custom USB peripherals”, Penram Intl.Publishing(India), 19993.Universal Serial Bus Specification Revision 2.04.5.Charles H.Roth, Jr, “Digital Systems Design using VHDL”,PWS publishing company, 1996.6.ZainalabediNavabi,“VHDL Analysis and Modelling ofDigital Systems”, McGraw – Hill, Second Edition.7.8.9.Douglas L. Perry ,”VHDL”, Second Edition, McGraw-Hill,Inc, 199310..au/catalog/targus-usb-to-parallel-adapter-p-1160.html11. USB Complete: The Developer's Guide, 4th Edition12. USB Mass Storage: Designing and ProgrammingDevices and Embedded Hosts14. FPGAPrototyping by VHDL Examples: Xilinx Spartan-3Version. Pong P.ChuBibliographical notesV.Vijaya obtained her B.Tech Degree in Electronics & Communication Engg., from (JNTU) Jawaharlal Nehru Technological University College of Engg., Ananthapur, and M.Tech. Degree in Instrumentation and Control Systems, from JNTUK college of Engg Kakinada and Pursuing PhD from JNTUH, Hyderabad. V.Vijaya worked at APEL Radio Communication Systems, Hyderabad and presently, she is working as Associate Professor in the ECE Dept of Vaagdevi College of Engineering at Warangal. She has 10 years of Teaching Experience and 2 years of Industrial Experience. Attended 15 workshops/refresher courses/short term courses at various places. Member of Project Review Committee (UG/PG); CRC for (UG/PG).She is the project coordinator for UG/PG. Her area of interest are Image processing, Signal processing, VLSI, Mobile Communications, Wireless Communications. She is life member of ISTE, IETE.She is the member of IEEE. She has published no. of papers in national conferences and international conferences.V.Rama obtained her B.Tech in Electronics &Communication Engg., from JNTU, Kakinada, and M.Tech. from NIT, Warangal. Pursuing PhD from NIT, Warangal She is working as Asst Professor in the ECE Dept., at NIT, Warangal. Staff adviser of ECE Dept., Incharge for basic Electronics Lab. She involved inextracurricular activites at institute. She has 12 years of Teaching Experience. She organized no. of UGC workshops in NITW. Her area of research is Bio Medical Signal Processing. Her areas of interest are Image processing, Signal processing, Tele medicine. She is the member of IEEE. She has published no. of papers in national and international conferences.CH.Kranthi Rekha had received her B.E in Electronics and Communication Engineering from Madurai Kamaraj University in 2000 and Completed M.Tech from JNTUH, Hyderabad. Presently she is working as Lecturer in Linton university college, Mantin, Malaysia, She has more than 10 years of teaching experience. She is the Author of two Books (Digital communications and Digital Image processing). Organized student level technical symposium technocraft-’09. Attended 10 workshops/refresher courses/short term courses at various places. As a resource person to talk on Image processing. Member of Project Review Committee (UG/PG); CRC for (UG/PG). Her area of interest are Neural networks, Image processing, Signal processing, VLSI, Communications. She is life member of ISTE, IETE.She has published no. of papers in national conferences and international conferences.B.Sreedevi obtained her AMIE Degree in Electronics & Communication Engg., from Institution of Engineers, Calcutta, and M.Tech. Degree in Digital System Computer Electronics, from JNTUA college of Engg Ananthapur. She is working as Associate Professor in the ECE Dept of Vaagdevi College of Engineering at Warangal. She has 10 years of Teaching Experience. Attended 12 workshops/refresher courses/short term courses at various places. Member of Project Review Committee (UG/PG); CRC for (UG/PG). Her area of interest are Image processing, Signal processing, VLSI, Communications. She is life member of ISTE, IETE. She has published no. of papers in national conferences and international conferences.C.B.RamaRao obtained his B.Tech in Electronics & Communication Engg., from JNTU Kakinada, and M.Tech. from JNTU Kakinada, Ph.D from IIT, kharagpur. He is working as Professor in the ECE Dept., at NIT, Warangal. At present he is the Head of ECE Dept.,. He involved in various activities at institute. He acted as associate dean of academic affairs at NITW. He has 28 years of Teaching Experience. He organized no. of workshops at NITW. His area of research is in advanced digital signal processing. His areas of interest are Bio Medical Signal Processing, Image processing, Signal processing. He is the member of IEEE. He has published no. of papers in national and international conferences.。

一篇关于FPGA的英文文献及翻译

一篇关于FPGA的英文文献及翻译

使用LabVIEW FPGA模块开发可编程自动化控制器学院:通信与电子工程学院班级:电子071学号: 2007131010姓名:欧洪材Building Programmable Automation Controllers with LabVIEWFPGAOverviewProgrammable Automation Controllers (PACs) are gaining acceptance within the industrial control market as the ideal solution for applications that require highly integrated analog and digital I/O, floating-point processing, and seamless connectivity to multiple processing nodes. National Instruments offers a variety of PAC solutions powered by one common software development environment, NI LabVIEW. With LabVIEW, you can build custom I/O interfaces for industrial applications using add-on software, such as the NI LabVIEW FPGA Module.With the LabVIEW FPGA Module and reconfigurable I/O (RIO) hardware, National Instruments delivers an intuitive, accessible solution for incorporating the flexibility and customizability of FPGA technology into industrial PAC systems. You can define the logic embedded in FPGA chips across the family of RIO hardware targets without knowing low-level hardware description languages (HDLs) or board-level hardware design details, as well as quickly define hardware for ultrahigh-speed control, customized timing and synchronization, low-level signal processing, and custom I/O with analog, digital, and counters within a single device. You also can integrate your custom NI RIO hardware with image acquisition and analysis, motion control, and industrial protocols, such as CAN and RS232, to rapidly prototype and implement a complete PAC system.Table of Contents1.Introduction2.NI RIO Hardware for PACs3.Building PACs with LabVIEW and the LabVIEW FPGA Module4.FPGA Development Flowing NI SoftMotion to Create Custom Motion Controllers6.Applications7.ConclusionIntroductionYou can use graphical programming in LabVIEW and the LabVIEW FPGA Module to configure the FPGA (field-programmable gate array) on NI RIO devices. RIO technology, the merging of LabVIEW graphical programming with FPGAs on NI RIOhardware, provides a flexible platform for creating sophisticated measurement and control systems that you could previously create only with custom-designed hardware.An FPGA is a chip that consists of many unconfigured logic gates. Unlike the fixed, vendor-defined functionality of an ASIC (application-specific integrated circuit) chip, you can configure and reconfigure the logic on FPGAs for your specific application. FPGAs are used in applications where either the cost of developing and fabricating an ASIC is prohibitive, or the hardware must be reconfigured after being placed into service. The flexible,software-programmable architecture of FPGAs offer benefits such ashigh-performance execution of custom algorithms, precise timing and synchronization, rapid decision making, and simultaneous execution of parallel tasks. Today, FPGAs appear in such devices as instruments, consumer electronics, automobiles, aircraft, copy machines, and application-specific computer hardware. While FPGAs are often used in industrial control products, FPGA functionality has not previously been made accessible to industrial control engineers. Defining FPGAs has historically required expertise using HDL programming or complex design tools used more by hardware design engineers than by controlengineers.Within the test-fixture the tx output of the transmitter module is loop ed back to the rx input of the receiver module.This allows the transmitter module to be used as test signal generator for the receiver module.Data can be written in parallel format to the transmitter module and looped back in serial format to the rx input of the receiver module,and data received can finally be read out in paral lel format from the receiver module.In order to automate the testing of the UART a s much as possible,tree independent Verilog tasks were written as follows.The Ve rilog task“write_to_transmitter”holds all necessary statements required to generate a s ingle parallel data write sequence to the transmitter module.Data that are writt en to the transmitter upon execution of the“write_to_transmitter”task,get la tched internal to the test-fixture for later analysis.The Verilog task“read_ou t_receiver”holds all necessary statements required to generate a single paral lel data read out sequence from the receiver module.Data that are read out of the receiver upon execution of the“read_out_receiver”task,get latched internalto the test-fixture for later analysis.The Verilog task“compare_data”holds a ll necessary statements required to compare the previous data written to the tran smitter module,to the corresponding and most recent data received and read out f rom the receive r module.If any discrepancy occurs,the“compare_data”task fl ags for an error by writing out the data values that were written to the transmitte r module,as well as the corresponding data values that were received by and read o ut from the receiver module.The simulation is immediately stopped by the“compa re_data”task if any discrepancy occurs.Besides the tree above mentioned Verilo g tasks,the test-fixture holds the statements to generate the mclkx16,the master reset signals as well as the“tx to rx”loop back feature.The statements are c onsidered trivial,and will not be illustrated here,but can be referred to within the test-fixture itself.The core of the test-fixture is a behavioral level“for loop”that executes the tree above mentioned Verilog tasks in order to write all possible data combinations to the transmitter and verify that same data gets prop erly received by the receiver.The for loop is showed below in figure21.Next to port definitions comes port directions.Directions are specified as in put,output or inout(bidirectional),and can be referred to in table1.Next to the specification of port directions comes declaration of internal signals.Inter nal signals in Verilog are declared as“wire”or“reg”data types.Signals of the“wire”type are used for continuos assignments,also called combinatorial s tatements.Signals of the“reg”type are used for assignments within the Verilog“always”block,often use for sequential logic assignments,but not necessari ly.For further explanation see aVerilog reference book.Data types of the internal signals of the module can be referred to in table3.We have now passed by all nec essary declarations,and are now ready to look at the actual ing hardware description language allows us to describe the function of the transm itter in a more behavioral manner,rather than focus on it’s actual implementation at gate level In software programming language, functions and procedures breaks larger programs into more readable,manageable and certa inly maintainable pieces.The Verilog language provides functions and tasks as co nstructs,analogous to software functions and procedures.A Verilog function andtask are used as the equivalent to multiple lines of Verilog code,where certain i nputs or signals affects certain outputs or variables.The use of functions and ta sks usually takes place where multiple lines of code are repeatedly used in a desi gn,and hence makes the design easier to read and certainly maintain.A Verilog fu nction can have multiple inputs,but always have only one output,while the Veril og task can have both multiple inputs,and multiple outputs and even in some cases,non of each.Below is shown the Verilog task,that hold all necessary sequential statements,to describe the transmitter in the“shift”modeWith the LabVIEW FPGA Module and NI RIO hardware, you now can use LabVIEW, a high-level graphical development environment designed specifically for measurement and control applications, to create PACs that have the customization, flexibility, and high-performance of FPGAs. Because the LabVIEW FPGA Module configures custom circuitry in hardware, your system can process and generate synchronized analog and digital signals rapidly and deterministically. Figure 1 illustrates many of the NI RIO devices that you can configure using the LabVIEW FPGA Module.Figure 1. LabVIEW FPGA VI Block Diagram and RIO Hardware PlatformsNI RIO Hardware for PACsHistorically, programming FPGAs has been limited to engineers who have in-depth knowledge of VHDL or other low-level design tools, which require overcoming a very steep learning curve. With the LabVIEW FPGA Module, NI has opened FPGA technology to a broader set of engineers who can now define FPGA logic using LabVIEW graphical development. Measurement and control engineers can focus primarily on their test and control application, where their expertise lies, rather than the low-level semantics of transferring logic into the cells of the chip. The LabVIEW FPGA Module model works because of the tight integration between the LabVIEW FPGA Module and the commercial off-the-shelf (COTS) hardware architecture of the FPGA and surrounding I/O components.National Instruments PACs provide modular, off-the-shelf platforms for your industrial control applications. With the implementation of RIO technology on PCI, PXI, and Compact Vision System platforms and the introduction of RIO-based CompactRIO, engineers now have the benefits of a COTS platform with thehigh-performance, flexibility, and customization benefits of FPGAs at their disposal to build PACs. National Instruments PCI and PXI R Series plug-in devices provide analog and digital data acquisition and control for high-performance, user-configurable timing and synchronization, as well as onboard decision making on a single device. Using these off-the-shelf devices, you can extend your NI PXI or PCI industrial control system to include high-speed discrete and analog control, custom sensor interfaces, and precise timing and control.NI CompactRIO, a platform centered on RIO technology, provides a small, industrially rugged, modular PAC platform that gives you high-performance I/O and unprecedented flexibility in system timing. You can use NI CompactRIO to build an embedded system for applications such as in-vehicle data acquisition, mobile NVH testing, and embedded machine control systems. The rugged NICompactRIO system is industrially rated and certified, and it is designed for greater than 50 g of shock at a temperature range of -40 to 70 °C.NI Compact Vision System is a rugged machine vision package that withstands the harsh environments common in robotics, automated test, and industrial inspection systems. NI CVS-145x devices offer unprecedented I/O capabilities and network connectivity for distributed machine vision applications.NI CVS-145x systems use IEEE 1394 (FireWire) technology, compatible with more than 40 cameras with a wide range of functionality, performance, and price. NI CVS-1455 and NI CVS-1456 devices contain configurable FPGAs so you can implement custom counters, timing, or motor control in your machine vision application.Building PACs with LabVIEW and the LabVIEW FPGA ModuleWith LabVIEW and the LabVIEW FPGA Module, you add significant flexibility and customization to your industrial control hardware. Because many PACs are already programmed using LabVIEW, programming FPGAs with LabVIEW is easy because it uses the same LabVIEW development environment. When you target the FPGA on an NI RIO device, LabVIEW displays only the functions that can be implemented in the FPGA, further easing the use of LabVIEW to program FPGAs. The LabVIEW FPGA Module Functions palette includes typical LabVIEW structures and functions, such as While Loops, For Loops, Case Structures, and Sequence Structures as well as a dedicated set of LabVIEW FPGA-specific functions for math, signal generation and analysis, linear and nonlinear control, comparison logic, array and cluster manipulation, occurrences, analog and digital I/O, and timing. You can use a combination of these functions to define logic and embed intelligence onto your NI RIO device.Figure 2 shows an FPGA application that implements a PID control algorithm on the NI RIO hardware and a host application on a Windows machine or an RT target that communicates with the NI RIO hardware. This application reads from analog input 0 (AI0), performs the PID calculation, and outputs the resulting data on analog output 0 (AO0). While the FPGA clock runs at 40 MHz the loop in this example runs much slower because each component takes longer than one-clock cycle to execute. Analog control loops can run on an FPGA at a rate of about 200 kHz. You can specify the clock rate at compile time. This example shows only one PID loop; however, creating additional functionality on the NI RIO device is merely a matter of adding another While Loop. Unlike traditional PC processors, FPGAs are parallel processors. Adding additional loops to your application does not affect the performance of your PID loop.Figure 2. PID Control Using an Embedded LabVIEW FPGA VI with Corresponding LabVIEW HostVI.FPGA Development FlowAfter you create the LabVIEW FPGA VI, you compile the code to run on the NI RIO hardware. Depending on the complexity of your code and the specifications of your development system, compile time for an FPGA VI can range from minutes to several hours. To maximize development productivity, with the R Series RIO devices you can use a bit-accurate emulation mode so you can verify the logic of your design before initiating the compile process. When you target the FPGA Device Emulator, LabVIEW accesses I/O from the device and executes the VI logic on the Windows development computer. In this mode, you can use the same debugging tools available in LabVIEW for Windows, such as execution highlighting, probes, and breakpoints.Once the LabVIEW FPGA code is compiled, you create a LabVIEW host VI to integrate your NI RIO hardware into the rest of your PAC system. Figure 3 illustrates the development process for creating an FPGA application. The host VI uses controls and indicators on the FPGA VI front panel to transfer databetween the FPGA on the RIO device and the host processing engine. These front panel objects are represented as data registers within the FPGA. The host computer can be either a PC or PXI controller running Windows or a PC, PXI controller, Compact Vision System, or CompactRIO controller running a real-time operating system (RTOS). In the above example, we exchange the set point, PID gains, loop rate, AI0, and AO0 data with the LabVIEW host VI.Figure 3. LabVIEW FPGA Development FlowThe NI RIO device driver includes a set of functions to develop a communication interface to the FPGA. The first step in building a host VI is to open a reference to the FPGA VI and RIO device. The Open FPGA VI Reference function, as seen in Figure 2, also downloads and runs the compiled FPGA code during execution. After opening the reference, you read and write to the control and indicator registers on the FPGA using the Read/Write Control function. Once you wire the FPGA reference into this function, you can simply select which controls and indicators you want to read and write to. You can enclose the FPGA Read/Write function within a While Loop to continuously read and write to the FPGA. Finally, the last function within the LabVIEW host VI in Figure 2 is the Close FPGA VI Reference function. The Close FPGA VI Reference function stops the FPGA VI and closes the reference to the device. Now you can download other compiled FPGA VIs to the device to change or modify its functionality.The LabVIEW host VI can also be used to perform floating-point calculations, data logging, networking, and any calculations that do not fit within the FPGA fabric. For added determinism and reliability, you can run your host application on an RTOS with the LabVIEW Real-Time Module. LabVIEW Real-Time systems provide deterministic processing engines for functions performed synchronously or asynchronously to the FPGA. For example, floating-point arithmetic, including FFTs, PID calculations, and custom control algorithms, are often performed in the LabVIEW Real-Time environment. Relevant data can be stored on a LabVIEW Real-Time system or transferred to a Windows host computer for off-line analysis, data logging, or user interface displays. The architecture for this configuration is shown in Figure 4. Each NI PAC platform that offers RIO hardware can run LabVIEW Real-Time VIs.Figure 4. Complete PAC Architecture Using LabVIEW FPGA, LabVIEW Real-Time and Host PC Within each R Series and CompactRIO device, there is flash memory available to store a compiled LabVIEW FPGA VI and run the application immediately upon power up of the device. In this configuration, as long as the FPGA has power, it runs the FPGA VI, even if the host computer crashes or is powered down. This is ideal for programming safety power down and power up sequences when unexpected events occur.Using NI SoftMotion to Create Custom Motion ControllersThe NI SoftMotion Development Module for LabVIEW provides VIs and functions to help you build custom motion controllers as part of NI PAC hardware platforms that can include NI RIO devices, DAQ devices, and Compact FieldPoint. NI SoftMotion provides all of the functions that typically reside on a motion controller DSP. With it, you can handle path planning, trajectory generation, and position and velocity loop control in the NI LabVIEW environment and then deploy the code on LabVIEW Real-Time or LabVIEW FPGA-based target hardware.NI SoftMotion includes functions for trajectory generator and spline engine and examples with complete source code for supervisory control, position, and velocity control loop using the PID algorithm. Supervisory control and the trajectory generator run on a LabVIEW Real-Time target and run at millisecond loop rates. The spline engine and the control loop can run either on a LabVIEW Real-Time target at millisecond loop rates or on a LabVIEW FPGA target at microsecond loop rates.ApplicationsBecause the LabVIEW FPGA Module can configure low-level hardware design of FPGAs and use the FPGAs within in a modular system, it is ideal for industrial control applications requiring custom hardware. These custom applications can include a custom mix of analog, digital, and counter/timer I/O, analog control up to 125 kHz, digital control up to 20 MHz, and interfacing to custom digital protocols for the following:Batch control∙Discrete control∙Motion control∙In-vehicle data acquisition∙Machine condition monitoring∙Rapid control prototyping (RCP)∙Industrial control and acquisition∙Distributed data acquisition and control∙Mobile/portable noise, vibration, and harshness (NVH) analysis ConclusionThe LabVIEW FPGA Module brings the flexibility, performance, and customization of FPGAs to PAC platforms. Using NI RIO devices and LabVIEW graphical programming, you can build flexible and custom hardware using the COTS hardware often required in industrial control applications. Because you are using LabVIEW, a programming language already used in many industrial control applications, to define your NI RIO hardware, there is no need to learn VHDL or other low-level hardware design tools to create custom hardware. Using the LabVIEW FPGA Module and NI RIO hardware as part of your NI PAC adds significant flexibility and functionality for applications requiring ultrahigh-speed control, interfaces to custom digital protocols, or a custom I/O mix of analog, digital, and counters.使用LabVIEW FPGA(现场可编程门阵列)模块开发可编程自动化控制器综述工业控制上的应用要求高度集成的模拟和数字输入输出、浮点运算和多重处理节点的无缝连接。

一篇关于FPGA的英文文献及翻译

一篇关于FPGA的英文文献及翻译

使用LabVIEW FPGA模块开发可编程自动化控制器学院:通信与电子工程学院班级:电子071学号: 2007131010:欧洪材Building Programmable Automation Controllers with LabVIEWFPGAOverviewProgrammable Automation Controllers (PACs) are gaining acceptance within the industrial control market as the ideal solution for applications that require highly integrated analog and digital I/O, floating-point processing, and seamless connectivity to multiple processing nodes. National Instruments offers a variety of PAC solutions powered by one common software development environment, NI LabVIEW. With LabVIEW, you can build custom I/O interfaces for industrial applications using add-on software, such as the NI LabVIEW FPGA Module.With the LabVIEW FPGA Module and reconfigurable I/O (RIO) hardware, National Instruments delivers an intuitive, accessible solution for incorporating the flexibility and customizability of FPGA technology into industrial PAC systems. You can define the logic embedded in FPGA chips across the family of RIO hardware targets without knowing low-level hardware description languages (HDLs) or board-level hardware design details, as well as quickly define hardware for ultrahigh-speed control, customized timing and synchronization, low-level signal processing, and custom I/O with analog, digital, and counters within a single device. You also can integrate your custom NI RIO hardware with image acquisition and analysis, motion control, andindustrial protocols, such as CAN and RS232, to rapidly prototype and implement a complete PAC system.Table of Contents1.Introduction2.NI RIO Hardware for PACs3.Building PACs with LabVIEW and the LabVIEW FPGA Module4.FPGA Development Flowing NI SoftMotion to Create Custom Motion Controllers6.Applications7.ConclusionIntroductionYou can use graphical programming in LabVIEW and the LabVIEW FPGA Module to configure the FPGA (field-programmable gate array) on NI RIO devices. RIO technology, the merging of LabVIEW graphical programming with FPGAs on NI RIO hardware, provides a flexible platform for creating sophisticated measurement and control systems that you could previously create only with custom-designed hardware.An FPGA is a chip that consists of many unconfigured logic gates. Unlike the fixed, vendor-defined functionality of an ASIC (application-specific integrated circuit) chip, you can configure and reconfigure the logic on FPGAsfor your specific application. FPGAs are used in applications where either the cost of developing and fabricating an ASIC is prohibitive, or the hardware must be reconfigured after being placed into service. The flexible,software-programmable architecture of FPGAs offer benefits such ashigh-performance execution of custom algorithms, precise timing and synchronization, rapid decision making, and simultaneous execution of parallel tasks. Today, FPGAs appear in such devices as instruments, consumer electronics, automobiles, aircraft, copy machines, and application-specific computer hardware. While FPGAs are often used in industrial control products, FPGA functionality has not previously been made accessible to industrial control engineers. Defining FPGAs has historically required expertise using HDL programming or complex design tools used more by hardware design engineers than by controlengineers.Within the test-fixture the tx output of the transmitter module i s looped back to the rx input of the receiver module. This allows the trans mitter module to be used as test signal generator for the receiver module. Data can be written in parallel format to the transmitter module and looped back in serial format to the rx input of the receiver module, and data rec eived can finally be read out in parallel format from the receiver module. In order to automate the testing of the UART as much as possible, tree inde pendent Verilog tasks were written as follows. The Verilog task“write_to_transmitter” holds all necessary statements required to generate a single parallel data write sequence to the transmitter module. Data tha t are written to the transmitter upon execution of the “write_to_transmitt er” task, get latched internal to the test-fixture for later analysis. The Verilog task “read_out_receiver” holds all necessary statements required to generate a single parallel data read out sequence from the receiver mo dule. Data that are read out of the receiver upon execution of the “read_o ut_receiver” task, get latched internal to the test-fixture for later anal ysis. The Verilog task “compare_data” holds all necessary statements requ ired to compare the previous data written to the transmitter module, to th e corresponding and most recent data received and read out from the receive r module. If any discrepancy occurs, the “compare_data” task flags for a n error by writing out the data values that were written to the transmitter module, as well as the corresponding data values that were received by and read out from the receiver module. The simulation is immediately stopped b y the “compare_data” task if any discrepancy occurs. Besides the tree abo ve mentioned Verilog tasks, the test-fixture holds the statements to genera te the mclkx16, the master reset signals as well as the “tx to rx” loop b ack feature. The statements are considered trivial, and will not be illustr ated here, but can be referred to within the test-fixture itself. The core of the test-fixture is a behavioral level “for loop” that executes the tr ee above mentioned Verilog tasks in order to write all possible data combin ations to the transmitter and verify that same data gets properly receivedby the receiver. The for loop is showed below in figure21.Next to port definitions comes port directions. Directions are specified as input, output or inout (bidirectional), and can be referred to in table 1. Next to the specification of port directions comes declaration of inte rnal signals. Internal signals in Verilog are declared as “wire” or “reg ” data types. Signals of the “wire” type are used for continuos assignme nts, also called combinatorial statements. Signals of the “reg” type are used for assignments within the Verilog “always” block, often use for seq uential logic assignments, but not necessarily. For further explanation see aVerilog reference book. Data types of the internal signals of the module can be referred to in table 3.We have now passed by all necessary declarati ons, and are now ready to look at the actual implementation. Using hardware description language allows us to describe the function of the transmitte r in a more behavioral manner, rather than focus on it’s actual implementation at gate level In software programming language, fu nctions and procedures breaks larger programs into more readable, manageabl e and certainly maintainable pieces. The Verilog language provides function s and tasks as constructs, analogous to software functions and procedures.A Verilog function and task are used as the equivalent to multiple lines o f Verilog code, where certain inputs or signals affects certain outputs or variables. The use of functions and tasks usually takes place where multipl e lines of code are repeatedly used in a design, and hence makes the designeasier to read and certainly maintain. A Verilog function can have multiple inputs, but always have only one output, while the Verilog task can haveboth multiple inputs, and multiple outputs and even in some cases, non of each. Below is shown the Verilog task, that hold all necessary sequential s tatements, to describe the transmitter in the “shift” modeWith the LabVIEW FPGA Module and NI RIO hardware, you now can use LabVIEW, a high-level graphical development environment designed specifically for measurement and control applications, to create PACs that have the customization, flexibility, and high-performance of FPGAs. Because the LabVIEW FPGA Module configures custom circuitry in hardware, your system can process and generate synchronized analog and digital signals rapidly and deterministically. Figure 1 illustrates many of the NI RIO devices that you can configure using the LabVIEW FPGA Module.Figure 1. LabVIEW FPGA VI Block Diagram and RIO Hardware PlatformsNI RIO Hardware for PACsHistorically, programming FPGAs has been limited to engineers who have in-depth knowledge of VHDL or other low-level design tools, which require overcoming a very steep learning curve. With the LabVIEW FPGA Module, NI has opened FPGA technology to a broader set of engineers who can now define FPGA logic using LabVIEW graphical development. Measurement and control engineers can focus primarily on their test and control application, where their expertise lies, rather than the low-level semantics of transferring logic into the cells of the chip. The LabVIEW FPGA Module model works because of the tight integration between the LabVIEW FPGA Module and the commercial off-the-shelf (COTS) hardware architecture of the FPGA and surrounding I/O components.National Instruments PACs provide modular, off-the-shelf platforms for your industrial control applications. With the implementation of RIO technology on PCI, PXI, and Compact Vision System platforms and the introduction of RIO-based CompactRIO, engineers now have the benefits of a COTS platform with thehigh-performance, flexibility, and customization benefits of FPGAs at their disposal to build PACs. National Instruments PCI and PXI R Series plug-in devices provide analog and digital data acquisition and control for high-performance, user-configurable timing and synchronization, as well as onboard decision making on a single device. Using these off-the-shelf devices, you can extend your NI PXI or PCI industrial control system to include high-speed discrete and analog control, custom sensor interfaces, and precise timing and control.NI CompactRIO, a platform centered on RIO technology, provides a small, industrially rugged, modular PAC platform that gives you high-performance I/O and unprecedented flexibility in system timing. You can use NI CompactRIO to build an embedded system for applications such as in-vehicle data acquisition, mobile NVH testing, and embedded machine control systems. The rugged NI CompactRIO system is industrially rated and certified, and it is designed for greater than 50 g of shock at a temperature range of -40 to 70 °C.NI Compact Vision System is a rugged machine vision package that withstands the harsh environments common in robotics, automated test, and industrial inspection systems. NI CVS-145x devices offer unprecedented I/O capabilities and network connectivity for distributed machine vision applications.NI CVS-145x systems use IEEE 1394 (FireWire) technology, compatible with more than 40 cameras with a wide range of functionality, performance, and price. NI CVS-1455 and NI CVS-1456 devices contain configurable FPGAs so you can implement custom counters, timing, or motor control in your machine vision application. Building PACs with LabVIEW and the LabVIEW FPGA Module With LabVIEW and the LabVIEW FPGA Module, you add significant flexibility and customization to your industrial control hardware. Because many PACs are already programmed using LabVIEW, programming FPGAs with LabVIEW is easy because it uses the same LabVIEW development environment. When you target the FPGA on an NI RIO device, LabVIEW displays only the functions that can be implementedin the FPGA, further easing the use of LabVIEW to program FPGAs. The LabVIEW FPGA Module Functions palette includes typical LabVIEW structures and functions, such as While Loops, For Loops, Case Structures, and Sequence Structures as well as a dedicated set of LabVIEW FPGA-specific functions for math, signal generation and analysis, linear and nonlinear control, comparison logic, array and cluster manipulation, occurrences, analog and digital I/O, and timing. You can use a combination of these functions to define logic and embed intelligence onto your NI RIO device.Figure 2 shows an FPGA application that implements a PID control algorithm on the NI RIO hardware and a host application on a Windows machine or an RT target that communicates with the NI RIO hardware. This application reads from analog input 0 (AI0), performs the PID calculation, and outputs the resulting data on analog output 0 (AO0). While the FPGA clock runs at 40 MHz the loop in this example runs much slower because each component takes longer than one-clock cycle to execute. Analog control loops can run on an FPGA at a rate of about 200 kHz. You can specify the clock rate at compile time. This example shows only one PID loop; however, creating additional functionality on the NI RIO device is merely a matter of adding another While Loop. Unlike traditional PC processors, FPGAs are parallel processors. Adding additional loops to your application does not affect the performance of your PID loop.Figure 2. PID Control Using an Embedded LabVIEW FPGA VI with Corresponding LabVIEW HostVI.FPGA Development FlowAfter you create the LabVIEW FPGA VI, you compile the code to run on the NI RIO hardware. Depending on the complexity of your code and the specifications of your development system, compile time for an FPGA VI can range from minutes to several hours. To maximize development productivity, with the R Series RIO devices you can use a bit-accurate emulation mode so you can verify the logic of your design before initiating the compile process. When you target the FPGA Device Emulator, LabVIEW accesses I/O from the device and executes the VI logicon the Windows development computer. In this mode, you can use the same debugging tools available in LabVIEW for Windows, such as execution highlighting, probes, and breakpoints.Once the LabVIEW FPGA code is compiled, you create a LabVIEW host VI to integrate your NI RIO hardware into the rest of your PAC system. Figure 3 illustrates the development process for creating an FPGA application. The host VI uses controls and indicators on the FPGA VI front panel to transfer data between the FPGA on the RIO device and the host processing engine. These front panel objects are represented as data registers within the FPGA. The host computer can be either a PC or PXI controller running Windows or a PC, PXI controller, Compact Vision System, or CompactRIO controller running a real-time operating system (RTOS). In the above example, we exchange the set point, PID gains, loop rate, AI0, and AO0 data with the LabVIEW host VI.Figure 3. LabVIEW FPGA Development FlowThe NI RIO device driver includes a set of functions to develop a communication interface to the FPGA. The first step in building a host VI is to open a reference to the FPGA VI and RIO device. The Open FPGA VI Referencefunction, as seen in Figure 2, also downloads and runs the compiled FPGA code during execution. After opening the reference, you read and write to the control and indicator registers on the FPGA using the Read/Write Control function. Once you wire the FPGA reference into this function, you can simply select which controls and indicators you want to read and write to. You can enclose the FPGA Read/Write function within a While Loop to continuously read and write to the FPGA. Finally, the last function within the LabVIEW host VI in Figure 2 is the Close FPGA VI Reference function. The Close FPGA VI Reference function stops the FPGA VI and closes the reference to the device. Now you can download other compiled FPGA VIs to the device to change or modify its functionality.The LabVIEW host VI can also be used to perform floating-point calculations, data logging, networking, and any calculations that do not fit within the FPGA fabric. For added determinism and reliability, you can run your host application on an RTOS with the LabVIEW Real-Time Module. LabVIEW Real-Time systems provide deterministic processing engines for functions performed synchronously or asynchronously to the FPGA. For example, floating-point arithmetic, including FFTs, PID calculations, and custom control algorithms, are often performed in the LabVIEW Real-Time environment. Relevant data can be stored on a LabVIEW Real-Time system or transferred to a Windows host computer for off-line analysis, data logging, or user interface displays. The architecture for this configuration is shown in Figure 4. Each NI PAC platform that offers RIO hardware can run LabVIEW Real-Time VIs.Figure 4. Complete PAC Architecture Using LabVIEW FPGA, LabVIEW Real-Time and Host PC Within each R Series and CompactRIO device, there is flash memory available to store a compiled LabVIEW FPGA VI and run the application immediately upon power up of the device. In this configuration, as long as the FPGA has power, it runs the FPGA VI, even if the host computer crashes or is powered down. This is ideal for programming safety power down and power up sequences when unexpected events occur.Using NI SoftMotion to Create Custom Motion Controllers The NI SoftMotion Development Module for LabVIEW provides VIs and functions to help you build custom motion controllers as part of NI PAC hardware platforms that can include NI RIO devices, DAQ devices, and Compact FieldPoint. NI SoftMotion provides all of the functions that typically reside on a motion controller DSP. With it, you can handle path planning, trajectory generation, and position and velocity loop control in the NI LabVIEW environment and then deploy the code on LabVIEW Real-Time or LabVIEW FPGA-based target hardware.NI SoftMotion includes functions for trajectory generator and spline engine and examples with complete source code for supervisory control, position, and velocity control loop using the PID algorithm. Supervisory control and thetrajectory generator run on a LabVIEW Real-Time target and run at millisecond loop rates. The spline engine and the control loop can run either on a LabVIEW Real-Time target at millisecond loop rates or on a LabVIEW FPGA target at microsecond loop rates.ApplicationsBecause the LabVIEW FPGA Module can configure low-level hardware design of FPGAs and use the FPGAs within in a modular system, it is ideal for industrial control applications requiring custom hardware. These custom applications can include a custom mix of analog, digital, and counter/timer I/O, analog control up to 125 kHz, digital control up to 20 MHz, and interfacing to custom digital protocols for the following:•Batch control•Discrete control•Motion control•In-vehicle data acquisition•Machine condition monitoring•Rapid control prototyping (RCP)•Industrial control and acquisition•Distributed data acquisition and control•Mobile/portable noise, vibration, and harshness (NVH) analysisConclusionThe LabVIEW FPGA Module brings the flexibility, performance, and customization of FPGAs to PAC platforms. Using NI RIO devices and LabVIEW graphical programming, you can build flexible and custom hardware using the COTS hardware often required in industrial control applications. Because you are using LabVIEW, a programming language already used in many industrial control applications, to define your NI RIO hardware, there is no need to learn VHDL or other low-level hardware design tools to create custom hardware. Using the LabVIEW FPGA Module and NI RIO hardware as part of your NI PAC adds significant flexibility and functionality for applications requiring ultrahigh-speed control, interfaces to custom digital protocols, or a custom I/O mix of analog, digital, and counters.使用LabVIEW FPGA(现场可编程门阵列)模块开发可编程自动化控制器综述工业控制上的应用要求高度集成的模拟和数字输入输出、浮点运算和多重处理节点的无缝连接。

FPGA外文资料125

FPGA外文资料125

Testing Configurable LUT-Based FPGA’s Wei Kang Huang,Fred J.Meyer,Member,IEEE,Xiao-Tao Chen,and Fabrizio Lombardi,Member,IEEEAbstract—We present a new technique for testingfieldprogrammable gate arrays(FPGA’s)based on look-up tables(LUT’s).We consider a generalized structure for the basic FPGAlogic element(cell);it includes devices such as LUT’s,sequentialelements(flip-flops),multiplexers and control circuitry.We usea hybrid fault model for these devices.The model is based on aphysical as well as a behavioral characterization.This permitsdetection of all single faults(either stuck-at or functional)andsome multiple faults using repeated FPGA reprogramming.Weshow that different arrangements of disjoint one-dimensional(1-D)cell arrays with cascaded horizontal connections andcommon vertical input lines provide a good logic testing regimen.The testing time is independent of the number of cells in thearray(C-testability).We define new conditions for C-testabilityof programmable/reconfigurable arrays.These conditions donot suffer from limited I/O pins.Cell configuration affects thecontrollability/observability of the iterative array.We applythe approach to various Xilinx FPGA families and compareit to prior work.Index Terms—C-testability,field programmable gate array,programmability,reconfigurability,testing.I.N OTATION AND D EFINITIONSHorizontal The internal inputs(outputs)of an iterative array.These propagate dependency between the CLB’sin the array.Vertical The external inputs of an iterative array.Thesecan be directly specified in the test patterns andrequire I/O blocks.Phase Each testing phase is a reprogramming of theFPGA followed by test vector application.Sincereprogramming is slow,the number of phases isa good measure of testing time.Session The application of every CLB test configuration tothose CLB’s that are under test.Multiple sessionsare required if not all CLB’s can be under testsimultaneously.C-testable An FPGA is C-testable with a given testingmethod if the number of programmings isindependent of the circuit size.In particular,foran iterative array,it is independent of the lengthof the array.and segments with programmable devices.Customization is accomplished by configuring the interconnect and the CLB’s by loading them with appropriate data from an external storage device.The FPGA also includes input/output blocks(IOB’s), which provide the interface between the package pins and the internal logic.The numbers of CLB’s and IOB’s vary widely depending on the particular chip and manufacturer[2].FPGA’s are versatile and in widespread use,due to their programmable nature and their ease of reconfiguration[3].Internal static configuration (memory)elements determine the logic functions and the interconnections.The SRAM in memory-based FPGA’s can be used to configure functions via look-up tables(LUT’s). Also,they often have a mode where the configuration SRAM is usable as memory.We focus on CLB testing for SRAM-based FPGA’s that implement functions via LUT’s.III.B ACKGROUNDTesting FPGA’s is addressed in the literature such as[4]–[7]. These works and this paper deal with manufacturing test.Other tests in thefield,such as verifying correctly loaded configu-ration data,are typically handled by architectural features for reprogrammable FPGA’s[2].Reference[4]discusses testing of row-based(segmented channel)FPGA’s.The approach sequentially tests every cell using a modified scan procedure, providing100%fault coverage of single stuck-at faults.It requires many tests and does not fully exploit the regularity of the FPGA to reduce test time.The methodology in[8]for testing uncommitted segmented channel FPGA’s for single stuck-at faults is based on connecting the cells of each row as a one-dimensional(1-D)unilateral array,such that the FPGA could be tested as a set of disjoint arrays.This yields considerable reduction in both vectors and test circuitry. Simultaneous testing of disjoint arrays helps achieve constant test set size(C-testability),so that test cost will be independent of chip size[9].In[7],the FPGA is configured to conduct direct output comparisons of pairs of logic cells using full cell control-lability.Test generation and output response comparison are handled internally using some of the logic resources in a built-in self-test(BIST)arrangement.This requires at least one extra “session,”i.e.,a doubling of chip programmings so that the cells previously used for test pattern generation and for output comparisons can become cells under test.Fault simulation established that100%fault coverage can be accomplished for single stuck-at faults.In[10],the logic resources are arranged as an iterative logic array(ILA)[9].This allows better scalability than the previous BIST arrangement[7]; however,it also requires another additional session—i.e.,a tripling of chip programmings.A simple testing arrangement(referred to as“naive”)was mentioned in[5].It connects together all input lines to the CLB’s(cells)under test from the IOB’s,and uses the remaining IOB’s for direct observability of the output lines of each cell under test.Fig.1shows a single programming phase with the three leftmost CLB’s under test.EachCLB Fig.1.One programming phase with the naive testingmethod.Fig.2.Interior of a Xilinx XC5200CLB.in thefigure has three inputs and two outputs.Three IOB’s are consumed in order to provide the cells under test with their input vectors.The cells under test have no connections between them.Their output response is directly observed at the IOB’s.In each programming phase,only a few CLB’s can be tested in parallel.This is basically restricted by the number of IOB’s and the number of output lines of each CLB.In Fig.1, only three CLB’s can be under test,because,after three IOB’s are used for CLB inputs,only seven IOB’s remain to observe output response.IV.P ROPOSED F AULT M ODELFig.2shows a portion of a Xilinx XC5200CLB.A full CLB consists of four stacked copies of thefigure(with the carry in (CI)and carry out(CO)signals rippled through)—plus a little extra logic.The portion in Fig.2has a single LUT with four inputs,so it has24configuration bits to specify its function. Of the three multiplexers,M1is a conventional multiplexer. M2and M3are programmable multiplexers;each needs only a single configuration bit to specify which input it passes.There is also a Dflip-flop.Our generalized internal CLB structure permits these de-vices:LUT’s,programmable(configurable)multiplexers,con-ventional multiplexers,andflip-flops.Some conventional logic usually does not interfere with test generation.We assume the interconnect and the IOB’s have already been tested;the interested reader should refer to[11]and[12] for a detailed treatment.In our proposed testing strategy,wedivide CLB’s into independent sets(linear arrays).For our fault model,within each linear array,we assume at most one faulty CLB;otherwise,fault masking might occur.For the single faulty CLB,the nature of the fault could take any form. For simplicity of illustration,in our investigation,we limited a CLB to a single faulty device.The nature of a device fault varies with the device.We model single device faults both physically(e.g.,stuck-at)and functionally[13].This hybrid fault model is adaptable to emerging FPGA technology and to different products as they become commercially available[2].The fault model was shown suitable to FPGA’s in[5]and confirmed by industrial experiments.In particular,by device as follows.•For a LUT,a fault can occur in any one of the:memory matrix,decoders,and input and output lines.A faulty memory matrix makes some memory cells incapable of storing the correct logic value(the LUT has a single bit output,so this value is either0or1)[2].Any number of memory cells in the LUT could be faulty.If the fault is in the decoder,the erroneous address leads to reading the contents of the wrong cell,i.e.,a one-bit address fault.The third possible LUT fault is on the I/O lines,with respect to which we allow any single stuck-at fault.The one-bit decoder address fault can be collapsed to the stuck-at fault of a LUT input line.So this fault type is detected when the decoders are tested.A stuck-at fault on a LUT output line is covered by the tests for the LUT memory matrix.•For a multiplexer,we use a functional fault model, because the internal logic structure varies from FPGA to FPGA[2].Testing confirms the multiplexer’s ability to select each input.•For the Dflip-flops,we use a functional fault model.A fault can cause aflip-flop to receive no data,to be unable to be triggered by the correct clock edge,or to be unable to be set or reset.Our testing objectives are as follows:•a100%fault coverage under a single faulty device model with neither delay nor area overhead;•ease of test pattern generation,because patterns are generated for a CLB,not the whole FPGA;•efficient implementation of the testing process as mea-sured by the amount of memory required to store the test instructions(configuration bits and test patterns);•the number of programming phases must be as small as possible,because reprogramming time is much greater than test pattern application time[3].V.T ESTING A CLBWe generate test patterns in two steps according to the CLB partitioning.Consider initially a CLB made of a single LUT.We can test the LUT memory matrix by reading all the memory bits in two phases.The programmed memory matrix contents in the second phase are complements of thefirst.The scenario is different for testing stuck-at faults at the LUT inputs.The LUT matrix contents must be arranged such that the boolean difference is one for the input to be tested;multiple patterns are required.Furthermore,each LUT output must be observable at a primary CLB output.So we require a sensitized path from the LUT output to a primary CLB output.By definition of the combinational partition,this can be achieved by configuring the multiplexers(or other devices) in the partition.We use a functional test for the multiplexers.Since a multiplexer selects from among all inputs,each data input must be active in at least one phase.Further,the functional test consists of applying logic1to the selected input while holding all others at logic0,and a second test pattern with these logic values reversed.The multiplexer output must be observable from at least one primary CLB output.If a multiplexer is not simultaneously controllable/observable,additional phases could be required.We need atleast phases to test a multiplexerwith inputs.A multiplexer can be tested with a LUT(if connected);a possible way to accomplish this consists of choosing a memory matrix for the LUT that satisfies the multiplexer(s)testability conditions.In this way,test phases can be overlapped(reduced).A.Testing the Sequential PartitionThe sequential partition includes the Dflip-flops as well as multiplexers and control circuits emanating from them or only observable through them.During test generation,we seek to overlap testing of programmable multiplexers with that of flip-flops.In some FPGA’s,flip-flops are more complicated than the D type.In particular,the Xilinx XC4000family[2]has Dflip-flops plus added logic that can be programmed to add set/reset capability.The S/R controllers are configurable to allow a set function,a reset function,or neither.For the XC4000,this requires three separate programming phases;however,testing of the S/R controllers can be overlapped with testing the flip-flops.1)Testing the D Flip-Flops:We functionally test the D flip-flops.We test the input and hold function with the data sequence010(or101)at D.Separate phases are required to test both rising and falling edge trigger mechanisms,if applicable.We can test the set(reset)function by applying the set(reset)signal after aflip-flop is in the“0”(“1”)state. The set/reset disable functions must also be tested if present, leading to another phase.To test the clock enable function,we use thefive-vector sequence given in Fig.3.Some functional tests can be overlapped to reduce the number of phases.We can possibly also overlap phases with those for multiplexers, depending on the sequential partition’s structure.VI.P ROPOSED T EST S TRATEGYFig.4shows a linear iterative array.There is a cascaded (horizontal)input reflecting the dependence of the cells in the iterative array,and testvector,etc.The period of thearray,Fig.3.Testing the enablefunction.Fig.4.Iterative array with period three.number of cells we must traverse in order to repeat the cascaded input.We do not allow all test patterns;the test generation process must ensure the periodicity is satisfied as it searches the input space.In Fig.4it must ensurethat—couldall be different,but we will constrain them to be identical—so,if wehaveand(and schedule the corresponding vertical CLBinputs)and we will have every CLB experience testvectorvertical inputs.Then,weneedIOB’s forcontrollability/observabilitywhen•For testing the CLB sequential partition,we program a 1-D (sequential)array and use the pipeline technique of [8].To reduce the number of required IOB’s,as many as possible vertical inputs are made common (i.e.,they will have identical logic values when test vectors are applied).This is beneficial because,in a sequential array,the requirements of controllability and observability are far more stringent than for a combinational array of the same size [8],[9].Let the number of vertical inputs with different logic values in the test process for thepipeline techniquebe.So,IOB’s are used as primary vertical inputs for thevertical inputs of the CLB’s with different logic valuesduring the test processandIOB’s are used as primary common vertical inputs for those vertical inputs that do not need to be distinguished.The total required IOB’s isthenwherewheretestsandhence,the totalFig.5.Block diagram of an XC5200CLB.number of test patterns for the CLB sequential partitionsisthe total number of phasesisandAgain,the number of requiredphases for testing the whole FPGA is the same as that fortesting a single CLB,so The configuredarrays are sequential in12of the required19phases.For thecombinational arrays,the number of vectors required in eachphase is the same as for a single ing the sequentialarrays in[8],we need additional cycles equal to the numberof CLB’s in an iterative array.Therefore,A.Example:The XC5200FPGA FamilyAn XC5200CLB has four independent four-input LUT’s,14multiplexers,and four Dflip-flops(Fig.5).This is a stack offour independent logic cells(LC’s).Each LC in Fig.2containsone four-input LUT,three multiplexers,and one Dflip-flopfor a total offive independent inputs().Also,it hasa),while the sequential partition hasfiveinputspatterns tests LC0/2.At the same time,we notice that wehavetestedtostuck faults.We need two phases,because weneed to program each memory cell for both0and1.There are16memory cells in each LUT and we need toaccess each in both phases for a total of32test patterns;however,six of these have already been performed(twoeach in phases2–4).•Phase7:In this sequential test phase wetesttostuckfaults.•Phase8:In this sequential test phase wetestfollowsthe LUT output,we propagate errors by horizontallyconnectingpatterns to testthe whole FPGA,becausedenotes the numberof test“sessions.”In each session,different CLB’s are underparing the formulaefor isequivalent in naturetoisgood for the Xilinx families studied,especially the XC3000.Table IIIgivesin Table IIIwould be increased if the BIST-ILA configuration of[10]were used.For the array-based method,all CLB’s are undertest simultaneously,so the equivalenttoandof anapproach—i.e.,the maximum number of faults such that testinvalidation(fault masking)cannot occur.Wegiveis under worst case conditions.The upper bound can be achieved if the faulty CLB locationsare favorable.We assume that test invalidation always occursif the CLB locations permitit.TABLE IVCLB A RCHITECTURAL C OMPARISON•Naive Approach.Every CLB is tested independently,hencewas barelyachievable.)•Array-Based Approach.We assume test invalidation oc-curs in a1-D array if there is more than one faultyCLB.Since we configure the arrays along the rows ofthe FPGA,IX.A RCHITECTURES AND C OMPARATIVE A NALYSISWe compare the Xilinx FPGA’s in the series3000,4000,and5200with respect to their CLB structures and IOBlimitations.We compare each CLB architecture by consider-ing programmability and controllability/observability and therequired programming phases.Further details are in[14].parison of CLB Devices and FeaturesTable IV gives the numbers of Dflip-flops,LUT’s,andprogrammable and conventional multiplexers—together withthe test patterns and programming phases needed by the array-based method.It also gives the configuration memory sizedue to theflip-flops and programmable multiplexers.Thedifference between the XC3000and XC4000families is notlarge.The XC4000has an extra8-bit LUT connected in serieswith the other two LUT’s.This partially affects controllabilityand observability;however,some of the required tests forthe additional LUT can still be combined with the tests forthe other two LUT’s.We consider the R/S control elementsequivalent to two MUX’s each,so the XC4000CLB has moreprogrammable MUX’s.Also,six MUX’s have four inputs.Since the number of programming phases to test a MUXdepends linearly on the number of MUX input lines[5],theseneed more phases(and test vectors)for the XC4000comparedto the XC3000.Two main XC5200features contribute to its suitability tothe array-based testing approach.First,it can be treated asfour parallel simple logic cells(LC’s)with little hassle.Likethe XC4000LUT’s,internal signals are not independent(dueto MUX”and“”isneutral.So“IOB’s cf.CLB outputs,and better CLB observability compared to other families.For a test method,“。

fpga英文文献翻译

fpga英文文献翻译

Field-programmable gate array(现场可编程门阵列)1、History ——历史FPGA业界的可编程只读存储器(PROM)和可编程逻辑器件(PLD)萌芽。

可编程只读存储器(PROM)和可编程逻辑器件(PLD)都可以分批在工厂或在现场(现场可编程)编程,然而,可编程逻辑被硬线连接在逻辑门之间。

在80年代末期,为海军水面作战部提供经费的的史蒂夫·卡斯尔曼提出要开发将实现60万可再编程门计算机实验。

卡斯尔曼是成功的,并且与系统有关的专利是在1992年发行的。

1985年,大卫·W·佩奇和卢文R.彼得森获得专利,一些行业的基本概念和可编程逻辑阵列,门,逻辑块技术公司开始成立。

同年,Xilinx共同创始人,Ross Freeman和Bernard Vonderschmitt发明了第一个商业上可行的现场可编程门阵列——XC2064。

该XC2064可实现可编程门与其它门之间可编程互连,是一个新的技术和市场的开端。

XC2064有一个64位可配置逻辑块(CLB),有两个三输入查找表(LUT)。

20多年后,Ross Freeman 进入全国发明家名人堂,名人堂对他的发明赞誉不绝。

Xilinx继续受到挑战,并从1985年到90年代中期迅速增长,当竞争对手如雨后春笋般成立,削弱了显著的市场份额。

到1993年,Actel大约占市场的18%。

上世纪90年代是FPGA的爆炸性时期,无论是在复杂性和生产量。

在90年代初期,FPGA的电信和网络进行了初步应用。

到这个十年结束时,FPGA行业领袖们以他们的方式进入消费电子,汽车和工业应用。

1997年,一个在苏塞克斯大学工作的研究员阿德里安·汤普森,合并遗传算法技术和FPGA来创建一个声音识别装置,使得FPGA的名气可见一斑。

汤姆逊的算法配置10×10的细胞在Xilinx的FPGA芯片阵列,以两个音区分,利用数字芯片的模拟功能。

英文文献及翻译(FPGA)

英文文献及翻译(FPGA)

Building Programmable Automation Controllers with LabVIEWFPGAOverviewProgrammable Automation Controllers (PACs) are gaining acceptance within the industrial control market as the ideal solution for applications that require highly integrated analog and digital I/O, floating-point processing, and seamless connectivity to multiple processing nodes. National Instruments offers a variety of PAC solutions powered by one common software development environment, NI LabVIEW. With LabVIEW, you can build custom I/O interfaces for industrial applications using add-on software, such as the NI LabVIEW FPGA Module.With the LabVIEW FPGA Module and reconfigurable I/O (RIO) hardware, National Instruments delivers an intuitive, accessible solution for incorporating the flexibility and customizability of FPGA technology into industrial PAC systems. You can define the logic embedded in FPGA chips across the family of RIO hardware targets without knowinglow-level hardware description languages (HDLs) or board-level hardware design details, as well as quickly define hardware for ultrahigh-speed control, customized timing and synchronization, low-level signal processing, and custom I/O with analog, digital, and counters within a single device. You also can integrate your custom NI RIO hardware with image acquisition and analysis, motion control, and industrial protocols, such as CAN and RS232, to rapidly prototype and implement a complete PAC system.Table of Contents1.Introduction2.NI RIO Hardware for PACs3.Building PACs with LabVIEW and the LabVIEW FPGA Module4.FPGA Development Flowing NI SoftMotion to Create Custom Motion Controllers6.Applications7.ConclusionIntroductionYou can use graphical programming in LabVIEW and the LabVIEW FPGA Module to configure the FPGA (field-programmable gate array) on NI RIO devices. RIO technology, the merging of LabVIEW graphical programming with FPGAs on NI RIO hardware, provides a flexible platform for creating sophisticated measurement and control systems that you could previously create only with custom-designed hardware.An FPGA is a chip that consists of many unconfigured logic gates. Unlike the fixed, vendor-defined functionality of an ASIC (application-specific integrated circuit) chip, you can configure and reconfigure the logic on FPGAs for your specific application. FPGAs are used in applications where either the cost of developing and fabricating an ASIC is prohibitive, or the hardware must be reconfigured after being placed into service. The flexible, software-programmable architecture of FPGAs offer benefits such ashigh-performance execution of custom algorithms, precise timing and synchronization, rapid decision making, and simultaneous execution of parallel tasks. Today, FPGAs appear in such devices as instruments, consumer electronics, automobiles, aircraft, copy machines, and application-specific computer hardware. While FPGAs are often used in industrial control products, FPGA functionality has not previously been made accessible to industrial control engineers. Defining FPGAs has historically required expertise using HDL programming or complex design tools used more by hardware design engineers than by control engineers.With the LabVIEW FPGA Module and NI RIO hardware, you now can use LabVIEW, a high-level graphical development environment designed specifically for measurement and control applications, to create PACs that have the customization, flexibility, andhigh-performance of FPGAs. Because the LabVIEW FPGA Module configures custom circuitry in hardware, your system can process and generate synchronized analog and digital signals rapidly and deterministically.NI RIO Hardware for PACsHistorically, programming FPGAs has been limited to engineers who have in-depth knowledge of VHDL or other low-level design tools, which require overcoming a very steep learning curve. With the LabVIEW FPGA Module, NI has opened FPGA technology to a broader set of engineers who can now define FPGA logic using LabVIEW graphical development. Measurement and control engineers can focus primarily on their test and control application, where their expertise lies, rather than the low-level semantics oftransferring logic into the cells of the chip. The LabVIEW FPGA Module model works because of the tight integration between the LabVIEW FPGA Module and the commercial off-the-shelf (COTS) hardware architecture of the FPGA and surrounding I/O components.National Instruments PACs provide modular, off-the-shelf platforms for your industrial control applications. With the implementation of RIO technology on PCI, PXI, and Compact Vision System platforms and the introduction of RIO-based CompactRIO, engineers now have the benefits of a COTS platform with the high-performance, flexibility, and customization benefits of FPGAs at their disposal to build PACs. National Instruments PCI and PXI R Series plug-in devices provide analog and digital data acquisition and control for high-performance, user-configurable timing and synchronization, as well as onboard decision making on a single device. Using these off-the-shelf devices, you can extend your NI PXI or PCI industrial control system to include high-speed discrete and analog control, custom sensor interfaces, and precise timing and control.NI CompactRIO, a platform centered on RIO technology, provides a small, industrially rugged, modular PAC platform that gives you high-performance I/O and unprecedented flexibility in system timing. You can use NI CompactRIO to build an embedded system for applications such as in-vehicle data acquisition, mobile NVH testing, and embedded machine control systems. The rugged NI CompactRIO system is industrially rated and certified, and it is designed for greater than 50 g of shock at a temperature range of -40 to 70 °C.NI Compact Vision System is a rugged machine vision package that withstands the harsh environments common in robotics, automated test, and industrial inspection systems. NI CVS-145x devices offer unprecedented I/O capabilities and network connectivity for distributed machine vision applications.NI CVS-145x systems use IEEE 1394 (FireWire) technology, compatible with more than 40 cameras with a wide range of functionality, performance, and price. NI CVS-1455 and NI CVS-1456 devices contain configurable FPGAs so you can implement custom counters, timing, or motor control in your machine vision application.Building PACs with LabVIEW and the LabVIEW FPGA ModuleWith LabVIEW and the LabVIEW FPGA Module, you add significant flexibility and customization to your industrial control hardware. Because many PACs are alreadyprogrammed using LabVIEW, programming FPGAs with LabVIEW is easy because it uses the same LabVIEW development environment. When you target the FPGA on an NI RIO device, LabVIEW displays only the functions that can be implemented in the FPGA, further easing the use of LabVIEW to program FPGAs. The LabVIEW FPGA Module Functions palette includes typical LabVIEW structures and functions, such as While Loops, For Loops, Case Structures, and Sequence Structures as well as a dedicated set of LabVIEW FPGA-specific functions for math, signal generation and analysis, linear and nonlinear control, comparison logic, array and cluster manipulation, occurrences, analog and digital I/O, and timing. You can use a combination of these functions to define logic and embed intelligence onto your NI RIO device.This application reads from analog input 0 (AI0), performs the PID calculation, and outputs the resulting data on analog output 0 (AO0). While the FPGA clock runs at 40 MHz the loop in this example runs much slower because each component takes longer than one-clock cycle to execute. Analog control loops can run on an FPGA at a rate of about 200 kHz. You can specify the clock rate at compile time. This example shows only one PID loop; however, creating additional functionality on the NI RIO device is merely a matter of adding another While Loop. Unlike traditional PC processors, FPGAs are parallel processors. Adding additional loops to your application does not affect the performance of your PID loop.FPGA Development FlowAfter you create the LabVIEW FPGA VI, you compile the code to run on the NI RIO hardware. Depending on the complexity of your code and the specifications of your development system, compile time for an FPGA VI can range from minutes to several hours. To maximize development productivity, with the R Series RIO devices you can use a bit-accurate emulation mode so you can verify the logic of your design before initiating the compile process. When you target the FPGA Device Emulator, LabVIEW accesses I/O from the device and executes the VI logic on the Windows development computer. In this mode, you can use the same debugging tools available in LabVIEW for Windows, such as execution highlighting, probes, and breakpoints.Once the LabVIEW FPGA code is compiled, you create a LabVIEW host VI to integrate your NI RIO hardware into the rest of your PAC system. The host VI uses controls and indicators on the FPGA VI front panel to transfer data between the FPGA onthe RIO device and the host processing engine. These front panel objects are represented as data registers within the FPGA. The host computer can be either a PC or PXI controller running Windows or a PC, PXI controller, Compact Vision System, or CompactRIO controller running a real-time operating system (RTOS). In the above example, we exchange the set point, PID gains, loop rate, AI0, and AO0 data with the LabVIEW host VI.The NI RIO device driver includes a set of functions to develop a communication interface to the FPGA. The first step in building a host VI is to open a reference to the FPGA VI and RIO device. The Open FPGA VI Reference function, also downloads and runs the compiled FPGA code during execution. After opening the reference, you read and write to the control and indicator registers on the FPGA using the Read/Write Control function. Once you wire the FPGA reference into this function, you can simply select which controls and indicators you want to read and write to. You can enclose the FPGA Read/Write function within a While Loop to continuously read and write to the FPGA. Finally, the last function within the LabVIEW host VI is the Close FPGA VI Reference function. The Close FPGA VI Reference function stops the FPGA VI and closes the reference to the device. Now you can download other compiled FPGA VIs to the device to change or modify its functionality.The LabVIEW host VI can also be used to perform floating-point calculations, data logging, networking, and any calculations that do not fit within the FPGA fabric. For added determinism and reliability, you can run your host application on an RTOS with the LabVIEW Real-Time Module. LabVIEW Real-Time systems provide deterministic processing engines for functions performed synchronously or asynchronously to the FPGA. For example, floating-point arithmetic, including FFTs, PID calculations, and custom control algorithms, are often performed in the LabVIEW Real-Time environment. Relevant data can be stored on a LabVIEW Real-Time system or transferred to a Windows host computer for off-line analysis, data logging, or user interface displays. The architecture for this configuration . Each NI PAC platform that offers RIO hardware can run LabVIEW Real-Time VIs.Within each R Series and CompactRIO device, there is flash memory available to store a compiled LabVIEW FPGA VI and run the application immediately upon power up of the device. In this configuration, as long as the FPGA has power, it runs the FPGA VI,even if the host computer crashes or is powered down. This is ideal for programming safety power down and power up sequences when unexpected events occur.Using NI SoftMotion to Create Custom Motion ControllersThe NI SoftMotion Development Module for LabVIEW provides VIs and functions to help you build custom motion controllers as part of NI PAC hardware platforms that can include NI RIO devices, DAQ devices, and Compact FieldPoint. NI SoftMotion provides all of the functions that typically reside on a motion controller DSP. With it, you can handle path planning, trajectory generation, and position and velocity loop control in the NI LabVIEW environment and then deploy the code on LabVIEW Real-Time or LabVIEW FPGA-based target hardware.NI SoftMotion includes functions for trajectory generator and spline engine and examples with complete source code for supervisory control, position, and velocity control loop using the PID algorithm. Supervisory control and the trajectory generator run on a LabVIEW Real-Time target and run at millisecond loop rates. The spline engine and the control loop can run either on a LabVIEW Real-Time target at millisecond loop rates or on a LabVIEW FPGA target at microsecond loop rates.ApplicationsBecause the LabVIEW FPGA Module can configure low-level hardware design of FPGAs and use the FPGAs within in a modular system, it is ideal for industrial control applications requiring custom hardware. These custom applications can include a custom mix of analog, digital, and counter/timer I/O, analog control up to 125 kHz, digital control up to 20 MHz, and interfacing to custom digital protocols for the following:∙Batch control∙Discrete control∙Motion control∙In-vehicle data acquisition∙Machine condition monitoring∙Rapid control prototyping (RCP)∙Industrial control and acquisition∙Distributed data acquisition and controlMobile/portable noise, vibration, and harshness (NVH) analysis ConclusionThe LabVIEW FPGA Module brings the flexibility, performance, and customization of FPGAs to PAC platforms. Using NI RIO devices and LabVIEW graphical programming, you can build flexible and custom hardware using the COTS hardware often required in industrial control applications. Because you are using LabVIEW, a programming language already used in many industrial control applications, to define your NI RIO hardware, there is no need to learn VHDL or other low-level hardware design tools to create custom hardware. Using the LabVIEW FPGA Module and NI RIO hardware as part of your NI PAC adds significant flexibility and functionality for applications requiringultrahigh-speed control, interfaces to custom digital protocols, or a custom I/O mix of analog, digital, and counters.使用LabVIEW FPGA(现场可编程门阵列)模块开发可编程自动化控制器综述工业控制上的应用要求高度集成的模拟和数字输入输出、浮点运算和多重处理节点的无缝连接。

FPGA外文资料13

FPGA外文资料13

Calibration of RO-based temperature sensors for a toolsetfor measuring thermal behavior of FPGA devicesPawełWeber a,Maciej Zagrabski a,Bartosz Wojciechowski a,n,Maciej Nikodem a,Krzysztof Kȩpa b,Krzysztof S.Berezowski aa Institute of Computer Engineering,Control and Robotics,Wrocław University of Technology,ul.Wybrzeże Wyspiańskiego27,50-370Wrocław,Polandb Department of Electrical and Computer Engineering,Virginia Tech,Blacksburg,VA,USAa r t i c l e i n f oArticle history:Received31January2014Received in revised form13June2014Accepted16June2014Keywords:FPGAROTemperature sensorXDLNetlistThermal modelProcess variancea b s t r a c tIn this work,we present calibration effort for Ring Oscillator-based temperature sensors in FieldProgrammable Gate Array(FPGA)devices and a toolset suitable for the analysis of thermal behavior ofFPGAs.The toolset allows for the automatic synthesis of a unified temperature measurement and heatgeneration infrastructure combined with the necessary control structures and communication inter-faces.The tools insert temperature sensors and heaters based on Ring Oscillators(RO)through the lowlevel manipulations on Xilinx Design Language(XDL)descriptions of FPGA resource allocations.Thepurpose of the kit is to support rapid setup of thermal behavior experiments by providing basic heatingand sensing elements with verified properties as well as an area-optimised IP core for control ofthe experiments.Furthermore to evaluate inter-and intra-device variability we compare results of tem-perature sensor calibration using our toolset and six devices from two FPGA families:Spartan-3E andSpartan-6.&2014Elsevier Ltd.All rights reserved.1.IntroductionIn recent years,the scaling of technology has promotedtemperature to become one of the most important design con-straints in integrated circuit(IC)design.This is because theexponential growth in available logic resources has induced arapid growth in chip power density which in consequence hascontributed to increased operating temperatures[1].This putsstrain on the static power dissipation as leakage-induced powerconsumption is exponentially coupled to the operating tempera-ture.In consequence,increased leakage power increases operatingtemperature even further.The reliability of ICs is also affectedby thermal gradients that cause mechanical stress of the siliconaccelerating chip aging.Additionally,on-chip current densitieshave grown so high,that only a portion of a chip can be safelyactivated at a time.Such factors have necessitated taking a pro-active approach to the temperature management of computationaldevices.Among chips that are typically implemented using the state-of-the-art nanometer technologies are high-performance Field Pro-grammable Gate Arrays(FPGAs)[2].FPGA devices provide largeamounts of power-efficient(MIPS/Watt)programmable resources,including specialized high-performance blocks like:DSP cores,memories,standard communication interfaces,even complete micro-processor cores.High performance results in high power densities,therefore FPGAs suffer from thermal constraints.Since such devicescan exhibit local and design-dependent hotspots it is beneficial to beable to incorporate thermal sensors in a design to monitor itstemperature on-line[3].Another way of dealing with the threat of overheating in FPGA-based System-on-Chips(SoC)is temperature-aware design.Thisrequires thermal modeling and simulation.However,most vendorsonly provide very simplistic Compact Thermal Models(CTMs)oftheir devices that adhere to the DELPHI methodology[4].Suchmodels are suitable for system simulation but do not reflect thermalgradients within a die properly.On the other side of the spectrum–using accurate Finite Element Method(FEM)models is usuallyprohibitive in terms of labor and computation required.Conse-quently,there is a need for building CTMs of FPGAs that reflectthermal relations between different parts of SoC such as varyingIP cores.In order to develop such models,scrupulous measure-ments are needed to formulate or validate thermal models of FPGAdevices[5].Contents lists available at ScienceDirectjournal homepage:/locate/mejoMicroelectronics Journal/10.1016/j.mejo.2014.06.0040026-2692/&2014Elsevier Ltd.All rights reserved.n Corresponding author.E-mail addresses:pawel.weber@.pl(P.Weber),maciej.zagrabski@.pl(M.Zagrabski),bartosz.wojciechowski@.pl(B.Wojciechowski),maciej.nikodem@.pl(M.Nikodem),krzysztof.kepa@poczta.fm(K.Kȩpa),krzysztof.berezowski@.pl(K.S.Berezowski).Microelectronics Journal∎(∎∎∎∎)∎∎∎–∎∎∎Operating temperature of a commercial-grade FPGA is typically limited to1251C.However,operating an actual chip near this temperature is not advised as it accelerates the IC wear-out caused by multiple temperature-induced ageing effects,such as electro-migration or time dependent dielectric breakdown(TDDB)[6]. This justifies modeling the temperature-dependent reliability of programmable logic devices[7]as well as online monitoring of the die's temperature.While high-performance cooling systems can allow for the worst-case design,in embedded systems factors like unit cost,form factor and device volume may render them infeasible.Modern FPGAs exhibit high spatial thermal gradients,often exceeding 201C[8].These are caused by varying utilisation of their configur-able fabric,but also by the fact that contemporary high-end FPGA platforms incorporate multiple hard blocks such as dedicated pro-cessors,multipliers or DLLs,which increase differences in power densities across the die[8].Temperature should be monitored and managed during opera-tion of the system.For this reason high-end FPGA devices have built-in thermal sensors and vendors provide IP cores that allow to use them in custom designs.However even high-end devices are equipped with only a single temperature sensor.For example Xilinx Virtex devices starting from Virtex-5family have one temperature sensor,available through the System Monitor[9].In particular designs,however,hotspots may occur far from the built-in sensor,which would suggest using multiple sensors.On the other hand,implementing multiple sensors at manufacturing time is not cost-effective because location of hot spots in a program-mable fabric is application-dependent.The contribution of this work is a toolset for sensor placement and sensor calibration method that can be used for a novel method of physical emulation of heat transfer in ICs using physical FPGA-based heat transfer models.We describe our in-house built toolset and aflow designed to conduct both such emulations as well as detailed temperature measurements.As a by-product our method allows to monitor in the real-time the temperature of custom FPGA designs.Presented toolset consists ofbasic heating and sensing elements with verified properties, an area-optimised IP core for controlling the sensors and reporting the measurement results,a set of EDA tools for high-level design of thermal modelfloor-plans;andEDA tools for sensor placement,inserting them into an already synthesised design netlists and conducting measurement experiments.In our experimental environment the proposed EDA tools manip-ulate the low-level description of the FPGA-based systems in order to place temperature sensors and heaters where needed withfine-grained control over their physical layout.Tools control operation of the sensors and heaters with a very little overhead in terms of device area.As a result,our toolset provides:environment for rapid preparation of experiments,high sensing resolution,and a low cost implementation of individual sensors.Our approach is universal and may be implemented in a wide variety of Xilinx FPGAs.The ability to sense temperature of an FPGA device and to locally heat it up provides two significant benefits to the FPGA-based SoC designer:(1)Design testing and temperature monitoring—the designer can inject temperature sensors with the required commu-nication infrastructure into his custom design and quickly evaluate thermal behavior of the system.If thermal constraints are not met, countermeasures can be applied,e.g.thefloorplanning step can be repeated in order to mitigate identified thermal issues.(2)Evalua-tion of thermal models of the device—by placing a set of controllable heaters and temperature sensors,thermal response of a system's physicalfloorplan to user-determined stimuli can be rapidly emu-lated and used to validate thermal models as was done in[10]. Apart from that,a generic FPGA development kit with controllable heaters can be used as a low-cost thermal test chip that is much cheaper than its ASIC counterpart.Since many Xilinx FPGA families are supported the size and power density can be easily changed. The proposed setup offers very high coverage with afine control over heaters,relatively high precision of static temperature mea-surements(o11C)and good resolution of transient measurements (on the order of10ms).2.Related workAnalog sensors used in integrated circuits produce voltage or current levels that are proportional to temperature of the sensor. An A/D converter is then used to obtain a useful reading.Digital sensors can be implemented using a ring oscillator(RO),oscilla-tion frequency of which is sensitive to temperature,a counter and a time reference.In this work we use ROs due to near-linear dependence of logic delay on temperature[11,5,12,13].The system counts time of afixed number of oscillations of an odd-numbered ring of inverters using afixed-frequency reference oscillator. Sensors built with ROs are very often used in reconfigurable logic. First works on sensing temperature in FPGAs using ROs[11,14] demonstrated the concept and clearly listed advantages of such approach:(i)hotspots and signal contentions can be detected if they manifest themselves with increased temperature,(ii)no dedicated hardware is needed as sensors can be inserted and removed at the design time or even reconfigured dynamically,(iii) the sensors are purely digital and need to operate only for a very short time,hence with negligible contribution to both power consumption or the temperature rise.Some possible applications of temperature sensors in FPGAs are not straightforward.For instance,temperature sensing using ROs was also used to create a covert channel between electrically separated parts of the FPGA-based design[15].The proof-of-concept was implemented in Xilinx Spartan-IIE device and allowed to exchange up to1bit per second within the device and up to1/8bit/s between FPGA and an external transceiver.In[16]a discussion of possible crosstalk effects in FPGA is presented as well as a demonstration of a method for time to digital conversion in order to measure the impact of coupling capacitances in the interconnection structure of FPGAs.This method enables simple post-production investigations for signal integrity of programmable devices.Zick and Hayes[17]presented a sensor node implementable in reconfigurable logic which is smaller than in previously reported works while also being self-contained,i.e.,requires no external logic.They also show procedures for inferringfine-grained variations in delay,tempera-ture,switching-induced IR drop,and leakage-induced IR drop.All experiments reported in[17]were performed with over100 sensors in a Virtex5device.Thermal sensors that are planted across the chip require a controller unit responsible for scheduling and collecting measure-ments.The overhead is in both timing and programmable resources and is related to the very existence of that controller.It also scales up with the amount of sensors in use.Therefore the number of thermal sensors implemented in the fabric has to meet a reasonable balance between the spatial granularity of sensing and the overhead in area and timing induced by the controller unit.Consequently, with limited number of sensors,it is highly advisable to place sensors in places of interest e.g.in a close proximity of the expected hot spots.The proximity constraint poses a design problem that needs to be solved.Mukherjee et al.[18]proposed algorithms forP.Weber et al./Microelectronics Journal∎(∎∎∎∎)∎∎∎–∎∎∎2sensor placement in reconfigurable logic that helped to reduce the number of sensors required to maintain high accuracy in monitor-ing hot spots.Sayed and Reda[19]proposed the use of soft sensors to circumvent the problem of sensor placement in reconfigurable devices,where exact positions of hot spots may change based on current configuration and workload.Sensor placement is con-strained by the distance to a possible hotspot and the allowed error margin.Knowing the maximum power density achievable on a given device it is possible to estimate this error as a function of the distance from a hotspot.An analytical model for upper bound of on-chip temperature differences was presented in[20].Authors of[10]report that each temperature sensor consumes about1mW of power.To avoid the disturbing self-heating effect of the sensor resources,sensor blocks should only be active for relatively short ually a trade-off between sensor noise and self-heating can be observed,e.g.authors of[21]advise to perform measurements with period of no more than216cycles of100MHz clock(655μs).Also with a large number of sensors,the overall power consumption of all the sensors should not be neglected. Compact sensor design and daisy-chain routing combined with a very high capacity of modern FPGAs may allow for hundreds of sensors in one design.3.The overall architecture of the measurement systemThe overall architecture of our measurement system is depicted in Fig.1.It consists of a set of temperature sensors,an optional set of heaters and control and communication logic implemented in an IP-core called SimulationCore.The SimulationCore is the only element of the whole toolset that has to be synthesized with the target design.Other elements of the system,e.g.sensors and heating blocks can be added to thefinal,placed and routed,netlist. In case when one wants to measure thermal response to an artificial set of power stimuli,a pre-synthesised version can be used to quickly start an experiment.The logic in the Simulation-Core can support sequential readouts from up to2n sensors and control the power output of up to m heaters where n;m are limited by the amount of configurable fabric that one is willing to use.The logic controlling the thermometers consists of a frequency divider for providing the time-base,a demultiplexer for setting enable signal to each sensor,a counter and register for storing the RO oscillation numbers that is connected to all sensors via a multiplexer,and a counter that controls this multiplexer and provides current sensor number.The main part of the heaters logic part of the SimulationCore consists of an array of register-counter pairs for each heater. Each counter is clocked with the main50MHz clock.These pairs provide compact and efficient PWM implementation for each heater–it is enabled only when the counters’value is higher than what is in the heater's register.With8-bit counters/registers 0.4%control resolution step can be achieved,and it can be adjusted easily.Operation of the SimulationCore module is controlled with two compact Finite State Machines(FSM).Thefirst FSM is responsible for setting the configuration registers for each heater in response to control commands received from the PC workstation that is controlling the mands are3-byte sequences,the first defining command type,the second byte is used as the heater address and the third as its setting.The second FSM controls the thermometers,multiplexes enable signals to the sensors and counts their oscillations in a predefined sensing time.This FSM also communicates via UART the measurements to PC computer that controls the experiment.Each readout is a3-byte package with1-byte sensor number and2-byte of counter value.Temperature sensors were designed independently of the SimulationCore and can be easily ing the centralised approach to gathering the readouts we have very compact sensors, occupying only one Configurable Logic Block(CLB)and being just a RO with enable signal and an output for counting the oscillations. Each logical heater in our architecture can consist of any number of one-CLB heater modules.As with temperature sensors the heaters are controlled by the SimulationCore,but are designed independently and can be changed.4.Temperature sensor designIn general,a RO-based temperature sensor measures tempera-ture by counting the number of oscillations that happen within a time window measured by afixed number of ticks of the time reference.As the oscillation frequency of the RO is temperature-dependent,the number of oscillations observed within the time window changes with sensor's rmation on oscillation numbers allows to compute the RO's oscillation per-iod/frequency.The complete measurement setup is usually com-posed of three elements:a Ring-Oscillator with an odd number of inverters and a single AND gate acting as on/off switch (see Fig.2),a pulse counter,and a time-reference counter.Such composition allows for different architectural solutions:a tem-perature sensor can be fully integrated with both counters,and able to operate stand-alone;or it can be distributed reducing the sensing element itself to just a RO,while the counters and the control unit can be dislocated elsewhere and/or shared by many or even all sensors.The latter architecture actually trades off the area/resource overhead of the complete measurement system for the number of measurements it can snapshot across the die simultaneously.In such a configuration,a complete temperature sensing setup can comprise a number of ring oscillators,the multiplexing/demultiplexing network,a time-reference counter, and an optional communication logic sending the readouts outside of the FPGA.In the reported design the latter approach is chosen,i.e., sensing elements are reduced to ROs only.This allows for a very compact implementation of the sensing element itself,at the expense of the system's temporal resolution.In our design,sensors share one common oscillation counter,therefore have to be read one by one.An important trade-off in designing a RO-based sensor is between area and noise.The longer the inverter chain is,the more stable its oscillations are,the lower the noise,and the lower its power density.On the other hand,the size of the sensor grows with the length of the inverter chain.Typically authors use between3and21[16,22,21]inverters in one chain.Inverter chains shorter than3elements tend to oscillate with higher frequencies than can be noted by logic element in the FPGA and thus can not be used as sensors.In case of FPGA devices,logic resources are organized in a regular grid of the Configurable Logic Blocks(CLBs). Therefore in order to reduce internal fragmentation of resources RO chain lengths should be aligned with the capacity of CLBs. Having this considered,we settled at a7-stage RO thatfits in one CLB of our test devices–the Spartan3E1600and Spartan-6LX45 from Xilinx(see Fig.2).In our design,we use a16-bit wide time-reference counter to measure the time it takes for14-bit RO oscillation counter to overflow.Effectively we measure RO oscillation frequency with 16-bit resolution.At the beginning of the measurement the AND gate,serving as an enable switch,is turned on and simultaneously a counter starts counting RO oscillations.Some authors[23,24] advise to start the measurement after the RO stabilizes.HoweverP.Weber et al./Microelectronics Journal∎(∎∎∎∎)∎∎∎–∎∎∎3Fig.1.Overall architecture of the measurement system.Fig.2.Left:schematic description of temperature sensing circuit implemented in one CLB within Spartan 3E FPGA.The sensor is a 7-stage Ring Oscillator.Right:schematic of a heater implemented in one CLB.The heater consists of eight 1-stage oscillators.P.Weber et al./Microelectronics Journal ∎(∎∎∎∎)∎∎∎–∎∎∎4with sufficiently long measurement time this does not bring any improvement in terms of noise reduction.Naturally,sensors generate heat when active.Thus,the mea-surement time has to be limited,such that the amount of heat generated by the sensor,and more importantly its contribution to the local temperature of the die,remains negligible.While most authors acknowledge the phenomena,to the best of our knowl-edge there is no systematic study on temperature sensor self-heating.Authors of[21]propose limiting measurement time to less than1ms,while Franco et al.[22]designed the control unit to make measurements during20ms.In our setup,we were able to limit measurement time even down to10μs,however at the cost of substantial noise.Since sensors are enabled and disabled one after another,with high numbers of them their self-heating is negligible.Overall we demonstrated a sensor grid of almost150sensors in the Spartan3E FPGA,with temporal resolution of around100ms. Analysing various design trade-offs associated with ROs is beyond the scope of this work.A systematic study of the RO-based sensors’properties and a proposition of a quality metric of such sensors can be found in[21].5.Sensor calibration procedureDue to process variation in new fabrication technologies and nondeterministic routing algorithms calibration should be repeated for each sensor in any FPGA device in order to produce accurate results.Three main approaches to this calibration process are described in literature.One is based on physical thermal sensor integrated in some devices[5,21].Another calibration method requires measurements of RO frequencies in a controlled environ-ment such as a climate chamber.For each temperature level,the frequencies are logged and then linear or polynomial regression is performed against the data[22].A similar method relies on IR measurements of the device to link temperature to sensor output. However the latter two methods assume that the temperature of all the sensors is equal to that of the package.This assumption requires the device to be idle and dissipate minimal power uniformly over its area.In any case the calibration process should take into account the thermal properties of the device as well as the nature of the sensor. The main drawback of this approach is the limitation on the temperature ranges in which the sensors can be tested.While most FPGAs can operate in temperatures up to1001C,the evaluation boards are not designed to endure such conditions.Hence,non-linearities that may manifest in higher temperatures may not be included in the resulting sensor model.Initially,sensors were calibrated by measuring RO oscillations in otherwise idling device,simultaneously with IR measurements of FPGA's package temperature in a steady state.Apart from the ambient conditions where package temperature was close to 321C,we used an external heat source to obtain two additional stable measurement points at52and641C.We performed linear regression against the data and confirmed that the temperature-frequency relation is linear for the temperature range tested.There is a systematic variation in sensor frequencies that is common in both tested devices.In each device used in our experiments there is a hotter region in the upper right corner and cooler in the bottom left(see Fig.9).Unfortunately,this intra-sensor variation is quite high and it actually exceeds the dynamic range of our experiment.Differences between slowest and fastest ROs reach 1800oscillations(over7%)when the device is in idle mode in ambient conditions(see Fig3).In order to overcome this intra-sensor variation an additional calibration is required.Differences between individual sensors are hard to predict at design time.This issue can be easily resolved by performing initial sensor calibration in a set of known temperature levels.However, this procedure is prohibitively time and effort consuming.For this reason we employed a mixed approach,where we use detailed calibration results for determining the temperature-frequency relationship and then pinpoint the exact parameters of the sensors only with measurement in one temperature point(usually the ambient temperature).To collect data for the sensor calibration we run two series of measurements in a programmable climate chamber.The chamber had140liters volume,operating temperature range ofÀ40to þ1801C and controllable humidity.Because of consumer-grade electronic parts on the evaluation boards we set humidity to0% and used temperature rangesfirst from20to601C and then from 0to701C with steps of101.The chamber was programmed to quickly change temperature and then keep it stable for afixed time of30min.We measured temperature inside the chamber with DS18B20digital thermometer,which has0.51C accuracy over the whole temperature range of the test.The resulting tempera-ture profile is shown in Fig.5(left).In thefirst experiment we compared3Spartan-3E devices with exactly the same configura-tions to verify the possible inter-device variation in oscillator parameters.In the second one we used two Spartan-3E and two Spartan-6boards with varying placement of the SimulationCore in the devices to evaluate how it affects parameters of the sensors.The calibration steps were then as follows:(i)collect profiles with long periods of stable temperature,(ii)filter the data with medianfilter to remove outliers(ing from communication errors),(iii)downsample the data to further remove noise and reduce the amount of points,(iv)for each sensorfind median oscillation count in periods of stable temperature,(v)perform polynomial regression on known data points.Wefitted the sensor data to polynomials of orders from one tofive.On average the linear approximation did not result in errors in excess of0.81C, therefore for the sake of simplicity we use linear regression later on.However,by using quadratic and cubic functions we were able to get much betterfit–the norm of residuals dropped from1.13 for linear approximation to0.86for quadratic and0.45for cubic. Further increase in polynomial order did not improve the quality of thefit.Then,we analyzed coefficients A and B of temperature-oscillations relation T¼AÁf ORþB for(i)a set of FPGA devices of the same model,and(ii)different sensor grid setups.Thisprovided Fig.3.Regression results for the oscillations to temperature relation.Results for only one in ten sensors were pictured(including the slowest and the fastest RO)for image readability.P.Weber et al./Microelectronics Journal∎(∎∎∎∎)∎∎∎–∎∎∎5insight into the extent of variation between sensors on one device and on different devices and the variation introduced with routing differences which is to be expected with different sensor-grid setups.When operating in steady-state in ambient temperature the differences between minimal and maximal average RO oscilla-tion counts is 7.5%with σof only 1.4%for Spartan-3E.In Fig.4we have presented the A and B coef ficient maps for 3devices with exactly the same con figuration.The maps are plotted with the same color scale.Each point in the map is one coef ficient of a single sensor in the 15Â10grid.It can be seen that sensor parameters differ between the devices,but the A parameter has standard deviation of only 1.3e À3and 3.9e À4with average of 4.3e À2and 3.1e À2for and Spartan 3and 6respectively (see Table 1).Having obtained the regression data one can finish the calibra-tion for any placement of sensors using only the ambient tem-perature.For that the temperature pro files are normalized with respect to mean oscillation count of all the sensors in the ambient temperature.Each pro file is shifted by its difference to the mean of all the sensors in the ambient temperature.These shifted pro files are shown in Fig.6(left).One thing immediately noticeable is the difference in oscillation counts between sensors that grows withtemperature.This however is already resolved with linear regres-sion.The linear regression results for all the sensors in one device are shown in the right part of Fig.6.By comparing the original and recreated pro files one can see that linear model fits the data well as can be seen in Fig.5(right).The maximal absolute temperature error ϵmax does not exceed 11C.Also the absolute difference between maximal and minimal temperature indicated by all the sensors was below 11at all times with median maximal difference of 0.61C.6.Heater designControllable heating sources have a number of applications in experimental research on thermal behavior of both FPGAs and ASICs.First,use of the heating modules allows to replace simula-tion with experiments using real devices,hardware-in-loop emulation,characterization,and calibration of heat flow models [5,25]and thermal-aware testing [26].Also,hardware-in-the-loop enables accurate thermal characterization of the device itself,including power density as well as the impact of variability on its thermal response.Finally,controllable heat generation units were also used to improve and validate IR monitoring techniques and temperature-to-power mapping algorithms [27].Unfortu-nately,authors of most publications do not describe architectures or performance results of the used heaters in a much detail.The only systematic study of the design of a programmable heating elements in con figurable logic is [28]where eight different heat generating cores,implemented using different resources available in the FPGA matrix,are evaluated in terms of their maximal temperature.The authors report temperature increase in excess of 1001C and compare it to their previous heaters that allowed only for 61C gradient.The same group revised their evaluation method in [29].In this work we use controllable heaters to evaluate and char-acterize our temperature sensing infrastructure that is described in the previous section.In a similar way to [28],we use length-of-1ringFig.4.Linear regression coef ficient maps for three Spartan 3E1600FPGAs.Table 1Temperature sensor regression results –A and B coef ficients of linear function T ¼Af RO þB .S3E1k6is Spartan-3E 1600on the evaluation board,S6LX45is Spartan-6LX45on the Atlys board;MD –mean deviation,σ–standard deviation.ParameterS3E1k6S6LX45Spartan#2Spartan#3Atlys#4Atlys#6Mean A À0.043À0.045À0.030À0.À031Mean B 108110739661042MD A 0.0010.001 3.2e À4 4.4e À4MD B 25.1224.210.1414.59σA 0.00130.0013 3.9e À4 5.4e À4σB32.9830.1612.4918.21P.Weber et al./Microelectronics Journal ∎(∎∎∎∎)∎∎∎–∎∎∎6。

FPGA外文资料113

FPGA外文资料113

FPGA-Based Subset Sum Delay LinesChung-Yun Wang,Yu-Yi Chen,Jiun-Lang Huang Graduate Institute of Electronics Engineering Department of Electrical EngineeringNational Taiwan UniversityXuan-Lun Huang Industrial Technology Research InstituteHsinchu,TaiwanAbstract—The programmable delay line is one of the key components in automatic test equipment.Recently,implemen-tation of programmable delay lines on FPGAs has drawn growing attention due to theflexibility and reconfiguration capability that FPGAs provide.In this work,we propose the subset sum delay line(SSDL)architecture for FPGA-based delay lines.The SSDL architecture takes advantage of the inevitable FPGA process variations,structure irregularities and routing uncertainties to realize high-quality FPGA-based delay lines.Furthermore,compared to previous FPGA-based delay lines,the SSDL architecture is FPGA independent;this substantially enhances its portability across different FPGA generations and suppliers.An SSDL is realized on Altera Cyclone II FPGA.Measurement results show that it achieves 76ps resolution and has a dynamic range of32ns. Keywords-FPGA,ATE,programmable delay line,subset sumI.I NTRODUCTIONThe programmable delay line plays a very important role in test and measurement applications;it facilities de-skew, timing measurement,precise edge placement,and time-to-digital conversions.Recently,FPGA-based delay lines are drawing growing attention and several techniques have been -pared to their ASIC counterparts,FPGA-based delay lines have the following advantages.1)FPGA-based delay lines incur very low or no NRE(non-recurring engineering)and manufacturing delay.2)With its off and even on-line reconfigurability,FPGA-based delay lines permit quick andflexible delay line performance adjustment;this makes FPGA-based timing circuits suitable for a wide spectrum of mea-surement and design needs.3)Modern FPGAs posses a rich amount of digital re-sources, e.g.,millions of logic elements,memory blocks,and even embedded processors;this allows one to realize the characterization/calibration and even application-related circuitry on FPGA.FPGA-based delay lines are not without limitations.First, the operating speed in general lags that of the state-of-art ASICs.Fortunately,as the fabrication technology keeps ad-vancing,FPGA performance improves as well.Second,the architecture of the adopted FPGA limits the choices of delay line architectures.In general,one has tofit an existing delay line architecture onto the FPGA architecture or develop new architectures that best utilize the FPGA architecture.Finally, one must pay more attention to the FPGA designflow so that the desired delay line schematics remain intact;this requires intensive knowledge and experience on the FPGA architecture as well as the design tools.Most of the reported FPGA-based timing circuits([5],[4], [11],[9],[7],[10],[1],[6],[2],[3])target the time-to-digital conversion(TDC)applications.In[5],a programmable delay line that uses MUX-based delay elements is reported;it achieves a250-ps resolution.The authors in[4]proposed a ring oscillator based TDC technique.In[11],[9],[7], the authors adopted the wave union TDC technique to resolve the unavoidable ultra wide bin issue,i.e.,some delay elements posses much larger delay values than others.The idea is to generate,for each input hit,a wave union with several0-to-1and1-to-0transitions and feed them into the TDC;this implicitly makes several measurements.[11] proposed post measurement analyses to enhance the mea-surement resolution and accuracy.The bin-to-bin calibration scheme in[1],[2]improve the measurement resolution and achieve the same resolution on all32channels.FPGA-based delay generators are reported in[6],[8].In[3],the authors reported on-FPGA coarse andfine delay lines of which the resolutions are105and13ps,respectively.To avoid interconnect induced delay line non-linearities, most FPGA-based timing circuits use the dedicated chain structure,e.g.,the carry chain,available in FPGA fami-lies[10].However,the carry chain structure provides limited regularity,e.g.,for eight or sixteen consecutive elements;this results in severe non-linearity when the delay line crosses the logic block boundaries.As a result,post processing and calibration are in general mandatory.In this work,we propose a new FPGA-based pro-grammable delay line architecture;it is called the subset sum delay line(SSDL)as the key concept is similar to the subset sum problem.The core idea is to take advantage of (1)the inevitable delay deviations associated with FPGA-based delay elements,(2)the FPGA structure irregularities, and(3)redundant delay elements.First,a sufficiently large number of delay elements are generated and connected to form a programmable delay line.For each delay element, its actual delay value is measured and stored.Then,the2014 IEEE 23rd Asian Test Symposiummodified subset sum algorithm is applied to identify the set of delay elements needed to generate the desired delay values;the results are stored in a lookup table for future applications.Compared to past FPGA-based delay lines,the proposed subset sum delay line(SSDL)has the following advantages.1)Potential for ultra high resolution.Instead of being limited by FPGA logic element and interconnect delay variations,SSDL takes advantage of these non-ideal factors to achieve very high resolu-tion.In fact,it is these non-idealities that makes SSDL concept a feasible FPGA-based delay line architecture.2)Portability across different FPGA architectures.As we just mentioned,SSDL achieves high resolution by using redundant delay elements whose delay values deviate from their nominal values.Since SSDL does not require that the delay elements strictly adhere to their nominal values,the delay elements can be realized with high-level FPGA constructs—there is no need to manually place the delay elements and modify the EDAflow as in[3].Thus,the overhead of migrating to different platforms is much less than past techniques that rely heavily on the knowledge of the FPGA structure and low level constructs.An SSDL has been implemented on the Altera Cyclone II FPGA.The measurement results show that resolution of the prototype delay line is76ps for a32ns dynamic range. The rest of the paper is organized as follows.In Section II, the proposed SSDL architecture is described.Then,Sec-tion III show the implementation and measurement results. Finally,Section IV concludes this work.II.T HE P ROPOSED S UBSET S UM D ELAY L INE(SSDL) A.The D n Delay ElementsThe proposed subset sum delay line(SSDL)consists of cascading delay elements,called D n elements.The con-ceptual schematic of an D n element is shown in Fig.1a. Each D n element consists of a series of n unit buffers and a multiplexer.The sel signal determines whether the input signal goes through the n buffers or not.For simplicity,we use the symbol in Fig.1b to represent a D n element.Fig.1c depicts an example subset sum delay line;it consists of four D1’s,one D2,one D3,one D5,and one D9.In the ideal case,the schematic in Fig.1a suggests that the delay value of D n,denoted byτn,isτn=τmux+sel·n·τbuf(1) whereτmux andτbuf correspond to the delays of the multiplexer and the unit buffer,respectively.From(1),the highest achievable resolution of a delay line that consists of D n elements isτbuf and the linearity is limited by the buffer delay variations and interconnect delay variations.To achieve high resolution,one may use theFigure1.The D n delay element schematic.vernier delay line ing vernier delay elements, the achievable resolution becomes the difference between two different types of unit delay elements;however,the incurred high delay latency may be unacceptable in some applications.Achieving high-linearity is also challenging for FPGA-based programmable delay lines.First,there is nothing that one can do to reduce the process variations.Second,the structure regularity, e.g.,a block of evenly placed logic elements,is in general limited to only8or16consecutive logic blocks.As will be shown later,instead offighting with these non-ideal factors,the proposed SSDL takes advantage of the inherent variations and non-ideality to achieve high resolution and high linearity.B.The SSDL ConceptAt thefirst glance,the SSDL hardware is rather simple and is unlikely to realize a high-quality programmable delay line if the D n elements exactly follow the model specifiedFigure 2.Simulation results of the SSDL in Fig.1cby (1).Take the SSDL in Fig.1c for example.Assuming that there are no delay variations,it can only produces twenty-four delay values8·τmux +k ·τbuf ,0≤k ≤23and the resolution equals the unit buffer delay τbuf .On the contrary,SSDL architecture takes advantage of the inevitable unit buffer and interconnect delay variations to achieve high resolution and linearity.Consider the SSDL in Fig.1c again.For convenience,let’s translate the multiplexer and interconnect delay variations to equivalent D n delay variations;therefore,we only have to consider the D n delay variations.With the D n variations being taken into account,the SSDL in Fig.1c will produce at most 28=256delay values,which is two orders larger than the the original twenty-four when variations are ignored.In general,an SSDL that consists of k D n ’s will produce at most 2k delay values —an SSDL with k =20can easily produce one million delay values.Of course,generating a large number of delay values is insufficient to realize a high-quality programmable delay line,which requires the delay values to be evenly spaced.In SSDL,out of all the possible delay values,a subset is chosen to realize a linear delay line.Assume that (1)the nominal delay value of the unit buffer in Fig.1is 1,(2)the standard deviation of the unit buffer delay variation is 0.1,and (3)the multiplexer delay is 0.Fig.2a shows the 256delay values that the SSDL in Fig.1c may produce.While the range is similar to that of the ideal case,i.e.,from 0to 23,these values apparently cover the range more densely.Fig.2b shows the histogram of the 256delay values.In this histogram,the bin width is 0.624,the minimum width such that there is no empty bin;this corresponds to 36bins.Except for the first and last bins,all bins contain at least two delay values.After binning,a quality delay line can be generated.To generate the delay line,we first use the bin centers as the target values.Then,for each bin center value,the closest delay value (from the 256values)is chosen as the actual value,i.e.,the actual value that this delay line can produce.Fig.2c shows the 36selected delay values;Fig.2d shows the INL (integral nonlinearity)plot.For this SSDL example,1LSB =0.624(the bin width)and the maximum INL is ±0.5LSB.It is clear that the SSDL outperforms the original nominal delay line.First,it delivers higher resolution by reducing 1LSB from 1down to 0.624.Furthermore,the maximum INL is always less than 0.5LSB.C.Modified Subset Sum AlgorithmFor the SSDL architecture,one key technique is to select the desired delay values out of a huge number of possible delay values.Take a 10-bit 20-ps resolution delay line for example.The delay line is supposed to generate 1,024delay values:T 0+20·k ps (2)where T 0is the minimum delay,i.e.,when all D n controlsignals are set to zero,and k =0,1,···,1,023.Assume that the SSDL consists of twenty D n ’s;this corresponds to 1M possible delay values out of which 1,024will be selected.Before applying the subset sum algorithm,the D n el-ements are characterized first.To begin with,the control signal sel of each delay stage is set to 0;the corresponding delay value is denoted by T 0.Then,for stage i ,we set its and only its sel signal to 1,and denote the measured delay value by T i .The delay difference that stage i can introduce is thusΔT i =T i −T 0.(3)Assuming that the SSDL consists of N stages.Theproposed subset sum algorithm is depicted in Alg.1.In addition to D n delay values,T 0,T 1,···,T N ,the trimming threshold t is also specified.At the beginning,the set of selected combination Θconsists of {s 0}only.Note that s 0is an imaginary delay stage whose delay value equals T 0.Then,the algorithm enters the expansion (line 4)and trimming (line 5to 14)loop for N times.Each time,the current combinations are expanded with delay stage s i .For example,after expanding with the first stage s 1,we haveΘ={{s 0},{s 0,s 1}}(4)where {s 0,s 1}means that the sel signal is 1for stage 1and 0for other stages and its delay value is T 0+ΔT 1.Algorithm1Modified Subset Sum AlgorithmInput:T0,ΔT1,ΔT2,···,ΔT N,tOutput:Θ1:Θ←{{s0}}2:i←13:while i≤N do4:Θ←Θ∪{θ∪{s i}|θ∈Θ}5:Θ ←φ6:b←T07:while|Θ|>0do8:θ←getMinDelay(Θ)9:if delay(θ)>b+t then10:Θ ←Θ ∪θ11:b←delay(θ)12:end if13:end while14:Θ←Θ15:i←i+116:end whileAfter further expansion with stage s2(without trimming), the set of delay stage combinations becomesΘ={{s0},{s0,s1},{s0,s2},{s0,s1,s2}}.(5) Clearly,the size ofΘgrows exponentially with N.Thus, we trimΘ(line5to14)by removing the combinations whose delay values are too close to others.Each time,the combinationθinΘthat has the minimum delay value is popped.If its delay value is within t,the trimming threshold, of the current baseline b,θis removed.After applying the modified subset sum algorithm,Θstill contains more combinations than needed.For each desired delay value,we select fromΘthe closest one and save it in a lookup table.For example,suppose that N=8and the combination whose delay value is closest to1ns is {s0,s1,s2,s8}.To generate1ns delay,the stored delay line control signal will be11000001.D.SSDL Design GuideFrom Section II-B,we know that delay variations play a very important role in SSDL.Besides the inherent logic element and interconnect delay variations,irregular routing and placement also introduce delay variations—recall that these can be viewed as buffer delay variations.The method we adopt to introduce irregular routing and placement is to use only high-level constructs,and let the design tools perform logic synthesis,placement and routing.Take the Altera Cyclone II for example,we just need to keep the buffers intact by using the LCELL primitive,and leave all the rest to the design tools.Fig.1d shows the D n code.This approach also enhances the SSDL portability across different FPGAfamilies.Figure3.An SSDL delay value histogram example.An SSDL can achieve a resolution ofγwith a maximum INL of0.5LSB,if the histogram(with bin widthγ)of all the delay values it generates has no empty bin.For this to be true,one safe approach is as follows.•Use a sufficiently large number of D1’s.This provides denser coverage.•Use larger D n’s to extend the dynamic range.An example histogram is shown in Fig.3.Note that the peaks in the histogram correspond to the delay values generated by the larger D n’s.The skirt around each peak is generated by D1’s.III.E XPERIMENTAL R ESULTSIn this section,we will show the steps and results of implementing an SSDL delay line.First,we would like to characterize the D1stage delay variation.To do so,forty D1stages are implemented on the Altera Cyclone II FPGA,and their delay values are measured.The standard deviation is computed to be81ps. It is derived that12D1stages are needed tofill all10ps wide bins with a99%probability.Then,larger D n’s are added to cover a dynamic range of40ns.This is done by adding one D n at a time while ensuring that the resulting histogram contains no empty bin. At the end,the SSDL configuration is as follows.{D90,D48,D26,D15,D9,D6,D4,D3,D2,D1×13} Note that it looks like sub-radix-2to ensure no empty histogram bin.Thefinal SSDL consists of22D n’s;this corresponds to a maximum of about four million delay values out of which we will try tofind the set of values that correspond to a20-ps resolution delay line.One problem we encounter is the deviation between estimated and measured delay values.Remember that only 22delay values(of the22D n elements)are measured from which the four million delay values are to be derived. However,it is observed that the estimation error could be large at times.Assuming that the delay values can be derived from the individual D n delay values,we choose5,096delay values.Then,from the5,096values,we apply the histogram approach in Section II-B.The upper plot of Fig.4shows the selected delay values,a total of623;the lower plot of Fig.42030405060700100200300400500600700d e l a ydelay indexselected delay values-0.8-0.6-0.4-0.200.20.40100200300400500600700I N L (L S B )delay indexINLFigure 4.Final results.Table IT HE F INAL D ELAY L INE S PECIFICATIONSLSB 76ps tap count 623max INL-0.41LSBshows the INL values.The final delay line specification is summarized in Table I.Currently,the resolution is not as good as expected;the main cause is the imperfect delay estimation process.IV.C ONCLUSIONIn this paper,we presented the subset sum delay,a new FPGA-based programmable delay line architecture.SSDL takes advantage of the inevitable delay variations and has the potential of achieving very high resolution;furthermore,SSDL only uses high-level FPGA constructs and thus pos-sesses high portability.The prototype SSDL result is shown.Due to the delay value measurement limitation,the final resolution is 76ps with a maximum INL of 0.41LSB.We are investigating more sophisticated algorithms to construct a high-resolution delay line out of the delay values that the SSDL can provide.R EFERENCES[1] E.Bayer and M.Traxler.A High-Resolution (¡10ps RMS)32-Channel Time-to-Digital Converter (TDC)Implemented in a Field Programmable Gate Array (FPGA).In IEEE-NPSS Real Time Conference ,pages 1–5,May 2010.[2] E.Bayer,P.Zipf,and M.Traxler.A Multichannel High-Resolution (¡5ps RMS between two channels)Time-to-Digital Converter (TDC)Implemented in a Field Pro-grammable Gate Array (FPGA).In IEEE Nuclear Science Symposium and Medical Imaging Conference ,pages 876–879,October 2011.[3]Y .-Y .Chen,J.-L.Huang,and T.Kuo.Implementation ofProgrammable Delay Lines on Off-the-Shelf FPGAs.In IEEE AUTOTESTCON ,2013.[4]S.S.Junnarkar,P.O’Connor,and R.Fontaine.FPGA basedself calibrating 40picosecond resolution,wide range Time to Digital Converter.In IEEE Nuclear Science Symposium ,pages 3434–3439,October 2008.[5]J.Li,Z.Zheng,M.Liu,and rge DynamicRange Accurate Digitally Programmable Delay Line with 250-ps Resolution.In International Conference on Signal Processing ,2006.[6] C.Lin,B.Shao,and J.Zhang.A Multi-Channel DigitalProgrammable Delay Trigger System with High Accuracy and Wide Range.In International Conference on Electronics,Communications and Control ,pages 1835–1838,September 2011.[7]J.Qi,Z.Deng,H.Gong,and Y .Liu.A 20ps Resolution WaveUnion FPGA TDC with On-Chip Real Time Correction.In IEEE Nuclear Science Symposium ,pages 396–399,November 2010.[8]Y .Song,H.Liang,L.Zhou,J.Du,J.Ma,and rgeDynamic Range High Resolution Digital Delay Generator Based on FPGA.In International Conference on Electronics,Communications and Control ,pages 2116–2118,September 2011.[9]J.Wu.An FPGA wave union TDC for time-of-flight applica-tions.In IEEE Nuclear Science Symposium ,pages 299–304,November 2009.[10]J.Wu.Several Key Issues on Implementation Delay LineBased TDCs using FPGAs.IEEE Transactions on Nuclear Science ,57(3):1543–1548,June 2010.[11]J.Wu and Z.Shi.The 10-ps Wave Union TDC:ImprovingFPGA TDC Resolution beyond Its Cell Delay.In IEEE Nu-clear Science Symposium ,pages 3440–3446,October 2008.。

FPGA介绍外文文献翻译、中英文翻译

FPGA介绍外文文献翻译、中英文翻译

Introduced FPGAProgrammable logic devices is a universal logic chip can be configured for various purposes, which is to achieve ASIC (Application Specific Integrated Circuit) semi-customized device, its emergence and development make electronic systems designers can use CAD tools to design their own ASIC device in the laboratory. Especially the emergence and development of FPGA (Field Programmable Gate Array), as a microprocessor, memory, the figures for electronic system design and set a new industry standard (You can purchase the standard product catalog in the sales market). Digital systems are facing to the developing of microprocessor, memory, FPGA those three standard building blocks constituting or their integration direction.Using FPGA devices design digital circuit, can not only simplify the design process and can reduce the size and cost of the entire system, increasing system reliability. They do not need to spend the traditional sense a lot of time and effort required to create integrated circuits, to avoid the investment risk and become the fastest-growing industries of electronic devices group. The main advantage of using FPGA devices circuit design of digital systems is as follows: (1)Design flexibleUsing FPGA devices may not be limited to standard series device at logic functional . And logic can be modified at any stage of the system design and the use of the process, and only re-programming the using FPGA device can be completed, provides the system design for great flexibility.(2) Increased functional densityFunctional density means the number of logic functions can be integrated in given space. The count of components gate in programmable logic chip is high, a piece of FPGA can replace several films, dozens of films or even hundreds of small-scale digital integrated circuit chip. FPGA devices use fewer chips when achieves digital system, thus reducing the number of chips, reducing printed circuit board area and the number of printed circuit boards, eventually causing an overall reduction in system size.(3) Improve reliabilityReducing the number of chips and the printed board, not only can reduce system size, but it greatly enhanced system reliability. System with a high degree of integration have much higher reliability than the same system with a low degree of integration designed by many standardcomponents. Using FPGA device reduces the number of chips required to achieve the system, the number of leads and pads on the printed circuit board is also reduced, so the reliability of the system can be improved.(4) Shortening the design cycleBecause of programmability and flexibility of FPGA devices, and use it to design a system, the time required is much shorter than the traditional method. FPGA devices have high integration, the printed circuit board layout simply when using. Meanwhile, after the success of the prototype design, due to the advanced development tools, high degree of automation, its logic is very simple and quick to modify. Therefore, using FPGA devices can greatly shorten the design cycle and accelerate speed to market, improve product competitiveness.(5) Work fastFPGA/CPLD devices work fast, generally can reach several hundred Hertz, far faster than the DSP device. And circuit series required to achieve the system is less after using FPGA devices, thus the working speed of the entire system will be improved.(6)Increased system security performanceMany FPGA devices have encryption capabilities, using FPGA devices widely in system can effectively prevent the product from being illegally imitation of others.(7) Reduce costsUsing FPGA devices to achieve digital system design, if only consider the price of the device itself, sometimes do not see its advantage, but the factors that affect the cost of the system is multifaceted. comprehensive consideration, cost advantages of using FPGA is obvious. First, using FPGA devices is easy to modify design, shorten the design cycle, allowing the system to reduce the cost of research and development; secondly, FPGA devices enable to reduce the printed circuit board area and the number of plug-ins required, thereby reducing the manufacturing cost of the system; once again, the use of FPGA devices enables the system to improve reliability, reduce maintenance workload, thereby reducing the cost of servicing the system. In short, the system design using FPGA devices cost savings.FPGA design principles :One important guiding principle of FPGA design: the balance and interchangeable of size and speed, this principle is reflected with a large number of validation in filter design behind.Here, "area" means the number of FPGA / CPLD logic resources consumed by design , theFPGA can be measured by the consuming of flip-flop (FF) and a lookup table (LUT) , a more general approach can measure by the number of equivalent logic gates which occupied by design. "Speed" refers to the highest frequency can be achieved with stable operation on the chip, this frequency is determined by the design of the timing condition, and closely related to the clock cycle, PAD to PAD Time, Clock Setup Time, Clock Hold Time, Clock-to-Output Delay timing and many timing feature quantity. Area and speed are always imbued with FPGA design ,are the ultimate standard of design quality evaluation. Two basic concepts of area and speed: the balance of the area and the speed , the area and the speed of exchange.Size and speed are a pair of opposites contradiction. Requires a design along with the smallest design area, and the highest operating frequency is unrealistic. A more scientific design goal should be under the premise of meeting the design timing requirements (including the requirements of the design frequency), occupying the smallest chip area. Or in the specified area, designed to make more timing margin, running higher frequency. Both targets fully reflects the thinking of the balance of the area and speed. About the area and speed requirements, should not be simply interpreted as the pursue of raising the engineers level and design perfection, but should recognize that they are directly related to quality and cost of the products . If the timing margin of the design is relatively large, run a relatively high frequency, which means design is more robust, the quality of the whole system is more certified; On the other hand, design consumes less area, it means that the unit chip can achieve more functional modules, needs less chips, the cost of the entire system also will be slashed. As two parts of the contradiction, area and speeds’ status are not the same. In contrast, to meet requirements of timing and operating frequency is more important, when the two conflict, using the criteria of speed priority.Area and speed of exchange is an important idea in FPGA design. In theory, if a design have larger timing margin, and can run much higher frequency than design requirements, it will be able to reuse the function module to reduce the chip area consumed by entire design, this is the savings using the advantages of the speed to change area; On the contrary, if a design's timing requirements are high, conventional methods can not reach the design frequency, then generally make data flow serial-parallel transforming, parallel copy multiple operating modules, take on the "serial-parallel conversion" thought to operate on the entire design, conduct the "serial-parallel conversion"in date at the output of the chip module, from a macro point of view, the entire chip have meet the requirements of processing speed, this corresponds with the areareplication and faster of exchange.Give an example. Assuming input data stream of the digital signal processing system is 350Mb / s, while the processing speed in the FPGA design data processing module up to 150Mb / s, since the data throughput of processing module can not meet the requirements, direct implementation at FPGA is impossible. In this case, we should use"area-for-speed" thought, at least copied into three processing module, the input data first conduct serial-parallel conversion, then using these three modules conduct parallel processing, then the processing result conduct "serial conversion " to complete the data rate requirements. We look at both ends of the entire processing module, the data rate is 350Mb / s, while inside the FPGA, the data rate of each sub-module process is 150Mb / s, in fact, the indemnification of the entire data throughput is dependent on the three sub-modules parallel processing, that takes more advantage of the chip area, to achieve high-speed processing,to achieve the design through the "copy area in exchange for improving processing speed"thinking.FPGA is the abbreviation of the field programmable gate array, it is the product on the basis of PAL, GAL, EPLD and other programmable devices' further development. It is appeared as a semi-custom circuit in ASIC field, it not only solve the lack of custom circuits, but also overcome the defect of limited numbers of gates in original programmable device.FPGA uses LCA (Logic Cell Array) such a new concept, including internal CLB (Configurable Logic Block), IOB (Input Output Block), and internal connections in three parts. The basic characteristics of FPGA:(1)Using FPGA to design ASIC circuits, users do not need to cast film production can get applicative chips.(2)FPGA can make the specimen of other full-custom or semi-custom ASIC circuits .(3)FPGA internal have rich triggers and I / O pins.(4)FPGA is one of the shortest design cycle, the lowest development costs, the least risky devices in ASIC circuits.(5)FPGA uses high-speed CHMOS technology with low power, can be compatible with CMOS, TTL level.It can be said that the FPGA chip is one of the best choice for small-scale systems to improve system integration and reliability .Currently, FPGA have many varieties, XILINX's XC Series, TI company's TPC series,company's ALTERA series.FPGA sets its work status by a program stored in the on-chip RAM, so when work, it needs to program the on-chip RAM. The user can use different programming form depending on the configuration mode.When powered up, FPGA chip will read the data inside EPROM to the on-chip programming RAM. When the configuration is completed, FPGA go into working condition. After brownout, FPGA restore to the blank chip, the internal logic disappears, therefore, FPGA can be used repeatedly. FPGA programming don't need a dedicated FPGA programmer, just use common EPROM、PROM programmer. When the FPGA function need modified, just to change an piece of EPROM. Thus, one same FPGA, different programming data, can bring different circuit functions. Therefore, FPGA is very flexible.There are a variety of FPGA configuration modes: parallel host mode for an FPGA plus an EPROM; master-slave mode can support a PROM programs multi-chip FPGA; serial mode can use serial PROM programs FPGA; peripheral mode can make FPGA to be used as peripherals of micro-processor, programmed by the microprocessor.Verilog HDL is a hardware description language, used as multiple abstract design levels of digital system modeling from algorithm-level, gate-level to switch level. Digital systems can describe hierarchically, and can conduct timing modeling in the same description explicitly.Verilog HDL language has the following ability to describe: behavioral characteristics of the design, the data flow characteristics of the design, structure and composition of the design as well as including response monitoring, response delay and waveform generation mechanism of design verification. All these use the same modeling language. In addition, Verilog HDL language provides programming language interface, through the interface, it can access design from external design in the simulation, verification period, including the simulation of specific control and operation.Verilog HDL language not only defines the syntax but also defines a clear simulation and simulation semantics for grammatical structure. Thus, the model written by this language can use the Verilog emulator to verify. Language inherit multiple operator structure from C programming language. Verilog HDL provides expanded modeling capabilities, which many extensions initially difficult to understand.FPGA介绍:可编程逻辑器件是一种可以构成各种用途逻辑的通用芯片,它是实现专用集成电路ASIC(Application Specific Integrated Circuit)的半定制器件,它的出现和发展使电子系统设计师借助于CAD手段在实验室里就可以设计自己的ASIC器件。

FPGA外文资料16

FPGA外文资料16

Applied Soft Computing 21(2014)533–541Contents lists available at ScienceDirectApplied SoftComputingj o u r n a l h o m e p a g e :w w w.e l s e v i e r.c o m /l o c a t e /a s ocFPGA multi-filter system for speech enhancement via multi-criteria optimizationKa Fai Cedric Yiu a ,∗,Zhibao Li a ,Siow Yong Low b ,Sven Nordholm caDepartment of Applied Mathematics,The Hong Kong Polytechnic University,Hunghom,Kowloon,Hong Kong,China bSchool of Electronics and Computer Science,University of Southampton Malaysia Campus,Nusajaya,Johor,Malaysia cDepartment of Electrical and Computer Engineering,Curtin University,WA,Australiaa r t i c l e i n f o Article history:Received 27February 2012Received in revised form 7March 2014Accepted 11March 2014Available online 13April 2014Keywords:Signal enhancement Multi-criteria FPGAa b s t r a c tSpeech is the main medium for human communication and interaction.Apart from the traditional tele-phones,more and more applications come with speech interfaces,which use speech signal as an input for various purposes.However,many of these applications might fail to perform in noisy environments as the signal-to-noise ratio (SNR)degrades.Two important measures for any speech enhancement algo-rithm are noise suppression and speech distortion.Naturally,different speech enhancement algorithms will have different trade-offs.Moreover,depending on the environment,it is possible that one algorithm will outperform the others in some respects.This paper proposes a multi-filter system,which has the capability of continually adjusting the noise suppression level and the speech distortion level in a Pareto fashion.Moreover,we show that the system works under a variety of noisy environments and we obtain the efficient frontier of the combined filters for each background noise.Because the multi-filters are adapting in parallel,the final system can be implemented on FPGA efficiently.©2014Elsevier B.V.All rights reserved.1.IntroductionBeing a natural interface between human and machine,speech input engine has already been integrated into many telecommu-nication systems like call centres and mobile phones,including Philips noise void mobile phone and more recently iPhone Siri sys-tem.Voice control feature is an attractive functionality that can be embedded in many products like electrical appliances.It pro-vides a natural mean of remote control making the products more user-friendly.However,in real environments like living rooms or factories,the speech signal picked up by a microphone can be very difficult to analyze because of interference.The performance of the speech input engine diminishes rapidly when there is a mismatch between the training and testing conditions [1,2].Since noise is additive in nature,one way to improve the signal-to-noise ratio (SNR)is to estimate the noise statistics during non-speech periods.With the availability of the noise characteristics,noise reduction algorithms can then be performed.One of the most popular single channel speech enhancement techniques is spectral subtraction.It was originally suggested by Boll [3]and has since gained popularity due to its simplicity.Spec-tral subtraction relies on the assumption that the target signal and∗Corresponding author.Tel.:+852********.noise signal are uncorrelated [3,4].Therefore,if the noise spectral components are estimated correctly,the target signal can be recov-ered by just subtracting the estimated spectral noise from the noisy spectral observation.Spectral subtractive-type algorithms can be viewed as filtering of noisy observation with a time-varying linear filter dependent on the characteristics of the noisy observation and on the estimated noise spectral components.Whilst Boll’s original spectral subtraction is simple and efficient,the musical artifacts from the residual noise can be very annoy-ing.As reported in [6,7],a critical parameter to reduce the musical noise is the a priori SNR.Ephraim and Malah [8]proposed a com-putationally efficient approach to determine the a priori SNR by using the decision-directed estimator.Nevertheless,the a priori SNR estimator has a trade-off between the response time and the musical residual noise [5].Generally speaking,each of the speech enhancement algorithm has its own trade-off in terms of noise sup-pression capability and speech distortion level.Depending on the observed noise signal,one algorithm might outperform another in certain aspects.This is because different additive noise may have different sensitivity towards the spectral distortion of the original signal.As a result,it would be difficult to find a single type of speech enhancement algorithm that works for all types of noise.In an effort to generalize the solution for a wide range of noise,this paper proposes to optimally combine several gain functions for the best trade-off between noise suppression and speech/10.1016/j.asoc.2014.03.0161568-4946/©2014Elsevier B.V.All rights reserved.534K.F.C.Yiu et al./Applied Soft Computing21(2014)533–541distortion.In this paper,we propose to formulate the optimalfilter design as a multi-criteria decision problem with the use of a system of parallelfilters to trade-off between suppression and distortion. It was previously reported in[9–11]that the trade-offs played a crucial role in enhancing speech recognition accuracy for voice control devices.However,those works focused on the optimization of speech recognization accuracy and no work has been conducted to combine the gain functions to establish a Pareto optimal solution in the form of an efficient frontier[12].Therefore,we study the use of the Non-dominated Sorting Genetic Algorithm-II(NSGA-II)[13] here for solving the formulated multi-criteria decision problem and demonstrate how to construct efficient frontiers for a variety of noisy environments.This paper aims to combine differentfilter characteristics as a means to cater for a wide range of noise profiles.As mentioned,the motivation stems from the fact that different gain functions may have different performance limitations on different types of noise. Thus by optimizing the gain functions in a Pareto sense,the best compromise can be obtained,in the form of an efficient frontier. In terms of computational requirement,it is possible to implement thefilters in parallel,which will yield a lower overall delay and improved computational efficiency.The contribution of the paper is therefore twofold.Thefirst is to formulated the multi-criteria optimization problem for a system of parallelfilters and to study the Pareto optimum solutions under a variety of background noisy environments.We show that the proposed parallelfilter system can continually adjust the noise suppression level and the speech distortion level in a Pareto fashion.The second contribution is asso-ciated with the proposed real-time implementation scheme.As the proposedfilters are parallel in nature,we adopt the use offield pro-grammable gate array(FPGA),which is usually be used to integrate large amounts of logic in a single integrated circuit[21–24].In this case,we analyze the common computational intensive operations for allfilters and design dedicated processing units.The proposed design is implemented infixed point arithmetic with a suitable bitwidth,which can reduce the overall circuit size significantly when compared with a direct realization of the software onto an FPGA platform.The acceleration is evaluated on a Virtex-4plat-form,showing that the FPGA-based implementation at184MHz can achieve real-time performance by processing a maximum of 21,375samples per second.As a result,real-time adaptation to changing noise is feasible.2.Single channel noise reductionSingle channel speech enhancement relies solely on the tem-poral and spectral information of the observations to remove the background noise.For simplicity assume that the received signal x(t)only consists of the observed source speech signal s(t)and the background noise,v(t)given asx(t)=s(t)+n(t).(1) The M-point short-time Fourier representation of(1)can be written asx(ω, )=s(ω, )+n(ω, ).(2) where s(ω, )and n(ω, )are the transformed representation of s(t) and n(t),respectively.The variableωdenotes one of the real angular centre frequencies,(ω∈ω0,...,ωM−1)and is the associated time index.Assume that active period of the speech signal is identifi-able(typically identified through a voice activity detector(VAD)). Then one straightforward way to suppress n(ω, )is to estimate its statistics during periods of non-speech.There have been numer-ous variations of Boll’s spectral subtraction over the years.In this paper,we concentrate on four types of the subtractive algorithm,namely the conventional Boll’s spectral subtraction[3],the Wiener filter[14],Ephraim-Malah’s minimum mean-square error(MMSE) log-spectral amplitude estimator[8]and the recently developed Cohen’s noncausal a priori SNR estimator[6].The following section briefly details each of the spectral subtractive algorithmic varia-tions.2.1.Boll’s spectral subtractionThe gain function for the Boll’s spectral subtraction is described byG SS(ω, )=max1−P n(ω, )P x(ω, ),ıfloor(3)where P n(ω, )and P x(ω, )are the noise and signal averages.The constant,ıfloor is the noisefloor introduced to avoid a vanishing gain function.In this case,P x(ω, )is estimated asP x(ω, )=˛P x(ω, −1)+(1−˛)|x(ω, )|p(4)where˛is a smoothing constant and|·|is the absolute value oper-ator.The factor p determines whether it is a magnitude subtraction (p=1)or power subtraction(p=2).From(3),it can be seen that spectral subtraction works on the principle of subtracting the esti-mate of the noise spectrum from the noisy signals.In this paper,the noise spectrum,P n(ω, )is estimated by using the minimum statistics approach[15].Briefly,minimum statistics is based on the observation that a short-time subband power of a noisy speech signal exhibits distinct peaks and valleys[15].The peaks refer to speech active period and the valleys are the repre-sentative values of the noise power level.Therefore by tracking the minimum power(hence the name minimum statistics)within a finite window large enough to bridge high power speech periods, the noise statistics can be efficiently estimated.2.2.WienerfilterUnlike the conventional spectral subtraction,which is primarily based on heuristics formulation,Wienerfilter formulation is opti-mally derived in the mean square sense[1].The gain function can be written asG WF(ω, )=SNR(ω, )SNR(ω, )+1(5) whereSNR(ω, )=P x(ω, )P n(ω, ).(6)Eq.(5)reveals that when SNR(ω, )≈0,the Wiener gain function approaches zero.On the other hand,at high SNR,the Wiener gain function approaches unity.This suggests that Wiener formulation suppresses the signal when the SNR is low and passes the signal when SNR is good.2.3.MMSE log spectral amplitude estimatorWhilst the Wienerfilter is an optimal complex spectral esti-mator in the mean square sense,it is not an optimal spectral magnitude estimator[1].Since it is well known that the short time spectral amplitude plays an important role in determining the overall speech quality,a spectral magnitude estimator is proposed in[16].It has been reported in[17]that the squared error of the log magnitude spectra yields a better performance compared to itsK.F.C.Yiu et al./Applied Soft Computing 21(2014)533–541535predecessor,the spectral magnitude estimator.The gain function for the log spectral amplitude estimator is given as G LOG (ω, )=SNR priori (ω, )SNR priori (ω, )+1exp12∞(ω, )exp −ÄÄdÄ(7)whereSNR priori (ω, )=˛G LOG (ω, −1)SNR(ω, −1)+(1−˛)max SNR(ω, )−1,0.(8)Here,˛is the smoothing constant and SNR(ω, )is readily defined in (6).The integral parameter (ω, )is defined as (ω, )=SNR priori (ω, )SNR priori (ω, )+1SNR(ω, ).(9)From (8),it can be seen that if SNR(ω, )is much greater than 0dB,SNR priori (ω, )corresponds to a frame delayed version of SNR(ω, ).Moreover,if it is lower than 0dB,SNR priori (ω, )is a smoothed delayed version of SNR(ω, ).It is precisely this smoothing effect that reduces the effects of musical tones with the log spectral amplitude estimator gain function.However,the delay introduced in the estimate may introduce reverberation effects especially during speech onset and offset periods.2.4.Cohen’s noncausal a priori SNR estimatorAs described earlier,the estimation of SNR priori (ω, )in (8)assumes a speech presence probability of unity.This means that the information estimated in SNR priori (ω, )cannot discriminate between speech onsets and noise irregularities under speech pres-ence uncertainties.The work by Cohen in [6]proposed a noncausal SNR priori (ω, )estimator to overcome the deficiency.The proposed method aims to better estimate the spectral variance of the clean speech.It was shown that the noncausal approach has the abil-ity to distinguish speech onsets and noise irregularities intervals in the estimation process by just having a few subsequent spec-tral measurements.The basic idea is to employ the “propagation and update”technique to estimate the noncausal SNR priori (ω, ).For brevity,the readers may refer to [6]for a detailed explanation of the technique.3.A parallel multi-filter systemSince different additive noise has a different sensitivity toward spectral distortion of the original signal,it would be difficult to find a single type of filter that works for all types of noise.In other words,each gain function has its own unique property in terms of noise reduction and signal distortion.For instance,Boll and Wiener formulation may yield the best suppression,but will also gener-ate a fair amount of musical noise.Whilst the MMSE log spectral estimator is good,Cohen’s approach may improve its performance in terms of musical artifacts.Nevertheless,Cohen’s solution may not be as robust as Boll,Wiener and MMSE log spectral estimator as it requires a non-causal estimation.Quite plainly,each of the gain functions has its own ideal operating conditions,depending on the noisy observations at hand.In order to improve the perfor-mance of the gain functions collectively,we attempt to form a linear combination of different filters in a Pareto optimal sense,namely a four-filter system.In this case,the combined gains will have the degrees of freedom to cater for a wide range of background noise.We propose to let the gain functions be combined as G ( ,ω, )=(a 1G SS (ω, )+a 2G WF (ω, )+a 3G LOG (ω, )+a 4G CO (ω, ))(10)Fig.1.A four filter system.where G SS ,G WF ,G LOG and G CO are the gain functions for the Boll’s spectral subtraction,Wiener formulation,Ephraim-Malah log spec-tral estimator and Cohen’s approach,respectively.The parameters a 1,a 2,a 3and a 4are the weighting function of each of the gain func-tion.Depending on the type of noise,the gain functions should be scaled optimally among them.Applying this gain function as the noise filter and synthesize all the subband signals,we can obtain the filtered signal y (t ).Fig.1shows the parallel nature of these four filters.The optimization of the proposed combination is detailed in the following section.3.1.Optimization strategyThe objective specified in (10)is in fact a separate measure to the distortion caused by the filters measured by the deviation between the filter output and the source signal,and the noise suppression.There are different ways of measuring distortion and suppression [1].Following [18],some intuitive measures are employed here.The normalized distortion quantity,D ,is D =12−|C d ˆPy s (w )−ˆP x s (w )|dw (11)where w =2 f ,and f is normalized frequency.The constant,C d isfor normalization and can be defined asC d =− ˆPx s (w )dw− ˆP y s(w )dw (12)where ˆPx s (w )is a spectral power estimate of the observed signal and ˆPy s (w )is the spectral power estimate of the filter output when the source signal is active alone.The constant C d normalizes the mean output spectral power to that of the single sensor spectral power.The measure of distortion in Eq.(11),is the mean output spectral power deviation from the observed signal spectral power.Ideally,the distortion is zero.The normalized noise suppression quantity,S n ,isS n =C s− ˆPy n (w )dw− ˆP x n(w )dw (13)where C s =1C d(14)and where ˆPy n (w )and ˆP x n (w )are spectral power estimates of the fil-ter output and the observation,respectively,when the surrounding noise is active alone.The noise suppression measures are normal-ized to the amplification/attenuation caused by the filter to the536K.F.C.Yiu et al./Applied Soft Computing21(2014)533–541observation when the source signal is active alone,i.e.if thefil-ter attenuates the source signal by a specific amount,the noise suppression quantities are reduced with the same amount.Because there is more than one objective in the design of the filters,it is basically a multi-criteria design problem.When different scaling factors are applied to the criteria in the design process,a solution set can be derived in which all solutions are efficient,or Pareto optimum.In the present context,the set of weights a*is Pareto optimum if and only if there does not exist a set of weights a such thatS n(a)≥S n(a∗),(15) D(a)≤D(a∗),(16) with strict inequality to at least one of the criteria.In order to solve for the Pareto optimum,we employ the well-known Non-dominated Sorting Genetic Algorithm-II(NSGA-II)with these two criteria and construct the set of Pareto optima.4.FPGA system for parallelfiltersThe fourfilter algorithms described in previous sections are implementable under ergodic signal property assumptions where the expectation operator may be interchanged with time averaging estimations.Further,the source signal and the noise signal compo-nents,have to be accessible separately so that we may estimate the source and the noise correlation estimates individually.In the following it is assumed that these constraints are fulfilled.4.1.Fixed point arithmeticIn contrast to traditional software development,designing a sys-tem on an FPGA platform always involves an estimation of the length of bitwidth which affects the circuit size,the system per-formance and the quality of the calculation.As the design employs afixed point representation and the saturation arithmetic[19]to avoid the overflow case,a set offixed point library is developed which allows the exploration of how the bitwidth affects the quality of the signal during signal processing.Bitwidth analysis can iden-tify a near-optimal bitwidth for the hardware which can ensure the quality of the signal with less area consumption.Many DSP algorithms rely uponfixed point arithmetic and its inherent speed advantage overfloating-point.Often,afixed point algorithm requires the evaluation of FIRfilter and FFT,etc.We run an experiment using a32-bitfixed point representation,while varying the integer size and disable the saturation arithmetic in order to determine the suitable integer size in this system.In case of any overflow,the coefficients will change dramatically and the results will be invalid.The integer size isfinally determined as12-bit.In practice it may be possible to overflow occasionally even if the integer size is12-bit,so saturation arithmetic and scaling have been employed in the hardware implementation to minimize the impact of overflow.During the calibration phase of the Cohen and Logfilters,it is needed to compute the exponential and exponential integral func-tions for log spectral gain.Due to computational complexity,it is hard to compute the exponential term in such32-bitfixed point arithmetic and implement in digital hardware.However,as Fig.2 depicts,the exponential results approximate to one when calcu-lating exp(0.5*expint(x)).So we use the Lookup table method and employ a newfixed-floating point data representation method to compute the exponential term to get the results more accurate. Simulation results show thatfloating point computation occupies very little time of the whole process and there is no significant degradation inperformance.Fig.2.Exponential function.4.2.Architecture explorationIn the time domain,the main operation of thefilters is the calcu-lations given by equations in Section2.These operations are greatly reduced by carrying out the actualfiltering in the frequency domain and transforming the results back to the time domain described using the dataflow shown in Fig.3,which performs the following calculations:1.Analyze the input signal to their frequency domain representa-tions via FFT.2.Filter the subband signals by the subband impulse response esti-mates.3.Synthesize the impulse response estimates back to the timedomain via IFFT(inverse FFT).The algorithm was carefully analyzed to determine an optimal way to map them to the hardware available,which tries to guaran-tee computational efficiency by taking advantage of the parallelism property of the algorithm running in the frequency domain,which can be exploited at several levels:•Loop level parallelism,consecutive loop iterations can be exe-cuted in parallel.•Task level parallelism,that entire procedures inside the program can be executed in parallel.•Data parallelism.Since the algorithm is made up of a control part and a compu-tation part,thefirst stage consists in locating the computational kernels of the algorithms.When profiling is carried out,the most time consuming operations can be determined and will be imple-mented in hardware.The profiling results of the main operations are shown in Table1,clearly indicating that the FFT/IFFT could be the best candidates to be moved to hardware,since they occupy over86%of the CPU time.These kernels are mapped on dedicated processing engines of the system,optimized to exploit the regu-larity of the operations operated on large amounts of data,while the remaining parts of the code is implemented by software run-ning on the PowerPC processor.The FPGA presents on the Virtex 4is suitable to carry out this implementation.The new Auxiliary Processor Unit(APU)controller simplifies the integration of hard-ware accelerators and co-processors.These hardware acceleratorK.F.C.Yiu et al./Applied Soft Computing21(2014)533–541537 Fig.3.Dataflow of the main operations.functions operate as extensions to the PowerPC,thereby offloading the CPU from demanding computational tasks.A block diagram of this architecture is shown in Fig.4.As shown the hardware accelerator is connected to the PowerPC proces-sor using two APU channels.Thefirst channel is used for data I/O from/to the FFT/IFFT module.The second is used to connect with vector-matrix multiplication module.The hardware accelera-tor was designed to be as much reusable,flexible,and customizable as possible.A set of constants are defined to specify the val-ues for a set of key parameters of the architecture like the bus width,the polarity of control signals,the functional units which should be inserted or removed,and so forth.In order to minimize the execution time difference between the fourfilters,multiple instances can be instantiated into the hardware accelerator to improve the performance.Since these operations are performed in the subband/frequency-domain,a high degree of parallelism can be achieved.Thus,the implementation may be exploited by opti-mizing area,by optimizing performance,or even by obtaining an optimal area/performance trade-off point.Key features of the hardware accelerator:•High parallelism,because functional units can operate inde-pendently from each other in subband frequency domain. When different functional units commit their elaboration simul-taneously,a multi-port registerfile allows the concurrent write-back of their results;•Scalability and adaptability:the functional units can be inserted or removed from the architecture in an immediate way,just setting the value of dedicated VHDL generics.Also manykey Fig.4.Block diagram of hardware/software co-design architecture for fourfilters.Fig.5.Block diagram of the hardware accelerator.parameters of the architecture can be tuned to taste of the user (width of the bus,latency of the functional units,throughput, etc.);•Modularity of the functional units:each functional unit is ded-icated to implement an elementary arithmetic operation in particular.It can be removed from the architecture and also be used as a stand-alone computational element inside other designs;•Easy to integrate with and controlled by PowerPC:the OPB reg-ister interface gives access to control and status registers to configure the operation of the hardware coprocessor bysoftware.Fig.6.Main state machine.538K.F.C.Yiu et al./Applied Soft Computing 21(2014)533–541The details of the hardware accelerator is shown in Fig.5,which contains two FCB channel interfacing logic modules responsible for data transfer from/to the accelerator,a FFT/IFFT module is responsible for analyzing and synthesizing data,a vector-matrix multiplication module is used to calculate main loop operations,and some temporary storage modules are used for constructing the data structure.When the APU passes an instruction to the hardware accelerator,the decoder logic decodes the instruction and waits for data from memory to arrive via the APU and then execute it.After decoding the instruction,the state machines handle data transfers between the APU and the operation module,using registers to store data for the consequent operations.Data is transferred between the APU and the interfacing logic via load and store instructions.The data is then transferred between the interfacing logic to the operation module via a user-defined protocol.In this system,the flow of data is as follows:1.A filter operation begins with the processor forwarding an load instruction for filter input data to the APU.Table 1Profiling results of the main operations.Function %Overall time FFT43.3%Main Loop 13.7%IFFT43.0%2.The APU passes the instruction to the interface logic,which decodes the instruction and waits for data from memory to arrive via the APU.3.The interface logic sends the input data to the specified operation module.4.When load instructions are completed,the processor forwards a store instruction to the APU in anticipation of the filter output.5.The interface logic decodes the store instruction and waits for data from the filter module.6.After processing,the operation module returns results to the interfacelogic.Fig.7.Spectrogram of background noise.K.F.C.Yiu et al./Applied Soft Computing 21(2014)533–5415397.Finally,the interface logic returns the output data to the proces-sor via the APU.This data is written back to memory.In the FCB interface logic,a hardware state machine manages the data transfer.That is to send and receive some data from software to hardware or vice versa.Fig.6shows the main state machine that is responsible for load and store operations.This state machine communicates with the processor using the APU.The hardware state machine asserts a ready flag which means that it is ready to accept the data.The PowerPC provides the data,address and a valid signal.The valid signal is the indication to write.As soon as the hardware gets the valid signal it writes the data at the address provided by the PowerPC and after writing it asserts a flag which tells the PowerPC that data has been written.The state machine then goes to wait state to wait for result.Once the result is ready,it asserts a result ready flag.The PowerPC can detect this completion in two ways.Firstly,the processor can continuously poll a bit in order to detect when the calculation has completed.Secondly,an interrupt can be enabled and on completion the interrupt output signal will be asserted.If the hardware acceler-ator interrupt signal is correctly connected to the processor and configured for use,an interrupt will occur to indicate completion.In this way,the processor can avoid wasting the valuable time in polling loops to do other calculations while the hardware accelerator is updating the log spectral gain of one filter.The final result from the filter can be sent back to PowerPC to combine with the results from other filters.In order to maximize system performance,the FFT/IFFT are implemented using the core generator provided by the vendor tools,while the vector-matrix multiplication module must be designed from scratch.Because the adaptation is a data-oriented application,it can easily be implemented by a combinational circuit in theory,however,the approach uses a large number of functional units and thus requires a significant amount of hardware resources.The dependency and movement of the data is examined,so we can design the multiplication function in a time-multiplexed fashion,d Bbuccaneerd Bwhite noised Brailway stationd Bpink noised Bmountain streamd Bhigh streetd Bheavy trafficd Bcity trafficd Bcity parkFig.8.Distortion and suppression for the four different filters.。

FPGA外文资料4

FPGA外文资料4

A schematic diagram of an armature controlled DC motor is given in Fig. 1. Equations of a DC motor based on the Kirchoff’s voltage law combined with the Newton’s moment law can be written as [10,13]:
Figure 1 DC motor equivalence circuit.
Figure 4 New sub-block diagram of adjusting mechanism.
the designed digital system to reduce the development time and to enhance the system performance [9,12].
2. Modeling of DC motor
Figure 2 Block diagram representing the nonlinear DC motor system.
application-specified integrated-circuit hardware and general purpose processors [4]. Embedded processor can now be developed in a simple way, allowing the user to design and mix hardware and software in one chip of the FPGA. Complicated control algorithms with heavy computation can be realized by software in FPGA. The results of the software/hardware co-design function increase the programmable, flexibility of

关于FPGA的外文文献翻译、中英文翻译

关于FPGA的外文文献翻译、中英文翻译

译文VPR:一种新的包装,布局和布线工具的FPGA研究沃恩贝茨和乔纳森罗斯系电气与计算机工程系,多伦多大学多伦多,ON,加拿大M5S3G4{沃恩,jayar} 摘要我们描述了一个基于FPGA新的功能和CAD工具使用的算法,各种途径和方(VPR)。

在减少路由面积计算方面,VPR优于所有的FPGA布局布线工具,我们可以比较。

虽然常用的算法是基于已知的方法,是我们目前而言改善运行时间和质量的几个有效方法。

我们目前的版图和路由上的大型电路的一套新的结果,让未来的基准电路尺寸上的设计方法更多,用于今天的典型的FPGA布局布线工具工业品外观设计。

VPR是针对一个围广泛的FPGA架构的能力,并且源代码是公开的。

它和相关的网表翻译/群集工具VPACK已经被用在世界各地的一些研究项目,并且是有用的FPGA体系结构的研究。

1 简介在FPGA的研究中,人们通常必须评估新结构特色的实用工具而做评估实验。

也就是说评估基准电路技术映射,放置和FPGA的布线结构上的关系和措施的架构质量,如运算速度或区域,然后可以很容易地提取出来。

因此,有相当大的对于灵活CAD工具的需求,这样才可以针对各种架构的FPGA做高效的设计,从而便于比较均匀的设计架构。

本文介绍了通用的地点和路线(VPR)工具,设计很灵活,足够让许多FPGA架构的比较VPR可以执行的位置,要么全球路由或合并后的全球详细路由。

这是公开的/〜jayar/软件。

为了使FPGA体系结构的比较有意义,它是至关重要的CAD工具用于将每个电路架构,以地图的高品质展现。

路由相优于所有的VPR在查看FPGA的路由器方面,任何标准基准测试的结果都可用,并且指出VPR的砂矿和路由器的组合胜过所有出版的FPGA 布局和布线工具。

本文结构如下:在第2节我们描述了一些VPR功能的FPGA架构和围与它可能被使用的地方。

在第3和第4节,我们描述了布局布线法。

在第5节讲述了比较有必要的VPR曲目数量和该电路成功的布线所要求的其他已发表的工具。

fpga英文文献翻译doc资料

fpga英文文献翻译doc资料

Field-programmable gate array(现场可编程门阵列)1、History ——历史FPGA业界的可编程只读存储器(PROM)和可编程逻辑器件(PLD)萌芽。

可编程只读存储器(PROM)和可编程逻辑器件(PLD)都可以分批在工厂或在现场(现场可编程)编程,然而,可编程逻辑被硬线连接在逻辑门之间。

在80年代末期,为海军水面作战部提供经费的的史蒂夫·卡斯尔曼提出要开发将实现60万可再编程门计算机实验。

卡斯尔曼是成功的,并且与系统有关的专利是在1992年发行的。

1985年,大卫·W·佩奇和卢文R.彼得森获得专利,一些行业的基本概念和可编程逻辑阵列,门,逻辑块技术公司开始成立。

同年,Xilinx共同创始人,Ross Freeman和Bernard Vonderschmitt发明了第一个商业上可行的现场可编程门阵列——XC2064。

该XC2064可实现可编程门与其它门之间可编程互连,是一个新的技术和市场的开端。

XC2064有一个64位可配置逻辑块(CLB),有两个三输入查找表(LUT)。

20多年后,Ross Freeman 进入全国发明家名人堂,名人堂对他的发明赞誉不绝。

Xilinx继续受到挑战,并从1985年到90年代中期迅速增长,当竞争对手如雨后春笋般成立,削弱了显著的市场份额。

到1993年,Actel大约占市场的18%。

上世纪90年代是FPGA的爆炸性时期,无论是在复杂性和生产量。

在90年代初期,FPGA的电信和网络进行了初步应用。

到这个十年结束时,FPGA行业领袖们以他们的方式进入消费电子,汽车和工业应用。

1997年,一个在苏塞克斯大学工作的研究员阿德里安·汤普森,合并遗传算法技术和FPGA来创建一个声音识别装置,使得FPGA的名气可见一斑。

汤姆逊的算法配置10×10的细胞在Xilinx的FPGA芯片阵列,以两个音区分,利用数字芯片的模拟功能。

FPGA外文资料52

FPGA外文资料52

Fusion Engineering and Design 89(2014)698–701Contents lists available at ScienceDirectFusion Engineering andDesignj o u r n a l h o m e p a g e :w w w.e l s e v i e r.c o m /l o c a t e /f u s e n g d esDesign of FPGA based high-speed data acquisition and real-time data processing system on J-TEXT tokamakW.Zheng a ,b ,R.Liu a ,b ,M.Zhang a ,b ,∗,G.Zhuang a ,b ,T.Yuan a ,ba State Key Laboratory of Advanced Electromagnetic Engineering and Technology,Huazhong University of Science and Technology,Wuhan 430074,China bSchool of Electrical and Electronic Engineering,Huazhong University of Science and Technology,Wuhan 430074,Chinah i g h l i g h t s•It is a data acquisition system for polarimeter–interferometer diagnostic on J-TEXT tokamak based on FPGA and PXIe devices.•The system provides a powerful data acquisition and real-time data processing performance.•Users can implement different data processing applications on the FPGA in a short time.•This system supports EPICS and has been integrated into the J-TEXT CODAC system.a r t i c l ei n f oArticle history:Received 24May 2013Received in revised form 8January 2014Accepted 14January 2014Available online 19February 2014Keywords:TokamakData acquisition FPGAData processingPolarimeter–interferometer diagnostica b s t r a c tTokamak experiment requires high-speed data acquisition and processing systems.In traditional data acquisition system,the sampling rate,channel numbers and processing speed are limited by bus through-put and CPU speed.This paper presents a data acquisition and processing system based on FPGA.The data can be processed in real-time before it is passed to the CPU.It provides processing ability for more chan-nels with higher sampling rates than the traditional data acquisition system while ensuring deterministic real-time performance.A working prototype is developed for the newly built polarimeter–interferometer diagnostic system on the Joint Texas Experimental Tokamak (J-TEXT).It provides 16channels with 120MHz maximum sampling rate and 16bit resolution.The onboard FPGA is able to calculate the plasma electron density and Faraday rotation angel.A RAID 5storage device is adopted providing 700MB/s read–write speed to buffer the data to the hard disk continuously for better performance.©2014Elsevier B.V.All rights reserved.1.IntroductionThe Joint Texas Experimental Tokamak (J-TEXT)contains a large number of diagnostics,requiring data acquisition (DAQ)systems with multiple channels and high rates [1].Among them,the newly built far-infrared three-wave polarimeter–interferometer system poses a challenge on the DAQ system.A plasma deviated from its equilibrium state could bring various magneto-hydrodynamic (MHD)instabilities,which is crucial to the future fusion reactor.Because the plasma current density gradient is a main factor driving these instabilities,a detailed knowledge of current density profile is needed to get a better understanding of the mechanism of equilibrium and MHD stability.To acquire the infor-mation on the plasma current profile dynamics,a multichannel∗Corresponding author at:State Key Laboratory of Advanced Electromagnetic Engineering and Technology,Huazhong University of Science and Technology,Wuhan 430074,China.Tel.:+86027877930058307;fax:+8602787793060.E-mail addresses:zhangming@ ,zheng.jtext@ (M.Zhang).three-wave interferometer–polarimeter system has been estab-lished on J-TEXT [2].It can measure the plasma electron density and Faraday rotation angle simultaneously by employing three sep-arated FIR laser beams with slightly offset frequencies.This system is planned to have 30signal channels,which produces 1–3MHz intermediate frequency (IF)signals.The phase shift information between IF signals are extracted,from which the Faraday rotation angle profile and the plasma electron density profile can be calcu-lated.To achieve high spatial and temporal resolution,it requires very high sampling rates.And to get Faraday rotation angle pro-file and the plasma electron density profile,extracting phase shift information in real-time is needed.It is beyond the capability of existing DAQ system on J-TEXT,thus a new DAQ platform must be developed.To meet the above requirements,a DAQ system which can pro-cess the data using an onboard Field Programmable Gate Array (FPGA)in real-time is developed.It allows the data being processed while the data is being acquired.The digital phase comparator is implemented on the FPGA.The system is able to calculate the plasma electron density in real-time and stream it to other systems,0920-3796/$–see front matter ©2014Elsevier B.V.All rights reserved./10.1016/j.fusengdes.2014.01.027W.Zheng et al./Fusion Engineering and Design89(2014)698–701699Fig.1.The functionalities of the interferometer–polarimeter DAQ system. like the plasma control system,in real-time.The presented system also allows users to build customized signal processing applications for various diagnostic systems.This DAQ system is implemented using National Instruments (NI)FlexRIO devices[3].It has an onboard FPGA,on which user can implement custom functions.The presented system demon-strated a way of creating DAQ system that is capable of processing data using FPGA in a very short time.It is deployed in the J-TEXT interferometer–polarimeter system.There are prior works which integrated the FlexRIO devices into the ITER fast controller frame-work which focuses on providing a developing workflow using ITER Control,Data Access and Communication(CODAC)Core system[4].2.System architecture2.1.The requirement of the interferometer–polarimeter DAQ systemThe interferometer–polarimeter system is designed to get the plasma current density profile with1␮s resolution.It requires the phase shift information to be extracted from the IF signals.To achieve this with precision,the1–3MHz IF signals must be acquired at a much higher frequency.120MS/s sampling rate is selected for the balance between performance and costs.Moreover,it is very useful to calculate the plasma electron density with the phase shit information in real-time since it may be used in plasma control. Also the system needs to be developed in short time for the coming experiment campaign.Traditional DAQ systems are characterized by hardware that acquire the data and pass them to a host where they are processed [5].The channel number,sampling and processing rates of these systems are limited by the processing power of the CPU and the bus throughput.As multiple channels are running at120MS/s,the amount of data acquired could increase beyond practical limits of the traditional DAQ systems,let alone processing them in real-time.Fig.1illustrates that the main functionalities of the presented system are data acquisition,processing and storing.Furthermore, interfaces for system configuration and monitoring must be sup-ported.The diagnostic data are acquired by the acquisition module at120MS/s sampling rate and streamed to the processing mod-ule.Data is processed in real-time by the processing module at 1MHz,and streamed to the storage module.The processing mod-ule extracts the1MS/s phase shift signals out of the120MS/s IF signals.This leads to a significant reduction in data storage and bus throughput requirements.So that more than30channels for the interferometer–polarimeter can be implemented[6].The data is then persisted on permanent storages like the Redundant Arrays of Inexpensive Disks(RAID)via data storage module.Thedashed Fig.2.The hardware structure design of the interferometer–polarimeter DAQ sys-tem.arrows are optional functions for the interferometer–polarimeter. The data sharing module can be added in some cases to share the processed date with real-time control via the Synchronous Data Network(SDN).The clock synchronization module can use the external clock from the clock distribution system to override the built-in clock.2.2.The hardware structure designConsidering the requirements of the diagnostic system,the proposed DAQ system is based on J-TEXT CODAC Real-time fast controller framework[7].The controller framework is based on PXI/PXIe bus devices which is similar to an ITER Fast Plant System Controller[8].As illustrated in Fig.2,the DAQ system contains a PXIe chassis,an embedded controller,multiple FlexRIO FPGA mod-ules with digitizer adapter modules,and an external RAID.The feature that distinct the proposed system from traditional DAQ systems is that it employs the NI FlexRIO devices.It is a PXIe devices with a user configurable FPGA and adaptors for var-ious functions that attached to ers can implements highly specialized DAQ jobs of deterministic real-time data processing applications on it.In the interferometer–polarimeter DAQ system,we chose the NI PXIe-7966R FlexRio devices which have a Virtex-5SX95T FPGA. An NI5734digitizer adaptor module is attached to the FlexRIO.The FlexRio onboard FPGA configures and controls the digitizer adap-tor module.It feeds sampling clocks to the digitizer and reads out samples from it.It also allows complex data processing functions to be implemented on it.The NI5734digitizer adapter module has4 simultaneously16-bit analog input channels with120MS/s samp-ling rate.Four FlexRIO modules are used to implement16channels for the interferometer–polarimeter diagnostic.As shown in Fig.2,the interferometer–polarimeter diagnostic signals are connected to the digitizer adapter modules by coaxial cables.The FlexRIO module reads out the samples from the digitized and processes the data in real-time.After being processed,the data is transferred to the host memory via Direct Memory Access(DMA). The FPGA module can receive configurations and instructions from the host and expose its status to the host as well.This approach allows greatflexibility in implementing real-time data processing.The host is implemented using the NI PXIe-8135embedded controller.The host stores the data on the RAID.In some cases the host can further process the data and share it via SDN,which in J-TEXT is implemented using the reflective-memory network. Another responsibility of the host is to configure the system.It700W.Zheng et al./Fusion Engineering and Design89(2014)698–701Fig.3.The workflow of the interferometer–polarimeter DAQ system. provides an Experimental Physics and Industrial Control System (EPICS)Channel Access interface[9].An NI PXIe chassis is chosen to enclose the system.It also provides reference clocks and trigger distributions via the PXIe backplane.An NI HDD-8265RAID is attached to it via the PXIe link.It offers a total storage capacity of24TB and700MB/s of sus-tained data read and write rates in RAID5mode.This allows the DAQ system to continuously store processed data in the permanent storage.3.Software implementationThis section mainly presents the host software application implementation and FPGA implementation.Host application is software running on the embedded controller.The FPGA applica-tion contains no software.It is a bitfile that configures the FPGA. However,this bitfile is generated by LabVIEW-FPGA using LabVIEW graphic programming language,so it is discussed in this section.The data is acquired and mainly processed by the FPGA appli-cation.A digital phase comparator is implemented on the FPGA and phase shift is calculated by it.The host application is mainly in charge of configure and monitor the whole system.Some further data processing can also be done by the host.3.1.Host applicationThe host application is developed using LabVIEW.It can run under Microsoft Windows operation system.If real-time control or data sharing via SDN is required,it can run on a LabVIEW-RT target with a real-time operating system.In the interferometer–polarimeter DAQ system Windows is b-VIEW provides complete support for the NI hardware,so using it enables the presented system being developed in short time.As illustrated in Fig.3,the host application uploads the FPGA bitfile to the FlexRIO devices at initialization.When the FlexRio modules have been initialized,the host application begins to con-figure parameters which are needed by the FPGA application.The configuration data contains the clock setting,timing setting,trigger setting,channel setting and digitizer setting.The digitizer setting includes thefilter setting,gain setting,and coupling setting.The host application is ready to process the data after configuring the FlexRIO.The host application will monitor the FlexRIO devices and dispatch commands from the EPICS control network to them.With EPICS support,the proposed system is integrated into the newly built J-TEXT CODAC system[10,11].The EPICS support is done by using the NI Distributed Control System pack.Since most of the data acquisition and processing are done by the FlexRIO,there are not much left to do in the host application software.Once the FlexRIO devices are triggered,the processed data will be streamed to the host.The data is written intofiles and uploaded to MDSplus.A local MDSplus database is used to take the advantage of the RAID stor-age as well as to avoid over load the main MDSplus repository on J-TEXT with large amount of datatraffic.Fig.4.The state transition diagram of the FPGA application in the data handling phase.If running on a LabVIEW-RT real-time target,the host can also process the data in real-time and publish the data on the Reflective-memory network based on SDN.The line-integrated plasma density is proportional to the phase shift.The host can calculate the line-integrated plasma density easily using the phase shift calculated by the FPGA and share it with the plasma control system via the SDN. The F3.2.FPGA application designFPGA applications are also programmed using LabVIEW.Instead of running on a processor,the program is synthesized into a bitfile that configures the FPGA on the FlexRio devices.The workflow of the FPGA application is shown in Fig.3as well. The FlexRIO devices initialize the digitizer adaptor modules after the start-up.After that,the FPGA application will configure the digitizer parameter by using the configuration information from the host application.Then FPGA application goes to the data hand-ling phase which contains multiple sub-states.Data acquisition, processing and streaming to the host are done in one of these states.Since there are multiple FlexRIO modules in the system,one of these modules is selected as the master module.Fig.4shows the state transition of data handling phase in the master mod-ule and slave modules.To ensure that the slaves and master are triggered at the same time,the triggers for slaves and master are aligned with the PXIe backplane clock.When the master receives a trigger,it enters the send-trigger state and sending a trigger to the trigger bus in backplane.The master will go to running state at the next backplane clock the same time as the slaves do.In the running state,the FlexRIO feed the digitizers with clocks and read out samples.The read out samples are immediately sent the processing module on the FPGA in which they are buffered and processed.To process the data,a simple digital phase comparator is imple-mented on the FlexRIO FPGA as the date processing module.To calculate the phase shift,the probe data is multiplied with the reference data.Then Fast Fourier Transform is performed,and an inverse cosine is performed using a look-up table.The detailed implementation of the phase comparator is beyond the scope of this paper.4.ApplicationsTo test the simple phase comparator on the FlexRIO,two sinu-soidal waves with slightly different frequencies are streamed to the FlexRIO.Fig.5shows the test input and result,which confirms that the FlexRIO is able to extract the phase shit between two signals.This system has been deployed in the J-TEXT interferometer–polarimeter system.With four FlexRIO devices equipped,it is able to acquire16channels of IF signals at120MS/s simultaneously,and produce1MS/s phase shift signals.CurrentlyW.Zheng et al./Fusion Engineering and Design89(2014)698–701701Fig.5.Test result of the phase comparator on the FlexRIO FPGA.The frequency is 0.9MHz for the red line,1MHz for the blue line.The lower part is the phase shift between these signals.both raw and processed data are stored in the MDSplus database. The raw data will be used in more accurate offline analysis.It is integrated into the J-TEXT CODAC and it is controlled and mon-itored using EPICS.It has been supporting the interferometer–polarimeter diagnostic experiments and producing promising results[12].5.Conclusion and future workThe system has been developed to provide a powerful data acquisition and real-time data processing system for J-TEXT interferometer–polarimeter system and other diagnostics in plasma fusion experiments.It proved the feasibility of using FlexRIO FPGA based devices in DAQ and real-time data processing. By using LabVIEW-RT and SDN it is able to share the processed data with other controllers.In the near future,more complex data processing applications which can handle noisy signals and calculate the current density profile in real-time will be implemented and tested.Moreover,this system will be ported to Linux and integrated in the ITER CODAC Core system tools.AcknowledgmentsWe want to thank our colleagues from the J-TEXT team for their encouraging contributions in testing and tuning the new diagnos-tics.Especially Q.Li had a lot work to test the program andfind some problems of the system,so that we can improve it and meet the requirement better.This work was supported by the National ITER Project of China (No.2010GB108004and2013GB113003).References[1]G.Zhuang,Y.Pan,X.W.Hu,Z.J.Wang,Y.H.Ding,M.Zhang,et al.,The recon-struction and research progress of the TEXT-U tokamak in China,Nucl.Fusion 51(2011)094020.[2]J.Chen,L.Gao,G.Zhuang,Z.J.Wang,K.W.Gentle,Design of far-infrared three-wave polarimeter–interferometer system for the J-TEXT tokamak,Rev.Sci.Instrum.81(2010),10D502–10D502-3.[3]National Instruments FlexRIO Home Page,[Online],Available http://www./flexrio/[4]D.Sanz,M.Ruiz,R.Castro,J.Vega,J.M.Lopez,E.Barrera,et al.,Implementationof intelligent data acquisition systems for fusion experiments using EPICS and FlexRIO technology,IEEE Trans.Nucl.Sci.60(2013)3446–3453.[5]E.Barrera,M.Ruiz,S.Lopez,D.Machon,J.Vega,PXI-based architecture forreal-time data acquisition and distributed dynamic data processing,IEEE Trans.Nucl.Sci.53(2006)923–926.[6]A.Salim,L.Crockett,J.McLean,ne,A user configurable data acquisitionand signal processing system for high-rate,high channel count applications, Fusion Eng.Des.87(2012)2174–2177.[7]Z.Wei,Z.Ge,Z.Ming,W.Chuqiao,L.Rui,H.Yang,et al.,Real-time fast controllerprototype for J-TEXT tokamak,in:Real Time Conference(RT),201218th IEEE-NPSS,2012.[8]B.Gonc¸alves,J.Sousa,B.B.Carvalho,A.P.Rodrigues,M.Correia,A.Batista,et al.,ITER prototype fast plant system controller,Fusion Eng.Des.86(2011) 556–560.[9]Experimental Physics and Industrial Control System Home Page[Online],Avail-able /epics/[10]W.Zheng,M.Zhang,J.Zhang,G.Zhuang,Y.He,T.Ding,The J-TEXT CODACsystem design and implementation,in:The9th IAEA Technical Meeting on Control,Data Acquisition,and Remote Participation for Fusion Research,Hefei, People’s Republic of China,2013.[11]W.Zheng,M.Zhang,J.Zhang,G.Zhuang,J-TEXT-EPICS:an EPICS toolkitattempted to improve productivity,Fusion Eng.Des.88(2013)3041–3045. [12]J.Chen,G.Zhuang,Z.J.Wang,L.Gao,Q.Li,W.Chen,et al.,First results fromthe J-TEXT high-resolution three-wave polarimeter–interferometer,Rev.Sci.Instrum.83(2012),10E306-310E306-303.。

FPGA 外文文献 原版

FPGA 外文文献 原版

250
z Salcic/Microprocessorsand Microsystems21 (1997) 249-256 2.3. Standard FPGA chip A standard Altera FLEX 8000 [7] chip is used as a major resource for implementation of application-specific hardware structures, which are a part of embedded system solution. In our case we decided to implement a PCB with a FLEX8282-84 devices, but it can be easily modified to accommodate any other FPGA from the FLEX 8000 family because they have the same architecture and reconfiguration mechanism. 2.4. Memory Existing 68HC 11 on-the-chip memory resources are not sufficient for most of intended applications. This was the reason for using the microcontroller in the expanded bus mode, and extend memory resources with external 8KB of SRAM and 32KB of EEPROM. Larger memory resources are needed to store programs and data, but also to store hardware configurations that are implemented in the FPGA chip. 2.5. Serial communication link A serial communication link is needed to provide communication with a personal computer, which is used as a software/hardware development platform. It enables both programs which run on the microcontroller and hardware configurations from the PC to the prototyping board to be downloaded. It can also be used in the target application. 2.6. Simple input/output devices for testing purpose In order to provide flexibility for system operation, different options, and to indicate the current state of the system a number of input switches, which are switched on or off manually, and a number of led indicators are provided. 2.7. System clocks The PROTOS system provides two system clocks. One is used to drive the microcontroller at 2 MHz, and the other one to drive sequential circuits, which are implemented in the FPGA at higher frequencies (up to 50 MHz in our case). 2.8. Access to the FPGA through memory-mapped I/0 As the 68HCll supports memory-mapped I/O, our decision was to extend this I/O method to the FPGA. This enables access to the FPGA resources through a number of registers, implemented in an EPLD, that appear in the address space of the 68HC 11. However, this does not prevent a user to implement more registers within the FPGA, as an application requires.

FPGA外文资料21

FPGA外文资料21

An FPGA-based parallel architecture for on-line parameter estimation using the RLS identificationalgorithmT.Ananthan ⇑,M.V.VaidyanDepartment of Electrical Engineering,National Institute of Technology Calicut,Kerala 673601,Indiaa r t i c l e i n f o Article history:Available online 20March 2014Keywords:RLS identification algorithm On-line parameter estimation FPGAParallel architecture ASICa b s t r a c tA parallel architecture for an on-line implementation of the recursive least squares (RLS)identification algorithm on a field programmable gate array (FPGA)is presented.The main shortcoming of this algorithm for on-line applications is its computational complexity.The matrix computation to update error covariance consumes most of the time.To improve the processing speed of the RLS architecture,a multi-stage matrix multiplication (MMM)algorithm was developed.In addition,a trace technique was used to reduce the computational burden on the proposed architecture.High throughput was achieved by employing a pipelined design.The scope of the architecture was explored by estimating the parameters of a servo position control system.No vendor dependent modules were used in this design.The RLS algorithm was mapped to a Xilinx FPGA Virtex-5device.The entire architecture operates at a maximum frequency of pared to earlier work,the hardware utilization was substantially reduced.An application-specific integrated circuit (ASIC)design was implemented in 180nm technology with the Cadence RTL compiler.Ó2014Elsevier B.V.All rights reserved.1.IntroductionA broad range of contemporary engineering systems require parameter estimation for estimating unknown parameters [1].These parameters are estimated on-line by recursive identification algorithms.The amount of time available for running such algo-rithms is only a fraction of the sampling interval [2–4].Among the variety of available recursive algorithms,the recursive least squares (RLS)algorithm is extensively applied for parameter esti-mation in signal processing,time series modelling,spectral analy-sis,adaptive control,and speech coding [5].The main issue in applying RLS for on-line applications is its computational complex-ity and the resultant time overhead.This problem can be tackled by employing parallel processing techniques to attain the desired specifications [6,7].One of the important tasks of parallel process-ing techniques is the distribution of computational tasks across multiple processing elements.The attempt to increase processing speed is brought about by the following techniques-namely,optimizing interconnection schemes,scheduling and mapping onto the architecture,detecting parallelism,and partitioning the algorithm into modules [8,9].2.Related workThis section discusses related work done in this field.The break-through of VLSI technology has provided a path to develop parallel architectures for implementing complex algorithms.Systolic archi-tectures have been proposed for the implementation of matrix operations.This consists of arrays for performing arithmetic oper-ations.By chaining the arrays,a systolic architecture has been developed for the RLS architecture [10,11].Its performance is fur-ther improved with an exponential memory for recursive applica-tions.To overcome the need of global synchronization and lack of programmability in systolic arrays,a wave front array processor is introduced [12].For online parameter estimation,a parallel archi-tecture has been proposed for implementing RLS identification algorithm including Kalman filter [13,14].Further,the throughput of RLS architecture is improved with directional forgetting and vec-torizing the measurement data techniques [15].The difficulties in RLS parallel processing with high throughput are pointed out and subsequently,an architecture with high throughput is developed [16,17].Implementation of RLS for on-line applications using conventional uni-processor and multiprocessor are found to be unsatisfactory [18,19].Alternatively,a multiprocessor scheme is employed utilizing a digital signal processor (DSP).Texas Instruments [20,21]introduced the first parallel-DSP,TMS320C40.It allows parallel processing using multiple processors.The/10.1016/j.micpro.2014.03.0050141-9331/Ó2014Elsevier B.V.All rights reserved.⇑Corresponding author.E-mail address:ananthan71@ (T.Ananthan).computation time of RLS could be decreased bynumber of processors.However,thean overhead to the system and it is not costIn recent times,FPGAs have been identified toplatforms for implementing the RLS algorithm.development of specific hardware architectures forhigh processing speed without the use of digital[24,25].An implementation of such hardwarerently feasible due to the developments of FPGAdesign automation(EDA)tools and the existence offor hardware descriptive languages(HDLs)[27–29].the algorithm on an FPGA can also be converted tospecific integrated circuit(ASIC),if high-volumedesign is envisaged[30,31].Attempts have beenment an FPGA-based RLS for applications such astion(QRD)based matrix inversion[32],digitalsystem in power amplifiers[33],noise canceller system with adap-tivefilters[34],adaptive array antenna[35,36],and QRD-RLS sys-tolic array[37,38].A serial architecture for the RLS identification algorithm has been developed for estimating the parameters of a second order process in an adaptive control system[39–41].This architecture incorporates a scalar-based direct algorithm mapping(SBDAM) to reduce the computation time incurred by matrix multiplica-tions.Further,the architecture is improved by introducing a covariance matrix resetting technique for overcoming the inaccu-racy in the estimation of the parameters,due to the constant command signal over a long time[42].The above architectures [39–42]have been designed with Altera library modules.Another methodology has been proposed,to reduce the design cycles needed in implementing RLS[43].In this work,an Altera DSP builder is used for estimating the parameters of a fuel cell stack system in a Matlab/Simulink environment.High cost computers are used to process the RLS in industrial automation applications. Alternatively,a reduced instruction set computing(RISC)proces-sor has been developed and mapped on an FPGA[44].From all the above reported work,it is observed that the development of an FPGA-based parallel architecture and ASIC implementation for the RLS identification algorithm have not been optimally addressed.In this paper,a novel parallel architecture for the RLS identifica-tion algorithm and its ASIC design is proposed.The architecture incorporates two key features for improving the processing speed of the algorithm.First,a multi-stage matrix multiplication (MMM)algorithm is employed to reduce the computation time ta-ken by matrix multiplications in the updating stage of error covari-ance.Next,a trace technique is introduced in the same stage to reduce the number of matrix computations.The architecture is deeply pipelined to achieve high throughput.The performance of the proposed architecture has been evaluated by estimating the parameters of a servo position control system.No vendor-depen-dent modules are used in the design.The organization of the paper is as follows.The theoretical background of the RLS identification algorithm and its application are described in Sections3and4.The architectural design of RLS employing MMM and trace technique are described in Section5. An FPGA and ASIC implementation results are presented and dis-cussed in Section6.The performance of the proposed architecture is compared to earlier work in Section7.Conclusions are drawn in Section8.3.Theoretical backgroundThe equations governing the RLS identification algorithm for on-line parameter estimation are described in this section.3.1.On-line system identificationOn-line system identification is a process of estimating un-known parameters when a process is under operation[45,46]. Employing the RLS algorithm in a system identification process can be implemented as shown in Fig.1.Assume that the system is a single-input–single-output(SISO)discrete-time system as shown in Fig.2.The z-transform of the input–output relationship [47]is given byUðzÞYðzÞ¼HðzÞ¼a0þa1zÀ1þÁÁÁþa m zÀm1þb1zÀ1þÁÁÁþb n zÀn;ð1Þwhere z¼e sT;T is the sampling period,UðzÞis the z-transform ofu k;YðzÞis the z-transform of yk,HðzÞis the pulse transfer function of SISO,a0;a1;...;a m and b1;b2;...;b n are the unknown parameters of the model and m and n are the order of the numerator and denominator polynomials respectively with n P m.The assumed model[47]of an SISO system(Eq.(1))in differ-ence equation form is given byyk¼X mi¼0a i u kÀiÀX ni¼1b i ykÀi;ð2Þwhere yk¼yðkTÞand u k¼uðkTÞ,for k¼1;2;3;...The RLS algorithm is used for estimating the unknown parame-ters a0;a1;...;a m and b1;b2;...;b n of the system using Eq.(2).The input signal is applied simultaneously to the system and the model as shown in Fig.1.The difference between their responses,defines the estimation error eðkÞ.With an error criterion,the algorithm is used to reduce the estimation error by adjustments made in the values of the parameters.The operation is then iteratively updated with each new input–output data set,till the estimation error be-comes sufficiently small in a statistical sense.Once the estimation error falls within the acceptable tolerance band,the obtained parameters represent the unknown system parameters.erning equations of RLS for parameter estimationIn a practical application,there is an additive noise present at the output of the system(Eq.(1)).This results in the weighted least Fig.1.On-line system identification using RLS[47].Fig.2.SISO discrete-time system.T.Ananthan,M.V.governing equations of WLS(see Eqs.(A.23)–(A.25)and(A.27) Appendix A)with the initial values of P p and^h p are as given below ¼y kþpÀu T kþp^h p;1¼P pþ1u kþp;¼^h pþK pþ1e kþp;1¼1kP pÀP p u kþp u T kþp P pkþu T kþp P p u kþp!;where e kþp is the error between the actual measured output and thepredicted output,K pþ1is the weighted least squares gain matrix,^hpþ1is the least squares parameter estimate,P p is the error covari-ance matrix,and k is the forgetting factor.Simplicity and accuracyare the main features of the RLS algorithm[43,44].Itsflow chartis shown in Fig.3.4.Estimation of the parameters of a servo position controlsystem using RLSThe application of RLS in a servo position control system is pre-sented in this section.Its block diagram shown in Fig.4consists ofthe RLS estimator,proportional,integral and derivative(PID)con-troller,and servo system.The unknown parameters of the systemare estimated by the RLS ing the estimated values,the controller is tuned to obtain the desired performance.Theinput–output relationship[48]of the system is given by Fig.3.Flow chart of the RLS algorithm.Fig.4.Block diagram of a servo position control system.Fig.5a.Record of the system input–output data.Fig.5b.Estimation error.x ðs ÞV t ðs ÞT d ¼0¼K tFR a ð1þs s e Þð1þs s m ÞþK 2t;ð7Þwhere V t is the terminal voltage,x is the motor speed,F is the vis-cous friction,R a is the armature resistance,K t is the motor torque constant,s m is the mechanical time constant,s e is the electrical time constant,and T d is the disturbance torque.Discretizing Eq.(7)H ðz ÞV t ðz Þ¼b 1z À1þb 2z À21þa 1z À1þa 2z À2;ð8where H is the position of the motor,a 2¼e ÀTs ,a 1¼Àð1þa 2Þ;b 1¼K T Às 1Àe ÀTs h i ;b 2¼K s 1Àe ÀTs ÀTe ÀT s Þh i ;where J is the total inertia,K is the static gain,and s is the motor time constant.The unknown parameters of the system (Eq.(8))are a 1;b 1and b 2.The assumed true values [49]of the parameters are a 2¼0:6065;b 1¼6:1065,and d 2¼2:090.Substituting these values Eq.(8)becomesH ðz Þt ¼6:1065z À1þ2:090z À2:ð9ÞA model structure of the same order (Eq.(8))is assumed for parameter estimation.4.1.Functional verification of RLSThe functionality of RLS is verified in MATLAB,version 7.9.0.529(R2009b).The input–output data of the system is recorded by applying a zero mean,unity variance random signal at the input (see Eq.(A.12)in Appendix A )and adding a measurement noise signal at the output This is shown in Fig.5a .u ðk Þand y ðk Þcorre-spond to the input terminal voltage V t and the motor position out-put H respectively (Eq.(8)).The iteration is started with the initial values,and the estimation error is as shown in Fig.5b .After approximately 100iterations,the estimated parameters converge to the true parameters as shown in Fig.5c .The estimated parame-ters are shown in Table 1.These parameters are used for tuning the PID controller to obtain the desired performance of the servo sys-tem.Fig.5d shows the response of the system for a unit step input.Thus,functional verification of the RLS algorithm is performed in MATLAB before attempting hardware implementation.5.Hardware implementationThe parallel architecture design of RLS incorporating MMM and trace technique is described in this section.5.1.Multi-stage matrix multiplication algorithmFig.5c.Estimated true parameters.Table 1Parameter estimates.Parameter a 2b 1b 2True value0.6065 6.1065 2.090Estimated value0.59786.00232.000Fig.5d.Unit step response of the servo system.and Microsystems 38(2014)496–508499Eq.(10)rewritten asa 111ÁÁÁa 11N.........a 1N 1ÁÁÁa 1NN 0B B @1C C A ÃÁÁÁÃa n11ÁÁÁa n 1N.........a n N 1ÁÁÁa n NN 0B B @1C C A ¼r 11ÁÁÁr 1N.........r N 1ÁÁÁr NN 0B B @1C C A :ð11ÞThe final product matrix R is calculated as follows:C 1¼A 1ÃA 2ÁÁÁstage I C 2¼C 1ÃA 3ÁÁÁstage II C 3¼C 2ÃA 4ÁÁÁstage IIICn À2¼Cn À3ÃAn À1ÁÁÁstage n À2R ¼C n À2ÃA nÁÁÁstage R ;9>>>>>>=>>>>>>;;ð12Þwhere C 1ÁÁÁC n À1are the partial products,C n À1¼R ,and stage I ÁÁÁstage R ,are the multiplication stages.The number of multiplica-tion stages for multiplying n number of matrices is given byQ ¼n À1;ð13Þwhere Q is number of multiplication stages.Rewriting Eq.(12)in a general recursive formc 1ij¼XN ka 1ik a 2kj ;ð14Þwhere C 1¼c 1ij ;A 1¼a 1ik ÀÁ;A 2¼a 2kj ;N is the matrix order,and i ;j are the row and column indices respectively.Expanding Eq.(14)for RC 1ði ;j Þ¼A 1ði ;k ÞÃA 2ðk ;j ÞÁÁÁstage I C 2ði ;j Þ¼C 1ði ;k ÞÃA 3ðk ;j ÞÁÁÁstageII C 3ði ;j Þ¼C 2ði ;k ÞÃA 4ðk ;j ÞÁÁÁstage III...............C n À2ði ;j Þ¼C n À3ði ;k ÞÃA n À1ðk ;j ÞÁÁÁstage n À2R ði ;j Þ¼C n À1ði ;k ÞÃA n ðk ;j ÞÁÁÁstageR9>>>>>>>>>>=>>>>>>>>>>;;ð15Þwhere C 1ði ;j Þis the i th row and j th column element of C 1.The loop indices are taken as i ,j and k .A pseudo code for con-current multiplication of Eq.(14)is shown in Table 2.The MMM hardware implementation uses a pipelined design to achieve high throughput.5.2.Trace techniqueThe computational burden on the proposed architecture in the updating stage of error covariance matrix,is reduced by this technique.Trace ðtr Þof a ðN ÂN Þmatrix A [51]is given by,tr ðA Þ¼XN i ¼1a ii :ð16ÞRewriting Eq.(6)P p þ1¼1P p ÀN pp;ð17ÞwhereN p ¼P p u k þp u T k þp P p ;ð18ÞD p ¼u T k þpP p u k þp scalar:ð19ÞFrom Eq.(18)T ¼P p u k þp u T k þp :ð20ÞFrom Eqs.(19)and (20)tr ðT Þ¼D p :ð21ÞFrom Eq.(21),it is established that tr ðT Þavoids the computation of D p .5.3.Architectural designThe RLS algorithm is partitioned into five modules,namely,the NP module (NPM),the error covariance module (ECM),the weighted least square gain module (WGM),the prediction error module (EM),and the parameter module (PM).The algorithm gov-erning equations are processed by these modules,as listed in Table 3.5.3.1.NP moduleThe implementation of NPM is shown in Fig.6.It consists of a three stage MMM.The processing elements PE 1;PE 2,and PE 3are used to process the matrix multiplications.The temporary registers TR 1—TR 3provide a data path connection between the processing elements.The control signals,E 1—E 3are the enabling signals,W 1—W 3are the writing pulses,and T 1—T 3are the triggering sig-nals.At the rising clock edge,the address lines are generated such that the ith row of P p and j th column of u k þp are synchronously en-tered at the input into PE 1.On the same clock pulse,T 1,is HIGH and the stage I computation is triggered.The first partial product,ðP p u k þp Þ,is available at the output of PE 1with a latency of 7ccs.Table 2Pseudocode for the matrix multiplication.Table 3Algorithm partitioning.Module Implemented equation NPM (18)ECM (6)WGM (4)EM (3)PM(5)500T.Ananthan,M.V.Vaidyan /Microprocessors and Microsystems 38(2014)496–508At the 7th rising clock edge,both E 1and TR 1,are HIGH.The first row of partial products is latched by TR 1in 3ccs.At the 10th (7þ3¼10ccs)rising clock edge,the same is entered at the input into PE 2.On the same clock pulse,the j th column of ‘part high-lighted in Eq.(20)appears on PE 2input,and T 2,is HIGH.Stage II computation is then triggered.The second partial product,ðP p u k þp Þu Tk þp h i,is available at the output of PE 2with a latency of 7ccs.At the 17th (10+7=17ccs)rising clock edge,both E 2and TR 2,are HIGH.The first row of the second partial product is latched by TR 2in 3ccs.At the 20th (17+3=20ccs)rising clock edge,the same is entered at the inputs into PE 3.On the same clock pulse,the first column of P p appears on PE 3input,and T 3,is HIGH.StageIII computation,ðP p u k þp Þu Tk þp h i P p n o,is then triggered.The prod-uct is available in PE 3output with a latency of 27ccs(20+7=27ccs).5.3.2.Trace computationThe implemented trace computing unit is shown in Fig.6.It consists of a parallel adder for adding the leading diagonal ele-ments of Eq.(20).From Eqs.(18)and (19),it is seen that the outputof PE 2is required for the trace computation.When PE 2generates its output at the 17th rising clock edge,both E 3and W 3,are HIGH.The output of PE 2is latched by TR 3and the leading diagonal elements appear on the parallel adder inputs.5.3.3.Error covariance moduleAn ECM is implemented by incorporating NPM into it as shown in Fig.7.N p and tr ðT Þare the outputs from NPM.The data has ar-rived at the divider input with a latency of 27ccs,whereas the data k þtr ðT Þhas arrived with a latency of 23ccs.Hence,the data N p and k þtr ðT Þare not synchronously entered on the inputs into the divider.This makes the computation ineffective,if care is not taken.To overcome this,the signal k þtr ðT Þis delayed by 3ccs.The updated error covariance is available at the output with a latency of 43ccs.5.3.4.Parallel architecture of the RLS algorithmA parallel architecture of the RLS algorithm implemented by incorporating ECM into it,is shown in Fig.8.In addition to the preceding modules it consists of EM,WGM,PM,and a data controller.At the 43rd rising clock edge,WGM and PMFig.6.Design of NPM.Fig.7.Design of ECM.computations are triggered.Subsequently,on the51st rising clock edge,the PM computation is triggered.Finally,once the pipeline is full(depth59ccs),the estimated parameters are is-sued one at every rising edge of the clock,without a break. While this output is progressing,the stage I and stage II compu-tations are carried out concurrently.Fig.9shows the pipelined structure of the architecture.The process through the pipelined design is shown in Fig.11.5.3.5.Data controller moduleThe data controller orchestrates,coordinates and synchronizes the operations of the modules by generating all the control and handshake signals required for an effective computation.Fig.12 shows that it consists of counters and afinite state machine (FSM).The counters(C1–C8)are used to generate the control sig-nals and the address lines are generated by the FSM.5.3.6.Hardware utilizationThe hardware implementation requires arithmetic elements such as adders,multipliers,and dividers.Table4lists the hardware requirements.When the system order increases,the number of multipliers also increase.The expression to calculate the number of multipliers is given byD m¼X5q¼1PE qþððNþ1ÞÀNÞþ2;ð22Þwhere D m is the increase in the number of multiplier,PE q is the number of multipliers in q th PE,and N is the matrix order.WithFig.8.Parallel architecture of RLS.Fig.9.Pipelined RLS architecture structure.Fig.10.FPGA-based RLS estimator output with a latency of59ccs.an increase in the system order,the number of multipliers increase as shown in Fig.13.5.3.7.Performance evaluationThe performance of the proposed architecture is evaluated implementing it in a servo position control system as discussed Section4.The block diagram of an FPGA-based implementation the RLS estimator in a servo position control system is shown14.Implementation results and discussionThe FPGA and ASIC implementation results of the parallel archi-tecture are presented and discussed in this section.Fig.11.Process through the pipelined RLS architecture.Fig.12.Block diagram of data controller.Table4Hardware requirements.Module Multiplier Adder DividerNPM94–ECM121WGM31–EM32–PM11–Total17101 Fig.13.Effect of system order on the multiplier.Fig.14.FPGA-based servo position control system.6.1.FPGA implementationThe architecture has been designed using VERILOG.Verification and simulation have been done in MODELSIM(6.4g SE).The hard-ware is mapped to a XILINX Virtex-5FPGA device.The total power consumed by the design is estimated using the X Power analyzer. The synthesis report shown in Table5shows that the proposed architecture can compute an output in2.946ns.tency analysis in pipeliningLatency(L)is the outcome of the pipelined design[28].In the proposed architecture,the input data is clocked through the pipe-lined design with a latency of59ccs as shown in Fig.10.Table6 lists the latency of design modules.where b;a;g;and q,are the la-tency of adder(a),multiplier(m),and divider(d)in the respective modules.The total latency of the design tency de-pends on the system order.The expression tofind the variation in latency is given byLþD L¼ðTotalÞL þ½ðNþ1Þ2ÀN2 ccs;ð23ÞT c¼‘1f mnsð24Þwhere T c is the computation time and‘is the number of iterations,f m is the maximum frequency of FPGA in MHz.The computation time taken for an iteration is found to be2.946ns.6.2.ASIC implementationThe proposed design is implemented in180nm technologyusing the Cadence RTL compiler(version v09-10-p104-1-32-bit).The ASIC implementation report in Table7shows that the numberof gates used is1330.For the set clock period of10000ps the totaldata path delay is2694ps.This results in a positive timing slack(7396ps).Hence,the design is found to be satisfactory in the syn-thesis stage.The schematic diagram of the proposed architecture isshown in Fig.16.7.Performance comparisonPerformance comparison has been made in terms of the pro-cessing speed and hardware requirements.A scalar based directalgorithm mapping(SBDAM)is developed to reduce the computa-tion time taken by the matrix multiplications[39].In[40–42],incorporating SBDAM into an RLS,serial architecture was imple-mented but,it resulted in low processing speed.In contrast,theproposed parallel architecture has high processing speed as shownin ing SBDAM the governing equations of the RLS algo-rithm are decomposed into scalar equations and implemented di-rectly with multipliers and adders.The algorithm is easy torealize.Nevertheless,it requires huge hardware requirements.Incontrast,in the proposed method,the hardware requirements arereduced significantly using the MMM algorithm.This is shown inTable9.In[40–42],Altera library modules(LM)were used forimplementing the RLS architecture.Hence,the hardware mappingwas confined to an Altera platform only and it cannot support ASICconversion.In the proposed design presented here,no vendordependent modules are used,and so it can be used with both FPGAand ASIC.The implementation features of the proposed architec-ture are compared to earlier work in Table10.Table5Synthesis report.Maximum frequency339.156MHzNumber of slices666Number of slicesflipflops997Number of4input LUTs810Total power consumption1200mWTable6Latency calculation.Module Formula L(ns) NPM3ðbmþb aþ2NÞþN279.542ECMðNPMÞLþa aþa mþb d47.136WGM&EMðECMÞLþg aþg m23.568PMðWGM&ERMÞLþq mþq a23.568Total173.814 Fig.15.The effect of system order on latency.Table7ASIC implementation report.Type Gate count Area Area%Logic utilizationSequential73041150.89479.0 Inverter79525.571 1.0 Tristate8106.4450.2 Logic51310.288.55519.8 Total133052071.466100Area and powerCell area52071.466 Total area52071.466 Leakage power276.769nW Dynamic power7839733.095nW Total power7840009.864nWTimingViolating paths0 Clock set10,000ps Total path delay2604ps Timing slack7396ps504T.Ananthan,M.V.Vaidyan/Microprocessors and Microsystems38(2014)496–5088.ConclusionsA novel FPGA-based high throughput parallel architecture is proposed to overcome the shortcomings of the RLS identification algorithm for on-line applications.By employing an MMM algo-rithm in the updating stage of error covariance,the computation time taken by the matrix multiplications is reduced considerably. In addition to MMM,a trace technique is also introduced in the same stage to reduce matrix pared to earlier work,the number of multipliers for implementing the algorithm is found to be significantly reduced.The proposed architecture is resource efficient along with having high processing speed.The performance of the proposed architecture has been evaluated by implementing it in a servo position control system.Theis mapped to a Xilinx Virtex-5FPGA device with a maximum quency of339.156MHz.An ASIC implementation of theis also envisaged.With all these features,the proposed RLS architecture is suitable for on-line applications in diversefields engineering areas.AcknowledgementThe work was supported by the Ministry of Communication Information Technology,Government of India through Special Power Development Program(SMDP)II.Appendix AA.1.Least square methodThe model shown in Fig.2is of theoretical interest only,since measurements are always contaminated with noise[47,52].For such practical situations,the system model is shown in Fig.17, and its output takes the formyk¼X mi¼0a i u kÀiÀX ni¼1b i ykÀiþv k;ðA:1Þyk¼^u T k hþv k;ðA:2Þwhere^u Tkis called the observation matrix,^u Tk¼½u kÁÁÁu kÀmÀy kÀ1ÁÁÁÀy kÀn ,h is the parameter vector,h¼½a0a1ÁÁÁb1b2ÁÁÁb n T,f n kg is the measurement noise sequence,andv k¼n kþP ni¼1b i n kÀi.The concatenated form of Eq.(A.1)is given byA p h¼y pÀv p;ðA:3Þwherep¼mþnÀ1is the numer of equations for the estimation,A p¼u k...u kÀmÀy kÀ1...Ày kÀnu kþ1...u kÀmþ1Ày k...Ày kÀnþ1:...............u kþpÀ1...u kþpÀmÀ1Ày kþpÀ2...Ày kþpÀnÀ12666437775¼concatenated observation matrix;ðA:4Þykykþ1266377Fig.16.Schematic diagram of RLS parallel architecture.Table8Comparison of processing speed.Arch.Freq.(MHz)Time(ns)SR a(s/s)Proposed339.1562946400,000[40]13.034602200,000[41,42]106000166,000a Sampling rate.Table9Comparison of hardware requirements.Arch.Multiplier Adder DividerProposed19101[40,41]603520[42]444020Table10Comparison of implementation features.Design Mapping ASICProposed Verilog Any FPGA Supported[40–42]Altera LM Only Altera Not supportedFig.17.System and measurement noise.。

FPGA外文资料107

FPGA外文资料107

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS,VOL.62,NO.5,MAY20152859 A Modular Multilevel Converter Pulse Generationand Capacitor Voltage Balance MethodOptimized for FPGA ImplementationWei Li,Member,IEEE,Luc-AndréGrégoire,Student Member,IEEE,and Jean Bélanger,Member,IEEEAbstract—To generate numerous gating signals at a fast rate,industry controllers of modular multilevel converter (MMC)usually implement the pulse generation function in field-programmable gate array(FPGA)boards.Many meth-ods of submodule(SM)capacitor voltage balance control (VBC)require knowing the gating signals and are therefore also implemented in the same FPGA.As the number of SMs in an MMC increases,both the latency and required resources for the implementation could become too large to meet the control requirements orfit into the FPGA. Conventional methods impose a limitation on the design of large MMC.This paper presents a pulse generation and VBC method that is optimized for FPGA implementation. With least comparison operation,this method produces the same valve voltage as other modulation methods,and it removes the need for a sorting operation in VBC,which is the main difficulty in FPGA implementation.The proposed method is implemented in the FPGA-based RT-LAB real-time simulator and tested in a hardware-in-the-loop setup. The performance of this method is validated in various tests.Index Terms—Field-programmable gate array(FPGA), modular multilevel converter(MMC),power system simu-lation,real-time systems.I.I NTRODUCTIONM ODULAR multilever converters(MMCs)are gaining popularity in high-voltage direct current(HVDC)appli-cations.Currently,two multiterminal MMC HVDC projects are being built in Nan’ao and Zhoushan,China.As MMC has many advantages,including low ac harmonic contents,low switching loss,fast fault recovery,and high reliability,it also presents many challenges[1]–[6].One of them is the implementation of the pulse generation and capacitor voltage balance control (VBC)due to the enormous number of submodules(SMs)in one MMC.Different gating signal generation techniques are proposed in the literature for MMC control.Space vector modulation schemes are used in[7]–[9]for MMC with a low number of voltage levels.Multicarrier pulsewidth modulation(PWM)Manuscript received February13,2014;revised May22,2014, July22,2014,and September8,2014;accepted September17, 2014.Date of publication October14,2014;date of current version April8,2015.The authors are with Opal-RT Technologies,Montréal,QC H3K1G6, Canada(e-mail:wei.li@;luc-andre.gregoire@; jean.belanger@).Color versions of one or more of thefigures in this paper are available online at .Digital Object Identifier10.1109/TIE.2014.2362879schemes become more applicable as the MMC voltage level in-creases and,thus,are more commonly proposed in the literature [10]–[22].Particularly,the phase-shifted multicarrier PWM is used in[11]–[17],and the level-shifted multicarrier PWM is used in[19]–[23].A comparison and evaluation of the two categories of PWM is given in[24]and[25].A PWM scheme using the moving-average concept is introduced and used in [26]–[29].Mainly two capacitor VBC approaches,namely,the individ-ual control loop approach and the pulse reassignment approach, are proposed in the literature.In thefirst approach,capacitor voltage control loops are added for each SM,and therefore,all references for pulse generation are different,[10]–[14],[30], [31].By using two loops,capacitor voltage balancing among valves and inside each valve can be achieved.Since the ref-erence combines the components from different control loops, tuning the control parameters,which is system dependent, becomes important but difficult[11].A large weight of the VBC signal in the reference could affect other control loops,whereas a small weight could lead to a slow response.For the second approach,there is no capacitor voltage control loop.The voltage reference is the same for all SMs in one valve.The generated pulses are reassigned to SMs according to the sorting results of the capacitor voltage and valve cur-rent direction[8],[15]–[22],[25]–[28],[32].This approach, which is effective for balancing inside each valve,is decoupled from other control loops and does not require tuning control parameters.It usually has faster response compared with the individual control loop approach.As this approach is applied after pulse generation,both the pulse generation and VBC are usually implemented in thefield-programmable gate array (FPGA)[7],[21],[22],[27],[28].The difficulty is in the FPGA implementation since the conventional methods need to sort the capacitor voltages in ascending or descending order and reassign the pulses according to the sorting result.This paper discusses the practical difficulties in implement-ing the pulse generation and VBC,particularly for MMC with a large number of SMs.A method optimized for FPGA imple-mentation is then presented.II.P ULSE G ENERATIONThe MMC topology with a half-bridge SM is given in Fig.1. When the capacitor voltages are well controlled to the nominal value,i.e.,V cap·nom,the valve output voltage,i.e.,V MMC,is expressed asV MMC=Σ(N i∗V cap·i)=V cap·nomΣN i(1)0278-0046©2014IEEE.Personal use is permitted,but republication/redistribution requires IEEE permission.See /publications_standards/publications/rights/index.html for more information.2860IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS,VOL.62,NO.5,MAY2015Fig.1.Schematic of MMC andSM.Fig. 2.Multiple carriers and reference for the phase-shifted PWM method.where N i is the gating pulse,V cap ·i is the capacitor voltage,and subscript i denotes that the value is for the i th SM.The MMC performance,including its harmonic contents,is only determined by the pulse summation or the number of ON -state SM.The selection of the ON -state SM will not affect the results.The principle of the proposed pulse generation method is to generate the exact same number of ON -state SMs as other modulation methods but with minimal operation.Thus,it is op-timized for FPGA implementation without any negative impact on the MMC performance or harmonic contents.The selection of the ON -state SM is made at a later VBC stage.A typical multicarrier phase-shifted PWM is illustrated in Fig.2.The number of triangle carriers is the same as the SM number in one MMC valve,which is denoted as M .The carriers are evenly interleaved,i.e.,a 2π/M phase shift between every two consecutive carriers.The magnitude of all carriers and the reference are multiplied by M times for easier demonstration.For each SM,the pulse is determined by comparison between the reference and the carrier.The summation,i.e.,N Σ,is given asN Σ=ΣN i =ΣU (S ref ,S car ·i )(2)where S ref is the reference,S car ·i is the carrier,and U (x,y )isthe comparison operator,which gives one if the first parameter x is greater than the second parameter y ,or 0otherwise.The carriers take the range of [0M ]on the y -axis,which can be evenly divided into M number of contiguous bands.The i th band is between (i −1)and i on the y -axis,where i is 1,2,...,M .The carriers cross each other simultaneously (2∗M )times per carrier period.The crossing points are all at the band boundaries.At any time,each band contains only one carrier.Since all SMs are not differentiated at this stage,the carriers can be reindexed without affecting the value of N Σ.At any time,the carrier currently in the i th band is dynamicallyindexedFig.3.Reindexed carriers and reference for the phase-shifted PWMmethod.Fig.4.Schematic diagram to achieve the pulse summation N Σ.Fig. 5.N Σis the sum of (a)the reference integer part,and (b)comparison result between the reference factional part (red)and a reindexed carrier (black).as the new i th carrier,which is marked as Scar ·i (see Fig.3).Scar ·i takes a value in the range [i −1,i ].Thus,(2)becomesN Σ=M i =1U (S ref ,Scar ·i )=SN i =1U (S ref ,S car ·i )+US ref ,Scar ·(SN +1)+Mi =SN +2U (S ref ,Scar ·i )=SN +US ref −SN,Scar ·(SN +1)−SN(3)where SN is the integer part of S ref .The term (S ref −SN )represents the fractional parts of S ref ,which is identified as SR .The waveform of (Scar ·(SN +1)−SN ),which is identified as Scar ,is either the same as or thereverse of Scar1,depending on if SN is even or odd.Therefore,the pulse summation,rewritten asN Σ=SN +U (SR,Scar)(4)can be achieved by adding the integer part of the reference withthe comparison result of the fractional part of the reference with a new carrier,which has a frequency M times of the original multicarriers.The carrier flips when the integer part of the ref-erence is an even number.Fig.4gives the schematic diagram,and Fig.5gives the new reference and carrier waveforms for the same case in Fig.2.LI et al.:MMC PULSE GENERATION AND CAPACITOR VOLT AGE BALANCE METHOD2861Fig.6.Carriers of level-shifted PWM for(a)the phase disposition method,(b)the phase-opposition disposition method,and(c)the alternative-phase-opposition disposition method.This method can be easily adapted to other modulation meth-ods.Three typical multicarrier level-shifted PWM methods are illustrated in ing the same principle,the pulse summation can also be achieved by(4),where S car has the same frequency as the original level-shifted multicarrier.The S car waveforms for the phase-shifted and the three level-shifted PWM methods are slightly different,but can be generated by the same code in the FPGA.When the number of SMs in a valve increases to a large number,e.g.,200,the harmonic becomes less of a concern. Some industry controllers use nearest level control(NLC) modulation to reduce the switching loss.This method can also be adapted to the NLC modulation.By setting S car in(4)to the constant of1,or0.5,0,the resulting NΣgives thefloor integer, rounded integer,or ceiling integer of the reference,respectively. Therefore,the proposed method can produce the same valve voltage as other modulation methods,e.g.,phase-shifted PWM, level-shifted PWM,or NLC,without recompilation of the FPGA program.The user can check the impact of different methods and carrier frequencies on the system harmonics on-the-fly.More important,this method requires the comparison op-eration only once,regardless of the SM pared with the conventional multicarrier PWM methods,which need M carriers for M times the comparison operation,it takes minimumfixed FPGA resources regardless of the MMC size.III.C APACITOR V OLTAGE B ALANCEA.Difficulties of Implementing Sorting Algorithm in FPGA To balance the capacitor voltage,the conventional pulse reassignment approaches need to sort the capacitor voltage. In practice,the bubble sort algorithm is often used due to its simplicity to program in FPGA.For real-time applications, the algorithm performance in the worst case is considered. To sort M SMs,the bubble sort needs(M−1)number of passes.In the n th pass,(M−n)number of steps consecutively compare,and swap if necessary,a pair of adjacent voltages. This algorithm requires a total of0.5M(M−1)steps and has complexity of O(M2),where the big O notation describes limiting behaviors of a function,and in this case,the complexity is asymptotically equivalent to M2when M tends toward infinite.The bubble sort implementation in FPGA can be in series, e.g.,with one function of the compare-and-swap step being called for0.5M(M−1)times,or in partial parallel,e.g., with multiple functions being called for fewer times.Each call of the function will take one FPGA clock.Hence,the series implementation will take at least0.5M(M−1)FPGA clocks. In a partial-parallel implementation of(M−1)functions,each function works for one pass,and the function for thefirst pass is called for the most times,i.e.,(M−1)times.Thus,it takes minimum(M−1)FPGA pared with the series implementation,the partial-parallel implementation is faster. However,for a large M,it is still too slow as required by control and takes too much FPGA resources for the large number of functions.Moreover,once the capacitor voltages are sorted,the pulses have to be reassigned according to the sorting results.This has to be done individually,which could also take considerable FPGA resources.Improvements on the bubble sort algorithm are proposed for MMC applications,but the complexity order remains similar. There exist other sorting algorithms with better worst-case complexity of O(M log M).Due to its nature,the FGPA resources and latency of sorting algorithm dramatically increase as M increases.The VBC could become too slow to meet the upper level control require-ments,or the implementation is too large to be accommodated in the FPGA.The design of the SM number in an MMC is limited by the VBC using the sorting method.B.Flowchart of Proposed Max/Min MethodWithout sorting operation,the proposed VBC method only needs tofind the SM of the maximum or minimum capacitor voltage.The actual number of ON-state SMs,i.e.,NΣact,is calculated and compared with its reference,i.e.,NΣref,in each FPGA cycle,the simulation time step in the FPGA.Depending on the result and the current direction(the charging direction is defined as positive),this method changes,if necessary,only one SM’s state according to the following rules.•If NΣact<NΣref and positive current,turn on the OFF-state SM of the minimum capacitor voltage.•If NΣact<NΣref and nonpositive current,turn on the OFF-state SM of the maximum capacitor voltage.•If NΣact>NΣref and positive current,turn off the ON-state SM of the maximum capacitor voltage.•If NΣact>NΣref and nonpositive current,turn off the ON-state SM of the minimum capacitor voltage.•If NΣact=NΣref,no switching.Theflowchart for the proposed method is given in Fig.7. There are seven steps in each FPGA cycle.Step1reads the pulse summation generated in Fig.4as the reference number of ON-state SM.Step2initializes the SM states if it is the first cycle since the MMC pulse is enabled.As all capaci-tors are charged to a similar voltage in the diode mode,the first NΣref number of SMs are set to ON-state and others to OFF-state.In Step3,the actual number of ON-state SMs is calculated and compared with the reference value.Steps4–7 take action according to the rules explained above.In Step5,Cmax and zero are the maximum and minimum possible capacitor voltages.Since this method searches only the maximum or minimum capacitor voltage of SM with a specific state,this step is to exclude the SM of the opposite state from2862IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS,VOL.62,NO.5,MAY2015Fig.7.Flowchart of proposed capacitor voltage balance method. being selected.For example,to select a previous OFF-state SM of the minimum voltage,the voltages of all previous ON-state SMs are set to the value of Cmax.Note in Steps5and6that the logic in different paths is similar and can be implemented using the same function to save FPGA resources.plexity and Latency of Proposed MethodTofind the maximum or minimum among M values,(M−1)number of comparison operations are required,which has complexity of O(M),which is a lower order than the complex-ity of sorting operations.The implementation in FPGA can be in series,parallel, or partial parallel.The series implementation has only one function being called for(M−1)time.The parallel implemen-tation requires floor(M/2)number of functions and maximum ceiling(log2(M))number of call for one function,where floor(x)and ceiling(x)give thefloor and ceiling integers of x.In the partial-parallel implementation,the M voltages are divided into K groups.Each group is treated in series,and the Kfinalists of each group are treated in parallel.It requires (M/K−1+ceiling(log2(K)))number of FPGA clocks and (K+floor(K/2))number of functions.The series implementation takes longer time,and the paral-lel implementation takes more FPGA resources.The partial-parallel implementation provides a good combination of speed and resources,which can be adjusted by changing the value K. For example,when a valve has1024SMs,the required FPGA resources and latency for the implementation of the bubble sorting and the proposed method are summarized in Table I. Even taking enormous FPGA resources(1023functions),the bubble sorting in partial-parallel implementation takes more than10μs(for a10-ns FPGA clock),which might not meet the controller’s requirements.The proposed method(with K=32)TABLE IFPGA R ESOURCES AND L ATENCY FOR D IFFERENT METHODS approximately takes21times less resource and is28timesfaster(in less than halfμs).D.Features of Proposed MethodWhen a big disturbance occurs to the system,the valve volt-age has to rapidly change according to its reference in order to recover the system fast.That means the derivative of the valve voltage has to be large.This method can change the number of ON-state SMs by one at each FPGA cycle as in Fig.7,which is smaller than the switch signal sampling time.For example,if the sampling time is10μs and the FPGA cycle is500ns,a maximum20(=10μs/500ns)SMs of the specific state with thefirst20maximum or minimum capacitor voltages could change their states at each10μs sampling period.Therefore, this method can achieve a very high valve voltage derivative. The per-unitized maximum absolute value is given asddtV MMCpumax=|ΔNΣ|max/MΔt=1/MT FPGAcycle(5)which is1000pu/s for an MMC with2000SMs per valve,and the FPGA cycle,i.e.,T FPGAcycle,being500ns.For a typical application with less SM number and equal or smaller FPGA cycle,the maximum derivative is even larger than1000pu/s, much higher than a control design may require.Therefore,by changing only one SM’s state at each FPGA cycle,this method will not decrease the system performance or slow down the system recovery at big disturbances.In a phase-shift modulation method with M SMs and carrier frequency of f car,each SM has a switching frequency of f car and switches twice(one for switching-on and another for switching-off)in each carrier signal period.The total switching number is2M in one valve.As in Fig.3,the proposed method only reindexes the carriers without modifying their pattern,the total switching number in a valve is not changed,2M in each period in this case.Although in one period,each individual SM might switch more or less times,the average switching number and,thus,the average switching frequency,is the same as in the original modulation method.For the level-shifted or NLC modulation method,the adapted method has the same average switching frequency as the original method.For multicarrier modulation methods,a reference may cross an interaction point of two carriers,which means one SM switches to ON-state and the other switches to OFF-state at the same instant.For those rare occasions,the total number of ON-state SMs does not change,and thus,the proposed method has less total switching number than the original methods.AnLI et al .:MMC PULSE GENERATION AND CAPACITOR VOLT AGE BALANCE METHOD2863Fig.8.Front and back views of the RT -LAB simulator-based HIL testbench.Fig.9.Schematic of the MMC ac–dc–ad system.optional watchdog can be added as in Step 4in Fig.7,to force one switching if a no-switching period lasts too long,which only occurs in abnormal conditions such as a constant voltage reference.The gating signals are sent from the controller to MMC devices through fiber optic or copper wires.Note that the sampling time in the I/O may not be same as the FPGA cycle.The I/O sampling time is in a few μs to tens of μs in industrial controllers.Having a much smaller FPGA cycle,e.g.,500ns,the same implementation of this method can be used for different I/O sampling times with minimum aliasing effect of two sampling time systems.Inside the FPGA,the synchronization requirement between the pulse generation part and the I/O driver becomes trivial.IV .T EST B ENCH S ETUP AND S TUDY S YSTEMA hardware-in-the-loop (HIL)test bench,based on the RT-LAB real-time simulation platform,is set up to validate the proposed method (see Fig.8).The test system is a two-terminal MMC HVDC system (see Fig.9).One MMC terminal is controlled by an external controller,whereas the other has an internal controller simulated in the same simulator.The system parameters of the external controller side are given in Table II.The SM capacitance is selected to store typically 1.5cycle of the energy for HVDC applications.The external controller is simulated in a second independent real-time simulator.The MMC valve control,using the pro-posed pulse generation and VBC method,is implemented in a Virtex-7FPGA board with a cycle of 500ns.The MMC pole control,explained in [33],is implemented in the CPU with a sampling rate of 25μs.TABLE IIS YSTEM P ARAMETER ON THE E XTERNAL C ONTROLLER SIDEThe system measurements,including the ac-side voltages and currents,dc-link voltages,and valve currents,are sent from the plant to the controller through copper wires (the white cables in the back view).They have to be calibrated to minimize the error and noise introduced in the analog and digital conversion and in the cables.The MMC measurements and commands are transferred through optical fibers.Each pair of fibers is used for one valve,and thus,six fibers are used for one station (the orange cables in the front view).For the controller,an outgoing message includes one valve current and 250capacitor voltages,and an incoming message includes 250MMC commands.No calibra-tion is required since all signals are transferred in digital format.The update rates of the outgoing and incoming messages can be individually adjusted during real-time simulation to study their impacts on the system performance.V .C ASE S TUDY AND R ESULTSA.Performances at System T ransientFig.10provides the MMC waveforms at the system transient when the reactive power references changes from 0.5to −0.5pu and the active power reference keeps 0.The control has a rate limiter;hence,the reference ramps to its final value in five cy-cles for a smooth transient.The frequency carrier is at 300Hz,i.e.,six times the system frequency.The MMC measurements and commands are updated every 20and 2μs,respectively.Note that the MMC terminal voltage,terminal current,and valve currents are highly sinusoidal and well controlled.Fig.10(e)gives the individual capacitor voltages of the first three SMs and the upper and lower boundaries of all capacitor voltages in one valve.All capacitor voltages are controlled in a very narrow band within the boundaries and,therefore,are well balanced.In the plant simulator,the power grid is simulated in CPU with a time step of 25μs,and the MMC is simulated in the FPGA with at a time step of 500ns.Therefore,those system measurements in Fig.10(a)–(d)have a resolution of 25μs.Every 25μs,a group of 32capacitor voltages is sent from the FPGA to the CPU for data logging purposes only.Therefore,the voltages in Fig.10(e)have a resolution of 1.5ms and might not be simultaneously sampled.The upper and lower boundaries are calculated by the logged data and,thus,have a small error due to the consecutive logging manner.The actual difference between the boundaries should be smaller than the calculated value.2864IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS,VOL.62,NO.5,MAY2015Fig.10.Waveforms at a system transient.(a)MMC ac voltages,(b)ac currents,(c)ac-side active and reactive power,(d)valve currents,and (e)capacitor voltages of three SMs and the upper and lower boundaries of one valve.B.Impact of Carrier Frequency and Signal RateThe carrier frequency has a significant impact on the VBC performance.Fig.11gives the capacitor voltages and their boundaries when the carrier frequency is50Hz,the active and reactive power references are0and−0.5pu,respectively. The other parameters are the same as the previous test.The voltages vary in a larger band between the upper and lower boundaries compared with that in Fig.10.For the two carrier frequencies,the bandwidth,i.e.,the upper boundary minus the lower boundary,is given in Fig.12.The time-averaged values at different carrier frequencies are given in Table III.Note that increasing the carrier frequency will have a better balance on the capacitor voltage because each SM switches moreoften.Fig.11.System waveforms as the carrier frequency is50Hz.(a)MMC valve currents.(b)Capacitor voltages of three SMs and the upper and lower boundaries of onevalve.Fig.12.Width of capacitor voltage band when the carrier frequency is 50Hz(red)and300Hz(blue).TABLE IIIW IDTH OF V OLTAGE B AND AT D IFFERENT C ARRIER FREQUENCIESTABLE IVA VERAGE W IDTH OF V OLTAGEB AND AT D IFFERENTS IGNAL C OMMUNICATION RATESAt a certain carrier frequency,e.g.,200Hz,it achieved agood performance.Further increasing the carrier frequency willimprove the performance but slowly.The effect of the MMC measurements and commands updaterates on the VBC performance is studied and summarized inTable IV.The results are achieved when the carrier frequencyis200Hz,the active and reactive power references are0and −0.5pu,respectively.It is observed that the signal update rating between the MMC and its controller has little effect on theperformance.C.Performances at Single SM Short-Circuit Fault Normally,the short circuit of a capacitor could cause per-manent damage of the device,and the faulty SM has to be by-passed.In this paper,the hypothetical temporary short circuit,LI et al .:MMC PULSE GENERATION AND CAPACITOR VOLT AGE BALANCE METHOD2865Fig.13.Waveforms at single fault on SM 1.(a)MMC ac voltages,(b)ac currents,(c)valve currents,and (d)capacitor voltage of SM1,SM2,SM3,and voltage boundaries of onevalve.Fig.14.Capacitor voltage of the fault SM at different power conditions.where the device recovers after the fault is cleared,is used to examine the effectiveness of the proposed VBC method in an extreme condition where the capacitor voltage of some SM deviates far away from others.The capacitor voltage in the fault SM is completely dis-charged to 0before the fault is cleared in 25μs.In all the following fault tests,the carrier frequency is 200Hz;the update rates of the MMC measurements and commands are 100and 20μs,respectively.Since this study is focused on voltage balancing within each valve,same faults are applied to all six valves to eliminate the effects from other control loops.Fig.13shows the system response to a temporary fault at SM 1when the active and reactive power is 0.5pu and 0,re-spectively.The complete discharge of one SM has a negligible impact on the ac voltages,ac currents,and valve currents.The SM 1capacitor voltage,coincident with the lower boundary asTABLE VS INGLE F AULT R ECOVERY T IMES AT D IFFERENT P OWER CONDITIONSFig.15.Waveforms at simultaneous fault on SM1,SM2,and SM3.(a)MMC ac voltages,(b)ac currents,(c)valve currents,and (d)capacitor voltage of SM1,SM2,SM3,and the voltage boundaries of one valve.in Fig.13(d),recovers fast after the fault.The recovery time,i.e.,the interval between the fault clearance and the instant that the width of the voltage band reduces to less than 5%,is 49.8ms,less than three cycles.In each cycle,the valve current changes its direction twice.At recovery,the fault SM is switched on at the beginning of the charging half cycle to increase its capacitor voltage and is switched off at the beginning of the discharge half cycle to maintain it voltage.Depending on the fault point on the valve current waveform,the recovery time may vary for a half cycle,i.e.,10ms.The capacitor voltage charging rate is determined by the valve current magnitude,which is proportional to the appar-ent power in the steady state.The capacitor voltages of the fault SM at different power conditions are given in Fig.14,and the recovery times are given in Table V.Generally speaking,the larger the apparent power is,the faster the fault SM recovers.D.Performances at Multiply SM Short-Circuit Fault Multifault scenarios are studied where simultaneous short-circuit faults are applied to multiple SMs and cleared after 25μs.Fig.15shows the system response when the SM 1,2,。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

High Level Programming for Real TimeFPGA Based Image ProcessingD Crookes, K Benkrid, A Bouridane, K Alotaibi and A BenkridSchool of Computer Science, The Queen‟s University of Belfast, Belfast BT7 1NN, UK ABSTRACTReconfigurable hardware in the form of Field Programmable Gate Arrays (FPGAs) has been proposed as a way of obtaining high performance for computationally intensive DSP applications such us Image Processing (IP), even under real time requirements. The inherent reprogrammability of FPGAs gives them some of the flexibility of software while keeping the performance advantages of an application specific solution.However, a major disadvantage of FPGAs is their low level programming model. To bridge the gap between these two levels, we present a high level software environment for FPGA-based image processing, which aims to hide hardware details as much as possible from the user. Our approach is to provide a very high level Image Processing Coprocessor (IPC) with a core instruction set based on the operations of Image Algebra. The environment includes a generator which generates optimised architectures for specific user-defined operations.1. INTRODUCTIONImage Processing application developers require high performance systems for computationally intensive Image Processing (IP) applications, often under real time requirements. In addition, developing an IP application tends to be experimental and interactive. This means the developer must be able to modify, tune or replace algorithms rapidly and conveniently.Because of the local nature of many low level IP operations (e.g. neighbourhood operations), one way of obtaining high performance in image processing has been to use parallel computing [1]. However, multiprocessor IP systems have generally speaking not yet fulfilled their promise. This is partly a matter of cost, lack of stability and software support for parallel machines; it is also a matter of communications overheads particularly if sequences of images are being captured and distributed across the processors in real time.A second way of obtaining high performance in IP applications is to use Digital Signal Processing (DSP) processors [2,3]. DSP processors provide a performance improve-ment over standard microprocessors while still maintaining a high level programming model. However, because of the software based control, DSP processors have still difficulty in coping with real time video processing.At the opposite end of the spectrum lie the dedicated hardware solutions. Application Specific Integrated Circuits (ASICs) offer a fully customised solution to a particular algorithm [4]. However, this solution suffers from a lack of flexibility, plus the high manufacturing cost and the relatively lengthy development cycle.Reconfigurable hardware solutions in the form of FPGAs [5] offer high performance, with the ability to be electrically reprogrammed dynamically to perform other algorithms. Though the first FPGAs were only capable of modest integration levels and were thus usedmainly for glue logic and system control, the latest devices [6] have crossed the Million gate barrier hence making it possible to implement an entire System On a Chip. Moreover, the introduction of the latest IC fabrication techniques has increased the maximum speed at which FPGAs can run. Design‟s performance exceeding 150MHz are no longer outside the realm of possibilities in the new FPGA parts, hence allowing FPGAs to address high bandwidth applications such as video processing.A range of commercial FPGA based custom computing systems includes: the Splash-2 system [7]; the G-800 system [8] and VCC‟s HOTWorks HOTI & HOTII development [9]. Though this solution seems to enjoy the advantages of both the dedicated solution and the software based one, many people are still reluctant to move toward this new technology because of the low level programming model offered by FPGAs. Although behavioural synthesis tools have made enormous progress [10, 11], structural design techniques (including careful floorplanning) often still result in circuits that are substantially smaller and faster than those developed using only behavioural synthesis tools [12].In order to bridge the gap between these two levels, this paper presents a high level software environment for an FPGA-based Image Processing machine, which aims to hide the hardware details from the user. The environment generates optimised architectures for specific user-defined operations, in the form of a low level netlist. Our system uses Prolog as the basic notation for describing and composing the basic building blocks. Our current implementation of the IPC is based on the Xilinx 4000 FPGA series [13].The paper first outlines the programming environment at the user level (the programming model). This includes facilities for defining low level Image Processing algorithms based on the operators of Image Algebra [14], without any reference to hardware details. Next, the design of the basic building blocks necessary for implementing the IPC instruction set are presented. Then, we describe the runtime execution environment.2. THE USER’S PROGRAMMING MODELAt its most basic level, the programming model for our image processing machine is a host processor (typically a PC programmed in C++) and an FPGA-based Image Processing Coprocessor (IPC) which carries out complete image operations (such as convolution, erosion etc.) as a single coprocessor instruction. The instruction set of the IPC provides a core of instructions based on the operators of Image Algebra. The instruction set is also extensible in the sense that new compound instructions can be defined by the user, in terms of the primitive operations in the core instruction set. (Adding a new primitive instruction is a task for an architecture designer).The coprocessor core instruction setMany IP neighbourhood operations can be described by a template (a static window with user defined weights) and one of a set of Image Algebra operators. Indeed, simple neighbourhood operations can be split in two stages:∙ A …local‟ operato r applied between an image pixel and the corresponding window coefficient.∙ A …global‟ operator applied to the set of intermediate results of the local operation, to reduce this set to a single result pixel.The set of local operators contains …Add‟ (…+‟) and …multiplication‟ (…*‟), whereas the global operator contains …Accumulation‟ (…∑‟), …Maximum‟ (…Max‟) and …Minimum‟ (…Min‟). With these local and global operators, the following neighbourhood operations can be built:For instance, a simple Laplace operation would be performed by doing convolution (i.e. Local Operation = …∑‟ and Global operation= …*‟) with the following template:The programmer interface to this instruction set is via a C++ class. First, the programmer creates the required instruction object (and its FPGA configuration), and subsequently applies it to an actual image. Creating an instruction object is generally in two phases: firstly build an object describing the operation, and then generate the configuration, in a file. For neighbourhood operations, these are carried out by two C++ object constructors:image_operator (template & operator details)image_instruction (operator object, filename)For instructions with a single template operator, these can be conveniently combined in a single constructor:Neighbourhood_instruction (template, operators, filename)The details required when building a new image operator object include:∙The dimension of the image (e.g. 256 ⨯ 256)∙The pixels size (e.g. 16 bits).∙The size of the window (e.g. 3⨯3).∙The weights of the neighbourhood window.∙The target position within the window, for aligning it with the image pixels (e.g. 1,1).∙The …local‟ and …global‟ operations.Later, to apply an instruction to an actual image, the apply method of the instruction object is used:Result = instruction_object.apply (input image)This will reconfigure the FPGA (if necessary), download the input pixel data and store the result pixels in the RAM of the IPC as they are generated.The following example shows how a programmer would create and perform a 3 by 3 Laplace operation. The image is 256 by 256; the pixel size is 16 bits.2.1 Extending the Model for Compound OperationsIn practical image processing applications, many algorithms comprise more than a single operation. Such compound operations can be broken into a number of primitive core instructions.Instruction Pipelining: A number of basic image operations can be put together in series. A typical example of two neighbourhood operations in series is the …Open‟ operation. To do an …Open‟ operation, an …Erode‟ neighbourhood operation is first performed, and the resulting image is fed into a …Dilate‟ neighbourhood operation as shown in Figure 1.Figure 1 ‘Open’ complex operationThis operation is described as follows in our high level environment:Task parallel: A number of basic image operations can be put together in parallel.For example, the Sobel edge detection algorithm can be performed (approximately) by adding the absolute results of two separate convolutions. Assuming that the FPGA has enough computing resources available, the best solution is to implement the operations in parallel using separate regions of the FPGA chip.Figure 2 Sobel complex operationThe following is an example of the code, based on our high level instruction set, to define and use a Sobel edge detection instruction. The user defines two neighbourhood operators(horizontal and vertical Sobel), and builds the image instruction by summing the absolute results from the two neighbourhood operations.The generation phase will automatically insert the appropriate delays to synchronise the two parallel operations.3. ARCHITECTURES FROM OPERATIONSWhen a new Image_instruction object(e.g. Neighbourhood_instruction) is created (by new), the corresponding FPGA configuration will be generated dynamically. In this section, we will present the structure of the FPGA configurations necessary to implement the high level instruction set for the neighbourhood operations described above. As a key example, the structure of a general 2-D convolver will be presented. Other neighbourhood operations are essentially variations of this, with different local and global operators sub-blocks.A general 2D convolverAs mentioned earlier, any neighbourhood image operation involves passing a 2-D window over an image, and carrying out a calculation at each window position.To allow each pixel to be supplied only once to the FPGA, internal line delays are required. These synchronise the supply of input values to the processing elements, ensuringthat all the pixel values involved in a particular neighbourhood operation are processed at the same instant[15, 16]. Assuming a vertical scan of the image, Figure 3 shows the architecture of a generic 2-D convolver with a P by Q template. Each Processing Element (PE) performs the necessary Multiply/Accumulate operation.Figure 3 Architecture of a generic 2-D, P by Q convolution operation Architecture of a Processing ElementBefore deriving the architecture of a Processing Element, we first have to decide which type of arithmetic to be used- either bit parallel or bit serial processing.While parallel designs process all data bits simultaneously, bit serial ones process input data one bit at a time. The required hardware for a parallel implementation is typically …n‟ times the equivalent serial implementation (for an n-bit word). On the other hand, the bit serial approach requires …n… clock cycles to process an n-bit word while the equivalent parallel one needs only one clock cycle. However, bit serial architectures operates at a higher clock frequency due to their smaller combinatorial delays. Also, the resulting layout in a serial implementation is more regular than a parallel one, because of the reduced number of interconnections needed between PEs (i.e. less routing stress). This regularity feature means that FPGA architectures generated from a high level specification can have more predictable layout and performance. Moreover, a serial architecture is not tied to a particular processing word length. It is relatively straightforward to move from one word length to another withvery little extra hardware (if any). For these reasons, we decided to implement the IPC hardware architectures using serial arithmetic.Note, secondly, that the need to pipeline the bit serial Maximum and Minimum operations common in Image Algebra suggests we should process data Most Significant Bit first (MSBF). Following on from this choice, because of problems in doing addition MSBF in 2‟s complement, there are certain advantages in using an alternative number representation to 2‟s complement. For the p urposes of the work described in this paper, we have chosen to use a redundant number representation in the form of a radix-2 Signed Digit Number system (SDNR) [17]. Because of the inherent carry-free property of SDNR add/subtract operations, the corresponding architectures can be clocked at high speed. There are of course several alternative representations which could have been chosen, each with their own advantages. However, the work presented in this paper is based on the following design choices:∙Bit serial arithmetic∙Most Significant Bit First processing∙Radix-2 Signed Digit Number Representation (SDNR) rather than 2‟s complement.Because image data may have to be occasionally processed on the host processor, the basic storage format for image data i s still, however, 2‟s complement. Therefore, processing elements first convert their incoming image data to SDNR. This also reduces the chip area required for the line buffers (in which data is held in 2‟s complement). A final unit to convert a SDNR resu lt into 2‟s complement will be needed before any results can be returned to the host system. With these considerations, a more detailed design of a general Processing Element (in terms of a local and a global operation) is given in Figure 4.Figure 4 Architecture of a standard Processing ElementDesign of the Basic Building BlocksIn what follows, we will present the physical implementation of the five basic building blocks stated in section 2 (the adder, multiplier, accumulator and maximum/ minimum units). These basic components were carefully designed in order to fit together with as little wastage as possible.The ‘multiplier’ unitThe multiplier unit used is based on a hybrid serial-parallel multiplier outlined in [18]. It multiplies a serial SDNR input with a two‟s complement parallel coefficient B=b N b N-1…b1 as shown in Figure 5. The multiplier has a modular, scaleable design, and comprises four distinct basic building components [19]: Type A, Type B, Type C and Type D. An N bit coefficient multiplier is constructed by:Type A → Type B→ (N-3)*TypeC → Type DThe coefficient word length may be varied by varying the number of type C units. On the Xilinx 4000 FPGA, Type A, B and C units occupy one CLB, and a Type D unit occupies 2 CLBs. Thus an N bit coefficient multiplier is 1 CLB wide and N+1 CLBs high. The online delay of the multiplier is 3.In+In-Figure 5 Design of an N bit hybrid serial-parallel multiplierThe ‘accumulation’ g lobal operation unitThe accumulation unit is the global operation used in the case of a convolution. It adds two SDNR operands serially and outputs the result in SDNR format as shown in Figure 6. The accumulation unit is based on a serial online adder presented in [20]. It occupies 3 CLBs laid out vertically in order to fit with the multiplier unit in a convolver design.Figure 6Block diagram and floorplan of an accumulation unitThe ‘Addition’ local operation unitThis unit is used in additive/maximum and additive/minimum operations. It takes a single SDNR input value and adds it to the corresponding window template coefficient. The coefficient is stored in 2‟s complement format into a RAM addressed by a counter whose period is the pixel word length. To keep the design compact, we have implemented the counter using Linear Feedback Shift Registers (LFSRs). The coefficient bits are preloaded into the appropriate RAM cells according to the counter output sequence. The input SDNR operand is added to the coefficient in bit serial MSBF.+-+-Figure 7. Block diagram and floorplan of an …Addition‟ local operation unitOut-Out+The adder unit occupies 3 CLBs. The whole addition unit occupies 9 CLBs laid out in a 3x3 array. The online delay of this unit is 3 clock cycles.The Maximum/Minimum unitThe Maximum unit selects the maximum of two SDNR inputs presented to its input serially, most significant bit first. Figure 10 shows the transition diagram of the finite state machine performing the maximum …O‟ of two SDNRs …X‟ and ‟Y‟. The physical impl ementation of this machine occupies an area of 13 CLBs laid out in 3 CLBs wide by 5 high. Note that this will allow this unit to fit the addition local operation in an Additive/Maximumneighbourhood operation. The online delay of this unit is 3, compatible with the online delay of the accumulation global operation.*(O=X)*(O=Y)X +X --+Figure 8. State diagram and floorplan of a Maximum unitThe minimum of two SDNRs can be determined in a similar manner knowing that Min(X,Y)=- Max(-X,-Y).5. THE COMPLETE ENVIRONMENTThe complete system is given in Figure 11. For internal working purposes, we have developed our own intermediate high level hardware description notation called HIDE4k [21]. This is Prolog-based [22], and enables highly scaleable and parameterised component descriptions to be written.In the front end, the user programs in a high level software environment (typically C++) or can interact with a Dialog-based graphical interface, specifying the IP operation to be carried out on the FPGA in terms of Local and Global operators, window template coefficients etc. The user can also specify:The desired operating speed of the circuit.∙The input pixel bit-length.∙Whether he or she wants to use our floorplanner to place the circuit or leave this task to the FPGA vendor‟s Placement and Routing tools.The system provides the option of two output circuit description formats: EDIF netlist (the normal), and VHDL at RTL level.Behind the scenes, when the user gives all the parameters needed for the specific IP operation, the intermediate HIDE code is generated. Depending on the choice of the output netlist format, the HIDE code will go through either the EDIF generator tool to generate an EDIF netlist, or the VHDL generator tool to generate a VHDL netlist. In the latter case, the resulting VHDL netlist needs to be synthesised into an EDIF netlist by a VHDL synthesiser tool. Finally, the resulting EDIF netlist will go through the FPGA vendor‟s specific tools to generate the configuration bitstream file. The whole process is invisible to the user, thus making the FPGA completely hidden from the user‟s point of view. Note that the resulting configuration is stored in a library, so it will not be regenerated if exactly the same operation happens to be defined again.Complete and efficient configurations have been produced from our high level instruction set for all the Image Algebra operations and for a variety of complex operations including…Sobel‟, …Open‟ and …Close‟. They have been successfully simulat ed using the Xilinx Foundation Project Manager CAD tools.Figure 10 presents the resulting layout for a Sobel edge detection operation on XC4036EX-2 for 256x256 input image of 8-bits pixels. An EDIF configuration file, with all the placement information, has been generated automatically by our tools from the high level description in 2.1. Note that the generator optimises the design, and uses just a single shared line buffer area for the two (task parallel) neighbourhood operations. The resulting EDIF file is fed to Xilinx PAR tools to generate the FPGA configuration bitstream. The circuit occupies 475 CLBs. Timing simulation shows that the circuit can run at a speed of 75MHz which leads to a theoretical frame rate of 143 frames per second.Figure 10 Physical configuration of Sobel operation on XC4036EX-2 Figure 11 presents the resulting layout for an 'Open' operation on XC4036EX-2 for 256x256 input image of 8-bits pixels. As previously, EDIF configuration file with all the placement information has been generated, automatically by our tools from the correspondinghigh level description presented in section 2.1. The resulting EDIF file is then fed to Xilinx PAR tools to generate the FPGA configuration bitstream. The circuit occupies 962 CLBs. Timing simulation shows that the circuit can run at a speed of 75MHz which leads to a theoretical frame rate of 133 frames per second.Figure 11 Physical configuration of Open operation on XC4036EX-26. CONCLUSIONSIn this paper, we have presented the design of an FPGA-based Image Processing Coprocessor (IPC) along with its high level programming environment. The coprocessor instruction set is based on a core level containing the operations of Image Algebra. Architectures for user-defined compound operations can be added to the system. Possibly the most significant aspect of this work is that it opens the way to image processing application developers toexploit the high performance capability of a direct hardware solution, while programming in an application-oriented model. Figures presented for actual architectures show that real time video processing rates can be achieved when staring from a high level design.The work presented in this paper is based specifically on Radix-2 SDNR, bit serial MSBF processing. In other situations, alternative number representations may be more appropriate. Sets of alternative representations are being added to the environment, including a full bit parallel implementation of the IPC [23]. This will give the user a choice when trying to satisfy competing constraints.Although our basic approach is not tied to a particular FPGA, we have implemented our system on XC4000 FPGA series. However, the special facilities provided by the new Xilinx VIRTEX family (e.g. large on-chip synchronous memory, built in Delay Locked Loops etc.) make it a very suitable target architecture for this type of application. Upgrading our system to operate on this new series of FPGA chips is underway.REFERENCES[1] Webber, H C (ed.), …Image processing and transputers‟, IOS Press, 1992.[2] Rajan, K, Sangunni, K S and Ramakrishna, J, …Dual-DSP systems for signal and image-processing‟, Microprocessing & Microsystems, Vol 17, No 9, pp 556-560, 1993.[3] Akiyama, T, Aono, H, Aoki, K, et al,…MPEG2 video codec using Image compressionDSP‟, IEEE Transactions on Consumer Electronics, Vol 40, No 3, pp 466-472, 1994. [4] L.A. Christopher, W.T. Mayweather and S.S. Perlman, …VLSI median filter for impulsenoi se elimination in composite or component TV signals‟, IEEE Transactions on Consumer Electronics, Vol 34, no. 1, pp. 263-267, 1988.[5] J. Rose and A. Sangiovanni-Vincentelli, …Architecture of Field Programmable GateArrays‟, Proceedings of the IEEE Volume 81, No7, pp 1013-1029, 1993.[6] /products/virtex/ss_vir.htm[6] Arnold, J M, Buell, D A and Davis, E G, …Splash-2‟, Proceedings of the 4th AnnualACM Symposium on Parallel Algorithms and Architectures, ACM Press, pp 316-324, June 1992.[7] Gigaops Ltd., The G-800 System, 2374 Eunice St. Berkeley, CA 94708.[8] Chan, S C, Ngai, H O and Ho, K L, …A programmable image processing system usingFPGAs‟, International Journal of Electronics, Vol 75, No 4, pp 725-730, 1993.[9] /[10] /news/pubs/snug/snug99_papers/Jaffer_Final.pdf[11] FPL99.[12] Hutchings.[13] Xilinx 4000.[14] Ritter G X, Wilson J N and Davidson J L, …Image Algebra: an overview‟, ComputerVision, Graphics and Image Processing, No 49, pp 297-331, 1990.[15] Shoup, R G, …Parameterised Convolution Filtering in an FPGA‟, More FPGAs, WMoore and W Luk (editors), Abington, EE&CS Books, pp 274, 1994.[16] Kamp, W, Kunemund, H, Soldner and Hofer, H, …Programmable 2D linear filter forvideo applications‟, IEEE Journal of Solid State Circuits, pp 735-740, 1990.[17] Avizienis A, …Signed Digit Number Representation for Fast Parallel Arithmetic”, IRETransactions on Electronic Computer, Vol. 10, pp 389-400, 1961.[18] Moran, J, Rios, I and Mene ses, J, …Signed Digit Arithmetic on FPGAs‟, More FPGAs, WMoore and W Luk (editors), Abington, EE&CS Books, pp 250, 1994.[19] Donachy, P, …Design and implementation of a high level image processing machineusing reconfigurable hardware‟, PhD Thesis, Depar tment of Computer Science, The Queen‟s University of Belfast, 1996.[20] Duprat, J, Herreros, Y and Muller, J, …Some results about on-line computation offunction‟, 9th Symposium on Computer Arithmetic, Santa Monica, September 1989. [21]D Crookes, K Alota ibi, A Bouridane, P Donachy and A Benkrid, 1998, …An Environmentfor Generating FPGA Architectures for Image Algebra-based Algorithms‟, ICIP98, Vol.3, pp. 990-994.[22]Clocksin W F and Melish C S, 1994, …Programming in Prolog‟, Springer-Verlag.。

相关文档
最新文档