A Survey on Compression Algorithms in Hadoop

合集下载

04 Coverage control for mobile sensing networks

IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, VOL. 20, NO. 2, APRIL 2004243Coverage Control for Mobile Sensing NetworksJorge Cortés, Member, IEEE, Sonia Martínez, Member, IEEE, Timur Karatas, and ¸ Francesco Bullo, Senior Member, IEEEAbstract—This paper presents control and coordination algorithms for groups of vehicles. The focus is on autonomous vehicle networks performing distributed sensing tasks, where each vehicle plays the role of a mobile tunable sensor. The paper proposes gradient descent algorithms for a class of utility functions which encode optimal coverage and sensing policies. The resulting closed-loop behavior is adaptive, distributed, asynchronous, and verifiably correct. Index Terms—Centroidal Voronoi partitions, coverage control, distributed and asynchronous algorithms, sensor networks.rents, and other distributed oceanographic signals. The vehicles communicate via an acoustic local area network and coordinate their motion in response to local sensing information and to evolving global data. This mobile sensing network is meant to provide the ability to sample the environment adaptively in space and time. By identifying evolving temperature and current gradients with higher accuracy and resolution than current static sensors, this technology could lead to the development and validation of improved oceanographic models. B. Optimal Sensor Allocation and Coverage Problems A fundamental prototype problem in this paper is that of characterizing and optimizing notions of quality-of-service (QoS) provided by an adaptive sensor network in a dynamic environment. To this goal, we introduce a notion of sensor coverage that formalizes an optimal sensor placement problem. This spatial resource-allocation problem is the subject of a discipline called locational optimization [5]–[9]. Locational optimization problems pervade a broad spectrum of scientific disciplines. Biologists rely on locational optimization tools to study how animals share territory and to characterize the behavior of animal groups obeying the following interaction rule: each animal establishes a region of dominance and moves toward its center. Locational optimization problems are spatial resource-allocation problems (e.g., where to place mailboxes in a city or cache servers on the internet) and play a central role in quantization and information theory (e.g., how to design a minimum-distortion fixed-rate vector quantizer). Other technologies affected by locational optimization include mesh and grid optimization methods, clustering analysis, data compression, and statistical pattern recognition. Because locational optimization problems are so widely studied, it is not surprising that methods are indeed available to tackle coverage problems; see [5], and [8]–[10]. However, most currently available algorithms are not applicable to mobile sensing networks because they inherently assume a centralized computation for a limited-size problem in a known static environment. This is not the case in multivehicle networks which, instead, rely on a distributed communication and computation architecture. Although an ad-hoc wireless network provides the ability to share some information, no global omniscient leader might be present to coordinate the group. The inherent spatially distributed nature and limited communication capabilities of a mobile network invalidate classic approaches to algorithm design. C. Distributed Asynchronous Algorithms for Coverage Control In this paper, we design coordination algorithms implementable by a multivehicle network with limited sensing andI. INTRODUCTION A. Mobile Sensing NetworksTHE deployment of large groups of autonomous vehicles is rapidly becoming possible because of technological advances in networking and in miniaturization of electromechanical systems. In the near future, large numbers of robots will coordinate their actions through ad-hoc communication networks, and will perform challenging tasks, including search and recovery operations, manipulation in hazardous environments, exploration, surveillance, and environmental monitoring for pollution detection and estimation. The potential advantages of employing teams of agents are numerous. For instance, certain tasks are difficult, if not impossible, when performed by a single vehicle agent. Further, a group of vehicles inherently provides robustness to failures of single agents or communication links. Working prototypes of active sensing networks have already been developed; see [1]–[4]. In [3], launchable miniature mobile robots communicate through a wireless network. The vehicles are equipped with sensors for vibrations, acoustic, magnetic, and infrared (IR) signals as well as an active video module (i.e., the camera or micro-radar is controlled via a pan-tilt unit). A second system is suggested in [4] under the name of Autonomous Oceanographic Sampling Network. In this case, underwater vehicles are envisioned measuring temperature, cur-Manuscript received November 4, 2002; revised June 26, 2003. This paper was recommended for publication by Associate Editor L. Parker and Editor A. De Luca upon evaluation of the reviewers’ comments. This work was supported in part by the Army Research Office (ARO) under Grant DAAD 190110716, and in part by the Defense Advanced Research Projects Agency/Air Force Office of Scientific Research (DARPA/AFOSR) under MURI Award F49620-02-1-0325. This paper was presented in part at the IEEE Conference on Robotics and Automation, Arlington, VA, May 2002, and in part at the Mediterranean Conference on Control and Automation, Lisbon, Portugal, July 2002. J. Cortés, T. Karatas, and F. Bullo are with the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA (e-mail: jcortes@; tkaratas@; bullo@). S. Martínez is with the Escola Universitària Politècnica de Vilanova i la Geltrú, Universidad Politécnica de Cataluña, Vilanova i la Geltrú 08800, Spain (e-mail: soniam@mat.upc.es). Digital Object Identifier 10.1109/TRA.2004.8246981042-296X/04$20.00 © 2004 IEEE244IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, VOL. 20, NO. 2, APRIL 2004communication capabilities. Our approach is related to the classic Lloyd algorithm from quantization theory; see [11] for a reprint of the original report and [12] for a historical overview. We present Lloyd descent algorithms that take into careful consideration all constraints on the mobile sensing network. In particular, we design coverage algorithms that are adaptive, distributed, asynchronous, and verifiably asymptotically correct. Adaptive: Our coverage algorithms provide the network with the ability to address changing environments, sensing task, and network topology (due to agents’ departures, arrivals, or failures). Distributed: Our coverage algorithms are distributed in the sense that the behavior of each vehicle depends only on the location of its neighbors. Also, our algorithms do not require a fixed-topology communication graph, i.e., the neighborhood relationships do change as the network evolves. The advantages of distributed algorithms are scalability and robustness. Asynchronous: Our coverage algorithms are amenable to asynchronous implementation. This means that the algorithms can be implemented in a network composed of agents evolving at different speeds, with different computation and communication capabilities. Furthermore, our algorithms do not require a global synchronization, and convergence properties are preserved, even if information about neighboring vehicles propagates with some delay. An advantage of asynchronism is a minimized communication overhead. Verifiable Asymptotically Correct: Our algorithms guarantee monotonic descent of the cost function encoding the sensing task. Asymptotically, the evolution of the mobile sensing network is guaranteed to converge to so-called centroidal Voronoi configurations (i.e., configurations where the location of each generator coincides with the centroid of the corresponding Voronoi cell) that are critical points of the optimal sensor-coverage problem. Let us describe the contributions of this paper in some detail. Section II reviews certain locational optimization problems and their solutions as centroidal Voronoi partitions. Section III provides a continuous-time version of the classic Lloyd algorithm from vector quantization and applies it to the setting of multivehicle networks. In discrete time, we propose a family of Lloyd algorithms. We carefully characterize convergence properties for both continuous and discrete-time versions (Appendix I collects some relevant facts on descent flows). We discuss a worst-case optimization problem, we investigate a simple uniform planar setting, and we present simulation results. Section IV presents two asynchronous distributed implementations of Lloyd algorithm for ad-hoc networks with communication and sensing capabilities. Our treatment carefully accounts for the constraints imposed by the distributed nature of the vehicle network. We present two asynchronous implementations, one based on classic results on distributed gradient flows, the other based on the structure of the coverage problem. (Appendix II briefly reviews some known results on asynchronous gradient algorithms.)Section V-A considers vehicle models with more realistic dynamics. We present two formal results on passive vehicle dynamics and on vehicles equipped with individual local controllers. We present numerical simulations of passive vehicle models and of unicycle mobile vehicles. Next, Section V-B describes density functions that lead the multivehicle network to predetermined geometric patterns. We present our conclusions and directions for future research in Section VI. D. Review of Distributed Algorithms for Cooperative Control Recent years have witnessed a large research effort focused on motion planning and coordination problems for multivehicle systems. Issues include geometric patterns [13]–[16], formation control [17], [18], gradient climbing [19], and conflict avoidance [20]. It is only recently, however, that truly distributed coordination laws for dynamic networks are being proposed; e.g., see [21]–[23]. Heuristic approaches to the design of interaction rules and emerging behaviors have been throughly investigated within the literature on behavior-based robotics; see [17], and [24]–[28]. An example of coverage control is discussed in [29]. Along this line of research, algorithms have been designed for sophisticated cooperative tasks. However, no formal results are currently available on how to design reactive control laws, ensure their correctness, and guarantee their optimality with respect to an aggregate objective. The study of distributed algorithms is concerned with providing mathematical models, devising precise specifications for their behavior, and formally proving their correctness and complexity. Via an automata-theoretic approach, the references [30] and [31] treat distributed consensus, resource allocation, communication, and data consistency problems. From a numerical optimization viewpoint, the works in [32] and [33] discuss distributed asynchronous algorithms as networking algorithms, rate and flow control, and gradient descent flows. Typically, both these sets of references consider networks with fixed topology, and do not address algorithms over ad-hoc dynamically changing networks. Another common assumption is that any time an agent communicates its location, it broadcasts it to every other agent in the network. In our setting, this would require a nondistributed communication setup. Finally, we note that the terminology “coverage” is also used in [34] and [35] and references therein to refer to a different problem called the coverage path-planning problem, where a single robot equipped with a limited footprint sensor needs to visit all points in its environment. II. FROM LOCATION OPTIMIZATION TO CENTROIDAL VORONOI PARTITIONS A. Locational Optimization In this section, we describe a collection of known facts about a meaningful optimization problem. References include the theory and applications of centroidal Voronoi partitions, see [10], and the discipline of facility location, see [6]. In the paper, we interchangeably refer to the elements of the network be the set as sensors, agents, vehicles, or robots. We letCORTÉS et al.: COVERAGE CONTROL FOR MOBILE SENSING NETWORKS245the position of the sensors and the partition of the space. This problem is referred to as a facility location problem and, in particular, as a continuous -median problem in [6]. Remark 2.2: Note that if we interchange the positions of any two agents, along with their associated regions of dominance, is not afthe value of the locational optimization function denotes the discrete group of perfected. Equivalently, if mutations of elements, then for all . To eliminate this discrete redundancy, one could take natural action on , and consider as the configuration space of for the position of the vehicles. B. Voronoi PartitionsFig. 1. Contour plot on a polygonal environment of the Gaussian density y ). function = exp( x0 0of nonnegative real numbers, be the set of positive natural numbers, and . be a convex polytope in , including its interior, Let denote the Euclidean distance function. We call and let a distribution density function if it reprea map sents a measure of information or probability that some event takes place over . In equivalent words, we can consider to be the bounded support of the function . Let be the location of sensors, each moving in the space . Because of noise and loss of resolution, the sensing performance at point taken from th sensor at the position degrades with the between and ; we describe this degradadistance . tion with a nondecreasing differentiable function Accordingly, provides a quantitative assessment of how poor the sensing performance is (see Fig. 1). mobile robots Remark 2.1: As an example, consider equipped with microphones attempting to detect, identify, and localize a sound source. How should we plan the robots’ motion in order to maximize the detection probability? Assuming the source emits a known signal, the optimal detection algorithm is a matched filter (i.e., convolve the known waveform with the received signal and threshold). The source is detected depending on the signal-to-noise-ratio (SNR), which is inversely proportional to the distance between the microphone and the source. Various electromagnetic and sound sensors have SNRs inversely proportional to distance. Within the context of this paper, a partition of is a collecwith disjoint interiors, tion of polytopes and are whose union is . We say that two partitions equal if and only differ by a set of -measure zero, for . all We consider the task of minimizing the locational optimization function (1)One can easily see that, at a fixed-sensors location, the optimal partition of is the Voronoi partition generated by the pointsWe refer to [9] for a comprehensive treatment on Voronoi diagrams, and briefly present some relevant concepts. The set is called the Voronoi diagram for the of regions . When the two Voronoi regions generators and are adjacent (i.e., they share an edge), is called a (and vice versa). The set of indexes (Voronoi) neighbor of of the Voronoi neighbors of is denoted by . Clearly, if and only if . We also define the face as . Voronoi diagrams can be defined with , and respect to various distance functions, e.g., the 1-, 2-, -norm over , see [36]. Some useful facts about the Euclidean setting are the following: if is a convex polytope in an -dimensional Euclidean space, the boundary of each is the union of -dimensional convex polytopes. In what follows, we shall writeNote that using the definition of the Voronoi partition, we have for all . Therefore(2) that is, the locational optimization function can be interpreted as an expected value composed with a “min” operation. This is the usual way in which the problem is presented in the facility location and operations research literature [6]. Remarkably, one can show [10] that(3) where we assume that the th sensor is responsible for measure. Note that the function ments over its “dominance region” is to be minimized with respect to both 1) the sensors location , and 2) the assignment of the dominance regions . The optimization is, therefore, to be performed with respect to with respect to the th sensor i.e., the partial derivative of only depends on its own position and the position of its Voronoi neighbors. Therefore, the computation of the derivative of with respect to the sensors’ location is decentralized in the sense246IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, VOL. 20, NO. 2, APRIL 2004of Voronoi. Moreover, one can deduce some smoothness prop: since the Voronoi partition depends at least conerties of , for all , the function tinuously on is at least continuously differentiable on for some . C. Centroidal Voronoi Partitions Let us recall some basic quantities associated with a region and a mass density function . The (generalized) mass, centroid (or center of mass), and polar moment of inertia are defined asIII. CONTINUOUS AND DISCRETE-TIME LLOYD DESCENT FOR COVERAGE CONTROL In this section, we describe algorithms to compute the location of sensors that minimize the cost , both in continuous and in discrete time. In Section III-A, we propose a continuous-time version of the classic Lloyd algorithm. Here, both the positions and partitions evolve in continuous time, whereas the Lloyd algorithm for vector quantization is designed in discrete time. In Section III-B, we develop a family of variations of Lloyd algorithm in discrete time. In both settings, we prove that the proposed algorithms are gradient descent flows. A. A Continuous-Time Lloyd Algorithm Assume the sensors location obeys a first-order dynamical behavior described byAdditionally, by the parallel axis theorem, one can write (4) is defined as the polar moment of inertia of where . the region about its centroid Let us consider again the locational optimization problem (1), and suppose now we are strictly interested in the setting (5) . The parallel axis that is, we assume and theorem leads to simplifications for both the function its partial derivative Consider a cost function to be minimized and impose that follows a gradient descent. In equivalent conthe location a Lyapunov function, and trol theoretical terms, consider stabilize the multivehicle system to one of its local minima via dissipative control. Formally, we set (6) where is a positive gain, and where we assume that the is continuously updated. partition Proposition 3.1 (Continuous-Time Lloyd Descent): For the closed-loop system induced by (6), the sensors location con, i.e., the verges asymptotically to the set of critical points of set of centroidal Voronoi configurations on . Assuming this set is finite, the sensors location converges to a centroidal Voronoi configuration. Proof: Under the control law (6), we haveHere, the mass density function is define. It is convenient toTherefore, the (not necessarily unique) local minimum points are centroids of their for the location optimization function Voronoi cells, i.e., each location satisfies two properties simultaneously: it is the generator for the Voronoi cell , and it is its centroidAccordingly, the critical partitions and points for are called centroidal Voronoi partitions. We will refer to a sensors’ configuration as a centroidal Voronoi configuration if it gives rise to a centroidal Voronoi partition. Of course, centroidal Voronoi configurations depend on the specific distribution density function , and an arbitrary pair admits, in general, multiple centroidal Voronoi configurations. This discussion provides a proof alternative to the one given in [10] for the necessity of centroidal Voronoi partitions as solutions to the continuous -median location problem.By LaSalle’s principle, the sensors location converges to the , which is precisely the largest invariant set contained in set of centroidal Voronoi configurations. Since this set is clearly consists of invariant for (6), we get the stated result. If converges to one of them a finite collection of points, then (see Corollary 1.2). Remark 3.2: If is finite, and , then a sufficient condition that guarantees exponential convergence is be positive definite at . Establishing that the Hessian of this property is a known open problem, see [10]. Note that this gradient descent is not guaranteed to find the global minimum. For example, in the vector quantization and signal processing literature [12], it is known that for bimodal distribution density functions, the solution to the gradient flow reaches local minima where the number of generators allocated to the two region of maxima are not optimally partitioned.CORTÉS et al.: COVERAGE CONTROL FOR MOBILE SENSING NETWORKS247B. A Family of Discrete-Time Lloyd Algorithms Let us consider the following variations of Lloyd algorithm. verifying the Let be a continuous mapping following two properties: , 1) for all , where denotes the th component of ; is not centroidal, then there exists a such that 2) if . Property 1) guarantees that, if moved according to , the agents of the network do not increase their distance to its corresponding centroid. Property 2) ensures that at least one robot moves at each iteration and strictly approaches the centroid of its Voronoi region. Because of this property, the fixed points of are the set of centroidal Voronoi configurations. Proposition 3.3 (Discrete-Time Lloyd Descent): Let be a continuous mapping satisfying properties denote the initial sensors’ location. 1) and 2). Let converges to the set Then, the sequence of centroidal Voronoi configurations. If this set is finite, then converges to a centroidal the sequence Voronoi configuration. as an objective funcProof: Consider tion for the algorithm . Using the parallel axis theorem, , and therefore (7) for all , as long as with strict inequality if for any , . , with strict inequality if In particular, , where denotes the set of centroids of the partition . Moreover, since the Voronoi partition is the optimal one for fixed , we also have (8) with strict inequality if . Now, because of property 1) of , inequality (7) yieldsalgorithm can also be seen as a fixed-point iteration. Consider for the mappingsLet be defined by . is continuous (indeed, ), and corresponds to Clearly, , Lloyd algorithm. Now, for all . Moreover, if is not centroidal, then . Therefore, verifies the inequality is strict for all properties 1) and 2). C. Remarks 1) Note that different sensor performance functions in (1) correspond to different optimization problems. Provided one [cf. (1)], uses the Euclidean distance in the definition of the standard Voronoi partition computed with respect to the Euclidean metric remains the optimal partition. For arbitrary , it is not possible anymore to decompose into the sum and . Nevertheless, it is still of terms similar to possible to implement the gradient flow via the expression for the partial derivative (3). Proposition 3.5: Assume the sensors location obeys a first. Then, for the closed-loop order dynamical behavior, system induced by the gradient law (3), , the converges asymptotically to sensors location . Assuming this set is finite, the the set of critical points of sensors location converges to a critical point. 2) More generally, various distance notions can be used to define locational optimization functions. Different performance function gives rise to corresponding notions of “center of a region” (any notion of geometric center, mean, or average is an interesting candidate). These can then be adopted in designing coverage algorithms. We refer to [36] for a discussion on Voronoi partitions based on non-Euclidean distance functions, and to [5] and [8] for a discussion on the corresponding locational optimization problems. 3) Next, let us discuss an interesting variation of the original problem. In [6], minimizing the expected minimum distance in (2) is referred to as the continuous -median function problem. It is instructive to consider the worst-case minimum distance function, corresponding to the scenario where no information is available on the distribution density function. In other words, the network seeks to minimize the largest possible distance from any point in to any of the sensor locations, i.e., to minimize the functionand the inequality is strict if of . In additionis not centroidal by property 2)because of (8). Hence, , and the inequality is a is strict if is not centroidal. We then conclude that descent function for the algorithm . The result now follows from the global convergence Theorem 1.3 and Proposition 1.4 in Appendix I. Remark 3.4: Lloyd algorithm in quantization theory [11], [12] is usually presented as follows. Given the location of agents, : 1) construct the Voronoi partition corresponding to ; 2) compute the mass centroids of the Voronoi regions found in step 1). Set the new location of the agents to these centroids, and return to step 1). LloydThis optimization is referred to as the -center problem in [6] and [7]. One can design a strategy for the -center problem analog to the Lloyd algorithm for the -median problem. Each vehicle moves, in continuous or discrete time, toward the center of the minimum-radius sphere enclosing the polytope. We refer to [37] for a convergence analysis of the continuous-time algorithms. In what follows, we shall restrict our attention to the -median problem and to centroidal Voronoi partitions.248IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, VOL. 20, NO. 2, APRIL 2004E. Numerical Simulations To illustrate the performance of the continuous-time Lloyd algorithm, we include some simulation results. The as a single cenalgorithm is implemented in setting, the code computes the tralized program. For the package bounded Voronoi diagram using the , and computes mass, centroid, and polar moment of inertia of polygons via the numerical . Careful attention was paid to integration routine numerical accuracy issues in the computation of the Voronoi diagram and in the integration. We illustrate the performance of the closed-loop system in Fig. 3. IV. ASYNCHRONOUS DISTRIBUTED IMPLEMENTATIONS In this section, we show how the Lloyd gradient algorithm can be implemented in an asynchronous distributed fashion. In Section IV-A, we describe our model for a network of robotic agents, and we introduce a precise notion of distributed evolution. Next, we provide two distributed algorithms for the local computation and maintenance of the Voronoi cells. Finally, in Section IV-C, we propose two distributed asynchronous implementations of Lloyd algorithm. The first one is based on the gradient optimization algorithms, as described in [32], and the second one relies on the special structure of the coverage problem. A. Modeling an Asynchronous Distributed Network of Mobile Robotic Agents We start by modeling a robotic agent that performs sensing, communication, computation, and control actions. We are interested in the behavior of the asynchronous network resulting from the interaction of finitely many robotic agents. A framework to formalize the following concepts is the theory of distributed algorithms; see [30]. Let us here introduce the notion of robotic agent with computation, communication, and control capabilities as the th element of a network. The th agent has a processor with the ability of allocating continuous and discrete states and performing operations on them. Each vehicle has access to its unique identifier . The th agent occupies a location and it is cafor any period of pable of moving in space, at any time time , according to a first-order dynamics of the form (11) The processor has access to the agent’s location and deter. The processor of the th agent mines the control pair has access to a local clock , and a scheduling sequence, such i.e., an increasing sequence of times and . The that processor of the th agent is capable of transmitting information to any other agent within a closed disk of radius . We to be a quantity controlassume the communication radius lable by the th processor and the corresponding communication bandwidth to be limited. We represent the information flow between the agents by means of “send” (within specified radius ) and “receive” commands with a finite number of arguments.Fig. 2. Notation conventions for a convex polygon.D. Computations Over Polygons With Uniform Density In this section, we investigate closed-form expression for the control laws introduced above. Assume the Voronoi region is a convex polygon (i.e., a polytope in ) with vertexes such as in Fig. 2. It is labeled convenient to define . Furthermore, we . By evaluating assume that the density function is the corresponding integrals, one can obtain the following closed-form expressions:(9) To present a simple formula for the polar moment of inertia, let and , for . Then, the polar moment of inertia of a polygon about its centroid becomesThe proof of these formulas is based on decomposing the polygon into the union of disjoint triangles. We refer to [38] for . analog expressions over Note also that the Voronoi polygon’s vertexes can be expressed as a function of the neighboring vehicles. The vertexes of the th Voronoi polygon that lie in the interior of are the circumcenters of the triangles formed by and any two neighbors adjacent to . The circumcenter of the triangle determined by , , and is(10) is the area of the triangle, and . where Equation (9) for a polygon’s centroid and (10) for the Voronoi cell’s vertexes lead to a closed-form algebraic expression for the control law in (6) as a function of the neighboring vehicles’ location.。

计算机专业英语测试题及答案

计算机专业英语测试题一、词汇测试题（每小题1分，共20分）(一)．Translate the following words and expressions into Chinese (共10分，每题1分)1．Cyber cafe2．microcomputer3．ROM4．Object-oriented programming5．utility program6．system specification7．database administrator8．modulator-demodulator9．client/server model10．spreadsheet program(二)．Translate the following terms or phrases from Chinese into English (共10分，每题1分)1．中央处理器2．广域网3．超级计算机4．电子商务5．计算机安全6．计算机文化7．网站8．域名9．数据库管理系统10．软件工程二、完型填空题（每空1分，共20分）Fill in each of the blanks with one of the words given in the list following each paragraph, making changes if necessary:1. Computer hardware is the involved in the of a computer and consists of the that can be physically handled. The function of these components is typically divided into three main categories: , , and . Components in these categories connect to , specifically, the computer’s central unit (CPU), the electronic that provides the computational ability and control of the computer, via wires or circuitry called bus.microprocessors component processing functionoutput equipment input circuitry storage2.In the relational model, data is organized in two-dimensionalcalled . There is no or structure imposed on the data. The tables or relations are, however, related to each other. The database management system (RDBMS) the data so that its external is a of relation or tables. This does not mean that data is stored as tables: the physical of the data is independent of the way in which the is logically organized.hierarchical set organize relational relationdata storage view network table三、英译汉题（每题10分，共20分）Translate the following passage from English into Chinese:1.The field of computer science has grown rapidly since the1950s due to the increase in their use. Computer programs have undergone many changes during this time in response to user need and advances in technology. Newer ideas in computing such as parallel computing, distributed computing, and artificial intelligence, have radically altered the traditional concepts that once determined program form and function. In parallelcomputing parts of a problem are worked on simultaneously by different processors, and this speed up the solution of the problem. Another type of parallel computing called distributed computing use CPUs from many interconnected computers to solve problems. Research into artificial intelligence (AI) has led to several other new styles of programming.2.High-level languages are commonly classified asprocedure-oriented, functional, objected-oriented, logic languages. The most common high-level languages today are procedure-oriented languages. In these languages, one or more related blocks of statements that perform some complete function are grouped together into a program module, or procedure, and given a name such as “procedure A”. If the same sequence of operations is needed elsewhere in the program, a simple statement can be used to refer back to the procedure. In essence, a procedure is just a mini-program. A large program can be constructed by grouping together procedures that perform different tasks.四、汉译英题（20分）最著名的互联网例子是因特网。

自动语音识别模型压缩算法综述

第62卷第1期吉林大学学报(理学版)V o l .62 N o .12024年1月J o u r n a l o f J i l i nU n i v e r s i t y (S c i e n c eE d i t i o n )J a n 2024研究综述d o i :10.13413/j .c n k i .jd x b l x b .2023058自动语音识别模型压缩算法综述时小虎1,袁宇平2,吕贵林3,常志勇4,邹元君5(1.吉林大学计算机科学与技术学院,长春130012;2.吉林大学大数据和网络管理中心,长春130012;3.中国第一汽车集团有限公司研发总院智能网联开发院,长春130011;4.吉林大学生物与农业工程学院,长春130022;5.长春中医药大学医药信息学院,长春130117)摘要:随着深度学习技术的发展,自动语音识别任务模型的参数数量越来越庞大,使得模型的计算开销㊁存储需求和功耗花费逐渐增加,难以在资源受限设备上部署.因此对基于深度学习的自动语音识别模型进行压缩,在降低模型大小的同时尽量保持原有性能具有重要价值.针对上述问题,全面综述了近年来该领域的主要工作,将其归纳为知识蒸馏㊁模型量化㊁低秩分解㊁网络剪枝㊁参数共享以及组合模型几类方法,并进行了系统综述,为模型在资源受限设备的部署提供可选的解决方案.关键词:语音识别;模型压缩;知识蒸馏;模型量化;低秩分解;网络剪枝;参数共享中图分类号:T P 391 文献标志码:A 文章编号:1671-5489(2024)01-0122-10C o m p r e s s i o nA l g o r i t h m s f o rA u t o m a t i c S pe e c h R e c o g n i t i o n M o d e l s :AS u r v e yS H IX i a o h u 1,Y U A N Y u p i n g 2,L ㊆U G u i l i n 3,C H A N GZ h i y o n g 4,Z O U Y u a n ju n 5(1.C o l l e g e o f C o m p u t e rS c i e n c e a n dT e c h n o l o g y ,J i l i nU n i v e r s i t y ,C h a n gc h u n 130012,C h i n a ;2.M a n a g e m e n t C e n t e r o f B i g D a t aa nd Ne t w o r k ,J i l i nU n i v e r s i t y ,C h a n g c h u n 130012,C h i n a ;3.I n t e l l i g e n tN e t w o r kD e v e l o p m e n t I n s t i t u t e ,R &DI n s t i t u t e of C h i n aF A W G r o u p C o .,L t d ,C h a n gc h u n 130011,C h i n a ;4.C o l l e g e o f B i o l o g i c a l a n dA g r i c u l t u r a lE n g i n e e r i n g ,J i l i nU n i v e r s i t y ,C h a n gc h u n 130022,C h i n a ;5.S c h o o l o f M ed i c a l I n f o r m a t i o n ,C h a n g c h u nU n i ve r s i t y of C h i n e s eM e d i c i n e ,C h a ng ch u n 130117,C hi n a )收稿日期:2023-02-23.第一作者简介:时小虎(1974 ),男,汉族,博士,教授,博士生导师,从事机器学习的研究,E -m a i l :s h i x h @j l u .e d u .c n .通信作者简介:邹元君(1975 ),男,汉族,博士,教授,从事医药信息化的研究,E -m a i l :z o u y j @c c u c m.e d u .c n .基金项目:国家自然科学基金(批准号:62272192)㊁吉林省科技发展计划项目(批准号:20210201080G X )㊁吉林省发改委项目(批准号:2021C 044-1)和吉林省教育厅科研基金(批准号:J J K H 20200871K J ).A b s t r a c t :W i t h t h e d e v e l o p m e n t o f d e e p l e a r n i n g t e c h n o l o g y,t h en u m b e r o f p a r a m e t e r s i na u t o m a t i c s p e e c hr e c o g n i t i o nt a s k m o d e l s w a s b e c o m i n g i n c r e a s i n g l y l a r g e ,w h i c h g r a d u a l l y i n c r e a s e dt h e c o m p u t i n g o v e r h e a d ,s t o r a g e r e q u i r e m e n t s a n d p o w e r c o n s u m p t i o no f t h em o d e l s ,a n d i tw a s d i f f i c u l t t od e p l o y o n r e s o u r c e -c o n s t r a i n e dd e v i c e s .T h e r e f o r e ,i tw a s o f g r e a t v a l u e t o c o m p r e s s t h e a u t o m a t i c s p e e c h r e c o g n i t i o nm o d e l s b a s e d o n d e e p l e a r n i n g t o r e d u c e t h e s i z e o f t h em o d e sw h i l em a i n t a i n i n gt h e o r i g i n a l p e r f o r m a n c e a sm u c h a s p o s s i b l e .A i m i n g a t t h e a b o v e p r o b l e m s ,a c o m p r e h e n s i v e s u r v e y w a s c o n d u c t e do n t h em a i nw o r k s i n t h i s f i e l d i n r e c e n t ye a r s ,w h i c hw a s s u mm a r i z e d a s s e v e r a lm e t h o d s ,i n c l u d i n g k n o w l e d g ed i s t i l l a t i o n ,m o d e l q u a n t i z a t i o n ,l o w -r a n k d e c o m p o s i t i o n ,n e t w o r k p r u n i n g ,p a r a m e t e r s h a r i n g a n d c o m b i n a t i o nm o d e l s ,a n dc o n d u c t e da s ys t e m a t i c r e v i e wt o p r o v i d e a l t e r n a t i v e s o l u t i o n s f o r t h ed e p l o y m e n t o fm o d e l s o n r e s o u r c e -c o n s t r a i n e dd e v i c e s .K e yw o r d s :s p e e c h r e c o g n i t i o n ;m o d e lc o m p r e s s i o n ;k n o w l e d g e d i s t i l l a t i o n ;m o d e l q u a n t i z a t i o n ;l o w -r a n kd e c o m p o s i t i o n ;n e t w o r k p r u n i n g ;p a r a m e t e r s h a r i n g 近年来,随着人机智能交互的快速发展,自动语音识别(a u t o m a t i c s p e e c h r e c o gn i t i o n ,A S R )作为智能交互应用的一种关键技术,已经在各种应用场景下凸显重要作用,如在语音搜索㊁语音助手㊁会议记录㊁智能服务㊁机器人等方面均应用广泛.随着深度学习的发展,基于深度学习的方法表现出了比传统的如高斯混合模型㊁隐M a r k o v 模型等方法更优越的性能.但基于深度学习的语音识别网络,特别是端到端模型,通常有具有数百万甚至高达数十亿的参数,占用大量内存,在运行时也需要较大的计算资源,仅适用于部署在专业服务器上.例如,Q u a r t z N e t 模型包含1.89ˑ107的参数[1],C o n f o r m e rT r a n s d u c e r 语音识别模型包含1.188ˑ108的参数[2],基于T r a n s f o r m e r 的声学模型包含2.70ˑ108以上的参数[3].随着移动㊁家居㊁车载等智能设备的广泛使用,产生了巨大的资源受限设备智能交互需求.而语音作为人们日常交流最自然的方式,在智能交互中扮演着重要角色.通常语音识别模型在设备端的部署无需与服务器进行数据传输,将模型部署在本地设备上具有更好的实时性㊁安全性和隐私保护能力.但由于资源受限设备的存储空间有限,同时计算能力较弱,无法将庞大的深度学习语音识别模型在资源受限的环境中实际部署.如何在尽量保持性能的同时,减小模型的大小㊁降低运行延迟成为自动语音识别的重要课题.本文对近年来关于自动语音识别模型的压缩算法进行总结和综述,将其归纳为知识蒸馏㊁模型量化㊁网络剪枝㊁矩阵分解㊁参数共享以及组合模型几类方法,并进行了系统综述,为模型在不同条件和场景下的部署提供可选的解决方案.1 基于知识蒸馏的方法知识蒸馏,主要通过模仿高延迟㊁高精度的大型神经网络行为,训练得到低延迟的轻量神经网络.其中,高延迟㊁高精度的大型网络称为教师模型,低延迟的轻量网络称为学生模型.在知识蒸馏过程中,教师模型的知识迁移到学生模型,从而提高学生模型的性能,以获得低延迟㊁低内存㊁高性能的学生模型.一般的知识蒸馏过程如图1所示.首先,训练一个教师模型,根据输入特征输出软标签的概率分布;然后,学生模型通过软标签模仿教师模型的行为.软标签不仅包含了正确的类别分布,也反应了类似类别之间的关系,通常比独热编码的真实标签包含更丰富的信息.真实标签的使用通常对训练有益,因此可以将根据真实标签训练的语音识别损失与知识蒸馏损失相结合,共同进行学生模型的优化.传统的混合A S R 系统需要首先进行帧的对齐,然后再进行识别,而目前的主流方法已经发展到端到端模型,不再需要进行帧对齐,而是直接将输入语音转换为对应文本.端到端模型主要可分为连接时序分类方法(c o n n e c t i o n i s tt e m po r a lc l a s s i f i c a t i o n ,C T C )[4]㊁循环神经网络转换器(r e c u r r e n t n e u r a l n e t w o r kt r a n s d u c e r ,R N N -T )[5]㊁基于注意力的编码器-解码器(a t t e n t i o n -b a s e de n c o d e r -d e c o d e r ,A E D )[6]3类.因此按照被压缩的原始A S R 系统划分,基于知识蒸馏技术的压缩模型主要分为传统的混合A S R 压缩模型和端到端压缩模型,后者又包括C T C 压缩模型㊁R N N -T 压缩模型和A E D 压缩模型,主要方法如图2所示.下面将依次介绍这几类方法,最后介绍多个教师的知识蒸馏方法.传统的混合A S R 系统包括基于高斯混合和隐M a r k o v 的混合模型(G a u s s i a n m i x e dm o d e l -h i d d e n M a r k o vm o d e l ,GMM -HMM )[7],基于深度神经网络和隐M a r k o v 的混合模型(d e e p n e u r a l n e t w o r k s -h i d d e n M a r k o vm o d e l ,D N N -HMM )[8]等.已有学者应用知识蒸馏技术对传统的A S R 系统进行了模型压缩.如L i 等[9]首先尝试应用教师-学生学习方法对D N N -HMM 混合系统实现了语音识别蒸馏.D N N -HMM 混合系统的训练需要每个输入声学帧的输出标签,即进行帧对齐,所以帧级知识蒸馏是教师用来指导学生模型非常自然的方式.因此,文献[9]和文献[10]都是通过最大限度地减少教师和学生的帧级输出分布之间的K u l l b a c k -L e i b l e r (K L )散度进行帧级知识蒸馏.在训练中,只需优化学生321 第1期时小虎,等:自动语音识别模型压缩算法综述图1 知识蒸馏的一般过程F i g .1G e n e r a l p r o c e s s o f k n o w l e d ge d i s t i l l a t i on 图2 基于知识蒸馏的方法F i g .2 M e t h o d s b a s e do nk n o w l e d ge d i s t i l l a t i o n 模型的参数,最小化K L 散度即等价于最小化交叉熵.但帧级训练未考虑语音数据的顺序性质,因此A S R 模型的序列训练会比帧级训练产生更好的性能[11].针对D N N -HMM 混合模型,W o n g 等[12]通过最小化最大互信息(m a x i m u m m u t u a l i n f o r m a t i o n ,MM I)损失与学生和教师假设级后验概率之间的K L 散度的加权平均值,并将假设后验分解为语言模型和声学模型,以调整混合A S R 架构中后验分布的动态范围,更好地进行序列训练,实现了序列级的知识蒸馏.C T C 损失函数的目标是最大化正确标签的后验概率,不执行严格的时间对齐.C T C 模型的后验概率分布是尖锐的,多数帧发出高概率的空白标签,只有少数发出非空白目标标签概率较高的帧可以有效地训练学生模型,因此将帧级知识蒸馏应用于基于C T C 的A S R 系统会降低性能[13].此外,由于C T C 无帧对齐的特点,因此即使对相同样本,出现尖峰的时间也不相同,使帧级知识蒸馏在优化时难以收敛[14].T a k a s h i m a 等[14]提出使用教师模型的N -b e s t 假设在C T C 框架中进行序列级知识蒸馏,将N -b e s t 假设的C T C 损失进行加权求和作为训练学生模型的蒸馏损失.这种方法能有效提升学生模型的性能,但需要对每个N -b e s t 假设计算C T C 损失,训练成本较高.文献[15]提出了基于格的序列级知识蒸馏方法,相比基于N -b e s t 的方法能抑制训练时间的增加.针对C T C 峰值时间的分歧,K u r a t a 等[16]通过在教师模型双向长短期记忆网络(b i -d i r e c t i o n a l l o n g s h o r t -t e r m m e m o r y ,B i L S T M )的帧窗口中,选择匹配学生模型单向长短期记忆网络(u n i -d i r e c t i o n a l l o n g s h o r t -t e r m m e m o r y ,U n i L S T M )当前帧的分布,放松帧对齐的严格假设.但基于B i L S T M C T C 模型中的峰值通常早于U n i L S T M C T C的情况,不具有普适性.因此,K u r a t a 等[17]进一步提出了引导C T C 训练,能明确地引导学生模型的峰值时间与教师模型对齐,在输出符号集合相同的情况下,可用于不同的架构.T i a n 等[18]设计了一种空白帧消除机制解决K L 散度应用于C T C 后验分布的困难,引入两阶段知识蒸馏过程,第一阶段利用余弦相似性执行特征级知识蒸馏,第二阶段利用提出的一致声学表示学习方法和空白帧消除机制执行S o f t m a x 级知识蒸馏.Y o o n 等[19]深入探索了异构教师和学生之间的知识蒸馏,首先使用表示级知识蒸馏初始化学生模型参数,利用帧加权选择教师模型关注的帧;然后用S o f t m a x 级知识蒸馏传输S o f t m a x 预测.文献[19]研究表明,在C T C 框架下蒸馏帧级后验概率分布,使用L 2损失比传统的K L 散度更合适,原因在于两个分布之间的L 2距离总是有界的,所以可提高蒸馏损失的数值稳定性.进一步,Y o o n 等[20]又在学生网络中添加多个中间C T C 层,采用文献[19]提出的S o f t m a x 级蒸馏将教师的知识迁移到中间的C T C 层.在R N N -T 模型中,由于其损失由编码器生成的特征和输出符号的所有可能对齐求和获得的负对数条件概率给出,所以需要更精妙地设计蒸馏损失才能实现这类模型的知识蒸馏.P a n c h a p a g e s a n 等[21]提出了一种基于格的蒸馏损失,该损失由教师和学生条件概率在整个R N N -T 输出概率格上的K L 散度之和给出.为提升效率,损失被计算为粗粒度条件概率之间的K L 散度之和,粗粒度条件概率指仅跟踪 y(在特定输出步骤的正确标签)㊁空白标签(用于在时间方向上改变对齐)和其余标签的条件概率,将维度从输出词典大小粗粒度到三维.Z h a o 等[22]提出了一种基于对数课程模块替换的知识蒸馏方法,应用模块替换压缩教师模型,将包含较多层的教师模型替换为包含较少层的学生模型.该方法可以使学生和教师在梯度级别进行互动,能更有效地传播知识.此外,由于它只使用R N N -T损失作为训练损失,没有额外的蒸馏损失,所以不需要调整多个损失项之间的权重.V i gn e s h 等[23]探421 吉林大学学报(理学版) 第62卷索教师和学生编码器的共同学习以对R N N -T 语音识别模型进行压缩,共享共同解码器的同时协同训练,并指出其优于预先训练的静态教师.T r a n s d u c e r 的编码器输出具有很高的熵,并包含有关声学上相似的词片混淆的丰富信息,当与低熵解码器输出相结合以产生联合网络输出时,这种丰富的信息被抑制.因此引入辅助损失,以从教师T r a n s d u c e r 的编码器中提取编码器信息.R a t h o d 等[24]提出了对C o n f o r m e rT r a n s d u c e r 的多级渐近式压缩方法,将前一阶段获得的学生模型作为当前阶段的教师模型,为该阶段蒸馏出一个新的学生模型,重复上述过程直到达到理想尺寸.由于A E D 语音识别模型的解码相对更简单,所以知识蒸馏方法较容易推广到A E D 模型.W u 等[25]通过将m i x u p 数据增强方法与S o f t m a x 级知识蒸馏相结合,对S p e e c h -T r a n s f o r m e r 语音识别模型进行压缩,从数据角度优化蒸馏,使学生模型在更真实的数据分布上训练.M u n i m 等[26]研究了对于A E D 模型的序列级蒸馏,通过b e a ms e a r c h 为每个样本生成多个假设,然后将教师生成的假设作为伪标签在同一数据集上训练学生模型.L Ü等[27]认为目前流行的知识蒸馏方法只蒸馏了教师模型中有限数量的块,未能充分利用教师模型的信息,因此提出了一种基于特征的多层次蒸馏方法,以学习教师模型所有块总结出的特征表示.此外,还将任务驱动的损失函数集成到解码器的中间块中以提高学生模型的性能.多个不同的A S R 系统通常会产生给定音频样本的不同转录,对多个教师模型进行集成可以包含更多的补充信息,给学生模型提供多样的可学习分布,常具有更好的性能.C h e b o t a r 等[28]通过平均单独训练的神经网络模型的上下文相关状态后验概率构建高精度的教师模型,然后使用标准交叉熵训练学生模型.G a o 等[29]遵循交叉熵损失和C T C 损失与错误率没有直接关系的事实,考虑序列级错误率定义了3种蒸馏策略,将错误率指标集成到教师选择中,直接面向语音识别的相关评价指标蒸馏和优化学生模型.Y a n g 等[30]提出将集成教师模型的精髓知识传播到学生模型,只保留教师模型的前k 个输出,并且以多任务的方式使用教师模型产生的软标签和真实标签训练学生模型.2 基于模型量化的方法将模型权重和激活量化为低精度,能达到压缩神经网络㊁减少内存占用㊁降低计算成本的目的.量化方案按阶段不同可划分为后训练量化(p o s t -t r a i n i n gq u a n t i z a t i o n ,P T Q )和量化感知训练(q u a n t i z a t i o n -a w a r e t r a i n i n g ,Q A T )两种.后训练量化将已经训练好的浮点模型转换成低精度模型,在不重新训练网络的前提下获取网络的量化参数,通常会导致模型性能下降,其量化过程如图3(A )图3 模型量化流程F i g .3 F l o wc h a r t o fm o d e l qu a n t i z a t i o n 所示.量化感知训练在量化的过程中,对网络进行训练,从而使网络参数能更好地适应量化带来的影响,减轻性能损失,其量化过程如图3(B )所示.量化感知训练会向全精度模型中插入伪量化节点,伪量化节点先对权重或激活进行量化得到真实值对应的整数值,然后反量化回浮点表示,模拟量化产生的误差,使模型在训练时适应量化操作.在伪量化过程中产生的量化权重/激活,无法进行梯度计算,一般通过直通估计器(s t r a i g h t t h r o u gh e s t i m a t o r ,S T E )将梯度传回伪量化前的权重/激活上.无论是后训练量化,还是量化感知训练,其主要目的是为了确定缩放因子和零点两个参数,以用于对全精度(32-b i t)模型的量化.线性量化一般可表示为q =ro u n d r S +æèçöø÷Z ,(1)其中:r 表示32-b i t 浮点值;q 表示量化值;S 为缩放因子,表示浮点数和整数之间的比例关系;Z 表示521 第1期时小虎,等:自动语音识别模型压缩算法综述零点,即F P 32浮点数r =0对应的量化值;r o u n d 为四舍五入取整操作.反量化公式为^r =S (q -Z ),(2)其中S 和Z 的计算方法为S =r m a x -r m i nq m a x -q m in ,(3)Z =r o u n d q m ax -r m a x æèçöø÷S ,(4)r m a x 和r m i n 分别为r 的最大值和最小值,q m ax 和q m i n 分别为q 的最大值和最小值.如上所述,基于模型量化的语音识别模型压缩方法也分为后训练量化方法和量化感知训练方法.后训练量化一般使用部分真实数据作为校准数据集对全精度模型进行前向传播,获得网络各层权重和激活的数据分布特性(如最大㊁最小值),用于后续量化参数的计算.M c G r a w 等[31]将模型参数量化为8-b i t 整型表示,采用均匀线性量化器,假设给定范围内的值均匀分布.首先找到原始参数的最小值和最大值,然后使用一个简单的映射公式确定一个比例因子,当乘以参数时,该比例因子在较小的精度尺度上均匀地扩展值,从而获得原始参数的量化版本.H e 等[32]使用一种更简单的量化方法,不再有明确的零点偏移,因此可假设值分布在浮点零的周围,避免了在执行较低精度的运算(如乘法)前必须应用零点偏移,从而加快了执行速度.不同于一般的后训练量化方法,K i m 等[33]提出了一种仅整数z e r o -s h o t 量化方法,先生成合成数据再进行校准,将合成数据输入到目标模型中以确定激活的截断范围,即使无法访问训练和/或验证数据,也能达到较好的性能.后训练量化无需微调模型,相对更简便;而量化感知训练通过微调网络学习得到量化参数,通常能保证量化损失精度更小,因此量化感知训练成为量化研究的主流.下面介绍采用量化感知训练对语音识别模型进行压缩的方法.针对伪量化方法在量化下计算权重梯度较难,从而忽略了反向传播过程中梯度计算的量化损失问题,N g u ye n 等[34]提出了一种基于绝对余弦正则化的新的量化感知训练方法,在模型权重上施加一个与量化模型相似的分布,将权重值驱动到量化水平.针对伪量化在训练时和推理时会产生数值差异的缺点(在训练时执行浮点运算,推理时执行整数运算),D i n g 等[35]用原生量化感知训练得到了4-b i t 的C o n f o r m e rA S R 模型,原生量化感知训练使用整数运算执行量化操作(如矩阵乘法),使训练和推理时的准确性没有任何区别.F a s o l i 等[36]研究表明,对基于L S T M 的大规模语音识别模型进行4-b i t 量化常会伴随严重的性能下降,为此提出了一种新的量化器边界感知裁剪(b o u n da w a r e c l i p p i n g ,B A C ),可根据d r o p o u t 设置对称边界,选择性地将不同的量化策略应用于不同的层.Z h e n 等[37]针对现有的量化感知训练方法需要预先确定和固定量化质心的缺点,提出了一种无正则化㊁模型无关的通用量化方案,能在训练期间确定合适的量化质心.上述研究都将模型量化为4-b i t 或8-b i t ,X i a n g 等[38]提出了二进制的深度神经网络(D N N )模型用于语音识别,并实现了高效的二进制矩阵乘法,使得量化结果能在真实的硬件平台上实现加速效果.图4 矩阵分解F i g .4 M a t r i xd e c o m po s i t i o n 3 基于低秩分解的方法通过近似分解矩阵或张量也是神经网络压缩的一种重要方法.主要思想是将原来大的权重矩阵或张量分解为多个小的权重矩阵或张量,用低秩矩阵或张量近似原有的权重矩阵或张量.如图4所示,将m ˑn 大小的权重矩阵A ,分解为m ˑr 大小的权重矩阵B 和r ˑn 大小的权重矩阵C 的乘积.一般要求r 远小于m 和n ,因此参数量由m ˑn 大幅度缩减到r ˑ(m +n ).一些学者采用低秩分解的方法进行了A S R 模型的压缩.如S a i n a t h 等[39]认为由于网络经过大量输出目标的训练以实现良好性能,因此网络参数大多数位于最终的权重层,所以对D N N 网络最后的621 吉林大学学报(理学版) 第62卷权重层使用低秩矩阵分解,将大小为m ˑn ㊁秩为r 的矩阵A 分解为A =B ˑC ,其中B 是大小为m ˑr的满秩矩阵,C 是大小为r ˑn 的满秩矩阵.M o r i 等[40]在端到端A S R 框架中对门控循环单元的权重矩阵进行T e n s o r -T r a i n 分解,能用多个低秩张量表示循环神经网络(R N N )层内的密集权重矩阵.按照T e n s o r -T r a i n 格式,如果对每个k ɪ{1,2, ,d }和k 维索引j k ɪ{1,2, ,n k },存在一个矩阵G k [j k ],则k 维张量W 中的每个元素可表示为W (j 1,j 2, ,j d -1,j d )=G 1[j 1]㊃G 2[j 2]㊃㊃G d -1[j d -1]㊃G d [j d ].(5)H e 等[41]采用了与上述文献类似的方法对D N N 语音识别模型进行压缩.X u e 等[42]在D N N 模型的权重矩阵上应用奇异值分解,对于一个m ˑn 的权重矩阵P ,对其应用奇异值分解,可得到P m ˑn =U m ˑn Σn ˑn V T n ˑn ,其中Σ是一个对角矩阵,P 的奇异值在对角线上以递增顺序排列.由于P 是一个稀疏矩阵,其奇异值大多数很小,所以可以只保留部分大奇异值,构成更小的对角矩阵,获得更大的压缩.4 基于网络剪枝的方法网络剪枝是对深度学习模型进行修剪的一种常见技术,其通过衡量权重的贡献确定保留或舍弃权重.剪枝方法可分为非结构化和结构化两类:非结构化剪枝去除单个权重;结构化剪枝按结构单元(如神经元节点㊁卷积核㊁通道等)对权重进行修剪.图5为网络剪枝的示意图,其中图5(C )以神经元节点剪枝为例对结构化剪枝进行了描述,相当于删除权重矩阵的整行或整列.剪枝由3个步骤组成:第一步是从预训练的完整神经网络进行初始化;第二步根据一定标准对某些权重连接进行删除,将网络逐步稀疏化;第三步微调得到的稀疏网络,一般将修剪过的权重及其梯度设置为零.交替执行第二步和第三步,直到达到预定的稀疏水平.基于网络剪枝的语音识别模型压缩方法也分为非结构化和结构化两类.图5 网络剪枝示意图F i g .5 S c h e m a t i c d i a g r a mo f n e t w o r k p r u n i n g非结构化剪枝也称为稀疏剪枝,因为其会导致稀疏的矩阵.Y u 等[43]利用非结构化剪枝对上下文相关的D N N -HMM 语音识别模型进行压缩,将执行减少D N N 连接数量的任务作为软正则化和凸约束优化问题的目标,并进一步提出了新的数据结构利用随机稀疏模式节省存储且加快解码.W u 等[44]提出了动态稀疏神经网络,无需单独训练就能获得不同稀疏度级别的模型,以满足不同规格的硬件和不同延迟的要求,该方法使用基于梯度的剪枝修剪权重,以权重ˑ梯度的L 1范数作为修剪标准.非结构化剪枝通常会产生具有竞争力的性能,但由于其不规则的稀疏性通常只有在特定的硬件上才能产生加速效果,而结构化剪枝的部署相对更容易.H e 等[45]提出了一种节点剪枝方法对D N N 语音识别模型进行压缩,提出了几种节点重要性函数对D N N 的隐层节点按分数排序,删除那些不重要节点的同时,将其对应的输入和输出连接同时删除.O c h i a i 等[46]利用G r o u p La s s o 正则化对D N N -HMM 语音识别系统中的D N N 隐层节点进行选择,测试了两种类型的G r o u p La s s o 正则化,一种用于输入权重向量,另一种用于输出权重向量.T a k e d a 等[47]集成了权重和节点活动的熵,提出了衡量节点重要性的分数函数,以用于不重要节点的选择,每个节点的权重熵考虑节点的权重数量及其模式,每个节点活动的熵用于查找输出完全不变的节点.L i u 等[48]通过剪枝学习一个双模式的语音识别模型,即剪枝㊁稀疏的流式模型和未剪枝㊁密集的非流式模型.在训练过程中两个模型共同学习,除作用在流式模型721 第1期时小虎,等:自动语音识别模型压缩算法综述821吉林大学学报(理学版)第62卷上的剪枝掩码,两种模式共享参数,使用G r o u p L a s s o正则化进行结构化剪枝分组变量的选择.5基于参数共享的方法除上述常用的压缩方法,还有利用参数共享压缩语音识别模型的研究,其主要思路是使多个结构相同的网络层共享参数,从而达到降低模型参数的作用.其方法框架如图6所示,编码器由n个结构完全相同的编码器层构成,所有编码器层共享参数,因此编码器的参数量从nˑQ压缩到Q,Q为每个编码器层的参数量.T r a n s f o r m e r模型的编码器和解码器各自堆叠了多个块,更深的堆叠结构可获得更好的识别结果,但它会显著增加模型大小,也会增加解码延迟.L i等[49]在堆叠结构中共享参数,而不是在每层引入新的参数,以压缩T r a n s f o r m e r语音识别模型的参数,先创建第一个编码器块和解码器块的参数,其他块重用第一个块的参数.共享参数方法不会减少计算量,只减少了参数量,但缓存交换的减少也可以加快训练和解码过程.为减轻参数数量减少导致的性能下降,该方法中还引入了语音属性以增强训练数据.图6编码器参数共享F i g.6E n c o d e r p a r a m e t e r s h a r i n g6组合模型上述各方法从不同方面压缩模型:知识蒸馏通过改变模型结构压缩模型;模型量化通过将权重由全精度转换为低精度表示缩小模型大小,对激活量化能提高计算效率,达到进一步的加速;分解方法通过对权重分解压缩模型;网络剪枝通过丢弃不必要的权重压缩模型;参数共享通过重用参数压缩模型.这些方法都起到了缩小内存占用㊁降低计算成本㊁减少推理延迟的作用.一些研究人员尝试组合以上几种不同类型的方法,以获得更好的效果.如L i等[50]通过联合知识蒸馏和剪枝两种技术对D N N 模型进行压缩,通过优化交叉熵损失进行知识蒸馏,在剪枝过程中对所有参数进行排序,然后修剪低于全局阈值的权重.Y u a n等[51]结合剪枝和量化方法压缩语音识别模型,与大多数逐步将权重矩阵置零的方法不同,剪枝过程中保持修剪权重的值,通过掩码控制剪枝的元素,修剪的权重元素可在稍后的训练阶段恢复;这项工作探索了两种量化方案:混合量化和整数量化,两种方案都以后训练量化的形式出现.G a o等[52]利用跨层权重共享㊁非结构化剪枝和后训练量化压缩模型,通过对称量化将权重和激活映射到8位整数.L e t a i f a等[53]探索剪枝和量化方法的组合压缩T r a n s f o r m e r模型,分3个步骤完成:首先设置剪枝率,然后剪枝模型,最后量化模型.综上所述,深度学习由于具有强大的表示能力在语音识别方面取得了显著成果,增加模型容量和使用大数据集被证明是提高性能的有效方式.随着物联网的迅速发展,在资源受限设备上部署基于深度学习的语音识别系统有越来越多的需求,在许多场景都需要借助语音识别进行便捷㊁智能的交互,如智能家电㊁智能客服等.而面对设备内存㊁计算能力和功耗的资源受限,需要对模型进行压缩才能实现部署.本文对现有的自动语音识别模型压缩方法进行了全面的综述,根据所使用的压缩技术将其分为知识蒸馏㊁模型量化㊁矩阵分解㊁网络剪枝㊁参数共享和组合模型几种类型,并对不同类型方法进行了系统综述,可为语音识别模型的许多实际应用场景中如何在资源受限设备上实现压缩和部署提供有益的指导.。

Algorithmic Efficiency in Computational Problems

Algorithmic Efficiency inComputational Problemsrefers to the ability of an algorithm to solve a problem in the most efficient manner possible. In computer science, algorithmic efficiency is a key concept that plays a crucial role in the design and analysis of algorithms. It is important to analyze and compare the efficiency of different algorithms in order to determine the best algorithm for a given problem.There are several factors that contribute to the efficiency of an algorithm, including time complexity, space complexity, and the quality of the algorithm design. Time complexity refers to the amount of time it takes for an algorithm to solve a problem, while space complexity refers to the amount of memory space required by an algorithm to solve a problem. The quality of algorithm design includes factors such as the choice of data structures and the way the algorithm is implemented.One important measure of algorithmic efficiency is the big O notation, which provides an upper bound on the growth rate of an algorithm. The big O notation allows us to compare the efficiency of different algorithms and make informed decisions about which algorithm to use for a particular problem. For example, an algorithm with a time complexity of O(n) is considered more efficient than an algorithm with a time complexity of O(n^2) for large input sizes.In order to improve the efficiency of algorithms, it is important to understand the theory behind algorithm design and analysis. This includes understanding different algorithm design techniques such as divide and conquer, dynamic programming, and greedy algorithms. By using these techniques, it is possible to design algorithms that are more efficient and can solve problems in a faster and more resource-efficient manner.In addition to understanding algorithm design techniques, it is also important to consider the specific characteristics of the problem at hand when designing algorithms. For example, some problems may have specific constraints that can be exploited toimprove algorithm efficiency. By taking into account these constraints, it is possible to design algorithms that are tailored to a specific problem and can solve it more efficiently.Another key aspect of algorithmic efficiency is the implementation of algorithms. The choice of programming language, data structures, and optimization techniques can all impact the efficiency of an algorithm. By optimizing the implementation of an algorithm, it is possible to reduce its time and space complexity and improve its overall efficiency.Overall, algorithmic efficiency is a fundamental concept in computer science that plays a crucial role in the design and analysis of algorithms. By understanding the theory behind algorithm design and analysis, and by carefully considering the specific characteristics of the problem at hand, it is possible to design algorithms that are efficient, fast, and resource-efficient. This can lead to significant improvements in the performance of computational problems and the development of more effective software applications.。

英文文献

A Survey on Secure Storage in Cloud ComputingAbstractCloud Computing is an environment for providing information and resources that are delivered as a service to end-users over the Internet on demand. Thus cloud enables users to access their data from any geographical locations at any time and also has brought benefits in the form of online storage services. Cloud storage service avoids the cost expensive on software, personnel maintenance and provides better performance, less storage cost and scalability. But the maintenance of stored data in a secure manner is not an easy task in cloud environment and especially that stored data may not be completely trustworthy. Cloud delivers services through internet which increases their exposure to storage security vulnerabilities. However security is one of the major drawbacks that preventing several large organizations to enter into cloud computing environment. This work surveyed on several existing cloud storage frameworks, techniques and their advantages, drawbacks and also discusses the challenges that are required to implement secure cloud data storage. This survey results help to identify the future research areas and methods for improving the existing drawbacks.Key words: Cloud Computing, Data, Security, Storage Techniques and Survey.1. IntroductionCloud Computing is a kind of computing whereby shared resources and IT-related capabilities are provided as a service to outer customers using Internet techniques. Cloud Computing depends on sharing information and computing resources instead of using local servers or personal devices for to manage supplications. Cloud Computing has began to receive mass attract in corporate organizations as it makes the data center be able to work like the Internet to share and access resources in safe and secure manner. To provide data storage service, Cloud Computing utilizes network of enormous amount of servers generally running lower cost customer PC technology with peculiar connections to disperse data processing tasks across end users.Reason for moving into Cloud is simply because of Cloud allows users to access applications from anywhere at any time through internet. But in past, consumers run their programs and applications from software which downloaded on physical server in their home or building. Cloud provides benefits such as flexibility, disaster recovery, software updates automatically, pay-per-use model and cost reduction. However Cloud also includes major risks such as security, data integrity, network dependency and centralization. When storing customer’s data into cloud data storage, security plays a vital role. Sometimes customers store some sensitive information in cloud storage environment. This causes some serious security issues. So providing security to such sensitive information is one of the difficult problems in Cloud computing. In preceding works, several methods are proposed for securely storing data into Cloud. This paper discussed those methodologies and various techniques to effectively store data. Also analyzed the advantages, drawbacks of those techniques and provides some directions for future research work.2. Storage Techniques in Cloud ComputingIn this section, various existing techniques have been discussed. Cloud storage is regarded as a system of disseminated data centers that generally utilizes virtualization technology and supplies interfaces for data storage.2.1 Implicit Storage Security to Data in OnlineProviding implicit Storage Security to data in Online is more beneficial in a cloud environment. Presented implicit storage security architecture for storing data where security is disseminated among many entities and also look at some common partitioning methods. So data partitioning scheme is proposed for online data storage that involves the finite field polynomial root. This strategy comprises of two partitioning scheme. Partitioned data are saved on cloud servers that are chosen in a random manner on network and these partitions are regained in order to renovate the master copy of data. Data pieces are accessible to one who has knowledge of passwords and storage locations of partitioned pieces.2.2 Identity-Based AuthenticationIn Cloud Computing, resources and services are distributed across numerous consumers. So there is a chance of various security risks. Therefore authentication of users as well as services is an important requirement for cloud security and trust. When SSL Authentication Protocol (SAP) was employed to cloud, it becomes very complex. As an alternative to SAP, proposed a new authentication protocol based on identity which is based on hierarchical model with corresponding signature and encryption schemes. Signature and encryption schemes are proposed to achieve security in cloud communication. When comparing performance, authentication protocol based on identity is very weightless and more efficient and also weightless protocol for client side.2.3 Public Auditing with Complete Data Dynamics SupportVerification of data integrity at unreliable servers is the major concern in cloud storage. Proposed scheme first focused to discover the potential security threats and difficulties of preceding works and build a refined verification scheme Public auditing system with protocol that supports complete dynamic data operations is presented. To accomplish dynamic data support, the existent proofread of PDP or PoR scheme is improved by spoofing the basic Markle Hash Tree (MHT). Proposed system extended in the direction of allowing TPA to perform many auditing jobs by examining the bilinear aggregate signature technique.2.4 Efficient Third Party Auditing (TPA)Cloud consumers save data in cloud server so that security as well as data storage correctness is primary concern. A novel and homogeneous structure is introduced to provide security to different cloud types. To achieve data storage security, BLS (Boneh–Lynn–Shacham) algorithm is used to signing the data blocks before outsourcing data into cloud. BLS (Boneh–Lynn–Shacham) algorithm is efficient and safer than the former algorithms. Batch auditing is achieved by using bilinear aggregate signature technique simultaneously. Reed-Solomon technique is used for error correction and to ensure data storage correctness. Multiple batch auditing is an important feature of this proposed work. It allows TPA to perform multiple auditingtasks for different users at the same.2.5 Way of Dynamically Storing Data in CloudSecurely preserving all data in cloud is not an easy job when there is demand in numerous applications for clients in cloud. Data storage in cloud may not be completely trustable because the clients did not have local copy of data stored in cloud. To address these issues, proposed a new protocol system using the data reading protocol algorithm to check the data integrity. Service providers help the clients to check the data security by using the proposed effective automatic data reading algorithm. To recover data in future, also presented a multi server data comparison algorithm with overall data cal culation in each update before outsourcing it to server’s remote access point.2.6 Effective and Secure Storage ProtocolCurrent trend is users outsourcing data into service provider who have enough area for storage with lower storage cost. A secure and efficient storage protocol is proposed that guarantees the data storage confidentiality and integrity. This protocol is invented by using the construction of Elliptic curve cryptography and Sobol Sequence is used to confirm the data integrity arbitrarily. Cloud Server challenges a random set of blocks that generates probabilistic proof of integrity. Challenge-Response protocol is credential so that it will not exposes the contents of data to outsiders. Data dynamic operations are also used to keep the same security assurance and also provide relief to users from the difficulty of data leakage and corruptions problems.2.7 Storage Security of DataResources are being shared across internet in public surroundings that creates severe troubles to data security in cloud. Transmitting data over internet is dangerous due to the intruder attack. So data encryption plays an important role in Cloud environment. Introduced a consistent and novel structure for providing security to cloud types and implemented a secure cross platform. The proposed method includes some essential security services that are supplied to cloud system. A network framework is created which consist of three data backups for data recovery. These backups located in remote location from main server. This method used SHA Hash algorithm for encryption, GZIP algorithm for compression and SFSPL algorithm for splitting files. Thus, a secure cross platform is proposed for cloud computing.2.8 Secure and Dependable Storage ServicesStorage service of cloud permits consumers to place data in cloud as well as allowed to utilize the available well qualified applications with no worry about data storage maintenance. Although cloud provides benefits, such a service gives up the self-control of user’s data that introduced fresh vulnerability hazards to cloud data correctness. To handle the novel security issue, accomplish the cloud data integrity and availability assurances, a pliable mechanism is proposed for auditing integrity in a dispersed manner. Proposed mechanism allows users to auditing the cloud data storage and this auditing result utilized Homomorphic token with Reed-Solomon erasure correcting code technique that guarantee the correctness insurance and also identifying misconduct servers rapidly. The proposed design is extended to support block-level data dynamic operations. If cloud consumer is notable to possess information, time and utility then the users can assign their job to an evaluator i.e. TPA for auditing process in safe manner.2.9 Optimal Cloud Storage SystemsCloud data storage which requires no effort is acquiring more popularity for individual, enterprise and institutions data backup and synchronization. A taxonomic approach to attain storage service optimality with res ource provider, consumer’s lifecycle is presented. Proposed scheme contributes storage system definition, storage optimality, ontology for storage service and controller architecture for storage which is conscious of optimality. When compared with existing work, more general architecture is created that works as a pattern for storage controller. A new prototype NubiSave is also proposed which is available freely and it implements almost all of RAOC concepts.2.10 Process of access and Store Small Files with StorageTo support internet services extensively, Hadoop distributed file system (HDFS) is acquired. Several reasons are examined for small file trouble of native Hadoop distributed file system: Burden on NameNode of Hadoop distributed file system is enforced by large amount of small files, for data placement correlations are not considered, prefetching mechanism is not also presented. In order to overcome these small size problems, proposed an approach that improves the small files efficiency on Hadoop distributed file system. Hadoop distributed file system is an Internet file system representative, which functioning on clusters. The cut-off point is measured in Hadoop distributed file system’s circumstance in an experimental way, which helps to improve I/O performance. From taxonomic way, files are categorized as independent files, structurally and logically-related files. Finally prefetching technique is used to make better access efficiency and considering correlations when files are stored.2.11 File Storage Security MaintenanceTo assure the security of stored data in cloud, presented a system which utilizes distributed scheme. Proposed system consists of a master server and a set of slave servers. There is no direct communication link between clients and slave servers in the proposed model. Master server is responsible to process the client’s requests and at slave server chunking operation is carried out to store copies of files in order to provide data backup for file recovery in future. Users can also perform effective and dynamic data operations. Clients file is stored in the form of tokens on main server and files were chunked on slave servers for file recovery. Thus proposed scheme achieved storage correctness insurance and data availability by using Token generation algorithm with homomorphic token and merging algorithm were used.2.12 File Assured Deletion (FADE) for Secure StorageProposed a file assured deletion scheme based on policy to dependably efface files of cancelled file access policies. Working prototype of FADE is implemented at the top of Amazon S3. Performance overhead is also evaluated on Amazon S3.2.12.1. File Assured Deletion Based on PolicyData file is logically connected with file access policy and a data key. Each file access policy should be attached with control key. Maintenance of control key is theresponsibility of key manager. When a policy is cancelled, control key of that policy will be dispatched from the key manager. The main idea is as follows: each file with data key is saved and control key is used to protect data key. Here key manager is responsible for retaining keys. The control key is deleted when a policy is cancelled. So that the encrypted file and data key could not be regained. In case the file is removed still a copy exists, that file is encrypted and unavailable to everyone. Multiple policies such as conjunctive and disjunctive policies are also presented. Conjunctive policies are used to recover file by satisfying all policies whereas disjunctive policies satisfying only one policy. Conclusion is FADE is executable in practice and this approach includes all dynamic data operations. Cryptographic operations are less and meta-data over-head is small.2.13 Accessing Outsourced Data EfficientlyAn approach is proposed to attain flexible access control and dynamic large-scale data in a safe and effective way. An Owner-write-user-read scenario is presented for accessing data. Original data owner be only able to update/modify their data. Cloud users will be able to read information with corresponding access rights. Proposed approach deals with key generation, dynamics handling and overhead analysis. In key generation part, a key derivation hierarchy is generated and Storage over-head is moderated. Dynamics handling part consists of dynamic data operations and access rights of user. Eavesdropping can be overcome by over-encryption and lazy revocation.3. ConclusionCloud Computing is an emerging computing paradigm, allows users to share resources and information from a pool of distributed computing as a service over Internet. Even though Cloud provides benefits to users, security and privacy of stored data in cloud are still major issues in cloud storage. Cloud storage is much more beneficial and advantageous than the earlier traditional storage systems especially in scalability, cost reduction, portability and functionality requirements. This paper presented a survey on secure storage techniques in Cloud Computing. First several storage techniques that provide security to data in cloud have been discussed in detail and also highlighted the necessity for future research on storage methods to provider much better security and accountability. Finally, presented a comparative analysis on storage techniques, that includes the proposed approach, advantages and limitations of those storage techniques.。

对算法时代的看法英语作文

对算法时代的看法英语作文Title: Perspectives on the Algorithmic Era。

In the contemporary landscape of technological advancements, the algorithmic era has emerged as a defining epoch, reshaping various aspects of human existence. From the way we communicate and consume information to how we conduct business and make decisions, algorithms wield significant influence. This essay aims to delve into the multifaceted dimensions of the algorithmic era, examining its impacts on society, economy, and individual lives.At the heart of the algorithmic era lies the omnipresence of algorithms, intricate sets of instructions designed to perform specific tasks or solve problems. These algorithms permeate our daily lives, powering search engines, social media platforms, recommendation systems, and even financial markets. They analyze vast amounts of data, extract patterns, and generate insights with unprecedented speed and accuracy. Consequently, algorithmshave become indispensable tools for navigating the complexities of the digital age.One of the most notable domains transformed by algorithms is the realm of communication and information dissemination. Social media algorithms, for instance, curate personalized feeds based on users' preferences, behavior, and demographics. While this enhances user experience by presenting relevant content, it also engenders concerns regarding filter bubbles and echo chambers, where individuals are exposed only to information that aligns with their existing beliefs, potentially exacerbating polarization and misinformation.Moreover, algorithms wield profound influence in the realm of commerce and industry. E-commerce platformsutilize recommendation algorithms to suggest products tailored to users' preferences, thereby enhancing sales and customer satisfaction. Similarly, in the financial sector, algorithmic trading algorithms execute high-speed transactions based on complex mathematical models, reshaping market dynamics and posing challenges fortraditional traders.However, the algorithmic era is not without its challenges and ethical dilemmas. One pressing concern is the issue of algorithmic bias, where algorithms inadvertently reflect and perpetuate existing societal inequalities. For example, facial recognition algorithms have been found to exhibit higher error rates for certain demographic groups, leading to discriminatory outcomes in law enforcement and surveillance practices. Addressing these biases requires concerted efforts to enhance algorithmic fairness and transparency, incorporating diverse perspectives in the development and deployment of algorithms.Furthermore, the algorithmic era raises profound questions about privacy and data security. With algorithms processing vast amounts of personal data, concerns regarding data breaches, surveillance, and algorithmic manipulation have escalated. Safeguarding privacy rights and ensuring data protection necessitate robust regulatory frameworks and ethical guidelines to govern the collection,storage, and usage of data in algorithmic systems.Despite these challenges, the algorithmic era presents immense opportunities for innovation and progress. Machine learning algorithms, for instance, hold promise in revolutionizing healthcare by facilitating early disease detection, personalized treatment plans, and drug discovery. Likewise, in transportation and urban planning, algorithms can optimize traffic flow, reduce congestion, and enhance sustainability through smart city initiatives.In conclusion, the algorithmic era signifies a paradigm shift in how we perceive and interact with technology.While algorithms offer unprecedented capabilities toanalyze data and automate tasks, they also pose complex challenges regarding fairness, transparency, and privacy. Navigating the algorithmic era requires a balanced approach that harnesses the benefits of algorithms while mitigating their potential harms. By fostering collaboration between technologists, policymakers, and society at large, we can shape a future where algorithms serve as tools forempowerment and advancement, enriching the human experience in the digital age.。

Compressed Sensing Recognition Algorithm for Sonar Image Based on Non-

Compressed Sensing Recognition Algorithm for Sonar Image Based on Non-negative Matrix Factorization and Adjacency Spectra FeatureYanling Hao, Liang WangCollege of AutomationHarbin Engineering UniversityHarbin, 150001, Chinawlyvi@Abstract— A compressed sensing (CS) recognition algorithm for sonar image based on non-negative matrix factorization and adjacency spectra (A-NMF) feature extraction is proposed in this paper. The feature vector of the tradition CS recognition algorithm is random Gaussian matrix, which causes the identification rate unstable. The stable feature vector which is extracted by the A-NMF feature extraction is used in this algorithm. The feature and structure of the original data can be expressed more accurately by this feature vector. Then the sonar images are classified under the compressed sensing framework. Experimental results show that the sonar image recognition is high, efficient and stable.Keywords- sonar image recogition;compressesd sensing; non-negative matrix factorization; adjacency spectra featureI.I NTRODUCTIONThese years, many countries which have well-developed marine technology have done a series of research and developing work on the automatic underwater vehicle (AUV). AUV is a kind of robot that can finish the work independently and continuously underwater. It can keep away from obstacles and also can be used for target identification, target tracking, independent navigation and so on. With the development of the acoustic imaging technology, it is widely used in ocean development field. The topic of target recognition using sonar image has become an attractive research field [1].Compressed Sensing concept has been proposed by Candés and Donoho in 2006 [2-4]. The main idea is to merge the compression with the sampling. Firstly, the non-adaptive Linear Projection (measured values) of the signal is sampled. Then the measured values are used to reconstruct the primary signal according to the corresponding refactor algorithm. The discriminant features of the traditional compressed sensing is used in [5] to accomplish the face identification. But in traditional compressed sensing identification, the feature vector is random Gaussian matrix, which causes the identification rate unstable. The structural characteristics of the image can not be reflected by the Non-Structural character of the Gaussian matrix. Aiming at these problems, a sonar image recognition method based on the A-NMF and compressed sensing is proposed in this paper. The non-negative matrix factorization has pure additive and sparsity [6], which can decrease the computational complexity of the compressed sensing and has reconstruction effect. The character of spectrum can describe the feature of the structural information in the image [7].II.A-NMF F EATUREA-NMF is a method [8] that combines non-negativematrix factorization and adjacency spectra, and uses imagefeature points to construct adjacency spectra. The adjacencyspectra can show the structural information. Then random numbers are replaced by the adjacency spectra as the initialvalue when carrying on the non-negative matrix factorization.The feature vector has strong expressing ability about thefeature and structure of the initial data.Assuming there are k classes of training samples, andthe sample number of each class are12,,,kn n n…, respectively, each sample is l h×grayscale vector as mI∈ , ()m l h=×, all thein samples in the ith classare columns of matrices iA,,1,2,,,,iim ni i i i nA I I I R×⎡⎤=∈⎣⎦….Where []121,11,2,1,,,,,,,ki k i k nA A A A A I I I I⎡⎤==⎣⎦isthe matrix of all training samples, the total is 12kn n n n=++ .Given nonnegative training samplematrix A with m n×dimensions, we can obtain two nonnegative matrix W with m r×, and H with r n×,such thatm n m r r nA W H×××=× (1) r should strictly follow the relationship of()m n r mn+<.H is projection of A on subspace W,where W, H are nonnegative.In order to get approximate solution of A W H≈×, weneed to find proper objective function. The common objective functions are Euclidean distance and K-L divergence [9]. In this paper, simple Euclidean distance ischosen as the objective function:()()2m nij iji jF V WH=−∑∑ (2)Thus, the nonnegative matrix decomposition can be changed into the following optimization problem: W, Hare nonnegative, by computing the Euclid distances throughiterative method, we can find proper W , H whose distance is smaller than the tradeoff value.Furthermore, we choose the adjacency matrix of all samples as the initial value of H .(1) The process solving this problem is as follows:Adjacent spectra of all samples tested is selected as the initial value of H .(1,2,,)k V k N = represents the feature set of kth image,k k k E V V ⊆×represents as boundary set, and the adjacency matrix k B is defined as ,(,)(.)0,i j k k kk V V if i j E B i j otherwise ⎧−∈⎪=⎨⎪⎩(3) Then SVD decomposition is used to decompose k B ,1T kB U U =Δ (4)Where,{}121,,,m k k k diag ρρρΔ= ,12[,,,]mT k k k k ρρρρ= , 120mk k k ρρρ≥≥≥≥ .If we take the normalized absolute value of k ρ, thefeature vector can be obtained as 12,,,m Tk k k k λλλλ⎡⎤=⎣⎦ , 1,2,k N = .Finally, using k λ as the initial value of k H in NMF.The initial value of k W can be calculated according to equation (1)(2) Iteration of W and H()()Tijij ij TijW V H H W WH = (5) ()()Tijij ijTijVH W W WHH = (6)The iteration process should be stopped as soon as the Euclid distances is smaller than the given threshold, and then W and H are the solution of the decomposition matrix. III. P ROPOSED R ECOGNITION A LGORITHM B ASED ON A-NMF AND CS A. Feature representation of traditional CSFor recognition based on sparse representation, if one class training samples lie on a linear subspace, new test samples can be a linear combination of the same kind oftraining samples, then any new test sample my ∈ from ith class can be a linear representation of the same class training samples:,1,1,2,2,,,,,1,,1,2,,i iii i i i i n i n n i j i j i j ij y I I I I a j n αααα==+++=∈=∑ (7)Since we do not know the class that y belongs to, it is reasonable to represent y via combination of the entire set of samples.0m y Ax R =∈ (8)Where the coefficients of training samples are 0,1,2,0,,0,,,,,0,,0i Tni i i n x ααα⎡⎤=∈⎣⎦, the coefficients are all zeros except those related with the i class. For traditional compressed sense, the project of imageinto vector space is d mR ×∈,d m , R are usually Gauss Matrix with dimensions of d n ×, where each value follows independent normal distribute of (0,1/)N N .0d yRy RAx =∈ (9) B. Feature representation and classification of ImprovedCSUsing the A-NMF algorithms, we can get the column vectorof W , which is used as the structure subspace of the basevector. It will represent every column vector i y of the sample image which is project to the subspace, that isT i i yW y = , it obtains the r dimensional vector i y which is used as feature vector matrixes TW of this target image. TheT W is the feature vector. Because this feature vector has the clustering characteristic, it will get better classification results depending on this feature vector. 0T T r y W y W Ax =∈ (10) Since r n ,0T y W Ax is an under-determinedequation, thus 0x is not unique, according to the sparse conditions of 0x , 0l norm should be used to solve thisequation.00ˆarg min T x x subject to W Ax y == (11) This is an NP-hard question. According to the sparserepresentation and compressed sensing studies, if 0x is sufficiently sparse, then 1l -minimization is equivalent to 0l -minimization.11ˆarg min T x x subject toW Ax y == (12) Given a new test sample y from one class of trainingsamples, according to equation (5), we can calculate 1ˆx, and it is easy to judge which class y belongs to by the nonzerocoefficients of 1ˆx. For ith class, the elements that are relevant to this class in 1ˆxcan be chosen through 1ˆ()i x δ,all except that the elements related to ith class are zeros; anda new approximate ˆi yis obtained, then we can compute the minimal residuals.12ˆmin ()()i i ir y y A xδ− (13) i is the class identification of y . According to minimum1l norm, ()i r y can be respectively obtained.C. Steps of the proposed method • 1: Input ：[]12,,,m nk A A A A ×=∈as k thtraining matrix, a test sample my ∈ 。

Audio-PCM 和G.711编码相关

ModulationIn the diagram,a sine wave(red curve)is sampled and quantized for PCM.The sine wave is sampled at regular intervals,shown as ticks on the x-axis.For each sample,one of the available values(ticks on the y-axis)is chosen by some algorithm(in this case,the floor function is used).This produces a fully discrete representation of the input signal(shaded area)that can be easily encoded as digital data for storage or manipulation.For the sine wave example at right,we can verify that the quantized values at the sampling moments are7,9,11,12,13,14,14,15,15,15,14,etc.Encoding these values as binary numbers would result in the following set of nibbles:0111,1001,1011,1100,1101,1110,1110, 1111,1111,1111,1110,etc.These digital values could then be further processed or analyzed by a purpose-specific digital signal processor or general purpose CPU.Several Pulse Code Modulation streams could also be multiplexed into a larger aggregate data stream,generally for transmission of multiple streams over a single physical link.This technique is called time-division multiplexing,or TDM,and is widely used,notably in the modern public telephone system.There are many ways to implement a real device that performs this task.In real systems, such a device is commonly implemented on a single integrated circuit that lacks only the clock necessary for sampling,and is generally referred to as an ADC(Analog-to-Digital converter).These devices will produce on their output a binary representation of the input whenever they are triggered by a clock signal,which would then be read by a processor of some sort.DemodulationTo produce output from the sampled data,the procedure of modulation is applied in reverse.After each sampling period has passed,the next value is read and the output of the system is shifted instantaneously(in an idealized system)to the new value.As a result of these instantaneous transitions,the discrete signal will have a significant amount of inherent high frequency energy,mostly harmonics of the sampling frequency(see square wave).To smooth out the signal and remove these undesirable harmonics,the signal would be passed through analog filters that suppress artifacts outside the expected frequency range(i.e.,greater than,the maximum resolvable frequency).Some systems use digital filtering to remove the lowest and largest harmonics.In some systems,no explicit filtering is done at all;as it's impossible for any system to reproduce a signal with infinite bandwidth,inherent losses in the system compensate for the artifacts—or the system simply does not require much precision.The sampling theorem suggests that practical PCM devices,provided a sampling frequency that is sufficiently greater than that of the input signal,can operate without introducing significant distortions within their designed frequency bands.The electronics involved in producing an accurate analog signal from the discrete data are similar to those used for generating the digital signal.These devices are DACs(digital-to-analog converters),and operate similarly to ADCs.They produce on their output a voltage or current(depending on type)that represents the value presented on their inputs.This output would then generally be filtered and amplified for use.我来总结一下吧，这里的PCM指线性PCM，说线性是为了和下面的非线性作对比的。

VESA Display Stream Compression Encoder IP v1.0 用户

VESA Display Stream Compression Encoder IP v1.0 UserGuideIntroductionDisplay Stream Compression (DSC) is a visually lossless video compression targeted for display devices. As there is demand for higher video resolutions and higher frame rates, the data bandwidth required to transmit the video keeps increasing. To transmit high video resolutions such as 4K and 8K, the source, transmission path, that is the display cable, and the display should support higher data rates. These high data rates increase the cost of the source, cable and the display. DSC is used to reduce the data rate required to transmit high resolution videos and there by reducing the cost. DSC was first introduced by Video Electronics Standards Association (VESA) in 2014. DSC compression is supported by the latest versions of the popularly used protocols such as HDMI, Display port, and MIPI DSI.DSC implements compression by combining a group of pixels in a horizontal line. The compression algorithm uses several stages such as prediction, quantization, entropy encoding, and rate control. There are two types of algorithms for prediction, which are Modified Median Adaptive Filter (MMAP) and Mid-Point Prediction (MPP). The predicted data is quantized based on the rate control to achieve constant bandwidth at the output. The quantized data is then passed to the Variable Length Coding (VLC) that minimizes the bits used to represent the quantized output. These compression stages are implemented for Y, Cb, and Cr component and the outputs of these stages are combined at the end using a substream multiplexer.DSC supports splitting a video frame into multiple slices horizontally with equal size. The slicing of a frame allows parallel processing of slices to handle high resolution video frames. The DSC IP supports two slices and uses MMAP and MPP predictions.FeaturesDSC has the following features:•VESA DSC 1.2a Spec•Implements Compression on YCbCr 444 Video Format•Supports 12-bits Per Pixel (12 bpp) and 8-bits Per Component•Standalone Operation, CPU, or Processor Assistance not Required•Supports Compression for 432x240, 648x480, 960x540, 1296x720, and 1920x1080 Resolutions at 60 Frames Per Second (fps)•Supports Two SlicesSupported FamiliesDSC supports the following family of products:•PolarFire® SoC FPGA•PolarFire FPGATable of ContentsIntroduction (1)Features (1)Supported Families (1)1. Hardware Implementation (3)1.1. Inputs and outputs (3)1.2. Configuration Parameters (3)1.3. Hardware Implementation of DSC IP (4)2. Testbench (7)2.1. Simulation (7)3. License (10)4. Installation Instructions (11)5. Resource Utilization (12)6. Revision History (13)Microchip FPGA Support (14)Microchip Information (14)The Microchip Website (14)Product Change Notification Service (14)Customer Support (14)Microchip Devices Code Protection Feature (14)Legal Notice (15)Trademarks (15)Quality Management System (16)Worldwide Sales and Service (17)1. Hardware ImplementationThe following figure shows the DSC IP block diagram.Figure 1-1. DSC Encoder IP Block Diagram1.1 Inputs and outputsThe following table lists the input and output ports of DSC IP.Table 1-1. Input and Output Ports of DSC IP1.2 Configuration ParametersThe following figure shows the DSC Encoder IP configuration parameters.Figure 1-2. DSC Encoder IP Configurator1.3 Hardware Implementation of DSC IPThis section describes the different internal modules of the DSC Encoder IP. The data input to the IP must be in the form of a raster scan image in the YCbCr 444 format.The following figure shows the DSC Encoder IP block diagram that divides the input image into two slices. The width of each slice is half of the input image width and the slice height is same as the input image height.Figure 1-3. DSC Encoder IP Block Diagram Slice 1The following figure shows the DSC Encoder block diagram for each slice.Figure 1-4. DSC Encoder Block Diagram Slice 2SLICE-1 BLOCK DIAGRAM1.3.1 Prediction and QuantizationEach group, consisting of three consecutive pixels, is predicted by using the MMAP and MPP algorithms. Predicted values are subtracted from the original pixel values, and the resulting residual pixels are quantized. In addition,reconstruction step is performed in the encoder wherein, the inverse quantized residuals are added to the predicted sample to ensure that both encoder and decoder have the same reference pixels.MMAP algorithm uses the current group’s pixels, the previous line’s adjacent pixels, and the reconstructed pixelimmediately to the left of the group. This is the default prediction method.The MPP predictor is a value at or near the midpoint of the range. The predictor depends on the rightmostreconstructed sample value of the previous group.1.3.2 VLC Entropy EncoderThe size of each residual is predicted using the previous residual size and changing the Quantization Parameter(QP). Variable length encoding effectively compresses the residual data.1.3.3 Rate ControlRate control block calculates the master Quantization Parameter (masterQP) to be used for prediction and VLC to ensure that the rate buffer neither underflows nor overflows. masterQP value is not transmitted along the bitstream, and the same rate control algorithm is imitated in the decoder. The RC algorithm is designed to optimize subjectivepicture quality by way of its QP decisions. Lower QP on flat areas of the image and Higher QP on busy areas of the image ensures you to maintain constant quality for all the pixels.1.3.4 Decoder ModelDecoder is an idealized theoretical actual decoder model. Decoder model dictates how the substeams Y, Cb, and Cr are multiplexed. The Balance FIFOs ensure that the multiplexer has at least one mux word’s worth of data whenever the multiplexer receives a request signal from the decoder model.1.3.5 Substream MultiplexerThe substream multiplexer multiplexes the Y, CB, and Cr components into a single slice of data. Each muxword has 48-bit data. Muxwords are inserted in the bitstream depending on the size of their syntax elements.1.3.6 Slice MultiplexerEach picture is divided into two equal slices. Each slice is independently decoded without referencing other slices.Two slices are merged in the bitstream by the slice multiplexing process.2. TestbenchTestbench is provided to check the functionality of the DSC IP.2.1 SimulationThe simulation uses a 432x240 image in YCbCr444 format represented by three files, each for Y, Cb, and Cr as input and generates a .txt file format that contains one frame.To simulate the core using the testbench, perform the following steps:1.Go to Libero® SoC Catalog tab, expand Solutions-Video, double-click DSC_Encoder, and then click OK.Note: If you do not see the Catalog tab, navigate to View > Windows menu and click Catalog to make itvisible.Figure 2-1. DSC Encoder IP Core in Libero SoC Catalog2.Go to the Files tab, right-click simulation, and then click Import Files.Figure 2-2. Import Files3.Import the img_in_luma.txt, img_in_cb.txt, img_in_cr.txt, and DSC_out_ref.txt files from thefollowing path: ..\<Project_name>\component\Microchip\SolutionCore\ DSC_Encoder\<DSC IP version>\Stimulus.The imported file is listed in the simulation folder as shown in the following figure.Figure 2-3. Imported Files4.Go to Libero SoC Stimulus Hierarchy tab, select the testbench (DSC_Encoder_tb. v), right-click and thenclick Simulate Pre-Synth Design > Open Interactively. The IP is simulated for one frame.Note: If you do not see the Stimulus Hierarchy tab, navigate to View > Windows menu, and then click Stimulus Hierarchy to make it visible.Figure 2-4. Simulating the Pre-Synthesis DesignModelSim opens with the testbench file as shown in the following figure.Figure 2-5. ModelSim Simulation WindowNote: If the simulation is interrupted due to the runtime limit specified in the DO file, use the run -allcommand to complete the simulation.3. LicenseVESA DSC IP is provided only in encrypted form.Encrypted RTL source code is license locked, which needs to be purchased separately. You can perform simulation, synthesis, layout, and program the Field Programmable Gate Array (FPGA) silicon using the Libero design suite.Evaluation license is provided for free to explore the VESA DSC IP features. The evaluation license expires after an hour’s use on the hardware.Installation Instructions 4. Installation InstructionsDSC IP core must be installed to the IP Catalog of the Libero SoC software. This is done automatically through theIP Catalog update function in the Libero SoC software, or the IP core can be manually downloaded from the catalog.Once the IP core is installed in the Libero SoC software IP Catalog, the core can be configured, generated, andinstantiated within the SmartDesign tool for inclusion in the Libero projects list.Resource Utilization 5. Resource UtilizationThe following table lists the resource utilization of a sample DSC IP design made for PolarFire FPGA(MPF300TS-1FCG1152I package) and generates compressed data by using 4:4:4 sampling of input data.Table 5-1. Resource UtilizationRevision History 6. Revision HistoryThe revision history describes the changes that were implemented in the document. The changes are listed byrevision, starting with the current publication.Table 6-1. Revision HistoryMicrochip FPGA SupportMicrochip FPGA products group backs its products with various support services, including Customer Service, Customer Technical Support Center, a website, and worldwide sales offices. Customers are suggested to visit Microchip online resources prior to contacting support as it is very likely that their queries have been already answered.Contact Technical Support Center through the website at /support. Mention the FPGA Device Part number, select appropriate case category, and upload design files while creating a technical support case.Contact Customer Service for non-technical product support, such as product pricing, product upgrades, update information, order status, and authorization.•From North America, call 800.262.1060•From the rest of the world, call 650.318.4460•Fax, from anywhere in the world, 650.318.8044Microchip InformationThe Microchip WebsiteMicrochip provides online support via our website at /. This website is used to make files and information easily available to customers. Some of the content available includes:•Product Support – Data sheets and errata, application notes and sample programs, design resources, user’s guides and hardware support documents, latest software releases and archived software•General Technical Support – Frequently Asked Questions (FAQs), technical support requests, online discussion groups, Microchip design partner program member listing•Business of Microchip – Product selector and ordering guides, latest Microchip press releases, listing of seminars and events, listings of Microchip sales offices, distributors and factory representativesProduct Change Notification ServiceMicrochip’s product change notification service helps keep customers current on Microchip products. Subscribers will receive email notification whenever there are changes, updates, revisions or errata related to a specified product family or development tool of interest.To register, go to /pcn and follow the registration instructions.Customer SupportUsers of Microchip products can receive assistance through several channels:•Distributor or Representative•Local Sales Office•Embedded Solutions Engineer (ESE)•Technical SupportCustomers should contact their distributor, representative or ESE for support. Local sales offices are also available to help customers. A listing of sales offices and locations is included in this document.Technical support is available through the website at: /supportMicrochip Devices Code Protection FeatureNote the following details of the code protection feature on Microchip products:•Microchip products meet the specifications contained in their particular Microchip Data Sheet.•Microchip believes that its family of products is secure when used in the intended manner, within operating specifications, and under normal conditions.•Microchip values and aggressively protects its intellectual property rights. Attempts to breach the code protection features of Microchip product is strictly prohibited and may violate the Digital Millennium Copyright Act.•Neither Microchip nor any other semiconductor manufacturer can guarantee the security of its code. Code protection does not mean that we are guaranteeing the product is “unbreakable”. Code protection is constantly evolving. Microchip is committed to continuously improving the code protection features of our products. Legal NoticeThis publication and the information herein may be used only with Microchip products, including to design, test,and integrate Microchip products with your application. Use of this information in any other manner violates these terms. Information regarding device applications is provided only for your convenience and may be supersededby updates. It is your responsibility to ensure that your application meets with your specifications. Contact yourlocal Microchip sales office for additional support or, obtain additional support at /en-us/support/ design-help/client-support-services.THIS INFORMATION IS PROVIDED BY MICROCHIP "AS IS". MICROCHIP MAKES NO REPRESENTATIONSOR WARRANTIES OF ANY KIND WHETHER EXPRESS OR IMPLIED, WRITTEN OR ORAL, STATUTORYOR OTHERWISE, RELATED TO THE INFORMATION INCLUDING BUT NOT LIMITED TO ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE, OR WARRANTIES RELATED TO ITS CONDITION, QUALITY, OR PERFORMANCE.IN NO EVENT WILL MICROCHIP BE LIABLE FOR ANY INDIRECT, SPECIAL, PUNITIVE, INCIDENTAL, OR CONSEQUENTIAL LOSS, DAMAGE, COST, OR EXPENSE OF ANY KIND WHATSOEVER RELATED TO THE INFORMATION OR ITS USE, HOWEVER CAUSED, EVEN IF MICROCHIP HAS BEEN ADVISED OF THE POSSIBILITY OR THE DAMAGES ARE FORESEEABLE. TO THE FULLEST EXTENT ALLOWED BY LAW, MICROCHIP'S TOTAL LIABILITY ON ALL CLAIMS IN ANY WAY RELATED TO THE INFORMATION OR ITS USE WILL NOT EXCEED THE AMOUNT OF FEES, IF ANY, THAT YOU HAVE PAID DIRECTLY TO MICROCHIP FOR THE INFORMATION.Use of Microchip devices in life support and/or safety applications is entirely at the buyer's risk, and the buyer agrees to defend, indemnify and hold harmless Microchip from any and all damages, claims, suits, or expenses resulting from such use. No licenses are conveyed, implicitly or otherwise, under any Microchip intellectual property rights unless otherwise stated.TrademarksThe Microchip name and logo, the Microchip logo, Adaptec, AVR, AVR logo, AVR Freaks, BesTime, BitCloud, CryptoMemory, CryptoRF, dsPIC, flexPWR, HELDO, IGLOO, JukeBlox, KeeLoq, Kleer, LANCheck, LinkMD, maXStylus, maXTouch, MediaLB, megaAVR, Microsemi, Microsemi logo, MOST, MOST logo, MPLAB, OptoLyzer, PIC, picoPower, PICSTART, PIC32 logo, PolarFire, Prochip Designer, QTouch, SAM-BA, SenGenuity, SpyNIC, SST, SST Logo, SuperFlash, Symmetricom, SyncServer, Tachyon, TimeSource, tinyAVR, UNI/O, Vectron, and XMEGA are registered trademarks of Microchip Technology Incorporated in the U.S.A. and other countries.AgileSwitch, APT, ClockWorks, The Embedded Control Solutions Company, EtherSynch, Flashtec, Hyper Speed Control, HyperLight Load, Libero, motorBench, mTouch, Powermite 3, Precision Edge, ProASIC, ProASIC Plus, ProASIC Plus logo, Quiet- Wire, SmartFusion, SyncWorld, Temux, TimeCesium, TimeHub, TimePictra, TimeProvider, TrueTime, and ZL are registered trademarks of Microchip Technology Incorporated in the U.S.A.Adjacent Key Suppression, AKS, Analog-for-the-Digital Age, Any Capacitor, AnyIn, AnyOut, Augmented Switching, BlueSky, BodyCom, Clockstudio, CodeGuard, CryptoAuthentication, CryptoAutomotive, CryptoCompanion, CryptoController, dsPICDEM, , Dynamic Average Matching, DAM, ECAN, Espresso T1S, EtherGREEN, GridTime, IdealBridge, In-Circuit Serial Programming, ICSP, INICnet, Intelligent Paralleling, IntelliMOS, Inter-Chip Connectivity, JitterBlocker, Knob-on-Display, KoD, maxCrypto, maxView, memBrain, Mindi, MiWi, MPASM, MPF, MPLAB Certified logo, MPLIB, MPLINK, MultiTRAK, NetDetach, Omniscient Code Generation, PICDEM, , PICkit, PICtail, PowerSmart, PureSilicon, QMatrix, REAL ICE, Ripple Blocker, RTAX, RTG4, SAM-ICE, Serial Quad I/O, simpleMAP, SimpliPHY, SmartBuffer, SmartHLS, SMART-I.S., storClad, SQI, SuperSwitcher, SuperSwitcher II, Switchtec, SynchroPHY, Total Endurance, Trusted Time, TSHARC, USBCheck, VariSense, VectorBlox, VeriPHY, ViewSpan, WiperLock, XpressConnect, and ZENA are trademarks of Microchip Technology Incorporated in the U.S.A. and other countries.SQTP is a service mark of Microchip Technology Incorporated in the U.S.A.The Adaptec logo, Frequency on Demand, Silicon Storage Technology, and Symmcom are registered trademarks of Microchip Technology Inc. in other countries.GestIC is a registered trademark of Microchip Technology Germany II GmbH & Co. KG, a subsidiary of Microchip Technology Inc., in other countries.All other trademarks mentioned herein are property of their respective companies.© 2022, Microchip Technology Incorporated and its subsidiaries. All Rights Reserved.ISBN: 978-1-6683-1273-5Quality Management SystemFor information regarding Microchip’s Quality Management Systems, please visit /quality.Worldwide Sales and Service。

Survey of clustering data mining techniques

A Survey of Clustering Data Mining TechniquesPavel BerkhinYahoo!,Inc.pberkhin@Summary.Clustering is the division of data into groups of similar objects.It dis-regards some details in exchange for data simpliﬁrmally,clustering can be viewed as data modeling concisely summarizing the data,and,therefore,it re-lates to many disciplines from statistics to numerical analysis.Clustering plays an important role in a broad range of applications,from information retrieval to CRM. Such applications usually deal with large datasets and many attributes.Exploration of such data is a subject of data mining.This survey concentrates on clustering algorithms from a data mining perspective.1IntroductionThe goal of this survey is to provide a comprehensive review of diﬀerent clus-tering techniques in data mining.Clustering is a division of data into groups of similar objects.Each group,called a cluster,consists of objects that are similar to one another and dissimilar to objects of other groups.When repre-senting data with fewer clusters necessarily loses certainﬁne details(akin to lossy data compression),but achieves simpliﬁcation.It represents many data objects by few clusters,and hence,it models data by its clusters.Data mod-eling puts clustering in a historical perspective rooted in mathematics,sta-tistics,and numerical analysis.From a machine learning perspective clusters correspond to hidden patterns,the search for clusters is unsupervised learn-ing,and the resulting system represents a data concept.Therefore,clustering is unsupervised learning of a hidden data concept.Data mining applications add to a general picture three complications:(a)large databases,(b)many attributes,(c)attributes of diﬀerent types.This imposes on a data analysis se-vere computational requirements.Data mining applications include scientiﬁc data exploration,information retrieval,text mining,spatial databases,Web analysis,CRM,marketing,medical diagnostics,computational biology,and many others.They present real challenges to classic clustering algorithms. These challenges led to the emergence of powerful broadly applicable data2Pavel Berkhinmining clustering methods developed on the foundation of classic techniques.They are subject of this survey.1.1NotationsTo ﬁx the context and clarify terminology,consider a dataset X consisting of data points (i.e.,objects ,instances ,cases ,patterns ,tuples ,transactions )x i =(x i 1,···,x id ),i =1:N ,in attribute space A ,where each component x il ∈A l ,l =1:d ,is a numerical or nominal categorical attribute (i.e.,feature ,variable ,dimension ,component ,ﬁeld ).For a discussion of attribute data types see [106].Such point-by-attribute data format conceptually corresponds to a N ×d matrix and is used by a majority of algorithms reviewed below.However,data of other formats,such as variable length sequences and heterogeneous data,are not uncommon.The simplest subset in an attribute space is a direct Cartesian product of sub-ranges C = C l ⊂A ,C l ⊂A l ,called a segment (i.e.,cube ,cell ,region ).A unit is an elementary segment whose sub-ranges consist of a single category value,or of a small numerical bin.Describing the numbers of data points per every unit represents an extreme case of clustering,a histogram .This is a very expensive representation,and not a very revealing er driven segmentation is another commonly used practice in data exploration that utilizes expert knowledge regarding the importance of certain sub-domains.Unlike segmentation,clustering is assumed to be automatic,and so it is a machine learning technique.The ultimate goal of clustering is to assign points to a ﬁnite system of k subsets (clusters).Usually (but not always)subsets do not intersect,and their union is equal to a full dataset with the possible exception of outliersX =C 1 ··· C k C outliers ,C i C j =0,i =j.1.2Clustering Bibliography at GlanceGeneral references regarding clustering include [110],[205],[116],[131],[63],[72],[165],[119],[75],[141],[107],[91].A very good introduction to contem-porary data mining clustering techniques can be found in the textbook [106].There is a close relationship between clustering and many other ﬁelds.Clustering has always been used in statistics [10]and science [158].The clas-sic introduction into pattern recognition framework is given in [64].Typical applications include speech and character recognition.Machine learning clus-tering algorithms were applied to image segmentation and computer vision[117].For statistical approaches to pattern recognition see [56]and [85].Clus-tering can be viewed as a density estimation problem.This is the subject of traditional multivariate statistical estimation [197].Clustering is also widelyA Survey of Clustering Data Mining Techniques3 used for data compression in image processing,which is also known as vec-tor quantization[89].Dataﬁtting in numerical analysis provides still another venue in data modeling[53].This survey’s emphasis is on clustering in data mining.Such clustering is characterized by large datasets with many attributes of diﬀerent types. Though we do not even try to review particular applications,many important ideas are related to the speciﬁcﬁelds.Clustering in data mining was brought to life by intense developments in information retrieval and text mining[52], [206],[58],spatial database applications,for example,GIS or astronomical data,[223],[189],[68],sequence and heterogeneous data analysis[43],Web applications[48],[111],[81],DNA analysis in computational biology[23],and many others.They resulted in a large amount of application-speciﬁc devel-opments,but also in some general techniques.These techniques and classic clustering algorithms that relate to them are surveyed below.1.3Plan of Further PresentationClassiﬁcation of clustering algorithms is neither straightforward,nor canoni-cal.In reality,diﬀerent classes of algorithms overlap.Traditionally clustering techniques are broadly divided in hierarchical and partitioning.Hierarchical clustering is further subdivided into agglomerative and divisive.The basics of hierarchical clustering include Lance-Williams formula,idea of conceptual clustering,now classic algorithms SLINK,COBWEB,as well as newer algo-rithms CURE and CHAMELEON.We survey these algorithms in the section Hierarchical Clustering.While hierarchical algorithms gradually(dis)assemble points into clusters (as crystals grow),partitioning algorithms learn clusters directly.In doing so they try to discover clusters either by iteratively relocating points between subsets,or by identifying areas heavily populated with data.Algorithms of theﬁrst kind are called Partitioning Relocation Clustering. They are further classiﬁed into probabilistic clustering(EM framework,al-gorithms SNOB,AUTOCLASS,MCLUST),k-medoids methods(algorithms PAM,CLARA,CLARANS,and its extension),and k-means methods(diﬀer-ent schemes,initialization,optimization,harmonic means,extensions).Such methods concentrate on how well pointsﬁt into their clusters and tend to build clusters of proper convex shapes.Partitioning algorithms of the second type are surveyed in the section Density-Based Partitioning.They attempt to discover dense connected com-ponents of data,which areﬂexible in terms of their shape.Density-based connectivity is used in the algorithms DBSCAN,OPTICS,DBCLASD,while the algorithm DENCLUE exploits space density functions.These algorithms are less sensitive to outliers and can discover clusters of irregular shape.They usually work with low-dimensional numerical data,known as spatial data. Spatial objects could include not only points,but also geometrically extended objects(algorithm GDBSCAN).4Pavel BerkhinSome algorithms work with data indirectly by constructing summaries of data over the attribute space subsets.They perform space segmentation and then aggregate appropriate segments.We discuss them in the section Grid-Based Methods.They frequently use hierarchical agglomeration as one phase of processing.Algorithms BANG,STING,WaveCluster,and FC are discussed in this section.Grid-based methods are fast and handle outliers well.Grid-based methodology is also used as an intermediate step in many other algorithms (for example,CLIQUE,MAFIA).Categorical data is intimately connected with transactional databases.The concept of a similarity alone is not suﬃcient for clustering such data.The idea of categorical data co-occurrence comes to the rescue.The algorithms ROCK,SNN,and CACTUS are surveyed in the section Co-Occurrence of Categorical Data.The situation gets even more aggravated with the growth of the number of items involved.To help with this problem the eﬀort is shifted from data clustering to pre-clustering of items or categorical attribute values. Development based on hyper-graph partitioning and the algorithm STIRR exemplify this approach.Many other clustering techniques are developed,primarily in machine learning,that either have theoretical signiﬁcance,are used traditionally out-side the data mining community,or do notﬁt in previously outlined categories. The boundary is blurred.In the section Other Developments we discuss the emerging direction of constraint-based clustering,the important researchﬁeld of graph partitioning,and the relationship of clustering to supervised learning, gradient descent,artiﬁcial neural networks,and evolutionary methods.Data Mining primarily works with large databases.Clustering large datasets presents scalability problems reviewed in the section Scalability and VLDB Extensions.Here we talk about algorithms like DIGNET,about BIRCH and other data squashing techniques,and about Hoﬀding or Chernoﬀbounds.Another trait of real-life data is high dimensionality.Corresponding de-velopments are surveyed in the section Clustering High Dimensional Data. The trouble comes from a decrease in metric separation when the dimension grows.One approach to dimensionality reduction uses attributes transforma-tions(DFT,PCA,wavelets).Another way to address the problem is through subspace clustering(algorithms CLIQUE,MAFIA,ENCLUS,OPTIGRID, PROCLUS,ORCLUS).Still another approach clusters attributes in groups and uses their derived proxies to cluster objects.This double clustering is known as co-clustering.Issues common to diﬀerent clustering methods are overviewed in the sec-tion General Algorithmic Issues.We talk about assessment of results,de-termination of appropriate number of clusters to build,data preprocessing, proximity measures,and handling of outliers.For reader’s convenience we provide a classiﬁcation of clustering algorithms closely followed by this survey:•Hierarchical MethodsA Survey of Clustering Data Mining Techniques5Agglomerative AlgorithmsDivisive Algorithms•Partitioning Relocation MethodsProbabilistic ClusteringK-medoids MethodsK-means Methods•Density-Based Partitioning MethodsDensity-Based Connectivity ClusteringDensity Functions Clustering•Grid-Based Methods•Methods Based on Co-Occurrence of Categorical Data•Other Clustering TechniquesConstraint-Based ClusteringGraph PartitioningClustering Algorithms and Supervised LearningClustering Algorithms in Machine Learning•Scalable Clustering Algorithms•Algorithms For High Dimensional DataSubspace ClusteringCo-Clustering Techniques1.4Important IssuesThe properties of clustering algorithms we are primarily concerned with in data mining include:•Type of attributes algorithm can handle•Scalability to large datasets•Ability to work with high dimensional data•Ability toﬁnd clusters of irregular shape•Handling outliers•Time complexity(we frequently simply use the term complexity)•Data order dependency•Labeling or assignment(hard or strict vs.soft or fuzzy)•Reliance on a priori knowledge and user deﬁned parameters •Interpretability of resultsRealistically,with every algorithm we discuss only some of these properties. The list is in no way exhaustive.For example,as appropriate,we also discuss algorithms ability to work in pre-deﬁned memory buﬀer,to restart,and to provide an intermediate solution.6Pavel Berkhin2Hierarchical ClusteringHierarchical clustering builds a cluster hierarchy or a tree of clusters,also known as a dendrogram.Every cluster node contains child clusters;sibling clusters partition the points covered by their common parent.Such an ap-proach allows exploring data on diﬀerent levels of granularity.Hierarchical clustering methods are categorized into agglomerative(bottom-up)and divi-sive(top-down)[116],[131].An agglomerative clustering starts with one-point (singleton)clusters and recursively merges two or more of the most similar clusters.A divisive clustering starts with a single cluster containing all data points and recursively splits the most appropriate cluster.The process contin-ues until a stopping criterion(frequently,the requested number k of clusters) is achieved.Advantages of hierarchical clustering include:•Flexibility regarding the level of granularity•Ease of handling any form of similarity or distance•Applicability to any attribute typesDisadvantages of hierarchical clustering are related to:•Vagueness of termination criteria•Most hierarchical algorithms do not revisit(intermediate)clusters once constructed.The classic approaches to hierarchical clustering are presented in the sub-section Linkage Metrics.Hierarchical clustering based on linkage metrics re-sults in clusters of proper(convex)shapes.Active contemporary eﬀorts to build cluster systems that incorporate our intuitive concept of clusters as con-nected components of arbitrary shape,including the algorithms CURE and CHAMELEON,are surveyed in the subsection Hierarchical Clusters of Arbi-trary Shapes.Divisive techniques based on binary taxonomies are presented in the subsection Binary Divisive Partitioning.The subsection Other Devel-opments contains information related to incremental learning,model-based clustering,and cluster reﬁnement.In hierarchical clustering our regular point-by-attribute data representa-tion frequently is of secondary importance.Instead,hierarchical clustering frequently deals with the N×N matrix of distances(dissimilarities)or sim-ilarities between training points sometimes called a connectivity matrix.So-called linkage metrics are constructed from elements of this matrix.The re-quirement of keeping a connectivity matrix in memory is unrealistic.To relax this limitation diﬀerent techniques are used to sparsify(introduce zeros into) the connectivity matrix.This can be done by omitting entries smaller than a certain threshold,by using only a certain subset of data representatives,or by keeping with each point only a certain number of its nearest neighbors(for nearest neighbor chains see[177]).Notice that the way we process the original (dis)similarity matrix and construct a linkage metric reﬂects our a priori ideas about the data model.A Survey of Clustering Data Mining Techniques7With the(sparsiﬁed)connectivity matrix we can associate the weighted connectivity graph G(X,E)whose vertices X are data points,and edges E and their weights are deﬁned by the connectivity matrix.This establishes a connection between hierarchical clustering and graph partitioning.One of the most striking developments in hierarchical clustering is the algorithm BIRCH.It is discussed in the section Scalable VLDB Extensions.Hierarchical clustering initializes a cluster system as a set of singleton clusters(agglomerative case)or a single cluster of all points(divisive case) and proceeds iteratively merging or splitting the most appropriate cluster(s) until the stopping criterion is achieved.The appropriateness of a cluster(s) for merging or splitting depends on the(dis)similarity of cluster(s)elements. This reﬂects a general presumption that clusters consist of similar points.An important example of dissimilarity between two points is the distance between them.To merge or split subsets of points rather than individual points,the dis-tance between individual points has to be generalized to the distance between subsets.Such a derived proximity measure is called a linkage metric.The type of a linkage metric signiﬁcantly aﬀects hierarchical algorithms,because it re-ﬂects a particular concept of closeness and connectivity.Major inter-cluster linkage metrics[171],[177]include single link,average link,and complete link. The underlying dissimilarity measure(usually,distance)is computed for every pair of nodes with one node in theﬁrst set and another node in the second set.A speciﬁc operation such as minimum(single link),average(average link),or maximum(complete link)is applied to pair-wise dissimilarity measures:d(C1,C2)=Op{d(x,y),x∈C1,y∈C2}Early examples include the algorithm SLINK[199],which implements single link(Op=min),Voorhees’method[215],which implements average link (Op=Avr),and the algorithm CLINK[55],which implements complete link (Op=max).It is related to the problem ofﬁnding the Euclidean minimal spanning tree[224]and has O(N2)complexity.The methods using inter-cluster distances deﬁned in terms of pairs of nodes(one in each respective cluster)are called graph methods.They do not use any cluster representation other than a set of points.This name naturally relates to the connectivity graph G(X,E)introduced above,because every data partition corresponds to a graph partition.Such methods can be augmented by so-called geometric methods in which a cluster is represented by its central point.Under the assumption of numerical attributes,the center point is deﬁned as a centroid or an average of two cluster centroids subject to agglomeration.It results in centroid,median,and minimum variance linkage metrics.All of the above linkage metrics can be derived from the Lance-Williams updating formula[145],d(C iC j,C k)=a(i)d(C i,C k)+a(j)d(C j,C k)+b·d(C i,C j)+c|d(C i,C k)−d(C j,C k)|.8Pavel BerkhinHere a,b,c are coeﬃcients corresponding to a particular linkage.This formula expresses a linkage metric between a union of the two clusters and the third cluster in terms of underlying nodes.The Lance-Williams formula is crucial to making the dis(similarity)computations feasible.Surveys of linkage metrics can be found in [170][54].When distance is used as a base measure,linkage metrics capture inter-cluster proximity.However,a similarity-based view that results in intra-cluster connectivity considerations is also used,for example,in the original average link agglomeration (Group-Average Method)[116].Under reasonable assumptions,such as reducibility condition (graph meth-ods satisfy this condition),linkage metrics methods suﬀer from O N 2 time complexity [177].Despite the unfavorable time complexity,these algorithms are widely used.As an example,the algorithm AGNES (AGlomerative NESt-ing)[131]is used in S-Plus.When the connectivity N ×N matrix is sparsiﬁed,graph methods directly dealing with the connectivity graph G can be used.In particular,hierarchical divisive MST (Minimum Spanning Tree)algorithm is based on graph parti-tioning [116].2.1Hierarchical Clusters of Arbitrary ShapesFor spatial data,linkage metrics based on Euclidean distance naturally gener-ate clusters of convex shapes.Meanwhile,visual inspection of spatial images frequently discovers clusters with curvy appearance.Guha et al.[99]introduced the hierarchical agglomerative clustering algo-rithm CURE (Clustering Using REpresentatives).This algorithm has a num-ber of novel features of general importance.It takes special steps to handle outliers and to provide labeling in assignment stage.It also uses two techniques to achieve scalability:data sampling (section 8),and data partitioning.CURE creates p partitions,so that ﬁne granularity clusters are constructed in parti-tions ﬁrst.A major feature of CURE is that it represents a cluster by a ﬁxed number,c ,of points scattered around it.The distance between two clusters used in the agglomerative process is the minimum of distances between two scattered representatives.Therefore,CURE takes a middle approach between the graph (all-points)methods and the geometric (one centroid)methods.Single and average link closeness are replaced by representatives’aggregate closeness.Selecting representatives scattered around a cluster makes it pos-sible to cover non-spherical shapes.As before,agglomeration continues until the requested number k of clusters is achieved.CURE employs one additional trick:originally selected scattered points are shrunk to the geometric centroid of the cluster by a user-speciﬁed factor α.Shrinkage suppresses the aﬀect of outliers;outliers happen to be located further from the cluster centroid than the other scattered representatives.CURE is capable of ﬁnding clusters of diﬀerent shapes and sizes,and it is insensitive to outliers.Because CURE uses sampling,estimation of its complexity is not straightforward.For low-dimensional data authors provide a complexity estimate of O (N 2sample )deﬁnedA Survey of Clustering Data Mining Techniques9 in terms of a sample size.More exact bounds depend on input parameters: shrink factorα,number of representative points c,number of partitions p,and a sample size.Figure1(a)illustrates agglomeration in CURE.Three clusters, each with three representatives,are shown before and after the merge and shrinkage.Two closest representatives are connected.While the algorithm CURE works with numerical attributes(particularly low dimensional spatial data),the algorithm ROCK developed by the same researchers[100]targets hierarchical agglomerative clustering for categorical attributes.It is reviewed in the section Co-Occurrence of Categorical Data.The hierarchical agglomerative algorithm CHAMELEON[127]uses the connectivity graph G corresponding to the K-nearest neighbor model spar-siﬁcation of the connectivity matrix:the edges of K most similar points to any given point are preserved,the rest are pruned.CHAMELEON has two stages.In theﬁrst stage small tight clusters are built to ignite the second stage.This involves a graph partitioning[129].In the second stage agglomer-ative process is performed.It utilizes measures of relative inter-connectivity RI(C i,C j)and relative closeness RC(C i,C j);both are locally normalized by internal interconnectivity and closeness of clusters C i and C j.In this sense the modeling is dynamic:it depends on data locally.Normalization involves certain non-obvious graph operations[129].CHAMELEON relies heavily on graph partitioning implemented in the library HMETIS(see the section6). Agglomerative process depends on user provided thresholds.A decision to merge is made based on the combinationRI(C i,C j)·RC(C i,C j)αof local measures.The algorithm does not depend on assumptions about the data model.It has been proven toﬁnd clusters of diﬀerent shapes,densities, and sizes in2D(two-dimensional)space.It has a complexity of O(Nm+ Nlog(N)+m2log(m),where m is the number of sub-clusters built during the ﬁrst initialization phase.Figure1(b)(analogous to the one in[127])clariﬁes the diﬀerence with CURE.It presents a choice of four clusters(a)-(d)for a merge.While CURE would merge clusters(a)and(b),CHAMELEON makes intuitively better choice of merging(c)and(d).2.2Binary Divisive PartitioningIn linguistics,information retrieval,and document clustering applications bi-nary taxonomies are very useful.Linear algebra methods,based on singular value decomposition(SVD)are used for this purpose in collaborativeﬁlter-ing and information retrieval[26].Application of SVD to hierarchical divisive clustering of document collections resulted in the PDDP(Principal Direction Divisive Partitioning)algorithm[31].In our notations,object x is a docu-ment,l th attribute corresponds to a word(index term),and a matrix X entry x il is a measure(e.g.TF-IDF)of l-term frequency in a document x.PDDP constructs SVD decomposition of the matrix10Pavel Berkhin(a)Algorithm CURE (b)Algorithm CHAMELEONFig.1.Agglomeration in Clusters of Arbitrary Shapes(X −e ¯x ),¯x =1Ni =1:N x i ,e =(1,...,1)T .This algorithm bisects data in Euclidean space by a hyperplane that passes through data centroid orthogonal to the eigenvector with the largest singular value.A k -way split is also possible if the k largest singular values are consid-ered.Bisecting is a good way to categorize documents and it yields a binary tree.When k -means (2-means)is used for bisecting,the dividing hyperplane is orthogonal to the line connecting the two centroids.The comparative study of SVD vs.k -means approaches [191]can be used for further references.Hier-archical divisive bisecting k -means was proven [206]to be preferable to PDDP for document clustering.While PDDP or 2-means are concerned with how to split a cluster,the problem of which cluster to split is also important.Simple strategies are:(1)split each node at a given level,(2)split the cluster with highest cardinality,and,(3)split the cluster with the largest intra-cluster variance.All three strategies have problems.For a more detailed analysis of this subject and better strategies,see [192].2.3Other DevelopmentsOne of early agglomerative clustering algorithms,Ward’s method [222],is based not on linkage metric,but on an objective function used in k -means.The merger decision is viewed in terms of its eﬀect on the objective function.The popular hierarchical clustering algorithm for categorical data COB-WEB [77]has two very important qualities.First,it utilizes incremental learn-ing.Instead of following divisive or agglomerative approaches,it dynamically builds a dendrogram by processing one data point at a time.Second,COB-WEB is an example of conceptual or model-based learning.This means that each cluster is considered as a model that can be described intrinsically,rather than as a collection of points assigned to it.COBWEB’s dendrogram is calleda classiﬁcation tree.Each tree node(cluster)C is associated with the condi-tional probabilities for categorical attribute-values pairs,P r(x l=νlp|C),l=1:d,p=1:|A l|.This easily can be recognized as a C-speciﬁc Na¨ıve Bayes classiﬁer.During the classiﬁcation tree construction,every new point is descended along the tree and the tree is potentially updated(by an insert/split/merge/create op-eration).Decisions are based on the category utility[49]CU{C1,...,C k}=1j=1:kCU(C j)CU(C j)=l,p(P r(x l=νlp|C j)2−(P r(x l=νlp)2.Category utility is similar to the GINI index.It rewards clusters C j for in-creases in predictability of the categorical attribute valuesνlp.Being incre-mental,COBWEB is fast with a complexity of O(tN),though it depends non-linearly on tree characteristics packed into a constant t.There is a similar incremental hierarchical algorithm for all numerical attributes called CLAS-SIT[88].CLASSIT associates normal distributions with cluster nodes.Both algorithms can result in highly unbalanced trees.Chiu et al.[47]proposed another conceptual or model-based approach to hierarchical clustering.This development contains several diﬀerent use-ful features,such as the extension of scalability preprocessing to categori-cal attributes,outliers handling,and a two-step strategy for monitoring the number of clusters including BIC(deﬁned below).A model associated with a cluster covers both numerical and categorical attributes and constitutes a blend of Gaussian and multinomial models.Denote corresponding multivari-ate parameters byθ.With every cluster C we associate a logarithm of its (classiﬁcation)likelihoodl C=x i∈Clog(p(x i|θ))The algorithm uses maximum likelihood estimates for parameterθ.The dis-tance between two clusters is deﬁned(instead of linkage metric)as a decrease in log-likelihoodd(C1,C2)=l C1+l C2−l C1∪C2caused by merging of the two clusters under consideration.The agglomerative process continues until the stopping criterion is satisﬁed.As such,determina-tion of the best k is automatic.This algorithm has the commercial implemen-tation(in SPSS Clementine).The complexity of the algorithm is linear in N for the summarization phase.Traditional hierarchical clustering does not change points membership in once assigned clusters due to its greedy approach:after a merge or a split is selected it is not reﬁned.Though COBWEB does reconsider its decisions,its。

压缩复原技术原理

压缩复原技术原理Compression and decompression techniques play a crucial role in the digital world. 压缩和解压技术在数字世界中发挥着至关重要的作用。

Data compression is a method used to reduce the size of data by encoding information using fewer bits. 数据压缩是一种通过使用更少的位来编码信息以减小数据大小的方法。

This is particularly important when it comes to storing or transmitting large amounts of data efficiently. 这在有效存储或传输大量数据时尤为重要。

There are various compression algorithms and techniques available, each with its own advantages and disadvantages. 有各种各样的压缩算法和技术可供选择，每种都有其自身的优点和缺点。

One common compression technique is lossless compression, which allows the original data to be perfectly reconstructed from the compressed data. 一种常见的压缩技术是无损压缩，它允许从压缩数据完美地重建原始数据。

Lossless compression is ideal for situations where every single bit of data matters, such as medical imaging or text files. 无损压缩非常适用于每一位数据都很重要的情况，如医学成像或文本文件。

基于维间扩展和事务压缩的关联规则算法改进

０引言
务数据库中共有ｍ个项，ｍ个不同的属性值。包含ｏ－即４或多
每个成员包含ｋ的项集称为ｋ项项＿数据挖掘是在大型数据存储库中，动地发现有用信息的个项的集合被称为项集，自集，一ｋ项集的每个元素称为ｋＮ元素。－过程。数据挖掘技术可用来探查大型数据库，发现先前未知的事务集也是项的集合，是ＩＴ的子集。如果项集ｘ是事务ｔｊ有用数据模式。关联规则挖掘就是通过计算大型事务数据集的子集，则称事务ｔｊ包含项集ｘ。关联规则： —Ｙ，ｘｘ为规则前件，为规则后件，与Ｙ是ＩＹｘ的子集。现的条件概率，出数据集中存在的频繁模式，找进而推出强关１２关联规则的几个重要属性．联规则，预测事物的发展趋势。
与项集Ｙ同时出现于事务集中的概率，中ｏ（其ｒＸＵＹ）表示包
Ｉ表示数据库中总的事Ｔ主的数据挖掘方法已经逐步被应用于零售、保险、银行、医疗数含前件ｘ与后件Ｙ同掘时，用户需要先设定一个支持度阈据分析等领域。
Ｚｈｎｎａ。ｕＦａａｇＹｕｙｎｇ，Ｌｉｎｇ
（．ＴｅｌｒｒｆＴｂｔＵｉｒｉ，Ｌｓ，Ｘｚａｇ８０１，Ｃｉａ．ｏｐｔｎｉｅｒｇｄｐ，ｅｇｎｅｎｎｔｕｅｏｉｔＵｉｒｉｈｉａｙｏｉｎｖｓｙａａｉｎ５０２ｈｎ；２ｃｍｕｒｅｇｎｅｉｅｔｎｉｅｒｇｉｓｔｔｆＴｂｎｖｓｙ）ｂｅｅｔｈｅｎｉｉｅｅｔ

通信工程专业英语TEXT1信息的数字表示

The key advantage of digital representation lies in the universality of representation. Since any medium, be it a text, an image, or a sound, is coded in a unique form which ultimately results in a sequence of bits, all kinds of information can be handled in the same way and by the same type of equipment.Furthermore, transformations of digital information are error free, while analog transformations introduce distortions and noise.1 StorageThe same digital data storage device can be used for all media. The only difference may lie in size requirement.Still images and motion video require larger volumes than text or graphics. Sound also is demanding though less so than imaging. Thus, appropriate digital devices may be required, such as compact disk-read-only memories (CD-ROMS). But the key point is that a single type of digital device may store all kinds of information.2. TransmissionAny data communication system capable of carrying bits has the potential to transmit any multimedia digital information. Thus, a single communication network supporting digital transmission can, in theory, be envisaged.This concept is referred to as an integrated services digital network （ISDN), which is a facility aimed at integrating the transport of all information media.The difficulties may arise from the requirements of certain applications, in particular those which need to respect with fidelity（忠实、保真度）the time-dependency of digital signals.Even when the potential for integrating different types of information on a single communication system is not exploited, the benefits of digital versus analog transmission are numerous. First, digital signals are less sensitive to transmission noise than analog signals.Second, the regeneration of the signal--which is the process by which an attenuated signal is strengthened -- is easier. Third, error detection and correction can be implemented. Fourth, the encrypting of the information is also easier.3. ProcessingAs all information is stored on computers, it can be processes, analyzed, modified, altered, or complemented by computer programs like any other data. This is probably where the potential is the highest.•An attempt may be made to recognize semantic contents (speech, handwriting, form, and pattern recognition). An example of advanced contents recognition is the detection of cuts in motion video to automatically identify and index video sequences.•Data structures, chaining using pointers between information elements, may be created for faster and flexible retrieval（检索）.•Powerful editing by cut-and-paste functions to create monomedia (e.g. sound only) or multimedia documents is possible.•The quality of information may be improved by the removal（移除）of noise or errors as shown by the digitization of old vinyl records to create higher-quality audio compact disks.•Information of a similar type but created through different processes---such as motion imaging synthesized by computers and camera-captured video--- may be mixed.•To summarize:•Digital representation permits the storage of different information types on the same device. Information may also be transmitted over a single digital network.Likewise, when digitized, all forms of information may be treated by computer programs, for editing, quality improvement, or recognition of the meaning of the information.4. Drawback (缺点，不足）of Digital Representation of Continuous InformationThe major drawback of the digital representation of information lies in coding distortion. The process of first sampling and then quantizing and coding the sampled values introduces distortions. As a result, the signal generated after digital-to-analog conversion and presented to the end-user has little chance of being completely identical to the original signal.Increasing the sampling rate and multiplying the number of bits used to code samples reduces the distortion. The result is an increased number of bits per time or space unit, called bit rate, necessary to represent the information. There are clearly technological limits to the explosion of bit rates: storage capacity is not infinite, and transmission systems also have limited bandwidths.One of the key issues then is to choose an appropriate balance between the accuracy of the digitization, which determines the bit rate, and the perceived（感知，觉察）distortions by users. Do not forget that it is how distortions are perceived by humans which matters, not whatthey are physically.Another consequence, as mentioned before, is the need for a large digital storage capacity for moving or still images and, to a lesser extent, for sound. Eight minutes of stereophonic CD sound are enough to completely fill（装满）the 80 megabyte hard disk of a standard personal computer.Fortunately, compression algorithms have been developed to alleviate（缓和，减轻）these requirements. But only spectacular recent progress in digital storage capacity and cost has enabled the development of massive digital storage of media such as images, sound, and motion video.Let us summarize:The digitization process introduces a distortion of the information. Reducing this distortion may be achieved by increasing the sampling rate and the number of bits to code each sample. Images, sound, and motion video require a large amount of digital storage capacity.给大家推荐一个英语微信群Empty Your Cup英语微信群是目前学习英语最有效的方法，群里都是说英语，没有半个中文，而且规则非常严格，是一个超级不错的英语学习环境，群里有好多英语超好的超牛逼的人，还有鬼佬和外国美眉。

缓解计算压力的英语

缓解计算压力的英语Mitigating Computational BurdenIn today's digital era, computational power is a crucial component of various industries, including healthcare, finance, and technology. However, with the increasing complexity of tasks and the ever-growing demand for processing data, computational pressure often becomes a significant challenge. To address this, several strategies can be employed to mitigate the burden on computational resources.One effective approach is to optimize algorithms and software. By refining the underlying logic and eliminating redundant or inefficient operations, it is possible to significantly reduce the computational requirements of a task. This often involves the use of advanced techniques such as parallel processing, where multiple processors work simultaneously to complete a task, or distributed computing, where multiple computers collaborate to solve a problem.Another strategy is to leverage high-performance computing (HPC) resources. These powerful systems, often housed in specialized data centers, are designed to handle computationally intensive tasks efficiently. By offloading complex calculations to these HPC platforms, regular computers can focus on more manageable tasks, thus reducing overall computational pressure.Cloud computing also plays a pivotal role in mitigating computational burden. By leveraging the elastic nature of cloud resources, users can dynamically allocate computing power as needed, without the need for expensive upfront investments. This flexibility allows for better scaling and optimization of computational workloads, ensuring efficient use of resources.Additionally, data compression and optimization techniques can help reduce the size and complexity of datasets, thereby easing the computational load. By identifying and eliminating redundant or unnecessary data, or using compression algorithms to shrink file sizes, the overall processing requirements of a task canbe significantly reduced.In conclusion, mitigating computational burden is crucial in ensuring the efficient and effective performance of complex tasks. By optimizing algorithms, leveraging HPC resources, employing cloud computing, and optimizing data, we can overcome the challenges posed by increasing computational demands and harness the full potential of our technological advancements.。

一种气象雷达数据混合压缩算法

Vol.45 No.3444计算机与数字工程Computer &• Digital Engineering总第329期2017年第3期一种气象雷达数据混合压缩算法$陈璐1马可2李重阳2(1.西安航天天绘数据技术有限公司西安710100) (2.西安电子工程研究所西安710100)摘要针对气象雷达数据压缩的问题，提出了一种气象雷达数据压缩混合算法。

该算法由预压缩、有损压缩和无损压缩三个步骤组成。

首先根据气象雷达数据说明不同用户对数据的需求;其次介绍预压缩算法，并按照不同用户的需求，提出可以采用无损压缩和有损压缩两种数据压缩算法对气象雷达数据进行压缩;最后给出混合压缩算法的流程图，并利用实测数据验证该混合压缩算法的有效性。

关键词气象雷达；数据压缩；混合压缩算法中图分类号TN958 DOI：10. 3969/j. issn. 1672-9722. 2017. 03. 008A New Hybrid Data Compression Algorithm for Weather RadarCHEN Lu1MA Ke2LI Chongyang2(1. Xi^n Aerospace Remote Sensing Data Technology Co. , Ltd, Xi^n 710100)(2. Xi’an Electronic Engineering Research Institute, Xi’an 710100)Abstract In order to solve the problem of data compression of meteorological radar, a hybrid algorithm of radar data compression is proposed. The algorithm consists of three steps including pre-compression, lossy compression and lossless compression Firstly, the paper introduced the weather radar data illustrate the different needs of users for data；Secondly, according to the practical requirements of the diverse users? it proposed the adoption of lossless compression and lossy compression algorithm for the data；Finally, the flow chart of hybrid compression algorithm was presented and verified based on the actual dataKey Words weather radar, data compression, hybrid compression algorithmClass Number TN958i引言气象雷达能够准确快速地观测气象的变化过程，对预测暴雨、冰雹等灾害天气起到重要作用[1~2]。

TIN压缩技术方法研究

n∑ T I N 压缩技术方法研究王璇1 ，朱映2 ，王玮2( 1．中山市基础地理信息中心，广东中山 528400; 2．江苏省基础地理信息中心，江苏南京 210013)Ｒese a r c h o n C o mp r essio n Algo r ithm s of T I NW ANG X u a n ，Z HU Y i n g ，W ANG W e i摘要: 利用可视化开发语言 C #，采用静态三角网算法，建立了一个能够独立运行的 DE M 数据压缩平台，实现了顶点聚合法和基于中心点的顶点消去法两种 T I N 压缩方法，并着重比较了顶点消去法和顶点聚合法的特点，找出在不同阈值和各种复杂程度地形情况下两种方法的优点和缺点，提出在不同情况下可以适用的压缩方法。

关键词: 数字高程模型; T I N 压缩; 顶点聚合; 基于中心点的顶点消去中图分类号: P208文献标识码: B文章编号: 0494-0911( 2014) 01-0094-03一、引言随着地理信息系统、遥感、虚拟地理环境等技术的发展，数字高程模型( D E M ) 得到了广泛的应法，公式如下nZ = ∑p i ·Z i( 1)i = 11 S i用。

D E M 数据处理、传输速度与用户需求之间的矛盾日益突出，D E M 数据压缩技术便顺应时势成为当今研究的热点。

D E M 有许多种表示方法，其中不规 p i =1 i = 1 S i( 2)则三角网模型( 即 T I N 模型) 相对于 D E M 中的其他模型有着不可替代的优点［1］。

地理信息数据量的增大，对空间数据的存储和传输提出了很高的要求，除了提高硬件设备，开发 D E M 的压缩算法也成为一种迫切的需求。

因此，研究 T I N 的建立和压缩是非常有意义的［2］。

二、顶点聚合法和基于中心点的顶点消去算法介绍1．顶点聚合法顶点聚合的方法是由Ｒoss i gnac 和 B or r el［3］最先提出的。

信息论解释信息分离的方法

信息论解释信息分离的方法Information theory, as developed by Claude Shannon, provides a framework for understanding how information can be separated and transmitted efficiently. 信息理论是由克劳德·香农提出的，它提供了一个框架，用于理解信息如何被有效地分离和传输。

Shannon's work laid the foundation for modern communication systems, and his theories have been applied to various fields, including computer science, telecommunications, and cryptography. 香农的工作为现代通信系统奠定了基础，他的理论已被应用到各个领域，包括计算机科学、电信和密码学。

One of the key concepts in information theory is the idea of information entropy, which measures the uncertainty or surprise associated with a given set of information. 信息理论中的一个关键概念是信息熵的概念，它衡量了与给定信息集相关的不确定性或惊喜。

In the context of information separation, entropy provides a measure of the amount of information contained in a system and the degreeto which it can be compressed or transmitted. 在信息分离的背景下，熵提供了容纳在一个系统中的信息的量的度量，以及它可以被压缩或传输的程度。

压缩映射定理及其应用英文

压缩映射定理及其应用英文Alright, let's dive into the fascinating world of the Contraction Mapping Theorem and its applications in a conversational yet concise manner.In math, the Contraction Mapping Theorem is kind oflike a magic trick. It tells us that if you have a function that "shrinks" things down in a certain space, it'll have a unique fixed point — a spot where the function doesn't change it at all. It's pretty cool when you think about it!Now, why is this theorem so useful? Well, imagineyou're trying to solve a complex equation or find the equilibrium state of a system. The Contraction Mapping Theorem can give you a starting point, or even the exact answer, by narrowing down the possible solutions.One practical application of this theorem is in computer science, especially in algorithms for optimizing functions. Think of it as a way to "zoom in" on the bestpossible solution by repeatedly applying the "shrinking" function. It's like having a superpower to find the exact right answer in a haystack of possibilities.Another cool thing about the Contraction Mapping Theorem is that it's not just limited to math and computer science. You can find its footprints in physics, economics, and even social sciences. Whenever there's a need to find a stable point or equilibrium in a system, chances are the Contraction Mapping Theorem has something to say about it.So.。

算法英语期末总结怎么写

算法英语期末总结怎么写Abstract:This essay provides a comprehensive summary of the topics covered in the Algorithm course. It discusses various algorithm design paradigms, complexity analysis techniques, and data structures. Additionally, it examines algorithms for searching, sorting, graph traversal, and dynamic programming. Finally, it evaluates the practical applications of algorithms in various domains.Introduction:Algorithms form the foundation of computer science and play a vital role in solving complex problems efficiently. This course on algorithms aimed to introduce students to different algorithm design paradigms and techniques for analyzing their efficiency. The course covered various topics such as data structures, searching, sorting, graph algorithms, and dynamic programming. This essay summarizes the key concepts learned throughout the course and their practical applications in real-world scenarios.Algorithm Design Paradigms:The course started with an introduction to various algorithm design paradigms. Four fundamental paradigms were discussed: greedy algorithms, divide and conquer, dynamic programming, and backtracking. Each paradigm was explained in detail, highlighting its unique characteristics and suitable problem domains. Greedy algorithms focus on making locally optimal choices, divide and conquer breaks down problems into subproblems, dynamic programming stores solutions to subproblems for reuse, and backtracking exhaustively explores all possible solutions.Complexity Analysis Techniques:Understanding the efficiency of an algorithm is essential for making informed decisions about algorithm selection. The course covered several complexity analysis techniques, including asymptotic notation (Big O), worst-case analysis, average-case analysis, and amortized analysis. Utilizing these techniques, we can quantify the runtime and space requirements of algorithms, allowing for comparisons and determining the most efficient approach for a given problem.Data Structures:Data structures provide a way to organize and store data efficiently. The course delved into various data structures such as arrays, linked lists, stacks, queues, trees, and graphs. Each data structure was examined in terms of its basic operations, memory consumption, and runtime complexity. Understanding the strengths and weaknesses of different data structures is crucial for choosing the appropriate structure for a specific problem, ensuring efficient algorithm execution.Searching:Searching is a ubiquitous operation in computer science. The course explored different searching algorithms such as linear search, binary search, depth-first search (DFS), and breadth-first search (BFS). Linear search and binary search were analyzed in terms of their runtime complexity, while DFS and BFS were discussed regarding their applications in graph traversal. Searching algorithms are key components of various applications such as information retrieval systems and recommendation engines.Sorting:Sorting is another fundamental operation in computer science. The course examined popular sorting algorithms, including bubble sort, insertion sort, selection sort, merge sort, quicksort, and heapsort. Each algorithm was presented with its time complexity analysis. Sorting algorithms are widely used in organizing data and are essential in various domains such as databases, data analysis, and network routing.Graph Algorithms:Graphs represent relationships between entities, and the course dedicated a significant portion to graph algorithms. Topics covered included graph representations, graph traversal algorithms (DFS and BFS), shortest path algorithms (Dijkstra's algorithm and Bellman-Ford algorithm), minimum spanning tree algorithms (Prim's algorithm and Kruskal's algorithm), and topological sorting. Graph algorithms have broad applications in various domains, including social networks, transportation networks, and recommendation systems.Dynamic Programming:Dynamic programming is an efficient algorithmic technique used for solving optimization problems by breaking them down into overlapping subproblems. The course introduced the concept of overlapping subproblems and optimal substructure, which are the key characteristics of problems suitable for dynamic programming. The approach was exemplified through various problem scenarios where dynamic programming provided optimal solutions, such as the knapsack problem and the Fibonacci sequence. Dynamic programming is widely used in numerous areas, including operations research, computer graphics, and bioinformatics.Practical Applications:Algorithms have practical applications in various domains, and the course highlighted a few of them. For instance, in the field of bioinformatics, algorithms are used for DNA sequence alignment, genome assembly, and protein structure prediction. In computer networks, algorithms are employed for routing packets, load balancing, and network optimization. In finance, algorithmic trading relies on efficient algorithms for making timely buy and selldecisions. Furthermore, algorithms play a crucial role in artificial intelligence, machine learning, and data mining.Conclusion:The Algorithm course provided a comprehensive understanding of various algorithm design paradigms, complexity analysis techniques, and data structures. It covered essential algorithms for searching, sorting, graph traversal, and dynamic programming. Additionally, it highlighted the practical applications of algorithms in diverse domains. The knowledge gained from this course equips students with the necessary tools to design efficient algorithms and solve complex problems in the real world. Developing strong algorithmic thinking skills is crucial for success in the field of computer science.。

Optimization Algorithms

Optimization AlgorithmsOptimization algorithms are a crucial tool in various fields, including engineering, computer science, economics, and many others. These algorithms are designed to find the best solution to a given problem within a set of constraints. They play a significant role in improving efficiency, reducing costs, and enhancing performance in various applications. However, the effectiveness of optimization algorithms depends on various factors, such as the complexity of the problem, the quality of the algorithm, and the computational resources available. One of the key challenges in using optimization algorithms is selecting the most appropriate algorithm for a specific problem. There are numerous optimization algorithms available, each with its strengths and weaknesses. Some algorithms are better suited for continuous optimization problems, while others are moreeffective for discrete optimization problems. Additionally, the performance of an algorithm can vary depending on the problem's characteristics, such as the number of variables, constraints, and the presence of noise or uncertainty. Another important consideration in using optimization algorithms is the computational resources required to solve a particular problem. Some optimization algorithms are computationally expensive and may require significant time and memory resources to find the optimal solution. In contrast, other algorithms are more efficient and can quickly converge to a satisfactory solution. Therefore, it is essential to balance the trade-off between the algorithm's performance and the available computational resources when selecting an optimization algorithm. Furthermore, the convergence properties of optimization algorithms play a crucial role in their effectiveness. Convergence refers to the algorithm's ability to reach a satisfactory solution within a reasonable number of iterations. Some algorithms may converge quickly but get stuck in local optima, while others may take longer to converge but find a better global optimum. Understanding the convergence properties of an algorithm is essential for determining its suitability for a specific problem and optimizing its performance. In addition to selecting the right algorithm and managing computational resources, parameter tuning is another critical aspect of optimizing algorithm performance. Many optimization algorithms have parameters that need to be tuned to achieve the best results for a givenproblem. Parameter tuning involves adjusting the algorithm's settings to improveits convergence speed, accuracy, and robustness. It requires a deep understanding of the algorithm's behavior and the problem at hand to find the optimal parameter values. Overall, optimization algorithms are powerful tools that cansignificantly improve efficiency and performance in various applications. By selecting the right algorithm, managing computational resources effectively, understanding convergence properties, and tuning parameters appropriately, practitioners can harness the full potential of optimization algorithms to solve complex problems and achieve optimal solutions. As technology continues to advance, optimization algorithms will play an increasingly important role in driving innovation and progress across diverse fields.。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

A Survey on Compression Algorithms in HadoopSampada LovalekarDepartment of ITSIES Graduate School of TechnologyNerul, Navi Mumbai, Indiasampada.lovalekar@Abstract—Now a days, big data is hot term in IT. It contains large volume of data. This data may be structured, unstructured or semi structured. Each big data source has different characteristics like frequency, volume, velocity and veracity of the data. Reasons of growth in the volume is use of internet, smart phone ,social networks, GPS devices and so on. However, analyzing big data is a very challenging problem today. Traditional data warehouse systems are not able to handle this large amount of data. As the size is very large, compression will surely add the benefit to store this large size of data. This paper explains various compression techniques in hadoop.Keywords-bzip2, gzip ,lzo, lz4 ,snappy______________________________________________________*****___________________________________________________I.I NTRODUCTIONThe volume of big data is growing day by day because of use of smart phones, internet, sensor devices etc. The three key characteristics of big data are volume, variety and value. Volume can be described as the large quantity of data generated because of use of technologies now a day. Big data comes in different formats like audio, video, image etc. This is variety. Data is generated in real time with demands for usable information to be served up as needed. Value is the value of that data whether it is more or less important.Big data is used in many sectors like healthcare, banking, insurance and so on. The amount of data is increasing day by day.Big data sizes vary from a few dozen terabytes to many petabytes of data.Big data doesn’t only bring new data types and storage mechanisms, but new types of analysis as well. Data is growing too fast. New data types are added. Processing and managing big data is a challenge in today’s era. With traditional methods for big data storage and analysis is less efficient. So, there is difference between analytics of traditional data and big data. The challenges comes with big data are data privacy and security, data storage, creating business value from the large amount of data etc. Data is growing too fast. Following points should be considered [1].∙In 2011 alone, mankind created over 1.2 trillion GB of data.∙Data volumes are expected to grow 50 times by 2020.∙Google receives over 2,000,000 search queries every minute.∙72 hours of video are added to YouTube every minute.∙There are 217 new mobile Internet users every minute.∙571 new websites are created every minute of the day.∙According to Twitter’s own research in early 2012, it sees roughly 175 million tweets every day, and has morethan 465 million accounts.As the size of big data is growing, compression is must. These large amounts of data need to be compressed. The advantages of compression are [2]:∙Compressed data uses less bandwidth on the network than uncompressed data.∙Compressed data uses less disk space.∙Speed up the data transfer across the network to or from disk.∙Cost is reduced.II.B IG D ATA TECHNOLOGIESHadoop is an open source framework for processing, storing, and analyzing massive amounts of distributed unstructured data. Hadoop has two main components. These are:Figure 1. HDFS architectureA HDFS: Hadoop Distributed File SystemThe Hadoop Distributed File System (HDFS) is a distributed file system designed to store very large data sets , and to stream those data sets at high bandwidth to user applications.It cn be easily portable from one platform to another.Figure 1[3] shows HDFS architecture.This architecture contains Namenodes and Datanodes. HDFS has a master/slave architecture.[3] HDFS consists of single NameNode. There are number of DataNodes whichmanage storage attached to the nodes that they run on. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.B MapReduceHadoop Map/Reduce is a software framework which process vast amounts of data. The Map/Reduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks asdirected by the master. Following figure[13] shows example of Hadoop MapReduce.Figure 2. Example of Hadoop MapReduceIt can be seen that [13]the Hadoop system captures datasets from different sources and then performs functions such as storing, cleansing, distributing, indexing, transforming, searching, accessing, analyzing, and visualizing. So, semi-structured and unstructured data are converted into structured data.III. COMPRESSION TYPES IN HADOOPCompression will the reduce I/O and decrease network usage. Compressed data uses less bandwidth on the network than uncompressed data. With compression more data can be saved in less space. Big data contains complex and unstructured data. So compression of this data is important. Codec represents implementation of compression and decompression algorithm. Some compression formats are splittable . Performance is better for large files if the algorithm is splittable . Common compression algorithms supported by hadoop are listed as,∙ LZO,∙ Gzip ∙ Bzip2 ∙ LZ4, and ∙ Snappy.A. LZOThe LZO compression format is composed of many smaller blocks of compressed data allowing jobs to be splitalong block boundaries. Block size should be same for compression and decompression. This is fast and splittable. LZO is a lossless data compression library written in ANSI C.LZO has good speed. Its source code and the compressed data format are designed to be portable across platforms. Decompression is very fast. Characteristics of LZO are [4, 11]:∙ Data compression is similar to other popularcompression techniques, such as gzip and bzip.∙It enables very fast decompression. ∙ It requires no additional memory for decompressionexcept for source buffers and destination buffers.∙Includes compression levels for generating pre-compressed data which achieve a quite competitive compression ratio.∙ There is also a compression level which needs only 8 kBfor compression.∙ Algorithm is thread safe. ∙ Algorithm is lossless. ∙ LZO is portable.Lzop is a file compressor which uses LZO for compression services. Lzop is the fastest compressor and decompressor. B GZIPIt is GNU zip. gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman Coding. GZIP will generally compress better than LZO though slower. Java will use java's GZIP unless the native Hadoop libs are available on the CLASSPATH; in this case it will use native compressors instead. [5]. gzip compression works by finding similar strings within a text file, and replaces those strings temporarily to make the overall file size smaller. The second occurrence of a string is replaced by a pointer to the previous string, in the form of a pair (distance, length). Literals or match lengths are compressed with one Huffman tree, and match distances are compressed with another tree. The trees are stored in a compact form at the start of each block. Deflate is compression algorithm and inflate is decompression algorithm. Gzip files are stored with .gz extension. The gzip sources, written in C, are available here in various formats. [6]∙ tar ∙ shar ∙ zip ∙ tar.gz ∙ tar.zC .Bzip2bzip2 [7, 8] is a freely available high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques. Bzip2 compresses data in blocks of size between 100 and 900 kB. Bzip2 performance is asymmetric, as decompression is relatively fast. The current version is 1.0.6. It supports (limited) recovery from media errors. If you are trying to restore compressed data from a backup tape or disk, and that data contains some errors, bzip2 may still be able to decompress those parts of the file which are undamaged. It's very portable. It should run on any 32 or 64-bit machine with an ANSI C compiler. bzip2 compresses large files in blocks. The block size affects both the compression ratio achieved, and the amount of memory needed for compression and decompression. The header of bzip2 data starts with the letter “BZ”.D LZ4LZ4 is a lossless data compression algorithm that is focused on compression and decompression speed.It provides compression speed at 400 MB/s per core fast decoder, with speed in multiple GB/s per core [10].Figure 3. LZ4 sequenceFig.2 [10] explains working of LZ4 algorithm. It has sequences. The token is a one byte value. The field is literal length. If the value of this field is 0, then there are no literals. If the value is 15, more bytes should be added. Each additional byte then represents a value of 0 to 255, which is added to the previous value to produce a total length. The next field is literals which are uncompressed literals. Offset is the next field. It represents the position of the match to be copied from. 1 means "current position - 1 byte. Maximum value of this field is 65,535. Next field is Match length. Second token field is used with values from 0 to 15. There is a baselength to apply, which is the minimum length of a match, called minmatch . This minimum is 4. If the value is 0, there is match length of 4 bytes. If the value is 15 means a match length of 19+ bytes. On reaching the highest possible value (15), we output additional bytes, one at a time, with values ranging from 0 to 255. They are added to total to provide the final matchlength. With the offset and the matchlength, the decoder can now proceed to copy the repetitive data from the already decoded buffer. By decoding the matchlength , we reach the end of the sequence, and start another one.E.SnappyHadoop-Snappy[9] is a project for Hadoop that provide access to the snappy compression. Snappy is written in C++. Focus ofsnappy is on very high speeds and reasonable compression . Requirements to build snappy are gcc c++, autoconf, automake, libtool, Java 6, JAVA_HOME set, Maven3 [9].IV. S U MMARY OF HADOOP COMPRESSIONSCHEMES There is large amount of data and it is growing day by day. Also properties of unstructured and semi structured data are not similar. Compression reduces this large volume of data. So, it will reduce the storage space. Obviously, it is beneficial. A compression format is commonly referred to as a codec , which is short for coder-decoder. The various types of compression algorithms are discussed in this paper. Summary [7] of these techniques can be given as follows. Gzip is general purpose compressor. Bzip compression is better than gzip but it is slower. The LZO compression format is composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries [12]. Speed is good for lzo and snappy but compression is less effective. Decompresses about twice as fast as gzip.It doesn’t compress quite as well as gzi p — expect files that are on the order of 50% larger than their gzipped version .Snappy is good for decompression than LZO. Table 1[7] shows summary of these compression algorithms.T ABLE 1: SUMMARY OF HADOOP COMPRESSION FORMATSFollowing table [12] shows a typical example, starting with an 8.0 GB file containing some text-based log data:TABLE 2: COMPARISON OF DIFFERENT COMPRESSION FORM AT V.CONCLUSIONWe are in the era of big data. There are various challenges and issues regarding big data. Large amount of data is generated from the various sources either in structured, semi structured or unstructured form. Such data are scattered across the Internet. Hadoop supports various types of compression and compression formats. Different types of compression algorithm are discussed in this paper. These algorithms are summarized. Finally, algorithm comparison is mentioned here.R EFERENCES[1]“Introduction to Big Data: Infrastructure and NetworkingConsiderations”,Juniper Networks ,White Paper[2]/display/MapR/Compression[3]Dhruba Borthakur ,The Hadoop Distributed File System: Architectureand Design, pp.4-5[4]:8080/LJ/220/11186.html[5]/book/pression.html[6]/[7]/[8]/[9]/p/snappy/[10]/p/lz4/[11]/opensource/lzo/[12]/blog/2009/11/hadoop-t-twitter-part-1- Splittable-lzo-compression/[13]Jean Yan,”Big Data, Bigger Opportunities “, WhitePaper, April 9, 2013。