有效的XML树结构压缩

合集下载

调查最先进的 XML 压缩技术

页码，1/11调查最先进的 XML 压缩技术审查识别（或不识别） XML 且您可以查询（或不查询）的数据压缩器简介： XML 被认为是数据表示和通过万维网交换数据的标准。

XML 拥有极大的灵活性并获得了广泛认可，但它有一个缺点，即 XML 文档较大。

这种尺寸意味您传输、处理、存储和查询的信息量通常大于其他数据格式。

您可以从多个 XML 压缩技术进行选择来处理这些问题。

本文提供了对最先进的 XML 压缩技术的概述。

标记本文！发布日期： 2011 年 9 月 05 日级别：中级原创语言：英文简介常用缩略语l l l l l l l lCDATA：字符数据 DTD：文档类型定义 GPS：全球定位系统 HTML：超文本标记语言 PPM：根据部分匹配预测 SAX：XML 的简易 API W3C：万维网联盟 XML：可扩展标记语言XML 是因为 HTML 和万维网的广泛普及而出现的最有用、最重要的技术之一。

XML 解决了许多问题，因为它可以在不同的架构之间提供中立的数据表示，通过最少的努力弥补软件系统之间的差距，并存储大量的半结构化数据。

XML 通常被称为自我描述的数据，因为它被设计为对文档中的每个记录使用重复的模式。

这种自我描述功能为 XML 提供了巨大的灵活性，但是也引入了 XML 文档过于冗长的问题，这会造成文档较大。

由于 XML 的使用率在不断增长，而且大型 XML 文档信息库目前也非常普遍，所以对高效 XML 压缩工具的需求非常大。

图 1 说明了使用 XML 压缩器减少通过网络传输 XML 数据的成本这一优势。

为了解决大型 XML 文档的大小问题，许多 XML 感知的压缩器利用众所周知的 XML 文档的结构，使得压缩比率优于一般的文本压缩器的比率。

XML 压缩工具的众多优势包括减少数据交换所需的网络带宽，减少存储所需的磁盘空间，并最大限度地降低了处理和查询 XML 文档的主要内存要求。

exi 压缩xml 原理

exi 压缩xml 原理
Exi（Efficient XML Interchange）是一种用于压缩XML数据的编码格式。

它的原理是通过减少XML数据的冗余信息和优化数据结构来实现高效的压缩和解压缩。

下面将详细介绍Exi的原理和工作方式。

Exi通过使用预定义的XML模式（schema）来描述XML数据的结构，从而消除了XML中的一些冗余信息。

XML模式定义了XML数据的元素、属性和其它组件之间的关系，使得Exi可以根据模式来生成更紧凑的编码。

Exi使用了基于事件的编码方式，即将XML数据转换为一系列事件，然后对这些事件进行编码。

在编码过程中，Exi使用了多种技术来减少数据的重复和冗余。

例如，Exi使用了基于字典的压缩技术，将常见的字符串和数据块存储在字典中，并使用索引来引用它们。

这样可以大大减少数据的重复出现，从而实现更好的压缩效果。

Exi还采用了基于位的编码方式，将数据转换为二进制形式进行存储和传输。

通过使用变长编码和位流技术，Exi可以根据数据的频率和大小来选择合适的编码方式，从而进一步提高压缩效率。

Exi的解压缩过程与压缩过程相反。

解压缩器根据编码规则和字典，将编码后的二进制数据转换回原始的XML格式。

总结一下，Exi通过使用XML模式、基于事件的编码方式和基于位的编码方式，实现了对XML数据的高效压缩和解压缩。

它能够减少
数据的冗余和重复出现，提高数据的传输效率。

通过使用Exi，可以在保持数据完整性的同时，减少存储空间和网络带宽的使用，提高系统的性能和效率。

XML文档的XCR压缩技术及其在IM软件中的应用的开题报告

XML文档的XCR压缩技术及其在IM软件中的应用的开题报告一、研究背景及意义随着互联网技术的不断发展，IM（即时通讯）软件越来越受到人们的关注。

IM软件具有实时性强、交互性好、使用方便等特点，已经成为人们日常生活中必不可少的一部分。

由于IM软件数据量大、流量消耗大等特点，为了提高IM软件的性能和用户体验，需要使用一些有效的数据压缩技术。

在IM软件的通信过程中，XML（可扩展标记语言）常常用于数据交换。

XML数据具有结构清晰、易于解析等特点，是IM软件中常用的数据格式。

然而，由于XML数据中存在大量的重复信息，所以传输过程中会导致数据冗余，降低数据传输速度和效率。

因此，本研究拟探讨XML文档的XCR压缩技术及其在IM软件中的应用，旨在提高IM软件的数据传输效率，实现数据快速传输和处理，提升用户体验。

二、研究内容和方法本研究主要包括两个方面内容：1. XML文档的XCR压缩技术XML文档的XCR压缩技术是指通过对XML文档进行信息提取和编码处理，在保证XML文档完整性的情况下，将XML文档压缩为较小的尺寸。

XCR压缩技术主要包括以下三个环节：（1）信息提取：通过对XML文档进行解析和语法分析，提取出XML文档中的数据内容和结构信息。

（2）编码处理：根据提取出来的信息，采用一定的编码算法进行数据压缩。

XCR压缩技术主要采用基于模块化的编码方法，将XML文档分解成若干模块，对每个模块进行编码处理。

（3）数据恢复：通过对压缩后的数据进行解码，将其还原为原始的XML文档。

2. XCR压缩技术在IM软件中的应用本研究将探讨XCR压缩技术在IM软件中的应用，主要包括以下两个方面：（1）XCR压缩技术在IM软件数据传输中的应用：通过将IM软件中产生的XML数据进行XCR压缩处理，可以减少数据传输量，提高数据的传输速度和效率。

同时，能够减少IM软件的流量消耗，降低通信成本。

（2）XCR压缩技术在IM软件数据处理中的应用：由于IM软件中大量使用XML数据，采用XCR压缩技术可以有效地提高数据处理速度和效率，加快IM软件的响应速度。

有效的XML树结构压缩

有效的XML树结构压缩仲志平;喻其山【摘要】In an XML document a considerable fraction consists of markup, using begin and end-element tags describing the document's tree structure. In this paper, the compression algorithms for XML tree structure have been emphatically researched, and compared relationships of different coding technique with the number of nodes in tree structure and size of pattern. Experimental results show that the algorithm is especially effective for XML tree structure of the repetition of tree patterns.%XML文档中相当大的部分由标记组成,用起始和结束元素标识符描述文档的树结构.本文重点研究了XML文档树结构的压缩算法,比较了各种编码方法与树结构中包含的节点数目和模型规模之间的关系.提出了有效的XML树结构压缩算法,实验结果表明,这种算法对于内部片段多次重复出现的XML树结构特别有效.【期刊名称】《安徽师范大学学报（自然科学版）》【年(卷),期】2011(034)001【总页数】5页(P33-37)【关键词】结构压缩;非排列树;DAG;SLT【作者】仲志平;喻其山【作者单位】安徽师范大学物理与电子信息学院,安徽,芜湖,241000;安徽师范大学物理与电子信息学院,安徽,芜湖,241000【正文语种】中文【中图分类】TP393.02XML已经成为Web数据表示和交换的主要格式,其特点是利用嵌套的起始和结束元素标记描述XML文档的树结构.由于XML文档中描述树结构的起始和结束元素标记多次重复出现,造成大量的结构冗余,严重影响了XML数据的存储、传输和交换效率.如何有效的压缩XML文档树结构内容已经成为一个研究热点.最先提出的XML专用压缩技术XMill[1]的核心思想是分离XML文档的结构和内容,将相同的数据项组合到同一个容器(树结构和数据)中,然后使用传统的文本压缩器gzip、bzip2或PPM进行压缩处理.基于同样思想的压缩工具还有XWRT(XML Word Replacing Transform)[2]和XComp[3]等等.根据对查询支持程度,XML数据压缩技术可以分为可查询压缩[3-11]和不可查询压缩[1,12].不可查询的XML压缩技术主要以解决XML的数据冗余、提高压缩比、减少数据存储空间为目标,但因查询处理过程中解压缩全部文档会增加系统负担.可查询XML压缩技术允许在压缩XML 数据上直接进行查询处理,其压缩效率通常低于其它传统压缩器.首个可查询的XML 压缩器XGrind[4]保留了XML文档的原始结构,利用字典编码元素和属性名,字符数据采用半适应的Huffman编码压缩;XSeq[7]是一个基于语法的可查询XML压缩器,像XMill一样分离结构和数据,然后分别使用著名的基于语法的字符串压缩算法Sequitur进行压缩;TREECHOP[9]将SAX事件流编码为二进制数,然后以深度优先次序写到压缩流中,每个双亲节点的码字作为其孩子节点码字的前缀,XML文档中的两个节点如果具有相同的路径,则他们的码字也相同;紧缩的DOM[13]和ISX[14]使用平衡的园括弧方法存储XML树结构.上述这些XML压缩算法,要么不支持查询,要么通过选择更加适合的后端压缩器来改善压缩效率,而没有从改进XML文档树结构的表示角度出发来支持查询和提高压缩比.受基于模型的压缩BPLEX算法的影响,利用SL T语法将给定的XML文档树结构转换为紧缩形式后,再进行二进制编码,可以获得有效的压缩效果.XML文档树结构由非排列(unranked)树文档中元素标签的嵌套自然形成,树中的每个节点可以拥有任意数目的孩子节点.图1中的a图是一个XML文档树结构实例,根节点p(parent)有3个孩子节点c(child), 3个节点c分别拥有0、3和2个孩子节点d(descendant).利用常见的“首孩子/之后兄弟”编码方法,将这个unranked 树转换为一个二叉树):unranked树中每个节点的首孩子变成二叉树中相关节点的左孩子,每个节点的之后兄弟变成二叉树中相关节点的右孩子,二叉树中任何缺少的左孩子或右孩子由nil叶子节点填充(nil叶子节点使用符号“”表示),转换后的结果如图1中的b图所示.实例对应的二叉树编码(即二叉树标签序列)为1.1 定长位编码定长位(bit)编码标签序列最直接的方法是为给定的二叉树建立一个符号表,并给表中每个符号分配一个固定长度的二进制代码,然后把二叉树中所有节点符号的代码连接在一起表示树节点标签序列.具体方法是:使用号码0的固定长度扩展编码代表nil叶子节点的专用符号,符号表中的第一个符号固定为二进制1(如01、001等等),以此类推.符号表中每个字符串由一个字符0终结,最后追加一个字符0表示符号表的结束.例如,前面示例对应的符号表为:T=parent0child0descendant00,其中字符串descendant后面的第一个0表示字符串结束,第二个0表示符号表结束.因此,最终分配给符号表中每个符号的代码分别是:-=00,parent=01,child=10,descendant=11.使用这些代码编码示例 unranked树得到的结果是:BTEncode= 01100010110011001100001011001100000000.定长位编码方法的优点是将unranked树表示成二进制代码序列,便于传统文本压缩器压缩处理;缺点是存在大量冗余的nil叶子节点,增加了存储和查询代价.如果原XML 文档中有n个节点,则对应的二叉树包含n+1个nil叶子.例如,图1(b)中具有9个内部节点的二叉树包含了10个nil叶子节点.1.2 避免nil叶子节点的定长编码本节通过引入4种不同形式的前缀码来避免冗余nil叶子节点的出现.方法是:1)用前缀“11”表示无nil孩子的二元节点、前缀“01”表示有左孩子nil的二元节点、前缀“10”表示有右孩子nil的二元节点、前缀“00”表示有左右孩子nil的二元节点;2)将这4个前缀码与符号表中每个符号对应的二进制编码结合起来表示二叉树中内部节点.例如,示例中,二叉树的根节点p的编码为“01”,由于p有右孩子nil 叶子,因此p对应的编码为“1001”.将这种编码方法应用到示例中得到的结果是:BTEncode = 100101101110011101110011101001110011.编码方法的优点是避免了大量冗余nil叶子节点的出现,但由于二叉树中每个内部节点前面都要添加2个bit位表示该节点是否拥有左、右nil叶子,因此,相对固定长度编码方法,树节点序列的最终编码长度并没有明显改善,尤其是当(其中n表示XML文档中包含的节点总数,m表示不同节点数目,以下给出了证明)时总编码长度还会增加.采用“定长位编码”方法与采用“避免nil叶子节点的定长编码”方法编码上述实例,最终得到的编码长度分别是19·2和9·(2+2).如果符号表中再添加一个节点,两种编码方法得到的结果是21·3比10·(3+2).不难看出,两种编码方法得到的结果与原XML文档的规模以及符号表中节点数目有关.证明:设某unranked树的XML文档中节点总数为n,其中不同节点总数为m.由于二叉树编码时,增加了n+1个nil叶子,因此总节点数变为2n+1.又因为m个不同节点至少需要[log2m](表示不小于log2m的最小整数)位二进制数进行编码,当采用固定长度编码时,树节点序列代码总长度为:当采用无nil叶子节点编码时,原XML中每个编码长度为([log2m]+2),树节点序列代码总长度为:由①和②式可得,当n·([log2m]+2)＞(2n+1)·[log2m](即时,无nil叶子编码时树节点序列代码总长度大于固定长度编码时树节点代码总长度.1.3 变长位编码变长位编码方法包括var编码、Huffman编码以及算术编码等等.1)var编码方法利用保留字节的第一个位标识之后的字节是否属于当前符号(编码结果略).2)Huffman编码是根据XML文档中各节点符号出现频率的统计结果进行编码.例如示例中3个符号“p”、“c”、“d”出现的频率分别为:1、3、5,对应的Huffman编码分别为:00、01、1,若用4个不同的前缀“11”、“01”、“10”、“00”表示每个节点是否含有nil叶子节点,则树节点序列的Huffman编码结果为:BTEncode=1000010111010110110011001011001,总编码长度为31(1·2+ 3·2+5·1+9·2).3)算术编码是把整个输入的消息直接编码为一个满足(0.0≤n＜1.0)的小数n,把这个码字n分配给整个输入流,而不是给各符号分别分配码字.具体做法是:逐个符号地读取输入流,每处理一个符号,就在码字后面加上几位.首先将“当前区间”定义为[0,1),对输入流中的每个符号s重复以下两个步骤:①把当前区间分割为长度正比于符号概率的子区间;②为符号s选择一个子区间,并将其定义为新的当前区间.然后用这种方法处理整个输入流,最终将输入流中最后一个字符对应区间中的任意数字表示算术编码的结果.下面给出输入流的编码步骤:首先将区间[0,1)划分给各个符号,顺序任意,每个符号所分得的子区间大小正比于它的概率,如表1给出的数据统计模型.然后,按照如下方式更新每个符号的low和high:其中lowRange(x)和highRange(x)分别表示符号x所对应区间的下限和上限.示例中第一个符号p输入时,low=0.0+(1.0-0,0)＊0.69=0.69,high=0.0+(1.0-0,0)＊0.74=0.74;第二个符号c输入时, low=0.69+(0.74-0.69)＊0.53=0.7165,high=0.69+(0.74-0.69)＊0.69=0.7245.如此计算到输入流的最后一个符号的low和high值,取low和high之间的任何一个值作为输入流的最终算术编码.解码器工作与编码过程相反,详细过程略.优点是能够产生比Huffman编码更短的代码,但是查询时必须解压缩全部文档.一个自顶向下的XML树结构能够表示成有向非循环图DAG(directed acyclic graph).而且,只需一遍遍历树结构就可以获得唯一最小的DAG(unique minimal DAG简称mu-DAG)以及存储子树片段的hash表.有向非循环图DAG和树结构之间的差别是:1)DAG中的内部节点允许存在多条输入边;2)DAG中节点和边的数目可能远远小于它所表示的树结构中所包含的节点和边的数目.通常情况下mu-DAG 中边的数目是原XML树结构中边的数目的10%左右[4].这里,我们利用关联的方法将XML树结构表示成二元有向非循环图(Binary DAGs).当出现子树重复时,使用赋给该子树的名称关联重复出现的子树.例如,二叉树a(a的mu-DAG被改写成自底向上标记的含义是“:0”表示一个节点符号表示一个节点a拥有左右“0”孩子“;2”表示一个节点a拥有左右“1”孩子.使用二元有向非循环图表示初始实例(包含nil叶子的二叉树的mu-DAG)得到的结果是由于根节点不可能被关联,因此示例中需要编码的对象包括三个关联号码0、1、2和三个符号p、c、d,意味着定长位编码时,每个符号需要3个bit位,号码3可使用变长var编码进行存储.采用定长位编码表示有向非循环图DAG得到的代码总长度为48(16·3),而采用定长位编码表示二叉树得到的总长度是38(19·2)(见1.1节).但是,如果树结构中再添加一个节点时,两种编码方法得到的结果分别是54 (48+2·3)和63(21·3).上述结果表明,使用后端压缩时,唯一最小有向非周期图mu-DAGs不一定能产生理想的编码数目.原因是,规模较小的文档树采用子树引用成本太高.因此我们引入一个临界值作为决定是否引入新模型的依据.如果模型的权值,即权值=(模型的频率)＊(模型的规模)大于大于临界值时才引入一个新的模型.一个唯一最小的有向非循环图mu-DAG避免存储重复的子树.有时,一个树结构的内部片段(即“树模型”)可能会重复多次,有向非循环图DAG无法删除这种冗余.例如,两个除了叶子位置之外,其它地方都相同的子树.在DAG中,由于这些子树之间不完全相同,因此不能共享.共享图[10]是一个允许共享且任意连接的树的子图(即树模型)的DAGs的综合.这个共享图可以由SL T(straight-line tree)语法[5]方便地表示.与DAGs一样,每个共享的组件对应一个关联.树模型由它的内部节点表示,为了使树结构良好平衡,从左向右追加填充标记y1,y2,…等叶子节点.例如,在树结构a(b(c),a(b(d),a(b(e),f)))中没有重复的子树,该树结构等同于它的唯一最小的有向非循环图mu-DAG.可是,树模型中拥有左孩子b的a节点出现了3次.可以用二元SL T语法将这种树模型表示为:a(b(y1),y2),整个树结构的二元SL T语法表示结果是:(1:a[b [y1],y2],2:1[c,1[d,1[e,f]]]).不难发现,关联出现在树的内部节点,一个树模型变量占位符y1,y2,….可以使用很多不同的具体实例加以说明.相比有向非循环图DAGs,SL T语法模型树不再是严格的二叉树,因为一个关联可以有与它的定义中变量数目相符的任意数目的(但是固定)孩子.由于自底向上编码,开始时的树模型不包括关联,之后的树模型只能关联先前的树模型,我们不需要精确地存储关联的具体数目.利用固定长度编码上面实例的方法是:1)用二进制编码000-101表示a-f六个符号;2)用二进制编码110表示任意一个变量;3)用二进制编码111代表关联“1”.因此,定长位 SL T语法编码(1:a[b[y1],y2],2:1[c,1[d,1[e,f]]])得到的结果是: 000001110110111010111011111100101.树模型的最终代码长度是33(11·3),若采用1.2节介绍的不包含nil叶子节点的定长位编码结果为:10·(3+2)=50,显然SL T语法位编码优于避免nil叶子节点编码.为了验证BPLEX算法要求的SL T语法和DAG编码方法的有效性,我们与下列两个具有代表性的XML压缩器进行了比较:一个是可查询XML压缩器 TREECHOP[9],另一个是典型的XML结构压缩器XMill[1].采用文本压缩器gzip作为后端压缩器,同样应用于BPLEX和DAG.为了测试树结构压缩,删除了数据集中所有文本数据,并用占位符替换,测试数据集特点如表2所示.要求每个算法对表2中的数据集测试一次,然后计算出该算法的平均值压缩比,测试结果如图 2所示.从测试结果上看,DAGs 有利于XML树结构压缩,而BPLEX不仅能产生可查询的树结构表示,而且在没有后端文本压缩的情况下具有很高的压缩效率.本文的重点是关于XML文档树结构的压缩.使用已知算法提取树结构的语法表示,该树结构不考虑树模型的重复部分,然后采用紧缩的二进制编码.在目前有效的树结构压缩算法中我们提供了最小的可查询的XML树结构表示.【相关文献】[1] LIEFKE H,SUCIU D.XMill:An efficient compressor for XML data[C].Proceedingsof the 2000 ACM SIGMOD International Conference on Management of Data,2000,(18):153-164.[2] SKIBINSKI P,SWACHA bining efficient XML compression with query processing[C].Proceedings of the 11thEast-European Conference on Advances in Databases and Information Systems,2007,(16):330-342.[3] LI W.An XML compression tool,Master’s thesis[R].University of Waterloo,2003.[4] TOLANI P,HARITSA J.XGRIND:a query-friendly XML compressor[C].Procaedings of the 18th International Conforman on Data Engineering,2002,(16):225-234.[5] MIN J,PARK M,Chung C.XPRESS:a queriable compression for XML data[C].Proceedings of the 2003 ACM SIGMOD International Conference on Management ofData,2003,16:122-133.[6] ANDREI Arion,Angela Bonifati,Gianni Costa,et al.XQueC:Pushing queries to compressed[C].Proceedings of 29th International Conference on Very Large Data Bases,2003,18:1065-1068.[7] Hongzhi Wang,Jianzhong Li,et al.XCpaqs:Compression of XML document with XPath query support[C].Proceedings of the International Conference on Information Technology,2004,18:354.[8] Yongjing Lin,Youtao Zhang,Quanzhong Li,et al.Supporting efficient query processingon compressed XML files[C].Proceedingsof the 2005 ACM Symposium on Applied Computing,2005,18:660-665.[9] Leighton G,Müldner T,Diamond J.TREECHOP:A tree-based queriable compressor for XML[R].Jodrey School of Computer Science, Acadia University,2005.[10] Min J,Park M,Chung C.A compressor for effective archiving,retrieval and update of XML documents[J].ACM Transactions on Internet Technology,2006,6(3):223-258.[11] 吴昊,耿焕同,吴祥.一种基于聚类分析的BBS主题发现算法研究[J].安徽师范大学学报:自然科学版,2009,32(1):9-13.[12] Leighton G,Diamond J,Muldner T.AXECHOP:A grammar-based compressor for XML[C].Proceedings of the Data Compression Conference,2005,18:467.[13] O.Delpratt,R.Raman,and N.Rahman.Engineering succinct DOM[R].In EDBT,2008.[14] R K Wong,F Lam,W M Shui.Querying and maintaining a compact XML storage[R].In WWW,2007.。

xml数组结构

xml数组结构XML（可扩展标记语言）是一种用于存储和传输数据的标记语言。

在XML中，数组结构是一种常见的数据组织方式，它使得数据可以按照有序的方式存储和传递。

本文将深入探讨XML数组结构的定义、用法以及在实际应用中的案例。

一、XML数组结构的定义1.1 XML基础概念XML使用标签将数据组织成树形结构，标签可以包含属性和值。

数组结构在XML中通常通过元素嵌套来实现。

1.2 数组结构的表示方式在XML中，数组结构可以使用以下方式表示：<array><item>Value 1</item><item>Value 2</item></array>上述XML片段中，<array> 元素包含多个<item> 元素，每个<item> 元素都包含一个数值。

二、XML数组结构的用法2.1 有序存储XML数组结构的主要优势之一是它可以有序存储数据。

通过定义多个相同类型的元素，可以确保数据的顺序性，使其易于读取和理解。

2.2 多层次结构XML数组结构可以嵌套多层，形成复杂的数据结构。

这使得XML非常适用于表示层次性强的数据，例如树形结构。

2.3 数据类型灵活XML并不要求数组中的元素是相同的数据类型。

这种灵活性使得XML 数组能够存储各种类型的数据，从简单的文本到复杂的嵌套结构。

三、XML数组结构的实际案例3.1 配置文件XML数组结构常被用于配置文件，例如：<config><server><address>192.168.1.1</address><port>8080</port></server><database><host>localhost</host><user>admin</user><password>secret</password></database></config>这样的结构清晰地表示了不同配置项之间的关系。

exi 压缩xml 原理

exi 压缩xml 原理XML（可扩展标记语言）是一种常用的数据交换格式，它具有可读性强、结构清晰等优点。

然而，在实际应用中，由于XML文件通常会包含大量的标签和冗余的文本，导致文件体积庞大，不利于网络传输和存储。

为了解决这个问题，人们开发了许多压缩算法，其中最常用的就是EXI（可扩展标记语言二进制化）。

EXI压缩XML的原理是通过将XML文件转换成二进制格式，从而减小文件的体积。

具体而言，EXI通过以下几个步骤实现压缩：1. 建立字典：EXI首先会建立一个字典，该字典包含XML文件中所有可能出现的元素和属性。

这样可以避免在压缩过程中重复存储相同的元素和属性名称。

2. 二进制表示：EXI将XML文件中的元素和属性名称用二进制编码表示。

这样可以减小存储空间，因为二进制编码通常比文本表示更紧凑。

3. 值压缩：EXI对XML文件中的属性值进行压缩。

常见的压缩技术包括整数编码、字符串字典和布尔值编码等。

这些技术可以将属性值表示为更短的二进制序列，从而减小文件大小。

4. 重复数据删除：在压缩过程中，EXI会检测并删除XML文件中的重复数据。

例如，如果一个元素在XML文件中多次出现，并且它们的值相同，那么EXI只会存储一次该元素的值。

通过以上步骤，EXI可以将XML文件压缩成更小的二进制表示形式，从而节省存储空间和网络传输带宽。

同时，由于二进制格式的文件更容易解析和处理，因此EXI压缩的文件在解压缩后可以更快地被应用程序处理。

EXI通过建立字典、二进制表示、值压缩和重复数据删除等技术，将XML文件压缩成更小、更高效的二进制格式。

这种压缩方式可以显著减小文件大小，提高传输效率，并且不影响XML文件的结构和可读性。

通过使用EXI压缩XML，我们可以更好地利用网络资源，提高系统的性能和响应速度。

基于XBW变换支持查询的XML数据压缩方法

基于XBW变换支持查询的XML数据压缩方法1胡智飞，杨路明，刘波，李建军中南大学信息科学与工程学院，长沙 (410083)E-mail：veron546@摘要：XML数据格式具有易于创建和解析，但过于冗长且难以在其上实现查询的不足,本文引入XBW变换将XML数据压缩成三个线性序列，从而使XML数据的查询处理从树形结构转移到这三个序列上，基于XBW变换导航，子路径查询和内容查询的算法，并采用Rank&Select 方法来实现。

实验结果表明，在压缩率和压缩时间方面，XBW ZIP的性能接近或者超过一些支持查询的XML压缩方法和一些传统的通用压缩方法。

关键词：XBW变换，Rank&Select方法，XBW ZIP中图法分类号: TP311.13 文献标识码: A0. 引言XML最吸引人的地方莫过于它的自描述结构，它用标签来表示数据项，将大量的元数据包含在纯文本的格式中，不仅描述了信息的特征，还描述了信息的内容,XML的这个特点，决定了XML在作为数据表示和交换标准的同时，也存在大量的冗余，所以对于XML文件，特别是大容量的XML文件，有必要找出一种方法来对其进行压缩，用以节省存储空间和网络带宽[1]。

学者们对XML数据压缩问题作了大量有益的研究工作，在支持查询处理的XML 压缩方法领域：Tolani和Haritsa在文献[2]中提出了XGrind方法，在XGrind压缩体中保留了XML文件的结构，XGrind在保留了结构完整性的同时，又具有XMILL[3]的数据与结构相分离的特点,其缺点是压缩效率较低。

Min K, Park M J,Chang C W在文献[4]中提出了XPRESS,它采用了一种叫做反向算术编码(reverse arithmetic encoding)的方法，对数据的每一条路径进行编码，能够在取得平均73%的压缩比的条件下支持查询压缩数据。

Buneman等在文献[5]中提出了一个能够对XML压缩文件直接进行路径查询的框架(Skeleton Compression Framework: SCF); Andrei Arion等在XqueC项目中提出了一种自适应的压缩器[6]。

有效的 XML 压缩器--XMill 和 LZMA 数据压缩(IJEME-V9-N4-1)

I.J. Education and Management Engineering, 2019, 4, 1-10Published Online July 2019 in MECS ()DOI: 10.5815/ijeme.2019.04.01Available online at /ijemEffective XML Compressor: XMill with LZMA Data CompressionSuchit A. Sapate aa, Persistent Systems Ltd., Nagpur, 440022, IndiaReceived: 06 February 2019; Accepted: 25 April 2019; Published: 08 July 2019AbstractThe XMill is an efficient XML compression tool which takes the advantage of awareness of XML. XMill compresses the data on the basis of three principles- separate the XML structure from the data, group related data and apply the semantic compressors. The XMill uses the gZip library to compress the XML string data for increasing the compression ratio. Here we have proposed a new method to increase the compression ratio of XMill tool. In this method we have added the 7Zip library to the XMill tool; 7Zip library uses the LZMA algorithm to compress the data. LZMA is an enhanced & improved version of LZ77 algorithm which is used in the gZip library. LZMA algorithm has following features over the LZ77 algorithm•Uses up to 4GB dictionary length instead of 32KB for removing the duplicate data.•Uses the look-a-head approach instead of greedy approach.•Uses the optimal parsing, shorter code for recently repeated matches.•Uses the context handling.Due to the above features our proposed approach achieves the best compression ratio with a comparable compression speed.Index Terms: XML, XMill, LZ77, LZMA, 7Zip, gZip.© 2019 Published by MECS Publisher. Selection and/or peer review under responsibility of the Research Association of Mode rn Education and Computer Science1.IntroductionNow a day XML has become an important standard for representing or exchanging the data on World* Corresponding author.E-mail address: suchit.sapate2005@Wide Web [1]. However, self-describing nature, flexibility, simplicity and portability makes the XML more popular in the data communication, but it suffers a verbosity problem [2]. Many effective XML compressors come into scope due to the verbosity problem of XML. There are two types of compressors on the basis of XML awareness: General text-based compressor and XML concise compressor [3].The general text-based compressor is mostly used for compressing the general text files and it includes data compressors such as gZip, WinRAR, 7Zip [3]. 7Zip compressor by default uses the 7Z format and that format by default uses the LZMA method to compress the data [6]. LZMA is an improved version of LZ77 algorithm which improves the data compression ratio [13,16].XML concise compressors are aware about the XML structure so that they can take an advantage of XML structure to compress the data which increases the compression ratio [3]. The XMill is one of the type of XML compressor which eliminates the redundant data by identifying the similarities between the semantically related data [12]. The XMill also uses the gZip library to compress the XML string type of data [7]. To improve the compression ratio of XMill compressor we have added 7Zip library in addition with gZip to compress the XML string data.We have specified the XMill basic introduction in the Section II and 7Zip introduction in the Section III. We have proposed the new approach for increasing the compression ratio those details are provided in the Section IV and Section V. Section VI contains the experimental results which show how our approach is better than the traditional XMill compressor.2.Related WorkThe The XMill was developed in summer 1999 in AT&T Labs by Hartmut Liefke and Dan Suciu [5]. This is an XML Aware compressor which uses the structure of XML to compress the XML [12]. The source code of this tool is now moved to the SouceForge Project so now new version of XMill tool is provided on this site[15]. The XMill compress the data on the basis of following three principles [7]:a. Separate the XML structure from the data: XML tags and attributes create the XML structure so inthis principle they separate the XML tags and attributes of the data [12]. Data can be a data item which contains the sequence of characters to represent the element contents and the attribute values [7,15].b. Group related data item: It groups the related data items into the containers [15] for example it groupsall <title> data items to form one container, same as all <id> data items formed another container [8].These containers are then compressed separately just like the column-wise compression in the relational database.c. Apply semantic compressors: XMill uses the semantic or specialized compressors to the differentcontainers for compressing the XML efficiently [7,15]. For compressing textual data it uses the GZip library.3.7Zip Introduction7Zip is a general text compressor which is used for archiving purpose. 7Zip is an open source compressor and most of the code of this compressor comes under the GNULGPL license [4,9]. 7Zip compressor uses the 7Z format by default [6]. 7Z format provides following main features to the 7Zip compressor: •Achieves high compression ratios [6,10]•Supports strong AES256 encryption [10]•Compresses large file up to approximately 16 exbibytes i.e. 16000000000 GB [10].•Supports the solid compression that means while compressing multiple files are treated as a single solid block due to this compression ratio increases [10]..The 7Z is an open architecture which allows addition of any new compression methods into it [6]. At the time of paper writing following compression methods defined in it.LZMAIt is enhanced and improved version of LZ77 algorithm which uses the sliding dictionary up to 4GB length instead of 32KB for eliminating the redundant data [6,10].LZMA2It is enhanced and improved version of LZMA algorithm which supports better multithreading than LZMA [10].PPMdIt contains the PPMdH (PPMII/cPPMII) code with small changes [6]. PPMII is an improved version of PPM [10].BZIP2It uses the BWT algorithm. BZIP algorithm internally uses the arithmetic coding for compressing data but BZIP2 uses the Huffman coding instead of arithmetic coding [10].DeflateThis method uses the combination of Huffman coding and LZ77 algorithm. This method compresses the file with high speed, but the compression ratio is not much higher. This method uses the dictionary-based compression approach and the dictionary length could be up to 32 KB for eliminating the redundant data [10]. 7Z format uses the LZMA method by default to compress the data [6]. 7Zip also supports numerous other compression formats such as Zip, gZip, bZip, XZ, tar & WIM [4].4.Proposed Methodology for XML CompressionXMill compressor uses the gZip library to compress the XML string data. As per our analysis we have found that we can improve the compression ratio of XMill compressor by adding the 7Zip library in it. So, we have provided one more option in XMill compressor to compress the XML string data by 7Zip library. By default, this library uses the LZMA algorithm to compress the data. LZMA algorithm uses up to 4GB length dictionary for eliminating the duplicate string to this, compression ratio improved as compared to LZ77 algorithm, but long size of dictionary makes this algorithm slower. LZ77 algorithm depends on the greedy approach for parsing whereas LZMA depends on the Look-a-head approach which makes the compression process more effective than the gZip.4.1 Proposed System ArchitectureThe XML file is parsed by a SAX8 parser that sends tokens to the path processor. Every XML token (tag, attribute, or data value) is assigned to a container. Tags and attributes, forming the XML structure, are sent to the structure container. Data values are sent to various data containers, according to the container expressions, and containers are compressed independently.The core of XMill is the path processor that determines how to map data values to containers. The user can control this mapping by providing a series of container expressions on the command line. For each XML data value the path processor checks its path against each container expression, and determines either that the value has to be stored in an existing container, or creates a new container for that value.Containers are kept in a main memory window of fixed size (the default is 8MB). When the window is filled, all containers are compressed by 7Zip LZMA algorithm, stored on disk and the compression resumes. In effect this splits the input file into independently compressed blocks.The decompressor is simpler, and its architecture is also like compressor architecture but data flow from bottom to top instead top to bottom and it is the contrast of compression process. After loading and decompressing the containers, the decompressor parses the structure container, invokes the corresponding semantic decompressor for the data items and generates the output.SAX-ParserPath ProcessorSem. Compressor 1Sem. Compressor 2 Sem. Compressor KStructure Container Data Container 1 Data Container 2 Data Container K7Zip 7Zip 7Zip 7ZipOutput File: Compressed XMLInput XML FileFig. 1. Proposed System Architecture for Compression5. Experimental Results and DiscussionIn this section we have performed the wide range of experiments to compare the compression performance of the existing XML compressors such as XMill, gZip and LZMA with our approach that is XMill-LZMA. Here we have compared the compressors based on following four parameters.1. Compression Ratio: This can be calculated by finding the ratio between the compression size and the original size of XML file.2. Compression Time: This represents the duration required to compress the XML file.3. Decompression Time: This represents the duration required to decompress the compressed XML file.4. Compressed File Size: This parameter represents the size of the file after the compression. Experimental EnvironmentWe have used following environment set up to run the performance analysis of GZip, LZMA, XMill and XMill-LZMA compressors.Table 1.Experimental Environment SpecificationFig. 2. Compression Ratio GraphExperimental DatasetsWe have used following XML files to do the performance analysis of GZip, LZMA, XMill and XMill-LZMA compressors [18].Table 2. Experimental DatabaseExperimental Comparison and DiscussionThe compression ratio graph i.e. figure 4 shows the XPPM compressor better than all other compressors but the process of compression of XPPM is too slow. The proposed compressor i.e. XMill-LZMA achieves the better compression ratio than the XMill compressor with comparable compression time. The compression ratio improvements of all algorithms are calculated with respect to GZip algorithm. Table 3 shows that compression ratio of proposed approach is 4.09% better than the GZip algorithm.Table 3. Compression Ratio EvaluationFig. 3. Compression time GraphGZip-based compressors have faster compression time, but they have the worst compression ratio, so the XMill compressor has the worst compression ratio. The compression time graph i.e. figure 3 shows the XMill compressor better than all other compressors. As compare to XPPM, the proposed compressor achieves the overall best average compression ratio with less cost term of compression time. Table 4 shows the compression time evaluation with respect to XPPM algorithm. The XMill algorithm needs 85.3% less time with respect to XPPM that means the XMill compression process is faster than XPPM. The proposed approach needs 34.57% less time than XPPM algorithm.Fig. 4. Decompression timeTable 4. Compression Time EvaluationThe decompression time graph figure 4 shows the XMill-LZMA compressor takes very less time than other XML aware compressors. Table 5 shows the decompression time evaluation with respect to XPPM algorithm. The XMill-LZMA algorithm needs 91.62% less time than the XPPM. So, this approach is much optimized approach in terms of decompression time parameter.Table 5. Decompression Time EvaluationTable 6 shows the memory space required to store the compressed file with respect to GZip algorithm. The XMill-LZMA needs 27.38% less space to store the compressed file. Experimental results of compressed file size graph i.e. figure 5 shows the XMill-LZMA needs very less memory size to store the compressed XML file as compare to the XMill tool.Table 5:- Compressed File Size EvaluationFig. 5. Compressed file sizeThe following summary report is used to evaluate the performance of the algorithm in the respective parameter. This report provides the guidelines for selecting the most appropriate XML compression tool on the basis of need and applicability. For e.g. when user has a requirement, compression size of file will be very low with compromising the compression speed in that case XPPM will be better choice, but user want low compressed file size with medium speed then XMill-LZMA will be good choice of that requirement. The rank specifies the performance of the respective algorithm in performance evaluation parameter. Rank 1 signifies the best performance and Rank 6 signifies the worst performance in that category.Table 6. - Summary Report6.ConclusionsIn this paper we have presented our approach to enhance the XMill tool. With the experimental results we have shown that how this approach is more effective than the traditional XMill approach. This approach increases the compression ratio and reduces decompression time of XMill. This approach is used when user wants good compression ratio by compromising the speed of compression such as when a user wants to send some XML file via mail in that case this approach is used because it saves the network bandwidth. For archiving the files this approach is beneficial as compare to traditional XMill approach.References[1]Vojtech Toman, “Compression of XML Data,” Charles University, Master’s Thesis at Department ofSoftware Engineering, March 2003[2]Mark Nottingham and David Orchard,"On XML Optimization," BEA Systems Position Paper, BinaryInterchange of XML Workshop, 2003.[3]Sherif Sakr, "XML compression techniques: A survey and comparison," Elsevier, Information andSoftware Technology, 75 (2009) 303–322, 2009.[4]Wikimedia Foundation, Inc, “7-Zip,” 29 May 2014, /wiki/7-Zip .[5]Smitha S. Nair, "XML Compression Techniques: A Survey," Department of Computer Science,University of Iowa, USA, https://people.ok.ubc.ca/rlawrenc/research/Students/SN_04_XMLCompress.pdf.[6]Igor Pavlov, “7-Zip,” 2013, /.[7]H. Liefke and D. Suciu, “XMill: An Efficient Compressor for XML Data,” Proc. of ACM SIGMOD Intl.Conf. on Management of Data, May 2000.[8]Pankaj M. Tolani, Jayant R. Haritsa, “XGRIND: A query-friendly XML compressor,” in: ICDE’02:Proceedings of the 18th International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, 2002, p. 225.[9]Markhor, "CodePlexProject Hosting for Open Source Software," Feb 6, 2012, version 70.https:///[10]Wikimedia Foundation, Inc, “7Z,” 19 May 2014, /wiki/7z.[11]Wikimedia Foundation, Inc, “LZ77 and LZ78,” 21 April 2014, /wiki/LZ77_and_LZ78.[12]Wilfred Ng, Lam Wai Yeung and James Cheng, "Comparative Analysis of XML CompressionTechnologies," World Wide Web: Internet and Web Information Systems, 9, 5–33,Springer Science + Business Media, Inc. Manufactured in The Netherlands,DOI: 10.1007/s11280-005-1435-2, 2005. [13]Wikimedia Foundation, Inc, “Lempel-Zip-Markov chain algorithm,” 5 June2014,/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm.[14]E.Jebamalar Leavline and D.Asir Antony Gnana Singh, "Hardware Implementation of LZMA DataCompression Algorithm," International Journal of Applied Information Systems, Foundation of Computer Science FCS, Volume 5, Issue-4, March 2013.[15]H. Liefke and D. Suciu, “An Extensible Compressor for XML Data,” Proc. of ACM SIGMOD Intl. Conf.on Management of Data, 2000.[16]Nandan Phadke, Omkar Bahirat, Tejaswi KONDURI, and CHANDRAMA THORAT, "PARALLELDATA COMPRESSION USING LZMA," International Journal of Advanced Computational Engineering and Networking, Volume-1, Issue-2, April-2013.[17]XMill Compressor, /projects/xmill.[18]XMLDataset. /research/xmldatasets/www/repository.htmlAuthor’s ProfileSuchit A. Sapate pursed Bachelor of Engineering in Computer engineering departmentfrom University of Nagpur, India in 2008 and Master of Technology in ComputerScience & engineering department from University of Nagpur, India in 2015. He iscurrently working as Project Lead in Persistent Systems Ltd. since 2008. His mainresearch work focuses on Data Analytics and Data Mining. He has published 3 papers inreputed international journal.How to cite this paper:Suchit A. Sapate," Effective XML Compressor: XMill with LZMA Data Compression", International Journal of Education and Management Engineering(IJEME), Vol.9, No.4, pp.1-10, 2019.DOI: 10.5815/ijeme.2019.04.01。

2006--Interval_++——一种基于区间树的压缩XML索引结构

$ ’， ’， ’ O 3 /P Q 3 / 0 3 2 , 3 2 = Q R 9 Q 3 2 >S 3 2 / 2 Q 2 K J" J? J T J
$ （ ’ （
，
） ! " # # $ $ " " ’ ) * $ +, ./ " + ’ , * 0 " -1 0 $ $ + 0 2 0 , 0 -4 " + ’ , #5 0 6 $ + 7 0 * 2 0 , 0 -( % % ( B A % &! ( & % %， 3 8， 3 ） 9 $ , + * ’ $ * " " ’ ) * $ +: ; 0 $ ; $， < $ = 0 0 6 $ + 7 0 * > $ 0 0 % % B + $ ( &! ( %5 8， 3 %$
/ 0 Q ?
计算机研究与发展 0 （） P P ?， 3 > R
压缩值区间［，并与压缩文件中的元素（已经变 !， "）为压缩区间值）进行比对，若某元素对应的压缩区间，则该元素满足查询条件，进入结果集值落入［ !， "） ! 显然，要得到所有的查询结果，需要对整个文件进行扫描，这种代价是很大的，在有些应用环境中是不允许的（如搜索引擎中） ! 鉴于此，对压缩后的 "#$ 建立索引是降低这种代价的有效途径每 ! 由于压缩后的 "#$ 文件中，个元素已变为对应的压缩值（相同路径的元素具有相同的压缩值），我们对该压缩值集合建立扩展的

树状结构的持久化示例（XML实现持久层）

树状结构的持久化示例（XML实现持久层）sitinspring原创,转载请注明作者及出处.
树状结构是生活中常见的数据结构,如公司等级,军队等级,类别归属,标签结构都是树状结构的具体例子,如何将树状结构持久化和从持久化中取出对于使用关系型数据库的应用一直比较麻烦,不如DB4O这样的数据库直接存取这样简单.本人用XML文件模拟关系型数据库,实现了树状结构存入文件及从文件中取出的完整功能,对为树状结构存取头疼的程序员有一定参考价值.
例中使用的数据结构为标签结构,如Java包括J2EE和J2SE,J2EE 包括JSp,EJB等,j2se包括swing,awt,applet等.
代码如下:
标签类,它能组成树状结构 :
制作ID的工具类:
将Tag从XML中读取存入的持久类:
测试代码:
测试结果:
代码下载:。

压缩XML数据的多查询处理技术的开题报告

压缩XML数据的多查询处理技术的开题报告一、研究背景随着互联网应用及数据存储技术的不断发展，数据存储量不断增加，如何高效地存储和查询大量的数据成为一个亟待解决的问题。

XML数据是当前广泛使用的一种半结构化数据，它在Web服务、数据交换等领域有着广泛的应用。

多查询处理技术是解决大量数据存储和查询问题的有效手段之一，它可以将多个查询请求合并为一次查询，从而提高查询效率和性能。

但是，在大规模XML数据的多查询处理中，传输和解析数据可能会成为瓶颈，导致查询效率低下。

因此，需要一种有效的方式来压缩XML数据，以提高数据传输和解析的效率，从而优化多查询处理技术。

二、研究目的本文旨在探讨一种压缩XML数据的多查询处理技术，通过对XML数据的压缩，提高数据传输和解析的效率，从而加速多查询处理过程，优化查询性能。

三、研究方法首先，研究和分析当前XML数据存储和查询的技术及其存在的问题。

其次，阅读相关文献和研究成果，了解现有的多查询处理技术和XML数据压缩算法。

然后，设计和实现一种压缩XML数据的多查询处理技术，并进行实验，比较实验结果，以证明该技术的有效性和可行性。

最后，撰写论文并进行论文答辩。

四、研究内容1. XML数据存储技术及其存在的问题2. 多查询处理技术的研究现状及其优化策略3. XML数据压缩算法的研究现状4. 基于XML数据压缩的多查询处理技术设计与实现5. 实验结果分析与比较六、研究价值1. 提高XML数据存储和查询的效率和性能2. 加速多查询处理过程，提高查询性能3. 引入XML数据压缩算法的多查询处理技术，具有一定的实用价值和推广价值七、预期成果1. 设计和实现一种基于XML数据压缩的多查询处理技术2. 验证该技术的有效性和可行性，并与现有的多查询处理技术进行比较3. 提供一种优化XML数据存储和查询的方法，并为相关领域的技术研究提供参考八、进度安排第一阶段：2022年3月-2022年5月1. 阅读相关文献，了解XML数据存储和查询技术及其存在的问题2. 研究多查询处理技术的研究现状及其优化策略第二阶段：2022年6月-2022年9月1. 研究XML数据压缩算法的研究现状2. 设计和实现压缩XML数据的多查询处理技术第三阶段：2022年10月-2023年1月1. 进行实验，比较结果2. 撰写论文，完成论文第四阶段：2023年2月-2023年3月1. 论文修改和准备2. 论文答辩九、参考文献1. Rajeev Rastogi, Maria Esther Vidal, and Jeffrey D. Ullman. Efficient query processing in XML databases. ACM Transactions on Database Systems, 29(1):33-69, March 2004.2. Jayant Madhavan, Alon Halevy, and Chris Welty. The Challenges of Web-scale XML Processing. IEEE Internet Computing, 9(5):47-54, September/October 2005.3. Eisuke Ogasahara, Yukio Tamura, Naoki Shibata, and Masaru Kitsuregawa. Combined optimization of xpath compilation and evaluation for xml stream processing. Proceedings of the VLDB Endowment, 2(1):1245-1256, 2009.4. Yanlei Diao, Michael J. Franklin, Raymond To, and Armando Fox. Sketch-based querying of distributed XML data. ACM Transactions on Database Systems, 32(4):22, 2007.5. Aniruddha Gokhale and Srinivas Padmanabhuni. Efficient processing of xml document queries on the standard rdbms. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 907-909. ACM, 2004.。