The CLaRK System Tools XML-based Corpora Development £
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
The CLaRK System Tools
XML-based Corpora Development
Kiril Simov,Alexander Simov,Krasimira Ivanova,Ilko Grigorov,Hristo Ganev
BulTreeBank Project
Linguistic Modelling Laboratory-CLPPI,BAS,Sofia,Bulgaria kivs@,alex@,krassy v@,
ilko@,hristo@
Abstract
CLaRK is an XML-based soft-
ware system for corpora develop-
ment.It incorporates several tech-
nologies:XML technology;Uni-
code;Regular Cascaded Grammars;
Constraints over XML Documents.
The basic components of the system
are:a tagger,a concordancer,an ex-
tractor,a grammar processor,a con-
straint engine.
1Introduction
The CLaRK System is an XML-based system for developing and exploration of text corpora (see(Simov et.al.,2001),(Simov et.al., 2002),(Simov et.al.,2003)).One of the main purposes which stands behind the de-sign of the system is reducing the human la-bor during the creation of language resources. The system offers different facilities for en-coding some regularities and dependences in order different processing procedures to be run semi or full automatically.For its work the system relies on the following key tech-nologies:XML technology;Unicode;Regular The work reported here is done within the BulTreeBank project.The project is funded by the V olkswagen Stiftung, Federal Republic of Germany under the Programme“Coop-eration with Natural and Engineering Scientists in Central and Eastern Europe”contract I/76887.Cascaded Grammars;Constraints over XML Documents.
The data format on which CLaRK System works is XML.The main reason for that is itsflexibility for adapting for different tasks and its comparatively easy for understanding structure.This makes it very popular in vari-ous scientific areas,especially in Corpus Lin-guistics.The architecture of CLaRK system is based on an Unicode XML Editor,which is the main interface to the system.Besides the XML language itself,the system supports an XPath language processor for navigation in documents and an XSLT engine for trans-formation of XML documents.Additionally to the standard way of XSL transformations application,there is a mechanism for apply-ing locally transformations to XML elements and their content and incorporating the results back in the source document.
For multilingual processing tasks,CLaRK is based on an Unicode(UTF-16)encoding of the text inside the system.For the purposes of segmenting the text in sensible way there is a mechanism for a creation of a hierarchy of tokenizers.They can be attached to the el-ements in the DTDs and in this way differ-ent tokenizers can be responsible for different parts of the documents.
The basic mechanism of CLaRK for lin-guistic processing of text corpora is the cas-caded regular grammar processor.The main