and Processing of Natural Language

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Contents
1Introduction1
1.1Coverage and Costs (2)
2Background5
2.1Methodology (6)
3Exponential Ambiguity7 4Handling Underspeciﬁed Structures14
5Discussion21
5.1Theory (21)
5.2Formalization:Using AGFL for the structural module of ELSA (22)
5.2.1Minimal Attachment of PPs in transparent conﬁgurations (23)
5.2.2Late Closure in NP-sequences (24)
5.2.3Robustness (24)
5.2.4Computational advantages of AGFL (25)
6Conclusion26
Abstract
This paper presents a strategy to handle syntactic ambiguity in a theoretically motivated fashion following general linguistic principles.This strategy,which is called underspec-iﬁcation,was implemented in a Natural Language Engine(NLE)for automatic informa-tion extraction,called ELSA(an acronym for English Language Semantic Analyser),which was developed at the Department of Language&Speech of the University of Nijmegen. The crucial idea of the strategy is that,in case of ambiguity,the NLE should know what option to choose and when to choose it.Until that moment the analysis remains under-speciﬁed,i.e.only one derivation is produced.
At present time,the NLE in question is adapted to serve as the linguistic module of a knowledge-based Information Retrieval System,called Condorcet,being developed at the University of Twente,for documents on theﬁelds of mechanical properties of engineering ceramics as a subﬁeld of engineering,and epilepsy as a subﬁeld of medicine.In this paper we will show how a theory-driven NLE will make a substantial contribution to(semi-)automatic information retrieval,making use of the AGFL system.
The authors are greatly indebted to Nicolaas J.I.Mars and Paul E.van der Vet,the initia-tors of the Condorcet project,for their substantial contribution to this article.
Chapter1
Introduction
The Condorcet1project deals with the development of a domain-speciﬁc Information Re-trieval system for texts.This system will be concerned with semi-automatic indexing of descriptions(title+abstract)of documents within a speciﬁc domain,thus producing document representations,and matching user requests to these representations.The do-mains that are to be covered are the domain of mechanical properties of engineering ceramics as a subﬁeld of engineering,and epilepsy as a subﬁeld of medicine.
Information Retrieval(IR)aims at locating and making available relevant documents out of a large machine-readable collection in response to a user query.The problem is twofold.First,the user has to determine which documents are relevant to a particular query.An IR system has to locate documents or document descriptions relevant to the query.Secondly,the documents judged relevant have to be made available in readable form.The Condorcet project will concentrate on theﬁrst problem(cf.[Van der Vet1995]). The problem ofﬁnding(signposts to)relevant documents is commonly deﬁned as that ofﬁnding all and only relevant items.The associated measures are recall(deﬁned to be 1when all relevant items are found and0when none are found)and precision(deﬁned to be1when only relevant items are found,and0when no item is relevant).Current IR methods employ techniques that are able to locate items on the basis of words occur-ring in them.These words either occur in the running text of the item,its abstract,and its title,or in a specialﬁeld where index terms taken from a thesaurus are stored.Ex-perimental research has shown that synonymy,homonymy,and the use of metaphors in natural language prevents recall and precision from attaining high values if no index terms are added manually.The construction of a thesaurus and the manual addition of index terms are very costly processes.IR research now proceeds on two different tracks. One track emphasises the use of added index terms,but tries to lower the costs of adding them by partially automating the process.Condorcet explores this route.It is particularly
2This term is used in a broad sense,i.e.covering Language Technology,Linguisitc Engineering and Technolinguistics.
3Relevant here means relevant with respect to the objective of information retrieval.
4This section is mainly adopted from[Van Bakel,forthcoming].
Figure1:Linguistic Analysis by ELSA
Chapter2
Background
The major pre-theoretical principle that underlies ELSA–like all computational linguistic research carried out at the Department of Language&Speech at the University of Nijme-gen(cf.[Van Bakel1984],[Coppen1991],[Jagtman1994])–is introduced in[Van Bakel, forthcoming]as the Principle of Linguistic Motivation,or LM Principle.This principle states that“In Natural Language Processing,solutions to computational problems must be mo-tivated by general linguistic considerations”.The LM Principle holds at all levels;it is based on the idea that only linguistics can provide a sound and substantial base for NLP, and will lead to well-performing,maintainable and expandable systems.Obviously,the LM principle implies that NLP models have to be based on linguistic theories;this should e.g.be reﬂected by the internal structure of systems,which is indeed the case as we will see further on.But it applies to other,less straightforward matters as well,one of which is the parsing strategy of syntactic underspeciﬁcation.
In[Coppen1991],a term is introduced to denote natural language processing that is in ac-cordance with the LM Principle,i.e.Technolinguistics.By using this term,Coppen creates a dichotomy in terminology—Technolinguistics nguage Technology—in order to replace the somewhat confusing use of terms like computational linguistics,linguistic engineering,etc.Both terms are used to cover research including both natural language and computer ing the terms correctly,research with emphasis on tech-nology will be called Language Technology,and research with emphasis on linguistics Technolinguistics.Apart from developing NLP systems,technolinguistic research aims to develop and further improve theories on methods of implementation of linguistic the-ories within formal models.The strategy of underspeciﬁcation must be viewed in this context.
Technolinguistics is application-oriented.Its main objective is to build systems that con-duct linguistic analysis of some sort,and to have these systems operate within a practi-cal situation.The LM Principle provides that the systems are based on general linguistic principles,and this is reﬂected in the methodology.
2.1Methodology6
Chapter3
Exponential Ambiguity
In automatic structural analysis,the major problem that has to be dealt with is the prob-lem of structural ambiguity,and with it the exponential increase of the number of anal-yses.In general it can be said that whether a system successfully conducts automatic parsing depends largely on how it tackles this problem.The phenomenon of exponential ambiguity typically occurs in phrases with transparent constituent boundaries.In these phrases it is unclear whether a particular constituent belongs to the left of the bounday or to the right.For instance,take sentences with NP-PP sequences,in which a prepositional phrase(PP)can be a post-modifying phrase to a noun phrase(NP),or a constituent of its own.The following is a notorious example:
1She saw the girl with the binoculars.
This sentence is both structurally and semantically ambiguous,which becomes clear when it is passivized:
2The girl with the binoculars was seen(by her).
3The girl was seen with the binoculars(by her).
An unrestricted parser will produce two structural analyses for(1).In this particular case this is not harmful,as both structural possibilities correspond to two different meanings. However,this will not always be the case.In transparency situations,a number of struc-tural analyses will usually be produced that will not lead to semantically valid analyses. What is more,if the number of PPs grows the number of analyses will increase expo-nentially.Many sentences can be found with NP-PP sequences in the chemical abstracts. Intuitively these are hardly recognized as ambiguous,but in NLP practice they prove to be highly‘ambiguity-prone’,to the extent that the number of analyses increases expo-
NP
PP
NP
N ODS column
P with
#structures
1 1
5 3
42 5
429 7
4862
2How information on subcategorization is applied in semantic analysis will be discussed in the next section.
PP
NP N2 N1 NK V[nom] HPLC Prep
on
Det
an
Prep
by
N1
NK
V[nom]
HPLC
Prep
on
Det
an
PP[+bare,sg,+hd,noppnest]
NP[+bare,sg,+hd,-pp] N2[+bare,sg,+hd,-pp] N1[sg,nom]
NK[sg,nom]
V[nom,sg]
HPLC Prep[]
on
Det[-def,sg,+ob]
an
3The only exception is made in case of PP with the preposition of.These PPs are analysed as postmod-ifying phrases to NP.
4By‘lower’is meant lower in the derivation tree of the NP speciﬁer.In this respect QP1is the highest possible node,and AP the lowest.Cf.[Van Bakel,forthcoming].
NP[det,det,-hd]
N2[-hd]
N1[-hd]
NK[-hd]
N
Det those
Chapter4
Handling Underspeciﬁed Structures
In the third module of ELSA,surface structures are converted into thematic structures that serve as input for the domain-speciﬁc module of the Analytical Chemical informa-tion extraction system(cf.[Postma1995]).Basically,this conversion consists of the fol-lowing three stages1:
1.The surface structure is changed into a CP-IP-VP structure,known from GB theory.
2.All major constituents are linked to their original deep structure positions,without
changing the word order of the sentence,however.In other words,argument chains are created.
3.The thematic grids of the main verbs are incorporated in the sentence,and the the-
matic roles are assigned to the various constituents.In the process,ungrammatical analyses are discarded by GB-basedﬁlters.
The third stage is important with respect to handling underspeciﬁed structures.We will explain here how this third stage is conducted by ELSA.
After all major constituents have been linked to their deep structure positions,and the conditions on agreement have been checked,the thematic grid of the main verb is in-corporated in the analysis,and thematic roles are assigned.Basically,the process of theta role assignment is very simple.All verbs and nominalizations have a pointer(e.g. <sf<1>>)that corresponds to a thematic frame;all frames are deﬁned in a separate lex-icon(i.e.frames).The thematic grid is collected from the lexicon,and it is placed in the tree structure next to C&A,as follows:
NP[1]
she I
NP[1,sem[+hum]]
V analyses SF
NP[+con]
OBJ
A thematic grid consists of an SF tree(Semantic Frame)that contains a number of nodes. The labels of these nodes contain the syntactic and semantic conditions that apply to the thematic roles,and the contents of the nodes are the thematic roles in questions.For in-stance,in the example above theﬁrst theta role is AGE(Agens),and the condition that applies to the assignment of this role is that the constituent to which the role will be as-signed has to be NP<+hum>.When a theta role candidate(a major constituent)meets the syntactic and semantic conditions that apply to a particular role,that role is assigned:the name of the role is placed within a special feature theta<>within the feature bundle of the constituent,and the theta role is removed from the thematic grid:
...
IP
I‘
VP
V‘
C&A
NP[sem[+con],theta[OBJ]]
the sample
SU
NP
N2
N1
NK
N[mss,sg] Lotion
VP
C&A
Prep
by
PP
NP
N2
N1
NK
N[mss,sg]
ODS column
After theta role assignment the analysis is as follows:
NP[1,chain[+hd]]
N2
N1
NK
N[mss,sg] Lotion
I
was NP
V
analysed NP[1,chain[+tl],theta[OBJ]]
Prep
by
C
I
NP
V[nom,sg]
HPLC
PP
NP
N2
N1
NK
N[mss,sg]
ODS column
In this analysis we can see that the PP on an ODS column did not receive a theta role from analyse,and that it has to be moved to its preceding constituent.If in a transparency sit-uation the preceding constituent is an NP,then the PP or CP is lowered by creating a PM position under N2for PP,or under NP for CP.However,in the example the PP has to be moved to a preceding CP(which was created by transformation of the nominalization NP HPLC),and therefore the PP has to be lowered to the C&A of this CP:
NP[1,chain[+hd]]
N2
N1
NK
N[mss,sg] Lotion
I
was NP
V
analysed NP[1,chain[+tl],theta[OBJ]]
Prep
by
C
I
NP
V[nom,sg]
HPLC
Prep
on
Det
an
NP[1,chain[+hd],sem[+hum]]
She I
NP[1,chain[+tl]]
V
saw NP[sem[+hum]]
N2 N1
N girl
P
with
Det
the
SF
NP[+con]
OBJ
NP[1,chain[+hd]]
She I
NP[1,chain[+tl],theta[DAT]]
V
saw NP[theta[OBJ]]
N2
N1
N
girl
P
with
Det
the
NP[1,chain[+hd]]
She I
NP[1,chain[+tl],theta[DAT]]
V
saw
Det the PP
NP
N2
N1
N
binoculars
In this example we see that two thematic analyses are produced by ELSA,even if only one structural analysis was produced earlier on.Note furthermore that this second the-matic analysis is only produced in case of thematic ambiguity.In other words,the strat-egy of syntactic underspeciﬁcation proves to be a successful way of tackling exponential ambiguity,without losing relevant information regarding possible thematic ambiguity. What is more,this strategy is a fully linguistically motivated solution to a computational problem.However,it must be noted that this strategy is not the only strategy that is used within NLP models.In the next section we will discuss another srategy in this respect, i.e.the strategy of Packing which is employed in the Core Language Engine(cf.[Alshawi 1992],and we will compare it to the one that was presented here.
Chapter5
Discussion
5.1Theory
ELSA is one of many NLP systems for automatic analysis of natural language sentences.
A similar system in objective and methodology is the Core Language Engine(CLE,cf.[Al-shawi1992]),which is being developed at the SRI Cambridge Computer Science Research Centre in England.Like ELSA,CLE is a modularly structured system for automatic se-mantic analysis of English sentences;it is intended to be used in machine translation. CLE differs from ELSA in some respects as well.To begin with,CLE employs different formalisms and drivers than ELSA does,and the input for CLE is limited to a maximum of ten words per sentence,whereas in ELSA no such limitations are imposed(see e.g.(4)). But these are only minor differences.A signiﬁcant difference is that semantic analysis by CLE is far more extensive than thematic analysis by ELSA;in fact,perhaps semantic analysis by CLE could be compared to the analysis performed by the entire Nijmegen in-formation extraction system.A fair deal of the CLE system consists of modules that per-form operations on logical forms,which can be compared to the module that was origi-nally intended within ELSA as a fourth module,i.e.a transformational grammar based on Montague semantics.
Apart from the similarities and differences mentioned above,there are also some tech-nological differences.For instance,CLE uses a bottom-up parsing algorithm in which some top-down constraints are employed,contrary to ELSA’s parser which is fully top-down.But the difference we would like to discuss here is the way CLE and ELSA tackle structural ambiguities.Like in ELSA,a strategy is incorporated in CLE at structural level to limit the number of analyses that is produced in transparency situations.The strategy used by CLE is based on the strategy that was introduced by[Tomita1985],and it is called Packing.As in underspeciﬁcation,structural ambiguities are not split out in this strategy, but a special notation is used,thus abstracting“away from alternative internal structures of constituents that have the same category”([Moore&Alshawi1992];cf.[Alshawi1992],
Chapter6
Conclusion
The parsing strategy of syntactic underspeciﬁcation that we presented in this paper is a linguistically sound way to tackle exponential ambiguity.It is elegant as well,since it enables the linguist to determine very accurately which transparency structures are to be speciﬁed,and in what way.
For natural language engines that are modularly structured,it is very important to deal with structural ambiguity in an efﬁcient way.As we have argued,this entails a solu-tion in which potential problematic structures are underspeciﬁed by the structural mod-ule,in order to tackle the problem when enough information is available.By doing so, we provide a linguistically motivated solution of this problem rather than a technical workaround to camouﬂage it.We have shown furthermore that AGFL offers an elegant,ﬂexible and fast formalism in which structures can be underspeciﬁed easily.
References
Analytical Abstracts
[1989]Analytical Abstracts Online(STN Host),the Royal Society of Chemistry,Letchwerth, Herts,England.
Bakel,Bas van
[1995]Robustness in Condorcet’s Natural Language Engine,KBS-note95-035,Knowledge-Based Systems Group,University of Twente,1995(internal publication).
[forthcoming]A Linguistic Approach to Automatic Information Extraction,PhD.Thesis,Uni-versity of Nijmegen(forthcoming).
Bakel,Jan van
[1984]Automatic Semantic Interpretation,Foris Dordrecht1984.
Chomsky,Noam
[1986a]Barriers,Linguistic inquiry monographs13,Cambridge,Mass.1986.
Coppen,Peter-Arno
[1991]Specifying the Noun Phrase,PhD.Thesis University of Nijmegen,Amsterdam1991.
[1996]Use of AGFL in Sequential Modular Analysis systems,this volume,1996.
Dekkers,C.,C.H.A.Koster,M.-J.Nederhof&A.van Zwol
[1992]Manual for the Grammar Workbench,Version1.5Technical Report no.92-14,University of Nijmegen,July1992.
Dowty,David R.,Robert E.Wall and Stanley Peters
[1981]Introduction to Montague Semantics,Dordrecht/Boston/London1981.
Frazier,Lyn
[1987]‘Syntactic Processing:Evidence from Dutch’,in:Natural Language&Linguistic Theory 51987,pp.519-559.
Jagtman,Margriet
[1994]Computer-Aided Syntactic Analysis of Interlanguage Data,PhD.Thesis University of Nij-megen,Enschede1994.
Koster,C.H.A.
[1991a]Towards an Afﬁx Grammar for the Hungarian Language,Technical Report no.91-18, University of Nijmegen,Department of Informatics,1991.
6Conclusion28。