中外语料库语言学源流PPT课件
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
\S+_VB\S+\s(\S+_[RXPJDN]\S+\s)*\S+_V\S*N
15
Limitation
• speed • A concordancer without applying indexing • can't process texts larger than a few million
3
Pቤተ መጻሕፍቲ ባይዱwerConc
• Size: 1.5MB, compressed package less than 1MB
• Installation: Doesn’t require any installation. • OS: Works only on Windows now.
4
Design principles for PowerConc
• @be • #n •*
returns all inflectional forms of ‘be’ returns all nouns refers to any single word
14
• a * of => a * of • It be ADJ that => It @be #adj that • Noun noun compound => #n #n • Bi-nominal => #n and #n • Passive =>
Ideally
• Most powerful, can do anything that a concordancer can do and cannot do.
• involves least effort in learning to use it
• Doing MORE with less • Reductionism in software design
点击此处输入 相关文本内容
PowerConc
• National Research Centre for Foreign Language Education, Beijing Foreign Studies University
• A general purpose tool for corpus analysis • Developed in Delphi • can deal with any ANSI encoded texts
6
Less buttons and/or tabs
Frequency Search count List
7
8
9
Freq. Count
Concordance Collocation &
Colligation
N-gram list
Key n-gram list
10
More possibilities in tool develop’t
variation
• Tool development lags behind
11
From phraseology to R-gram
• Many of the ‘grammars’ as some sort of phraseology
• We coined a technical term ‘R-gram’.
– E.g. on a Simplified Chinese OS – works well with Simplified/Trad. Chinese texts,
(un)tokenised or raw/POS-tagged, as well as raw/POStagged English texts
words anyway.
16
Q&A问答环节
敏而好学,不耻下问。 学问学问,边学边问。
He is quick and eager to learn. Learning is learni ng and asking.
PowerConc: An R-gram Based Corpus Analysis Tool
Jiajin Xu & Yunlong Jia
Beijing Foreign Studies University
标题添加
点击此处输入相 关文本内容
标题添加
点击此处输入相 关文本内容
总体概述
点击此处输入 相关文本内容
Expressions. • But Regex is too difficult for lay users.
13
Easy search with enhanced hits
• Smart Input
• Three meta-characters in Smart Input syntax, the simplest grammar ever.
• Corpus-informed/related ‘grammars’
– Pattern grammar (local grammar) – Collostruction – Lexical grammar (natural grammar, real grammar) – Lexical priming (textual colligation) – Longman grammar: Biber et al. grammar register
12
• a * of: collocational framework • It be ADJ that: evaluative construction • Noun noun compounds • Bi-nominal constructions • Passive constructions: be/get ADV. V-EN • All these could be matched with Regular
– An operational parallel to phraseology – The unit of language can be words, lemmata,
phrases, POS, POS sequence, and combination of all these. – Can be linguistic structures with uncertain words or categories (e.g. be passive/get passive).
15
Limitation
• speed • A concordancer without applying indexing • can't process texts larger than a few million
3
Pቤተ መጻሕፍቲ ባይዱwerConc
• Size: 1.5MB, compressed package less than 1MB
• Installation: Doesn’t require any installation. • OS: Works only on Windows now.
4
Design principles for PowerConc
• @be • #n •*
returns all inflectional forms of ‘be’ returns all nouns refers to any single word
14
• a * of => a * of • It be ADJ that => It @be #adj that • Noun noun compound => #n #n • Bi-nominal => #n and #n • Passive =>
Ideally
• Most powerful, can do anything that a concordancer can do and cannot do.
• involves least effort in learning to use it
• Doing MORE with less • Reductionism in software design
点击此处输入 相关文本内容
PowerConc
• National Research Centre for Foreign Language Education, Beijing Foreign Studies University
• A general purpose tool for corpus analysis • Developed in Delphi • can deal with any ANSI encoded texts
6
Less buttons and/or tabs
Frequency Search count List
7
8
9
Freq. Count
Concordance Collocation &
Colligation
N-gram list
Key n-gram list
10
More possibilities in tool develop’t
variation
• Tool development lags behind
11
From phraseology to R-gram
• Many of the ‘grammars’ as some sort of phraseology
• We coined a technical term ‘R-gram’.
– E.g. on a Simplified Chinese OS – works well with Simplified/Trad. Chinese texts,
(un)tokenised or raw/POS-tagged, as well as raw/POStagged English texts
words anyway.
16
Q&A问答环节
敏而好学,不耻下问。 学问学问,边学边问。
He is quick and eager to learn. Learning is learni ng and asking.
PowerConc: An R-gram Based Corpus Analysis Tool
Jiajin Xu & Yunlong Jia
Beijing Foreign Studies University
标题添加
点击此处输入相 关文本内容
标题添加
点击此处输入相 关文本内容
总体概述
点击此处输入 相关文本内容
Expressions. • But Regex is too difficult for lay users.
13
Easy search with enhanced hits
• Smart Input
• Three meta-characters in Smart Input syntax, the simplest grammar ever.
• Corpus-informed/related ‘grammars’
– Pattern grammar (local grammar) – Collostruction – Lexical grammar (natural grammar, real grammar) – Lexical priming (textual colligation) – Longman grammar: Biber et al. grammar register
12
• a * of: collocational framework • It be ADJ that: evaluative construction • Noun noun compounds • Bi-nominal constructions • Passive constructions: be/get ADV. V-EN • All these could be matched with Regular
– An operational parallel to phraseology – The unit of language can be words, lemmata,
phrases, POS, POS sequence, and combination of all these. – Can be linguistic structures with uncertain words or categories (e.g. be passive/get passive).