语料库研究方法概述
合集下载
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
– Observe the meaning that is associated with the pattern;
– Observe its semantic prosody;
语料库研究方法概述
2012 语料库与外语研究研修班
Instruments
– Antconc 3.0
• Concordance
语料库研究方法概述
2012 语料库与外语研究研修班
Method
– To answer RQ 1, generate a wordlist of the given text and observe:
• The number of types
• The number of tokens
• the type/token ratio (TTR)
语料库研究方法概述
2012 语料库与外语研究研修班
创新: 数据
方法
技术
解读/理论/ 视角
√ √√
√
新
√ √√
√
√√
√
语料库研究方法概述
2012 语料库与外语研究研修班
基于语料库方法是一种验证程序 语料库驱动方法是一种发现程序
语料库研究方法概述
2012 语料库与外语研究研修班
理据:任何感知都是推断 Any perception is but inferencing.
语料库研究方法概述
2012 语料库与外语研究研修班
Research on relationship:
–shape –direction –strength
语料库研究方法概述
2012 语料库与外语研究研修班
Research questions
– What are the words that are unique to the text in terms of its subject matter?
– Sort : Level 1, Level 2, Level 3 – Frequency count
• Collocates
– Sort – Sort POS tags
语料库研究方法概述
2012 语料库与外语研究研修班
Research on chunks
Objectives:
– To retrieve the multiword sequences; – To examine the internal structure of
• Some thing or phenomenon:
– out of expectation – Incongruent – Need a solution – puzzling
Reading to be better informed
• What has been done as contribution • What has been left undone • What has been done wrong
occur? • Predictive: What will happen if…? • Never ask a question to which you already
know the answer;never ask 'how to' question
Finding a method
• Population • Sample • Sampling
– Use PowerGrep to retrieve the word class from the POS tagged text;
语料库研究方法概述
2012 语料库与外语研究研修班
Explanatory research
–interrelationship between words
–IR between phraseologies –IR between genres
2. To what extent can the level of difficulty of the text be computed on the basis of the graded wordlists?
3. How many different word classes are used? What is the number of each word class?
语料库研究方法概述
2012 语料库与外语研究研修班
– To answer RQ 3, retrieve each word class from the POS tagged text, and sort them on frequency in decreasing order
• Retrieve all the nouns, verbs, and adjectives
• Sort the list
语料库研究方法概述
2012 语料库与外语研究研修班
Instruments
– Use Antconc 3.0 to generate the wordlist;
– Use Range to compare and contrast the wordlist against a batch of graded wordlists;
Objectives:
– Observe the collocates of a word; – Study its patterns of uses; – Study its meanings associated with its
patterns of uses; – Study the semantic prosody of its
语料库研究方法概述
2012 语料库与外语研究研修班
选题、设计与方法
Put it altogether
李文中 中国外语教育研究中心
2012
语料库研究方法概述
2012 语料库与外语研究研修班
语料库不是人学的, 正则表达式不是女人学的。
语料库研究方法概述
2012 语料库与外语研究研修班
Corpus-driven is basically corpus based.
基本步骤: 1.确定题目 2.提出问题 3.确定总体和样本 4.选择工具 5.处理数据 6.描述结果:分类、总结特征(description) 7.解释结果:观察、描述、解释(explanation) 8.解读结果:意义、价值、应用(interpretation)
Identifying a problem
S (Sample)
Sampling validity
P (population)
reliability
Generalizability
R (Result)
Validity
I (Interpretation)
• IF •PS •S R •R I • THEN • IP
语料库研究方法概述
• Never count someone else’s money.
Formulating research questions
• Naming: what is… • Classificatory: How are they interrelated
(patterned)? • Explanatory: to what extent do they co-
语料库研究方法概述
2012 语料库与外语研究研修班
Any corpus-based research is necessarily driven by corpus data.
语料库研究方法概述
2012 语料库与外语研究研修班
目标:通过语料库分析和研究:
–验证假设、直觉 –获得新发现 –建立新的假设 –构建新的理论 –验证已有的发现 –解决难题
语料库研究方法概述
2012 语料库与外语研究研修班
UnbridgeaBiblioteka Baidule
world of reality
world of text
Einstein Gulf
语料库研究方法概述
2012 语料库与外语研究研修班
色
眼
声
耳
文
香
鼻
学问思辨行
本
味
舌
触
身
法
意
语料库研究方法概述
2012 语料库与外语研究研修班
Method
– Compare & contrast the wordlist (of the observed text or corpus) against the wordlist of the reference text or corpus (larger);
– Observe and group the words within a classification framework;
• If the text is very large, standardize the TTR
• the types and their frequency cumulative percentage
语料库研究方法概述
2012 语料库与外语研究研修班
– To answer RQ 2, compute the wordlist against a batch of graded wordlists, and observe:
• How many types on Level 1, 2, and 3 lists are used in the text? And what is their percentage?
• What about their tokens?
• How many types that are not on any list are used in the text? Summarize their features.
3. What is the semantic prosody of the pattern?
语料库研究方法概述
2012 语料库与外语研究研修班
Method
– Search the word (KW, SW, or Node Word) as KWIC;
– Observe its collocates and their word classes;
2012 语料库与外语研究研修班
Descriptive research
–single text –text vs. text –people vs. text
语料库研究方法概述
2012 语料库与外语研究研修班
Research questions
1. How many different word forms are used in the text? How many running words are used? What is their distribution?
– How are these sequences structured in terms of lexical grammatical pattern?
such sequences; – To obtain the sequences unique to a
specific text;
语料库研究方法概述
2012 语料库与外语研究研修班
Research questions
– What multiword sequences (in terms of n-gram) are found in the given text?
语料库研究方法概述
2012 语料库与外语研究研修班
Instruments
– Antconc 3.0
Other applications
– Literary analysis – Automatic summarization
语料库研究方法概述
2012 语料库与外语研究研修班
Research on word uses
meaning
语料库研究方法概述
2012 语料库与外语研究研修班
Research questions
1. What words collocate with the Search Word? What is the strength of the collocability?
2. What is the pattern of the SW? And what is its semantic preference?
– To what extent are these words related to the subject/topic of the text?
– What patterns of relationships exist among the key words?
语料库研究方法概述
2012 语料库与外语研究研修班
– Observe its semantic prosody;
语料库研究方法概述
2012 语料库与外语研究研修班
Instruments
– Antconc 3.0
• Concordance
语料库研究方法概述
2012 语料库与外语研究研修班
Method
– To answer RQ 1, generate a wordlist of the given text and observe:
• The number of types
• The number of tokens
• the type/token ratio (TTR)
语料库研究方法概述
2012 语料库与外语研究研修班
创新: 数据
方法
技术
解读/理论/ 视角
√ √√
√
新
√ √√
√
√√
√
语料库研究方法概述
2012 语料库与外语研究研修班
基于语料库方法是一种验证程序 语料库驱动方法是一种发现程序
语料库研究方法概述
2012 语料库与外语研究研修班
理据:任何感知都是推断 Any perception is but inferencing.
语料库研究方法概述
2012 语料库与外语研究研修班
Research on relationship:
–shape –direction –strength
语料库研究方法概述
2012 语料库与外语研究研修班
Research questions
– What are the words that are unique to the text in terms of its subject matter?
– Sort : Level 1, Level 2, Level 3 – Frequency count
• Collocates
– Sort – Sort POS tags
语料库研究方法概述
2012 语料库与外语研究研修班
Research on chunks
Objectives:
– To retrieve the multiword sequences; – To examine the internal structure of
• Some thing or phenomenon:
– out of expectation – Incongruent – Need a solution – puzzling
Reading to be better informed
• What has been done as contribution • What has been left undone • What has been done wrong
occur? • Predictive: What will happen if…? • Never ask a question to which you already
know the answer;never ask 'how to' question
Finding a method
• Population • Sample • Sampling
– Use PowerGrep to retrieve the word class from the POS tagged text;
语料库研究方法概述
2012 语料库与外语研究研修班
Explanatory research
–interrelationship between words
–IR between phraseologies –IR between genres
2. To what extent can the level of difficulty of the text be computed on the basis of the graded wordlists?
3. How many different word classes are used? What is the number of each word class?
语料库研究方法概述
2012 语料库与外语研究研修班
– To answer RQ 3, retrieve each word class from the POS tagged text, and sort them on frequency in decreasing order
• Retrieve all the nouns, verbs, and adjectives
• Sort the list
语料库研究方法概述
2012 语料库与外语研究研修班
Instruments
– Use Antconc 3.0 to generate the wordlist;
– Use Range to compare and contrast the wordlist against a batch of graded wordlists;
Objectives:
– Observe the collocates of a word; – Study its patterns of uses; – Study its meanings associated with its
patterns of uses; – Study the semantic prosody of its
语料库研究方法概述
2012 语料库与外语研究研修班
选题、设计与方法
Put it altogether
李文中 中国外语教育研究中心
2012
语料库研究方法概述
2012 语料库与外语研究研修班
语料库不是人学的, 正则表达式不是女人学的。
语料库研究方法概述
2012 语料库与外语研究研修班
Corpus-driven is basically corpus based.
基本步骤: 1.确定题目 2.提出问题 3.确定总体和样本 4.选择工具 5.处理数据 6.描述结果:分类、总结特征(description) 7.解释结果:观察、描述、解释(explanation) 8.解读结果:意义、价值、应用(interpretation)
Identifying a problem
S (Sample)
Sampling validity
P (population)
reliability
Generalizability
R (Result)
Validity
I (Interpretation)
• IF •PS •S R •R I • THEN • IP
语料库研究方法概述
• Never count someone else’s money.
Formulating research questions
• Naming: what is… • Classificatory: How are they interrelated
(patterned)? • Explanatory: to what extent do they co-
语料库研究方法概述
2012 语料库与外语研究研修班
Any corpus-based research is necessarily driven by corpus data.
语料库研究方法概述
2012 语料库与外语研究研修班
目标:通过语料库分析和研究:
–验证假设、直觉 –获得新发现 –建立新的假设 –构建新的理论 –验证已有的发现 –解决难题
语料库研究方法概述
2012 语料库与外语研究研修班
UnbridgeaBiblioteka Baidule
world of reality
world of text
Einstein Gulf
语料库研究方法概述
2012 语料库与外语研究研修班
色
眼
声
耳
文
香
鼻
学问思辨行
本
味
舌
触
身
法
意
语料库研究方法概述
2012 语料库与外语研究研修班
Method
– Compare & contrast the wordlist (of the observed text or corpus) against the wordlist of the reference text or corpus (larger);
– Observe and group the words within a classification framework;
• If the text is very large, standardize the TTR
• the types and their frequency cumulative percentage
语料库研究方法概述
2012 语料库与外语研究研修班
– To answer RQ 2, compute the wordlist against a batch of graded wordlists, and observe:
• How many types on Level 1, 2, and 3 lists are used in the text? And what is their percentage?
• What about their tokens?
• How many types that are not on any list are used in the text? Summarize their features.
3. What is the semantic prosody of the pattern?
语料库研究方法概述
2012 语料库与外语研究研修班
Method
– Search the word (KW, SW, or Node Word) as KWIC;
– Observe its collocates and their word classes;
2012 语料库与外语研究研修班
Descriptive research
–single text –text vs. text –people vs. text
语料库研究方法概述
2012 语料库与外语研究研修班
Research questions
1. How many different word forms are used in the text? How many running words are used? What is their distribution?
– How are these sequences structured in terms of lexical grammatical pattern?
such sequences; – To obtain the sequences unique to a
specific text;
语料库研究方法概述
2012 语料库与外语研究研修班
Research questions
– What multiword sequences (in terms of n-gram) are found in the given text?
语料库研究方法概述
2012 语料库与外语研究研修班
Instruments
– Antconc 3.0
Other applications
– Literary analysis – Automatic summarization
语料库研究方法概述
2012 语料库与外语研究研修班
Research on word uses
meaning
语料库研究方法概述
2012 语料库与外语研究研修班
Research questions
1. What words collocate with the Search Word? What is the strength of the collocability?
2. What is the pattern of the SW? And what is its semantic preference?
– To what extent are these words related to the subject/topic of the text?
– What patterns of relationships exist among the key words?
语料库研究方法概述
2012 语料库与外语研究研修班