6 Query Operations

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

t
P(ki | R): probability of observing the term k i in the set R of
relevant documents
P(ki | R ): probability of observing the term k i in the set R of
non-relevant documents
10
Relationship
D Cr Dr
Dn
11
12
Formula
• Rocchio，1971 • If Cr is known in advance, then the best query will be represented.
qopt
1 1 C d j N | C | dC d j | Cr | d j r r j r
3
Relevance Feedback Architecture
Query String Revise d Query Query Reformulation
1. Doc1 2. Doc2 3. Doc3 . .
Document corpus
Rankings
IR System
ReRanked Documents
• Bias towards rejecting just the highest ranked of the irrelevant documents:
qm q
d j Dr

d j max nonrelevant (d j )
6-3
: Tunable weight for initial query. : Tunable weight for relevant documents. : Tunable weight for irrelevant document.
• Several algorithms for query reformulation.
5
Outline
• User Relevance Feedback
– – – – 1 2 3 4 Query Reformulation for VSR Term Reweighting for the Probabilistic Model A Variant of Probabilistic Term Reweighting Evaluation of Relevance Feedback Strategies
16
17
18
Comparison of Methods
• Overall, experimental results indicate no clear preference for any one of the specific methods.
– Three techniques above yield similar results. – All methods generally improve retrieval performance (recall & precision) with feedback. – Generally just let tunable constants equal 1. – Use a positive feedback strategy only: r=0
14
Ide Regular Method
• Since more feedback should perhaps increase the degree of reformulation, do not normalize for amount of feedback:
qm q
d j Dr
19
Query Expansion and Term Reweighting for the VM
• Advantages
– Simplicity • The modified term weights are computed directly from the set of retrieved documents. – Good results • The modified query vector does reflect a portion of the intended query semantics.
1. Doc2 2. Doc4 3. Doc5 . .
Ranked Documents
1. Doc1 2. Doc2 3. Doc3 . .
Feedback
4
Query Reformulation
• Revise query to account for feedback:
– Query Expansion: Add new terms to query from relevant documents. – Term Reweighting: Reweighting the term in the expanded query • Increase weight of terms in relevant documents • and decrease weight of terms in irrelevant documents

dj
d j Dn

dj
6-2
: Tunable weight for initial query. : Tunable weight for relevant documents. : Tunable weight for irrelevant documents.
15
Ide “Dec Hi” Method
• Query Operations
– Normalization – Query Reformulation
2
Relevance Feedback
• After initial retrieval results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents. • Use this feedback information to reformulate the query. • Produce new results based on reformulated query. • Allows more interactive, multi-pass process.
9
Query Expansion and Term Reweighting for the Vector Space Model
• Basic idea: Reformulate the query such that it gets closer to the term-weight vector space of the relevant documents. • Some symbols – Let Cr: set of relevant documents in the collection D. – for a given query, retrieved documents Dr+Dn. Their relations can be represented as follow.
• Standard_Rocchio Method:
qm q D d j N | D | dD d j | Dr | d j r r j n
6-1
•the Dr and Dn are the document sets which the user judged : Tunable weight for initial query. : Tunable weight for relevant documents. : Tunable weight for irrelevant documents.
???jd???jd????8?????nrdjdjmddq?q???????
Query Operations
Relevance Feedback & Query Expansion
1
• Query
– Formulating one‟s information need is difficult. – Good way: initial query formulation, and then query reformulation
Байду номын сангаас
7
Query Construction
8
1 Query Reformulation for VSR
• Change query vector using vector algebra. • Add the vectors for the relevant documents to the query vector. • Subtract the vectors for the irrelevant docs from the query vector. • This both adds both positive and negatively weighted terms to the query as well as reweighting the initial terms.
• Probabilistic ranking formula
P (ki | R ) 1 P ( ki | R ) sim(d j , q ) ~ wi ,q wi , j log log 1 P (ki | R ) P ( ki | R ) i 1
• Disadvantages
– No optimality criterion is adopted.
20
2 Term Reweighting for the Probabilistic Model
21
22
评价
23
2 Term Reweighting for the Probabilistic Model
– But, the probabilities P(ki | R) and
P (ki | R )
are unknown.
24
2 Term Reweighting for the Probabilistic Model
• Probability estimation
– Initial search • P(ki | R) is constant for P(ki | R) 0.5 all termsk i :
• The term probability distribution P(ki | R ) can ni P ( ki | R ) be approximated by the N distribution in the whole collection.: ni : the number of documents in the collection which contain the term k i
6-0
Where N is the total number of documents.
13
Formula
• According to the user feedback, we get the modified query q_m which can be written in three ways.
• Automatic
– Automatic Local Analysis (基于最初检出文献集合) – Automatic Global Analysis
6
User Relevance Feedback
• Query Expansion and Term Reweighting for the Vector Model • Term Reweighting for the Probabilistic Model • A Variant of Probabilistic Term Reweighting • Evaluation of Relevance Feedback Strategies