信息检索技术概述

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Non Binary Independence Model
• Estimate a term’s weight based on whether or not the term appear in a relevant document • The probability that a term which appears tf times will appear in a relevant document is estimated • Weights are normalized based on document size • The final weight is computed as the ratio of the probability that a term will occur tf times in a relevant document to the probability that it occurs tf times in non-relevant documents
Inspired by evolution in the nature Search space of hypotheses containing interacting parts where the impact of each part on the overall fitness is hard to model Parallelizable, but basically a random approach Hypotheses are described by bit strings whose representation depends on the application Fitness function used for ranking potential hypotheses and selecting them for next generation population
跨语言检索问题
• 当前热门! • 允许用户以一种语言L提出查询,而得到以 另一语言L’表述的检索结果文件 • 关键问题是 L 和 L’ 之间通常没有直接对应 • 不同语言在风格,用词等方面有不同
Probabilistic Retrieval
• It computes the similar coefficient between a query and a document as the probability that the document will be relevant to the query • Probability theory can be used • Two different approaches are proposed
Baldwin Effect suggests that individual learning can alter
the course of evolution
Motivated by biological neural systems To learn real-valued, discrete-valued and vector-valued functions from examples ANNs are built out of a densely interconnected of small units (called perceptron), where each takes a number of realvalued input and produces a single real-valued output Robust to errors in the training data Highly parallel Need long training time Can evaluate the learned target function quickly
Lamarck Evolution proposes that evolution over many
generations was directly influenced by the experiences of individual organisms during their life time, contradicted by
Genetic programming
• the individuals in the population are computer programs rather than bit strings • Usually manipulate programs in form of trees, i.e. the parsing tree of the program
精度(Precision) – relevant retrieved / retrieved 准确度(Recall)– relevant retrieved / relevant
检索策略
• 各种不同的策略都会对文件和查询要求间的相 似程度进行度量 • 各种策略的共同出发点都是:如果发现在查询 要求和文件中同时出现的项(词汇)越多,即 认为该文件和该查询要求越相关 • 检索策略是一个算法,当它收到一个查询请求Q 以及一组文件D1,D2..Dn时,它应计算出其中每 个文件Di和查询请求Q的相似系数(similarity coefficient) SC(Q,Di) • 最常用的检索策略:向量空间模型
estimate a term’s weight based on how often the term appears or does not appear in relevant documents and non-relevant documents respectively
• • • • Simple term weight model Non-binary independence model Poisson model Component based model
信息检索技术概述
•基本概念 •衡量信息检索技术的指标 •检索策略 •向量空间模型 •提高检索效率的各种技术途径 •跨语言检索问题
定义/概念
• 在用户提出查询要求之前对一组静态的或接近 静态的文件建立索引 • 用户提出查询要求 • 将一组与用户查询相关的文件按照它们与该查 询的相似程度排列,并将结果提供给用户
1) relies on usage patterns to predict relevance 2) uses each term in the query as clues as to whether or not a document is relevant
Probabilistic Retrieval(2)
提高检索效率的途径
• 通过增加或删减项(单词)来优化查询 • 用文件的相关部分甚至是某个段落而非全文来缩小 检索的范围 • 用户提供相关性的反馈 • 对文件进行分类/聚类 • 按段检索 • 引入词库 • 利用语义网络 • 回归分析 上述方法可以和各种不同的检索策略相结合
提高检索效率的途径(2)
• • • • • 逆索引 查询的加工处理/基于反馈 Signature files 检测重复文件 并行和分布式的信息检索
Key concerns with Probabilistic Retrieval models
• Parameter estimation – accurate probabilistic computations are based on the need to estimate relevance, thus, it is difficult to accurately estimate parameters if without good training data set • Independence assumption
iteratively updates a pool of hypotheses, on each iteration,
• all members are evaluated according to the fitness function • A new population is generated by probabilistically selecting the most fit individuals from the current population • Some individuals remain intact • The others used as basis for creating new offsprings by applying genetic operators such as mutation and crossover
信息检索(Information Retrieval ,简称IR)不是 去简单地寻找相匹配的模式, 而是希望找到 相关的文件
衡量指标
• 有效性(Effectiveness) – 如何按照与用户 查询的相关程度对文件进行排序 • 效率/高效性(Efficiency) – 如何更快地讲 文件排序
度量效率的两个指标:
向量空间模型
• 基于文件的内容是通过它所使用的单词表达的
• 若文件内容和查询内容越相似,就认为该文件 和该查询越相似 • 为每个文件定义一个向量,同理也为查询请求 定义一个向量 • 通常以两个向量的内积计算他们的相似系数
向量空间模型 (2)
• 常采用 tf/idf 算法!简单!
• t – 在文件组中出现的不同项(单词)的 数目 • tfij – 项 tj 在文件 Di中出现的次数 • dfj – 文件组中包含项 tj的文件的数量
current research! Baldwin Effect suggests that individual learning can alter
the course of evolution
Genetic programming
• the individuals in the population are computer programs rather than bit strings • Usually manipulate programs in form of trees, i.e. the parsing tree of the program
向量空间模型 (3)

idf = log (d/dfj), 其中 d 是文件组的文件数
• dij = tfij * idfj • SC(Q, Di) = Σt j=1(wqj * dij)
• 也可以用其它方法计算 SC(Q,Di)
其它检索策略
• • • • • • • • Probabilistic retrieval Language models Inference networks Boolean indexing Latent semantic indexing Neural networks Genetic algorithms Fuzzy set retrieval
Lamarckianrience of a
single organism directly affects the genetic setup of their offspring(s), contradicted by current research!
Simple term weight model
• Assign probabilities to component of the query and then use each of these as evidence in computing the final probability that a document is relevant to the query • The weights correspond to the probability that a particular term, within a given query, will retrieve a relevant document • The weights for each term in the query are combined to obtain a final measure of relevance
相关文档
最新文档