profile)阙值(threshold)权值(weight)语词加权(term-weighting)相似度(similarity)相异度(dissimilarity)域建模(domain modeling)叙词表(thesaurus)扁平(flat)⼴义向量空间模型(generalized vector space model)神经元(neuron)潜语义标引模型(latent semantic indexing model)邻近结点(proximal node)贝叶斯信任度⽹络(Bayesian belief network)结构导向(structure guided)结构化⽂本检索(structured text retrieval, STR)推理⽹络(inference network)扩展布尔模型(extended Boolean model)⾮重叠链表(non-overlapping list)第3章检索性能评价(retrieval performance evaluation)会话(interactive session)查全率(R, Recall Ratio) 信息性(Informativeness)查准率(P, Precision Ratio) ⾯向⽤户(user-oriented)漏检率(O, Omission Ratio) 新颖率(novelty ratio)误检率(M, Miss Ratio) ⽤户负担(user effort)相对查全率(relative recall)覆盖率(coverage ratio)参考测试集(reference test collection)优劣程度(goodness)查全率负担(recall effort)主观性(subjectiveness)信息性测度(informativeness measure)第4章检索单元(retrieval unit)字母表(alphabet)分隔符(separator)复合性(compositional)模糊布尔(fuzzy Boolean)模式(pattern)SQL(Structured Query Language, 结构化查询语⾔) 布尔查询(Boolean query)参照(reference)半结合(semijoin)标签(tag)有序包含(ordered inclusion)⽆序包含(unordered inclusion)CCL(Common Command Language, 通⽤命令语⾔) 树包含(tree inclusion)布尔运算符(Boolean operator) searching allowing errors容错查询Structured Full-text relevance feedback 相关反馈Query Language (SFQL) (结构化全⽂查询语⾔) extended patterns扩展模式CD-RDx Compact Disk Read only Data exchange (CD-RDx)(只读磁盘数据交换)WAIS (⼴域信息服务系统Wide Area Information Service)visual query languages. 查询语⾔的可视化查询语法树(query syntax tree)第5章query reformulation 查询重构 query expansion 查询扩展 term reweighting 语词重新加权相似性叙词表(similarity thesaurus)User Relevance Feedback⽤户相关反馈 the graphical interfaces 图形化界⾯簇(cluster)检索同义词(searchonym) local context analysis局部上下⽂分析第6章⽂献(document)样式(style)元数据(metadata)Descriptive Metadata 描述性元数据 Semantic Metadata 语义元数据intellectual property rights 知识产权 content rating 内容等级digital signatures数字签名 privacy levels 权限electronic commerce电⼦商务都柏林核⼼元数据集(Dublin Core Metadata Element Set)通⽤标记语⾔(SGML,standard general markup language)机读⽬录记录(Machine Readable Cataloging Record, MARC)资源描述框架(Resource Document Framework, RDF) XML(eXtensible Markup Language, 可扩展标记语⾔) HTML(HyperText Markup Language, 超⽂本标记语⾔)Tagged Image File Format (TIFF标签图像⽂件格式)Joint Photographic Experts Group (JPEG) Portable Network Graphics (PNG新型位图图像格式)第7章分隔符(separator)连字符(hyphen)排除表(list of stopwords)词⼲提取(stemming)波特(porter)词库(treasury of words)受控词汇表(controlled vocabulary)索引单元(indexing component)⽂本压缩text compression 压缩算法compression algorithm注释(explanation)统计⽅法(statistical method)赫夫曼(Huffman)压缩⽐(compression ratio)数据加密Encryption 半静态的(semi-static)词汇分析lexical analysis 排除停⽤词elimination of stopwords第8章半静态(semi-static)191 词汇表(vocabulary)192事件表(occurrence)192 inverted files倒排⽂档suffix arrays后缀数组 signature files签名档块寻址(block addressing)193 索引点(index point)199起始位置(beginning)199 Vocabulary search词汇表检索Retrieval of occurrences 事件表检索 Manipulation of occurrences事件表操作散列变换(hashing)205 误检(false drop)205查询语法树(query syntax tree)207 布鲁特-福斯算法简称BF(Brute-Force)故障(failure)210 移位-或(shift-or)位并⾏处理(bit-parallelism)212顺序检索(sequential search)220 原位(in-place)227第9章并⾏计算(parallel computing) SISD (单指令流单数据流)SIMD (单指令流多数据流) MISD (多指令流单数据流)MIMD (多指令流多数据流)分布计算(distributed computing)颗粒度(granularity)231 多任务(multitasking)I/O(input/output)233 标引器(indexer)映射(map)233 命中列表(hit-list)全局语词统计值(global term statistics)线程(thread)算术逻辑单元(arithmetic logic unit, ALU 中介器(broker)虚拟处理器(virtual processor)240分布式信息检索(distributed information retrieval)249⽂献收集器(gatherer)主中介器(central broker)254第10章信息可视化(information visualization)图标(icon)260颜⾊凸出显⽰(color highlighting)焦点+背景(focus-plus-context)画笔和链接(brushing and linking)魔术透镜(magic lenses)移动镜头和调焦(panning and zooming)弹性窗⼝(elastic window)概述及细节信息(overview plus details)⾼亮⾊显⽰(highlight)信息存取任务(information access tasks)⽂献替代(document surrogate)常见问题(FAQ, Frequently Asked Question) 群体性推荐(social recommendation)上下⽂关键词(keyword-in-context, KWIC)伪相关反馈(pseudo-relevance feedback)重叠式窗⼝(overlapping window)⼯作集(working set)第11/12章多媒体信息检索(Multimedia Information Retrieval, MIR)超类(superclass)半结构化数据(semi-structured data)数据⽚(data blade)可扩充型系统(extensible type system)相交(intersect)动态服务器(dynamic server)叠加(overlaps)档案库服务器(archive server)聚集(center)逻辑结构(logical structure)词包含(contain word)例⼦中的查询(query by example)路径名(path-name)通过图像内容查询(Query by Image Content, QBIC)图像标题(image header)主要成分分析(Principal Component Analysis, PCA)精确匹配(exact match)潜语义标引(Latent Semantic Indexing, LSI)基于内容(content-based)范围查寻(Range Query)第13章exponential growth指数增长 Distributed data 数据的分布性volatile data 不稳定数据 redundant data 冗余数据Heterogeneous data异构数据分界点(cut point)373Centralized Architecture集中式结构收集器-标引器(crawler-indexer)373 Wanderers 漫步者 Walkers 步⾏者 Knowbots 知识机器⼈Distributed Architecture分布式结构 gatherers 收集器brokers 中介器 the query interface 查询界⾯the answer interface响应界⾯ PageRank ⽹页级别Crawling the Web漫游Web breadth-first ⼴度优先depth-first fashion 深度优先 Indices(index pl.)索引Web Directories ⽹络⽬录 Metasearchers元搜索引擎Teaching the User⽤户培训颗粒度(granularity)384超⽂本推导主题检索(Hypertext Included Topic Search, HITS)380 Specific queries专指性查询 Broad queries 泛指性查询Vague queries模糊查询 Searching using Hyperlinks使⽤超链接搜索Web Query Languages查询语⾔ Dynamic Search 动态搜索Software Agents 软件代理鱼式搜索(fish search)鲨鱼搜索(shark search)拉出/推送(pull/push)393门户(portal)395 Duplicated data 重复数据第14章联机公共检索⽬录(online public access catalog, OPAC)397化学⽂摘(Chemical Abstract, CA)399 ⽣物学⽂摘(Biological Abstract, BA)⼯程索引(Engineering Index,EI)国会图书馆分类法(Library of Congress Classification)408杜威⼗进分类法(Dewey Decimal Classification)408联机计算机图书馆中⼼(Online Computer Library Center, OCLC)409机读⽬录记录(Machine Readable Cataloging Record, MARC)409第15章NSF (National Science Foundation, 美国国家科学基⾦会)NSNA(National Aeronautics and Space Administration,美国航空航天局)数字图书馆创新项⽬(Digital Libraries Initiative, DLI)4155S(stream,信息流structure,结构space, 空间scenario, 场景society社会)416基于数字化对象标识符(Digital Object Identifier, DOI)420都柏林核⼼(Dublin Core, DC)430 数字图书馆(Digital Library, DL)资源描述框架(Resource Document Framework, RDF)431text encoding initiative (TEI) (⽂本编码创新项⽬)431v。
John Wieczorek Museum of Vertebrate Zoology, UC Berkeley
Distributed Databases
Multiple sources of data …under local control, …with concepts in common …and a desire to deliver data as part of a community.
Project Goals
To define a protocol for retrieving structured data from multiple, heterogeneous databases across the Internet To build a reference implementation of both provider and portal software using said protocol
LifeMapper Global Biodiversity Information Facility (GBIF)
Distributed vs. centralized
Multiple sources of data …under local control, …with concepts in common …and a desire to deliver data as part of a community
随着深度学习(Deep Learning)在⾃然语⾔处理中应⽤的普及,很多⼈误以为word2vec是⼀种深度学习算法。
另外需要强调的⼀点是,word2vec是⼀个计算word vector的开源⼯具。
当我们在说word2vec算法或模型的时候,其实指的是其背后⽤于计算word vector的CBoW模型和Skip-gram模型。
Statistical Language Model在深⼊word2vec算法的细节之前,我们⾸先回顾⼀下⾃然语⾔处理中的⼀个基本问题:如何计算⼀段⽂本序列在某种语⾔下出现的概率?之所为称其为⼀个基本问题,是因为它在很多NLP任务中都扮演着重要的⾓⾊。
对于⼀段⽂本序列S=w1,w2,...,w T它的概率可以表⽰为:P(S)=P(w1,w2,...,w T)=T∏t=1p(w t|w1,w2,...,w t−1)即将序列的联合概率转化为⼀系列条件概率的乘积。
问题变成了如何去预测这些给定previous words下的条件概率:p(w t|w1,w2,...,w t−1)由于其巨⼤的参数空间,这样⼀个原始的模型在实际中并没有什么⽤。
我们更多的是采⽤其简化版本——Ngram模型:p(w t|w1,w2,...,w t−1)≈p(w t|w t−n+1,...,w t−1)常见的如bigram模型(N=2)和trigram模型(N=3)。
Learning: Knowledge Representation, Organization, and AcquisitionDanielle S. McNamara and Tenaha O’ReillyOld Dominion UniversityKnowledge acquisition is the process of absorbing and storing new information in memory, the success of which is often gauged by how well the information can later be remembered, or retrieved from memory. The process of storing and retrieving new information depends heavily on the representation and organization of this information. Moreover, the utility of knowledge can also be influenced by how the information is structured. For example, a bus schedule can be represented in the form of a map or a timetable. On the one hand, a timetable provides quick and easy access to the arrival time for each bus, but does little for finding where a particular stop is situated. On the other hand, a map provides a detailed picture of each bus stop’s location, but cannot efficiently communicate bus schedules. Both forms of representation are useful, but it is important to select the representation most appropriate for the task. Similarly, knowledge acquisition can be improved by considering the purpose and function of the desired information. This article provides an overview of knowledge representation and organization, and offers five guidelines to improve knowledge acquisition and retrieval.Knowledge Representation and OrganizationThere are numerous theories of how knowledge is represented and organized in the mind including rule-based production models (Anderson & Lebière, 1998), distributed networks (Rumelhart & McClelland, 1986), and propositional models (Kintsch, 1998). However, these theories are all fundamentally based on the concept of semantic networks. A semantic networkFigure 1: Schematic representation of a semantic networkis a method of representing knowledge as a system of connections between concepts in memory. This section explains the basic assumptions of semantic networks and describes several different types of knowledge.Semantic NetworksAccording to semantic network models, knowledge is organized based on meaning, such that semantically related concepts are interconnected. Knowledge networks are typically represented as diagrams of nodes (i.e., concepts) and links (i.e., relations). The nodes and links are given numerical weights to represent their strengths in memory. In Figure 1, the node representing DOCTOR is strongly related to SCALPEL, whereas NURSE is weakly related to SCALPEL. These link strengths are represented here in terms of line width. Similarly, some nodes in Figure 1 are bolded to represent their strength in memory. Concepts such as DOCTOR and BREAD are more memorable because they are more frequently encountered than concepts such as SCALPEL and CRUST.Mental excitation, or activation, spreads automatically from one concept to another related concept. For example, thinking of BREAD spreads activation to related concepts, such as BUTTER and CRUST. These concepts are primed, and thus more easily recognized or retrieved from memory. For example, in a typical semantic priming study (Meyer &Schvaneveldt, 1976), a series of words (e.g., BUTTER) and nonwords (e.g., BOTTOR) are presented, and participants determine whether each item is a word. A word is more quickly recognized if it follows a semantically related word. For example, BUTTER is more quickly recognized as a word if BREAD precedes it rather than NURSE. This result supports the assumption that semantically related concepts are more strongly connected than unrelated concepts.Figure 2: Schematic representation of ideas (propositions) in a semantic network.Network models represent more than simple associations. They must represent the ideas and complex relationships that comprise knowledge and comprehension. For example, the idea “The doctor uses a scalpel” can be represented as the proposition USE(DOCTOR,SCALPEL) consisting of the nodes DOCTOR and SCALPEL and the link USE (see Figure 2). Educators have successfully used similar diagrams, called concept maps, to communicate important relations and attributes amongst the key concepts of a lesson (Guastello, Beasley, & Sinatra 2000).Types of KnowledgeThere are numerous types of knowledge, but the most important distinction is between declarative and procedural knowledge. Declarative knowledge refers to our memory for concepts, facts, or episodes, whereas procedural knowledge refers to the ability to perform various tasks. Knowledge of how to drive a car, solve a multiplication problem, or throw a football are all forms of procedural knowledge, called procedures or productions. Procedural knowledge may begin as declarative knowledge, but is proceduralized with practice (Anderson, 1982). For example, when first learning to drive a car, you may be told to put the key in the ignition to start the car, which is a declarative statement. However, after starting the car numerous times, this act becomes automatic and is completed with little thought. Indeed, procedural knowledge tends to be accessed automatically and require little attention. It also tends to be more durable (less susceptible to forgetting) than declarative knowledge (Jensen & Healy, 1998).Knowledge AcquisitionThis section describes five guidelines for knowledge acquisition that emerge from how knowledge is represented and organized.Process the material semantically. Knowledge is organized semantically; therefore, knowledge acquisition is optimized when the learner focuses on the meaning of the new material. Craik and his colleagues were among the first to provide evidence for the importance of semantic processing(Craik & Tulving, 1975). In their studies, participants answered questions concerning target words that varied according to the depth of processing involved. For example, semantic questions (e.g., Would the word fit appropriately in the sentence?: "He met a____ on the street"? FRIEND vs. TREE) involves a greater depth of processing than phonemic questions (e.g., Does the word rhyme with LATE?: CRATE vs. TREE), which in turn have a greater depth than questions concerning the structure of a word (e.g., Is the word in capital letters?: TREE vs. tree). They found that words processed semantically were better learned than words processed phonemically or structurally. Further studies have confirmed that learning benefits from greater semantic processing of the material.Process and retrieve information frequently. A second learning principle is to test and retrieve the information numerous times. Retrieving, or self-producing information can be contrasted with simply reading or copying it. Decades of research on a phenomenon called the generation effect has shown that passively studying items by copying or reading them does little for memory in comparison to self-producing, or generating, an item (Slamecka & Graf, 1978). Moreover, learning improves as a function of the number of times information is retrieved. Within an academic situation, this principle points to the need for frequent practice tests, worksheets, or quizzes. In terms of studying, it is also important to break up, or distribute retrieval attempts (Melton, 1967; Glenberg, 1979). Distributed retrieval can include studying or testing items in a random order, with breaks, or on different days. In contrast, repeating information numerous times sequentially involves only a single, retrieval from long-term memory, which does little to improve memory for the information.Learning and retrieval conditions should be similar. How knowledge is represented is determined by the conditions and context (internal and external) in which it is learned, and this in turn determines how it is retrieved: Information is best retrieved when the conditions of learning and retrieval are the same. This principle has been referred to as encoding specificity (Tulving & Thompson, 1973). For example, in one experiment, participants were shown sentences with anadjective and a noun printed in capital letters (e.g. The CHIP DIP tasted delicious.) and told that their memory for the nouns would be tested afterward. In the recognition test, participants were shown the noun either with the original adjective (CHIP DIP), a different adjective (SKINNY DIP), or without an adjective (DIP). Noun recognition was better when the original adjective (CHIP) was presented than when no adjective was presented. Moreover, presenting a different adjective (SKINNY) yielded the lowest recognition (Light & Carter-Sobell, 1970). This finding underscores the importance of matching learning and testing conditions.Encoding specificity is also important in terms of the questions used to test memory or comprehension. Different types of questions tap into different levels of understanding. For example, recalling information involves a different level of understanding, and different mental processes than does recognizing information. Likewise, essay and open-ended questions assess a different level of understanding than do multiple-choice questions (McNamara & Kintsch, 1996). Essay and open-ended questions generally tap into a conceptual or situational understanding of the material, which results from an integration of text-based information and the reader’s prior knowledge. In contrast, multiple-choice questions involve recognition processes and typically assess a shallow or text-based understanding. A text-based representation can be impoverished and incomplete because it consists only of concepts and relations within the text. This level of understanding, likely developed by a student preparing for a multiple-choice exam, would be inappropriate preparation for an exam with open-ended or essay questions. Thus, students should benefit by adjusting their study practices according to the expected type of questions. Alternatively, students may benefit from reviewing the material in many different ways, such as recognizing the information, recalling the information, and interpreting the information. These latter processes improve understanding and maximize the probability that the various ways thematerial is studied will match the way it is tested. From a teacher’s point of view, including different types of questions on worksheets or exams ensures that each student will have an opportunity to convey their understanding of the material.Connect new information to prior knowledge. Knowledge is interconnected; therefore, new material that is linked to prior knowledge will be better retained. A driving factor in text and discourse comprehension is prior knowledge (Bransford & Johnson, 1972). Skilled readers actively use their prior knowledge during comprehension. Prior knowledge helps the reader to fill in contextual gaps within the text and to develop a better global understanding or situation model of the text. Given that texts rarely (if ever) spell out everything needed for successful comprehension, using prior knowledge to understand text and discourse is critical. Moreover, thinking about what you already know about a topic provides connections in memory to the new information – the more connections that are formed, the more likely the information will be retrievable from memory.Create cognitive procedures. Procedural knowledge is better retained and more easily accessed. Therefore, one should develop and use cognitive procedures when learning information. Procedures can include short cuts for completing a task (e.g., using "fast 10s" to solve multiplication problems) as well as memory strategies that increase the distinctive meaning of information. Cognitive research has repeatedly demonstrated the benefits of memory strategies, or mnemonics, for enhancing the recall of information. There are numerous types of mnemonics, but one well-known mnemonic is the method of loci. This technique was invented originally for the purpose of memorizing long speeches in the times before luxuries such as paper and pencil were readily available (Yates, 1966). The first task is to imagine and memorize a series of distinct locations along a familiar route, such as a pathway from one campus buildingto another. Each topic of a speech (or word in a word list; Crovitz, 1971) can then be pictured in a location along the route. When it comes time to recall the speech or word list, the items are simply "found" by mentally traveling the pathway.Mnemonics are generally effective because they increase semantic processing of the words (or phrases) and render them more meaningful by linking them to familiar concepts in memory. Mnemonics also provide “ready-made” effective cues for retrieving the information. Another important aspect of mnemonics is that mental imaging is often involved. Images not only render the information more meaningful, but they provide an additional route for "finding" information in memory (e.g., Paivio, 1990). As mentioned earlier, increasing the number of meaningful links to information in memory increases the likelihood it can be retrieved.Strategies are also an important component of metacognition (Hacker, Dunlosky, & Graesser, 1998). Metacognition is the ability to think about, understand and manage one’s learning. First one must develop an awareness of one's own thought processes. Simply being aware of thought processes increases the likelihood of more effective knowledge construction. Second, the learner must be aware of whether or not comprehension has been successful. Realizing when comprehension has failed is crucial to learning. The final, and most important stage of metacognitive processing is fixing the comprehension problem. The individual must be aware of and use strategies to remedy comprehension and learning difficulties. For successful knowledge acquisition to occur, all three of these processes must occur. Without thinking or worrying about learning, the student cannot realize whether the concepts have been successfully grasped. Without realizing that information has not been understood, the student cannot engage in strategies to remedy the situation. If nothing is done about a comprehension failure, awareness is futile.ConclusionKnowledge acquisition is integrally tied to how the mind organizes and represents information. Learning can be enhanced by considering the fundamental properties of human knowledge as well as the ultimate function of the desired information. The most important property is that knowledge is organized semantically; therefore, learning methods should enhance meaningful study of the new information. Learners should also create as many links to the information as possible. In addition, learning methods should be matched to the desired outcome. Just as using a bus timetable to find a bus stop location is ineffective, learning to recognize information will do little good on an essay exam.2,161 wordsDanielle S. McNamaraTenaha O'ReillyBibliographyAnderson, J. R. 1982. Acquisition of a cognitive skill. Psychological Review89:369-406. Anderson, J. R., and Lebière, C. 1998. The Atomic Components of Thought. Mahwah, NJ: Erlbaum.Bransford, J., and Johnson, M. K. 1972. Contextual prerequisites for understanding some investigations of comprehension and recall. Journal of Verbal Learning and VerbalBehavior11: 717-726.Craik, F. I. M., and Tulving, E. 1975. Depth of processing and the retention of words in episodic memory. Journal of Experimental Psychology: General194:268-294.Crovitz, H. F. 1971. The capacity of memory loci in artificial memory. Psychonomic Science24: 187-188.Hacker, D. J., Dunlosky, J., and Graesser, A. C. 1998. Metacognition in Educational Theory and Practice. Mahwah, NJ: Lawrence Erlbaum.Guastello, F., Beasley, M., and Sinatra, R. 2000. Concept mapping effects on science content comprehension of low-achieving inner-city seventh graders. Rase: Remedial & Special Education 21: 356-365.Glenberg, A. M. 1979. Component-levels theory of the effects of spacing of repetitions on recall and recognition. Memory & Cognition 7: 95-112.Kintsch, W. 1998. Comprehension: A Paradigm for Cognition. New York: Cambridge University Press.Jensen, M. B., and Healy, A. F. 1998. Retention of procedural and declarative information from the Colorado Drivers' Manual. In M. J. Intons-Peterson & D. Best (Eds.), MemoryDistortions and their Prevention (pp. 113-124). Mahwah, NJ: Erlbaum.Light, L. L., and Carter-Sobell, L. 1970. Effects of changed semantic context on recognition memory. Journal of Verbal Learning and Verbal Behavior9:1-11.McNamara, D. S., and Kintsch, W. 1996. Learning from text: Effects of prior knowledge and text coherence. Discourse Processes 22: 247-287.Melton, A. W. 1967. Repetition and retrieval from memory. Science 158: 532.Meyer, D. E., and Schvaneveldt, R. W. 1976. Meaning, memory structure, and mental processes.Science192:27-33.Paivio, A. 1990. Mental Representations: A Dual Coding Approach. NY: Oxford University Press.Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition (Vol. 1: Foundations). Cambridge, MA: MIT press. Slamecka, N. J., and Graf, P. 1978. The generation effect: Delineation of a phenomenon.Journal of Experimental Psychology: Human Learning and Memory4: 592-604.Tulving, E., and Thompson, D. M. 1973. Encoding specificity and retrieval processes in episodic memory. Psychological Review80: 352-373.Yates, F. A. 1966. The Art of Memory. Chicago, IL: University of Chicago Press.。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Using query logs to establish vocabularies indistributed information retrievalMilad Shokouhi *,Justin Zobel,Saied Tahaghoghi,Falk ScholerSchool of Computer Science and Information Technology,RMIT University,Melbourne 3001,AustraliaReceived 11December 2005;accepted 3April 2006Available online 7July 2006AbstractUsers of search engines express their needs as queries,typically consisting of a small number of terms.The resulting search engine query logs are valuable resources that can be used to predict how people interact with the search system.In this paper,we introduce two novel applications of query logs,in the context of distributed information retrieval.First,we use query log terms to guide sampling from uncooperative distributed collections.We show that while our sampling strategy is at least as efficient as current methods,it consistently performs better.Second,we propose and evaluate a prun-ing strategy that uses query log information to eliminate terms.Our experiments show that our proposed pruning method maintains the accuracy achieved by complete indexes,while decreasing the index size by up to 60%.While such pruning may not always be desirable in practice,it provides a useful benchmark against which other pruning strategies can be measured.Ó2006Elsevier Ltd.All rights reserved.Keywords:Distributed information retrieval;Uncooperative environments;Indexing;Query logs1.IntroductionTraditional information retrieval systems use corpus,document,and query statistics to identify likely answers to users’queries.However,these queries can be captured in a query log,providing an additional source of evidence of relevance.In recent years,considerable attention has been devoted to the study of query logs and the way people express their information needs (de Moura et al.,2005;Fagni,Perego,Silvestri,&Orlando,2006;Jansen &Spink,2005).The query logs of commercial search engines such as Excite 1(Spink,Wolfram,Jansen,&Saracevic,2001),Altavista 2(Silverstein,Marais,Henzinger,&Moricz,1999),and Allthe-Web 3(Jansen &Spink,2005)have been investigated and analysed.Query logs have been used in information0306-4573/$-see front matter Ó2006Elsevier Ltd.All rights reserved.doi:10.1016/j.ipm.2006.04.003*Corresponding author.E-mail address:milad@.au (M.Shokouhi).1 .2 .3.Information Processing and Management 43(2007)169–180170M.Shokouhi et al./Information Processing and Management43(2007)169–180retrieval research for applications such as query expansion(Billerbeck,Scholer,Williams,&Zobel,2003;Cui, Wen,Nie,&Ma,2002),contextual text retrieval(Wen,Lao,&Ma,2004),and image retrieval(Hoi&Lyu, 2004).The question we explore in this paper is how query logs can be used to guide future search,in the con-text of distributed information retrieval.In distributed information retrieval(DIR)systems,the task is to search a group of separate collections and identify the most likely answers from a subset of these.Brokers receive queries from the users and send them to those collections that are deemed most likely to contain relevant answers.In a cooperative environment,col-lections inform brokers about the information they contain by providing information such as term distribu-tion statistics.In uncooperative environments,on the other hand,collections do not provide any information about their content to brokers.A technique that can be used to obtain information about collections in such environments is to send probe queries to each rmation gathered from the limited number of answer documents that a collection provides in response to such queries is used to construct a representation set;this representation set guides the evaluation of user queries.In this paper,we introduce two novel applications of query logs:sampling for improved query probing,and pruning of index information.Thefirst of these is ing a TREC web crawl,we show that query log terms can be used produce effective samples from uncooperative collections.We compare the performance of our strategy with the state-of-art method,and show that samples obtained using query log terms allow for more effective collection selec-tion and retrieval performance–improvements in average precision are often over50%.Our method is at least as efficient as current sampling methods,and can be much more efficient for some collections.Our second new use of query logs is a pruning strategy that uses query log terms to remove less significant terms from collection representation sets.For a DIR environment with a large number of collections,the total size of collection representation sets on the broker might become impractically large.The goal of pruning methods is to eliminate unimportant terms from the index without harming retrieval performance.In previous work–such as that of Carmel et al.(2001),Craswell,Hawking,and Thistlewaite(1999),de Moura et al. (2005)and Lu and Callan(2002)–pruning strategies have had an adverse effect on performance.The reason is that these approaches drop many terms that are necessary to future queries.We show that pruning based on query logs does not decrease search precision.In addition,our method can be applied during document index-ing,which means that it independent of term frequency statistics.We also test our method on central indexes and for different types of search tasks.We show that,by applying our pruning strategy,the same performance as a full index can be achieved,while substantially reducing index size.In practice,such pruning might not always be desirable;if a term is present,it should be searchable.However,our pruning does provide an inter-esting benchmark against which other methods can be measured,and is clearly superior to the principal alternative.2.Distributed searchThe vast volume of data on the web makes it extremely costly for a single search engine to provide com-prehensive coverage.Moreover,public search engines cannot crawl and index documents to which there are no public links,or from which crawlers are forbidden.These documents form the so-called hidden web and generally only be viewed by using custom search interfaces supplied as part of the site.Distributed information retrieval(DIR)aims to address this issue by passing queries to multiple servers through a central broker.Each server sends its top-ranked answers back to the broker,which produces a single ranked list of answers for pre-sentation to the user.For efficiency,the broker usually passes the query only to a subset of available servers, selecting those that are most likely to contain relevant answers.To identify the appropriate servers,the broker calculates a similarity between the query and the representation set of each server.In cooperative environments,servers provide the broker with their representation sets(Callan,Lu,&Croft, 1995;Fuhr,1999;Gravano,Chang,Garcia-Molina,&Paepcke,1997,1999;Yuwono&Lee,1997).The bro-ker can be aware of the distribution of terms at the servers,and is therefore able to calculate weights for each server.Queries are sent to those servers that indicate the highest weight for the query terms.In practice,servers may be uncooperative and therefore do not publish their index information.Server rep-resentation sets can be gathered using query-based sampling(QBS)(Callan,Connell,&Du,1999).In QBS,anM.Shokouhi et al./Information Processing and Management43(2007)169–180171initial query is created from frequently-occurring terms found in a reference collection–to increase the chance of receiving an answer–and sent to the server.The query results provided by the server are downloaded,and another query is created using randomly-selected terms from these results.This process continues until a suf-ficient number of documents have been downloaded(Callan&Connell,2001;Callan et al.,1999;Shokouhi, Scholer,&Zobel,2006).Many queries do not return any yet-unseen answers;Ipeirotis and Gravano(2002) claim that,on average,one new document is received per two queries.QBS also downloads many pages that are not highly representative for the server.2.1.Query-based samplingQuery-based sampling QBS was introduced by Callan et al.(1999),who suggested that even a small number of documents(such as300)obtained by random sampling can effectively represent the collection held at a ser-ver.They tested their method on the CACM collection(Jones&Rijsbergen,1976)and many other small servers artificially created from TREC newswire data(Voorhees&Harman,2000).In QBS,subsequent queries after the first are selected by choosing terms from documents that have been downloaded so far(Callan et al.,1999). Various methods were explored;random selection of query terms was found to be the most effective way of choosing probe queries,and this method has since been used in other work on sampling non-cooperative serv-ers(Craswell,Bailey,&Hawking,2000;Si&Callan,2003).These methods generally proceed until afixed number of documents(usually300)have been downloaded.However,Shokouhi et al.(2006)have shown that for more realistic,larger collections,fixed-size samples might not be suitable,as the coverage of the vocabulary of the server is poor.An alternative technique,called Qprober(Gravano,Ipeirotis,&Sahami,2003),has been proposed for auto-matic classification of servers.Here,a classification system is trained with a set of pages and judgments.Then the system suggests the classification rules and uses the rules as queries.For example,if the classification sys-tem suggests(Madonna!Music),it uses‘‘Madonna’’as a query and classifies the downloaded pages as music-related.Qprober differs from QBS in the way that probe queries are selected and requires a classification system in the background.ing query logs for samplingTerms that appear in search engine query logs are–by definition–popular in queries,and tend to refer to topics that are well-represented in the collection.We therefore hypothesis that probe queries composed of query log terms would return more answers than the random terms,leading to higher efficiency.Since query terms are aligned with actual user interests,we also believe that sampling using query log terms would better reflect user needs than random terms from downloaded documents.Hence,instead of choosing the terms from downloaded documents for probe queries,we use terms from query logs.Analysis of our method shows that it is at least as efficient as previous methods,and generates samples that produce higher overall effectiveness.2.3.EvaluationTo simulate a DIR environment,we extracted documents from the100largest servers in the TREC WT10g col-lection(Bailey,Craswell,&Hawking,2003).These sets vary in size from26505documents(www9.yahoo. com),to2790documents(),with an average size of5602documents per server.For sam-pling queries,we used the1000most frequent terms in the Excite search engine query logs collected in1997 (Spink et al.,2001).For each query,we download the top10answers;this is the number of results that most search interfaces return on thefirst page of results.Sampling stops after300unique documents have been downloaded or1000 queries have been sent(whichever comesfirst).Although usingfixed-size samples might not always be the opti-mal method(Shokouhi et al.,2006),we restrict ourselves to300documents to ensure that our results are com-parable to the widely accepted baseline(Callan et al.,1999).For each server we gather two samples:one by query-based sampling,and the other by our query log method.For query log(QL)experiments,each of the1000most frequent terms in the Excite query logs arepassed as a probe query to the collection,and the top 10returned answers are collected.For QBS ,probe queries are selected from the current downloaded documents at each time,and the top 10results of each query are gathered.To evaluate the effectiveness of samples for different queries,we used topics 451–550from the TREC -9and TREC 2001Web Tracks.We used only terms in the title field as queries.Since we are extracting only the largest 100servers from WT 10g,the number of available relevant documents is low,so the precision-recall metrics produce poor results.For this reason,many DIR experiments use the set of documents that are retrieved by a central server as an oracle.That is,all of the top-ranked pages returned by the central index are considered to be relevant,and the performance of DIR approaches is evaluated based on how effectively they can retrieve this set (Craswell et al.,2000;Xu &Callan,1998).Therefore,we use a central index contain-ing the documents of all 100servers 4as a benchmark.For both the baseline and DIR experiments,we gathered the top 10results for each query.Results for 100and 1000answers per query were found to be similar and are not presented here.We tested different cutoff(CO)points in our evaluations:for a cutoffof 1,the queries were passed to the one server with the most similar corresponding representation set;for a cutoffof 50,queries were sent to the top 50servers.Table 1shows that the QL method consistently produces better results.Differences that are statistically significant based on the t -test at the 0.05and 0.01level of significance are indicted by the and à,respectively.For mean average pre-cision (MAP),which is considered to be the most reliable evaluation metric (Sanderson &Zobel,2005),QL outperforms QBS significantly in four of five cases.We made two key observations.First,query log (QL )terms did not retrieve the expected 300documents for four servers after 1000queries,while QBS failed to retrieve this number from only one server.Analysis showed that these servers contain documents unlikely to be of general interest to users.For example has error pages and HTML forms while includes many pages with non-text characters.Second,the QL method downloads an average of 2.43unseen documents per query,while the corresponding average for QBS is 2.80.Having access to the term document frequency information of any collection,it is possible to calculate the expected number of answers from the collection,for single-term queries extracted randomly from its index.Therefore,we indexed all of the servers together as a global collection.At most 10answers are retrieved per query.The expected number of answers per query can be calculated asj Number of Terms df >9j j Total Number of Terms j Â10þX 9i ¼1j Number of Terms df ¼i jj Total Number of Terms j Âiwhich gives an expected value of 2.60,close to numbers obtained by both theQLandQBSmethods.Table 1Comparison of the QLandQBSmethods on a subset of theWT 10g data;QLconsistently performs better CO MAPP@5P@10R -precisionQBSQL QBS QL QBSQLQBSQL10.06680.09020.13020.17210.07440.09880.07440.0988100.15620.2515à0.30570.4322à0.20110.3023à0.20110.3023à200.16170.2811à0.31490.4621à0.21150.3437à0.21150.3437à300.15400.2655à0.29410.4471à0.21060.3259à0.21060.3259à400.18120.2639à0.32000.4306à0.24590.3212à0.24590.3212à500.18680.4188à0.33410.4188 0.25060.3176à0.25060.3176àDifferences that are statistically significant based on the t -test at the 0.05and 0.01level of significance are indicted by and à,respectively.‘‘CO’’is the cutoffnumber of servers from which answers are retrieved.4The 100servers consist of 563656documents in total,containing 309195668terms,1788711of them unique.172M.Shokouhi et al./Information Processing and Management 43(2007)169–180However,these values contrast with those reported by Ipeirotis and Gravano (2002),who claim that QBS downloads an average of only one unseen document per two queries.On further investigation,we observed that the average varies for different collections,as shown in Table 2.The first collection is extracted from TREC AP newswire data and contains newspaper articles.Collections labelled WEB are subsets of the TREC WT 10g collection (Bailey et al.,2003).Finally,GOV-1is a subset of the TREC GOV 1collection (Craswell &Hawking,2002).Note that the average values for QL are between 4.9and 7.4unseen documents per query,while for QBS these range from 1.2to 4.5.In general,the gap between methods is more significant for larger collections with broad topics.Each QL probe query returns about 10answers –the maximum –on the first page,while this number is considerably lower for QBS .3.Pruning using query logsIn uncooperative DIR systems,the broker keeps a representation sample for each collection (Callan &Con-nell,2001;Craswell et al.,2000;Shokouhi et al.,2006).These samples usually contain a small number of doc-uments downloaded by query-based sampling (Callan et al.,1999)from the corresponding collections.Pruning is the process of excluding unimportant terms (or unimportant term occurrences)from the index to reduce storage cost and,by making better use of memory,increase querying speed.The importance of a term can be calculated according to factors such as lexicon statistics (Carmel et al.,2001)or position in the document (Craswell et al.,1999).The major drawback with current pruning strategies is that they decrease precision,because the pruned index can miss terms that occur in user queries.In addition,in lexicon-based pruning strategies,the indexing process is slowed significantly.First,documents need to be parsed,so that term distribution statistics are available.Then,unimportant terms can be identified,and excluded,based on the lexicon statistics.For example,terms that occur in a large proportion of documents might be treated as stopwords.Finally,the index needs to be updated based on the new pruned vocabulary statistics.Lexicon-based pruning strategies face additional problems when dealing with broker indexes in DIR .Doc-uments are gathered from different collections,with different vocabulary statistics.A term that appears unim-portant in one collection based on its occurrences might in fact be critical in another collection.Therefore,pruning the broker’s index based on the global lexicon statistics does not seem reasonable.We introduce a new pruning method that addresses these problems.Our method prunes during parsing,and is therefore faster than lexicon-based methods,as index updates are not required.Unlike other approaches,our proposed method does not harm precision,and can increase retrieval performance in some cases.Note,however,that we regard this pruning strategy as an illustration of the power of query logs rather than a method that should be deployed in practice:users for search for a term should be able to find matches if they are present in the collection.Although sampling inevitably involves some loss,that loss should be mini-mised.That said,as our experiments show the new pruning method is both effective and efficient.3.1.Related workPruning is widely used for efficiency,either to increase query processing speed (Persin,Zobel,&Sacks-Davis,1996),or to save disk storage space (Carmel et al.,2001;Craswell et al.,1999;de Moura et al.,2005;Lu &Callan,2002).Table 2Comparison of the QLandQBSmethods,showing average number of answers returned per queryCollection Size Unseen (QBS )Total (QBS )Unseen (QL )Total (QL )Newswire 30507 4.6 5.8 4.99.1WEB-1304035 1.8 2.2 6.99.9WEB-2218489 2.5 2.9 6.69.9WEB-3817025 1.2 et al./Information Processing and Management 43(2007)169–180173174M.Shokouhi et al./Information Processing and Management43(2007)169–180 Carmel et al.(2001)proposed a pruning strategy where each indexed term is sent in turn as a query to their search system.Index information is discarded for those documents that contain the query term,but do not appear in the top ranked results in response to the query.This strategy is computationally expensive and time consuming.The soundness of this approach is unclear;the highly ranked pages for many queries are not highly ranked for any of the individual query terms,de Moura et al.(2005)have extended Carmel’s method. The apply Carmel’s method to extract the most important terms of each collection.Then they keep only those sentences that contain important terms,and delete the rest.For the same reason discussed previously,this approach also is not applicable in uncooperative DIR environments.Although this approach is more effective than Carmel’s method in most cases,the loss in average precision compared to a full-text baseline is significant.D’Souza,Thorn,and Zobel(2004)discuss surrogate methods for pruning,where only the most significant words are kept for each document.In this approach,the representation set is not the collection’s vocabulary, but is instead a complete index for the surrogates.Such an approach requires a high level of cooperation between servers.Craswell et al.(1999)use a pruning strategy to merge the results gathered from multiple search engines.In their work,they download thefirst four kilobytes of each returned document instead of extracting the whole document for result merging.They showed that in some cases,the loss in performance is negligible.They only evaluated their method for result merging.In a comprehensive analysis of pruning in brokers,Lu and Callan(2002)divided pruning methods into var-ious groups:frequency-based methods prune documents according to lexicon statistics;location-based methods exclude terms based on the position of their appearance in the documents;single-occurrence methods set a pruning threshold based on the number of unique terms in documents,and keep one instance of each term in the document;and multiple-occurrence methods allow for multiple occurrences of terms in pruned docu-ments.Experiments evaluating the performance of nine methods demonstrated that four models can achieve similar optimal levels of performance,and do not have any significant advantage over each other.Of these best methods,FIRSTM is the only one which does not rely on a broker’s vocabulary statistics.For each document,this approach stores information about thefirst1600terms.The other methods measure the importance of terms based on frequency information.As discussed,these methods are unsuitable for DIR in many ways:the frequency of a term in the broker does not indicate its importance in the original collections; the cost of pruning and re-indexing might be high;and,adding a new collection makes the current pruned index unusable,since after a new collection is added to the system,the previous information is no longer valid. The FIRSTM approach prunes during parsing,which makes it more comparable with our approach.Therefore, we evaluate our approach by using FIRSTM as a baseline.Lu and Callan tested their methods on100collections created from TREC disks1,2,and3,and showed that their models can reduce storage costs by54%–93%,with less than a10%loss in precision.We test our systems on WT10g and GOV1,which are larger and consist of unmanaged data crawled from the Web;these collections are described in more detail in Section3.2.Our proposed pruning method is applied during parsing,and is independent of index updates,as the addi-tion of new collections to the system does not require the re-indexing of the original documents.Moreover, our pruning method does not reduce system performance and precision,while in all of the discussed previous work(Carmel et al.,2001;Craswell et al.,1999;de Moura et al.,2005;Lu&Callan,2002),pruning results in a decrease in precision.ing query logs for pruningThe main motivation for pruning is to omit unimportant terms from the index.That is,pruning methods are intended to exclude the terms that are less likely to appear in user queries(de Moura et al.,2005).Some methods prune terms that are rare in documents(Lu&Callan,2002).However,the distribution of terms in user queries is not similar to that in typical web documents.We propose using the history of previous user queries to achieve this directly.Our hypothesis is that prun-ing those terms that do not appear in a search engine query logs will be able to reduce index sizes while main-taining retrieval performance.We test our hypothesis with experiments on distributed environments andM.Shokouhi et al./Information Processing and Management43(2007)169–180175 central indexes for different types of queries.In a standard search environment,where completeness may be more important than improvements in efficiency,such pruning(or any pruning)is unappealing;but in a dis-tributed environment,where index information is incomplete and is difficult to gather,such an approach has significant promise.For our experiments,we used a list of the315936unique terms in the log of about one million queries submitted to the Excite search engine on16September1997(Spink et al.,2001).Larger query logs,or a com-bination of query logs from different search engines,might be useful for larger collections.Also,for highly topic-specific collections,topical query logs(Beitzel,Jensen,Chowdhury,Grossman,&Frieder,2004)and query terms that have been classified into different categories(Jansen,Spink,&Pedersen,2005)could provide additional benefits.For experiments on uncooperative DIR environments and brokers,we used the testbed described in Section 2.3.The100largest servers were extracted from the TREC WT10g collection,with each server being considered as a separate collection.Query-based sampling,as described in Section2.1,was used to obtain representation sets for each collection by downloading300documents from each server in our testbed.We do not omit stop-words in any of our experiments.For each downloaded sample,we only retained information about those terms that were present in our query log,and eliminated the other terms from the broker.We used CORI (Callan et al.,1995)for collection selection and result merging;CORI has been used in many papers as a base-line(Craswell et al.,2000;Nottelmann&Fuhr,2003;Powell&French,2003;Si&Callan,2003).TREC topics451–550and their corresponding relevance judgements were used to evaluate the effectiveness of our pruned representation sets.We use only theÆtitleæfield of TREC topics as queries for the search system.To test our pruning method on central indexes,we used the TREC WT10g and GOV1collections.The GOV1 collection(Craswell&Hawking,2002)contains over a million documents crawled from domain. TREC topics451–550were used for our experiments on WT10g,and topics551–600and NP01–NP150were used with the GOV1collection.All experiments with central indexes use the OkapiBM25similarity measure (Robertson,Walker,Hancock-Beaulieu,Gull,&Lau,1992).In addition to these topic-finding search tasks,we evaluate our pruning approach on central indexes for named-pagefinding and topic distillation tasks.In topic distillation,the objective is tofind relevant homepages related to a general topic(Craswell,Hawking,Wilkinson,&Wu,2003).We use the TREC topic distillation top-ics551–600,and corresponding relevance judgements,with the GOV1collection.For named-pagefinding(also known as homepagefinding)the aim is tofind particular web pages of named individuals or organisations.To evaluate this type of search task,we used the TREC named-pagefinding queries NP01-NP150(Craswell& Hawking,2002).3.3.Distributed retrieval resultsThe results of our experiments using different pruning methods for DIR systems are shown in Table3.For each scheme,up to1000answers were returned per query.The cutoff(CO)values show the number of collec-tions that are selected for each query.That is,for thefirst row,only the best collection is selected while for theTable3Effectiveness of different pruning schemes,on a subset of WT10gCO MAP P@10R-precisionORIG FIRSTM PR ORIG FIRSTM PR ORIG FIRSTM PR10.01780.00960.01780.03670.03160.03670.02170.00710.0217 100.04150.03990.04150.06200.04680.06300.05930.04000.0594 200.03550.04680.05040.05900.04940.06460.04850.04600.0644 300.03990.04620.05460.06030.04810.0658 0.05000.04260.0659* 400.04890.05090.0628 0.06410.05610.0671 0.05210.04370.0611 500.05060.05160.0647 0.06540.05650.0684 0.05620.04390.0708* Significance at the0.1and0.05levels of confidence are indicated with*and ,respectively.‘‘CO’’is the cutoffnumber of servers from which answers are fetched.。