Information Retrieval Based Writer Identification
InformationRetrievalandExtraction资讯检索与撷取
1
Information Retrieval
generic information retrieval system
select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user
2
Detection Need
Definition
a set of criteria specified by the user which describes the kind of information desired. » queries in document search task » profiles in routing task
Document must identify a company who has the capability to produce document management system by obtaining a turnkey- system or by obtaining and integrating the basic components.
5
search vs. routing
The search process matches a single Detection Need against the stored corpus to return a subset of documents. Routing matches a single document against a group of Profiles to determine which users are interested in the document. Profiles stand long-term expressions of user needs. Search queries are ad hoc in nature. A generic detection architecture can be used for both the search and routing.
信息检索基本原理
有效位是指标引词在匹配中的有效长度。
项目词是指具体的标引词
5.2.2 提问式的表达以及语法检查
逻辑提问式 逻辑提问式由逻辑算子以及算项即检索提问词构成 运算项是用 00—99 之间两位数字构成的, 每一个两位数字对应于一个提问检索词. 逻辑算子包括逻辑或、逻辑与、逻辑非、括号、逻辑式结束符等。
1. 2 .2 信息检索系统的逻辑构成 信息源选择采集子系统 信息源是检索系统的信息或数据来源,目前,信息 检索系统中的数据主要来自各种公开文献,如一次文献 中的期刊、图书、研究报告、会议论文、专利文献、政府出版物、学位论文、二次文献中的文摘、索引和目录, 三次文献中的百科全书、专科词典,名录、指南、手册等,有些系统还收录各种机构的内部资料,如实验记录、 测试或观测结果、工程设计资料、统计资料等。 本功能模块任务:根据系统的经营方针和服务对象的需要,以快速、经济的手段,广泛地、连续不断地采集 各种信息源,为系统提供充足而适用的数据来源。
1. 2 .2 信息检索系统的逻辑构成
标引子系统 标引,就是根据一定的规则和程序,对文献内容进行分析,然后赋予每篇文献以一定数量的内容标识(分类
号、主题词、关键词等),作为存贮与检索的依据。 标引作业通常与文献编目和文摘工作一起进行,然后把标引结果和其他描述事项填入工作单,交录入员去录 入计算机中。
在规范化的记录结构中,并将所有文献记录按线性次序排列起来就构成顺排文献文档。
5.1.1 脱机批处理检索系统
脱机批处理检索系统检索过程
顺排文档
用户提问
检索处理
命中文献输出
5.1.2 联机检索系统
脱机批处理检索系统一般建立在倒排文档基础上。
5.1.2 联机检索系统
主文档及索引 MF MX
现代信息检索综述 Modern Information Retrieval - A Brief Review
Modern Information Retrieval:A Brief OverviewAmit SinghalGoogle,Inc.singhal@AbstractFor thousands of years people have realized the importance of archiving andfinding information.With the advent of computers,it became possible to store large amounts of information;andfinding useful information from such collections became a necessity.Thefield of Information Retrieval(IR)was born in the1950s out of this necessity.Over the last forty years,thefield has matured considerably.Several IR systems are used on an everyday basis by a wide variety of users.This article is a brief overview of the key advances in thefield of Information Retrieval,and a description of where the state-of-the-art is at in thefield.1Brief HistoryThe practice of archiving written information can be traced back to around3000BC,when the Sumerians designated special areas to store clay tablets with cuneiform inscriptions.Even then the Sumerians realized that proper organization and access to the archives was critical for efficient use of information.They developed special classifications to identify every tablet and its content.(See http://www.libraries.gr for a wonderful historical perspective on modern libraries.)The need to store and retrieve written information became increasingly important over centuries,especially with inventions like paper and the printing press.Soon after computers were invented,people realized that they could be used for storing and mechanically retrieving large amounts of information.In1945Vannevar Bush published a ground breaking article titled“As We May Think”that gave birth to the idea of automatic access to large amounts of stored knowledge.[5]In the1950s,this idea materialized into more concrete descriptions of how archives of text could be searched automatically.Several works emerged in the mid1950s that elaborated upon the basic idea of searching text with a computer.One of the most influential methods was described by H.P. Luhn in1957,in which(put simply)he proposed using words as indexing units for documents and measuring word overlap as a criterion for retrieval.[17]Several key developments in thefield happened in the1960s.Most notable were the development of the SMART system by Gerard Salton and his students,first at Harvard University and later at Cornell Univer-sity;[25]and the Cranfield evaluations done by Cyril Cleverdon and his group at the College of Aeronautics in Cranfield.[6]The Cranfield tests developed an evaluation methodology for retrieval systems that is still in use by IR systems today.The SMART system,on the other hand,allowed researchers to experiment with ideas to1improve search quality.A system for experimentation coupled with good evaluation methodology allowed rapid progress in thefield,and paved way for many critical developments.The1970s and1980s saw many developments built on the advances of the1960s.Various models for do-ing document retrieval were developed and advances were made along all dimensions of the retrieval process. These new models/techniques were experimentally proven to be effective on small text collections(several thou-sand articles)available to researchers at the time.However,due to lack of availability of large text collections, the question whether these models and techniques would scale to larger corpora remained unanswered.This changed in1992with the inception of Text Retrieval Conference,or TREC.[11]TREC is a series of evalu-ation conferences sponsored by various US Government agencies under the auspices of NIST,which aims at encouraging research in IR from large text collections.With large text collections available under TREC,many old techniques were modified,and many new tech-niques were developed(and are still being developed)to do effective retrieval over large collections.TREC has also branched IR into related but importantfields like retrieval of spoken information,non-English language retrieval,informationfiltering,user interactions with a retrieval system,and so on.The algorithms developed in IR were thefirst ones to be employed for searching the World Wide Web from1996to1998.Web search, however,matured into systems that take advantage of the cross linkage available on the web,and is not a focus of the present article.In this article,I will concentrate on describing the evolution of modern textual IR systems ([27,33,16]are some good IR resources).2Models and ImplementationEarly IR systems were boolean systems which allowed users to specify their information need using a complex combination of boolean ANDs,ORs and NOTs.Boolean systems have several shortcomings,e.g.,there is no inherent notion of document ranking,and it is very hard for a user to form a good search request.Even though boolean systems usually return matching documents in some order,e.g.,ordered by date,or some other document feature,relevance ranking is often not critical in a boolean system.Even though it has been shown by the research community that boolean systems are less effective than ranked retrieval systems,many power users still use boolean systems as they feel more in control of the retrieval process.However,most everyday users of IR systems expect IR systems to do ranked retrieval.IR systems rank documents by their estimation of the usefulness of a document for a user query.Most IR systems assign a numeric score to every document and rank documents by this score.Several models have been proposed for this process.The three most used models in IR research are the vector space model,the probabilistic models,and the inference network model.2.1Vector Space ModelIn the vector space model text is represented by a vector of terms.[28]The definition of a term is not inherent in the model,but terms are typically words and phrases.If words are chosen as terms,then every word in the vocabulary becomes an independent dimension in a very high dimensional vector space.Any text can then be represented by a vector in this high dimensional space.If a term belongs to a text,it gets a non-zero value in the text-vector along the dimension corresponding to the term.Since any text contains a limited set of terms(the vocabulary can be millions of terms),most text vectors are very sparse.Most vector based systems operate in the positive quadrant of the vector space,i.e.,no term is assigned a negative value.To assign a numeric score to a document for a query,the model measures the similarity between the query vector(since query is also just text and can be converted into a vector)and the document vector.The similarity between two vectors is once again not inherent in the model.Typically,the angle between two vectors is used as a measure of divergence between the vectors,and cosine of the angle is used as the numeric similarity(sincecosine has the nice property that it is1.0for identical vectors and0.0for orthogonal vectors).As an alternative, the inner-product(or dot-product)between two vectors is often used as a similarity measure.If all the vectors are forced to be unit length,then the cosine of the angle between two vectors is same as their dot-product.If is the document vector and is the query vector,then the similarity of document to query(or score of for)can be represented as:where is the value of the th component in the query vector,and is the th component in the document vector.(Since any word not present in either the query or the document has a or value of0,respectively,we can do the summation only over the terms common in the query and the document.)How we arrive at and is not defined by the model,but is quite critical to the search effectiveness of an IR system.is often referred to as the weight of term-in document,and is discussed in detail in Section4.1.2.2Probabilistic ModelsThis family of IR models is based on the general principle that documents in a collection should be ranked by decreasing probability of their relevance to a query.This is often called the probabilistic ranking principle (PRP).[20]Since true probabilities are not available to an IR system,probabilistic IR models estimate the probability of relevance of documents for a query.This estimation is the key part of the model,and this is where most probabilistic models differ from one another.The initial idea of probabilistic retrieval was proposed by Maron and Kuhns in a paper published in1960.[18]Since then,many probabilistic models have been proposed, each based on a different probability estimation technique.Due to space limitations,it is not possible to discuss the details of these models here.However,the fol-lowing description abstracts out the common basis for these models.We denote the probability of relevance for document by.Since this ranking criteria is monotonic under log-odds transformation,we can rank documents by.Assuming that the prior probability of relevance,i.e.,, is independent of the document under consideration and thus is constant across documents,and are just scaling factors for thefinal document scores and can be removed from the above formulation(for ranking purposes).This further simplifies the above formulation to:reduces to:to transform the ranking formula to use only3the terms present in a document:Different assumptions for estimation of and yield different document ranking functions.E.g.,in[7]Croft and Harper assume that is the same for all query terms and,where is the collection size and is the number of documents that contain term-.This yields a scoring functionas the weight of term-in document,this formulation becomes very similar to the similarity formulation in the vector space model(Section2.1)with query terms assigned a unit weight.2.3Inference Network ModelIn this model,document retrieval is modeled as an inference process in an inference network.[32]Most tech-niques used by IR systems can be implemented under this model.In the simplest implementation of this model, a document instantiates a term with a certain strength,and the credit from multiple terms is accumulated given a query to compute the equivalent of a numeric score for the document.From an operational perspective,the strength of instantiation of a term for a document can be considered as the weight of the term in the document, and document ranking in the simplest form of this model becomes similar to ranking in the vector space model and the probabilistic models described above.The strength of instantiation of a term for a document is not defined by the model,and any formulation can be used.2.4ImplementationMost operational IR systems are based on the inverted list data structure.This enables fast access to a list of documents that contain a term along with other information(for example,the weight of the term in each document,the relative positions of the term in each document,etc.).A typical inverted list may be stored as: which depicts that term-is contained in,,...,,and stores any other information.All models described above can be implemented using inverted lists.Inverted lists exploit the fact that given a user query,most IR systems are only interested in scoring a small number of documents that contain some query term.This allows the system to only score documents that will have a non-zero numeric score.Most systems maintain the scores for documents in a heap(or another similar data structure)and at the end of processing return the top scoring documents for a query.Since all documents are indexed by the terms they contain,the process of generating, building,and storing document representations is called indexing and the resulting invertedfiles are called the inverted index.Most IR systems use single words as the terms.Words that are considered non-informative,like function words(the,in,of,a,...),also called stop-words,are often ignored.Conflating various forms of the same word to its root form,called stemming in IR jargon,is also used by many systems.The main idea behind stemming is that users searching for information on retrieval will also be interested in articles that have information about retrieve,retrieved,retrieving,retriever,and so on.This also makes the system susceptible to errors due to poor stemming.For example,a user interested in information retrieval might get an article titled Information on Golden Retrievers due to stemming.Several stemmers for various languages have been developed over the years,each with its own set of stemming rules.However,4the usefulness of stemming for improved search quality has always been questioned in the research community, especially for English.The consensus is that,for English,on average stemming yields small improvements in search effectiveness;however,in cases where it causes poor retrieval,the user can be considerably annoyed.[12] Stemming is possibly more beneficial for languages with many word inflections(like German).Some IR systems also use multi-word phrases(e.g.,“information retrieval”)as index terms.Since phrases are considered more meaningful than individual words,a phrase match in the document is considered more informative than single word matches.Several techniques to generate a list of phrases have been explored.These range from fully linguistic(e.g.,based on parsing the sentences)to fully statistical(e.g.,based on counting word cooccurrences).It is accepted in the IR research community that phrases are valuable indexing units and yield improved search effectiveness.However,the style of phrase generation used is not critical.Studies comparing linguistic phrases to statistical phrases have failed to show a difference in their retrieval performance.[8]3EvaluationObjective evaluation of search effectiveness has been a cornerstone of IR.Progress in thefield critically depends upon experimenting with new ideas and evaluating the effects of these ideas,especially given the experimental nature of thefield.Since the early years,it was evident to researchers in the community that objective evaluation of search techniques would play a key role in thefield.The Cranfield tests,conducted in1960s,established the desired set of characteristics for a retrieval system.Even though there has been some debate over the years,the two desired properties that have been accepted by the research community for measurement of search effective-ness are recall:the proportion of relevant documents retrieved by the system;and precision:the proportion of retrieved documents that are relevant.[6]It is well accepted that a good IR system should retrieve as many relevant documents as possible(i.e.,have a high recall),and it should retrieve very few non-relevant documents(i.e.,have high precision).Unfortunately, these two goals have proven to be quite contradictory over the years.Techniques that tend to improve recall tend to hurt precision and vice-versa.Both recall and precision are set oriented measures and have no notion of ranked retrieval.Researchers have used several variants of recall and precision to evaluate ranked retrieval. For example,if system designers feel that precision is more important to their users,they can use precision in top ten or twenty documents as the evaluation metric.On the other hand if recall is more important to users, one could measure precision at(say)50%recall,which would indicate how many non-relevant documents a user would have to read in order tofind half the relevant ones.One measure that deserves special mention is average precision,a single valued measure most commonly used by the IR research community to evaluate ranked retrieval.Average precision is computed by measuring precision at different recall points(say10%,20%, and so on)and averaging.[27]4Key TechniquesSection2described how different IR models can implemented using inverted lists.The most critical piece of information needed for document ranking in all models is a term’s weight in a document.A large body of work has gone into proper estimation of these weights in different models.Another technique that has been shown to be effective in improving document ranking is query modification via relevance feedback.A state-of-the-art ranking system uses an effective weighting scheme in combination with a good query expansion technique. 4.1Term WeightingVarious methods for weighting terms have been developed in thefield.Weighting methods developed under the probabilistic models rely heavily upon better estimation of various probabilities.[21]Methods developed5tf is the term’s frequency in documentqtf is the term’s frequency in queryN is the total number of documents in the collectiondf is the number of documents that contain the termdl is the document length(in bytes),andavdl is the average document lengthOkapi weighting based document score:[23](between1.0–2.0),(usually0.75),and(between0–1000)are constants.Pivoted normalization weighting based document score:[30])and a state-of-the-art scoring method(like the ones shown in Table1).Many such studies claim that their proposed methods are far superior than tf-idf weighting,often a wrong conclusion based on the poor weighting formulation used.64.2Query ModificationIn the early years of IR,researchers realized that it was quite hard for users to formulate effective search requests. It was thought that adding synonyms of query words to the query should improve search effectiveness.Early research in IR relied on a thesaurus tofind synonyms.[14]However,it is quite expensive to obtain a good general purpose thesaurus.Researchers developed techniques to automatically generate thesauri for use in query mod-ification.Most of the automatic methods are based on analyzing word cooccurrence in the documents(which often produces a list of strongly related words).Most query augmentation techniques based on automatically generated thesaurii had very limited success in improving search effectiveness.The main reason behind this is the lack of query context in the augmentation process.Not all words related to a query word are meaningful in context of the query. E.g.,even though machine is a very good alternative for the word engine,this augmentation is not meaningful if the query is search engine.In1965Rocchio proposed using relevance feedback for query modification.[24]Relevance feedback is motivated by the fact that it is easy for users to judge some documents as relevant or non-relevant for their query. Using such relevance judgments,a system can then automatically generate a better query(e.g.,by adding related new terms)for further searching.In general,the user is asked to judge the relevance of the top few documents retrieved by the system.Based on these judgments,the system modifies the query and issues the new query forfinding more relevant documents from the collection.Relevance feedback has been shown to work quite effectively across test collections.New techniques to do meaningful query expansion in absence of any user feedback were developed early 1990s.Most notable of these is pseudo-feedback,a variant of relevance feedback.[3]Given that the top few documents retrieved by an IR system are often on the general query topic,selecting related terms from these documents should yield useful new terms irrespective of document relevance.In pseudo-feedback the IR sys-tem assumes that the top few documents retrieved for the initial user query are“relevant”,and does relevance feedback to generate a new query.This expanded new query is then used to rank documents for presentation to the user.Pseudo feedback has been shown to be a very effective technique,especially for short user queries.5Other Techniques and ApplicationsMany other techniques have been developed over the years and have met with varying success.Cluster hy-pothesis states that documents that cluster together(are very similar to each other)will have a similar relevance profile for a given query.[10]Document clustering techniques were(and still are)an active area of research. Even though the usefulness of document clustering for improved search effectiveness(or efficiency)has been very limited,document clustering has allowed several developments in IR,e.g.,for browsing and search inter-faces.Natural Language Processing(NLP)has also been proposed as a tool to enhance retrieval effectiveness, but has had very limited success.[31]Even though document ranking is a critical application for IR,it is defi-nitely not the only one.Thefield has developed techniques to attack many different problems like information filtering[2],topic detection and tracking(or TDT)[1],speech retrieval[13],cross-language retrieval[9],ques-tion answering[19],and many more.6Summing UpThefield of information retrieval has come a long way in the last forty years,and has enabled easier and faster information discovery.In the early years there were many doubts raised regarding the simple statistical tech-niques used in thefield.However,for the task offinding information,these statistical techniques have indeed proven to be the most effective ones so far.Techniques developed in thefield have been used in many other areas and have yielded many new technologies which are used by people on an everyday basis,e.g.,web search7engines,junk-emailfilters,news clipping services.Going forward,thefield is attacking many critical prob-lems that users face in todays information-ridden world.With exponential growth in the amount of information available,information retrieval will play an increasingly important role in future.References[1]J.Allan,J.Carbonell,G.Doddington,J.Yamron,and Y.Yang.Topic detection and tracking pilot study:Final report.In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop,pages 194–218,1998.[2]N.J.Belkin and rmationfiltering and information retrieval:Two sides of the same coin?Communications of the ACM,35(12):29–38,1992.[3]Chris Buckley,James Allan,Gerard Salton,and Amit Singhal.Automatic query expansion using SMART:TREC3.In Proceedings of the Third Text REtrieval Conference(TREC-3),pages69–80.NIST Special Publication500-225,April1995.[4]Chris Buckley,Gerard Salton,and James Allan.Automatic retrieval with locality information usingSMART.In Proceedings of the First Text REtrieval Conference(TREC-1),pages59–72.NIST Special Publication500-207,March1993.[5]Vannevar Bush.As We May Think.Atlantic Monthly,176:101–108,July1945.[6]C.W.Cleverdon.The Cranfield tests on index language devices.Aslib Proceedings,19:173–192,1967.[7]W.B.Croft and ing probabilistic models on document retrieval without relevance infor-mation.Journal of Documentation,35:285–295,1979.[8]J.L.Fagan.The effectiveness of a nonsyntactic approach to automatic phrase indexing for documentretrieval.Journal of the American Society for Information Science,40(2):115–139,1989.[9]G.Grefenstette,editor.Cross-Language Information Retrieval.Kluwer Academic Publishers,1998.[10]A.Griffiths,H.C.Luckhurst,and ing interdocument similarity in document retrieval systems.Journal of the American Society for Information Science,37:3–11,1986.[11]D.K.Harman.Overview of thefirst Text REtrieval Conference(TREC-1).In Proceedings of the FirstText REtrieval Conference(TREC-1),pages1–20.NIST Special Publication500-207,March1993. [12]David Hull.Stemming algorithms-a case study for detailed evaluation.Journal of the American Societyfor Information Science,47(1):70–84,1996.[13]G.J.F.Jones,J.T.Foote,K.Sparck Jones,and S.J.Young.Retrieving spoken documents by combiningmultiple index sources.In Proceedings of ACM SIGIR’96,pages30–38,1996.[14]K.Sparck Jones.Automatic Keyword Classification for Information Retrieval.Butterworths,London,1971.[15]K.Sparck Jones.A statistical interpretation of term specificity and its application in retrieval.Journal ofDocumentation,28:11–21,1972.[16]K.Sparck Jones and P.Willett,editors.Readings in Information Retrieval.Morgan Kaufmann,1997.8[17]H.P.Luhn.A statistical approach to mechanized encoding and searching of literary information.IBMJournal of Research and Development,1957.[18]M.E.Maron and J.L.Kuhns.On relevance,probabilistic indexing and information retrieval.Journal ofthe ACM,7:216–244,1960.[19]Marius Pasca and Sanda Harabagiu.High performance question/answering.In Proceedings of the24thInternational Conference on Research and Development in Information Retrieval,pages366–374,2001.[20]S.E.Robertson.The probabilistic ranking principle in IR.Journal of Documentation,33:294–304,1977.[21]S.E.Robertson and K.Sparck Jones.Relevance weighting of search terms.Journal of the AmericanSociety for Information Science,27(3):129–146,May-June1976.[22]S.E.Robertson and S.Walker.Some simple effective approximations to the2–poisson model for proba-bilistic weighted retrieval.In Proceedings of ACM SIGIR’94,pages232–241,1994.[23]S.E.Robertson,S.Walker,and M.Beaulieu.Okapi at TREC–7:automatic ad hoc,filtering,VLC andfiltering tracks.In Proceedings of the Seventh Text REtrieval Conference(TREC-7),pages253–264.NIST Special Publication500-242,July1999.[24]J.J.Rocchio.Relevance feedback in information retrieval.In Gerard Salton,editor,The SMART RetrievalSystem—Experiments in Automatic Document Processing,pages313–323,Englewood Cliffs,NJ,1971.Prentice Hall,Inc.[25]Gerard Salton,editor.The SMART Retrieval System—Experiments in Automatic Document Retrieval.Prentice Hall Inc.,Englewood Cliffs,NJ,1971.[26]Gerard Salton and Chris Buckley.Term-weighting approaches in automatic text rmationProcessing and Management,24(5):513–523,1988.[27]Gerard Salton and M.J.McGill.Introduction to Modern Information Retrieval.McGraw Hill Book Co.,New York,1983.[28]Gerard Salton,A.Wong,and C.S.Yang.A vector space model for information municationsof the ACM,18(11):613–620,November1975.[29]Amit Singhal,Chris Buckley,and Mandar Mitra.Pivoted document length normalization.In Proceedingsof ACM SIGIR’96,pages21–29.Association for Computing Machinery,New York,August1996. [30]Amit Singhal,John Choi,Donald Hindle,David Lewis,and Fernando Pereira.AT&T at TREC-7.InProceedings of the Seventh Text REtrieval Conference(TREC-7),pages239–252.NIST Special Publication 500-242,July1999.[31]T.Strzalkowski,L.Guthrie,J.Karlgren,J.Leistensnider,F.Lin,J.Perez-Carballo,T.Straszheim,J.Wang,and J.Wilding.Natural language information retrieval:TREC-5report.In Proceedings of the Fifth Text REtrieval Conference(TREC-5),1997.[32]Howard Turtle.Inference Networks for Document Retrieval.Ph.D.thesis,Department of Computer Sci-ence,University of Massachusetts,Amherst,MA01003,1990.Available as COINS Technical Report 90-92.[33]C.J.van rmation Retrieval.Butterworths,London,1979.9。
100个信息工程专业术语中英文
100个信息工程专业术语中英文全文共3篇示例,供读者参考篇1Information engineering is a vast field that covers a wide range of knowledge and skills. In this article, we will introduce 100 important terms and concepts in information engineering, both in English and Chinese.1. Artificial Intelligence (AI) - 人工智能2. Machine Learning - 机器学习3. Deep Learning - 深度学习4. Natural Language Processing (NLP) - 自然语言处理5. Computer Vision - 计算机视觉6. Data Mining - 数据挖掘7. Big Data - 大数据8. Internet of Things (IoT) - 物联网9. Cloud Computing - 云计算10. Virtual Reality (VR) - 虚拟现实11. Augmented Reality (AR) - 增强现实12. Cybersecurity - 网络安全13. Cryptography - 密码学14. Blockchain - 区块链15. Information System - 信息系统16. Database Management System (DBMS) - 数据库管理系统17. Relational Database - 关系数据库18. NoSQL - 非关系型数据库19. SQL (Structured Query Language) - 结构化查询语言20. Data Warehouse - 数据仓库21. Data Mart - 数据集市22. Data Lake - 数据湖23. Data Modeling - 数据建模24. Data Cleansing - 数据清洗25. Data Visualization - 数据可视化26. Hadoop - 分布式存储和计算框架27. Spark - 大数据处理框架28. Kafka - 流数据处理平台29. Elasticsearch - 开源搜索引擎30. Cyber-Physical System (CPS) - 嵌入式系统31. System Integration - 系统集成32. Network Architecture - 网络架构33. Network Protocol - 网络协议34. TCP/IP - 传输控制协议/互联网协议35. OSI Model - 开放系统互连参考模型36. Router - 路由器37. Switch - 交换机38. Firewall - 防火墙39. Load Balancer - 负载均衡器40. VPN (Virtual Private Network) - 虚拟专用网络41. SDN (Software-Defined Networking) - 软件定义网络42. CDN (Content Delivery Network) - 内容分发网络43. VoIP (Voice over Internet Protocol) - 互联网语音44. Unified Communications - 统一通信45. Mobile Computing - 移动计算46. Mobile Application Development - 移动应用开发47. Responsive Web Design - 响应式网页设计48. UX/UI Design - 用户体验/用户界面设计49. Agile Development - 敏捷开发50. DevOps - 开发与运维51. Continuous Integration/Continuous Deployment (CI/CD) - 持续集成/持续部署52. Software Testing - 软件测试53. Bug Tracking - 缺陷跟踪54. Version Control - 版本控制55. Git - 分布式版本控制系统56. Agile Project Management - 敏捷项目管理57. Scrum - 敏捷开发框架58. Kanban - 看板管理法59. Waterfall Model - 瀑布模型60. Software Development Life Cycle (SDLC) - 软件开发生命周期61. Requirements Engineering - 需求工程62. Software Architecture - 软件架构63. Software Design Patterns - 软件设计模式64. Object-Oriented Programming (OOP) - 面向对象编程65. Functional Programming - 函数式编程66. Procedural Programming - 过程式编程67. Dynamic Programming - 动态规划68. Static Analysis - 静态分析69. Code Refactoring - 代码重构70. Code Review - 代码审查71. Code Optimization - 代码优化72. Software Development Tools - 软件开发工具73. Integrated Development Environment (IDE) - 集成开发环境74. Version Control System - 版本控制系统75. Bug Tracking System - 缺陷跟踪系统76. Code Repository - 代码仓库77. Build Automation - 构建自动化78. Continuous Integration/Continuous Deployment (CI/CD) - 持续集成/持续部署79. Code Coverage - 代码覆盖率80. Code Review - 代码审查81. Software Development Methodologies - 软件开发方法论82. Waterfall Model - 瀑布模型83. Agile Development - 敏捷开发84. Scrum - 看板管理法85. Kanban - 看板管理法86. Lean Development - 精益开发87. Extreme Programming (XP) - 极限编程88. Test-Driven Development (TDD) - 测试驱动开发89. Behavior-Driven Development (BDD) - 行为驱动开发90. Model-Driven Development (MDD) - 模型驱动开发91. Design Patterns - 设计模式92. Creational Patterns - 创建型模式93. Structural Patterns - 结构型模式94. Behavioral Patterns - 行为型模式95. Software Development Lifecycle (SDLC) - 软件开发生命周期96. Requirement Analysis - 需求分析97. System Design - 系统设计98. Implementation - 实施99. Testing - 测试100. Deployment - 部署These terms are just the tip of the iceberg when it comes to information engineering. As technology continues to advance, new terms and concepts will emerge, shaping the future of this dynamic field. Whether you are a student, a professional, or just someone interested in technology, familiarizing yourself with these terms will help you navigate the complex world of information engineering.篇2100 Information Engineering Professional Terms in English1. Algorithm - a set of instructions for solving a problem or performing a task2. Computer Science - the study of computers and their applications3. Data Structures - the way data is organized in a computer system4. Networking - the practice of linking computers together to share resources5. Cybersecurity - measures taken to protect computer systems from unauthorized access or damage6. Software Engineering - the application of engineering principles to software development7. Artificial Intelligence - the simulation of human intelligence by machines8. Machine Learning - a type of artificial intelligence that enables machines to learn from data9. Big Data - large and complex sets of data that require specialized tools to process10. Internet of Things (IoT) - the network of physical devices connected through the internet11. Cloud Computing - the delivery of computing services over the internet12. Virtual Reality - a computer-generated simulation of a real or imagined environment13. Augmented Reality - the integration of digital information with the user's environment14. Data Mining - the process of discovering patterns in large data sets15. Quantum Computing - the use of quantum-mechanical phenomena to perform computation16. Cryptography - the practice of securing communication by encoding it17. Data Analytics - the process of analyzing data to extract meaningful insights18. Information Retrieval - the process of finding relevant information in a large dataset19. Web Development - the process of creating websites and web applications20. Mobile Development - the process of creating mobile applications21. User Experience (UX) - the overall experience of a user interacting with a product22. User Interface (UI) - the visual and interactive aspects of a product that a user interacts with23. Software Architecture - the design and organization of software components24. Systems Analysis - the process of studying a system's requirements to improve its efficiency25. Computer Graphics - the creation of visual content using computer software26. Embedded Systems - systems designed to perform a specific function within a larger system27. Information Security - measures taken to protect information from unauthorized access28. Database Management - the process of organizing and storing data in a database29. Cloud Security - measures taken to protect data stored in cloud computing environments30. Agile Development - a software development methodology that emphasizes collaboration and adaptability31. DevOps - a set of practices that combine software development and IT operations to improve efficiency32. Continuous Integration - the practice of integrating code changes into a shared repository frequently33. Machine Vision - the use of cameras and computers to process visual information34. Predictive Analytics - the use of data and statistical algorithms to predict future outcomes35. Information Systems - the study of how information is used in organizations36. Data Visualization - the representation of data in visual formats to make it easier to understand37. Edge Computing - the practice of processing data closer to its source rather than in a centralized data center38. Natural Language Processing - the ability of computers to understand and generate human language39. Cyber Physical Systems - systems that integrate physical and computational elements40. Computer Vision - the ability of computers to interpret and understand visual information41. Information Architecture - the structural design of information systems42. Information Technology - the use of computer systems to manage and process information43. Computational Thinking - a problem-solving approach that uses computer science concepts44. Embedded Software - software that controls hardware devices in an embedded system45. Data Engineering - the process of collecting, processing, and analyzing data46. Software Development Life Cycle - the process of developing software from conception to deployment47. Internet Security - measures taken to protectinternet-connected systems from cyber threats48. Application Development - the process of creating software applications for specific platforms49. Network Security - measures taken to protect computer networks from unauthorized access50. Artificial Neural Networks - computational models inspired by the biological brain's neural networks51. Systems Engineering - the discipline that focuses on designing and managing complex systems52. Information Management - the process of collecting, storing, and managing information within an organization53. Sensor Networks - networks of sensors that collect and transmit data for monitoring and control purposes54. Data Leakage - the unauthorized transmission of data to an external source55. Software Testing - the process of evaluating software to ensure it meets requirements and functions correctly56. Internet Protocol (IP) - a set of rules for sending data over a network57. Machine Translation - the automated translation of text from one language to another58. Cryptocurrency - a digital or virtual form of currency that uses cryptography for security59. Software Deployment - the process of making software available for use by end-users60. Computer Forensics - the process of analyzing digital evidence for legal or investigative purposes61. Virtual Private Network (VPN) - a secure connection that allows users to access a private network over a public network62. Internet Service Provider (ISP) - a company that provides access to the internet63. Data Center - a facility that houses computing and networking equipment for processing and storing data64. Network Protocol - a set of rules for communication between devices on a network65. Project Management - the practice of planning, organizing, and overseeing a project to achieve its goals66. Data Privacy - measures taken to protect personal data from unauthorized access or disclosure67. Software License - a legal agreement that governs the use of software68. Information Ethics - the study of ethical issues related to the use of information technology69. Search Engine Optimization (SEO) - the process of optimizing websites to rank higher in search engine results70. Internet of Everything (IoE) - the concept of connecting all physical and digital objects to the internet71. Software as a Service (SaaS) - a software delivery model in which applications are hosted by a provider and accessed over the internet72. Data Warehousing - the process of collecting and storing data from various sources for analysis and reporting73. Cloud Storage - the practice of storing data online in remote servers74. Mobile Security - measures taken to protect mobile devices from security threats75. Web Hosting - the service of providing storage space and access for websites on the internet76. Malware - software designed to harm a computer system or its users77. Information Governance - the process of managing information to meet legal, regulatory, and business requirements78. Enterprise Architecture - the practice of aligning an organization's IT infrastructure with its business goals79. Data Backup - the process of making copies of data to protect against loss or corruption80. Data Encryption - the process of converting data into a code to prevent unauthorized access81. Social Engineering - the manipulation of individuals to disclose confidential information82. Internet of Medical Things (IoMT) - the network of medical devices connected through the internet83. Content Management System (CMS) - software used to create and manage digital content84. Blockchain - a decentralized digital ledger used to record transactions85. Open Source - software that is publicly accessible for modification and distribution86. Network Monitoring - the process of monitoring and managing network performance and security87. Data Governance - the process of managing data to ensure its quality, availability, and security88. Software Patch - a piece of code used to fix a software vulnerability or add new features89. Zero-Day Exploit - a security vulnerability that is exploited before the vendor has a chance to patch it90. Data Migration - the process of moving data from one system to another91. Business Intelligence - the use of data analysis tools to gain insights into business operations92. Secure Socket Layer (SSL) - a protocol that encrypts data transmitted over the internet93. Mobile Device Management (MDM) - the practice of managing and securing mobile devices in an organization94. Dark Web - the part of the internet that is not indexed by search engines and often used for illegal activities95. Knowledge Management - the process of capturing, organizing, and sharing knowledge within an organization96. Data Cleansing - the process of detecting and correcting errors in a dataset97. Software Documentation - written information that describes how software works98. Open Data - data that is freely available for anyone to use and redistribute99. Predictive Maintenance - the use of data analytics to predict when equipment will need maintenance100. Software Licensing - the legal terms and conditions that govern the use and distribution of softwareThis list of 100 Information Engineering Professional Terms in English provides a comprehensive overview of key concepts and technologies in the field of information technology. These terms cover a wide range of topics, including computer science, data analysis, network security, and software development. By familiarizing yourself with these terms, you can better understand and communicate about the complex and rapidly evolving world of information engineering.篇3100 Information Engineering Professional Terms1. Algorithm - 算法2. Artificial Intelligence - 人工智能3. Big Data - 大数据4. Cloud Computing - 云计算5. Cryptography - 密码学6. Data Mining - 数据挖掘7. Database - 数据库8. Deep Learning - 深度学习9. Digital Signal Processing - 数字信号处理10. Internet of Things - 物联网11. Machine Learning - 机器学习12. Network Security - 网络安全13. Object-Oriented Programming - 面向对象编程14. Operating System - 操作系统15. Programming Language - 编程语言16. Software Engineering - 软件工程17. Web Development - 网页开发18. Agile Development - 敏捷开发19. Cybersecurity - 网络安全20. Data Analytics - 数据分析21. Network Protocol - 网络协议22. Artificial Neural Network - 人工神经网络23. Cloud Security - 云安全24. Data Visualization - 数据可视化25. Distributed Computing - 分布式计算26. Information Retrieval - 信息检索27. IoT Security - 物联网安全28. Machine Translation - 机器翻译29. Mobile App Development - 移动应用开发30. Software Architecture - 软件架构31. Data Warehousing - 数据仓库32. Network Architecture - 网络架构33. Robotics - 机器人技术34. Virtual Reality - 虚拟现实35. Web Application - 网页应用36. Biometrics - 生物识别技术37. Computer Graphics - 计算机图形学38. Cyber Attack - 网络攻击39. Data Compression - 数据压缩40. Network Management - 网络管理41. Operating System Security - 操作系统安全42. Real-Time Systems - 实时系统43. Social Media Analytics - 社交媒体分析44. Blockchain Technology - 区块链技术45. Computer Vision - 计算机视觉46. Data Integration - 数据集成47. Game Development - 游戏开发48. IoT Devices - 物联网设备49. Multimedia Systems - 多媒体系统50. Software Quality Assurance - 软件质量保证51. Data Science - 数据科学52. Information Security - 信息安全53. Machine Vision - 机器视觉54. Natural Language Processing - 自然语言处理55. Software Testing - 软件测试56. Chatbot - 聊天机器人57. Computer Networks - 计算机网络58. Cyber Defense - 网络防御60. Image Processing - 图像处理61. IoT Sensors - 物联网传感器62. Neural Network - 神经网络63. Network Traffic Analysis - 网络流量分析64. Software Development Life Cycle - 软件开发周期65. Data Governance - 数据治理66. Information Technology - 信息技术67. Malware Analysis - 恶意软件分析68. Online Privacy - 在线隐私69. Speech Recognition - 语音识别70. Cyber Forensics - 网络取证71. Data Anonymization - 数据匿名化72. IoT Platform - 物联网平台73. Network Infrastructure - 网络基础设施74. Predictive Analytics - 预测分析75. Software Development Tools - 软件开发工具77. Information Security Management - 信息安全管理78. Network Monitoring - 网络监控79. Software Deployment - 软件部署80. Data Encryption - 数据加密81. IoT Gateway - 物联网网关82. Network Topology - 网络拓扑结构83. Quantum Computing - 量子计算84. Software Configuration Management - 软件配置管理85. Data Lakes - 数据湖86. Infrastructure as a Service (IaaS) - 基础设施即服务87. Network Virtualization - 网络虚拟化88. Robotic Process Automation - 机器人流程自动化89. Software as a Service (SaaS) - 软件即服务90. Data Governance - 数据治理91. Information Security Policy - 信息安全政策92. Network Security Risk Assessment - 网络安全风险评估93. Secure Software Development - 安全软件开发94. Internet Security - 互联网安全95. Secure Coding Practices - 安全编码实践96. Secure Network Design - 安全网络设计97. Software Security Testing - 软件安全测试98. IoT Security Standards - 物联网安全标准99. Network Security Monitoring - 网络安全监控100. Vulnerability Management - 漏洞管理These terms cover a wide range of topics within the field of Information Engineering, and are essential in understanding and discussing the various aspects of this discipline. It is important for professionals in this field to be familiar with these terms in order to effectively communicate and collaborate with others in the industry.。
Information Retrieval
Q1
0.61 0.32 -0.29
Q2
0.52 0.55 +0.03
Q3
0.12 0.13 +0.01
Q4
0.73 0.32 -0.41
Q5
0.22 0.12 -0.10
Mean
0.44 0.29 -0.15
4
2
1
5
3
• rank differences by decreasing absolute value • null hypothesis: A and B are equivalent:
– ranks of positives ( ) and negatives ( ) should be mixed
• add up positive ranks W(+) = 1+2 = 3 • add up negative ranks W(-) = 3+4+5 = 12
– look-up a critical value for min(W(+),W(-)) – if above reject the null hypothesis
• “outliers” as far as most statistical models are concerned • dimensionality-reduction / component-analysis likely to hurt
– mostly unsupervised: very little training data
Hopkins IR Workshop 2005 Copyright © Victor Lavrenko
Evaluating a Ranked List
Search Engines:Information Retrieval in Practice搜索引擎——信息检索实践_Slides_chap1_pdf
Easytocomparefieldswithwell‐defined semanticstoqueriesinordertofindmatches Textismoredifficult
Documentsvs.Records
Examplebankdatabasequery
– Findrecordswithbalance>$50,000inbranches locatedinAmherst,MA. – Matcheseasilyfoundbycomparisonwithfield valuesofrecords
designingandimplementingthemaremajorissuesfor searchengines
SearchEngines
InformationRetrievalinPractice
AllslidesAddisonWesley,2008
SearchandInformationRetrieval
SearchontheWeb1 isadailyactivityformany peoplethroughouttheworld Searchandcommunicationaremostpopular usesofthecomputer Applicationsinvolvingsearchareeverywhere Thefieldofcomputersciencethatismost involvedwithR&Dforsearchisinformation retrieval(IR)
– Measuringandimprovingtheefficiencyofsearch
e.g.,reducingresponsetime,increasingquery throughput,increasingindexingspeed
文献检索专业英文词汇课件
02
Literature retrieval technology
Boolean Search
Boolean Search technology is a common and effective information retrieval method, which is based on the development of computer technology and information retrieval theory
Topic search
Topic search technology can be applied to all fields of text data analysis, such as news, report, paper, etc
Topic search technology has the characteristics of subjectivity, complexity, and uncertainty
Cluster analysis
• Cluster analysis is a data analysis method, which provides data into different groups according to cancer rules or algorithms, and then analyzes the characteristics and patterns of each group
01
Overview of Literature Retrieval
Definition of Literature Retrieval
Literature retrieval is an information retrieval technology that uses computer programs to search and retrieve relevant documents from a large corpus of text
文献翻译----信息检索导论
本科毕业设计外文文献及译文文献、资料题目:Introduction to Information Retrieval 文献、资料来源:网络文献、资料发表(出版)日期:2008.3.20院(部):专业:班级:姓名:学号:指导教师:翻译日期:外文文献:Introduction to Information RetrievalThe meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study,information retrieval might be defined thus: Information retrieval (IR) is finding material of an unstructured nature that satisfies an information need from within large collections.As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email.1Information retrieval is fast becoming the dominant form of information access, overtaking traditional database- style searching.IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to maintain product inventories and personnel records. In reali ty, almost no data are truly “unstructured”. This is definitely true of all text data if you count He latent linguistic structure of human languages. But even accepting that the intended notion of structure is overt structure, most text has structure, such as headings and paragraphs and footnotes, which is commonly represented in documents by explicit markup. IR is also used to facilitate “semi-structured” search such as finding a document where the title contains Java and the body contains threading.The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic. Given a set of topics, standing information needs, or other categories, classification is the task of deciding which classes, if any, each of a set of documents belongs to. It is often approached by first manually classifying some documents and then hoping to be able to classify new documentsautomatically.Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales. In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues need to gather documents for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the commercial importance of the web. We focus on all these issues in Chapters 19–21. At the other extreme is personal information retrieval. In the last few years, consumer operating systems have integrated information retrieval. Email programs usually not only provide search but also text classification: they at least provide a spam filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders. Distinctive issues here include handling the broad range of document types on a typical personal computer, and making the search system maintenance free and sufficiently lightweight in terms of startup, processing, and disk space usage that it can run on one machine without annoying its owner. In between is the space of enterprise, institutional, and domain-specific search, where retrieval might be provided for collections such as a corporation’s internal documents, a database of patents, or research articles on biochemistry. In this case, the documents will typically be stored on centralized file systems and one or a handful of dedicated machines will provide search over the collection.This book contains techniques of value over this whole spectrum, but our coverage of some aspects of parallel and distributed search in web-scale search systems is comparatively light owing to the relatively small published literature on the details of such systems. However, outside of a handful of web search companies, a software developer is most likely to enco unter the personal search and enterprise scenarios.In this chapter we begin with a very simple example of an information retrieval problem, and introduce the idea of a term-document matrix and the central inverted index data structure. We will then examine the Boolean retrieval model and how Boolean queries are processed.An example information retrieval problemA fat book which many people own is Shakespeare’s Collected Works. Sup-pose you wantedto determine which plays of Shakespeare contain the words Brutus AND Caesar AND NOT Calpurnia. One way to do that is to start at the beginning and to read through all the text, noting for each play whether it contains Brutus and Caesar and excluding it from consideration if it contains Calpurnia. The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly referred to as grepping through text, after the Unix command grep, which performs this process. Grepping through text can be a very effective process, especially given the speed of modern computers, and often allows useful possibilities for wildcard pattern matching through the use of regular expressions. With modern computers, for simple querying of modest collections, you really need nothing more.But for many purposes, you do need more:1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words.2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans NEAR countrymen with grep, where NEAR might be defined as “within 5 words” or “within the same sentence”.3. To allow ranked retrieval: in many cases you want the best answer to an information need among many documents that contain certain words. The way to avoid linearly scanning the texts for each query is to index the documents in advance. Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics of the Boolean retrieval model. Suppose we record for each document –here a play of Shakespeare’s –whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 different words). The result is a binary term-document incidence matrix, as in Figure 1.1. Terms are the indexed units they are usually words, and for the moment you can think of them as words, but the information retrieval literature normally speaks of terms because some of them, such as perhaps I-9 or Hong Kong are not usually thought of as words. Now, depending on whether we look at the matrix rows or columns, we can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it.To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors forBrutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: The answers for this query are thus Antony and Cleopatra and Hamlet.The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words. Let us now consider a more realistic scenario, simultaneously using the opportunity to introduce some terminology and notation. Suppose we have N = 1 million documents. By documents we mean whatever units we have decided to build a retrieval system over. They might be individual memos or chapters of a book... We will refer to the group of documents over which we perform retrieval as the collection. It is sometimes also referred to as a corpus. Suppose each document is about 1000 words long. If we assume an average of 6 bytes per word including spaces and punctuation, then this is a document collection about 6 GB in size. Typically, there might be about M = 500,000 distinct terms in these documents. There is nothing special about the numbers we have chosen, and they might vary by an order of magnitude or more, but they give us some idea of the dimensions of the kinds of problems we need to handle. We will discuss and model these size assumptions in Section 5.1.Our goal is to develop a system to address the ad hoc retrieval task. This is the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An information need is the topic about which the user desires to know more, and is differentiated from a query, which is what the user conveys to the computer in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of value with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of particular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture. To assess the effectiveness of an IR system, a user will usually want to know two key statistics about the system’s returned results for a query:Precision: What fractions of the returned results are relevant to the information need?Recall: What fractions of the relevant documents in the collection were returned by thesystem?Detailed discussion of relevance and evaluation measures including precision and recall is found in Chapter 8.We now cannot build a term-document matrix in a naive way. A 500K ×1M matrix has half-a-trillion 0’s and 1’s –too many to fit in a computer’s memory. But the crucial observation is that the matrix is extremely sparse, that is, it has few non-zero entries. Because each document is 1000 words long, the matrix has no more than one billion 1’s, so a minimum of 99.8% of the cells are zero. A much better representation is to record only the things that do occur, that is the 1positions.This idea is central to the first major concept in information retrieval, the inverted index. The name is actually redundant: an index always maps back from terms to the parts of a document where they occur. Nevertheless, inverted index, or sometimes inverted file, has become the standard term in information retrieval.3 the basic idea of an inverted index is shown in Figure 1.3. We keep a dictionary of terms. Then for each term, we have a list that records which documents the term occurs in. Each item in the list – which records that a term appeared in a document –is conventionally called a posting. The list is then called a postings list, and all the postings lists taken together are referred to as the postings. The dictionary in Figure 1.3 has been sorted alphabetically and each postings list is sorted by document ID. We will see why this is useful in Section 1.3, below, but later we will also consider alternatives to doing this.A first take at building an inverted indexTo gain the speed benefits of indexing at retrieval time, we have to build the index in advance. The major steps in this are:1. Collect the documents to be indexed.2. Tokenize the text, turning each document into a list of tokens.3. Do linguistic preprocessing, producing a list of normalized tokens.4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.We will define and discuss the earlier stages of processing, that is, steps 1–3, in Section 2.2. Until then you can think of tokens and normalized tokens as also loosely equivalent to words. Here, we assume that the first 3 steps have already been done, and we examine building a basicinverted index by sort-based indexing.Within a document collection, we assume that each document has a unique serial number, known as the document identifier. During index construction, we can simply assign successive integers to each new document when it is first encountered. The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID, as in Figure 1.4.The core indexing step is sorting this list so that the terms are alphabetical, giving us the representation in the middle column of Figure 1.4. Multiple occurrences of the same term from the same document are then merged.5 Instances of the same term are then grouped, and the result is split into a dictionary and postings, as shown in the right column of Figure 1.4. Since a term generally occurs in a number of documents, this data organization already reduces the storage requirements of the index. The dictionary also records some statistics, such as the number of documents which contain each term. This information is not vital for a basic Boolean search engine, but it allows us to improve the efficiency of the search engine at query time, and it is a statistic later used in many ranked retrieval models. The postings are secondarily sorted by docID. This provides the basis for efficient query processing. This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.In the resulting index, we pay for storage of both the dictionary and the postings lists. The latter are much larger, but the dictionary is commonly kept in memory, while postings lists are normally kept on disk, so the size of each is important, and in Chapter 5 we will examine how each can be optimized for storage and access efficiency.What data structure should be used for a postings list? A fixed length array would be wasteful as some words occur in many documents, and others in very few. For an in-memory postings list, two good alternatives are singly linked lists or variable length arrays. Singly linked lists allow cheap insertion of documents into postings lists, and naturally extend to more advance indexing strategies such as skip lists, which require additional pointers. V ariable length arrays win in space requirements by avoiding the overhead for pointers and in time requirements because their use of contiguous memory increases speed on modern processors with memory caches. Extra pointers can in practice been coded into the lists as offsets. If updates are relatively infrequent, variable length arrays will be more compact and faster to traverse. We can also use ahybrid scheme with a linked list of fixed length arrays for each term. When postings lists are stored on disk, they are stored as a contiguous run of postings without explicit pointers, so as to minimize the size of the postings list and the number of disk seeks to read a postings list into memory.中文译文:信息检索导论在信息检索这个词的含义非常广。
信息检索InformationRetrievalIR
User Interface
4, 10
Text Text
Text Operations
user feedback
logical view
Query Operations
5
query
Searching
8
retrieved docs
logical view Indexing
inverted file Index
A = 测量区间,B = 关联方面(绝对关联), C = 文档,D = 上下文, 在这里进行关联测量(包括需要的信息) E = 用户的判断
a
9
相关概念
文本形式,文本存在多种规范形式,通常包括非结构化(也称
为纯文本)、半结构化和结构化文本。大多数情况下,文本被看 作是半结构化。比如,一本书的说明书可能是如下的形式:
a
15
实现方法
2. 索引 (*)
- 速度快
- 易于改进
例如:
关键词表示: 原句子:数据库和人工智能在工业上的应用 预处理后:数据库、人工智能、工业、应用 原句子:人工智能和数据库在工业上的应用 预处理后:人工智能、数据库、工业、应用 倒排文档: 人工智能 ——〉{d1, d3,d5, d6,d7} 查找过程描述: 用户问题:Q = {w1=数据库, w2=人工智能, w3=工业}, 且 Q= w1 AND w2 AND (NOT w3) 文档列表:w1 ——〉{d1, d2, d5, d7, d9}
ISBN: 0-201-12227-8 Author: Salton, Gerard Titre: Automatic text processing: the transformation,
analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 … Content: <Text Content>
计算机信息检索专业课件computer retrieval
2013-8-30
上海大学图书馆情报部
二. 文献信息数据库原理
基本字段名称: 文摘(Abstracts)
文章题目(Article Title) 作者关键词(Author Key Words)
2013-8-30
上海大学图书馆情报部
二. 文献信息数据库原理
辅助字段名称: 作者 (Authors) 作者机构 期刊名称(Journal Title) 国际标准连续出版物号(ISSN)等。
记录号
基本字段
顺排文档示意图
辅助字段
记录4
记录6
2013-8-30
上海大学图书馆情报部
倒排文档示意图
作者倒排文档: 作者姓名 记录号 陈春秀 1 陈东方 3 程文娟 3 黄茂 5 秦大河 6 秦翔 6 吴元康 2 肖中新 4 上海大学图书馆情报部 2013-8-30
数据库(文档) / \ 顺排文档 倒排文档(若干) / / | \ 记录的集合 主题词 作者 期刊名称等倒排文档… | 字段 给出特征标识 记录数 / \ 基本字段 辅助字段
2013-8-30 上海大学图书馆情报部
二. 文献信息数据库原理 —数据库类型
二次文献(定义同印刷型检索工具) 目录:一批相关文献信息的著录集合,它以报道文 献出版或收藏信息为主要功能的工具。 题录:将图书、报刊等文献中论文的篇目按照一定 的排检方法编排而成的,供人们查找篇目出 处的工具。 文摘:以精练的语言把文献信息的重要内容、学术 观点、数据及结构准确地摘录下来,并按一 定的著录规则与排列方式编排起来,供读者 查阅使用的一种检索工具。 p.27
2013-8-30
上海大学图书馆情报部
二. 文献信息数据库原理
irst_精品文档
irstTitle: The Importance of Information Retrieval System Technology (IRST)Introduction:Information Retrieval System Technology (IRST) is an essential tool for organizations and individuals in today's digital age. With the exponential growth of available information, the need for efficient and effective information retrieval has become paramount. This document will explore the significance of IRST, its key features, benefits, and its impact on various fields.1. Definition and Components of IRST:Information Retrieval System Technology (IRST) is a software system designed to facilitate the retrieval of relevant information from large databases or collections of documents. It comprises three main components: document indexing, query processing, and relevancy ranking. Document indexing involves the categorization and organization of documents to enable quick and accurate retrieval. Query processing refers to the system's ability to interpret user queries and match them with indexed documents. Relevancy ranking aims todisplay the most relevant results based on user queries and indexing protocols.2. Benefits of IRST:2.1 Time Efficiency:IRST plays a crucial role in streamlining the search process and saving valuable time. By employing efficient indexing techniques and relevancy ranking algorithms, users can obtain the desired information promptly, reducing the time spent searching through vast amounts of data.2.2 Enhanced Accuracy:One of the primary benefits of IRST is its ability to increase the accuracy of information retrieval. Through advanced indexing methods, users can access precisely what they are looking for, minimizing irrelevant results and false positives.2.3 Improved Decision-Making:IRST provides users with quick access to relevant and reliable information, enabling better decision-making. In fields such as medicine, finance, and academia, where accurate information is vital, the use of IRST can significantly enhance the quality of decisions made.3. Applications of IRST:3.1 Information Management in Libraries:IRST has revolutionized information management in libraries by offering efficient cataloging and retrieval of books, journals, and other resources. Librarians can index and categorize books, making it easier for users to find specific information.3.2 Business Intelligence:In the corporate world, IRST finds applications in business intelligence. It enables companies to retrieve data from various sources, analyze market trends, and make informed business decisions. IRST assists in competitor analysis, market research, and data mining.3.3 E-Commerce:In e-commerce, IRST is crucial for providing users with accurate and relevant search results, improving user experience, and ultimately boosting sales. By employing sophisticated indexing and relevancy ranking algorithms, e-commerce platforms can match user queries with the most suitable products or services.4. Challenges and Future Trends:4.1 Scalability:As the volume of digital information continues to grow, IRST faces challenges in handling large-scale databases and ensuring quick retrieval. Future developments in IRST should focus on scalability to meet the growing demands of information retrieval.4.2 Multilingual Information Retrieval:With the globalization of businesses and the internet, the need for efficient multilingual information retrieval has become more apparent. Future IRST advancements may involve incorporating machine translation and cross-language retrieval techniques to bridge language barriers.4.3 Personalized Retrieval:As user preferences and habits continue to shape the digital landscape, personalization in information retrieval becomes crucial. Future trends in IRST may involve incorporating user profiling and machine learning techniques to tailor search results according to individual preferences.Conclusion:Information Retrieval System Technology (IRST) is an indispensable tool that improves the efficiency and accuracyof information retrieval. Its benefits extend across various fields, including libraries, business intelligence, and e-commerce. However, scalability, multilingual retrieval, and personalization are challenges IRST must address in the future to keep up with ever-growing information demands. As technology continues to advance, IRST will play an increasingly vital role in ensuring quick access to relevant information, revolutionizing the way we search for and utilize knowledge.。
retrievalqa使用例
retrievalqa使用例全文共四篇示例,供读者参考第一篇示例:RetrievalQA是一个旨在改进信息检索和问答系统的工具,通过使用最新的自然语言处理和机器学习技术来提供更快速和准确的检索结果。
在今天的信息爆炸时代,我们面临着海量的数据和信息,大多数情况下我们需要从中找到我们需要的答案。
RetrievalQA就是为了解决这个问题而诞生的。
RetrievalQA的使用方法非常简单,用户只需要输入一个问题或者关键词,系统就会立即返回相关的文档或资料。
这些结果不仅可以是全文文档的摘要,还可以是问题的答案或者相关文章的链接。
通过这种方式,用户可以更加快速和准确地找到自己需要的信息,节省了大量的时间和精力。
RetrievalQA的优势在于其高效性和准确性。
我们知道,传统的搜索引擎在处理复杂问题时常常无法给出满意的答案,而且搜索结果也缺乏一定的结构性。
RetrievalQA则是基于最新的自然语言处理和机器学习算法,可以更好地理解用户的意图,并且能够从海量的信息中提取出最相关的答案。
RetrievalQA还可以为用户提供更加个性化和定制化的服务。
用户可以根据自己的需求和偏好来调整系统的设置,比如指定搜索的范围、过滤结果的条件等。
这样一来,用户就可以得到更加准确和有针对性的搜索结果,提高了工作和学习的效率。
RetrievalQA不仅可以用于个人的信息检索,还可以应用在各种领域,比如教育、医疗、金融等。
在教育领域,老师可以利用RetrievalQA来准备教学材料,学生可以用来查阅辅助资料;在医疗领域,医生可以通过RetrievalQA来快速获取临床指南和最新的研究成果,提高诊疗效率;在金融领域,投资者可以利用RetrievalQA来获取市场资讯和行业新闻,做出更加明智的投资决策。
第二篇示例:近年来,随着人工智能技术的不断发展,检索问答系统(retrievalqa)在文本理解和语义匹配方面的应用变得越来越广泛。
信息检索问答
1.What is the definition of Information Retrieval?Information retrieval (IR) deals with the representation, storage, organization of, and access to information items.(P1)2.What is the primary goal of an IR system?the primary goal of an IR system is to retrieve all the documents which are relevant to a user query while retrieving as few non-relevant documents as possible.(P2)3. What is the difference between IR and DR?Data retrievalwhich docs contain a set of keywordsWell defined(定义)semantics(语义学)a single erroneous object implies failure!Information retrievalinformation about a subject or topic(主题)semantics is frequently loose(宽松的)small errors are tolerated4.the effective retrieval of relevant information is affected by ?user taskthe logical view of the documents5.What is the relation of the retrieval and browsing?Retrieval(检索)information or dataPurposefulBrowsing(浏览)glancing around1.What is the three classic models in information retrieval?The three classic models in information retrieval are called Boolean, vector, and probabilistic. 3.What is definition and characters of ad hoc and filtering?hoc retrieval :the documents in the collection remain relatively static while new queries are submitted to the system.filtering :the queries remain relatively static while new documents come into the system (and leave).4.Please tell something about Boolean Model?The Boolean model is a simple retrieval model based on set theory and Boolean algebra集合理论和布尔代数inherent simplicity and neat formalismDrawbacks of the Boolean Model5.What is the hypertext?A hypertext is a high level interactive navigational structure高层交互式导航结构which allows us to browse text non-sequentially非顺序的on a computer screen. It consists basically of nodes 结点which are correlated by directed links链in a graph structure.6.How to avoid losing in the web?it is desirable that the hypertext include a hypertext map which shows where the user is at all times7.Please tell something about the Web.the Web is not exactly a proper hypertext because it lacks an underlying基本的data model, it lacks a navigational plan导航计划, and it lacks a consistently designed user interface设计统一的用户界面.Instead of saying that the Web is a hypertext, we prefer to say that it is a pool of interconnected webs.1.What is the definition of retrieval performance evaluation?information retrieval systems require the evaluation of how precise is the answer set. This type of evaluation is referred to as retrieval performance evaluation.2. what is retrieval performance evaluation usually based on?It is usually based on a test reference collection测试参考集and on an evaluation measure评价测度3. What are the most used retrieval evaluation measures?Two most used retrieval evaluation measures are recall and precision.查全率与查准率4. The test reference collection consists of ?The test reference collection consists of a collection of documents文献集, a set of example information requests信息查询实例, and a set of relevant documents (provided by specialists) for each example information request每个信息查询实例的一组相关文献。
Information Retrieval Technology
• technique retriev document known • 转换为索引编号
– 123 345 110 2234 432 3565 2302 566 345 4321 3565 755 1128
• 计算 计算tf
– 110 1 123 1 345 2 1 432 1 566 1 755 1 1128 1 2302 1 2344 1 3565 2 4321 1
Index Text Database
Web Search Using IR
Web Spider Document corpus
Query String
1. Page1 2. Page2 3. Page3 . .
IR System
Ranked Documents
提纲
• • • • • 检索效率 检索模型 文本处理 索引与检索 Query处理
– retrieve too few or too many documents
向量空间模型
• 文档D和查询Q(不妨统称为文本)都可用 向量表示 • 检索过程就是计算文档向量与查询向量之 间的相似度 • 可以根据相似度值的不同,对检索结果进 行排序 • 可以根据检索结果,进一步做相关检索 (relevance feedback)
dictionary
posting list
倒排索引
• 对于单个查询词,搜索就是词典查找的过程,不 需要扫描所有文档,只需要访问这个词对应的 posting list,速度相当快。 • 倒排文件组成
– 词汇表(Vocabulary)
通常我们称存放词汇表的文件为索引文件indexfile通常我们称存放出现位置的文件为置入文件postingfile通常采用差值压缩deltacompressionquery处理相关反馈相关反馈relevancefeedback相关反馈根据用户对文档的相关性评估产生新的查询query修改的基本思路出现在相关文档中的terms被添加到原始的query向量中或者这些term的权重在创建新的query时有某种程度的增长出现在不相关文档中的terms被从原始query中删除或者这些term的权重某种程度地降
摘要写作 abstract writing
1. Miniaturizing the Text
Is a condensed statement of contents of a paper.Viewed as a mini-version or miniature of the document, summarizing the content of the main body.
50-100 words may sufficient for a short article. Each journal and/or abstracting index has a
different requirement. As a general rule, an abstract will be approximately 3-5% of the length of the paper.
4. Dynamic operating characteristics of a one-month interval are given for the collector array and heat transfer devices, and cost efficiency is compared with that of the conventional design.(85 words)
6. The transport fluid for transferring enery from the solar array to the storage tank was important to overall efficiency.
7. An optimum ratio of 64/36 was determined for the proportion of propylene ['prəʊpəli:n]( 丙烯) glycol ['glaɪkoʊl] (醇)to water.
基于本体的语义信息检索系统模型研究
基于本体的语义信息检索系统模型研究【摘要】传统的信息检索无法实现信息对语义层面的查询,在信息膨胀的今天,越来越难以满足人么对查询效率的要求。
本文通过设计一个基于本体的语义检索系统模型,通过语义标签对非结构化数据进行标注,建立统一的元数据库,并且建立相应的领域本体,利用本体的语义推理功能,从而实现了对信息资源的语义检索。
【关键词】本体;语义检索;元数据1.引言随着互联网与信息技术的发展,信息化的越来越深入到工作与生活的各个层面,随之而来的是信息量的急剧膨胀。
由于信息处理技术的发展,如何从海量的信息中高效快速、准确地检索到所需信息已经成为计算机领域研究的一个热点问题。
信息检索就是从信息集合中找到用户所需信息的过程。
在实践中,传统的基于关键词的检索方法主要通过把表征用户查询请求的关键词与表征信息内容的索引词进行严格机械匹配进行的。
由于一义多词和一词多义现象的存在,缺乏语义理解能力,致使表示查询请求的关键词和用户的真实需求之间,关键词和索引词之间会存在多重表达差异,从而导致查询结果检准率低、误检率高。
为此,本文将研究研究面向本体的智能信息检索技术,并以此为基础构建一个系统模型,通过建立本体库与元数据库来准确映射信息资源,实现了对查询条件进行了语义层面的处理,从而提高检索效率。
2.信息检索与本体2.1 信息检索信息检索information retrieval)这个术语产生于calvin mooers1948年在mit的硕士论文。
信息检索是指将信息按照一定的方式组织和存储起来,并针对用户的需求找出所需信息的过程,又称为“信息存储与检索”[1]。
从广义上讲,信息检索包括存储过程和检索过程,对用户来说,往往仅指查找所需信息的检索过程。
信息的存储主要为对一定专业范围内的信息进行选择,并在此基础上进行信息特征描述、加工并使其有序化,即建立数据库。
检索是借助一定的设备与工具,采用—系列方法与策略从数据库中查找出所需信息。
information retrieval 综述
2What is adatabase?A database is a collection of similar data records stored in a common file (or collection of files).****3Types of databases:examplesExamples: The databases that form the basis for »catalogues of books or other types of documents»computerized bibliographies»address directories»a full text newspaper, newsletter, magazine, journal + collections of these»WWW and Internet search engines»intranet search engines»...****4Information managementInformation retrievalInformation retrievaland related activities: figureImage retrievalText retrieval Presentation ofinformation***-Information retrievaland related activities: explanation •“Text retrieval”can be considered as a part of the largerconcept “information management”.•There is a great overlap:“text retrieval”-“image retrieval”because image retrieval is in most cases based on textretrieval:in most cases retrieval of images is not based oncomputerized investigation of the images themselves, buton searches in the text that accompanies each image.6 ***-Information retrieval:the terminologySeveral words are used with similar or related meanings:»database / databank / corpus / collection / catalog / site /archive / file / web / ...»contents of a database / records / documents / items / (web)pages / ...»search / query / filter / ...»thesaurus / controlled vocabulary / dictionary / lexicon /term bank / ontology / ...»results / selection / retrieved documents / retrieved items /...Information retrieval software:a particular type of DBMS•Software forinformation storage and retrieval(ISR software)•Text(-oriented) database management systems (Text-DBMS)•Text information management systems(TIMS)•Document retrieval systems•Document management systems8Information retrieval:via a database to the user***-Information content Information content Linear file Inverted fileSearch engineSearch interface UserUser Database9Information retrieval: building a database**--Inverted file, index, register of the databaseUserUser Recordsderived from the inputand stored in the database Records fed into the database management systemIndexingRetrieval ?? Question ??The records input in a database system to be indexed do not necessarily appear completely in the output phase;that is: they are not shown completely to the user of the system in the results of a query.Can you illustrate this?The records input in a database system to be indexed do not necessarily appear completely in the output phase;that is: they are not shown completely to the user of the system in the results of a query. Can you illustrate this?**--1011ComparisonInformation retrieval:the basic processes in search systemsInformationproblem RepresentationQuery Indexed documents Representation Retrieved, sorted documentsText documents Evaluationandfeedback ****12Information retrieval systems:many components make up a system •Any retrieval system is built up of many more or less independent components.•To increase the quality of the results,these components can be modifiedmore or less independent of each other.***-Information retrieval systems:important componentsthe information contentsystem to describe formal aspects of information itemssystem to describe the subjects of information itemsconcrete descriptions of information items= application of the used information description systemsinformation storage and retrieval computer program(s)computer system used for retrievaltype of medium or information carrier used for distribution14 ***-Information retrieval systems:the information content•The information content is the information that is createdor gathered by the producer.•The information content is independent of software andof distribution media.•The information content is input into the retrieval systemusing»a system (rules) to describe the formal aspects»a system (rules) to describe the contents(classification, thesaurus,...)Information retrieval systems:media used for distribution•Hard copy(for information retrieval systems only in the broad sense)»Print»Microfiche•For computers:(for information retrieval systems strictu sensu)»Magnetic tape»Floppy disk; optical disk (CD-ROM, Photo-CD, DVD...)»Online16 ***-Information retrieval systems:the computer programThe information retrieval program consists of severalmodules, including:•The module that allows the creation of theinverted file(s) = index file(s) = dictionary file(s).•The search engine provides the search features and powerthat allow the inverted file(s) to be searched.•The interface between the system and the user determineshow they (can) interact to search the database (usingmenus and/or icons and/or templates and/or commands).What determines the results of a search in a retrieval system?1.the information retrieval system( = contents + system)2.the user of the retrieval systemand the search strategy applied to the system Result of a searchResult of a search 18Layered structureof a databaseDatabase(File)RecordsFieldsCharacters + in many systems:relations / links between records***-A simple database architecture: all records together form a database The ‘salami architecture’= ‘sliced bread architecture’»the salami or the bread is a “database”»each slice of salami or bread is a “database record”»there are no relations between slices / records»the retrieval system tries to offer the appropriate slices / records to the user!! Question !!The database architecture described here is simple, but which factors make retrieval nevertheless a complex procedure in many real databases with this architecture?The database architecture described here is simple, but which factors make retrieval nevertheless a complex procedure in many real databases with this architecture?**--2021 Characteristics / definition ofstructured text-information•The text information is structured.(files, records, fields, sub-fields,links/relations among records...)•The length of records and fields can be “long”.•Some fields are multi-valued =they occur more than once =repeated or repeatable fields**--22Structure ofa bibliographic fileRecord No. 1TitleAuthor 1: name + first name Author 2:...SourceDescriptor 1Descriptor 2...Record No. 2Sub-fields Repeated fields**--24 Text retrieval and language:an overviewProblems/difficulties related to language / terminologyoccur•in the case of “multi-linguality”:“cross-language information retrieval”;that is when more than 1 language is used»in the contents of the searched database(s)and/or in the subject descriptors of the searched database(s) OR»in the search terms used in a query•even when only 1 language is applied throughout the system !***-25Text retrieval and language:enhancing retrieval•Retrieval can be enhanced by coping with the problems caused by the use of natural language.•Contributions to this enhancement of retrieval can be made by»the database producer»the computerized retrieval system»the searcher/user •(The distinction between these is not very sharp and clear in all cases.)☺***-!! Task -Assignment !!Read about Language and information retrieval by Large, Andrew, Tedd, Lucy A., and Hartley, R.J.Chapter 4 in: Information seeking in the online age: principles and practice. London : Bowker-Saur, 1999, 308 pp.Read about Language and information retrieval by Large, Andrew, Tedd, Lucy A., and Hartley, R.J.Chapter 4 in: Information seeking in the online age: principles and practice. London : Bowker-Saur, 1999, 308 pp.**--26!! Task -Assignment !!Read about Information organization .By Large, Andrew, Tedd, Lucy A., and Hartley, R.J.Chapter 5 in: Information seeking in the online age: principles and practice. London : Bowker-Saur, 1999, 308 pp.Read about Information organization .By Large, Andrew, Tedd, Lucy A., and Hartley, R.J.Chapter 5 in: Information seeking in the online age: principles and practice. London : Bowker-Saur, 1999, 308 pp .**--2728Text retrieval and language:a word is not a concept (a)Problem:A word or phrase or term is not the same as a concept orsubject or topic.****WordWordConcept!Text retrieval and language:a word is not a concept (a’)So, to ‘cover’a concept in a search,to increase the recall of a search,the user of a retrieval system should consider anexpansion of the query;that is:the user should also include other words in the query to‘cover’the concept.!30 ****Text retrieval and language:a word is not a concept (a’’)»synonyms!(such as :Latin names of species in biology besides the commonnames,scientific names besides common names of substances inchemistry…)!Text retrieval and language:a word is not a concept (a’’’)»narrower terms, more specific terms(such as particular brand names);including terms with prefixes(for instance: viruses, retroviruses, rotaviruses...)»spelling variations(such as UK English versus US English);possible variations after transliteration!32 ****Text retrieval and language:a word is not a concept (a’’’’)»singular or plural forms of a noun(when this is used as a search term)»(relevant) related terms»various forms of a verb(when this is used in the query)»broader terms (perhaps)!Text retrieval and language:a word is not a concept (b)☺•Method to solve the problemat the time of database production:»adding to each database record those codes from aclassification system or terms from a thesaurus system thatare relevant,and providing the user with knowledge about the systemused;in some cases, this process is computerized(with intellectual intervention or completely automatic)34 ***-Text retrieval and language:a word is not a concept (b’)»However, this solution is not perfect:—Addition of terms by humans from a controlledvocabulary / from a thesaurus is not easy and timeconsuming.Consequences:–the added value lags behind the availability of the document–the process can delay access to the document–the process is expensive—Moreover, in practice, most users of the resultingdatabase do not exploit this method offered.Text retrieval and language:a word is not a concept (c)•Method to solve the problem,provided by the computerized retrieval system:»offering to the user a partly computerized access to the particular subject description system used by the database producer, and then linking to the database for searching »computerized, automatic, analysis of the ‘free text’search terms applied in a query by the user, for transparent ‘mapping’to the corresponding particular classification codes, categories, or thesaurus terms used by the database producer ☺36Text retrieval and language:a word is not a concept (c’)»offering the searching user access to a (general) thesaurus system,even when the database producer has not categorised the database contents;in this way, the user can refine his/her query»better, and more generally:computerized, automatic expansion of the query terms introduced by the user, based on a general thesaurus!(however, not many retrieval systems offer this feature)**--☺37Text retrieval and language:a word is not a concept (c’’)»to avoid the problems of possible variationsat the end of search terms:—offering the possibility to the user to truncate a search term explicitly—computerized, automatic, transparent truncationwithout explicit user action**--☺38Text retrieval and language:a word is not a concept (c’’’)»to avoid the problems of possible prefixes and suffixes:—computerized, automatic, transparent, intelligent morphological analysis of the query terms:‘stemming’of the ‘free text’search terms used by the user;however, this does not work perfectly and has not (yet) been implemented in most retrieval systems;for languages that have a richer morphology thanEnglish, this can offer even a larger pay-off**--☺?? Question ??Which problems in text retrieval are illustrated by the following sentences?Which problems in text retrieval are illustrated by the following sentences?****39!40Time flies like an arrow.Fruit flies like a banana.?****Examples41 ****ExamplesTime flies like an arrow. Fruit flies like a banana.42 ****ExamplesTime flies like an arrow. Fruit flies like a banana.OK!43Text retrieval and language:ambiguity of meaning (a)•Problem:A word or phrase can have more than 1 meaning,because natural languages have evolved spontaneously, not strictly controlled.•Ambiguity of the meaning = polysemy.•The meaning can depend on the context.•The meaning may depend on the region where the term is used.•This is a problem for retrieval.•This decreases the precision of many searches.****44Text retrieval and language:ambiguity of meaning (a’)•An example is the word “pascal”, which can have several meanings:»the philosopher Blaise Pascal,»the programming language Pascal,»the physical unit of pressure, and»the name of many persons…•Another example:»Turkey, the country»Turkey, the animal ****Example !45 ****ExampleText retrieval and language:ambiguity of meaning (a’’)•Example of sentences:»The banks of New Zealand flooded our mailboxes withfree account proposals.»The banks of New Zealand flooded with heavy rainsaccount for the economic loss.!46 ****Text retrieval and language:ambiguity of meaning (a’’’)Problem:Ambiguity of meaningmay be the cause of low precision.Relevant conceptWordIrrelevant concept!NOT wantedText retrieval and language:ambiguity of meaning (b)•Method to solve the problemat the time of database production:»adding to each database record codes from a classification system or terms from a thesaurus system,and providing the user with knowledge about the system used;in some cases, this process is computerized(completely automatic or with intellectual intervention); ☺48Text retrieval and language:ambiguity of meaning (b’)•Method to solve the problem,provided by the computerized retrieval system:»offering to the user a partly computerized access to the subject description system and then linking to the database for searching***-☺Text retrieval and language:ambiguity of meaning (b’’)»searching normally (without added value), but adding value by categorizing the retrieved items in thepresentation phase to assist in the ‘disambiguation’; this feature is offered for instance by —the public access module of the book catalogue of the library automation system VUBIS at VUB, Belgium, when a searching items that were assigned a particular keyword☺!! Task -Assignment !!Search Clusty or Vivisimo or Wisenut as an example of a system that applies automatic, computerized subject categorization of database records.Search Clusty or Vivisimo or Wisenut as an example of a system that applies automatic, computerized subject categorization of database records.*---50Text retrieval and language:ambiguity of meaning (b’’’)»Natural language processing of the queries:linguistic analysis to determine possible meanings of the query, which includes disambiguation of words in their context:“lexical”analysis = at the level of the word“semantic”analysis = at the level of the sentenceHowever, most queries are short and therefore it is difficult to apply semantic analysis for disambiguation.☺52Text retrieval and language:ambiguity of meaning (b’’’’)»Natural language processing of the documents:linguistic analysis to determine possible meanings of a sentence, which includes disambiguation of words in their context:“lexical”analysis = at the level of the word“semantic”analysis = at the level of the sentence However, most retrieval systems do not apply this complicated method.***-☺53A word is not a conceptA concept is not a word****Word1 Word2 Word3Concept1Concept2Concept3 The most simple relationbetween words and concepts is NOT valid.54A word is not a conceptA concept is not a word****Word1 Word2 Word3Relevant concept 1 Irrelevant concept 2 Irrelevant concept3•A concept cannot be “covered”by only 1 word or term; this may be the cause of low recall of a search.•The meaning of many words is ambiguous;this may be the cause of low precision of a search.Text retrieval and language:relation with recall and precision Recapitulating the two problems discussed, we can say that•Expansion of the query allowsto increase therecall.•Disambiguation of the query allowsto increase the precision.!56 **--Text retrieval and language:evolution of meaning (a)•Difficulty:The meaning of a word or phrase can change over time.!Text retrieval and language:evolution of meaning (b)☺•Method to solve the problemat the time of database production:»using a categorization systemand also adapting this continuously to the changing realityand meanings of terms58 ***-Text retrieval and language:phrases composed of words (a)•Problem:Most retrieval systems can search for words,but they do not directly recognize or ‘know’phrases / terms composed of more than 1 word.!59Text retrieval and language:phrases composed of words (b)•Methods to solve the problem,provided by the computerized retrieval system:»the user can and should indicate explicitly that a few words should be considered together by the retrieval system as forming a phrase/term(for instance in many Internet search engines by putting the phrase in quotes like “three word phrase”)***-☺60Text retrieval and language:phrases composed of words (b’)»better:the retrieval system automatically recognizes a phrase/term relying on a term bank that has been created in advance;examples:the Internet search engines AltaVista and Scirus work in this way ***-☺Text retrieval and language:searching more than 1 database (a)•Problem:Searching various databases at the same time,or merging databases for searching,suffers from the problem that these databases may usecategorization systems to make the problem ofterminology and language smaller, but in most cases thesesystems are different and incompatible.!62 **--Text retrieval and language:searching more than 1 database (b)•Method to solve the problem,provided by the computerized retrieval system:»mapping of the search term chosen by the user to thevarious thesaurus terms used by the various databases;only a few retrieval systems try to accomplish this☺Text retrieval and language:relations among concepts (a)•Difficulty:In many cases, when the user combines several concepts in 1 search, the searching user cannot well communicate the intended relations among these concepts to the retrieval system.!64Text retrieval and language:relations among concepts (a’)»Example:concept 1 = children/sons/daughters/...concept 2 = parents/fathers/mothers/...concept 3 = beating/violence/...How to find documents on“children beating their parents”while avoiding documents on“parents beating their children ”?**--Examples !65Text retrieval and language:relations among concepts (a’’’)»Example:concept 1 = computersconcept 2 = architectureHow to find documents on“(the application/role/importance of)computers in architecture”,while avoiding documents on“the architecture of computers ”?**--Examples !66Text retrieval and language:relations among concepts (b)•Method to solve the problem,provided by the database producer:»offering facilities to the user for disambiguation,like in the more simple case of singular terms without combinations with other terms**--☺67Text retrieval and language:relations among concepts (b’)•Method to solve the problem,provided by the computerized retrieval system:»natural language analysis ofboththe documentsand the natural language queryto interpret their structure and meaning **--☺68Text retrieval and language:expressing the purpose of a search •Difficulty:Classical queries and retrieval systems work with terms to match the subject, the “aboutness”expressed in the query with the documents,but do not try to express and to understandthe purpose, aim and context of the search.**--!?? Question ??Which are some of the problems caused by the use of language in information retrieval?Which are some of the problems caused by the use of language in information retrieval?***-69!70Text retrieval and multi-linguality(1a)•Problem:When the user does not know well the language of a (monolingual) database, searching is not efficient.**--!Text retrieval and multi-linguality(1b)•Methods to solve the problem,at the time of database production:»adding subject descriptors in various languages(for instance in Pascal and Francis made by INIST )»adding abstracts in various languages(for instance the abstracts in English in INSPEC)»translation of the complete contents of the database These processes can be partly computerized,but they are still time consuming and expensive.☺72Text retrieval and multi-linguality(1c)•Method to solve the problem,provided by the computerized retrieval system:»translating the query of the user,by using a general multilingual thesaurus;however, most free text queries are quite short, which makes it difficult to use the context to limit possible ambiguity;disambiguation by user-computer interaction offered by the query interface, can increase the effectiveness here.**--☺Text retrieval and multi-linguality(2a)•Problem:When documents in a database are written in more than 1 language, searching that database in a single language may not be sufficient to retrieve all interesting, relevant documents.!74Text retrieval and multi-linguality(2b)•Method to solve the problem:»extensions of the methods when only 1 language is used in the documents**--☺Text retrieval and multi-linguality(3)•Problem:When more than 1 database is searched at the same time,the mechanisms to solve problems related to language ineach separate database cannot be applied so wellanymore.!76 **--Text retrieval and multi-linguality(4a)•Problem:Of course, the user should ideally be able to understandthe contents of all the retrieved documents, even whenvarious languages are used in those documents.!Text retrieval and multi-linguality(4b)•Methods to solve the problem,at the time of database production:»adding abstracts in various languages(for instance the abstracts in English in INSPEC)»translation of the complete contents of the database These processes can be partly computerized, but they are still time consuming and expensive.☺78Text retrieval and multi-linguality(4c)•Methods to solve the problem,provided by the computerized retrieval system:»rapid automated translation—of the titles of retrieved records/documents(for instance offered by the Internet search engine AltaVista )—of the abstracts of retrieved records/documents (for instance offered by the Internet search engine AltaVista )—of the complete retrieved records/documents **--☺☺A good text retrieval system solves some problems due to language •accepts words / terms / phrases in the query of the user •maps the words to corresponding concepts •presents these concepts to the userwho can then select the appropriate, relevant concept (“disambiguation”)•searches for this concept,even in documents written in another language •presents the resulting, retrieved documents in the language preferred by the user 80Natural language processing of the documents AND of the queryComparison and matching of bothEnhanced text retrievalusing natural language processingInformationproblemRepresentationQueryIndexed documents Representation Retrieved, sorted documents Text documents Evaluationandfeedback **--81Text retrieval and language:conclusions•The use of terms and language to retrieve information from databases/collections/corpora causes many problems.•These problems are not recognized or underestimated by many users of search/retrieval systems= The power of retrieval systems is overestimated by many users.•Much research and development is still needed to enhance text retrieval.***-!! Task -Assignment !!Recommended reading: Veal, D.C.Progress in documentation: Techniques of document management: a review of text retrieval and related technologies.J. Doc., Vol. 57, No. 2, March 2001, pp. 192-217.Recommended reading: Veal, D.C.Progress in documentation: Techniques of document management: a review of text retrieval and related technologies.J. Doc., Vol. 57, No. 2, March 2001, pp. 192-217.**--82!! Task -Assignment !!Recommended reading: Chowdhury, G. G., and Chowdhury, Sudatta Information retrieval in digital libraries. In: Introduction to digital libraries. London : Facet Publishing, 2003, 354 pp.Recommended reading: Chowdhury, G. G., and Chowdhury, Sudatta Information retrieval in digital libraries. In: Introduction to digital libraries. London : Facet Publishing, 2003, 354 pp.**--83?? Question ??Explain the basic relations/similarities in•speech recognition (speech to text)•translation of a text(text to text)•summarizing texts(text to summary)•text retrieval(query to texts)•cross-language text retrieval (combination)Explain the basic relations/similarities in •speech recognition (speech to text)•translation of a text (text to text)•summarizing texts (text to summary)•text retrieval (query to texts)•cross-language text retrieval (combination )**--8486 ****Hints on how to use informationsources: overview (Part 1)•Know the purpose and motivation for each search.•Do not be lazy: search on your own, before botheringexperts with requests for advice.•Plan your search in advance.•Choose the best source(s) for each search.•Use the available tools for subject searching well.•Try to cope with the language problems;avoid spelling errors in your search query;use spelling variations in your search query。
最新信息检索导论-第一章-布尔检索(英文)教学讲义ppt
9
Sec. 1.1
Term-document incidence matrices
Antony Brutus Caesar Calpurnia Cleopatra mercy worser
Antony and Cleopatra 1 1 1 0 1 1 1
Julius Caesar 1 1 1 1 0 0 0
The Tempest 0 0 0 0 0 1 1
Hamlet 0 1 1 0 0 1 1
Othello 0 0 1 0 0 1 1
Macbeth 1 0 1 0 0 1 0
Brutus AND Caesar BUT NOT Calpurnia
1 if play contains word, 0 otherwise
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable lenPogsttihng
arrays
Brutus
•
Some
tradeoffs
in 1size/2ease 4of
in1s1ertio3n1
45 173 174
Caesar
1 2 4 5 6 16 57 132
Calpurnia
2 31 54 101
Dictionary
Postings
Sorted by docID (more later on why). 17
Sec. 1.2
Inverted index construction
Doc 1
informationretrievalIntroduction(PDF)
信息检索就是给定一个查询Q,从文档集合C中计算每 篇文档D与Q的相关度并排序(Ranking)
相关度通常只有相对意义,对一个Q,不同文档的相关 度可以比较,而对于不同的Q的相关度不便比较
相关度的输入信息可以更多,比如用户的背景信息、用 户的查询历史等
现代信息检索中相关度不是唯一度量,如还有:重要 度、权威度、新颖度等度量。或者说这些因子都影响 “相关度”。
断也不尽相同
19
信息检索的基本概念
定义“相关性”的两个角度:
– 系统角度:系统输出结果,用户是信息的接受者。这种理解置 用户于被动的地位,基于这种理解,研究的重心落在系统本 身。主题相关性:检索系统检出的文档的主题即核心内容与用 户的信息需求相匹配。系统角度相关并不和用户脱节。系统角 度定义的相关简单可以计算。
4
2
Number of hosts (millions)
InItnertnerent:etP:aPsta,sPt,rePsrensnt,t,FuFututruere
140
120
100
The 'Network Effect’
80
kicks in, and the web
60
goes critical'
17
信息检索的基本概念
文档(Document):检索的对象
– 可以是文本,也可以是图像、视频、语音等多媒体 文档,text retrieval/image retrieval/video retrieval/speech retrieval/multimedia retrieval
– 可以是无格式、半格式、有格式的
12
Outline
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Information Retrieval Based Writer IdentificationA. Bensefia, T. Paquet, L. HeutteLaboratoire Perception Systèmes Information,UFR des Sciences, Université de Rouen,F-76821 Mont-Saint-Aignan Cedex, France.Ameur.Bensefia@univ-rouen.frAbstractThis communication deals with the Writer Identification task. Our previous work has shown the interest of using the graphemes as features for describing the individual properties of Handwriting. We propose here to exploit the same feature set but using an information retrieval paradigm to describe and compare the handwritten query to each sample of handwriting in the database. Using this technique the image processing stage is performed only once and before the retrieval process can take place, thus leading to a significant saving in the computation of each query response, compared to our initial proposition. The method has been tested on two handwritten databases. The first one has been collected from 88 different writers at PSI Lab. while the second one contains 39 writers from the original correspondence of Emile Zola, a famous French novelist of the last 19th century. We also analyze the proposed method when using concatenation of graphemes (bi and tri-gramme) as features.1.IntroductionIn this communication we present a methodology for the identification of the writer of a document. This task has been defined as the one of assigning to an unknown handwritten document its correct writer among a finite set of possible candidates [9]. The implicit hypothesis behind this task is the handwriting individuality. This assumption has proved to be founded since interesting performance have already been obtained in various experiments [9, 10, 11].Two main approaches have generally been considered in the literature regarding the kind of features used to characterize each handwriting. When a sufficient amount of handwritten material is available in the image, few global robust features can be defined on text blocks and provide the information need. On the contrary, when the input image contains few samples of handwriting, local features are required to characterize the handwritten fragments.In this communication, like in our previous work [5], the second approach is adopted. This choice leads to represent each handwritten input image in a high dimensional feature space in order to capture the whole variability over the database. As a consequence, the writer identification task consists in finding similar documents represented in a high dimensional feature space. In the field of information retrieval (IR), this problem has been intensively studied and is still motivating a large number of research, especially due to the need for web document retrieval.In this communication we investigate the use of one of the most popular schemes used in IR [1] and apply it to the task of writer identification. In part 2 of this communication we recall our previous approach. Part 3 is devoted to the presentation of the IR model known as the “vector space model” in the literature. In parts 4 and 5 we evaluate the proposed approach on two different databases. The first one has been constituted in our lab, originally for recognition purposes. It contains 88 different writers. The second one contains 39 writers that have taken part in a correspondence with Emile ZOLA, a French novelist of the late 19th century.2. Writer IdentificationIn our previous work [6] an original approach for writer identification has been proposed based on local features such as graphemes. This study has also shown that although prone to variability, each handwriting can be characterized by a set of invariant features also called the writer’s invariants. Writer identification can be efficiently carried out using the writer’s invariants instead of using elementary graphemes, without no significant loss in the identification performance.Each grapheme is produced by the segmentation module of our recognition system [6,7]. In this system, letter hypothesis are analyzed up to the concatenation of 3 consecutive graphemes.Each handwritten document D j is thus described by the set of graphemes x i it is made of :{})(,DcardixD ij≤=(2.1) A similarity measure between an unknown handwritten document Q and a reference document in the database D can be defined according to the following relation :∑=∈=)(1)),(()(1Q card i j i Ty x y sim Max Q cardD)SIM(Q,j (2.2)where y i , x j are graphemes that belong respectively to document Q and D , and ),(j i x y sim is a similarity measure between two graphemes. Among many others, the correlation measure has been chosen for its average properties. Therefore, two documents will be all the closer as this measure will be close to one. The writer of document Q will be determined as the writer of the closest document in the database.)))max ()( j D SIM(Q,(Arg Writer basej D Q Writer ∈= (2.3)The first evaluation of this approach was carried out on a database of 88 writers that has been constituted in our lab. Two experiments were conducted: the first one was design to measure the performance of the approach on large blocks of text; the second one was designed on small handwritten queries.The results are encouraging, giving rise to a correct identification of nearly 98% when working with large handwritten samples as queries (typically 3 lines of text). When dealing with small queries (typically 50 graphemes) the correct writer was determined in nearly 93% of the cases. These results have shown the interest of using graphemes as local features for writer identification.Two major drawbacks of this approach can however be pointed out. The first one is that it is especially computationally expensive due to the pattern matching technic employed. Assume T is the average size of a document, then the complexity of the retrieval process is O(T²N), where N is the number of documents in the database. The second one arises when using invariant graphemes as features. In this case, when calculating the similarity between two documents, each feature is assigned the same weight, no matter its effective frequency in the document.3. Information Retrieval ModelInformation Retrieval techniques have been designed in order to query textual documents described in a high dimensional feature space such as terms. Therefore, the problem of binary feature encoding and document querying has been particularly studied in this field. An Information Retrieval system is characterized by [2]:• The set of documents that constitute thedatabase.• An Information Retrieval model that ordersdocuments in the database according to their respective similarity with the query.•Document processing: documents are processed in order to gather statistical information.One of the most popular model in IR was proposed by Salton [1]. Its first advantage is to propose a retrieval model that integrates the description of the documents and the query in a single high dimensional feature space. High dimensionality ensures a minimum loss of information when describing each document in the database as well as the query. The second advantage is that, once the feature space has been defined, each document can be described independently from the query, thus avoiding any other access to the document content when responding to a query. This last point is of particular interest regarding our problem of writer identification which requires intensive image matching. Although very simple, this model is still popular in the IR community [3, 4].Various kinds of features can be used to describe an electronic document. They can be words, n-grammes, letters, html tags... In the feature space a similarity measure will then be defined between the query and each document, thus giving an ordered list of relevant documents regarding the query content. Two distinct steps are required: the indexing phase concerns the processing of each document in order to obtain a high dimensional vector that describes the document; the retrieval phase concerns the calculation of the relevance score of each document for a particular query.3.1. Indexing phaseAssume a binary feature set has been chosen. Denote ϕi , m i ≤≤1 the i th binary feature. For IR purposes each feature is all the more relevant to describe a document as it is relatively frequent in this document compared to any other document in the database. Using this principle, each document D j as well as the query Q, can be described as follows:T m-1,j 1,j o,j j ) a ,.... , a (a D =ρ(3.1) T m 1o ) b .... ,b , (b Q 1−=ρ(3.2) where : a i,j and b i are weight assigned to each characteristic ϕi , and are defined by:a i,j = FF(ϕi , D j ) IDF(ϕi ) (3.3)b i, = FF(ϕi , Q) IDF(ϕi ) (3.4) FF(ϕi , D j ) is the Feature Frequency in document D j .IDF(ϕi ) is the Inverse Document Frequency and is the inverse of the number of documents that contain this characteristic ϕi , it is exactly defined by :)))(i i DF( 1n 1( log IDF ϕ++=ϕ (3.5)where n denotes the total number of documents in the database and DF(ϕi ) is the Document Frequency, i.e. the number of documents that contain this characteristic. Notice that IDF(ϕi ) = 0 when ϕi , occurs in every document. Such characteristics will therefore be given a null score and should indeed be eliminated from the feature set.3.2. Retrieval phaseEach document as well as the query being described in the same high dimensional feature space, a similarity measure between a document and the query is required to provide an ordered list of pertinent documents. Many similarity measures have been proposed in the literature. Most of them are defined on binary feature vectors such as Dice, Jaccard, Okapi. When dealing with real valued feature vectors a similarity measure can be defined by the normalized inner product of the two vectors e.g. by the cosine of the angle of the two vectors. Therefore the similarity measure between document D and the query Q is defined by:∑∑∑ϕϕϕϕϕϕ=ii iji iji QD QD j TFIDFTFIDF TFIDF TFIDFD Q 2,2,,),cos( (3.6)where the two terms in the denominator are the lengths of the document and the query respectively. Compared to the direct pattern matching method, the retrieval process has a complexity of O(TN), where T is the size of the feature vector and N the number of documents in the database.4 IR applied to writer identificationIn this section we discuss the implementation of the IR model for the writer identification task. The central point lies in the definition of a common feature space over the entire database. Then indexing and retrieval phase can be implemented following the definitions given in section 3. Let us recall that our initial works have implemented writer identification based on local features such as graphemes (see section 2). Besides, we have shown that writer identification can be efficiently carried out using invariant clusters within the set of graphemes of each writer.Therefore, the writer’s invariants can be viewed as binary features defined within the writer’s set of graphemes. In order to define a set of binary features common to all the handwritten documents it is required to cluster all the graphemes of the database. For this purpose, the procedure described in [6] is used. We briefly recall itsmain characteristics. Many sequential clustering phases are iterated with random selection. Each of them provides a variable number of clusters. The invariant clusters are defined as the groups of patterns that have been clustered together after each sequential clustering phase.Figure 2 gives some of the most frequent clusters obtained on the PSI_BASE (see next section for details). These features can occur for different writers. A feature is all the more pertinent as it belongs to a low number of writers. TF-IDF scores will thus be calculated for each feature and each document during the indexing phase.Figure 2. Some invariant clusters of the PSI_BASE.5 Experiment5.1. Description of the databasesTwo databases have been used for this experiment. The first one (PSI_BASE) contains 88 writers who have been asked to copy a letter that contains 107 words. The scanned images have been divided into two parts: two thirds for the learning base and one third of each page for the test base. The second base (ZOLA_BASE) contains 39 writers that have taken part to a correspondence with Emile ZOLA, a famous novelist of the late 19th century (1840-1902). These images have been scanned from a microfilm with a resolution of 300 dpi. They present a higher degree of difficulty than those of the PSI_BASE for various reasons: presence of noise, overlapping lines, slant, type of nib or quill used at the end of the 19th century. Finally this database contains completely free writing. The original microfilm contains nearly 700 documents. This database was first inspected and manually annotated in order to discard from the analysis, irrelevant areas such as printed zones, marks, etc... Although it contains a relatively large number of documents, they are far from being equally distributed among writers. The number of words can vary dramatically from one document to another. For these reasons, the ZOLA_BASE was designed with text blocks having a sufficient amount of information and at the sametime with a sufficient number of writers. The result wasthus a compromise of these two criteria. The learning base contains 39 documents each containing between 5 and 7 handwritten lines, while the test base contains text blocks between 3 and 5 lines long. Figure 3 gives some samples of the ZOLA_BASE.Figure 3. Some samples of the ZOLA_BASE.Due to the variability of the ZOLA_BASE it was necessary to modify our segmentation algorithm in order to operate on slanted connected components and without any knowledge of the reference lines. Therefore, the grapheme produced by this segmentation step can vary from those produced on the PSI_BASE. Figure 4 gives the segmentation result on one example.Figure 4. Segmentation produced on the ZOLA_BASE.As connected graphemes can be grouped together to produce either bi or tri-gram (a larger window could eventually be used), the writer identification has been carried out on these three levels. Indeed, if our previous study has shown that graphemes are good local features, it is however unclear whether concatenations of these features can better characterize a writing or not. Table 1summarizes the properties of the two databases on the three levels of analysis. level 1 level 2level 3# graphemes 43178 2508815953PSI_BASE # binaryfeatures7230 1387612722# graphemes 25907 1564710670ZOLA_BASE # binaryfeatures3567 5489 6266Table 1. Properties of the two databases.5.2. ResultsFigure 5 gives the performance of the approach on the PSI_BASE. It shows that the correct writer is determined in 93% (83/88) of the cases using first level graphemes. Identification rate rises up to 95.45% (84/88) using bi-grams as features, while tri-grams give only 80% (70/88) of correct identification. Let us recall that in our initial work [5] a correct identification rate of 97% was obtained on the first level graphemes but intensive pattern matching was required in this case.This first result shows that the vector space model of IR is pertinent for the task of writer identification when using local features. Furthermore bi-gram features may be even better features for the task.Two reasons can explain the lower performance obtained on tri-gram. The first one is due to the fact that tri-gram features being more numerous, each one of them is thus less frequent and therefore cannot be as representative of a particular writer as lower level features (bi-gram or graphemes). The second reason is that tri-gram features may be more dependent on the textual content. Therefore, while it may be a pertinent feature for the writer, its frequency may be so low (due to the low frequency of textual passage) that the size of our database does not allow to measure it.Results obtained on the ZOLA_BASE are significantly lower than those obtained on the PSI_BASE. Particularities of this database are given in section 5.1 and can explain these results. Nevertheless, the method allows a correct identification of 93,3% (36/39) in the top 5 propositions. However bi-gram features are not as informative as on the PSI_BASE.Figure 5. Writer identification on the PSI_BASE .Figure 6. Writer identification on the ZOLA_BASE.6. ConclusionIn this communication we have presented an information retrieval based writer identification method. The results obtained are comparable to those presented in our previous work, but the information retrieval model has a linear complexity which is one order less than our initial method.The method has been tested on two different databases. On a clean database the method performs very well, furthermore it is shown that bi or tri-gram features can also bring interesting information about the writer. On a more noisy database, performance decrease but the method still provides an interesting means to query handwritten documents.AcknowledgementThis study was sponsored by the French program CNRS STIC-SHS.The authors are grateful to DPCI for scanning the microfilm of Zola’s letters () References[1] Salton, Wrong “A vector Space Model for Automatic Indexing”, Information retrieval and language processing, pp 613-620, 1975.[2] P.Schaüble, « Multimedia Information Retrieval : Content–Based Information Retrieval from Large Text and Audio Databases », Kluwer Academic publishers, 1997.[3] B. Pouliquen, D.Delamane, P.Lebeux, « indexation des textes médicaux par extraction de concepts et ses utilisations »JADT, 6éme journées d’analyse statistique des données textuelles, 2002.[4] D. Memmi, « Le modèle vectoriel pour le traitement desgrenoble,France n°14,2000.[5]A.Bensefia,A.Nosary,T.Paquet, L.Heutte, “Writer Identification by Writer’s Invariants », International Workshop on Frontiers in Handwriting Recognition, IWFHR’01, pp 274-279, 2002.[6] Nosary A., Heutte L., Paquet T., Lecourtier Y., “ Defining writer’s invariants to adapt the recognition task ”, Proc. ICDAR’99, Bengalore (India), pp 765-768, 1999.[7] Nosary A., Reconnaissance automatique destextes manuscrits par adaptation au scripteur, Thèse de Doctorat, Université de Rouen, 2002.[8] Said H.E..S., Tan T.N., Baker K.D., “ Personal Identification Based on Handwritting ”, Pattern Recognition, vol. 33, pp. 149-160, 2000.[9] A. Srihari, S. Cha, H. Arora, S. Lee, “Individuality of Handwritig : A Validity Study”, Proc. ICDAR’01, Seattle (USA), pp 106-109, 2001.[10] Marti U.V., Messerli R., Bunke H., “ Writer Identification Using Text Line Based Features ”, Proc. ICDAR’01, Seattle (USA), pp. 101-105, 2001.[11] Zois E.N., Anastassopoulos V., “ Morphological Waveform Coding for writer Identification ”, Pattern Recognition, vol. 33,n°3, pp. 385-398, 2000.。