distributed information retrieval

合集下载

3. 液体化学品泄漏演习方案

附：参考演练方案1.目标Objective该“泄露演习方案”旨在提供一种可供参考的液体化学品/材料泄露控制与处置方案，包括泄露或释放源现场控制和补救措施。

各施工队伍可以此为参考检查自身在化学品泄露控制及应急响应程序和工具准备方面的完整性和可靠性。

The objective of the “spill drill scenario” is to test th e preparedness and the integrity of the spill team’s procedures and response, which includes stopping the source of the release, containing the release and commencing the remedial actions.2.定义DefinitionMSDS（物料安全数据表或化学品安全说明书）-是化学品生产商和进口商用来阐明化学品的理化特性（如闪点、易燃度、反应活性等）以及对使用者健康（如致癌、致畸等）可能产生的危害的一份文件。

MSDS (Material Safety Data Sheet)- A document containing information about a material or chemical including the chemical and generic name of its ingredients, the chemical and physical properties of the substance, health hazard information and precautions for safe use and handling.泄露处理包- 采用亲油性超细纤维无纺布制作，不含化学药剂，不会造成二次公害，能迅速吸收本身重量数十倍的油污、有机溶剂、碳氢化合物、植物油等液体的处理材料（以吸油棉为最常见）；用于在发生液体溢出或泄露时吸收外溢液体控制进一步扩散。

信息检索结课论文

信息检索结课论文题目：基于网络的信息检索应用研究学院：计算机科学与工程学院专业：软件工程学生：学号：授课教师：基于网络的信息检索应用研究王扬波(大学计算机学院电子与通信工程)摘要：网络信息检索一般指因特网检索，是通过网络接口软件，用户可以在一终端查询各地上网的信息资源。

这一类检索系统都是基于互联网的分布式特点开发和应用的，即：数据分布式存储，大量的数据可以分散存储在不同的效劳器上；用户分布式检索，任何地方的终端用户都可以访问存储数据；数据分布式处理，任何数据都可以在网上的任何地方进展处理。

本文对基于网络的信息检索应用进展研究，并分析了其局限。

关键词：信息检索；网络；分布式；Research on the application of informationbased on NetworkXX(xx)Abstract: network information retrieval generally refers to the Internet search, is through the network interface software, users can query the information resources in the Internet in a terminal. This kind of retrieval system is based on the Internet. That is, the data can be distributed and stored in different servers. Users can access the storage data. Data can be processed in any part of the Internet. In this paper, we study the application of information retrieval based on network, and analyze the development trend.Key words: information retrieval; network; distributed;1 网络信息检索简介随着信息技术的飞速开展，信息已成为全社会的重要资源，对信息的占有程度及信息处理水平的先进程度已成为衡量一个国家或地区现代化程度的重要标志，而网络上丰富的信息在更大程度上改变了人们的工作和生活的方式。

信息检索关键词部分

信息检索关键词部分Key word第1章信息检索（Information Retrieval, IR）数据检索（data retrieval）相关性（relevance）推送（Push）超空间（hyperspace）拉出（pulling）⽂献逻辑表⽰（视图）（logical view of the document）检索任务（retrieval task 检索（retrieval ）过滤（filtering）全⽂本（full text）词⼲提取（stemming）⽂本操作（text operation）标引词（indexing term）信息检索策略（retrieval strategy）光学字符识别（Optical Character Recognition, OCR）跨语⾔（cross-language）倒排⽂档（inverted file）检出⽂献（retrieved document）相关度（likelihood）信息检索的⼈机交互界⾯（human-computer interaction, HCI）检索模型与评价（Retrieval Model & Evaluation）⽂本图像（textual images）界⾯与可视化（Interface & Visualization）书⽬系统（bibliographic system）多媒体建模与检索（Multimedia Modeling & Searching）数字图书馆（Digital Library）检索评价（retrieval evaluation）标准通⽤标记语⾔（Standard Generalized Markup Language, SGML）标引和检索（indexing and searching）导航（Navigation）并⾏和分布式信息检索（parallel and distribution IR）模型与查询语⾔（model and query language）导航（Navigation）有效标引与检索（efficient indexing and searching）第2章特别检索（ad hoc retrieval）过滤（filtering）集合论（set theoretic）代数（algebraic）概率（probabilistic 路由选择（routing）⽤户需求档（user profile）阙值（threshold）权值（weight）语词加权（term-weighting）相似度（similarity）相异度（dissimilarity）域建模（domain modeling）叙词表（thesaurus）扁平（flat）⼴义向量空间模型（generalized vector space model）神经元（neuron）潜语义标引模型（latent semantic indexing model）邻近结点（proximal node）贝叶斯信任度⽹络（Bayesian belief network）结构导向（structure guided）结构化⽂本检索（structured text retrieval, STR）推理⽹络（inference network）扩展布尔模型（extended Boolean model）⾮重叠链表（non-overlapping list）第3章检索性能评价（retrieval performance evaluation）会话（interactive session）查全率(R, Recall Ratio) 信息性（Informativeness）查准率(P, Precision Ratio) ⾯向⽤户（user-oriented）漏检率(O, Omission Ratio) 新颖率（novelty ratio）误检率(M, Miss Ratio) ⽤户负担（user effort）相对查全率（relative recall）覆盖率（coverage ratio）参考测试集（reference test collection）优劣程度（goodness）查全率负担（recall effort）主观性（subjectiveness）信息性测度（informativeness measure）第4章检索单元（retrieval unit）字母表（alphabet）分隔符（separator）复合性（compositional）模糊布尔（fuzzy Boolean）模式（pattern）SQL(Structured Query Language, 结构化查询语⾔) 布尔查询（Boolean query）参照（reference）半结合（semijoin）标签（tag）有序包含（ordered inclusion）⽆序包含（unordered inclusion）CCL(Common Command Language, 通⽤命令语⾔) 树包含（tree inclusion）布尔运算符（Boolean operator） searching allowing errors容错查询Structured Full-text relevance feedback 相关反馈Query Language (SFQL) （结构化全⽂查询语⾔） extended patterns扩展模式CD-RDx Compact Disk Read only Data exchange (CD-RDx)（只读磁盘数据交换）WAIS (⼴域信息服务系统Wide Area Information Service)visual query languages. 查询语⾔的可视化查询语法树（query syntax tree）第5章query reformulation 查询重构 query expansion 查询扩展 term reweighting 语词重新加权相似性叙词表（similarity thesaurus）User Relevance Feedback⽤户相关反馈 the graphical interfaces 图形化界⾯簇（cluster）检索同义词（searchonym） local context analysis局部上下⽂分析第6章⽂献（document）样式（style）元数据（metadata）Descriptive Metadata 描述性元数据 Semantic Metadata 语义元数据intellectual property rights 知识产权 content rating 内容等级digital signatures数字签名 privacy levels 权限electronic commerce电⼦商务都柏林核⼼元数据集（Dublin Core Metadata Element Set）通⽤标记语⾔（SGML，standard general markup language）机读⽬录记录（Machine Readable Cataloging Record, MARC）资源描述框架(Resource Document Framework, RDF) XML(eXtensible Markup Language, 可扩展标记语⾔) HTML（HyperText Markup Language, 超⽂本标记语⾔）Tagged Image File Format (TIFF标签图像⽂件格式)Joint Photographic Experts Group (JPEG) Portable Network Graphics (PNG新型位图图像格式)第7章分隔符（separator）连字符（hyphen）排除表（list of stopwords）词⼲提取（stemming）波特（porter）词库（treasury of words）受控词汇表（controlled vocabulary）索引单元（indexing component）⽂本压缩text compression 压缩算法compression algorithm注释（explanation）统计⽅法（statistical method）赫夫曼（Huffman）压缩⽐（compression ratio）数据加密Encryption 半静态的（semi-static）词汇分析lexical analysis 排除停⽤词elimination of stopwords第8章半静态（semi-static）191 词汇表（vocabulary）192事件表（occurrence）192 inverted files倒排⽂档suffix arrays后缀数组 signature files签名档块寻址（block addressing）193 索引点（index point）199起始位置（beginning）199 Vocabulary search词汇表检索Retrieval of occurrences 事件表检索 Manipulation of occurrences事件表操作散列变换（hashing）205 误检（false drop）205查询语法树（query syntax tree）207 布鲁特-福斯算法简称BF（Brute-Force）故障（failure）210 移位-或（shift-or）位并⾏处理（bit-parallelism）212顺序检索（sequential search）220 原位（in-place）227第9章并⾏计算（parallel computing） SISD （单指令流单数据流）SIMD （单指令流多数据流） MISD （多指令流单数据流）MIMD （多指令流多数据流）分布计算（distributed computing）颗粒度（granularity）231 多任务（multitasking）I/O（input/output）233 标引器（indexer）映射（map）233 命中列表（hit-list）全局语词统计值（global term statistics）线程（thread）算术逻辑单元（arithmetic logic unit, ALU 中介器（broker）虚拟处理器（virtual processor）240分布式信息检索(distributed information retrieval)249⽂献收集器（gatherer）主中介器（central broker）254第10章信息可视化（information visualization）图标（icon）260颜⾊凸出显⽰（color highlighting）焦点+背景（focus-plus-context）画笔和链接（brushing and linking）魔术透镜（magic lenses）移动镜头和调焦（panning and zooming）弹性窗⼝（elastic window）概述及细节信息（overview plus details）⾼亮⾊显⽰（highlight）信息存取任务（information access tasks）⽂献替代（document surrogate）常见问题(FAQ, Frequently Asked Question) 群体性推荐（social recommendation）上下⽂关键词（keyword-in-context, KWIC）伪相关反馈（pseudo-relevance feedback）重叠式窗⼝（overlapping window）⼯作集（working set）第11/12章多媒体信息检索（Multimedia Information Retrieval, MIR）超类（superclass）半结构化数据（semi-structured data）数据⽚（data blade）可扩充型系统（extensible type system）相交（intersect）动态服务器（dynamic server）叠加（overlaps）档案库服务器（archive server）聚集（center）逻辑结构（logical structure）词包含（contain word）例⼦中的查询（query by example）路径名（path-name）通过图像内容查询（Query by Image Content, QBIC）图像标题（image header）主要成分分析（Principal Component Analysis, PCA）精确匹配（exact match）潜语义标引（Latent Semantic Indexing, LSI）基于内容（content-based）范围查寻（Range Query）第13章exponential growth指数增长 Distributed data 数据的分布性volatile data 不稳定数据 redundant data 冗余数据Heterogeneous data异构数据分界点（cut point）373Centralized Architecture集中式结构收集器-标引器（crawler-indexer）373 Wanderers 漫步者 Walkers 步⾏者 Knowbots 知识机器⼈Distributed Architecture分布式结构 gatherers 收集器brokers 中介器 the query interface 查询界⾯the answer interface响应界⾯ PageRank ⽹页级别Crawling the Web漫游Web breadth-first ⼴度优先depth-first fashion 深度优先 Indices（index pl.）索引Web Directories ⽹络⽬录 Metasearchers元搜索引擎Teaching the User⽤户培训颗粒度（granularity）384超⽂本推导主题检索（Hypertext Included Topic Search, HITS）380 Specific queries专指性查询 Broad queries 泛指性查询Vague queries模糊查询 Searching using Hyperlinks使⽤超链接搜索Web Query Languages查询语⾔ Dynamic Search 动态搜索Software Agents 软件代理鱼式搜索（fish search）鲨鱼搜索（shark search）拉出/推送（pull/push）393门户（portal）395 Duplicated data 重复数据第14章联机公共检索⽬录（online public access catalog, OPAC）397化学⽂摘（Chemical Abstract, CA）399 ⽣物学⽂摘（Biological Abstract, BA）⼯程索引（Engineering Index,EI）国会图书馆分类法（Library of Congress Classification）408杜威⼗进分类法（Dewey Decimal Classification）408联机计算机图书馆中⼼（Online Computer Library Center, OCLC）409机读⽬录记录（Machine Readable Cataloging Record, MARC）409第15章NSF (National Science Foundation, 美国国家科学基⾦会)NSNA（National Aeronautics and Space Administration，美国航空航天局）数字图书馆创新项⽬（Digital Libraries Initiative, DLI）4155S（stream,信息流structure,结构space, 空间scenario, 场景society社会）416基于数字化对象标识符（Digital Object Identifier, DOI）420都柏林核⼼（Dublin Core, DC）430 数字图书馆（Digital Library, DL）资源描述框架(Resource Document Framework, RDF)431text encoding initiative (TEI) （⽂本编码创新项⽬）431v。

分布式数据库和应用程序【英文】

Distributed Databases and Applications
John Wieczorek Museum of Vertebrate Zoology, UC Berkeley
DiGIR
1
Distributed Databases

Multiple sources of data …under local control, …with concepts in common …and a desire to deliver data as part of a community.
DiGIR
11
Project Goals

To define a protocol for retrieving structured data from multiple, heterogeneous databases across the Internet To build a reference implementation of both provider and portal software using said protocol

LifeMapper Global Biodiversity Information Facility (GBIF)
DiGIR
7
Distributed vs. centralized

Multiple sources of data …under local control, …with concepts in common …and a desire to deliver data as part of a community
DiGIR

计算机科学与技术学院申请博士学位发表学术论文的规定(2008.9上网)

计算机科学与技术学院申请博士学位发表学术论文的规定根据《华中科技大学申请博士学位发表学术论文的规定》，我院博士研究生申请博士学位前，须按以下要求之一发表学术论文：1、A类、B类或学院规定的国际顶尖学术会议论文一篇；2、SCI期刊论文一篇，C类一篇，国内权威刊物一篇；3、SCI期刊论文一篇，国内权威刊物二篇；4、SCI期刊论文一篇，C类二篇。

A、B、C类期刊参照《华中科技大学期刊分类办法》中规定的计算机科学与技术及其它相关学科的期刊执行，其中C类含被EI检索的国际会议论文。

学院规定的国内权威刊物指中国科学、科学通报、Journal of computer Science and Technology、计算机学报、软件学报、计算机研究与发展、Fronties of computer Science in China、电子学报、自动化学报、通信学报、数学学报、应用数学学报、计算机辅助设计与图形学学报及其它相关学科的一级学会学报。

学位申请人发表或接收发表的学术论文中，至少有一篇是以外文全文在C类及以上刊物上发表。

学位申请人发表或被接收发表的学术论文必须是其学位论文的重要组成部分，是学位申请人在导师指导下独立完成的科研成果，以华中科技大学为第一署名单位，以申请人为第一作者（与导师共同发表的论文，导师为第一作者，申请人可以第二作者）。

对于“同等贡献作者”排名的认定，参照《华中科技大学期刊分类办法》（校人[2008]28号文）执行。

本规定自2008年入学博士生起执行。

本规定的解释和修改权属计算机科学与技术学院学位审议委员会。

华中科技大学计算机科学与技术学院学位审议委员会二○○八年九月一日为提高研究生培养质量、提高学术水平、促进国际学术交流，经计算机学院学位审议委员会研究决定，国际顶尖学术会议分为A、B两类，分类如下：一、A类1. International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)2. ACM Conference on Computer and Communication Security (CCS)3. USENIX Conference on File and Storage Techniques (FAST)4. International Symposium on High Performance ComputerArchitecture (HPCA)5. International Conference on Software Engineering (ICSE)6. International Symposium on Computer Architecture (ISCA)7. USENIX Conference on Operating System and Design (OSDI)8. ACM SIGCOMM Conference (SIGCOMM)9. ACM Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR)10. International Conference on Management of Data and Symposium onPrinciples of Database Systems (SIGMOD/PODS)11. ACM Symposium on Operating Systems Principles (SOSP)12. Annual ACM Symposium on Theory of Computing (STOC)13. USENIX Annual Technical Conference (USENIX)14. ACM International Conference on Virtual Execution Environments(VEE)15. International Conference on Very Large Data Bases (VLDB)二、B类1. International Conference on Dependable Systems and Networks (DSN)2. IEEE Symposium on Foundations of Computer Science (FOCS)3. IEEE International Symposium on High Performance DistributedComputing (HPDC)4. International Conference on Distributed Computing Systems (ICDCS)5. International Conference on Data Engineering (ICDE)6. IEEE International Conference on Network Protocols (ICNP)7. ACM International Conference on Supercomputing (ICS)8. International Joint Conference on Artificial Intelligence (IJCAI)9. IEEE Conference on Computer Communications (INFOCOM)10. ACM SIGKDD International Conference on Knowledge Discovery andData Mining (KDD)11. Annual IEEE/ACM International Symposium on Microarchitecture(MICRO)12. ACM/IFIP/USENIX International Middleware Conference (Middleware)13. ACM International Conference on Multimedia (MM)14. ACM International Conference on Mobile Systems, Applications, andServices (MobiSys)15. ACM Conference on Programming Language Design andImplementation (PLDI)16. Annual ACM Symposium on Principles of Distributed Computing(PODC)17. ACM Symposium on Principles of Programming Languages (POPL)18. ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP)19. IEEE Real-Time Systems Symposium (RTSS)20. Supercomputing (SC'XY) Conference21. ACM Conference on Computer Graphics and Interactive Techniques(SIGGRAPH)22. ACM Conference on Measurement and Modeling of ComputerSystems (SIGMETRICS)23. IEEE Symposium on Security and Privacy (SP)24. Annual ACM Symposium on Parallel Algorithms and Architectures(SPAA)25. International World Wide Web Conference (WWW)华中科技大学计算机科学与技术学院学位审议委员会二○○八年十一月十七日计算机学院资助教师和学生参加顶尖国际学术会议试行办法院办字[2006]06号为了促进计算机学院师生开展国际学术交流，提高学术水平，经第75次学院办公会议研究，并经第2次教授咨询委员会咨询，学院资助教师和在校学生参加顶尖国际学术会议，制定本办法。

文献检索练习册

第一章单项选择题1.根据国家相关标准，文献的定义是指“记录有（知识）的一起载体。

2.以作者本人取得的成果为依据二创作的论文、报告等，并经公开发表或出版的各种文献，称为（一次文献）3.文摘、题录、目标等属于（二次文献）。

4.手稿、私人笔记等属于（零次）文献，辞典、手册等属于（三次）文献。

5.按照出版时间的先后。

应将各个级别的文献排列成（一次文献、二次文献、三次文献）。

6.（二次文献）的主要功能是检索、通报、控制一次文献、帮助人们在较短时间内获取较多的文献信息。

7.一次文献、二次文献、三次文献是按照（加工深度）进行区分的。

8.从文献的（载体类型）角度区分，可将文献分为印刷型、缩微型等。

9具有固定名称、同意出版形式和一定出版规律的定期或不定期的连续出版物，称为（期刊）。

10.（期刊）类型的专业文献出版周期最短、发行量最大、报道最迅速及时，成为多数论文发表渠道。

11.在公开版物中，当前的（报纸文献）反映的信息内容可能最新。

12.（档案文献）不属于公共出版物。

13.根据文后参考文献信息区别期刊和图书，主要依据是判断有无（卷期号）特征词，若有则为期刊。

14.根据文后参考文献信息区别图书和会议文献，主要依据是判断有无（出版社）特征词，若有则为会议。

15.根据布拉德福文献分散定律，阅读（核心期刊）文章是一种有效的情报获取方法。

16.核心期刊的期刊影响因子具有（学科性、学术性、时间性）特点。

17.在文献信息传递的载体类型中，（印刷型）是历史最悠久的文献形式。

18.一次文献的出版类型有（期刊）等。

19.二次文献又可称为（检索工具），它报道文献的（主要信息及其来源出处）。

20.当我们需要对陌生人知识作一般了解时，我们可先参考（图书）文献。

21.从载体的物理形态区分，（电子型）文献是文献发展的方向。

22.（期刊）提供的信息相对比较新颖、及时、可靠、专深。

23.要了解本专业的国内核心期刊，可参考（《中文核心期刊要目总览》）。

文献出版类型

文献的出版类型介绍根据文献的出版类型不同，文献可分为：图书报纸科技报告专利文献标准文献学位论文会议文献档案政府出版物期刊1图书(books, monographs)：论述或介绍某一学科或领域知识的出版物。

1.1内容特点:比较成熟,系统全面,基本知识1.2标准著录格式：作者. 书名. 版本（第1版不写）. 出版地: 出版者，出版年，页码1.3判别依据：出版地：出版者例如：(1) Etten V W.Fundamentals of optical fiber communication. London: Prentice-Hall,1991(2) 蒋永新主编. 自然科学技术信息检索教程.2版. 上海：上海大学出版社，2006.1.4实际情况：版次如：second edition,3rd edition编者ed.ISBN号出版商：press, publisher，Housepublishing company等如：GashS ed. Effective Literature Searching for Research. 2nd. Hampshire:Gower House,1999析出文献G.R. Mettam, L.B. Adams, New Discovery, in: B.S. Jones, R.Z. Smith (Eds.), Introduction to the Electronic Age, E-Publishing, Inc. New Y ork,1994, pp. 281-304.2期刊(journal, transactions)：指有固定名称、统一出版形式和一定出版规律的定期或不定期的连续出版物。

2.1内容特点:新颖2.2标准著录格式：著者.文章篇名. 刊名，出版年，卷号，期号，起止页码2.3判别依据：刊名，卷号，期号例：T ohyama H. A plasma Image bar for an electrophoto-graphic printer. Journal of the Imaging Science, 1991, vol.35no.5,330-3 (J.Imag. Sci., 1991, 35(5):330-3)(Journal of the Imaging Science, 1991, v.35n.5,330-3)期刊(journal, transactions)刊名中可能出现的词：Journal(J.)Transaction(Trans.)Letter(Lett.)Annual (Ann.)3会议文献(conference)：在国际和国内重要的学术或专业性会议上宣读发表的论文、报告。

国际计算机会议与期刊分级列表

Computer Science Department Conference RankingsSome conferences accept multiple categories of papers. The rankingsbelow are for the most prestigious category of paper at a givenconference. All other categories should be treated as "unranked".AREA: Artificial Intelligence and Related SubjectsRank 1:IJCAI: Intl Joint Conf on AIAAAI: American Association for AI National ConferenceICAA: International Conference on Autonomous Agents（现改名为AAMAS） CVPR: IEEE Conf on Comp Vision and Pattern RecognitionICCV: Intl Conf on Computer VisionICML: Intl Conf on Machine LearningKDD: Knowledge Discovery and Data MiningKR: Intl Conf on Principles of KR & ReasoningNIPS: Neural Information Processing SystemsUAI: Conference on Uncertainty in AIACL: Annual Meeting of the ACL (Association of Computational Linguistics) Rank 2:AID: Intl Conf on AI in DesignAI-ED: World Conference on AI in EducationCAIP: Inttl Conf on Comp. Analysis of Images and PatternsCSSAC: Cognitive Science Society Annual ConferenceECCV: European Conference on Computer VisionEAI: European Conf on AIEML: European Conf on Machine LearningGP: Genetic Programming ConferenceIAAI: Innovative Applications in AIICIP: Intl Conf on Image ProcessingICNN/IJCNN: Intl (Joint) Conference on Neural NetworksICPR: Intl Conf on Pattern RecognitionICDAR: International Conference on Document Analysis and RecognitionICTAI: IEEE conference on Tools with AIAMAI: Artificial Intelligence and MathsDAS: International Workshop on Document Analysis SystemsWACV: IEEE Workshop on Apps of Computer VisionCOLING: International Conference on Computational LiguisticsEMNLP: Empirical Methods in Natural Language ProcessingRank 3:PRICAI: Pacific Rim Intl Conf on AIAAI: Australian National Conf on AIACCV: Asian Conference on Computer VisionAI*IA: Congress of the Italian Assoc for AIANNIE: Artificial Neural Networks in EngineeringANZIIS: Australian/NZ Conf on Intelligent Inf. SystemsCAIA: Conf on AI for ApplicationsCAAI: Canadian Artificial Intelligence ConferenceASADM: Chicago ASA Data Mining Conf: A Hard Look at DMEPIA: Portuguese Conference on Artificial IntelligenceFCKAML: French Conf on Know. Acquisition & Machine LearningICANN: International Conf on Artificial Neural NetworksICCB: International Conference on Case-Based ReasoningICGA: International Conference on Genetic AlgorithmsICONIP: Intl Conf on Neural Information ProcessingIEA/AIE: Intl Conf on Ind. & Eng. Apps of AI & Expert SysICMS: International Conference on Multiagent SystemsICPS: International conference on Planning SystemsIWANN: Intl Work-Conf on Art & Natural Neural NetworksPACES: Pacific Asian Conference on Expert SystemsSCAI: Scandinavian Conference on Artifical IntelligenceSPICIS: Singapore Intl Conf on Intelligent SystemPAKDD: Pacific-Asia Conf on Know. Discovery & Data MiningSMC: IEEE Intl Conf on Systems, Man and CyberneticsPAKDDM: Practical App of Knowledge Discovery & Data MiningWCNN: The World Congress on Neural NetworksWCES: World Congress on Expert SystemsINBS: IEEE Intl Symp on Intell. in Neural \& Bio SystemsASC: Intl Conf on AI and Soft ComputingPACLIC: Pacific Asia Conference on Language, Information and Computation ICCC: International Conference on Chinese ComputingOthers:ICRA: IEEE Intl Conf on Robotics and AutomationNNSP: Neural Networks for Signal ProcessingICASSP: IEEE Intl Conf on Acoustics, Speech and SPGCCCE: Global Chinese Conference on Computers in EducationICAI: Intl Conf on Artificial IntelligenceAEN: IASTED Intl Conf on AI, Exp Sys & Neural NetworksWMSCI: World Multiconfs on Sys, Cybernetics & InformaticsAREA: Hardware and ArchitectureRank 1:ASPLOS: Architectural Support for Prog Lang and OSISCA: ACM/IEEE Symp on Computer ArchitectureICCAD: Intl Conf on Computer-Aided DesignDAC: Design Automation ConfMICRO: Intl Symp on MicroarchitectureHPCA: IEEE Symp on High-Perf Comp ArchitectureRank 2:FCCM: IEEE Symposium on Field Programmable Custom Computing Machines SUPER: ACM/IEEE Supercomputing ConferenceICS: Intl Conf on SupercomputingISSCC: IEEE Intl Solid-State Circuits ConfHCS: Hot Chips SympVLSI: IEEE Symp VLSI CircuitsISSS: International Symposium on System SynthesisDATE: IEEE/ACM Design, Automation & Test in Europe ConferenceRank 3:ICA3PP: Algs and Archs for Parall ProcEuroMICRO: New Frontiers of Information TechnologyACS: Australian Supercomputing ConfUnranked:Advanced Research in VLSIInternational Symposium on System SynthesisInternational Symposium on Computer DesignInternational Symposium on Circuits and SystemsAsia Pacific Design Automation ConferenceInternational Symposium on Physical DesignInternational Conference on VLSI DesignAREA: ApplicationsRank 1:I3DG: ACM-SIGRAPH Interactive 3D GraphicsSIGGRAPH: ACM SIGGRAPH ConferenceACM-MM: ACM Multimedia ConferenceDCC: Data Compression ConfSIGMETRICS: ACM Conf on Meas. & Modelling of Comp SysSIGIR: ACM SIGIR Conf on Information RetrievalPECCS: IFIP Intl Conf on Perf Eval of Comp \& Comm SysWWW: World-Wide Web ConferenceRank 2:EUROGRAPH: European Graphics ConferenceCGI: Computer Graphics InternationalCANIM: Computer AnimationPG: Pacific GraphicsIEEE-MM: IEEE Intl Conf on Multimedia Computing and SysNOSSDAV: Network and OS Support for Digital A/VPADS: ACM/IEEE/SCS Workshop on Parallel \& Dist Simulation WSC: Winter Simulation ConferenceASS: IEEE Annual Simulation SymposiumMASCOTS: Symp Model Analysis \& Sim of Comp \& Telecom Sys PT: Perf Tools - Intl Conf on Model Tech \& Tools for CPENetStore - Network Storage SymposiumRank 3:ACM-HPC: ACM Hypertext ConfMMM: Multimedia ModellingDSS: Distributed Simulation SymposiumSCSC: Summer Computer Simulation ConferenceWCSS: World Congress on Systems SimulationESS: European Simulation SymposiumESM: European Simulation MulticonferenceHPCN: High-Performance Computing and NetworkingGeometry Modeling and ProcessingWISEDS-RT: Distributed Simulation and Real-time ApplicationsIEEE Intl Wshop on Dist Int Simul and Real-Time ApplicationsUn-ranked:DVAT: IS\&T/SPIE Conf on Dig Video Compression Alg \& Tech MME: IEEE Intl Conf. on Multimedia in EducationICMSO: Intl Conf on Modelling, Simulation and OptimisationICMS: IASTED Intl Conf on Modelling and SimulationAREA: System TechnologyRank 1:SIGCOMM: ACM Conf on Comm Architectures, Protocols & Apps INFOCOM: Annual Joint Conf IEEE Comp & Comm SocSPAA: Symp on Parallel Algms and ArchitecturePODC: ACM Symp on Principles of Distributed ComputingPPoPP: Principles and Practice of Parallel ProgrammingMassPar: Symp on Frontiers of Massively Parallel ProcRTSS: Real Time Systems SympSOSP: ACM SIGOPS Symp on OS PrinciplesSOSDI: Usenix Symp on OS Design and ImplementationCCS: ACM Conf on Comp and Communications SecurityIEEE Symposium on Security and PrivacyMOBICOM: ACM Intl Conf on Mobile Computing and Networking USENIX Conf on Internet Tech and SysICNP: Intl Conf on Network ProtocolsOPENARCH: IEEE Conf on Open Arch and Network ProgPACT: Intl Conf on Parallel Arch and Compil TechRank 2:CC: Compiler ConstructionIPDPS: Intl Parallel and Dist Processing SympIC3N: Intl Conf on Comp Comm and NetworksICPP: Intl Conf on Parallel ProcessingICDCS: IEEE Intl Conf on Distributed Comp SystemsSRDS: Symp on Reliable Distributed SystemsMPPOI: Massively Par Proc Using Opt InterconnsASAP: Intl Conf on Apps for Specific Array ProcessorsEuro-Par: European Conf. on Parallel ComputingFast Software EncryptionUsenix Security SymposiumEuropean Symposium on Research in Computer SecurityWCW: Web Caching WorkshopLCN: IEEE Annual Conference on Local Computer NetworksIPCCC: IEEE Intl Phoenix Conf on Comp & CommunicationsCCC: Cluster Computing ConferenceICC: Intl Conf on CommRank 3:MPCS: Intl. Conf. on Massively Parallel Computing SystemsGLOBECOM: Global CommICCC: Intl Conf on Comp CommunicationNOMS: IEEE Network Operations and Management SympCONPAR: Intl Conf on Vector and Parallel ProcessingVAPP: Vector and Parallel ProcessingICPADS: Intl Conf. on Parallel and Distributed SystemsPublic Key CryptosystemsIEEE Computer Security Foundations WorkshopAnnual Workshop on Selected Areas in CryptographyAustralasia Conference on Information Security and PrivacyInt. Conf on Inofrm and Comm. SecurityFinancial CryptographyWorkshop on Information HidingSmart Card Research and Advanced Application ConferenceICON: Intl Conf on NetworksIMSA: Intl Conf on Internet and MMedia SysNCC: Nat Conf CommIN: IEEE Intell Network WorkshopICME: Intl Conf on MMedia & ExpoSoftcomm: Conf on Software in Tcomms and Comp NetworksINET: Internet Society ConfWorkshop on Security and Privacy in E-commerceUn-ranked:PARCO: Parallel ComputingSE: Intl Conf on Systems EngineeringAREA: Programming Languages and Software EngineeringRank 1:POPL: ACM-SIGACT Symp on Principles of Prog LangsPLDI: ACM-SIGPLAN Symp on Prog Lang Design & ImplOOPSLA: OO Prog Systems, Langs and ApplicationsICFP: Intl Conf on Function ProgrammingJICSLP/ICLP/ILPS: (Joint) Intl Conf/Symp on Logic ProgICSE: Intl Conf on Software EngineeringFSE: ACM Conference on the Foundations of Software Engineering (inc: ESEC-FSE when held jointly)FM/FME: Formal Methods, World Congress/EuropeCAV: Computer Aided VerificationRank 2:CP: Intl Conf on Principles & Practice of Constraint ProgTACAS: Tools and Algos for the Const and An of SystemsESOP: European Conf on ProgrammingICCL: IEEE Intl Conf on Computer LanguagesPEPM: Symp on Partial Evalutation and Prog ManipulationSAS: Static Analysis SymposiumRTA: Rewriting Techniques and ApplicationsESEC: European Software Engineering ConfIWSSD: Intl Workshop on S/W Spec & DesignCAiSE: Intl Conf on Advanced Info System EngineeringITC: IEEE Intl Test ConfIWCASE: Intl Workshop on Cumpter-Aided Software EngSSR: ACM SIGSOFT Working Conf on Software ReusabilitySEKE: Intl Conf on S/E and Knowledge EngineeringICSR: IEEE Intl Conf on Software ReuseASE: Automated Software Engineering ConferencePADL: Practical Aspects of Declarative LanguagesISRE: Requirements EngineeringICECCS: IEEE Intl Conf on Eng. of Complex Computer SystemsIEEE Intl Conf on Formal Engineering MethodsIntl Conf on Integrated Formal MethodsFOSSACS: Foundations of Software Science and Comp StructRank 3:FASE: Fund Appr to Soft EngAPSEC: Asia-Pacific S/E ConfPAP/PACT: Practical Aspects of PROLOG/Constraint TechALP: Intl Conf on Algebraic and Logic ProgrammingPLILP: Prog, Lang Implentation & Logic ProgrammingLOPSTR: Intl Workshop on Logic Prog Synthesis & TransfICCC: Intl Conf on Compiler ConstructionCOMPSAC: Intl. Computer S/W and Applications ConfCSM: Conf on Software MaintenanceTAPSOFT: Intl Joint Conf on Theory & Pract of S/W DevWCRE: SIGSOFT Working Conf on Reverse EngineeringAQSDT: Symp on Assessment of Quality S/W Dev ToolsIFIP Intl Conf on Open Distributed ProcessingIntl Conf of Z UsersIFIP Joint Int'l Conference on Formal Description Techniques and Protocol Specification, Testing, And VerificationPSI (Ershov conference)UML: International Conference on the Unified Modeling LanguageUn-ranked:Australian Software Engineering ConferenceIEEE Int. W'shop on Object-oriented Real-time Dependable Sys. (WORDS)IEEE International Symposium on High Assurance Systems EngineeringThe Northern Formal Methods WorkshopsFormal Methods PacificInt. Workshop on Formal Methods for Industrial Critical SystemsJFPLC - International French Speaking Conference on Logic and Constraint ProgrammingL&L - Workshop on Logic and LearningSFP - Scottish Functional Programming WorkshopHASKELL - Haskell WorkshopLCCS - International Workshop on Logic and Complexity in Computer ScienceVLFM - Visual Languages and Formal MethodsNASA LaRC Formal Methods Workshop(1) FATES - A Satellite workshop on Formal Approaches to Testing of Software(1) Workshop On Java For High-Performance Computing(1) DSLSE - Domain-Specific Languages for Software Engineering(1) FTJP - Workshop on Formal Techniques for Java Programs(*) WFLP - International Workshop on Functional and (Constraint) Logic Programming(*) FOOL - International Workshop on Foundations of Object-Oriented Languages(*) SREIS - Symposium on Requirements Engineering for Information Security(*) HLPP - International workshop on High-level parallel programming and applications(*) INAP - International Conference on Applications of Prolog(*) MPOOL - Workshop on Multiparadigm Programming with OO Languages(*) PADO - Symposium on Programs as Data Objects(*) TOOLS: Int'l Conf Technology of Object-Oriented Languages and Systems(*) Australasian Conference on Parallel And Real-Time SystemsAREA: Algorithms and TheoryRank 1:STOC: ACM Symp on Theory of ComputingFOCS: IEEE Symp on Foundations of Computer ScienceCOLT: Computational Learning TheoryLICS: IEEE Symp on Logic in Computer ScienceSCG: ACM Symp on Computational GeometrySODA: ACM/SIAM Symp on Discrete AlgorithmsSPAA: ACM Symp on Parallel Algorithms and ArchitecturesPODC: ACM Symp on Principles of Distributed ComputingISSAC: Intl. Symp on Symbolic and Algebraic ComputationCRYPTO: Advances in CryptologyEUROCRYPT: European Conf on CryptographyRank 2:CONCUR: International Conference on Concurrency TheoryICALP: Intl Colloquium on Automata, Languages and ProgSTACS: Symp on Theoretical Aspects of Computer ScienceCC: IEEE Symp on Computational ComplexityWADS: Workshop on Algorithms and Data StructuresMFCS: Mathematical Foundations of Computer ScienceSWAT: Scandinavian Workshop on Algorithm TheoryESA: European Symp on AlgorithmsIPCO: MPS Conf on integer programming & comb optimization LFCS: Logical Foundations of Computer ScienceALT: Algorithmic Learning TheoryEUROCOLT: European Conf on Learning TheoryWDAG: Workshop on Distributed AlgorithmsISTCS: Israel Symp on Theory of Computing and SystemsISAAC: Intl Symp on Algorithms and ComputationFST&TCS: Foundations of S/W Tech & Theoretical CSLATIN: Intl Symp on Latin American Theoretical InformaticsRECOMB: Annual Intl Conf on Comp Molecular BiologyCADE: Conf on Automated DeductionIEEEIT: IEEE Symposium on Information TheoryAsiacryptRank 3:MEGA: Methods Effectives en Geometrie AlgebriqueASIAN: Asian Computing Science ConfCCCG: Canadian Conf on Computational GeometryFCT: Fundamentals of Computation TheoryWG: Workshop on Graph TheoryCIAC: Italian Conf on Algorithms and ComplexityICCI: Advances in Computing and InformationAWTI: Argentine Workshop on Theoretical InformaticsCATS: The Australian Theory SympCOCOON: Annual Intl Computing and Combinatorics ConfUMC: Unconventional Models of ComputationMCU: Universal Machines and ComputationsGD: Graph DrawingSIROCCO: Structural Info & Communication ComplexityALEX: Algorithms and ExperimentsALG: ENGG Workshop on Algorithm EngineeringLPMA: Intl Workshop on Logic Programming and Multi-Agents EWLR: European Workshop on Learning RobotsCITB: Complexity & info-theoretic approaches to biologyFTP: Intl Workshop on First-Order Theorem Proving (FTP)CSL: Annual Conf on Computer Science Logic (CSL)AAAAECC: Conf On Applied Algebra, Algebraic Algms & ECC DMTCS: Intl Conf on Disc Math and TCSUn-ranked:Information Theory WorkshopAREA: Data BasesRank 1:SIGMOD: ACM SIGMOD Conf on Management of DataPODS: ACM SIGMOD Conf on Principles of DB SystemsVLDB: Very Large Data BasesICDE: Intl Conf on Data EngineeringICDT: Intl Conf on Database TheoryRank 2:SSD: Intl Symp on Large Spatial DatabasesDEXA: Database and Expert System ApplicationsFODO: Intl Conf on Foundation on Data OrganizationEDBT: Extending DB TechnologyDOOD: Deductive and Object-Oriented DatabasesDASFAA: Database Systems for Advanced ApplicationsCIKM: Intl. Conf on Information and Knowledge ManagementSSDBM: Intl Conf on Scientific and Statistical DB MgmtCoopIS - Conference on Cooperative Information SystemsER - Intl Conf on Conceptual Modeling (ER)Rank 3:COMAD: Intl Conf on Management of DataBNCOD: British National Conference on DatabasesADC: Australasian Database ConferenceADBIS: Symposium on Advances in DB and Information SystemsDaWaK - Data Warehousing and Knowledge DiscoveryRIDE WorkshopIFIP-DS: IFIP-DS ConferenceIFIP-DBSEC - IFIP Workshop on Database SecurityNGDB: Intl Symp on Next Generation DB Systems and AppsADTI: Intl Symp on Advanced DB Technologies and IntegrationFEWFDB: Far East Workshop on Future DB SystemsMDM - Int. Conf. on Mobile Data Access/Management (MDA/MDM)ICDM - IEEE International Conference on Data MiningVDB - Visual Database SystemsIDEAS - International Database Engineering and Application SymposiumOthers:ARTDB - Active and Real-Time Database SystemsCODAS: Intl Symp on Cooperative DB Systems for Adv AppsDBPL - Workshop on Database Programming LanguagesEFIS/EFDBS - Engineering Federated Information (Database) SystemsKRDB - Knowledge Representation Meets DatabasesNDB - National Database Conference (China)NLDB - Applications of Natural Language to Data BasesKDDMBD - Knowledge Discovery and Data Mining in Biological Databases Meeting FQAS - Flexible Query-Answering SystemsIDC(W) - International Database Conference (HK CS)RTDB - Workshop on Real-Time DatabasesSBBD: Brazilian Symposium on DatabasesWebDB - International Workshop on the Web and DatabasesWAIM: Interational Conference on Web Age Information Management(1) DASWIS - Data Semantics in Web Information Systems(1) DMDW - Design and Management of Data Warehouses(1) DOLAP - International Workshop on Data Warehousing and OLAP(1) DMKD - Workshop on Research Issues in Data Mining and Knowledge Discovery (1) KDEX - Knowledge and Data Engineering Exchange Workshop(1) NRDM - Workshop on Network-Related Data Management(1) MobiDE - Workshop on Data Engineering for Wireless and Mobile Access(1) MDDS - Mobility in Databases and Distributed Systems(1) MEWS - Mining for Enhanced Web Search(1) TAKMA - Theory and Applications of Knowledge MAnagement(1) WIDM: International Workshop on Web Information and Data Management(1) W2GIS - International Workshop on Web and Wireless Geographical Information Systems * CDB - Constraint Databases and Applications* DTVE - Workshop on Database Technology for Virtual Enterprises* IWDOM - International Workshop on Distributed Object Management* IW-MMDBMS - Int. Workshop on Multi-Media Data Base Management Systems* OODBS - Workshop on Object-Oriented Database Systems* PDIS: Parallel and Distributed Information SystemsAREA: MiscellaneousRank 1:Rank 2:AMIA: American Medical Informatics Annual Fall SymposiumDNA: Meeting on DNA Based ComputersRank 3:MEDINFO: World Congress on Medical InformaticsInternational Conference on Sequences and their ApplicationsECAIM: European Conf on AI in MedicineAPAMI: Asia Pacific Assoc for Medical Informatics ConfSAC: ACM/SIGAPP Symposium on Applied ComputingICSC: Internal Computer Science ConferenceISCIS: Intl Symp on Computer and Information SciencesICSC2: International Computer Symposium ConferenceICCE: Intl Conf on Comps in EduEd-MediaWCC: World Computing CongressPATAT: Practice and Theory of Automated TimetablingNot Encouraged (due to dubious referee process):International Multiconferences in Computer Science -- 14 joint int'l confs.SCI: World Multi confs on systemics, sybernetics and informaticsSSGRR: International conf on Advances in Infrastructure for e-B, e-Edu and e-Science and e-MedicineIASTED conferences以下是期刊：IEEE/ACM TRANSACTIONS期刊系列一般都被公认为领域顶级期刊，所以以下列表在关于IEEE/ACM TRANSACTIONS的分类不一定太准确。

文献检索题目分析

文献检索（文科）第一章一、单项选择题2、以作者本人取得的成果为依据而创作的论文、报告等，并经公开发表或出版的各种文献，称为一次文献3、文摘、题录、目录等属于二次文献4、手稿、私人笔记等属于零次文献，辞典、手册等属于三次文献。

5、按照出版时间的先后，应将各个级别的文献排列成一次文献、二次文献、次文献17、在文献信息传递的载体类型中，印刷型是历史最悠久的文献形式。

20、当我们需要对陌生知识作一般了解时，我们可先参考图书文献。

21、从载体的物理形态区分，电子型文献是文献发展的方向。

27、学位论文属于一次文献。

二、多项选择题2、期刊、报纸文献就属于连续出版物。

3、专利、报告、标准文献就属于特种文献。

三、是非题4、一次文献内容新颖丰富，叙述具体详尽，参考价值大，但数量庞大，分布分散。

正确5、一次文献是产生二三次文献的基础，是检索利用的主要对象。

正确6、从零次文献、一次文献到二次文献，再到三次文献，是一个知识内容由分散到集中，由无组织到系统化的过程。

正确14、图书馆馆藏的书刊既包括纸质版书刊，也包含网络版书刊。

正确20、文献的出版类型是针对一次文献所含的内容的特点和出版方式进行区分的文献类型。

正确四、填空题二—1―小像弋和电于型寻，貝屮_______ 乂献址放丛本的文献形式。

3./W、、‘•「•期刊、图书、会及、/位论文、标准、报或专利中的哪种类型，并说明判断依据。

① Brewingion B. Mobile agents for distributed informationretrieval. Klusch M. ( Ed. ) Intelligent Information Agents,Berlin： Springer, 1999 ( )②Donini F M, Lenzerini M, Nardi D, Nutt W・ Thecomplexity of concept languages. Information andComputation. 134(1) ,314 - 316,1997( )③Finin T, Fritzson R, McKay D, McEntire R・ KQMLas an agent communication language. Proceedings of the Third International Conference on Information andKnowledge Management ( CIKM94 ), ACM Press, New York,1994( )④Sycara J L, Klusch M. Interoperability among heterogeneoussoftware agents on the Internet. Technical report CMU-RI-TR-98-22,CMU,Pittsburgh, USA, 1998( )⑤Papadopoulos, Gregory M. Implementation of aGeneral Purpose Dataflow Multiprocessor. MIT ElectricalEngineering and Computer Science, Ph D. Thesis, Aug.1988,1155( )⑥Harris, Daniel J. Gauging Device including a Probe Having水平和影响力较高的期刊。

医学文献检索复习考试重点总结

医学文献检索复习考试重点总结第一章文献检索基础本章要点1.1 文献信息的基本概念1.2 文献的类型和级别1.3 医学文献的分布规律重点印刷型文献出版类型的识别方法电子文献的文件格式和主要类型文献信息的时间、地区和学科分布规律第一节医学文献基础信息、知识、情报和文献以及相互关系信息 (Information)信息是人体感官对事物存在或运动状态及其特征的反应。

基本属性:客观性可塑性依附性共享性知识 (Knowledge)知识是系统化的信息,是人类不断接受信息经过大脑加工得出的经验或总结。

情报（Information)是满足特定用户的特定需要的动态知识。

情报是知识的传递并起作用的部分。

基本属性：①知识性②传递性③用效性文献（Literature)我国国家标准 GB4898-85把文献定义为“记录有知识的一切载体”。

医学文献记录了千千万万医学工作者研究人类生命过程、同疾病斗争的科学知识。

文献的构成包括四个要素①知识信息内容②信息符号：文字、图表、声音、图像等③载体材料：甲骨、竹简、纸张、胶卷、磁盘、光盘等④记录的方式及手段：刀刻、书写、印刷、录音、录像等信息、知识和情报之间的逻辑关系可形象地用图来表示。

第二节文献的类型和级别1.2.1 按载体形式区分按文献载体物理类型分：印刷型文献；缩微型文献；视听型文献;数字化文献等。

印刷型文献(printed form)各种印刷品，如：正式出版的图书、报刊及杂志等。

缩微型文献(microform)缩微型文献是以感光材料为载体，以光学缩微技术为记录手段而产生的一种文献形式。

视听型文献(audio-visual)声像型文献又叫视听资料。

是以磁性、感光材料为载体,直接记录声音、图像而形成的一种文献。

数字化文献(electronic form)以数字的形式存贮在光盘、磁盘和U盘等介质上，并通过计算机阅读和利用。

如：光载体记录手段种类优点缺点印刷型纸张等印刷图书报刊便于阅读密度低耗人物缩微型感光材料缩微拍摄缩微平片体积小成本低不能直接阅读视听型磁性材料机械装置输入录象带闻其声观其形成本稍高感光材料电子型磁性材料键盘及穿孔输电子图书密度高速度快成本高手写型甲骨简策手写甲骨文密度低图书 (Book)图书是指论述或总结某一学科或知识领域的出版物,指一定页数的印刷品。

研究生科技文献检索(理工类)考察作业任务

科技文献检索（理工类）期末综合大作业作业要求：1）作业请独立完成，抄袭与被抄袭（截图雷同）均判不及格。

2）用A4纸打印，作业字体大小为五号字，请注意填写页眉信息。

3）作业上交时间与地点：2018年6月20日1：30—3：00交到上课教室。

（一）基础知识与概念1.《中图法》的全称是什么？它将图书分为几个基本部类，多少基本大类？TP393是哪类书？答：《中图法》的全称是《中国图书馆分类法》。

它将图书分为五个基本部类，二十二个大类。

TP393是计算机网络。

2.一次文献和二次文献有什么区别？图书馆文献数据库中哪些是一次文献库，哪些是二次文献库，各举2个例子。

答：一次文献是指作者创作的原始文献。

作者以自己的研究成果为基本素材而创作（或撰写）的文献，并向社会公开。

如：图书、报纸、期刊论文、科技报告、会议论文、学位论文、专利、标准等。

二次文献是指按一定的方法对一次文献进行整理加工，以使之有序化而形成的文献。

二次文献在内容上并不具有原创性，它只提供有关一次文献的内容线索，由情报人员对一次文献进行加工、整理、提炼、标引及编序后形成的工具性文献。

如：各种目录、题录、索引、文摘等。

二次文献是用来查找一次文献的工具。

3.在CNKI中文核心期刊要目中查找你所在专业的核心期刊一种，写出刊名、主办单位、ISSN号和CN号。

答：刊名：《软件学报》；主办单位：中国科学院软件研究所ISSN号：1000-9825；CN号：11-2560/TP4.文献检索时往往会出现检索结果过多、过少、或者根本不相关的情况，请问检索策略调整有哪些方法？答:检索结果过多—--缩小检索范围；检索结果过少----扩大检索范围；检索结果相关度小----修改检索词、检索式、更换检索工具。

5. 判断以下文献各属于期刊、图书、会议、学位论文、标准、科技报告或专利中的哪种类型。

①B．Brewington．Mobile agents for distributed information retrieval．M．klusch(ED．Intelligent Information Agent) [M]，Berlin：Springer，1999 ②F．M．Donini，M．lenzerini，D．Nardi，W．Nutt．The complexity of concept languages．Information and Computation ．134(1)，314-316，1997③T．Finin，R．Fritzson，D．McKay，R．McEntire．KQML as an agent communication language ．Proceedings of Third International Conference on Information and Knowledge Management(CIKM-94)，ACM press ，New York，1994④Sycara，J．Ｌu，M．Klusch．Interoperability among heterogeneous software agents on the internet．Technical report CMU-RI-TR-98-22，CMU，Pittsburgh，USA ，1998⑤Papadopoulos，Gregory M．Implementation of a General Purpose Dataflow Multiprocessor．MIT Electrical Engineering and Computer Science ，PH．D．Thesis，Aug．1988，1-155⑥Harris，Daniel J．Gauging Device including a Probe Having a Plurality of Concentric and Coextensive Electrodes．U．S．patent No．2400331．3 sept．1968 ⑦American Society for testing and Materials Standard．Standard Test for Rubber Property-effect of liquids，ASTM-D 471，1995答：①专著(含教材等) ②期刊③会议④科技报告⑤学位论文⑥专利文献⑦标准(二) 检索练习题1.利用CNKI中国优秀硕士学位论文全文数据库，检索我校2011年计算机应用技术专业下载量最高的一篇论文（要求给出论文题目、作者、导师姓名、下载次数）。

Word2Vec详解

Word2Vec详解Word2Vec详解word2vec可以在百万数量级的词典和上亿的数据集上进⾏⾼效地训练；其次，该⼯具得到的训练结果——词向量（word embedding），可以很好地度量词与词之间的相似性。

随着深度学习（Deep Learning）在⾃然语⾔处理中应⽤的普及，很多⼈误以为word2vec是⼀种深度学习算法。

其实word2vec算法的背后是⼀个浅层神经⽹络。

另外需要强调的⼀点是，word2vec是⼀个计算word vector的开源⼯具。

当我们在说word2vec算法或模型的时候，其实指的是其背后⽤于计算word vector的CBoW模型和Skip-gram模型。

Statistical Language Model在深⼊word2vec算法的细节之前，我们⾸先回顾⼀下⾃然语⾔处理中的⼀个基本问题：如何计算⼀段⽂本序列在某种语⾔下出现的概率？之所为称其为⼀个基本问题，是因为它在很多NLP任务中都扮演着重要的⾓⾊。

例如，在机器翻译的问题中，如果我们知道了⽬标语⾔中每句话的概率，就可以从候选集合中挑选出最合理的句⼦做为翻译结果返回。

统计语⾔模型给出了这⼀类问题的⼀个基本解决框架。

对于⼀段⽂本序列S=w1,w2,...,w T它的概率可以表⽰为：P(S)=P(w1,w2,...,w T)=T∏t=1p(w t|w1,w2,...,w t−1)即将序列的联合概率转化为⼀系列条件概率的乘积。

问题变成了如何去预测这些给定previous words下的条件概率：p(w t|w1,w2,...,w t−1)由于其巨⼤的参数空间，这样⼀个原始的模型在实际中并没有什么⽤。

我们更多的是采⽤其简化版本——Ngram模型：p(w t|w1,w2,...,w t−1)≈p(w t|w t−n+1,...,w t−1)常见的如bigram模型（N=2）和trigram模型（N=3）。

事实上，由于模型复杂度和预测精度的限制，我们很少会考虑N>3的模型。

CRUST

Learning: Knowledge Representation, Organization, and AcquisitionDanielle S. McNamara and Tenaha O’ReillyOld Dominion UniversityKnowledge acquisition is the process of absorbing and storing new information in memory, the success of which is often gauged by how well the information can later be remembered, or retrieved from memory. The process of storing and retrieving new information depends heavily on the representation and organization of this information. Moreover, the utility of knowledge can also be influenced by how the information is structured. For example, a bus schedule can be represented in the form of a map or a timetable. On the one hand, a timetable provides quick and easy access to the arrival time for each bus, but does little for finding where a particular stop is situated. On the other hand, a map provides a detailed picture of each bus stop’s location, but cannot efficiently communicate bus schedules. Both forms of representation are useful, but it is important to select the representation most appropriate for the task. Similarly, knowledge acquisition can be improved by considering the purpose and function of the desired information. This article provides an overview of knowledge representation and organization, and offers five guidelines to improve knowledge acquisition and retrieval.Knowledge Representation and OrganizationThere are numerous theories of how knowledge is represented and organized in the mind including rule-based production models (Anderson & Lebière, 1998), distributed networks (Rumelhart & McClelland, 1986), and propositional models (Kintsch, 1998). However, these theories are all fundamentally based on the concept of semantic networks. A semantic networkFigure 1: Schematic representation of a semantic networkis a method of representing knowledge as a system of connections between concepts in memory. This section explains the basic assumptions of semantic networks and describes several different types of knowledge.Semantic NetworksAccording to semantic network models, knowledge is organized based on meaning, such that semantically related concepts are interconnected. Knowledge networks are typically represented as diagrams of nodes (i.e., concepts) and links (i.e., relations). The nodes and links are given numerical weights to represent their strengths in memory. In Figure 1, the node representing DOCTOR is strongly related to SCALPEL, whereas NURSE is weakly related to SCALPEL. These link strengths are represented here in terms of line width. Similarly, some nodes in Figure 1 are bolded to represent their strength in memory. Concepts such as DOCTOR and BREAD are more memorable because they are more frequently encountered than concepts such as SCALPEL and CRUST.Mental excitation, or activation, spreads automatically from one concept to another related concept. For example, thinking of BREAD spreads activation to related concepts, such as BUTTER and CRUST. These concepts are primed, and thus more easily recognized or retrieved from memory. For example, in a typical semantic priming study (Meyer &Schvaneveldt, 1976), a series of words (e.g., BUTTER) and nonwords (e.g., BOTTOR) are presented, and participants determine whether each item is a word. A word is more quickly recognized if it follows a semantically related word. For example, BUTTER is more quickly recognized as a word if BREAD precedes it rather than NURSE. This result supports the assumption that semantically related concepts are more strongly connected than unrelated concepts.Figure 2: Schematic representation of ideas (propositions) in a semantic network.Network models represent more than simple associations. They must represent the ideas and complex relationships that comprise knowledge and comprehension. For example, the idea “The doctor uses a scalpel” can be represented as the proposition USE(DOCTOR,SCALPEL) consisting of the nodes DOCTOR and SCALPEL and the link USE (see Figure 2). Educators have successfully used similar diagrams, called concept maps, to communicate important relations and attributes amongst the key concepts of a lesson (Guastello, Beasley, & Sinatra 2000).Types of KnowledgeThere are numerous types of knowledge, but the most important distinction is between declarative and procedural knowledge. Declarative knowledge refers to our memory for concepts, facts, or episodes, whereas procedural knowledge refers to the ability to perform various tasks. Knowledge of how to drive a car, solve a multiplication problem, or throw a football are all forms of procedural knowledge, called procedures or productions. Procedural knowledge may begin as declarative knowledge, but is proceduralized with practice (Anderson, 1982). For example, when first learning to drive a car, you may be told to put the key in the ignition to start the car, which is a declarative statement. However, after starting the car numerous times, this act becomes automatic and is completed with little thought. Indeed, procedural knowledge tends to be accessed automatically and require little attention. It also tends to be more durable (less susceptible to forgetting) than declarative knowledge (Jensen & Healy, 1998).Knowledge AcquisitionThis section describes five guidelines for knowledge acquisition that emerge from how knowledge is represented and organized.Process the material semantically. Knowledge is organized semantically; therefore, knowledge acquisition is optimized when the learner focuses on the meaning of the new material. Craik and his colleagues were among the first to provide evidence for the importance of semantic processing(Craik & Tulving, 1975). In their studies, participants answered questions concerning target words that varied according to the depth of processing involved. For example, semantic questions (e.g., Would the word fit appropriately in the sentence?: "He met a____ on the street"? FRIEND vs. TREE) involves a greater depth of processing than phonemic questions (e.g., Does the word rhyme with LATE?: CRATE vs. TREE), which in turn have a greater depth than questions concerning the structure of a word (e.g., Is the word in capital letters?: TREE vs. tree). They found that words processed semantically were better learned than words processed phonemically or structurally. Further studies have confirmed that learning benefits from greater semantic processing of the material.Process and retrieve information frequently. A second learning principle is to test and retrieve the information numerous times. Retrieving, or self-producing information can be contrasted with simply reading or copying it. Decades of research on a phenomenon called the generation effect has shown that passively studying items by copying or reading them does little for memory in comparison to self-producing, or generating, an item (Slamecka & Graf, 1978). Moreover, learning improves as a function of the number of times information is retrieved. Within an academic situation, this principle points to the need for frequent practice tests, worksheets, or quizzes. In terms of studying, it is also important to break up, or distribute retrieval attempts (Melton, 1967; Glenberg, 1979). Distributed retrieval can include studying or testing items in a random order, with breaks, or on different days. In contrast, repeating information numerous times sequentially involves only a single, retrieval from long-term memory, which does little to improve memory for the information.Learning and retrieval conditions should be similar. How knowledge is represented is determined by the conditions and context (internal and external) in which it is learned, and this in turn determines how it is retrieved: Information is best retrieved when the conditions of learning and retrieval are the same. This principle has been referred to as encoding specificity (Tulving & Thompson, 1973). For example, in one experiment, participants were shown sentences with anadjective and a noun printed in capital letters (e.g. The CHIP DIP tasted delicious.) and told that their memory for the nouns would be tested afterward. In the recognition test, participants were shown the noun either with the original adjective (CHIP DIP), a different adjective (SKINNY DIP), or without an adjective (DIP). Noun recognition was better when the original adjective (CHIP) was presented than when no adjective was presented. Moreover, presenting a different adjective (SKINNY) yielded the lowest recognition (Light & Carter-Sobell, 1970). This finding underscores the importance of matching learning and testing conditions.Encoding specificity is also important in terms of the questions used to test memory or comprehension. Different types of questions tap into different levels of understanding. For example, recalling information involves a different level of understanding, and different mental processes than does recognizing information. Likewise, essay and open-ended questions assess a different level of understanding than do multiple-choice questions (McNamara & Kintsch, 1996). Essay and open-ended questions generally tap into a conceptual or situational understanding of the material, which results from an integration of text-based information and the reader’s prior knowledge. In contrast, multiple-choice questions involve recognition processes and typically assess a shallow or text-based understanding. A text-based representation can be impoverished and incomplete because it consists only of concepts and relations within the text. This level of understanding, likely developed by a student preparing for a multiple-choice exam, would be inappropriate preparation for an exam with open-ended or essay questions. Thus, students should benefit by adjusting their study practices according to the expected type of questions. Alternatively, students may benefit from reviewing the material in many different ways, such as recognizing the information, recalling the information, and interpreting the information. These latter processes improve understanding and maximize the probability that the various ways thematerial is studied will match the way it is tested. From a teacher’s point of view, including different types of questions on worksheets or exams ensures that each student will have an opportunity to convey their understanding of the material.Connect new information to prior knowledge. Knowledge is interconnected; therefore, new material that is linked to prior knowledge will be better retained. A driving factor in text and discourse comprehension is prior knowledge (Bransford & Johnson, 1972). Skilled readers actively use their prior knowledge during comprehension. Prior knowledge helps the reader to fill in contextual gaps within the text and to develop a better global understanding or situation model of the text. Given that texts rarely (if ever) spell out everything needed for successful comprehension, using prior knowledge to understand text and discourse is critical. Moreover, thinking about what you already know about a topic provides connections in memory to the new information – the more connections that are formed, the more likely the information will be retrievable from memory.Create cognitive procedures. Procedural knowledge is better retained and more easily accessed. Therefore, one should develop and use cognitive procedures when learning information. Procedures can include short cuts for completing a task (e.g., using "fast 10s" to solve multiplication problems) as well as memory strategies that increase the distinctive meaning of information. Cognitive research has repeatedly demonstrated the benefits of memory strategies, or mnemonics, for enhancing the recall of information. There are numerous types of mnemonics, but one well-known mnemonic is the method of loci. This technique was invented originally for the purpose of memorizing long speeches in the times before luxuries such as paper and pencil were readily available (Yates, 1966). The first task is to imagine and memorize a series of distinct locations along a familiar route, such as a pathway from one campus buildingto another. Each topic of a speech (or word in a word list; Crovitz, 1971) can then be pictured in a location along the route. When it comes time to recall the speech or word list, the items are simply "found" by mentally traveling the pathway.Mnemonics are generally effective because they increase semantic processing of the words (or phrases) and render them more meaningful by linking them to familiar concepts in memory. Mnemonics also provide “ready-made” effective cues for retrieving the information. Another important aspect of mnemonics is that mental imaging is often involved. Images not only render the information more meaningful, but they provide an additional route for "finding" information in memory (e.g., Paivio, 1990). As mentioned earlier, increasing the number of meaningful links to information in memory increases the likelihood it can be retrieved.Strategies are also an important component of metacognition (Hacker, Dunlosky, & Graesser, 1998). Metacognition is the ability to think about, understand and manage one’s learning. First one must develop an awareness of one's own thought processes. Simply being aware of thought processes increases the likelihood of more effective knowledge construction. Second, the learner must be aware of whether or not comprehension has been successful. Realizing when comprehension has failed is crucial to learning. The final, and most important stage of metacognitive processing is fixing the comprehension problem. The individual must be aware of and use strategies to remedy comprehension and learning difficulties. For successful knowledge acquisition to occur, all three of these processes must occur. Without thinking or worrying about learning, the student cannot realize whether the concepts have been successfully grasped. Without realizing that information has not been understood, the student cannot engage in strategies to remedy the situation. If nothing is done about a comprehension failure, awareness is futile.ConclusionKnowledge acquisition is integrally tied to how the mind organizes and represents information. Learning can be enhanced by considering the fundamental properties of human knowledge as well as the ultimate function of the desired information. The most important property is that knowledge is organized semantically; therefore, learning methods should enhance meaningful study of the new information. Learners should also create as many links to the information as possible. In addition, learning methods should be matched to the desired outcome. Just as using a bus timetable to find a bus stop location is ineffective, learning to recognize information will do little good on an essay exam.2,161 wordsDanielle S. McNamaraTenaha O'ReillyBibliographyAnderson, J. R. 1982. Acquisition of a cognitive skill. Psychological Review89:369-406. Anderson, J. R., and Lebière, C. 1998. The Atomic Components of Thought. Mahwah, NJ: Erlbaum.Bransford, J., and Johnson, M. K. 1972. Contextual prerequisites for understanding some investigations of comprehension and recall. Journal of Verbal Learning and VerbalBehavior11: 717-726.Craik, F. I. M., and Tulving, E. 1975. Depth of processing and the retention of words in episodic memory. Journal of Experimental Psychology: General194:268-294.Crovitz, H. F. 1971. The capacity of memory loci in artificial memory. Psychonomic Science24: 187-188.Hacker, D. J., Dunlosky, J., and Graesser, A. C. 1998. Metacognition in Educational Theory and Practice. Mahwah, NJ: Lawrence Erlbaum.Guastello, F., Beasley, M., and Sinatra, R. 2000. Concept mapping effects on science content comprehension of low-achieving inner-city seventh graders. Rase: Remedial & Special Education 21: 356-365.Glenberg, A. M. 1979. Component-levels theory of the effects of spacing of repetitions on recall and recognition. Memory & Cognition 7: 95-112.Kintsch, W. 1998. Comprehension: A Paradigm for Cognition. New York: Cambridge University Press.Jensen, M. B., and Healy, A. F. 1998. Retention of procedural and declarative information from the Colorado Drivers' Manual. In M. J. Intons-Peterson & D. Best (Eds.), MemoryDistortions and their Prevention (pp. 113-124). Mahwah, NJ: Erlbaum.Light, L. L., and Carter-Sobell, L. 1970. Effects of changed semantic context on recognition memory. Journal of Verbal Learning and Verbal Behavior9:1-11.McNamara, D. S., and Kintsch, W. 1996. Learning from text: Effects of prior knowledge and text coherence. Discourse Processes 22: 247-287.Melton, A. W. 1967. Repetition and retrieval from memory. Science 158: 532.Meyer, D. E., and Schvaneveldt, R. W. 1976. Meaning, memory structure, and mental processes.Science192:27-33.Paivio, A. 1990. Mental Representations: A Dual Coding Approach. NY: Oxford University Press.Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition (Vol. 1: Foundations). Cambridge, MA: MIT press. Slamecka, N. J., and Graf, P. 1978. The generation effect: Delineation of a phenomenon.Journal of Experimental Psychology: Human Learning and Memory4: 592-604.Tulving, E., and Thompson, D. M. 1973. Encoding specificity and retrieval processes in episodic memory. Psychological Review80: 352-373.Yates, F. A. 1966. The Art of Memory. Chicago, IL: University of Chicago Press.。

6-data mining(1)

Part II Data MiningOutlineThe Concept of Data Mining(数据挖掘概念) Architecture of a Typical Data Mining System (数据挖掘系统结构)What can be Mined? (能挖掘什么？)Major Issues(主要问题)in Data MiningData Cleaning(数据清理)3What Is Data Mining?Data mining is the process of discovering interesting knowledge from large amounts of data. (数据挖掘是从大量数据中发现有趣知识的过程) The main difference that separates information retrieval apart from data mining is their goals. (数据挖掘和信息检索的主要差别在于他们的目标) Information retrieval is to help users search for documents or data that satisfy their information needs(信息检索帮用户寻找他们需要的文档/数据)e.g. Find customers who have purchased more than $10,000 in the last month .(查找上个月购物量超过1万美元的客户)Data mining discovers useful knowledge by analyzing data correlations using sophisticated data mining techniques(数据挖掘用复杂技术分析…)e.g. Find all items which are frequently purchased with milk .(查找经常和牛奶被购买的商品)A KDD Process (1) Some people view data mining as synonymous5A KDD Process (2)Learning the application domain (学习应用领域相关知识):Relevant knowledge & goals of application (相关知识和目标) Creating a target data set (建立目标数据集) Data selection, Data cleaning and preprocessing (预处理)Choosing functions of data mining (选择数据挖掘功能)Summarization, classification, association, clustering , etc.Choosing the mining algorithm(s) (选择挖掘算法)Data mining (进行数据挖掘): search for patterns of interest Pattern evaluation and knowledge presentation (模式评估和知识表示)Removing redundant patterns, visualization, transformation, etc.Present results to user in meaningful manner.Use of discovered knowledge (使用所发现的知识)7Concept/class description (概念/类描述)Characterization(特征): provide a summarization of the given data set Comparison(区分): mine distinguishing characteristics(挖掘区别特征)that differentiate a target class from comparable contrasting classes. Association rules (correlation and causality)(关联规则)Association rules are of the form(这种形式的规则): X ⇒Y,Examples: contains(T, “computer”) ⇒contains(T, “software”)[support = 1%, confidence = 50%]age(X, “20..29”) ∧income(X, “20..29K ”) ⇒buys(X, “PC ”)[support = 2%, confidence = 60%]Classification and Prediction (分类和预测)Find models that describe and distinguish classes for future prediction.What kinds of patterns can be mined?(1)What kinds of patterns can be mined?(2)Cluster(聚类)Group data to form some classes(将数据聚合成一些类)Principle: maximizing the intra-class similarity and minimizing the interclass similarity (原则: 最大化类内相似度，最小化类间相似度) Outlier analysis: objects that do not comply with the general behavior / data model. (局外者分析: 发现与一般行为或数据模型不一致的对象) Trend and evolution analysis (趋势和演变分析)Sequential pattern mining(序列模式挖掘)Regression analysis(回归分析)Periodicity analysis(周期分析)Similarity-based analysis(基于相似度分析)What kinds of patterns can be mined?(3)In the context of text and Web mining, the knowledge also includes: (在文本挖掘或web挖掘中还可以发现)Word association (术语关联)Web resource discovery (WEB资源发现)News Event (新闻事件)Browsing behavior (浏览行为)Online communities (网上社团)Mining Web link structures to identify authoritative Web pages finding spam sites (发现垃圾网站)Opinion Mining (观点挖掘)…10Major Issues in Data Mining (1)Mining methodology(挖掘方法)and user interactionMining different kinds of knowledge in DBs (从DB 挖掘不同类型知识) Interactive mining of knowledge at multiple levels of abstraction (在多个抽象层上交互挖掘知识)Incorporation of background knowledge (结合背景知识)Data mining query languages (数据挖掘查询语言)Presentation and visualization of data mining results(结果可视化表示) Handling noise and incomplete data (处理噪音和不完全数据) Pattern evaluation (模式评估)Performance and scalability (性能和可伸缩性) Efficiency(有效性)and scalability(可伸缩性)of data mining algorithmsParallel(并行), distributed(分布) & incremental(增量)mining methods©Wu Yangyang 11Major Issues in Data Mining (2)Issues relating to the diversity of data types (数据多样性相关问题)Handling relational and complex types of data (关系和复杂类型数据) Mining information from heterogeneous databases and www(异质异构) Issues related to applications (应用相关的问题) Application of discovered knowledge (所发现知识的应用)Domain-specific data mining tools (面向特定领域的挖掘工具)Intelligent query answering (智能问答) Process control(过程控制)and decision making(决策制定)Integration of the discovered knowledge with existing knowledge:A knowledge fusion problem (知识融合)Protection of data security(数据安全), integrity(完整性), and privacy12CulturesDatabases: concentrate on large-scale (non-main-memory) data.(数据库：关注大规模数据)To a database person, data-mining is an extreme form of analytic processing. Result is the data that answers the query.(对数据库工作者而言数据挖掘是一种分析处理, 其结果就是问题答案) AI (machine-learning): concentrate on complex methods, small data.(人工智能(机器学习)：关注复杂方法，小数据)Statistics: concentrate on models. (统计：关注模型.)To a statistician, data-mining is the inference of models. Result is the parameters of the model (数据挖掘是模型推论, 其结果是一些模型参数)e.g. Given a billion numbers, a statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation.©Wu Yangyang 13Data Cleaning (1)Data Preprocessing (数据预处理):Cleaning, integration, transformation, reduction, discretization (离散化) Why data cleaning? (为什么要清理数据？)--No quality data, no quality mining results! Garbage in, Garbage out! Measure of data quality (数据质量的度量标准)Accuracy (正确性)Completeness (完整性)Consistency(一致)Timeliness(适时)Believability(可信)Interpretability(可解释性) Accessibility(可存取性)14Data Cleaning (2)Data in the real world is dirtyIncomplete (不完全)：Lacking some attribute values (缺少一些属性值)Lacking certain interest attributes /containing only aggregate data(缺少某些有用属性或只包含聚集数据)Noisy(有噪音): containing errors or outliers(包含错误或异常) Inconsistent: containing discrepancies in codes or names(不一致: 编码或名称存在差异)Major tasks in data cleaning (数据清理的主要任务)Fill in missing values (补上缺少的值)Identify outliers(识别出异常值)and smooth out noisy data(消除噪音)Correct inconsistent data(校正不一致数据) Resolve redundancy caused by data integration (消除集成产生的冗余)15Data Cleaning (3)Handle missing values (处理缺值问题) Ignore the tuple (忽略该元组) Fill in the missing value manually (人工填补) Use a global constant to fill in the missing value (用全局常量填补) Use the attribute mean to fill in the missing value (该属性平均值填补) Use the attribute mean for all samples belonging to the same class to fill in the missing value (用同类的属性平均值填补) Use the most probable value(最大可能的值)to fill in the missing value Identify outliers and smooth out noisy data(识别异常值和消除噪音)Binning method (分箱方法):First sort data and partition into bins (先排序、分箱)Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.(然后用平均值、中值、边界值平滑)©Wu Yangyang 16Data Cleaning (4)Example: Sorted data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins (分成等深的箱):-Bin 1: 4, 8, 9, 15-Bin 2: 21, 21, 24, 25-Bin 3: 26, 28, 29, 34Smoothing by bin means (用平均值平滑):-Bin 1: 9, 9, 9, 9-Bin 2: 23, 23, 23, 23-Bin 3: 29, 29, 29, 29Smoothing by bin boundaries (用边界值平滑):-Bin 1: 4, 4, 4, 15-Bin 2: 21, 21, 25, 25-Bin 3: 26, 26, 26, 34Clustering (。

(完整版)信息管理与信息系统专业英语词汇总结

Lesson1distributed applications 分布式应用程序competitive advantage 竞争优势data warehouses 数据仓库incompatible databases 不兼容数据库decision support systems 决策支持系统executive information systems 执行信息系统DBMS——database management systems 数据库管理系统entry 款目metadata 元数据mainframe computer大型计算机desktop computer台式计算机laptop computer膝上型计算机spreadsheet 电子表格LAN ------local area network 局域网database server 数据库服务器user views 用户视图data security 数据安全性data integrity 数据完整性concurrent user 并发用户data updating 数据更新data redundancy 数据冗余consistency of data and metadata 数据和元数据的一致性distributed database 分布式数据库telecommunications network 远程通讯网Lesson2automatic indexing自动标引human indexing 人工标引extraction indexing 抽词标引assignment indexing赋词标引controlled vocabulary 受控词表non-substantive words 非实意词index terms 标引词automatic stemming 自动抽取词干weight 权值clue words 提示词inverted file 倒排文档absolute frequency 绝对词频relative frequency 相对词频information retrieval 信息检索syntactic criteria 句法规则word string 词串NLDB——Natural Language DataBase 自然语言数据库MAI——machine-aided indexing 机器辅助标引recall ratio 查全率precision ratio 查准率descriptor 叙词thesaurus 叙词表semantic vocabulary 语义词表concept headings 概念标题consistency of indexing 标引的一致性underassignment 欠量赋词overassignment 过量赋词back file 备份文件main heading 主标题subheading 副标题access point 检索点Lesson3machine-readable form 机读形式source document 源文献subject indexing 主题标引back-of-the-book indexing书后标引indexing scheme 标引方案NFAIS——National Federation of Abstracting and Information Services（美国）国家文摘与信息服务联合会scope notes 范围注释permuted list 轮排词表CAS——Chemical Abstracts Service 化学文摘社character set 字符集statistical correlation 统计关联ISI——Institute for Scientific Information （美国）科学情报社co-citation indexing 共引文标引SCI——Science Citation Indexes 科学引文索引SSCI——Social Science Citation Indexes 社会科学引文标引bibliometric analysis 书目计量分析Lesson4performance enhancement 性能改善scarce resources 稀缺资源proxy servers 代理服务器JAVA executables JAV可执行程序source code 源代码streaming media 流媒体outsourcing 业务外包wild card characters 通配符real-time traffic analysis 实时流量分析static web pages 静态网页ISDN——Integrated Services Digital Network 综合服务数据网URL——Uniform Resource Locator 统一资源定位符HTML——Hypertext Markup Language 超文本标识语言CGI——Common Gateway Interface 公共网关接口XML——Extension Markup Language 扩展标识语言OR——Operation Record 操作记录IIS——Internet Information Services 网络信息服务Lesson5IR——information retrieval 信息检索search engine spam 搜索引擎垃圾soft computing 软计算data mining 数据挖掘information fusion 信息融合classification 分类clustering 聚类thesaurus construction 词表构建Web page categorization 网页分类JPG——Joint Photographic Experts Group 图像文件格式GIF——Graphics Interchange Format 可交换的图像文件格式PNG——Portable Network Graphic 可移植的网络图像文件格式the WWW Consortium 万维网联盟HTTP——Hypertext Transfer Protocol 超文本传输协议TCP——Transfer Control Protocol 传输控制协议ASCII——American Standard Code for Information Interchange 美国信息互换标准代码CPUCentral Processing Unit 中央处理器Lesson6black－box services 黑箱服务delivering information 传递信息videoconferencing 视频会议cross reference互见，相互参照timeliness 及时性cross check 交叉检查，核对knowledge framework 知识结构Lesson7IP——intellectual property 知识产权electronic holdings of libraries 电子馆藏information infrastructure 信息基础设施copyright 版权patent 专利exclusive right 专有权subsequent editions 后续版本Lesson8encryption technologies 加密技术decrypted digital version 解密数字版本fair use doctrine 公平利用原则authenticity and integrity of the information 信息的可靠性和完整性DMCA——the Digital Millennium Copyright Act 数字千年版权法DVD——digital video diskencyclopedias 百科全书Lesson9CKO——chief knowledge officer 知识主管knowledge sharing 知识共享manual 手册competitive intelligence 竞争情报search engine 搜索引擎artificial intelligence 人工智能drill-down access 深度查询accessibility 可获得性knowledge discovery 知识发现quantitative data 定量数据qualitative data 定性数据virtual warehouses 虚拟(数据)仓库virtual library 虚拟图书馆relational database 关系数据库research and development 研发（研究与开发）directory 指南newsletter 简讯intelligent search agents 智能检索代理information resources 信息资源performance evaluation 性能评价Lesson10CIO——chief information officer信息主管ERP——Enterprise Resource Planning 企业资源规划CRM——Customer Relationship Management 客户关系管理Collaborative Applications Environment 协同应用环境workflow package 工作流软件包Lesson11rights of information users 信息用户的权利obligations of information users 信息用户的义务terms and conditions 条款。

未来的互联网英语作文

未来的互联网英语作文The Evolving Landscape of the Internet.In the ever-evolving technological landscape, the internet has emerged as a transformative force, shaping human interactions, communication, and access to information. As we look towards the future, the internet is poised to undergo further significant developments, reshaping its role in our lives and ushering in a new era of connectivity.Artificial Intelligence and Machine Learning.One of the most transformative advancements in the future internet will be the increasing integration of artificial intelligence (AI) and machine learning (ML) technologies. AI-powered algorithms will play a pivotalrole in automating tasks, enhancing user experiences, and personalizing content.For instance, AI chatbots will become more sophisticated, providing seamless and human-like customer support. ML algorithms will analyze user behavior patterns, tailoring recommendations and content to individual preferences. AI-driven search engines will deliver more relevant and comprehensive results, enhancing the ease and efficiency of information retrieval.Immersive Technologies.Virtual reality (VR) and augmented reality (AR) technologies are poised to redefine online experiences and entertainment. VR headsets will immerse users in virtual worlds, offering unparalleled gaming, educational, and social experiences. AR will seamlessly blend digital information with the physical world, enhancing user interactions and providing new ways to access and consume content.VR and AR technologies will find applications in various industries. In education, they can provide immersive learning environments that enhance studentengagement and understanding. In healthcare, they canassist in surgical procedures, provide remote consultation, and facilitate rehabilitation therapies.Edge Computing and 5G Networks.The future internet will be characterized by the proliferation of edge computing and 5G networks. Edge computing brings data processing and storage closer to users, reducing latency and improving response times for applications. 5G networks will provide ultra-fast connectivity, enabling real-time data transfer, seamless video streaming, and support for demanding applications such as autonomous vehicles and remote surgery.Edge computing and 5G will revolutionize industries by enabling real-time processing and decision-making. In manufacturing, edge devices can monitor production lines, detect anomalies, and optimize processes in real-time. In healthcare, 5G connectivity will support real-time monitoring of patients' vital signs and provide remote access to medical records, enhancing patient care.Quantum Computing.The advent of quantum computing promises to unleash unprecedented computational power, opening newpossibilities for the internet. Quantum algorithms can solve complex problems that are intractable for classical computers, enabling advancements in encryption, drug discovery, and materials science.Quantum computing will revolutionize the internet by enhancing the security of online transactions, accelerating the development of new technologies, and facilitating the creation of personalized and tailored content. It will also drive innovation in fields such as finance, healthcare, and energy, by enabling faster and more accurate modeling and simulations.Blockchain and Decentralization.Blockchain technology has the potential to reshape the future of the internet by promoting decentralization andtransparency. Blockchain networks are distributed ledgers that record transactions securely and transparently, without the need for intermediaries.Decentralized applications (dApps) built on blockchain platforms will empower users to control their data and participate in decision-making processes. This will lead to more transparent and accountable internet governance, reducing the reliance on centralized authorities and fostering a more equitable distribution of power.Conclusion.The future internet is a tapestry of transformative technologies that will profoundly impact our lives. From the integration of AI and ML to the immersive experiences of VR and AR, from the speed and efficiency of edge computing and 5G to the computational power of quantum computing, the internet is poised to evolve into a more intelligent, connected, and decentralized ecosystem.These advancements will reshape industries, enhanceproductivity, improve communication, and provide unprecedented opportunities for innovation and progress. As we embrace the future of the internet, we must actively shape its development, ensuring that it serves the needs of humanity and contributes to a more equitable, sustainable, and prosperous world.。

数据挖掘中的名词解释

第一章1，数据挖掘(Data Mining), 就是从存放在数据库, 数据仓库或其他信息库中的大量的数据中获取有效的、新颖的、潜在有用的、最终可理解的模式的非平凡过程。

2，人工智能(Artificial Intelligence)它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学。

人工智能是计算机科学的一个分支, 它企图了解智能的实质, 并生产出一种新的能以人类智能相似的方式做出反应的智能机器。

3，机器学习(Machine Learning)是研究计算机怎样模拟或实现人类的学习行为, 以获取新的知识或技能, 重新组织已有的知识结构使之不断改善自身的性能。

4，知识工程（Knowledge Engineering）是人工智能的原理和方法, 对那些需要专家知识才能解决的应用难题提供求解的手段。

5，信息检索（Information Retrieval）是指信息按一定的方式组织起来, 并根据信息用户的需要找出有关的信息的过程和技术。

数据可视化(Data Visualization)是关于数据之视觉表现形式的研究；其中, 这种数据的视觉表现形式被定义为一种以某种概要形式抽提出来的信息, 包括相应信息单位的各种属性和变量。

6，联机事务处理系统(OLTP)实时地采集处理与事务相连的数据以及共享数据库和其它文件的地位的变化。

在联机事务处理中, 事务是被立即执行的, 这与批处理相反, 一批事务被存储一段时间, 然后再被执行。

7，8, 联机分析处理(OLAP)使分析人员, 管理人员或执行人员能够从多角度对信息进行快速一致, 交互地存取, 从而获得对数据的更深入了解的一类软件技术。

决策支持系统(decision support)是辅助决策者通过数据、模型和知识, 以人机交互方式进行半结构化或非结构化决策的计算机应用系统。

它为决策者提供分析问题、建立模型、模拟决策过程和方案的环境, 调用各种信息资源和分析工具, 帮助决策者提高决策水平和质量。

DSTR培训资料

数据的扩展性
水平扩展
DSTR支持水平扩展，能够通过增加节点数量来提高系统的性能和数据处理能力。
垂直扩展
DSTR也支持垂直扩展，能够通过增加单节点硬件资源来提高系统的处理能力和性能。
05
dstr与竞品对比分析
dstr与mysql对比
存储方式
dstr采用内存存储，mysql为磁盘存储，dstr具有更快的读写速
redis支持数据备份和恢复，而dstr只能通过持久化机制实现数据备份和恢复，因此在数据安全性方面dstr略逊于redis。
要点三
内存限制
redis内存容量受限于服务器硬件配置，而dstr可以通过分布式集群扩展内存容量。
06
dstr实践案例分析
案例一：微博数据存储方设计
总结词
高效、稳定、安全
dstr支持更为丰富的数据结构，如 hash、list、set、zset等，而 postgresql只支持有限的几种数据类型。
事务处理
postgresql支持事务处理，而dstr不支持，事务处理能够保证数据的一致性和完整性。
dstr与redis对比
要点一
性能
要点二
数据安全性
dstr与redis相比具有更高的读写速度和更低的延迟，因为dstr将数据存储在内存中。
与关系型数据库、NoSQL数据库和全文搜索引擎的比较，说明dstr的优势和适用场景。
dstr的存储与读取方式
存储方式
dstr采用倒排索引、词典树等数据结构，将字符串分解为单词，并存储每个单词在原始文本中出现的频率和位置信息。
读取方式
通过搜索查询接口，支持高效的关键词匹配和排序，返回包含关键词的相关文档和评分。

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Using query logs to establish vocabularies indistributed information retrievalMilad Shokouhi *,Justin Zobel,Saied Tahaghoghi,Falk ScholerSchool of Computer Science and Information Technology,RMIT University,Melbourne 3001,AustraliaReceived 11December 2005;accepted 3April 2006Available online 7July 2006AbstractUsers of search engines express their needs as queries,typically consisting of a small number of terms.The resulting search engine query logs are valuable resources that can be used to predict how people interact with the search system.In this paper,we introduce two novel applications of query logs,in the context of distributed information retrieval.First,we use query log terms to guide sampling from uncooperative distributed collections.We show that while our sampling strategy is at least as eﬃcient as current methods,it consistently performs better.Second,we propose and evaluate a prun-ing strategy that uses query log information to eliminate terms.Our experiments show that our proposed pruning method maintains the accuracy achieved by complete indexes,while decreasing the index size by up to 60%.While such pruning may not always be desirable in practice,it provides a useful benchmark against which other pruning strategies can be measured.Ó2006Elsevier Ltd.All rights reserved.Keywords:Distributed information retrieval;Uncooperative environments;Indexing;Query logs1.IntroductionTraditional information retrieval systems use corpus,document,and query statistics to identify likely answers to users’queries.However,these queries can be captured in a query log,providing an additional source of evidence of relevance.In recent years,considerable attention has been devoted to the study of query logs and the way people express their information needs (de Moura et al.,2005;Fagni,Perego,Silvestri,&Orlando,2006;Jansen &Spink,2005).The query logs of commercial search engines such as Excite 1(Spink,Wolfram,Jansen,&Saracevic,2001),Altavista 2(Silverstein,Marais,Henzinger,&Moricz,1999),and Allthe-Web 3(Jansen &Spink,2005)have been investigated and analysed.Query logs have been used in information0306-4573/$-see front matter Ó2006Elsevier Ltd.All rights reserved.doi:10.1016/j.ipm.2006.04.003*Corresponding author.E-mail address:milad@.au (M.Shokouhi).1 .2 .3.Information Processing and Management 43(2007)169–180170M.Shokouhi et al./Information Processing and Management43(2007)169–180retrieval research for applications such as query expansion(Billerbeck,Scholer,Williams,&Zobel,2003;Cui, Wen,Nie,&Ma,2002),contextual text retrieval(Wen,Lao,&Ma,2004),and image retrieval(Hoi&Lyu, 2004).The question we explore in this paper is how query logs can be used to guide future search,in the con-text of distributed information retrieval.In distributed information retrieval(DIR)systems,the task is to search a group of separate collections and identify the most likely answers from a subset of these.Brokers receive queries from the users and send them to those collections that are deemed most likely to contain relevant answers.In a cooperative environment,col-lections inform brokers about the information they contain by providing information such as term distribu-tion statistics.In uncooperative environments,on the other hand,collections do not provide any information about their content to brokers.A technique that can be used to obtain information about collections in such environments is to send probe queries to each rmation gathered from the limited number of answer documents that a collection provides in response to such queries is used to construct a representation set;this representation set guides the evaluation of user queries.In this paper,we introduce two novel applications of query logs:sampling for improved query probing,and pruning of index information.Theﬁrst of these is ing a TREC web crawl,we show that query log terms can be used produce eﬀective samples from uncooperative collections.We compare the performance of our strategy with the state-of-art method,and show that samples obtained using query log terms allow for more eﬀective collection selec-tion and retrieval performance–improvements in average precision are often over50%.Our method is at least as eﬃcient as current sampling methods,and can be much more eﬃcient for some collections.Our second new use of query logs is a pruning strategy that uses query log terms to remove less signiﬁcant terms from collection representation sets.For a DIR environment with a large number of collections,the total size of collection representation sets on the broker might become impractically large.The goal of pruning methods is to eliminate unimportant terms from the index without harming retrieval performance.In previous work–such as that of Carmel et al.(2001),Craswell,Hawking,and Thistlewaite(1999),de Moura et al. (2005)and Lu and Callan(2002)–pruning strategies have had an adverse eﬀect on performance.The reason is that these approaches drop many terms that are necessary to future queries.We show that pruning based on query logs does not decrease search precision.In addition,our method can be applied during document index-ing,which means that it independent of term frequency statistics.We also test our method on central indexes and for diﬀerent types of search tasks.We show that,by applying our pruning strategy,the same performance as a full index can be achieved,while substantially reducing index size.In practice,such pruning might not always be desirable;if a term is present,it should be searchable.However,our pruning does provide an inter-esting benchmark against which other methods can be measured,and is clearly superior to the principal alternative.2.Distributed searchThe vast volume of data on the web makes it extremely costly for a single search engine to provide com-prehensive coverage.Moreover,public search engines cannot crawl and index documents to which there are no public links,or from which crawlers are forbidden.These documents form the so-called hidden web and generally only be viewed by using custom search interfaces supplied as part of the site.Distributed information retrieval(DIR)aims to address this issue by passing queries to multiple servers through a central broker.Each server sends its top-ranked answers back to the broker,which produces a single ranked list of answers for pre-sentation to the user.For eﬃciency,the broker usually passes the query only to a subset of available servers, selecting those that are most likely to contain relevant answers.To identify the appropriate servers,the broker calculates a similarity between the query and the representation set of each server.In cooperative environments,servers provide the broker with their representation sets(Callan,Lu,&Croft, 1995;Fuhr,1999;Gravano,Chang,Garcia-Molina,&Paepcke,1997,1999;Yuwono&Lee,1997).The bro-ker can be aware of the distribution of terms at the servers,and is therefore able to calculate weights for each server.Queries are sent to those servers that indicate the highest weight for the query terms.In practice,servers may be uncooperative and therefore do not publish their index information.Server rep-resentation sets can be gathered using query-based sampling(QBS)(Callan,Connell,&Du,1999).In QBS,anM.Shokouhi et al./Information Processing and Management43(2007)169–180171initial query is created from frequently-occurring terms found in a reference collection–to increase the chance of receiving an answer–and sent to the server.The query results provided by the server are downloaded,and another query is created using randomly-selected terms from these results.This process continues until a suf-ﬁcient number of documents have been downloaded(Callan&Connell,2001;Callan et al.,1999;Shokouhi, Scholer,&Zobel,2006).Many queries do not return any yet-unseen answers;Ipeirotis and Gravano(2002) claim that,on average,one new document is received per two queries.QBS also downloads many pages that are not highly representative for the server.2.1.Query-based samplingQuery-based sampling QBS was introduced by Callan et al.(1999),who suggested that even a small number of documents(such as300)obtained by random sampling can eﬀectively represent the collection held at a ser-ver.They tested their method on the CACM collection(Jones&Rijsbergen,1976)and many other small servers artiﬁcially created from TREC newswire data(Voorhees&Harman,2000).In QBS,subsequent queries after the ﬁrst are selected by choosing terms from documents that have been downloaded so far(Callan et al.,1999). Various methods were explored;random selection of query terms was found to be the most eﬀective way of choosing probe queries,and this method has since been used in other work on sampling non-cooperative serv-ers(Craswell,Bailey,&Hawking,2000;Si&Callan,2003).These methods generally proceed until aﬁxed number of documents(usually300)have been downloaded.However,Shokouhi et al.(2006)have shown that for more realistic,larger collections,ﬁxed-size samples might not be suitable,as the coverage of the vocabulary of the server is poor.An alternative technique,called Qprober(Gravano,Ipeirotis,&Sahami,2003),has been proposed for auto-matic classiﬁcation of servers.Here,a classiﬁcation system is trained with a set of pages and judgments.Then the system suggests the classiﬁcation rules and uses the rules as queries.For example,if the classiﬁcation sys-tem suggests(Madonna!Music),it uses‘‘Madonna’’as a query and classiﬁes the downloaded pages as music-related.Qprober diﬀers from QBS in the way that probe queries are selected and requires a classiﬁcation system in the background.ing query logs for samplingTerms that appear in search engine query logs are–by deﬁnition–popular in queries,and tend to refer to topics that are well-represented in the collection.We therefore hypothesis that probe queries composed of query log terms would return more answers than the random terms,leading to higher eﬃciency.Since query terms are aligned with actual user interests,we also believe that sampling using query log terms would better reﬂect user needs than random terms from downloaded documents.Hence,instead of choosing the terms from downloaded documents for probe queries,we use terms from query logs.Analysis of our method shows that it is at least as eﬃcient as previous methods,and generates samples that produce higher overall eﬀectiveness.2.3.EvaluationTo simulate a DIR environment,we extracted documents from the100largest servers in the TREC WT10g col-lection(Bailey,Craswell,&Hawking,2003).These sets vary in size from26505documents(www9.yahoo. com),to2790documents(),with an average size of5602documents per server.For sam-pling queries,we used the1000most frequent terms in the Excite search engine query logs collected in1997 (Spink et al.,2001).For each query,we download the top10answers;this is the number of results that most search interfaces return on theﬁrst page of results.Sampling stops after300unique documents have been downloaded or1000 queries have been sent(whichever comesﬁrst).Although usingﬁxed-size samples might not always be the opti-mal method(Shokouhi et al.,2006),we restrict ourselves to300documents to ensure that our results are com-parable to the widely accepted baseline(Callan et al.,1999).For each server we gather two samples:one by query-based sampling,and the other by our query log method.For query log(QL)experiments,each of the1000most frequent terms in the Excite query logs arepassed as a probe query to the collection,and the top 10returned answers are collected.For QBS ,probe queries are selected from the current downloaded documents at each time,and the top 10results of each query are gathered.To evaluate the eﬀectiveness of samples for diﬀerent queries,we used topics 451–550from the TREC -9and TREC 2001Web Tracks.We used only terms in the title ﬁeld as queries.Since we are extracting only the largest 100servers from WT 10g,the number of available relevant documents is low,so the precision-recall metrics produce poor results.For this reason,many DIR experiments use the set of documents that are retrieved by a central server as an oracle.That is,all of the top-ranked pages returned by the central index are considered to be relevant,and the performance of DIR approaches is evaluated based on how eﬀectively they can retrieve this set (Craswell et al.,2000;Xu &Callan,1998).Therefore,we use a central index contain-ing the documents of all 100servers 4as a benchmark.For both the baseline and DIR experiments,we gathered the top 10results for each query.Results for 100and 1000answers per query were found to be similar and are not presented here.We tested diﬀerent cutoﬀ(CO)points in our evaluations:for a cutoﬀof 1,the queries were passed to the one server with the most similar corresponding representation set;for a cutoﬀof 50,queries were sent to the top 50servers.Table 1shows that the QL method consistently produces better results.Diﬀerences that are statistically signiﬁcant based on the t -test at the 0.05and 0.01level of signiﬁcance are indicted by the and à,respectively.For mean average pre-cision (MAP),which is considered to be the most reliable evaluation metric (Sanderson &Zobel,2005),QL outperforms QBS signiﬁcantly in four of ﬁve cases.We made two key observations.First,query log (QL )terms did not retrieve the expected 300documents for four servers after 1000queries,while QBS failed to retrieve this number from only one server.Analysis showed that these servers contain documents unlikely to be of general interest to users.For example has error pages and HTML forms while includes many pages with non-text characters.Second,the QL method downloads an average of 2.43unseen documents per query,while the corresponding average for QBS is 2.80.Having access to the term document frequency information of any collection,it is possible to calculate the expected number of answers from the collection,for single-term queries extracted randomly from its index.Therefore,we indexed all of the servers together as a global collection.At most 10answers are retrieved per query.The expected number of answers per query can be calculated asj Number of Terms df >9j j Total Number of Terms j Â10þX 9i ¼1j Number of Terms df ¼i jj Total Number of Terms j Âiwhich gives an expected value of 2.60,close to numbers obtained by both theQLandQBSmethods.Table 1Comparison of the QLandQBSmethods on a subset of theWT 10g data;QLconsistently performs better CO MAPP@5P@10R -precisionQBSQL QBS QL QBSQLQBSQL10.06680.09020.13020.17210.07440.09880.07440.0988100.15620.2515à0.30570.4322à0.20110.3023à0.20110.3023à200.16170.2811à0.31490.4621à0.21150.3437à0.21150.3437à300.15400.2655à0.29410.4471à0.21060.3259à0.21060.3259à400.18120.2639à0.32000.4306à0.24590.3212à0.24590.3212à500.18680.4188à0.33410.4188 0.25060.3176à0.25060.3176àDiﬀerences that are statistically signiﬁcant based on the t -test at the 0.05and 0.01level of signiﬁcance are indicted by and à,respectively.‘‘CO’’is the cutoﬀnumber of servers from which answers are retrieved.4The 100servers consist of 563656documents in total,containing 309195668terms,1788711of them unique.172M.Shokouhi et al./Information Processing and Management 43(2007)169–180However,these values contrast with those reported by Ipeirotis and Gravano (2002),who claim that QBS downloads an average of only one unseen document per two queries.On further investigation,we observed that the average varies for diﬀerent collections,as shown in Table 2.The ﬁrst collection is extracted from TREC AP newswire data and contains newspaper articles.Collections labelled WEB are subsets of the TREC WT 10g collection (Bailey et al.,2003).Finally,GOV-1is a subset of the TREC GOV 1collection (Craswell &Hawking,2002).Note that the average values for QL are between 4.9and 7.4unseen documents per query,while for QBS these range from 1.2to 4.5.In general,the gap between methods is more signiﬁcant for larger collections with broad topics.Each QL probe query returns about 10answers –the maximum –on the ﬁrst page,while this number is considerably lower for QBS .3.Pruning using query logsIn uncooperative DIR systems,the broker keeps a representation sample for each collection (Callan &Con-nell,2001;Craswell et al.,2000;Shokouhi et al.,2006).These samples usually contain a small number of doc-uments downloaded by query-based sampling (Callan et al.,1999)from the corresponding collections.Pruning is the process of excluding unimportant terms (or unimportant term occurrences)from the index to reduce storage cost and,by making better use of memory,increase querying speed.The importance of a term can be calculated according to factors such as lexicon statistics (Carmel et al.,2001)or position in the document (Craswell et al.,1999).The major drawback with current pruning strategies is that they decrease precision,because the pruned index can miss terms that occur in user queries.In addition,in lexicon-based pruning strategies,the indexing process is slowed signiﬁcantly.First,documents need to be parsed,so that term distribution statistics are available.Then,unimportant terms can be identiﬁed,and excluded,based on the lexicon statistics.For example,terms that occur in a large proportion of documents might be treated as stopwords.Finally,the index needs to be updated based on the new pruned vocabulary statistics.Lexicon-based pruning strategies face additional problems when dealing with broker indexes in DIR .Doc-uments are gathered from diﬀerent collections,with diﬀerent vocabulary statistics.A term that appears unim-portant in one collection based on its occurrences might in fact be critical in another collection.Therefore,pruning the broker’s index based on the global lexicon statistics does not seem reasonable.We introduce a new pruning method that addresses these problems.Our method prunes during parsing,and is therefore faster than lexicon-based methods,as index updates are not required.Unlike other approaches,our proposed method does not harm precision,and can increase retrieval performance in some cases.Note,however,that we regard this pruning strategy as an illustration of the power of query logs rather than a method that should be deployed in practice:users for search for a term should be able to ﬁnd matches if they are present in the collection.Although sampling inevitably involves some loss,that loss should be mini-mised.That said,as our experiments show the new pruning method is both eﬀective and eﬃcient.3.1.Related workPruning is widely used for eﬃciency,either to increase query processing speed (Persin,Zobel,&Sacks-Davis,1996),or to save disk storage space (Carmel et al.,2001;Craswell et al.,1999;de Moura et al.,2005;Lu &Callan,2002).Table 2Comparison of the QLandQBSmethods,showing average number of answers returned per queryCollection Size Unseen (QBS )Total (QBS )Unseen (QL )Total (QL )Newswire 30507 4.6 5.8 4.99.1WEB-1304035 1.8 2.2 6.99.9WEB-2218489 2.5 2.9 6.69.9WEB-3817025 1.2 1.57.49.9GOV-11361762.53.87.39.7M.Shokouhi et al./Information Processing and Management 43(2007)169–180173174M.Shokouhi et al./Information Processing and Management43(2007)169–180 Carmel et al.(2001)proposed a pruning strategy where each indexed term is sent in turn as a query to their search system.Index information is discarded for those documents that contain the query term,but do not appear in the top ranked results in response to the query.This strategy is computationally expensive and time consuming.The soundness of this approach is unclear;the highly ranked pages for many queries are not highly ranked for any of the individual query terms,de Moura et al.(2005)have extended Carmel’s method. The apply Carmel’s method to extract the most important terms of each collection.Then they keep only those sentences that contain important terms,and delete the rest.For the same reason discussed previously,this approach also is not applicable in uncooperative DIR environments.Although this approach is more eﬀective than Carmel’s method in most cases,the loss in average precision compared to a full-text baseline is signiﬁcant.D’Souza,Thorn,and Zobel(2004)discuss surrogate methods for pruning,where only the most signiﬁcant words are kept for each document.In this approach,the representation set is not the collection’s vocabulary, but is instead a complete index for the surrogates.Such an approach requires a high level of cooperation between servers.Craswell et al.(1999)use a pruning strategy to merge the results gathered from multiple search engines.In their work,they download theﬁrst four kilobytes of each returned document instead of extracting the whole document for result merging.They showed that in some cases,the loss in performance is negligible.They only evaluated their method for result merging.In a comprehensive analysis of pruning in brokers,Lu and Callan(2002)divided pruning methods into var-ious groups:frequency-based methods prune documents according to lexicon statistics;location-based methods exclude terms based on the position of their appearance in the documents;single-occurrence methods set a pruning threshold based on the number of unique terms in documents,and keep one instance of each term in the document;and multiple-occurrence methods allow for multiple occurrences of terms in pruned docu-ments.Experiments evaluating the performance of nine methods demonstrated that four models can achieve similar optimal levels of performance,and do not have any signiﬁcant advantage over each other.Of these best methods,FIRSTM is the only one which does not rely on a broker’s vocabulary statistics.For each document,this approach stores information about theﬁrst1600terms.The other methods measure the importance of terms based on frequency information.As discussed,these methods are unsuitable for DIR in many ways:the frequency of a term in the broker does not indicate its importance in the original collections; the cost of pruning and re-indexing might be high;and,adding a new collection makes the current pruned index unusable,since after a new collection is added to the system,the previous information is no longer valid. The FIRSTM approach prunes during parsing,which makes it more comparable with our approach.Therefore, we evaluate our approach by using FIRSTM as a baseline.Lu and Callan tested their methods on100collections created from TREC disks1,2,and3,and showed that their models can reduce storage costs by54%–93%,with less than a10%loss in precision.We test our systems on WT10g and GOV1,which are larger and consist of unmanaged data crawled from the Web;these collections are described in more detail in Section3.2.Our proposed pruning method is applied during parsing,and is independent of index updates,as the addi-tion of new collections to the system does not require the re-indexing of the original documents.Moreover, our pruning method does not reduce system performance and precision,while in all of the discussed previous work(Carmel et al.,2001;Craswell et al.,1999;de Moura et al.,2005;Lu&Callan,2002),pruning results in a decrease in precision.ing query logs for pruningThe main motivation for pruning is to omit unimportant terms from the index.That is,pruning methods are intended to exclude the terms that are less likely to appear in user queries(de Moura et al.,2005).Some methods prune terms that are rare in documents(Lu&Callan,2002).However,the distribution of terms in user queries is not similar to that in typical web documents.We propose using the history of previous user queries to achieve this directly.Our hypothesis is that prun-ing those terms that do not appear in a search engine query logs will be able to reduce index sizes while main-taining retrieval performance.We test our hypothesis with experiments on distributed environments andM.Shokouhi et al./Information Processing and Management43(2007)169–180175 central indexes for diﬀerent types of queries.In a standard search environment,where completeness may be more important than improvements in eﬃciency,such pruning(or any pruning)is unappealing;but in a dis-tributed environment,where index information is incomplete and is diﬃcult to gather,such an approach has signiﬁcant promise.For our experiments,we used a list of the315936unique terms in the log of about one million queries submitted to the Excite search engine on16September1997(Spink et al.,2001).Larger query logs,or a com-bination of query logs from diﬀerent search engines,might be useful for larger collections.Also,for highly topic-speciﬁc collections,topical query logs(Beitzel,Jensen,Chowdhury,Grossman,&Frieder,2004)and query terms that have been classiﬁed into diﬀerent categories(Jansen,Spink,&Pedersen,2005)could provide additional beneﬁts.For experiments on uncooperative DIR environments and brokers,we used the testbed described in Section 2.3.The100largest servers were extracted from the TREC WT10g collection,with each server being considered as a separate collection.Query-based sampling,as described in Section2.1,was used to obtain representation sets for each collection by downloading300documents from each server in our testbed.We do not omit stop-words in any of our experiments.For each downloaded sample,we only retained information about those terms that were present in our query log,and eliminated the other terms from the broker.We used CORI (Callan et al.,1995)for collection selection and result merging;CORI has been used in many papers as a base-line(Craswell et al.,2000;Nottelmann&Fuhr,2003;Powell&French,2003;Si&Callan,2003).TREC topics451–550and their corresponding relevance judgements were used to evaluate the eﬀectiveness of our pruned representation sets.We use only theÆtitleæﬁeld of TREC topics as queries for the search system.To test our pruning method on central indexes,we used the TREC WT10g and GOV1collections.The GOV1 collection(Craswell&Hawking,2002)contains over a million documents crawled from domain. TREC topics451–550were used for our experiments on WT10g,and topics551–600and NP01–NP150were used with the GOV1collection.All experiments with central indexes use the OkapiBM25similarity measure (Robertson,Walker,Hancock-Beaulieu,Gull,&Lau,1992).In addition to these topic-ﬁnding search tasks,we evaluate our pruning approach on central indexes for named-pageﬁnding and topic distillation tasks.In topic distillation,the objective is toﬁnd relevant homepages related to a general topic(Craswell,Hawking,Wilkinson,&Wu,2003).We use the TREC topic distillation top-ics551–600,and corresponding relevance judgements,with the GOV1collection.For named-pageﬁnding(also known as homepageﬁnding)the aim is toﬁnd particular web pages of named individuals or organisations.To evaluate this type of search task,we used the TREC named-pageﬁnding queries NP01-NP150(Craswell& Hawking,2002).3.3.Distributed retrieval resultsThe results of our experiments using diﬀerent pruning methods for DIR systems are shown in Table3.For each scheme,up to1000answers were returned per query.The cutoﬀ(CO)values show the number of collec-tions that are selected for each query.That is,for theﬁrst row,only the best collection is selected while for theTable3Eﬀectiveness of diﬀerent pruning schemes,on a subset of WT10gCO MAP P@10R-precisionORIG FIRSTM PR ORIG FIRSTM PR ORIG FIRSTM PR10.01780.00960.01780.03670.03160.03670.02170.00710.0217 100.04150.03990.04150.06200.04680.06300.05930.04000.0594 200.03550.04680.05040.05900.04940.06460.04850.04600.0644 300.03990.04620.05460.06030.04810.0658 0.05000.04260.0659* 400.04890.05090.0628 0.06410.05610.0671 0.05210.04370.0611 500.05060.05160.0647 0.06540.05650.0684 0.05620.04390.0708* Signiﬁcance at the0.1and0.05levels of conﬁdence are indicated with*and ,respectively.‘‘CO’’is the cutoﬀnumber of servers from which answers are fetched.。